RubyGems - hackboxen - Versions diffs - 0.1.0 - Mend

hackboxen 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (36) hide show

data/CHANGELOG.textile +44 -0
data/Gemfile +12 -0
data/Gemfile.lock +34 -0
data/LICENSE.txt +20 -0
data/README.textile +203 -0
data/Rakefile +49 -0
data/VERSION +1 -0
data/bin/describe.rb +101 -0
data/bin/hb-install +5 -0
data/bin/hb-runner +93 -0
data/bin/hb-scaffold +6 -0
data/hackboxen.gemspec +97 -0
data/lib/gemfiles/Gemfile.jruby-1.6.2.default +19 -0
data/lib/gemfiles/Gemfile.ruby-1.8.7.default +20 -0
data/lib/gemfiles/Gemfile.ruby-1.9.2.default +18 -0
data/lib/hackboxen.rb +17 -0
data/lib/hackboxen/tasks.rb +6 -0
data/lib/hackboxen/tasks/endpoint.rb +16 -0
data/lib/hackboxen/tasks/icss.rb +15 -0
data/lib/hackboxen/tasks/init.rb +36 -0
data/lib/hackboxen/tasks/install.rb +46 -0
data/lib/hackboxen/tasks/mini.rb +47 -0
data/lib/hackboxen/tasks/scaffold.rb +71 -0
data/lib/hackboxen/template.rb +36 -0
data/lib/hackboxen/template/Rakefile.erb +31 -0
data/lib/hackboxen/template/config.yaml.erb +10 -0
data/lib/hackboxen/template/endpoint.rb.erb +39 -0
data/lib/hackboxen/template/icss.yaml.erb +125 -0
data/lib/hackboxen/template/main.erb +31 -0
data/lib/hackboxen/utils.rb +49 -0
data/lib/hackboxen/utils/README_ConfigValidator.textile +63 -0
data/lib/hackboxen/utils/config_validator.rb +41 -0
data/lib/hackboxen/utils/logging.rb +39 -0
data/lib/hackboxen/utils/paths.rb +66 -0
data/spec/install_spec.rb +36 -0
metadata +213 -0

data/lib/hackboxen/tasks/scaffold.rb ADDED

@@ -0,0 +1,71 @@
+require 'rubygems'
+require 'configliere'
+require 'rake'
+hb_lib_dir     = File.join(File.dirname(__FILE__), '../../../')
+machine_config = '/etc/hackbox/hackbox.yaml'
+install_config = File.join(ENV['HOME'], '.hackbox/hackbox.yaml')
+Settings.use :commandline, :config_file
+Settings.define :namespace, :required => true
+Settings.define :protocol,  :required => true
+Settings.define :coderoot,  :required => true
+Settings.define :targets,   :default  => 'catalog'
+Settings.read(machine_config) if File.exists? machine_config
+Settings.read(install_config) if File.exists? install_config
+Settings.resolve!
+# Hackbox directories to be created
+coderoot = Settings[:coderoot]
+hackbox  = File.join(coderoot, Settings[:namespace].gsub(/\./,'/'), Settings[:protocol])
+engine   = File.join(hackbox, 'engine')
+config   = File.join(hackbox, 'config')
+# Define idempotent directory tasks
+[ coderoot, hackbox, engine, config ].each { |dir| directory dir }
+# Hackbox files to be created
+rakefile    = File.join(hackbox, 'Rakefile')
+main        = File.join(engine, 'main')
+config_yml  = File.join(config, 'config.yaml')
+icss_yml    = File.join(config, "#{Settings[:protocol]}.icss.yaml")
+endpoint    = File.join(engine, "#{Settings[:protocol]}_endpoint.rb")
+templates   = File.join(hb_lib_dir, 'lib/hackboxen/template')
+# Create a basic endpoint if apeyeye was specified as a target
+file endpoint, [:config] => engine do |t, args|
+  HackBoxen::Template.new(File.join(templates, "endpoint.rb.erb"), endpoint, args[:config]).substitute!
+end
+# Create a basic hackbox Rakefile
+file rakefile => hackbox do
+  HackBoxen::Template.new(File.join(templates, "Rakefile.erb"), rakefile, {}).substitute!
+end
+# Create a basic executable hackbox main file
+file main => engine do
+  HackBoxen::Template.new(File.join(templates, 'main.erb'), main, {}).substitute!
+  File.chmod(0755, main)
+end
+# Create a basic config file
+file config_yml => config do
+  basic_config = { 'namespace' => Settings[:namespace], 'protocol'  => Settings[:protocol] }
+  HackBoxen::Template.new(File.join(templates, "config.yaml.erb"), config_yml, basic_config).substitute!
+end
+# Create a basic icss file
+file icss_yml => config do
+  targets = Settings[:targets].split(',')
+  basic_config = {
+    'namespace' => Settings[:namespace],
+    'protocol'  => Settings[:protocol],
+    'targets'   => targets
+  }
+  HackBoxen::Template.new(File.join(templates, "icss.yaml.erb"), icss_yml, basic_config).substitute!
+  Rake::Task[endpoint].invoke(basic_config) if targets.include? 'apeyeye'
+end
+task :scaffold => [rakefile, main, config_yml, icss_yml]

data/lib/hackboxen/template.rb ADDED

@@ -0,0 +1,36 @@
+require 'erubis'
+module HackBoxen
+  class Template
+    attr_accessor :source_template, :output_path, :attributes
+    def initialize source_template, output_path, attributes
+      @source_template = source_template
+      @output_path     = output_path
+      @attributes      = attributes
+    end
+    def compile!
+      dest << Erubis::Eruby.new(source).result(attributes)
+      dest << "\n"
+      dest
+    end
+    def substitute!
+      compile!
+    end
+    protected
+    def source
+      File.open(source_template).read
+    end
+    def dest
+      return @dest if @dest
+      @dest ||= File.open(output_path, 'w')
+    end
+  end
+end

data/lib/hackboxen/template/Rakefile.erb ADDED

@@ -0,0 +1,31 @@
+require 'hackboxen'
+#
+# When you require 'hackboxen' the library establishes where the current hackbox directory
+# is located and loads all required tasks in order for your hackbox to run to completion
+#
+task :get_data do
+  #
+  # This task is intended to pull data down from a source. Examples include
+  # the web, an ftp server, and Amazon's simple storage service (s3). As much
+  # as possible this should be the only task that interacts with the 'outside'
+  # world.
+  #
+end
+task :default => ['hb:create_working_config', 'hb:icss', 'hb:endoint', :get_data, 'hb:init']
+#
+# hb:create_working_config makes establishes all required directories and serializes out all
+# configuration options into env/working_config.json. This task is required.
+#
+# hb:icss copies over the icss.yaml file if it exists into its proper place in fixd/data. This
+# task is not required.
+#
+# hb:endpoint copies over the endpoint.rb file if it exists into its proper place in fixd/code.
+# This task is not required.
+#
+# :get_data is explained above. This task (and any other dependent tasks you wish to write) are
+# expected only to pull data into the ripd directory, nothing more. This task is required.
+#
+# hb:init executes the main file located in engine. This task is required.
+#

data/lib/hackboxen/template/config.yaml.erb ADDED

@@ -0,0 +1,10 @@
+---
+#
+# This is a sample config. Any hackbox-specific options or parameters that need to be accessed
+# during the execution of a hackbox should be put in here.
+#
+namespace: <%= namespace %>
+protocol: <%= protocol %>
+filesystem_scheme: file
+under_consideration: true # This flag is set to true for initial publishing, then removed when fully complete
+update_frequency: monthly # How often the data is refreshed

data/lib/hackboxen/template/endpoint.rb.erb ADDED

@@ -0,0 +1,39 @@
+<% format = "" %>
+<% indent = 0 %>
+<% entries = namespace.split('.') << protocol %>
+<% entries.each_with_index do |part,count| %>
+  <% indent = count %>
+  <% indent.times { |c| format += "  " } %>
+  <% if entries[count] == entries.last %>
+    <% format += "class #{part.split("_").map { |p| p.capitalize }.join("")}Endpoint < Endpoint\n\n" %>
+  <% else %>
+    <% format += "module #{part.split("_").map { |p| p.capitalize }.join("")}\n" %>
+  <% end %>
+<% end %>
+<% indent += 1 %>
+<% targets.each do |target| %>
+  <% case target %>
+  <% when 'mysql' %>
+    <% indent.times { |c| format += "  " } %>
+    <% format += "extend Connection::MysqlConnection\n" %>
+  <% when 'hbase' %>
+    <% indent.times { |c| format += "  " } %>
+    <% format += "extend Connection::HbaseConnection\n" %>
+  <% when 'geo_index' %>
+    <% indent.times { |c| format += "  " } %>
+    <% format += "extend Connection::HbaseGeoConnection\n" %>
+  <% when 'elasticsearch' %>
+    <% indent.times { |c| format += "  " } %>
+    <% format += "extend Connection::ElasticSearchConnection\n" %>
+  <% end %>
+<% end %>
+<% format += "\n" %>
+<% indent.times { |c| format += "  " } %>
+<% format += "Put your endpoint code here:\n\n" %>
+<% indent -= 1 %>
+<% while indent >= 0 %>
+  <% indent.times { |c| format += "  " } %>
+  <% format += "end\n" %>
+  <% indent -= 1 %>
+<% end %>
+<%= format %>

data/lib/hackboxen/template/icss.yaml.erb ADDED

@@ -0,0 +1,125 @@
+---
+namespace: <%= namespace %>
+protocol: <%= protocol %>
+data_asset:
+- name: <%= protocol %>_data_asset
+  location: <%= protocol %>_data.tsv
+  type: <%= protocol %>_data_record
+<% if targets.include? 'apeyeye' %>
+code_asset:
+- name: <%= protocol %>_code_asset
+  location: code/<%= protocol %>_endpoint.rb
+messages:
+  <%= protocol %>_search: # An example message name
+    request:
+    - name: <%= protocol %>_search_request
+      type: <%= protocol %>_search_request
+    response: <%= protocol %>_search_response_record
+    doc:  A clear description of how to interact with the api using this message
+    samples:
+    - request:  # A sample request using this message's defined request parameters below
+      - param_1_name: value
+        param_2_name: value
+        param_3_name: value
+<% end %>
+targets:
+<% targets.each do |target| %>
+<% case target %>
+<% when 'catalog' %>
+  catalog:
+  - name: <%= protocol %>_catalog_entry
+    title: The display title of this catalog entry
+    description: -|
+      A very detailed description of the entry goes here. Ensure proper formatting and clear concise information about the dataset as this field will be the main visibility point of the dataset page.
+    tags:
+    - an
+    - array
+    - of
+    - single-word
+    - tags
+    packages: # You only need this if your dataset will be available for bulk download
+    - data_assets:
+      - <%= protocol %>_data_asset
+<% if targets.include? 'apeyeye' %>
+    messages:
+    - an array of message names # needs to match the messages entries up above
+<% end %>
+<% when 'apeyeye' %>
+  apeyeye:
+  - code_assets:
+    - <%= protocol %>_code_asset
+<% when 'hbase' %>
+  hbase:
+  # When your data has the following schema (row_key, column_family, column_name, column_value), use
+  - table_name: The hbase table to write data into
+    column_families: An array of column families to write data to
+    loader: fourple_loader
+    data_assets
+    - <%= protocol %>_data_asset
+  # When your data is simply a tsv record, use these hashes instead
+  - table_name: The hbase table to write data into
+    column_family: A single column family to write data to
+    id_field: The name of the field to use as the row key when indexing
+    loader: tsv_loader
+    data_assets:
+     - <%= protocol %>_data_asset
+<% when 'geo_index' %>
+  geo_index:
+  - table_name: The hbase table name # must be one of geo_location_infochimps_place, _path or _event
+    min_zoom: An integer specifying the minimum zoom level
+    max_zoom: An integer specifying the maximum zoom level
+    chars_per_page: An integer number of approximately how many characters to display per page
+    sort_field: The field within the Properties hash to sort by. use -1 if no field is sorted by
+    data_assets:
+    - <%= protocol %>_data_asset
+<% when 'elasticsearch' %>
+  elasticsearch:
+  - index_name: The name of the index to write data into
+    object_type: The object type to be created in ElasticSearch
+    id_field: Optionally used to define the field to id by during indexing
+    loader: Either tsv_loader or json_loader based on your data type
+    data_assets:
+    - <%= protocol %>_data_asset
+<% when 'mysql' %>
+  mysql:
+  - database: The name of the MySQL database to be loaded into
+    table_name: The name of the corresponding table to be loaded into
+    data_assets:
+    - <%= protocol %>_data_asset
+<% end %>
+<% end %>
+# Any non-basic types declared above must be defined explicitly under this type heading
+types:
+- name: <%= protocol %>_data_record
+  type: record
+  doc: Description of the <%= protocol %>_data_record type
+  fields:
+  - name: A name for one of the fields in the <%= protocol %>_data_record type
+    doc: A description for this field
+    type: If this not a primitive type, make sure you explicitly define it below
+  - name: A name for one of the fields in the <%= protocol %>_data_record type
+    doc: A description for this field
+    type: If this not a primitive type, make sure you explicitly define it below
+<% if targets.include? 'apeyeye' %>
+- name: <%= protocol %>_search_request
+  type: record
+  doc: Description of the <%= protocol %>_search_request type
+  fields:
+  - name: A name for one of the fields in the <%= protocol %>_search_request type
+    doc: A description for this field
+    type: If this not a primitive type, make sure you explicitly define it below
+- name: <%= protocol %>_search_response_record
+  type: record
+  doc: Description of the <%= protocol %>_search_response_record type
+  fields:
+  - name: A name for one of the fields in the <%= protocol %>_search_response_record type
+    doc: A description for this field
+    type: If this not a primitive type, make sure you explicitly define it below
+<% end %>

data/lib/hackboxen/template/main.erb ADDED

@@ -0,0 +1,31 @@
+#!/usr/bin/env ruby
+#
+# A simple example of executable main file. This script is NOT required to be ruby.
+#
+#
+# inputdir is the first argument your main script will get. It will ALWAYS get this. inputdir
+# will ALWAYS be a directory that contains (ripd/, rawd/, fixd/, env/, and log/).
+#
+inputdir  = ARGV[0]
+#
+# outputdir is the second argument your main script will get. It will ALWAYS get this. outputdir
+# will ALWAYS be the fixd/data/ directory
+#
+outputdir = ARGV[1]
+#
+# Ruby example: read in the working_environment.json file in env/ into a ruby hash
+# (same as a javascript associative array, a java hashmap, a python dictionary, etc)
+# called 'options' to access the configuration settings used to execute the Rakefile
+#
+require 'json'
+options = JSON.parse(File.read(File.join(inputdir, "env", "working_environment.json")))
+#
+# If you require 'hackboxen' you can access the default paths utility method
+#
+require 'hackboxen'
+path_to :fixd_dir  # => "[current_dataroot]/fixd/"
+path_to :hb_engine # => "[current_hackbox]/engine/"

data/lib/hackboxen/utils.rb ADDED

@@ -0,0 +1,49 @@
+WorkingConfig  = Configliere::Param.new
+WorkingConfig.use :commandline, :config_file
+module HackBoxen
+  autoload :ConfigValidator, 'hackboxen/utils/config_validator'
+  autoload :Paths,           'hackboxen/utils/paths'
+  autoload :Logging,         'hackboxen/utils/logging'
+  def self.find_root_dir
+    start_dir = File.dirname INCLUDING_FILE
+    Dir.chdir start_dir
+    until hackbox_root? Dir.pwd
+      Dir.chdir('..')
+      if Dir.pwd == '/'
+        puts "Warning: not in a Hackbox base directory"
+        return start_dir
+      end
+    end
+    return Dir.pwd
+  end
+  def self.hackbox_root? dir = Dir.pwd
+    %w[ engine config Rakefile ].each do |expected|
+      return false unless Dir.entries(dir).include? expected
+    end
+    true
+  end
+  def self.current_fs
+    fs = WorkingConfig[:filesystem_scheme] ? WorkingConfig[:filesystem_scheme] : 'file'
+    Swineherd::FileSystem.get fs
+  end
+  def self.current
+    hackbox_root? ? File.join(WorkingConfig[:namespace].gsub('.', '/'), WorkingConfig[:protocol]) : 'debug'
+  end
+  def self.verify_dependencies
+    %w[ dataroot namespace protocol ].each do |req|
+      raise "Your hackbox config appears to be missing a [#{req}]" unless WorkingConfig[req.to_sym]
+    end
+  end
+  def self.read_config cfg
+    WorkingConfig.read cfg if current_fs.exists?(cfg)
+  end
+end

data/lib/hackboxen/utils/README_ConfigValidator.textile ADDED

@@ -0,0 +1,63 @@
+h1.  Execution Environment Validator
+Hackboxen usually require resources in their execution environment. If the @WorkingConfig@ for a hackbox contains the key @requires@, then its value must be a hash that declares its requirements.  This declaration takes the form of a tree of hashes where each terminal keys specifies a particular requirement and the value associated with is key is a configuration specifier for that requirement.
+h2. Requirement Values
+The value for each key may be one of a:
+* **Null:** This requirement must exist, but exact configuration does not need to be precisely stated.
+* **String:** This requirement must exist and its configuration (e.g. version constraint, location) is specified in the string.
+* **Array Of Strings:** This requirement has multiple configuration constraints (e.g. min/max version, access to multiple mysql databases)
+* **Hash:** The key is a category rather than an actual requirement.  The value contains actual requirements or subcategories.
+The meaning of strings as values is defined by its key. In general, version strings should be specified in Bundler Gemfile syntax. Currently, the evaluator does not actually interpret value strings-- it only checks for the existence of keys.  However, these values may be needed by external tools or systems and so should be specified if a value other than the default is required.
+h2. Schema
+The following is the current schema for the top of the @requires@ tree (default versions in parentheses):
+* **platform:** The processing environment for this hackbox
+** *os:* One of "linux", "osx", "win".  ("linux")
+** *hardware:*  One of "x86", "x86_64" ("x86")
+* **language:**  Languages and/or their libraries
+** **ruby:**  The @RVM@ ruby version needed by this hackbox ("1.8.7")
+** **jars:**  If the ruby version is a jruby version, then the needed external jars should be named in this hash.
+** **python:** The minimum python version needed by this hackbox ("2.6")
+* **processing:**  These are data processing tools and resources that need to be available
+** **pig:**:  Apache Pig is installed and configured.
+** **wukong:**:  Wukong Hadoop streaming processor is installed and configured.
+* **shelltools:**  A reference to command line tools that must be callable via a shell in the default @PATH@.
+* **datastore:**  Datastores that must be accessible by this hackbox.  If the value for a datastore is @null@, then the default store is needed.  If the value is a string, then this is the "name" of the required store.  If the value is an array of strings, then all of the named stores are required.
+** **mysql:**
+** **elasticsearch:**
+** **hbase:**
+* **filesystems:** These are the non-local filesystems that the hackbox needs and will access through the filesystem abstraction in swineherd.  Local filesystems are always expected to be available.
+** **hdfs:**
+** **s3:**
+h2. Example
+An example YAML @requires@ specification should look something like:
+<pre><code>
+requires:
+  language:
+    ruby: 1.9
+    python: 2.6
+    jars:
+      xerces: 4.5
+  shelltools:
+    wget: null
+    curl: null
+    tar: null
+    gcc: null
+  datastore:
+    mysql: null
+    hbase: null
+</code></pre>
+h2. Evaluation
+To be implemented.