RubyGems - redtrack - Versions diffs - 0.0.1 - Mend

redtrack 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

checksums.yaml +7 -0
data/.gitignore +3 -0
data/Gemfile +4 -0
data/LICENSE +22 -0
data/README.md +173 -0
data/Rakefile +2 -0
data/lib/redtrack.rb +16 -0
data/lib/redtrack_client.rb +286 -0
data/lib/redtrack_datatypes.rb +175 -0
data/lib/redtrack_kinesisclient.rb +238 -0
data/lib/redtrack_loader.rb +650 -0
data/lib/redtrack_local_file_stream.rb +126 -0
data/redtrack.gemspec +17 -0
metadata +99 -0

checksums.yaml ADDED

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: b09961b41ddfee0668b0c3afc544d7f56618592b
+  data.tar.gz: d209715373eef894cf6c4cb70141edea015c1916
+SHA512:
+  metadata.gz: e03069f4f3ee8ba7a8e02b267ee2cd142dfcdb8dc81d33131b523d5957a4a25a152fac71fb649d31a7d529144e2d3fb498e4a22e69b7d0b4128ddcb3bd00a1eb
+  data.tar.gz: 352dc2543e113fdf26e42f393b5221ab45aaccec76743b9f85bfe1f9032f4d7aa6135a7a9f81ab0368a24359f69f269d4134916543ae205d03c5c1c24a81f52c

data/.gitignore ADDED

@@ -0,0 +1,3 @@
+# Intellij
+.idea/
+*.iml

data/Gemfile ADDED

@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+# Specify your gem's dependencies in redtrack.gemspec
+gemspec

data/LICENSE ADDED

@@ -0,0 +1,22 @@
+The MIT License (MIT)
+Copyright (c) 2014 Red Hot Labs
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

data/README.md ADDED

@@ -0,0 +1,173 @@
+RedTrack
+========
+RedTrack provides Infrastructure for tracking and loading events into [AWS Redshift](http://aws.amazon.com/redshift/) using [AWS Kinesis](http://aws.amazon.com/kinesis/) as a data broker. For more information on its motivation, design goals, and architecture, please see this blog post:
+# Installation / Dependencies
+Add to Gemfile
+```
+gem 'redtrack', git: 'git://github.com/redhotlabs/redtrack.git'
+```
+Once installed, the library can be used by requiring it
+```
+require 'redtrack'
+```
+You need a Redshift cluster. If you don't have one, launch one starting here: [Redshift AWS console](https://console.aws.amazon.com/redshift/home)
+# Getting Started
+A full application example showing usage here: https://github.com/lrajlich/sinatra_example
+Redtrack is used through a client object. In order to get started, you need to configure & create a redtrack client, ensure you have the proper AWS resources provisioned & configured, and then you can call the APIs.
+### Configure & Create RedTrack client
+To construct a client object, pass a hash of options, [documented in next section](https://github.com/redhotlabs/redtrack/blob/master/README.md#constructor-options),  to its constructor:
+```ruby
+redtrack_options = {
+  :PARAMETER_NAME => PARAMETER_VALUE
+  ...
+}
+redtrack_client = RedTrack::Client.new(redtrack_options)
+...
+```
+##### Constructor options
+```:access_key_id``` Required. String. Passed to the [aws ruby sdk](https://github.com/aws/aws-sdk-ruby)<br/>
+```:secret_access_key``` Required. String. Passed to the [aws ruby sdk](https://github.com/aws/aws-sdk-ruby)<br/>
+```:s3_bucket``` Required. String. Name of the bucket to store file uploads. Must be in same region as Redshift cluster.<br/>
+```:region``` Required. String. AWS region. Passed to aws-sdk.<br/>
+```:redshift_cluster_name``` Required. String. Fill in Name of the redshift cluster from redshift cluster configuration<br/>
+```:redshift_host``` Required. String. This is the Endpoint under Cluster Database Properties on redshift cluster configuration<br/>
+```:redshift_port``` Required. String. Port under Cluster Database Properties on redshift cluster configuration. Default is 5439<br/>
+```:redshift_dbname``` Required. String. Database Name under Cluster Database Properties on redshift cluster configuration<br/>
+```:redshift_user``` Required. String. Master Username under Cluster Database Properties on redshift cluster configuration<br/>
+```:redshift_password``` Required. String. Password used for the above user<br/>
+```:redshift_schema``` Required. Hash. Schema definition for redshift. For more information, see [Redshift Schema section](https://github.com/redhotlabs/redtrack#redshift-schema)<br/>
+```:kinesis_enabled``` Required. Bool. When "true", uses Kinesis for data broker. When "false", writes to a file as a broker instead of Kinesis (use that configuration for development only).<br/>
+For an example / template configuration, see [example configuration](https://github.com/lrajlich/sinatra_example/blob/master/configuration.rb)
+### Creating AWS resources
+RedTrack depends on a number of AWS resources to be provisioned and configured. These are:
+###### 1) Redshift cluster
+This has to be done manually via the [Redshift AWS console](https://console.aws.amazon.com/redshift/home)
+###### 2) Redshift Database
+You have to make sure the configuration parameter ```redshift_dbname``` has a corresponding database in redshift, otherwise loading events will fail. By default, your Redshift Cluster will have a database when you create the cluster. You can create additional databases using ```psql``` and using the ```CREATE DATABASE``` command.
+###### 3) Redshift Tables
+For every table in your schema, you need to make sure there is a Redshift table with the same name; otherwise, loading events will fail. RedTrack client provides a helper method for creating these tables:
+```ruby
+redtrack_client.create_table_from_schema('SOME_TABLE_NAME')
+```
+An example usage can be seen here: [Create table example](https://github.com/lrajlich/sinatra_example/blob/master/setup_redtrack_aws_resources.rb#L12)
+###### 4) Kinesis Streams
+For every table in your schema, you need to make sure there is a Kinesis stream that has a name following the convention ```<redshift_cluster_name>.<redshift_db_name>.<table_name>```. RedTrack provides a helper method for creating these streams:
+```ruby
+redtrack_client.create_kinesis_stream_for_table('SOME_TABLE_NAME')
+```
+An example usage can be seen here: [Create kinesis stream exampe](https://github.com/lrajlich/sinatra_example/blob/master/setup_redtrack_aws_resources.rb#L26)
+###### 5) Tracking Tables
+The final component is that RedTrack keeps internal state to track what events have already been loaded. The ```kinesis_loads``` table has to exist in the database that you are loading. Like the above, there is a helper method for creating this table:
+```ruby
+redtrack_client.create_kinesis_loads_table()
+```
+An example usage can be seen here: [Create kinesis table example](https://github.com/lrajlich/sinatra_example/blob/master/setup_redtrack_aws_resources.rb#L19)
+# Interface
+There's 2 interfaces for Redtrack - Write and Loader. The gist is that the Write api is called inline with application logic and writes events to the broker and the Loader is called asynchronously by a recurring job to read events from the broker and load them into redshift. For an overview of the architecture, see: <INSERT LINK HERE>.
+#### Write Api
+You web application will interact with the Write API in-line with web transactions. Write will validate the passed data validates against the redtrack schema (since the data is loaded asynchronously into redshift, redtrack does not validate the write against redshift directly) and then write it to the appropriate stream in kinesis.
+A simple example:
+```ruby
+redtrack_client = RedTrack::Client.new(options)
+data = {
+  :message => "foo",
+  :timestamp => Time.now.to_i
+}
+result = redtrack_client.write("SOME_TABLE",data)
+```
+For an application example, see [this example usage](https://github.com/lrajlich/sinatra_example/blob/master/app.rb#L34)
+#### Loader
+The loader is run asynchronously to consume events off of the broker and load them into the warehouse. In this case, events are read from Kinesis from the last load point, uploaded to S3, and then copied into Redshift. There is a single function and it takes 2 parameters - a table name, and a stream shard index. The stream shard index corresponds to the index in the array of shards returned by a [DescribeStream](http://docs.aws.amazon.com/kinesis/latest/APIReference/API_DescribeStream.html) request
+A simple example:
+```ruby
+loader = redtrack_client.new_loader()
+stream_shard_index=0
+loader_result = loader.load_redshift_from_broker("SOME_TABLE_NAME",stream_shard_index)
+```
+For an application example, see [this load_redshift script example](https://github.com/lrajlich/sinatra_example/blob/master/load_redshift.rb)
+# Redshift Schema
+One of the features of redtrack is the ability to pass in a schema matching table schema. Redtrack can validate that passed events match the schema, as well, it can generate a SQL statements to create a table matching that schema or create the table directly. To get an overview of what the available redshift schema definition is, see [The docs](http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html)
+In order to pass schema, you pass in a hash like this:
+```ruby
+SCHEMA = {
+  :SOME_TABLE_NAME => {
+    :columns => {
+      :SOME_COLUMN_NAME => {
+        :type => 'varchar(32)',
+        :constraint => 'not null'
+      },
+      ... (OTHER COLUMNS)
+    },
+    :sortkey => 'SOME_COLUMN_NAME',
+    :distkey => 'SOME_COLUMN_NAME'
+  },
+  ... (OTHER TABLES)
+}
+```
+A simple example looks like this:
+```ruby
+SCHEMAS= {
+    :test_events => {
+        :columns => {
+            :client_ip =>     { :type => 'varchar(32)', :constraint => 'not null'},
+            :timestamp =>     { :type => 'integer', :constraint => 'not null'},
+            :message =>       { :type => 'varchar(128)' }
+        },
+        :sortkey => 'timestamp'
+    }
+}
+```
+#### Redshift Type Support
+Since Redtrack does asynchronous loading of events, the events are filtered before they are written to the broker in order to avoid COPY errors and to provide direct feedback to the caller of the ```write``` function
+```varchar(n)``` Supported. Current behavior is to truncate any strings that exceed the provided length<br/>
+```char``` Supported. <br/>
+```smallint``` Supported. <br/>
+```bigint``` Supported. <br/>
+```timestamp``` Partially Supported. Not all time formats are supported. Timeformat for Redshift is very restrictive (simply checking for a valid Ruby time is not sufficient) and thus this is done via string matching. [Documentation](http://docs.aws.amazon.com/redshift/latest/dg/r_DATEFORMAT_and_TIMEFORMAT_strings.html)<br/>
+```decimal``` Supported. Checks that the value is a numeric, eg, converts to float.
+Redtrack type filtering is done [here](https://github.com/redhotlabs/redtrack/blob/master/lib/redtrack_datatypes.rb) and contributions to filtering logic are welcome:
+#### Unsopported Redshift schema options
+1) Creating Redshift tables with Redshift Column Attributes, [From Docs](http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html). This includes the following parameters: DEFAULT, IDENTITY, and ENCODE. DISTKEY and SORTKEY will be created as table attributes, but not as column attributes. You can manually set attributes on the columns.
+2) Creating Redshift Tables with table Constraints, [From Docs](http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html). This includes UNIQUE, PRIMARY KEY, and FOREIGN_KEY constraints. You can manually set these values on the table schema.
+3) Enforcement of Unique column constraints, [From Docs](http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html), The RedTrack client will not verify that an event's property is actually unique. What will happen is that the events will fail to load.
+# Documentation / Further reading
+Redshift supports a handful of types. [Redshift Types](http://docs.aws.amazon.com/redshift/latest/dg/c_Supported_data_types.html)

data/Rakefile ADDED

	@@ -0,0 +1,2 @@
1	+ require "bundler/gem_tasks"
2	+

data/lib/redtrack.rb ADDED

@@ -0,0 +1,16 @@
+# Copyright (c) 2014 RedHotLabs, Inc.
+# Licensed under the MIT License
+# Dependent requires
+require 'logger'
+require 'aws-sdk'
+require 'json'
+require 'pg'
+require 'time'
+# Require all of redtrack library
+require 'redtrack_client'
+require 'redtrack_kinesisclient'
+require 'redtrack_loader'
+require 'redtrack_local_file_stream'
+require 'redtrack_datatypes'

data/lib/redtrack_client.rb ADDED

@@ -0,0 +1,286 @@
+# The Client provides an application interface for redtrack
+#
+# Copyright (c) 2014 RedHotLabs, Inc.
+# Licensed under the MIT License
+module RedTrack
+  class Client
+    TAG='RedTrack::Client'
+    @broker = nil
+    @redshift_conn = nil
+    @options = nil
+    @data_types = nil
+    @valid_data_types = nil
+    @logger = nil
+    # Constructor for the client - initialize instance variables
+    #
+    # @param [Hash] options Options to the client - see README.md
+    def initialize(options)
+      # Create logger and add to options (passed to other objects)
+      @logger = Logger.new(STDOUT)
+      options[:logger] = @logger
+      # Create the appropriate broker
+      if options[:kinesis_enabled] == true
+        @logger.debug("#{TAG} Kinesis enabled. create KinesisClient")
+        @broker = RedTrack::KinesisClient.new(options)
+      else
+        @logger.debug("#{TAG} Kinesis disabled. create FileClient")
+        @broker = RedTrack::FileClient.new(options)
+      end
+      # Bind to the interface for checking data types
+      @data_types = RedTrack::DataTypes.new(options)
+      @valid_data_types = @data_types.valid_data_types
+      aws_options = {
+          :access_key_id => options[:access_key_id],
+          :secret_access_key => options[:secret_access_key],
+          :region => options[:region]
+      }
+      AWS.config(aws_options)
+      @options = options
+    end
+    # Create a new loader client
+    #
+    # @param [Hash] loader_options The options to pass to the loader
+    # @return [RedTrack::Loader] The loader client
+    def new_loader(loader_options={})
+      merged_options = merge_options(loader_options)
+      if @redshift_conn == nil
+        @redshift_conn = new_redshift_connection(loader_options)
+      end
+      return RedTrack::Loader.new(merged_options,@broker,@redshift_conn)
+    end
+    # Create a new redshift connection
+    #
+    # @param [Hash] connection_options A set of options to pass to PG.connect. Uses options passed to redtrack client by default
+    # @return [PG::Connection] Postgres client connection
+    def new_redshift_connection(connection_options={})
+      merged_options = merge_options(connection_options)
+      @redshift_conn = PG.connect(
+        :host => merged_options[:redshift_host],
+        :port => merged_options[:redshift_port],
+        :dbname => merged_options[:redshift_dbname],
+        :user => merged_options[:redshift_user],
+        :password => merged_options[:redshift_password])
+      return @redshift_conn
+    end
+    # Check the data to ensure it conforms to the table schema and write to the databroker for the table.
+    # Determines which shard to write to randomly
+    #
+    # @param [String] table The name of the redshift table to write to
+    # @param [Hash] data hash containing data to write to the table. Key is column name
+    # @param [String] partition_key optional, used to determine which kinesis shard to write the data to
+    # @return [Boolean] Whether or not the write succeeded
+    def write(table,data,partition_key=nil)
+      ## Get table schema...
+      schema = get_table_schema(table)
+      if schema == nil
+        raise "Scheme does not exist for table name ='#{table}'"
+      end
+      ## Ensure that the keys in the passed data are symbols (this is what's expected)
+      data.keys.each do |key|
+        if(key.is_a?(Symbol) == false)
+          raise "Data key #{key} is not a symbol!"
+          # TODO: CONVERT string keys to symbols instead of raising
+        end
+      end
+      intersection = schema[:columns].keys & data.keys
+      ## Validate no data keys are passed that are not in table schema
+      data.keys.each do |key|
+        if(intersection.include?(key) == false)
+          raise "Data key #{key} is not in schema for #{table} table!!"
+        end
+      end
+      ## Validate that columns are not null
+      schema[:columns].each do |column_name,column|
+        if(column.keys.include?(:constraint) == true && column[:constraint] == "not null" && intersection.include?(column_name) == false)
+          raise "Column #{column_name} is missing from passed data"
+        end
+      end
+      ## Validate column types
+      schema[:columns].each do |column_name,column|
+        if(intersection.include?(column_name) == true)
+          value = data[column_name.to_sym]
+          column_type = column[:type]
+          if column_type["("] != nil
+            type_name = column_type[/(.*)\(.*/,1]
+          else
+            type_name = column_type
+          end
+          if @valid_data_types.include? type_name
+            data[column_name.to_sym] = @data_types.send("check_#{type_name}".to_sym,value,column_type,column_name)
+          else
+            raise "Invalid data type #{type_name}. Valid types [#{@valid_data_types.join(",")}]"
+          end
+        end
+      end
+      ## Serialize as json, we load the data as JSON into redshift
+      data_string=data.to_json
+      ## Write the serialized data string to the broker
+      partition_key = partition_key || rand(100).to_s
+      stream_name = @broker.stream_name(table)
+      result = @broker.stream_write(stream_name, data_string, partition_key)
+      return result
+    end
+    # Gets a schema hash object for a specific table
+    #
+    # @param [String] table The name of the redshift table
+    # @return [Hash] Hash object containing the column definitions
+    def get_table_schema(table)
+      if (@options[:redshift_schema] == nil)
+        raise 'Must pass :redshift_schema as option when creating RedTrack client'
+      end
+      schema = @options[:redshift_schema]
+      if schema[table.to_sym]
+        result = schema[table.to_sym]
+      elsif schema["#{table}"]
+        result = schema["#{table}"]
+      end
+      return result
+    end
+    # Returns a SQL statement for creating a Redshift per the defined schema above
+    #
+    # @param [String] table The name of the table
+    # @param [Boolean] exec Whether to execute the statement
+    # @param [Hash] table_schema The table schema to use - if not provided, get from passed schema
+    # @return [String] Returns the create table string
+    def create_table_from_schema(table,exec=true,schema=nil)
+      if schema == nil
+        schema = get_table_schema(table)
+        if !schema
+          @logger.warn("#{TAG} No schema exists for table #{table}")
+          return false
+        end
+      end
+      query = "create table #{table} (\n"
+      schema[:columns].each_with_index do |(column_name,column),index|
+        query += "#{column_name} " + column[:type]
+        if column[:constraint] != nil
+          query += " " + column[:constraint]
+        end
+        if index !=  schema[:columns].size - 1
+          query += ","
+        end
+        query += "\n"
+      end
+      query += ")"
+      if schema[:sortkey] != nil
+        query += "\nsortkey(" + schema[:sortkey] + ");\n"
+      else
+        query += ";\n"
+      end
+      if exec
+        conn = new_redshift_connection()
+        result = conn.exec(query)
+      else
+        result = query
+      end
+      return result
+    end
+    # @return [String] Executes query against redshift and returns the result
+    def create_kinesis_loads_table
+      schema= {
+        :columns => {
+          :stream_name =>                   { :type => 'varchar(64)' },
+          :shard_id =>                      { :type => 'varchar(64)' },
+          :table_name =>                    { :type => 'varchar(64)' },
+          :starting_sequence_number =>      { :type => 'varchar(64)' },
+          :ending_sequence_number =>        { :type => 'varchar(64)' },
+          :load_timestamp =>                { :type => 'timestamp', :constraint => 'not null' }
+        },
+        :sortkey => 'load_timestamp'
+      }
+      return create_table_from_schema('kinesis_loads',true,schema)
+    end
+    # Create a kinesis stream for the table - use configuration
+    #
+    # @param [String] table The name of the table
+    # @param [integer] shard_count The number of shards in the stream
+    def create_kinesis_stream_for_table(table,shard_count=1)
+      result = false
+      if @options[:kinesis_enabled]
+        result = @broker.create_kinesis_stream_for_table(table,shard_count)
+      else
+        @logger.warn("#{TAG} Kinesis is not enabled. Nothing done.")
+      end
+      return result
+    end
+    private
+    # Merge options between passed options and the default options in RedTrack client
+    #
+    # @param [Hash] options The set of options passed
+    def merge_options(options)
+      merged_options=@options
+      options.each do |passed_option_key,passed_option_value|
+        merged_options[passed_option_key] = passed_option_value
+      end
+      return merged_options
+    end
+    # Determine whether the typed value is a legit number, (eg, string)
+    #
+    # @param [Numeric] value The value to check as valid numeric
+    # @return [Boolean] Whether or not the value is a numeric
+    def is_numeric(value)
+      Float(value) != nil rescue false
+    end
+    # Determine whether the typed value is a timestamp as defined by redshift. This is more restrictive than ruby parsing b/c of redshift
+    # See: http://docs.aws.amazon.com/redshift/latest/dg/r_DATEFORMAT_and_TIMEFORMAT_strings.html
+    #
+    # @param [String] value The value to check as a valid timestamp: "YYYY-MM-DD HH:mm:ss" is only accepted format
+    # @return [Boolean] Whether or not the value is a timestamp as accepted by redshift
+    def is_redshift_timestamp(value)
+      if value.is_a?(String) && value[/\A\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\z/] != nil
+        return true
+      end
+      return false
+    end
+  end
+end