RubyGems - dataduck - Versions diffs - 0.6.8 → 0.7.0 - Mend

dataduck 0.6.8 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

checksums.yaml +4 -4
data/dataduck.gemspec +1 -0
data/docs/contents.yml +5 -0
data/docs/integrations/optimizely.md +21 -0
data/docs/integrations/semrush.md +36 -0
data/docs/tables/incremental_vs_full_loading.md +70 -0
data/lib/dataduck/commands.rb +5 -1
data/lib/dataduck/database.rb +1 -1
data/lib/dataduck/destination.rb +4 -0
data/lib/dataduck/logs.rb +5 -0
data/lib/dataduck/redshift_destination.rb +6 -0
data/lib/dataduck/table.rb +6 -0
data/lib/dataduck/version.rb +2 -2
metadata +19 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: e07cb92ca8f7f36c672dfb89e617e41ddb92f302
-  data.tar.gz: 40037cb63d58385f830043257883322e7dad1dbb
+  metadata.gz: ef6f3cd5a8054cf855b227324845f2ec365516dd
+  data.tar.gz: 45030532745d7a68988bee5e95a791ed55bae6b0
 SHA512:
-  metadata.gz: bfce8a29678d44a96f26b05c3cef1fe047776976ced0d856a028276f9f65c0b8094f0ac64405f1eb8af91c1c2c07205b15776a448e9475d52c573596455409b2
-  data.tar.gz: 55a7dc9934972cb5953bac4e416dfddf26d4ae65389022168e8e8e36a6d6ffc74c4d76ce57eca9289bbc7a677061a093ec1f809a398b1838b776f0d0975589c6
+  metadata.gz: 1d167785a5f64fd8ea77546dcb3f7d19107c3c651aac5933afe09e7eee4eb1674c25a9223b1c724a8229334d669c41604288c29fa4da6e35bfa596ad8bde6a90
+  data.tar.gz: 708a7eb4404cee131bcd946314b57febaa6908c716c4ce2928167c88a0a61f084820f34258a3b5b74cbc97a215759e8cec13e9f89ff25a17aba2453a636d4517

data/dataduck.gemspec CHANGED Viewed

@@ -30,4 +30,5 @@ Gem::Specification.new do |spec|
   spec.add_runtime_dependency "oj", "~> 2.12"
   spec.add_runtime_dependency "sequel-redshift"
   spec.add_runtime_dependency "whenever", "~> 0.9"
+  spec.add_runtime_dependency "sentry-raven", '~>0.15'
 end

data/docs/contents.yml CHANGED Viewed

@@ -12,3 +12,8 @@
 "Tables":
   "The Table Class": README
+  "Incremental vs Full Loading": incremental_vs_full_loading
+"Integrations":
+  "Optimizely": optimizely
+  "SEMrush": semrush

data/docs/integrations/optimizely.md ADDED Viewed

@@ -0,0 +1,21 @@
+# Optimizely Integration
+Optimizely is a website optimization platform which includes a/b testing and personalization products.
+The Optimizely integration uses Optimizely's API to fetch data for your projects, experiments, and variations, then puts them into
+three tables in your data warehouse.
+To use the Optimizely integration, first, get an API token from [https://app.optimizely.com/tokens](https://app.optimizely.com/tokens). Then add the following to your project's .env file:
+```
+optimizely_api_token=YOUR_TOKEN
+```
+Finally, add the following file to your project's /src/tables directory, naming it optimizely_integration.rb
+```ruby
+class OptimizelyIntegration < DataDuck::Optimizely::OptimizelyIntegration
+end
+```
+Now, running `dataduck etl optimizely_integration` will ETL three tables for you. These tables are `optimizely_projects`, `optimizely_experiments`, and `optimizely_variations`. The results data can be found on the variations. Additionally, a `dataduck_extracted_at` datetime column indicates how fresh the data is.

data/docs/integrations/semrush.md ADDED Viewed

@@ -0,0 +1,36 @@
+# SEMrush Integration
+SEMrush is a powerful and versatile competitive intelligence suite for online marketing, from SEO and PPC to social media and video advertising research.
+The SEMrush integration is currently focused on SEO. It will create a table called `semrush_organic_results` that shows the Google search ranking for specific phrases. By running this regularly, you can see how your website or your competitors' websites search rankings change over time.
+To use the SEMrush integration, first add the following to your .env file:
+```
+semrush_api_key=YOUR_API_KEY
+```
+Then create a table called `organic_results` with the following:
+```ruby
+class OrganicResults < DataDuck::SEMRush::OrganicResults
+  def display_limit
+    20 # Default is 20
+  end
+  def search_database
+    'us' # Default is 'us'
+  end
+  def phrases
+    ['My Phrase 1',
+      "Another Phrase",
+      "Some Other Keywords",
+    ]
+  end
+end
+```
+This table will have five columns: date, phrase, rank, domain, and url.
+The methods display_limit and search_database are optional, but can be modified to fit your particular use case.

data/docs/tables/incremental_vs_full_loading.md ADDED Viewed

@@ -0,0 +1,70 @@
+# Incremental vs Full Loading
+Loading a table can be performed either incrementally or with a full reload each time. An incremental load is generally
+better, since it takes less time and transfers less data, however not all tables cannot be loaded incrementally.
+## Incremental loading
+If you are running an ETL process regularly, rather than loading an entire table each time, it is more efficient
+to load just the rows that have changed. This is known as an incremental load. By default, if a table contains
+a row called `updated_at`, DataDuck ETL will use incremental loading based off of that column. If no such column
+exists, it will load the entire table each time.
+If rows can be deleted from a table, you should not use incremental loading either, since DataDuck ETL won't know which rows
+have been deleted. Soft deleting a row, by setting a column to 'deleted' (for example) is fine to use with incremental loading.
+Under the hood, before extracting, DataDuck ETL will check the destination for the latest value of a column, then use that value as a LIMIT
+when running the extract query.
+If you would like to base an incremental load on a different column, such as `id` or `created_at` (common in cases where
+the rows are not expected to change, like an event stream), then you can do so by giving your table a method `extract_by_column`.
+```ruby
+class MyTable < DataDuck::Table
+  source :source1, ["id", "created_at", "name"]
+  def extract_by_column
+    'created_at'
+  end
+  output({
+      :id => :integer,
+      :created_at => :datetime,
+      :name => :string,
+  })
+end
+```
+## Full reloads
+Fully reloading a table takes longer, so it is only recommended you do this with tables where it is not possible to use
+incremental loads.
+If you would like to fully reload the table each time, you may give your table an `extract_by_column` that returns `nil`.
+Alternatively, if you want to have an `extract_by_column` but still reload the entire table each time, you may
+give it a method `should_fully_reload?` that returns true. An example of when you might want to do this is if you are
+reloading an entire table, but doing it in batches.
+```ruby
+class MyTableFullyReloaded < DataDuck::Table
+  source :source1, ["id", "created_at", "name"]
+  def batch_size
+    1_000_000 # if there is a lot of data, and you want to use less memory (for example), batching is a good idea
+  end
+  def extract_by_column
+    'created_at'
+  end
+  def should_fully_reload?
+    true
+  end
+  output({
+      :id => :integer,
+      :created_at => :datetime,
+      :name => :string,
+  })
+end
+```

data/lib/dataduck/commands.rb CHANGED Viewed

@@ -47,7 +47,11 @@ module DataDuck
         return DataDuck::Commands.help
       end
-      DataDuck::Commands.public_send(command, *args[1..-1])
+      begin
+        DataDuck::Commands.public_send(command, *args[1..-1])
+      rescue Exception => err
+        DataDuck::Logs.error(err)
+      end
     end
     def self.c

data/lib/dataduck/database.rb CHANGED Viewed

@@ -10,7 +10,7 @@ module DataDuck
       raise Exception.new("Must implement connection in subclass.")
     end
-    def query
+    def query(sql)
       raise Exception.new("Must implement query in subclass.")
     end

data/lib/dataduck/destination.rb CHANGED Viewed

@@ -34,6 +34,10 @@ module DataDuck
       raise Exception.new("Must implement load_table! in subclass")
     end
+    def postprocess!(table)
+      # e.g. vacuum or build indexes
+    end
     def self.destination(name, allow_nil = false)
       name = name.to_s

data/lib/dataduck/logs.rb CHANGED Viewed

@@ -1,4 +1,5 @@
 require 'logger'
+require 'raven'
 module DataDuck
   module Logs
@@ -43,6 +44,10 @@ module DataDuck
       puts "[ERROR] #{ message }"
       @@logger.error(message)
+      if ENV['SENTRY_DSN']
+        Raven.capture_exception(err)
+      end
     end
     private

data/lib/dataduck/redshift_destination.rb CHANGED Viewed

@@ -277,6 +277,12 @@ module DataDuck
       self.query("DROP TABLE zz_dataduck_recreating_old_#{ table.name }")
     end
+    def postprocess!(table)
+      DataDuck::Logs.info "Vacuuming table #{ table.name }"
+      vacuum_type = table.indexes.length == 0 ? "FULL" : "REINDEX"
+      self.query("VACUUM #{ vacuum_type } #{ table.name }")
+    end
     def self.value_to_string(value)
       string_value = ''

data/lib/dataduck/table.rb CHANGED Viewed

@@ -111,6 +111,8 @@ module DataDuck
       if self.should_fully_reload?
         destination.finish_fully_reloading_table!(self)
       end
+      self.postprocess!(destination, options)
     end
     def extract!(destination = nil, options = {})
@@ -220,6 +222,10 @@ module DataDuck
       self.output_schema.keys.sort.map(&:to_s)
     end
+    def postprocess!(destination, options = {})
+      destination.postprocess!(self)
+    end
     def recreate!(destination)
       destination.recreate_table!(self)
     end

data/lib/dataduck/version.rb CHANGED Viewed

@@ -1,8 +1,8 @@
 module DataDuck
   if !defined?(DataDuck::VERSION)
     VERSION_MAJOR = 0
-    VERSION_MINOR = 6
-    VERSION_PATCH = 8
+    VERSION_MINOR = 7
+    VERSION_PATCH = 0
     VERSION = [VERSION_MAJOR, VERSION_MINOR, VERSION_PATCH].join('.')
   end
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: dataduck
 version: !ruby/object:Gem::Version
-  version: 0.6.8
+  version: 0.7.0
 platform: ruby
 authors:
 - Jeff Pickhardt
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2015-11-09 00:00:00.000000000 Z
+date: 2015-12-06 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bundler
@@ -178,6 +178,20 @@ dependencies:
     - - "~>"
       - !ruby/object:Gem::Version
         version: '0.9'
+- !ruby/object:Gem::Dependency
+  name: sentry-raven
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.15'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.15'
 description: A straightforward, effective ETL framework.
 email:
 - pickhardt@gmail.com
@@ -206,9 +220,12 @@ files:
 - docs/commands/recreate.md
 - docs/commands/show.md
 - docs/contents.yml
+- docs/integrations/optimizely.md
+- docs/integrations/semrush.md
 - docs/overview/README.md
 - docs/overview/getting_started.md
 - docs/tables/README.md
+- docs/tables/incremental_vs_full_loading.md
 - examples/example/.gitignore
 - examples/example/.ruby-version
 - examples/example/Gemfile