dataduck 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 0f87bbaf674b1943242d3ea173a5e34fe00e0724
4
- data.tar.gz: b8fbacadd9323ab917498712c8d4f39f1f5ca907
3
+ metadata.gz: dcc9a5407d2bae97ab0ecb754f95d3c872b92cfe
4
+ data.tar.gz: d32783d694d625367fb5f732602a6c2a997e8241
5
5
  SHA512:
6
- metadata.gz: 40bbfce9c990d1542236c59967c31fe3fe5982c84bed12ccaf604c7ce15f2cebc5432b865dfedac5e95607dc37f20e0d681462ee9e7936e30ffdce8361688c96
7
- data.tar.gz: fdc25e1ddf3a00faeceb13f11f4b7452f4085b3c0e5ca805137cc21174727aabec052dd335c371cd21c78db7ca0af7ef4e5a93f2876df9abfff531ad90bc8612
6
+ metadata.gz: 184f298a735a3928a78d5b8e85e22e498d4e26b554d89a2b4201afa0e91d8b812872e0c72e4a6046970b6b8489ff451ad4b1fd93d2271314d133484762189470
7
+ data.tar.gz: d9e4001147d98b51c3a481f894d52129349ba656d1f6b362845f37618e41b2bdf29005389b106df842ffb8ec02d0b84b45c683a561a49bc4674a88ce392fefcc
data/README.md CHANGED
@@ -18,10 +18,12 @@ See [https://github.com/DataDuckETL/DataDuck/tree/master/examples/example](https
18
18
 
19
19
  ##### Instructions for using DataDuck ETL
20
20
 
21
- Create a new project, then add the following to your Gemfile:
21
+ Create a new, empty directory. Inside this directory, create a file named Gemfile, and add the following to it:
22
22
 
23
23
  ```ruby
24
- gem 'dataduck', :git => 'git://github.com/DataDuckETL/DataDuck.git'
24
+ source 'https://rubygems.org'
25
+
26
+ gem 'dataduck'
25
27
  ```
26
28
 
27
29
  Then execute:
@@ -23,6 +23,7 @@ Gem::Specification.new do |spec|
23
23
 
24
24
  spec.add_runtime_dependency "sequel", '~> 4.19'
25
25
  spec.add_runtime_dependency "pg", '~> 0.16'
26
+ spec.add_runtime_dependency "mysql", "~> 2.9"
26
27
  spec.add_runtime_dependency "aws-sdk", "~> 2.0"
27
28
  spec.add_runtime_dependency "sequel-redshift"
28
29
  end
@@ -0,0 +1,7 @@
1
+ # Documentation
2
+
3
+ The documentation directory is viewable at (http://dataducketl.com/docs)[http://dataducketl.com/docs].
4
+
5
+ # Autogenerated
6
+
7
+ The documentation directory is autogenerated from the main DataDuck ETL git repo. If you would like to add or correct something in the documentation, let us know or make a pull request to (https://github.com/DataDuckETL/DataDuck/docs)[https://github.com/DataDuckETL/DataDuck/docs].
@@ -0,0 +1,6 @@
1
+ "Overview":
2
+ "Welcome": README
3
+ "Getting Started": getting_started
4
+
5
+ "Tables":
6
+ "The Table Class": README
@@ -0,0 +1,24 @@
1
+ # Overview
2
+
3
+ DataDuck ETL is a straightforward, effective extract-transform-load framework for data warehousing. If you want to set
4
+ up a data warehouse, DataDuck ETL makes it simple and straightforward to do.
5
+
6
+ ## Getting Started
7
+
8
+ Getting started with DataDuck ETL takes just a few minutes. For instructions, read the
9
+ [getting started](/docs/overview/getting_started) page.
10
+
11
+ ## Why Use a Data Warehouse
12
+
13
+ If you already have your data in your main database, and probably use a web analytics product like Google Analytics, you
14
+ may be wondering why you'd want a data warehouse anyway.
15
+
16
+ There's many advantages to using a data warehouse, including:
17
+
18
+ - integrating multiple data sources so you can analyze them together
19
+ - helping to ensure data quality by cleaning up the data and running data quality checks
20
+ - having a single source of truth that the entire company trusts
21
+ - connecting business intelligence products for reports and dashboards
22
+ - using the data warehouse to build models, which may get incorporated back in the product, or used for predictions and company decision making
23
+ - performance optimizations so your queries run fast
24
+ - ensuring sensitive data doesn't end up in reports, by not passing it to the data warehouse (encrypted passwords, salts, etc have no practical analytics value, so they are not ETLed)
@@ -0,0 +1,28 @@
1
+ # Getting Started
2
+
3
+ ## Requirements
4
+
5
+ DataDuck ETL currently supports extracting from MySQL and PostgreSQL databases. It supports loading into Amazon
6
+ Redshift. If you would like to extract or load into a database not yet supported, contact us.
7
+
8
+ ## Instructions
9
+
10
+ First, create a new, empty directory. Inside this directory, create a file named Gemfile with the following:
11
+
12
+ ```ruby
13
+ source 'https://rubygems.org'
14
+
15
+ gem 'dataduck'
16
+ ```
17
+
18
+ Then execute:
19
+
20
+ $ bundle install
21
+
22
+ Finally, run the quickstart command:
23
+
24
+ $ dataduck quickstart
25
+
26
+ It will ask you for the credentials to your database, and then create the basic setup for your project. After the setup, your project's ETL can be run by running `ruby src/main.rb`
27
+
28
+ If you would like to run this regularly, such as every night, it's recommended to use the [whenever](https://github.com/javan/whenever) gem to manage a cron job to regularly run the ETL.
@@ -0,0 +1,48 @@
1
+ # The Table Class
2
+
3
+ If you've run the `dataduck quickstart` command, you'll notice a bunch of table files were generated under /src/tables.
4
+ Each of these table files inherits from `DataDuck::Table`, the base table class. Tables need to have the `source` and `output` defined.
5
+
6
+ You may also define transformations with the `transforms` method and validations with `validates` method.
7
+
8
+ ## Example Table
9
+
10
+ The following is an example table.
11
+
12
+ ```ruby
13
+ class Decks < DataDuck::Table
14
+ source :my_database, ["id", "name", "user_id", "cards",
15
+ "num_wins", "num_losses", "created_at", "updated_at",
16
+ "is_drafted", "num_draft_wins", "num_draft_losses"]
17
+
18
+ transforms :calculate_num_totals
19
+
20
+ validates :validates_num_total
21
+
22
+ output({
23
+ :id => :integer,
24
+ :name => :string,
25
+ :user_id => :integer,
26
+ :num_wins => :integer,
27
+ :num_losses => :integer,
28
+ :num_total => :integer,
29
+ :num_draft_total => :integer,
30
+ :created_at => :datetime,
31
+ :updated_at => :datetime,
32
+ :is_drafted => :boolean,
33
+ # Note that num_draft_wins and num_draft_losses
34
+ # are not included in the output, but are used in
35
+ # the transformation.
36
+ })
37
+
38
+ def calculate_num_totals(row)
39
+ row[:num_total] = row[:num_wins] + row[:num_losses]
40
+ row[:num_draft_total] = row[:num_draft_wins] + row[:num_draft_losses]
41
+ row
42
+ end
43
+
44
+ def validates_num_total(row)
45
+ return "Deck id #{ row[:id] } has negative value #{ row[:num_total] } for num_total." if row[:num_total] < 0
46
+ end
47
+ end
48
+ ```
@@ -16,6 +16,21 @@ module DataDuck
16
16
  end
17
17
  end
18
18
 
19
+ def self.prompt_choices(choices = [])
20
+ while true
21
+ print "Enter a number 0 - #{ choices.length - 1}\n"
22
+ choices.each_with_index do |choice, idx|
23
+ choice_name = choice.is_a?(String) ? choice : choice[1]
24
+ print "#{ idx }: #{ choice_name }\n"
25
+ end
26
+ choice = STDIN.gets.strip.to_i
27
+ if 0 <= choice && choice < choices.length
28
+ selected = choices[choice]
29
+ return selected.is_a?(String) ? selected : selected[0]
30
+ end
31
+ end
32
+ end
33
+
19
34
  def self.acceptable_commands
20
35
  ['console', 'quickstart']
21
36
  end
@@ -47,7 +62,20 @@ module DataDuck
47
62
  puts "Welcome to DataDuck!"
48
63
  puts "This quickstart wizard will create your application, assuming the source is a Postgres database and the destination is an Amazon Redshift data warehouse."
49
64
 
50
- puts "Enter the source (Postgres database) hostname:"
65
+
66
+ puts "What kind of database would you like to source from?"
67
+ db_type = prompt_choices([
68
+ [:mysql, "MySQL"],
69
+ [:postgresql, "PostgreSQL"],
70
+ [:other, "other"],
71
+ ])
72
+
73
+ if db_type == :other
74
+ puts "You've selected 'other'. Unfortunately, those are the only choices supported at the moment. Contact us at DataDuckETL.com to request support for your database."
75
+ exit
76
+ end
77
+
78
+ puts "Enter the source hostname:"
51
79
  source_host = STDIN.gets.strip
52
80
 
53
81
  puts "Enter the name of the database when connecting to #{ source_host }:"
@@ -62,8 +90,13 @@ module DataDuck
62
90
  puts "Enter the password:"
63
91
  source_password = STDIN.noecho(&:gets).chomp
64
92
 
65
- db_source = DataDuck::PostgresqlSource.new({
66
- 'type' => 'postgresql',
93
+ db_class = {
94
+ mysql: DataDuck::MysqlSource,
95
+ postgresql: DataDuck::PostgresqlSource,
96
+ }[db_type]
97
+
98
+ db_source = db_class.new({
99
+ 'db_type' => db_type.to_s,
67
100
  'host' => source_host,
68
101
  'database' => source_database,
69
102
  'port' => source_port,
@@ -3,7 +3,7 @@ require_relative 'sql_db_source.rb'
3
3
  require 'sequel'
4
4
 
5
5
  module DataDuck
6
- class PostrgresqlSource < DataDuck::SqlDbSource
6
+ class PostgresqlSource < DataDuck::SqlDbSource
7
7
  def db_type
8
8
  'postgres'
9
9
  end
@@ -1,6 +1,6 @@
1
1
  module DataDuck
2
2
  VERSION_MAJOR = 0
3
- VERSION_MINOR = 2
3
+ VERSION_MINOR = 3
4
4
  VERSION_PATCH = 0
5
5
  VERSION = [VERSION_MAJOR, VERSION_MINOR, VERSION_PATCH].join('.')
6
6
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: dataduck
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.0
4
+ version: 0.3.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Jeff Pickhardt
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2015-10-10 00:00:00.000000000 Z
11
+ date: 2015-10-11 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bundler
@@ -80,6 +80,20 @@ dependencies:
80
80
  - - "~>"
81
81
  - !ruby/object:Gem::Version
82
82
  version: '0.16'
83
+ - !ruby/object:Gem::Dependency
84
+ name: mysql
85
+ requirement: !ruby/object:Gem::Requirement
86
+ requirements:
87
+ - - "~>"
88
+ - !ruby/object:Gem::Version
89
+ version: '2.9'
90
+ type: :runtime
91
+ prerelease: false
92
+ version_requirements: !ruby/object:Gem::Requirement
93
+ requirements:
94
+ - - "~>"
95
+ - !ruby/object:Gem::Version
96
+ version: '2.9'
83
97
  - !ruby/object:Gem::Dependency
84
98
  name: aws-sdk
85
99
  requirement: !ruby/object:Gem::Requirement
@@ -127,6 +141,11 @@ files:
127
141
  - bin/dataduck
128
142
  - bin/setup
129
143
  - dataduck.gemspec
144
+ - docs/README.md
145
+ - docs/contents.yml
146
+ - docs/overview/README.md
147
+ - docs/overview/getting_started.md
148
+ - docs/tables/README.md
130
149
  - examples/example/.gitignore
131
150
  - examples/example/.ruby-version
132
151
  - examples/example/Gemfile