dataduck 0.2.0 → 0.3.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +4 -2
- data/dataduck.gemspec +1 -0
- data/docs/README.md +7 -0
- data/docs/contents.yml +6 -0
- data/docs/overview/README.md +24 -0
- data/docs/overview/getting_started.md +28 -0
- data/docs/tables/README.md +48 -0
- data/lib/dataduck/commands.rb +36 -3
- data/lib/dataduck/postgresql_source.rb +1 -1
- data/lib/dataduck/version.rb +1 -1
- metadata +21 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: dcc9a5407d2bae97ab0ecb754f95d3c872b92cfe
|
4
|
+
data.tar.gz: d32783d694d625367fb5f732602a6c2a997e8241
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 184f298a735a3928a78d5b8e85e22e498d4e26b554d89a2b4201afa0e91d8b812872e0c72e4a6046970b6b8489ff451ad4b1fd93d2271314d133484762189470
|
7
|
+
data.tar.gz: d9e4001147d98b51c3a481f894d52129349ba656d1f6b362845f37618e41b2bdf29005389b106df842ffb8ec02d0b84b45c683a561a49bc4674a88ce392fefcc
|
data/README.md
CHANGED
@@ -18,10 +18,12 @@ See [https://github.com/DataDuckETL/DataDuck/tree/master/examples/example](https
|
|
18
18
|
|
19
19
|
##### Instructions for using DataDuck ETL
|
20
20
|
|
21
|
-
Create a new
|
21
|
+
Create a new, empty directory. Inside this directory, create a file named Gemfile, and add the following to it:
|
22
22
|
|
23
23
|
```ruby
|
24
|
-
|
24
|
+
source 'https://rubygems.org'
|
25
|
+
|
26
|
+
gem 'dataduck'
|
25
27
|
```
|
26
28
|
|
27
29
|
Then execute:
|
data/dataduck.gemspec
CHANGED
@@ -23,6 +23,7 @@ Gem::Specification.new do |spec|
|
|
23
23
|
|
24
24
|
spec.add_runtime_dependency "sequel", '~> 4.19'
|
25
25
|
spec.add_runtime_dependency "pg", '~> 0.16'
|
26
|
+
spec.add_runtime_dependency "mysql", "~> 2.9"
|
26
27
|
spec.add_runtime_dependency "aws-sdk", "~> 2.0"
|
27
28
|
spec.add_runtime_dependency "sequel-redshift"
|
28
29
|
end
|
data/docs/README.md
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
# Documentation
|
2
|
+
|
3
|
+
The documentation directory is viewable at (http://dataducketl.com/docs)[http://dataducketl.com/docs].
|
4
|
+
|
5
|
+
# Autogenerated
|
6
|
+
|
7
|
+
The documentation directory is autogenerated from the main DataDuck ETL git repo. If you would like to add or correct something in the documentation, let us know or make a pull request to (https://github.com/DataDuckETL/DataDuck/docs)[https://github.com/DataDuckETL/DataDuck/docs].
|
data/docs/contents.yml
ADDED
@@ -0,0 +1,24 @@
|
|
1
|
+
# Overview
|
2
|
+
|
3
|
+
DataDuck ETL is a straightforward, effective extract-transform-load framework for data warehousing. If you want to set
|
4
|
+
up a data warehouse, DataDuck ETL makes it simple and straightforward to do.
|
5
|
+
|
6
|
+
## Getting Started
|
7
|
+
|
8
|
+
Getting started with DataDuck ETL takes just a few minutes. For instructions, read the
|
9
|
+
[getting started](/docs/overview/getting_started) page.
|
10
|
+
|
11
|
+
## Why Use a Data Warehouse
|
12
|
+
|
13
|
+
If you already have your data in your main database, and probably use a web analytics product like Google Analytics, you
|
14
|
+
may be wondering why you'd want a data warehouse anyway.
|
15
|
+
|
16
|
+
There's many advantages to using a data warehouse, including:
|
17
|
+
|
18
|
+
- integrating multiple data sources so you can analyze them together
|
19
|
+
- helping to ensure data quality by cleaning up the data and running data quality checks
|
20
|
+
- having a single source of truth that the entire company trusts
|
21
|
+
- connecting business intelligence products for reports and dashboards
|
22
|
+
- using the data warehouse to build models, which may get incorporated back in the product, or used for predictions and company decision making
|
23
|
+
- performance optimizations so your queries run fast
|
24
|
+
- ensuring sensitive data doesn't end up in reports, by not passing it to the data warehouse (encrypted passwords, salts, etc have no practical analytics value, so they are not ETLed)
|
@@ -0,0 +1,28 @@
|
|
1
|
+
# Getting Started
|
2
|
+
|
3
|
+
## Requirements
|
4
|
+
|
5
|
+
DataDuck ETL currently supports extracting from MySQL and PostgreSQL databases. It supports loading into Amazon
|
6
|
+
Redshift. If you would like to extract or load into a database not yet supported, contact us.
|
7
|
+
|
8
|
+
## Instructions
|
9
|
+
|
10
|
+
First, create a new, empty directory. Inside this directory, create a file named Gemfile with the following:
|
11
|
+
|
12
|
+
```ruby
|
13
|
+
source 'https://rubygems.org'
|
14
|
+
|
15
|
+
gem 'dataduck'
|
16
|
+
```
|
17
|
+
|
18
|
+
Then execute:
|
19
|
+
|
20
|
+
$ bundle install
|
21
|
+
|
22
|
+
Finally, run the quickstart command:
|
23
|
+
|
24
|
+
$ dataduck quickstart
|
25
|
+
|
26
|
+
It will ask you for the credentials to your database, and then create the basic setup for your project. After the setup, your project's ETL can be run by running `ruby src/main.rb`
|
27
|
+
|
28
|
+
If you would like to run this regularly, such as every night, it's recommended to use the [whenever](https://github.com/javan/whenever) gem to manage a cron job to regularly run the ETL.
|
@@ -0,0 +1,48 @@
|
|
1
|
+
# The Table Class
|
2
|
+
|
3
|
+
If you've run the `dataduck quickstart` command, you'll notice a bunch of table files were generated under /src/tables.
|
4
|
+
Each of these table files inherits from `DataDuck::Table`, the base table class. Tables need to have the `source` and `output` defined.
|
5
|
+
|
6
|
+
You may also define transformations with the `transforms` method and validations with `validates` method.
|
7
|
+
|
8
|
+
## Example Table
|
9
|
+
|
10
|
+
The following is an example table.
|
11
|
+
|
12
|
+
```ruby
|
13
|
+
class Decks < DataDuck::Table
|
14
|
+
source :my_database, ["id", "name", "user_id", "cards",
|
15
|
+
"num_wins", "num_losses", "created_at", "updated_at",
|
16
|
+
"is_drafted", "num_draft_wins", "num_draft_losses"]
|
17
|
+
|
18
|
+
transforms :calculate_num_totals
|
19
|
+
|
20
|
+
validates :validates_num_total
|
21
|
+
|
22
|
+
output({
|
23
|
+
:id => :integer,
|
24
|
+
:name => :string,
|
25
|
+
:user_id => :integer,
|
26
|
+
:num_wins => :integer,
|
27
|
+
:num_losses => :integer,
|
28
|
+
:num_total => :integer,
|
29
|
+
:num_draft_total => :integer,
|
30
|
+
:created_at => :datetime,
|
31
|
+
:updated_at => :datetime,
|
32
|
+
:is_drafted => :boolean,
|
33
|
+
# Note that num_draft_wins and num_draft_losses
|
34
|
+
# are not included in the output, but are used in
|
35
|
+
# the transformation.
|
36
|
+
})
|
37
|
+
|
38
|
+
def calculate_num_totals(row)
|
39
|
+
row[:num_total] = row[:num_wins] + row[:num_losses]
|
40
|
+
row[:num_draft_total] = row[:num_draft_wins] + row[:num_draft_losses]
|
41
|
+
row
|
42
|
+
end
|
43
|
+
|
44
|
+
def validates_num_total(row)
|
45
|
+
return "Deck id #{ row[:id] } has negative value #{ row[:num_total] } for num_total." if row[:num_total] < 0
|
46
|
+
end
|
47
|
+
end
|
48
|
+
```
|
data/lib/dataduck/commands.rb
CHANGED
@@ -16,6 +16,21 @@ module DataDuck
|
|
16
16
|
end
|
17
17
|
end
|
18
18
|
|
19
|
+
def self.prompt_choices(choices = [])
|
20
|
+
while true
|
21
|
+
print "Enter a number 0 - #{ choices.length - 1}\n"
|
22
|
+
choices.each_with_index do |choice, idx|
|
23
|
+
choice_name = choice.is_a?(String) ? choice : choice[1]
|
24
|
+
print "#{ idx }: #{ choice_name }\n"
|
25
|
+
end
|
26
|
+
choice = STDIN.gets.strip.to_i
|
27
|
+
if 0 <= choice && choice < choices.length
|
28
|
+
selected = choices[choice]
|
29
|
+
return selected.is_a?(String) ? selected : selected[0]
|
30
|
+
end
|
31
|
+
end
|
32
|
+
end
|
33
|
+
|
19
34
|
def self.acceptable_commands
|
20
35
|
['console', 'quickstart']
|
21
36
|
end
|
@@ -47,7 +62,20 @@ module DataDuck
|
|
47
62
|
puts "Welcome to DataDuck!"
|
48
63
|
puts "This quickstart wizard will create your application, assuming the source is a Postgres database and the destination is an Amazon Redshift data warehouse."
|
49
64
|
|
50
|
-
|
65
|
+
|
66
|
+
puts "What kind of database would you like to source from?"
|
67
|
+
db_type = prompt_choices([
|
68
|
+
[:mysql, "MySQL"],
|
69
|
+
[:postgresql, "PostgreSQL"],
|
70
|
+
[:other, "other"],
|
71
|
+
])
|
72
|
+
|
73
|
+
if db_type == :other
|
74
|
+
puts "You've selected 'other'. Unfortunately, those are the only choices supported at the moment. Contact us at DataDuckETL.com to request support for your database."
|
75
|
+
exit
|
76
|
+
end
|
77
|
+
|
78
|
+
puts "Enter the source hostname:"
|
51
79
|
source_host = STDIN.gets.strip
|
52
80
|
|
53
81
|
puts "Enter the name of the database when connecting to #{ source_host }:"
|
@@ -62,8 +90,13 @@ module DataDuck
|
|
62
90
|
puts "Enter the password:"
|
63
91
|
source_password = STDIN.noecho(&:gets).chomp
|
64
92
|
|
65
|
-
|
66
|
-
|
93
|
+
db_class = {
|
94
|
+
mysql: DataDuck::MysqlSource,
|
95
|
+
postgresql: DataDuck::PostgresqlSource,
|
96
|
+
}[db_type]
|
97
|
+
|
98
|
+
db_source = db_class.new({
|
99
|
+
'db_type' => db_type.to_s,
|
67
100
|
'host' => source_host,
|
68
101
|
'database' => source_database,
|
69
102
|
'port' => source_port,
|
data/lib/dataduck/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: dataduck
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.3.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jeff Pickhardt
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2015-10-
|
11
|
+
date: 2015-10-11 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: bundler
|
@@ -80,6 +80,20 @@ dependencies:
|
|
80
80
|
- - "~>"
|
81
81
|
- !ruby/object:Gem::Version
|
82
82
|
version: '0.16'
|
83
|
+
- !ruby/object:Gem::Dependency
|
84
|
+
name: mysql
|
85
|
+
requirement: !ruby/object:Gem::Requirement
|
86
|
+
requirements:
|
87
|
+
- - "~>"
|
88
|
+
- !ruby/object:Gem::Version
|
89
|
+
version: '2.9'
|
90
|
+
type: :runtime
|
91
|
+
prerelease: false
|
92
|
+
version_requirements: !ruby/object:Gem::Requirement
|
93
|
+
requirements:
|
94
|
+
- - "~>"
|
95
|
+
- !ruby/object:Gem::Version
|
96
|
+
version: '2.9'
|
83
97
|
- !ruby/object:Gem::Dependency
|
84
98
|
name: aws-sdk
|
85
99
|
requirement: !ruby/object:Gem::Requirement
|
@@ -127,6 +141,11 @@ files:
|
|
127
141
|
- bin/dataduck
|
128
142
|
- bin/setup
|
129
143
|
- dataduck.gemspec
|
144
|
+
- docs/README.md
|
145
|
+
- docs/contents.yml
|
146
|
+
- docs/overview/README.md
|
147
|
+
- docs/overview/getting_started.md
|
148
|
+
- docs/tables/README.md
|
130
149
|
- examples/example/.gitignore
|
131
150
|
- examples/example/.ruby-version
|
132
151
|
- examples/example/Gemfile
|