dataduck 0.2.0 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +4 -2
- data/dataduck.gemspec +1 -0
- data/docs/README.md +7 -0
- data/docs/contents.yml +6 -0
- data/docs/overview/README.md +24 -0
- data/docs/overview/getting_started.md +28 -0
- data/docs/tables/README.md +48 -0
- data/lib/dataduck/commands.rb +36 -3
- data/lib/dataduck/postgresql_source.rb +1 -1
- data/lib/dataduck/version.rb +1 -1
- metadata +21 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: dcc9a5407d2bae97ab0ecb754f95d3c872b92cfe
|
4
|
+
data.tar.gz: d32783d694d625367fb5f732602a6c2a997e8241
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 184f298a735a3928a78d5b8e85e22e498d4e26b554d89a2b4201afa0e91d8b812872e0c72e4a6046970b6b8489ff451ad4b1fd93d2271314d133484762189470
|
7
|
+
data.tar.gz: d9e4001147d98b51c3a481f894d52129349ba656d1f6b362845f37618e41b2bdf29005389b106df842ffb8ec02d0b84b45c683a561a49bc4674a88ce392fefcc
|
data/README.md
CHANGED
@@ -18,10 +18,12 @@ See [https://github.com/DataDuckETL/DataDuck/tree/master/examples/example](https
|
|
18
18
|
|
19
19
|
##### Instructions for using DataDuck ETL
|
20
20
|
|
21
|
-
Create a new
|
21
|
+
Create a new, empty directory. Inside this directory, create a file named Gemfile, and add the following to it:
|
22
22
|
|
23
23
|
```ruby
|
24
|
-
|
24
|
+
source 'https://rubygems.org'
|
25
|
+
|
26
|
+
gem 'dataduck'
|
25
27
|
```
|
26
28
|
|
27
29
|
Then execute:
|
data/dataduck.gemspec
CHANGED
@@ -23,6 +23,7 @@ Gem::Specification.new do |spec|
|
|
23
23
|
|
24
24
|
spec.add_runtime_dependency "sequel", '~> 4.19'
|
25
25
|
spec.add_runtime_dependency "pg", '~> 0.16'
|
26
|
+
spec.add_runtime_dependency "mysql", "~> 2.9"
|
26
27
|
spec.add_runtime_dependency "aws-sdk", "~> 2.0"
|
27
28
|
spec.add_runtime_dependency "sequel-redshift"
|
28
29
|
end
|
data/docs/README.md
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
# Documentation
|
2
|
+
|
3
|
+
The documentation directory is viewable at (http://dataducketl.com/docs)[http://dataducketl.com/docs].
|
4
|
+
|
5
|
+
# Autogenerated
|
6
|
+
|
7
|
+
The documentation directory is autogenerated from the main DataDuck ETL git repo. If you would like to add or correct something in the documentation, let us know or make a pull request to (https://github.com/DataDuckETL/DataDuck/docs)[https://github.com/DataDuckETL/DataDuck/docs].
|
data/docs/contents.yml
ADDED
@@ -0,0 +1,24 @@
|
|
1
|
+
# Overview
|
2
|
+
|
3
|
+
DataDuck ETL is a straightforward, effective extract-transform-load framework for data warehousing. If you want to set
|
4
|
+
up a data warehouse, DataDuck ETL makes it simple and straightforward to do.
|
5
|
+
|
6
|
+
## Getting Started
|
7
|
+
|
8
|
+
Getting started with DataDuck ETL takes just a few minutes. For instructions, read the
|
9
|
+
[getting started](/docs/overview/getting_started) page.
|
10
|
+
|
11
|
+
## Why Use a Data Warehouse
|
12
|
+
|
13
|
+
If you already have your data in your main database, and probably use a web analytics product like Google Analytics, you
|
14
|
+
may be wondering why you'd want a data warehouse anyway.
|
15
|
+
|
16
|
+
There's many advantages to using a data warehouse, including:
|
17
|
+
|
18
|
+
- integrating multiple data sources so you can analyze them together
|
19
|
+
- helping to ensure data quality by cleaning up the data and running data quality checks
|
20
|
+
- having a single source of truth that the entire company trusts
|
21
|
+
- connecting business intelligence products for reports and dashboards
|
22
|
+
- using the data warehouse to build models, which may get incorporated back in the product, or used for predictions and company decision making
|
23
|
+
- performance optimizations so your queries run fast
|
24
|
+
- ensuring sensitive data doesn't end up in reports, by not passing it to the data warehouse (encrypted passwords, salts, etc have no practical analytics value, so they are not ETLed)
|
@@ -0,0 +1,28 @@
|
|
1
|
+
# Getting Started
|
2
|
+
|
3
|
+
## Requirements
|
4
|
+
|
5
|
+
DataDuck ETL currently supports extracting from MySQL and PostgreSQL databases. It supports loading into Amazon
|
6
|
+
Redshift. If you would like to extract or load into a database not yet supported, contact us.
|
7
|
+
|
8
|
+
## Instructions
|
9
|
+
|
10
|
+
First, create a new, empty directory. Inside this directory, create a file named Gemfile with the following:
|
11
|
+
|
12
|
+
```ruby
|
13
|
+
source 'https://rubygems.org'
|
14
|
+
|
15
|
+
gem 'dataduck'
|
16
|
+
```
|
17
|
+
|
18
|
+
Then execute:
|
19
|
+
|
20
|
+
$ bundle install
|
21
|
+
|
22
|
+
Finally, run the quickstart command:
|
23
|
+
|
24
|
+
$ dataduck quickstart
|
25
|
+
|
26
|
+
It will ask you for the credentials to your database, and then create the basic setup for your project. After the setup, your project's ETL can be run by running `ruby src/main.rb`
|
27
|
+
|
28
|
+
If you would like to run this regularly, such as every night, it's recommended to use the [whenever](https://github.com/javan/whenever) gem to manage a cron job to regularly run the ETL.
|
@@ -0,0 +1,48 @@
|
|
1
|
+
# The Table Class
|
2
|
+
|
3
|
+
If you've run the `dataduck quickstart` command, you'll notice a bunch of table files were generated under /src/tables.
|
4
|
+
Each of these table files inherits from `DataDuck::Table`, the base table class. Tables need to have the `source` and `output` defined.
|
5
|
+
|
6
|
+
You may also define transformations with the `transforms` method and validations with `validates` method.
|
7
|
+
|
8
|
+
## Example Table
|
9
|
+
|
10
|
+
The following is an example table.
|
11
|
+
|
12
|
+
```ruby
|
13
|
+
class Decks < DataDuck::Table
|
14
|
+
source :my_database, ["id", "name", "user_id", "cards",
|
15
|
+
"num_wins", "num_losses", "created_at", "updated_at",
|
16
|
+
"is_drafted", "num_draft_wins", "num_draft_losses"]
|
17
|
+
|
18
|
+
transforms :calculate_num_totals
|
19
|
+
|
20
|
+
validates :validates_num_total
|
21
|
+
|
22
|
+
output({
|
23
|
+
:id => :integer,
|
24
|
+
:name => :string,
|
25
|
+
:user_id => :integer,
|
26
|
+
:num_wins => :integer,
|
27
|
+
:num_losses => :integer,
|
28
|
+
:num_total => :integer,
|
29
|
+
:num_draft_total => :integer,
|
30
|
+
:created_at => :datetime,
|
31
|
+
:updated_at => :datetime,
|
32
|
+
:is_drafted => :boolean,
|
33
|
+
# Note that num_draft_wins and num_draft_losses
|
34
|
+
# are not included in the output, but are used in
|
35
|
+
# the transformation.
|
36
|
+
})
|
37
|
+
|
38
|
+
def calculate_num_totals(row)
|
39
|
+
row[:num_total] = row[:num_wins] + row[:num_losses]
|
40
|
+
row[:num_draft_total] = row[:num_draft_wins] + row[:num_draft_losses]
|
41
|
+
row
|
42
|
+
end
|
43
|
+
|
44
|
+
def validates_num_total(row)
|
45
|
+
return "Deck id #{ row[:id] } has negative value #{ row[:num_total] } for num_total." if row[:num_total] < 0
|
46
|
+
end
|
47
|
+
end
|
48
|
+
```
|
data/lib/dataduck/commands.rb
CHANGED
@@ -16,6 +16,21 @@ module DataDuck
|
|
16
16
|
end
|
17
17
|
end
|
18
18
|
|
19
|
+
def self.prompt_choices(choices = [])
|
20
|
+
while true
|
21
|
+
print "Enter a number 0 - #{ choices.length - 1}\n"
|
22
|
+
choices.each_with_index do |choice, idx|
|
23
|
+
choice_name = choice.is_a?(String) ? choice : choice[1]
|
24
|
+
print "#{ idx }: #{ choice_name }\n"
|
25
|
+
end
|
26
|
+
choice = STDIN.gets.strip.to_i
|
27
|
+
if 0 <= choice && choice < choices.length
|
28
|
+
selected = choices[choice]
|
29
|
+
return selected.is_a?(String) ? selected : selected[0]
|
30
|
+
end
|
31
|
+
end
|
32
|
+
end
|
33
|
+
|
19
34
|
def self.acceptable_commands
|
20
35
|
['console', 'quickstart']
|
21
36
|
end
|
@@ -47,7 +62,20 @@ module DataDuck
|
|
47
62
|
puts "Welcome to DataDuck!"
|
48
63
|
puts "This quickstart wizard will create your application, assuming the source is a Postgres database and the destination is an Amazon Redshift data warehouse."
|
49
64
|
|
50
|
-
|
65
|
+
|
66
|
+
puts "What kind of database would you like to source from?"
|
67
|
+
db_type = prompt_choices([
|
68
|
+
[:mysql, "MySQL"],
|
69
|
+
[:postgresql, "PostgreSQL"],
|
70
|
+
[:other, "other"],
|
71
|
+
])
|
72
|
+
|
73
|
+
if db_type == :other
|
74
|
+
puts "You've selected 'other'. Unfortunately, those are the only choices supported at the moment. Contact us at DataDuckETL.com to request support for your database."
|
75
|
+
exit
|
76
|
+
end
|
77
|
+
|
78
|
+
puts "Enter the source hostname:"
|
51
79
|
source_host = STDIN.gets.strip
|
52
80
|
|
53
81
|
puts "Enter the name of the database when connecting to #{ source_host }:"
|
@@ -62,8 +90,13 @@ module DataDuck
|
|
62
90
|
puts "Enter the password:"
|
63
91
|
source_password = STDIN.noecho(&:gets).chomp
|
64
92
|
|
65
|
-
|
66
|
-
|
93
|
+
db_class = {
|
94
|
+
mysql: DataDuck::MysqlSource,
|
95
|
+
postgresql: DataDuck::PostgresqlSource,
|
96
|
+
}[db_type]
|
97
|
+
|
98
|
+
db_source = db_class.new({
|
99
|
+
'db_type' => db_type.to_s,
|
67
100
|
'host' => source_host,
|
68
101
|
'database' => source_database,
|
69
102
|
'port' => source_port,
|
data/lib/dataduck/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: dataduck
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.3.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jeff Pickhardt
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2015-10-
|
11
|
+
date: 2015-10-11 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: bundler
|
@@ -80,6 +80,20 @@ dependencies:
|
|
80
80
|
- - "~>"
|
81
81
|
- !ruby/object:Gem::Version
|
82
82
|
version: '0.16'
|
83
|
+
- !ruby/object:Gem::Dependency
|
84
|
+
name: mysql
|
85
|
+
requirement: !ruby/object:Gem::Requirement
|
86
|
+
requirements:
|
87
|
+
- - "~>"
|
88
|
+
- !ruby/object:Gem::Version
|
89
|
+
version: '2.9'
|
90
|
+
type: :runtime
|
91
|
+
prerelease: false
|
92
|
+
version_requirements: !ruby/object:Gem::Requirement
|
93
|
+
requirements:
|
94
|
+
- - "~>"
|
95
|
+
- !ruby/object:Gem::Version
|
96
|
+
version: '2.9'
|
83
97
|
- !ruby/object:Gem::Dependency
|
84
98
|
name: aws-sdk
|
85
99
|
requirement: !ruby/object:Gem::Requirement
|
@@ -127,6 +141,11 @@ files:
|
|
127
141
|
- bin/dataduck
|
128
142
|
- bin/setup
|
129
143
|
- dataduck.gemspec
|
144
|
+
- docs/README.md
|
145
|
+
- docs/contents.yml
|
146
|
+
- docs/overview/README.md
|
147
|
+
- docs/overview/getting_started.md
|
148
|
+
- docs/tables/README.md
|
130
149
|
- examples/example/.gitignore
|
131
150
|
- examples/example/.ruby-version
|
132
151
|
- examples/example/Gemfile
|