dataduck 0.6.8 → 0.7.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/dataduck.gemspec +1 -0
- data/docs/contents.yml +5 -0
- data/docs/integrations/optimizely.md +21 -0
- data/docs/integrations/semrush.md +36 -0
- data/docs/tables/incremental_vs_full_loading.md +70 -0
- data/lib/dataduck/commands.rb +5 -1
- data/lib/dataduck/database.rb +1 -1
- data/lib/dataduck/destination.rb +4 -0
- data/lib/dataduck/logs.rb +5 -0
- data/lib/dataduck/redshift_destination.rb +6 -0
- data/lib/dataduck/table.rb +6 -0
- data/lib/dataduck/version.rb +2 -2
- metadata +19 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: ef6f3cd5a8054cf855b227324845f2ec365516dd
|
4
|
+
data.tar.gz: 45030532745d7a68988bee5e95a791ed55bae6b0
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 1d167785a5f64fd8ea77546dcb3f7d19107c3c651aac5933afe09e7eee4eb1674c25a9223b1c724a8229334d669c41604288c29fa4da6e35bfa596ad8bde6a90
|
7
|
+
data.tar.gz: 708a7eb4404cee131bcd946314b57febaa6908c716c4ce2928167c88a0a61f084820f34258a3b5b74cbc97a215759e8cec13e9f89ff25a17aba2453a636d4517
|
data/dataduck.gemspec
CHANGED
data/docs/contents.yml
CHANGED
@@ -0,0 +1,21 @@
|
|
1
|
+
# Optimizely Integration
|
2
|
+
|
3
|
+
Optimizely is a website optimization platform which includes a/b testing and personalization products.
|
4
|
+
|
5
|
+
The Optimizely integration uses Optimizely's API to fetch data for your projects, experiments, and variations, then puts them into
|
6
|
+
three tables in your data warehouse.
|
7
|
+
|
8
|
+
To use the Optimizely integration, first, get an API token from [https://app.optimizely.com/tokens](https://app.optimizely.com/tokens). Then add the following to your project's .env file:
|
9
|
+
|
10
|
+
```
|
11
|
+
optimizely_api_token=YOUR_TOKEN
|
12
|
+
```
|
13
|
+
|
14
|
+
Finally, add the following file to your project's /src/tables directory, naming it optimizely_integration.rb
|
15
|
+
|
16
|
+
```ruby
|
17
|
+
class OptimizelyIntegration < DataDuck::Optimizely::OptimizelyIntegration
|
18
|
+
end
|
19
|
+
```
|
20
|
+
|
21
|
+
Now, running `dataduck etl optimizely_integration` will ETL three tables for you. These tables are `optimizely_projects`, `optimizely_experiments`, and `optimizely_variations`. The results data can be found on the variations. Additionally, a `dataduck_extracted_at` datetime column indicates how fresh the data is.
|
@@ -0,0 +1,36 @@
|
|
1
|
+
# SEMrush Integration
|
2
|
+
|
3
|
+
SEMrush is a powerful and versatile competitive intelligence suite for online marketing, from SEO and PPC to social media and video advertising research.
|
4
|
+
|
5
|
+
The SEMrush integration is currently focused on SEO. It will create a table called `semrush_organic_results` that shows the Google search ranking for specific phrases. By running this regularly, you can see how your website or your competitors' websites search rankings change over time.
|
6
|
+
|
7
|
+
To use the SEMrush integration, first add the following to your .env file:
|
8
|
+
|
9
|
+
```
|
10
|
+
semrush_api_key=YOUR_API_KEY
|
11
|
+
```
|
12
|
+
|
13
|
+
Then create a table called `organic_results` with the following:
|
14
|
+
|
15
|
+
```ruby
|
16
|
+
class OrganicResults < DataDuck::SEMRush::OrganicResults
|
17
|
+
def display_limit
|
18
|
+
20 # Default is 20
|
19
|
+
end
|
20
|
+
|
21
|
+
def search_database
|
22
|
+
'us' # Default is 'us'
|
23
|
+
end
|
24
|
+
|
25
|
+
def phrases
|
26
|
+
['My Phrase 1',
|
27
|
+
"Another Phrase",
|
28
|
+
"Some Other Keywords",
|
29
|
+
]
|
30
|
+
end
|
31
|
+
end
|
32
|
+
```
|
33
|
+
|
34
|
+
This table will have five columns: date, phrase, rank, domain, and url.
|
35
|
+
|
36
|
+
The methods display_limit and search_database are optional, but can be modified to fit your particular use case.
|
@@ -0,0 +1,70 @@
|
|
1
|
+
# Incremental vs Full Loading
|
2
|
+
|
3
|
+
Loading a table can be performed either incrementally or with a full reload each time. An incremental load is generally
|
4
|
+
better, since it takes less time and transfers less data, however not all tables cannot be loaded incrementally.
|
5
|
+
|
6
|
+
## Incremental loading
|
7
|
+
|
8
|
+
If you are running an ETL process regularly, rather than loading an entire table each time, it is more efficient
|
9
|
+
to load just the rows that have changed. This is known as an incremental load. By default, if a table contains
|
10
|
+
a row called `updated_at`, DataDuck ETL will use incremental loading based off of that column. If no such column
|
11
|
+
exists, it will load the entire table each time.
|
12
|
+
|
13
|
+
If rows can be deleted from a table, you should not use incremental loading either, since DataDuck ETL won't know which rows
|
14
|
+
have been deleted. Soft deleting a row, by setting a column to 'deleted' (for example) is fine to use with incremental loading.
|
15
|
+
|
16
|
+
Under the hood, before extracting, DataDuck ETL will check the destination for the latest value of a column, then use that value as a LIMIT
|
17
|
+
when running the extract query.
|
18
|
+
|
19
|
+
If you would like to base an incremental load on a different column, such as `id` or `created_at` (common in cases where
|
20
|
+
the rows are not expected to change, like an event stream), then you can do so by giving your table a method `extract_by_column`.
|
21
|
+
|
22
|
+
```ruby
|
23
|
+
class MyTable < DataDuck::Table
|
24
|
+
source :source1, ["id", "created_at", "name"]
|
25
|
+
|
26
|
+
def extract_by_column
|
27
|
+
'created_at'
|
28
|
+
end
|
29
|
+
|
30
|
+
output({
|
31
|
+
:id => :integer,
|
32
|
+
:created_at => :datetime,
|
33
|
+
:name => :string,
|
34
|
+
})
|
35
|
+
end
|
36
|
+
```
|
37
|
+
|
38
|
+
## Full reloads
|
39
|
+
|
40
|
+
Fully reloading a table takes longer, so it is only recommended you do this with tables where it is not possible to use
|
41
|
+
incremental loads.
|
42
|
+
|
43
|
+
If you would like to fully reload the table each time, you may give your table an `extract_by_column` that returns `nil`.
|
44
|
+
Alternatively, if you want to have an `extract_by_column` but still reload the entire table each time, you may
|
45
|
+
give it a method `should_fully_reload?` that returns true. An example of when you might want to do this is if you are
|
46
|
+
reloading an entire table, but doing it in batches.
|
47
|
+
|
48
|
+
```ruby
|
49
|
+
class MyTableFullyReloaded < DataDuck::Table
|
50
|
+
source :source1, ["id", "created_at", "name"]
|
51
|
+
|
52
|
+
def batch_size
|
53
|
+
1_000_000 # if there is a lot of data, and you want to use less memory (for example), batching is a good idea
|
54
|
+
end
|
55
|
+
|
56
|
+
def extract_by_column
|
57
|
+
'created_at'
|
58
|
+
end
|
59
|
+
|
60
|
+
def should_fully_reload?
|
61
|
+
true
|
62
|
+
end
|
63
|
+
|
64
|
+
output({
|
65
|
+
:id => :integer,
|
66
|
+
:created_at => :datetime,
|
67
|
+
:name => :string,
|
68
|
+
})
|
69
|
+
end
|
70
|
+
```
|
data/lib/dataduck/commands.rb
CHANGED
@@ -47,7 +47,11 @@ module DataDuck
|
|
47
47
|
return DataDuck::Commands.help
|
48
48
|
end
|
49
49
|
|
50
|
-
|
50
|
+
begin
|
51
|
+
DataDuck::Commands.public_send(command, *args[1..-1])
|
52
|
+
rescue Exception => err
|
53
|
+
DataDuck::Logs.error(err)
|
54
|
+
end
|
51
55
|
end
|
52
56
|
|
53
57
|
def self.c
|
data/lib/dataduck/database.rb
CHANGED
data/lib/dataduck/destination.rb
CHANGED
data/lib/dataduck/logs.rb
CHANGED
@@ -1,4 +1,5 @@
|
|
1
1
|
require 'logger'
|
2
|
+
require 'raven'
|
2
3
|
|
3
4
|
module DataDuck
|
4
5
|
module Logs
|
@@ -43,6 +44,10 @@ module DataDuck
|
|
43
44
|
|
44
45
|
puts "[ERROR] #{ message }"
|
45
46
|
@@logger.error(message)
|
47
|
+
|
48
|
+
if ENV['SENTRY_DSN']
|
49
|
+
Raven.capture_exception(err)
|
50
|
+
end
|
46
51
|
end
|
47
52
|
|
48
53
|
private
|
@@ -277,6 +277,12 @@ module DataDuck
|
|
277
277
|
self.query("DROP TABLE zz_dataduck_recreating_old_#{ table.name }")
|
278
278
|
end
|
279
279
|
|
280
|
+
def postprocess!(table)
|
281
|
+
DataDuck::Logs.info "Vacuuming table #{ table.name }"
|
282
|
+
vacuum_type = table.indexes.length == 0 ? "FULL" : "REINDEX"
|
283
|
+
self.query("VACUUM #{ vacuum_type } #{ table.name }")
|
284
|
+
end
|
285
|
+
|
280
286
|
def self.value_to_string(value)
|
281
287
|
string_value = ''
|
282
288
|
|
data/lib/dataduck/table.rb
CHANGED
@@ -111,6 +111,8 @@ module DataDuck
|
|
111
111
|
if self.should_fully_reload?
|
112
112
|
destination.finish_fully_reloading_table!(self)
|
113
113
|
end
|
114
|
+
|
115
|
+
self.postprocess!(destination, options)
|
114
116
|
end
|
115
117
|
|
116
118
|
def extract!(destination = nil, options = {})
|
@@ -220,6 +222,10 @@ module DataDuck
|
|
220
222
|
self.output_schema.keys.sort.map(&:to_s)
|
221
223
|
end
|
222
224
|
|
225
|
+
def postprocess!(destination, options = {})
|
226
|
+
destination.postprocess!(self)
|
227
|
+
end
|
228
|
+
|
223
229
|
def recreate!(destination)
|
224
230
|
destination.recreate_table!(self)
|
225
231
|
end
|
data/lib/dataduck/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: dataduck
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.7.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jeff Pickhardt
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2015-
|
11
|
+
date: 2015-12-06 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: bundler
|
@@ -178,6 +178,20 @@ dependencies:
|
|
178
178
|
- - "~>"
|
179
179
|
- !ruby/object:Gem::Version
|
180
180
|
version: '0.9'
|
181
|
+
- !ruby/object:Gem::Dependency
|
182
|
+
name: sentry-raven
|
183
|
+
requirement: !ruby/object:Gem::Requirement
|
184
|
+
requirements:
|
185
|
+
- - "~>"
|
186
|
+
- !ruby/object:Gem::Version
|
187
|
+
version: '0.15'
|
188
|
+
type: :runtime
|
189
|
+
prerelease: false
|
190
|
+
version_requirements: !ruby/object:Gem::Requirement
|
191
|
+
requirements:
|
192
|
+
- - "~>"
|
193
|
+
- !ruby/object:Gem::Version
|
194
|
+
version: '0.15'
|
181
195
|
description: A straightforward, effective ETL framework.
|
182
196
|
email:
|
183
197
|
- pickhardt@gmail.com
|
@@ -206,9 +220,12 @@ files:
|
|
206
220
|
- docs/commands/recreate.md
|
207
221
|
- docs/commands/show.md
|
208
222
|
- docs/contents.yml
|
223
|
+
- docs/integrations/optimizely.md
|
224
|
+
- docs/integrations/semrush.md
|
209
225
|
- docs/overview/README.md
|
210
226
|
- docs/overview/getting_started.md
|
211
227
|
- docs/tables/README.md
|
228
|
+
- docs/tables/incremental_vs_full_loading.md
|
212
229
|
- examples/example/.gitignore
|
213
230
|
- examples/example/.ruby-version
|
214
231
|
- examples/example/Gemfile
|