samidare 0.1.2 → 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: f97f6368bca22b657df37c1279d7b5d07c514c76
4
- data.tar.gz: 62f4b01e91cfda7c95b787e71a047e5d8fb2155f
3
+ metadata.gz: 9bc2846461bb56c026fab7855d4fa3bbb5d9a6fa
4
+ data.tar.gz: fec1ae897422431b75114117f89e4c442d3fa1ac
5
5
  SHA512:
6
- metadata.gz: 64dde2e0f6fd33eae7b2d044acabe546ee097745617410a751b6dfa48e3fd8bd8389d80e820037ec876b299e74637ef6ea1db22dbe56e793b3a55c533e7ef166
7
- data.tar.gz: 1c687b2b61a237444ce5d7e9f1dbf44d73985482a0ef8d31178806a1321a11224b97277d756c26ffffb891f87689e57335e0772ae913340805a5418c77b14ada
6
+ metadata.gz: 77c2030170b64c5006f0d6a773bf58c365dd1b4f1261028ffe8171058f5101bf72c71934278def78147952aae7d4aff6eb7f88131eaca56c0de12aaafb1e9438
7
+ data.tar.gz: 0cd7332f0bf04d0f2d3373411b46c7c3495f9c31dc6464fb934abb154e31fa552b76612feadc62cdc8c431618a389d314400db586cff1fb94260698d88f57e6e
data/.gitignore CHANGED
@@ -1,15 +1,15 @@
1
- /.bundle/
2
- /.yardoc
3
- /Gemfile.lock
4
- /_yardoc/
5
- /coverage/
6
- /doc/
7
- /pkg/
8
- /spec/reports/
9
- /tmp/
10
- *.bundle
11
- *.so
12
- *.o
13
- *.a
14
- .rspec
15
- mkmf.log
1
+ /.bundle/
2
+ /.yardoc
3
+ /Gemfile.lock
4
+ /_yardoc/
5
+ /coverage/
6
+ /doc/
7
+ /pkg/
8
+ /spec/reports/
9
+ /tmp/
10
+ *.bundle
11
+ *.so
12
+ *.o
13
+ *.a
14
+ .rspec
15
+ mkmf.log
data/README.md CHANGED
@@ -1,117 +1,179 @@
1
- # Samidare
2
-
3
- Generate Embulk config and BigQuery schema from MySQL schema and run Embulk.
4
-
5
- ## Installation
6
-
7
- Add this line to your application's Gemfile:
8
-
9
- ```ruby
10
- gem 'samidare'
11
- ```
12
-
13
- And then execute:
14
-
15
- $ bundle
16
-
17
- Or install it yourself as:
18
-
19
- $ gem install samidare
20
-
21
- ## Usage
22
- Require `database.yml` and `table.yml`.
23
- Below is a sample config file.
24
-
25
- #### database.yml
26
- ```yml
27
- db01:
28
- host: localhost
29
- username: root
30
- password: pswd
31
- database: production
32
- bq_dataset: mysql_db01
33
-
34
- db02:
35
- host: localhost
36
- username: root
37
- password: pswd
38
- database: production
39
- bq_dataset: mysql_db02
40
-
41
- ```
42
-
43
- **Caution: Embulk doesn't allow no password for MySQL**
44
-
45
- #### table.yml
46
- ```yml
47
- db01:
48
- tables:
49
- - name: users
50
- - name: events
51
- - name: hobbies
52
-
53
- db02:
54
- tables:
55
- - name: administrators
56
- - name: configs
57
- ```
58
-
59
- Samidare requires BigQuery parameters like below.
60
- ```ruby
61
- [sample.rb]
62
- config = {
63
- 'project_id' => 'BIGQUERY_PROJECT_ID',
64
- 'service_email' => 'SERVICE_ACCOUNT_EMAIL',
65
- 'key' => '/etc/embulk/bigquery.p12',
66
- 'schema_dir' => '/var/tmp/embulk/schema',
67
- 'config_dir' => '/var/tmp/embulk/config',
68
- 'auth_method' => 'private_key'
69
- }
70
-
71
- client = Samidare::EmbulkClient.new
72
- client.generate_config(config)
73
- client.run(config)
74
- ```
75
-
76
- ```bash
77
- bundle exec ruby sample.rb
78
- ```
79
-
80
- ## Features
81
- #### daily snapshot
82
- BigQuery supports table wildcard expression of a specific set of daily tables, for example, `sales20150701` .
83
- If you need daily snapshot of a table for BigQuery, use `daily_snapshot` option to `database.yml` or `table.yml` like below.
84
- `daily_snapshot` option effects all tables in case of `database.yml` .
85
- On the other hand, only target table in `table.yml` .
86
- **Daily part is determined by execute date.**
87
-
88
- ```yml
89
- [database.yml]
90
- production:
91
- host: localhost
92
- username: root
93
- password: pswd
94
- database: production
95
- bq_dataset: mysql
96
- daily_snapshot: true
97
- ```
98
-
99
- ```yml
100
- [table.yml]
101
- production:
102
- tables:
103
- - name: users
104
- - name: events
105
- daily_snapshot: true
106
- - name: hobbies
107
-
108
- Only `events` is renamed to `eventsYYYYMMDD` for BigQuery.
109
- ```
110
-
111
- ## Contributing
112
-
113
- 1. Fork it ( https://github.com/[my-github-username]/samidare/fork )
114
- 2. Create your feature branch (`git checkout -b my-new-feature`)
115
- 3. Commit your changes (`git commit -am 'Add some feature'`)
116
- 4. Push to the branch (`git push origin my-new-feature`)
117
- 5. Create a new Pull Request
1
+ # Samidare
2
+
3
+ Generate Embulk config and BigQuery schema from MySQL schema and run Embulk.
4
+
5
+ ## Installation
6
+
7
+ Add this line to your application's Gemfile:
8
+
9
+ ```ruby
10
+ gem 'samidare'
11
+ ```
12
+
13
+ And then execute:
14
+
15
+ $ bundle
16
+
17
+ Or install it yourself as:
18
+
19
+ $ gem install samidare
20
+
21
+ ## Embulk setup
22
+ `Samidare` is utility for `Embulk` .
23
+ You need to install `Embulk` and install some gems like below.
24
+
25
+ ```bash
26
+ embulk gem install embulk-input-mysql --version 0.8.2
27
+ embulk gem install embulk-output-bigquery --version 0.4.3
28
+ embulk gem install embulk-parser-jsonl --version 0.2.0
29
+ embulk gem install embulk-formatter-jsonl --version 0.1.4
30
+ ```
31
+
32
+ ## Usage
33
+ Require `database.yml` and `table.yml`.
34
+ Below is a sample config file.
35
+
36
+ ### database.yml
37
+ ```yml
38
+ db01:
39
+ host: localhost
40
+ username: root
41
+ password: pswd
42
+ database: production
43
+ bq_dataset: mysql_db01
44
+
45
+ db02:
46
+ host: localhost
47
+ username: root
48
+ password: pswd
49
+ database: production
50
+ bq_dataset: mysql_db02
51
+
52
+ ```
53
+
54
+ **Caution: Embulk doesn't allow no password for MySQL**
55
+
56
+ ### table.yml
57
+ ```yml
58
+ db01:
59
+ tables:
60
+ - name: users
61
+ - name: events
62
+ - name: hobbies
63
+
64
+ db02:
65
+ tables:
66
+ - name: administrators
67
+ - name: configs
68
+ ```
69
+
70
+ Samidare requires BigQuery parameters like below.
71
+
72
+ ```ruby
73
+ [sample.rb]
74
+ require 'samidare'
75
+
76
+ config = {
77
+ 'project_id' => 'BIGQUERY_PROJECT_ID',
78
+ 'service_email' => 'SERVICE_ACCOUNT_EMAIL',
79
+ 'key' => '/etc/embulk/bigquery.p12',
80
+ 'schema_dir' => '/var/tmp/embulk/schema',
81
+ 'config_dir' => '/var/tmp/embulk/config',
82
+ 'auth_method' => 'private_key'
83
+ }
84
+
85
+ client = Samidare::EmbulkClient.new
86
+ client.generate_config(config)
87
+ client.run(config)
88
+ ```
89
+
90
+ ```bash
91
+ ruby sample.rb
92
+ ```
93
+
94
+ ## Features
95
+ ### process status
96
+ `Samidare` returns process status as boolean.
97
+ If all tables are succeed, then returns `true`, else `false` .
98
+ It is useful to control system flow.
99
+
100
+ ```ruby
101
+ process_status = Samidare::EmbulkClient.new.run(config)
102
+ exit 1 unless process_status
103
+ ```
104
+
105
+ ### narrow tables
106
+ You can narrow actual target tables from `table.yml` for test or to retry.
107
+ If no target tables is given, `Samidare` will execute all tables.
108
+
109
+ ```ruby
110
+ # in case, all tables are ['users', 'purchases', 'items']
111
+ target_tables = ['users', 'purchases']
112
+ Samidare::EmbulkClient.new.run(config, target_tables)
113
+ ```
114
+
115
+ ### retry
116
+ You can set retry count.
117
+ If any table failed, only failed table will be retried until retry count.
118
+ If no retry count is given, `Samidare` dosen't retry.
119
+
120
+ ```ruby
121
+ # 2 times retry will execute
122
+ Samidare::EmbulkClient.new.run(config, [], 2)
123
+ ```
124
+
125
+ ### SQL condition
126
+ If you set `condition` to a table in `table.yml` , SQL is generated like below.
127
+ It is useful for large size table.
128
+
129
+ ```yml
130
+ [table.yml]
131
+ production:
132
+ tables:
133
+ - name: users
134
+ - name: events
135
+ conditon: created_at < CURRENT_DATE()
136
+ ```
137
+
138
+ ```sql
139
+ SELECT * FROM users
140
+ SELECT * FROM events WHERE created_at < CURRENT_DATE()
141
+ ```
142
+
143
+ ### daily snapshot
144
+ BigQuery supports table wildcard expression of a specific set of daily tables, for example, `sales20150701` .
145
+ If you need daily snapshot of a table for BigQuery, use `daily_snapshot` option to `database.yml` or `table.yml` like below.
146
+ `daily_snapshot` option effects all tables in case of `database.yml` .
147
+ On the other hand, only target table in `table.yml` .
148
+ **Daily part is determined by execute date.**
149
+
150
+ ```yml
151
+ [database.yml]
152
+ production:
153
+ host: localhost
154
+ username: root
155
+ password: pswd
156
+ database: production
157
+ bq_dataset: mysql
158
+ daily_snapshot: true
159
+ ```
160
+
161
+ ```yml
162
+ [table.yml]
163
+ production:
164
+ tables:
165
+ - name: users
166
+ - name: events
167
+ daily_snapshot: true
168
+ - name: hobbies
169
+
170
+ Only `events` is renamed to `eventsYYYYMMDD` for BigQuery.
171
+ ```
172
+
173
+ ## Contributing
174
+
175
+ 1. Fork it ( https://github.com/[my-github-username]/samidare/fork )
176
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
177
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
178
+ 4. Push to the branch (`git push origin my-new-feature`)
179
+ 5. Create a new Pull Request
data/Rakefile CHANGED
@@ -1,7 +1,7 @@
1
- require "bundler/gem_tasks"
2
- require "rspec/core/rake_task"
3
-
4
- RSpec::Core::RakeTask.new(:spec)
5
-
6
- task :default => :spec
7
-
1
+ require "bundler/gem_tasks"
2
+ require "rspec/core/rake_task"
3
+
4
+ RSpec::Core::RakeTask.new(:spec)
5
+
6
+ task :default => :spec
7
+
@@ -1,42 +1,42 @@
1
- require 'samidare/version'
2
- require 'samidare/embulk_utility'
3
- require 'samidare/embulk'
4
- require 'samidare/mysql'
5
-
6
- module Samidare
7
- class EmbulkClient
8
- def generate_config(bq_config)
9
- Samidare::EmbulkUtility::ConfigGenerator.new.generate_config(database_configs, bq_config)
10
- end
11
-
12
- def run(bq_config, target_table_names = [], retry_max = 0)
13
- error_tables = run_and_retry(bq_config, target_table_names, retry_max, 0)
14
- # return batch status(true: all tables success)
15
- error_tables.size == 0
16
- end
17
-
18
- private
19
- def run_and_retry(bq_config, target_table_names = [], retry_max, retry_count)
20
- error_tables = Samidare::Embulk.new.run(
21
- database_configs,
22
- table_configs,
23
- bq_config,
24
- target_table_names)
25
- if error_tables.size > 0 && retry_count < retry_max
26
- puts "------------------------------------"
27
- puts "retry start -> #{retry_count + 1} time"
28
- puts "------------------------------------"
29
- error_tables = run_and_retry(bq_config, error_tables, retry_max, retry_count + 1)
30
- end
31
- error_tables
32
- end
33
-
34
- def database_configs
35
- @database_configs ||= YAML.load_file('database.yml')
36
- end
37
-
38
- def table_configs
39
- @table_configs ||= Samidare::MySQL::TableConfig.generate_table_configs
40
- end
41
- end
42
- end
1
+ require 'samidare/version'
2
+ require 'samidare/embulk_utility'
3
+ require 'samidare/embulk'
4
+ require 'samidare/mysql'
5
+
6
+ module Samidare
7
+ class EmbulkClient
8
+ def generate_config(bq_config)
9
+ Samidare::EmbulkUtility::ConfigGenerator.new.generate_config(database_configs, bq_config)
10
+ end
11
+
12
+ def run(bq_config, target_table_names = [], retry_max = 0)
13
+ error_tables = run_and_retry(bq_config, target_table_names, retry_max, 0)
14
+ # return batch status(true: all tables success)
15
+ error_tables.size == 0
16
+ end
17
+
18
+ private
19
+ def run_and_retry(bq_config, target_table_names = [], retry_max, retry_count)
20
+ error_tables = Samidare::Embulk.new.run(
21
+ database_configs,
22
+ table_configs,
23
+ bq_config,
24
+ target_table_names)
25
+ if error_tables.size > 0 && retry_count < retry_max
26
+ puts "------------------------------------"
27
+ puts "retry start -> #{retry_count + 1} time"
28
+ puts "------------------------------------"
29
+ error_tables = run_and_retry(bq_config, error_tables, retry_max, retry_count + 1)
30
+ end
31
+ error_tables
32
+ end
33
+
34
+ def database_configs
35
+ @database_configs ||= YAML.load_file('database.yml')
36
+ end
37
+
38
+ def table_configs
39
+ @table_configs ||= Samidare::MySQL::TableConfig.generate_table_configs
40
+ end
41
+ end
42
+ end
@@ -1,85 +1,85 @@
1
- require 'json'
2
- require 'erb'
3
- require 'big_query'
4
- require 'unindent'
5
- require 'date'
6
-
7
- module Samidare
8
- class BigQueryUtility
9
- CONTENTS = <<-EOS.unindent
10
- in:
11
- type: mysql
12
- user: <%= user %>
13
- password: <%= password %>
14
- database: <%= database %>
15
- host: <%= host %>
16
- query: |
17
- <%= query %>
18
- out:
19
- type: bigquery
20
- project: <%= project %>
21
- p12_keyfile_path: <%= p12_keyfile_path %>
22
- service_account_email: <%= service_account_email %>
23
- dataset: <%= dataset %>
24
- table: <%= table_name %>
25
- schema_path: <%= schema_path %>
26
- auto_create_table: 1
27
- path_prefix: <%= path_prefix %>
28
- source_format: NEWLINE_DELIMITED_JSON
29
- file_ext: .json.gz
30
- delete_from_local_when_job_end: 1
31
- formatter:
32
- type: jsonl
33
- encoders:
34
- - {type: gzip}
35
- EOS
36
-
37
- def initialize(config)
38
- @config = config.dup
39
- @current_date = Date.today
40
- end
41
-
42
- def self.generate_schema(columns)
43
- json_body = columns.map { |column| column.to_json }.join(",\n")
44
- "[\n" + json_body + "\n]\n"
45
- end
46
-
47
- def self.generate_sql(table_config, columns)
48
- columns = columns.map { |column| column.converted_value }
49
- sql = "SELECT " + columns.join(",")
50
- sql << " FROM #{table_config.name}"
51
- sql << " WHERE #{table_config.condition}" if table_config.condition
52
- sql << "\n"
53
- sql
54
- end
55
-
56
- def generate_embulk_config(db_name, database_config, table_config, columns)
57
- host = database_config['host']
58
- user = database_config['username']
59
- password = database_config['password']
60
- database = database_config['database']
61
- query = Samidare::BigQueryUtility.generate_sql(table_config, columns)
62
- project = @config['project_id']
63
- p12_keyfile_path = @config['key']
64
- service_account_email = @config['service_email']
65
- dataset = database_config['bq_dataset']
66
- table_name = actual_table_name(table_config.name, database_config['daily_snapshot'] || table_config.daily_snapshot)
67
- schema_path = "#{@config['schema_dir']}/#{db_name}/#{table_config.name}.json"
68
- path_prefix = "/var/tmp/embulk_#{db_name}_#{table_config.name}"
69
-
70
- ERB.new(CONTENTS).result(binding)
71
- end
72
-
73
- def delete_table(dataset, table_name)
74
- @config['dataset'] = dataset
75
-
76
- bq = BigQuery::Client.new(@config)
77
- bq.delete_table(table_name)
78
- end
79
-
80
- def actual_table_name(table_name, daily_snapshot)
81
- return table_name unless daily_snapshot
82
- table_name + @current_date.strftime('%Y%m%d')
83
- end
84
- end
85
- end
1
+ require 'json'
2
+ require 'erb'
3
+ require 'big_query'
4
+ require 'unindent'
5
+ require 'date'
6
+
7
+ module Samidare
8
+ class BigQueryUtility
9
+ CONTENTS = <<-EOS.unindent
10
+ in:
11
+ type: mysql
12
+ user: <%= user %>
13
+ password: <%= password %>
14
+ database: <%= database %>
15
+ host: <%= host %>
16
+ query: |
17
+ <%= query %>
18
+ out:
19
+ type: bigquery
20
+ project: <%= project %>
21
+ p12_keyfile: <%= p12_keyfile %>
22
+ service_account_email: <%= service_account_email %>
23
+ dataset: <%= dataset %>
24
+ table: <%= table_name %>
25
+ schema_path: <%= schema_path %>
26
+ auto_create_table: 1
27
+ path_prefix: <%= path_prefix %>
28
+ source_format: NEWLINE_DELIMITED_JSON
29
+ file_ext: .json.gz
30
+ delete_from_local_when_job_end: 1
31
+ formatter:
32
+ type: jsonl
33
+ encoders:
34
+ - {type: gzip}
35
+ EOS
36
+
37
+ def initialize(config)
38
+ @config = config.dup
39
+ @current_date = Date.today
40
+ end
41
+
42
+ def self.generate_schema(columns)
43
+ json_body = columns.map { |column| column.to_json }.join(",\n")
44
+ "[\n" + json_body + "\n]\n"
45
+ end
46
+
47
+ def self.generate_sql(table_config, columns)
48
+ columns = columns.map { |column| column.converted_value }
49
+ sql = "SELECT " + columns.join(",")
50
+ sql << " FROM #{table_config.name}"
51
+ sql << " WHERE #{table_config.condition}" if table_config.condition
52
+ sql << "\n"
53
+ sql
54
+ end
55
+
56
+ def generate_embulk_config(db_name, database_config, table_config, columns)
57
+ host = database_config['host']
58
+ user = database_config['username']
59
+ password = database_config['password']
60
+ database = database_config['database']
61
+ query = Samidare::BigQueryUtility.generate_sql(table_config, columns)
62
+ project = @config['project_id']
63
+ p12_keyfile = @config['key']
64
+ service_account_email = @config['service_email']
65
+ dataset = database_config['bq_dataset']
66
+ table_name = actual_table_name(table_config.name, database_config['daily_snapshot'] || table_config.daily_snapshot)
67
+ schema_path = "#{@config['schema_dir']}/#{db_name}/#{table_config.name}.json"
68
+ path_prefix = "/var/tmp/embulk_#{db_name}_#{table_config.name}"
69
+
70
+ ERB.new(CONTENTS).result(binding)
71
+ end
72
+
73
+ def delete_table(dataset, table_name)
74
+ @config['dataset'] = dataset
75
+
76
+ bq = BigQuery::Client.new(@config)
77
+ bq.delete_table(table_name)
78
+ end
79
+
80
+ def actual_table_name(table_name, daily_snapshot)
81
+ return table_name unless daily_snapshot
82
+ table_name + @current_date.strftime('%Y%m%d')
83
+ end
84
+ end
85
+ end