samidare 0.1.2 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: f97f6368bca22b657df37c1279d7b5d07c514c76
4
- data.tar.gz: 62f4b01e91cfda7c95b787e71a047e5d8fb2155f
3
+ metadata.gz: 9bc2846461bb56c026fab7855d4fa3bbb5d9a6fa
4
+ data.tar.gz: fec1ae897422431b75114117f89e4c442d3fa1ac
5
5
  SHA512:
6
- metadata.gz: 64dde2e0f6fd33eae7b2d044acabe546ee097745617410a751b6dfa48e3fd8bd8389d80e820037ec876b299e74637ef6ea1db22dbe56e793b3a55c533e7ef166
7
- data.tar.gz: 1c687b2b61a237444ce5d7e9f1dbf44d73985482a0ef8d31178806a1321a11224b97277d756c26ffffb891f87689e57335e0772ae913340805a5418c77b14ada
6
+ metadata.gz: 77c2030170b64c5006f0d6a773bf58c365dd1b4f1261028ffe8171058f5101bf72c71934278def78147952aae7d4aff6eb7f88131eaca56c0de12aaafb1e9438
7
+ data.tar.gz: 0cd7332f0bf04d0f2d3373411b46c7c3495f9c31dc6464fb934abb154e31fa552b76612feadc62cdc8c431618a389d314400db586cff1fb94260698d88f57e6e
data/.gitignore CHANGED
@@ -1,15 +1,15 @@
1
- /.bundle/
2
- /.yardoc
3
- /Gemfile.lock
4
- /_yardoc/
5
- /coverage/
6
- /doc/
7
- /pkg/
8
- /spec/reports/
9
- /tmp/
10
- *.bundle
11
- *.so
12
- *.o
13
- *.a
14
- .rspec
15
- mkmf.log
1
+ /.bundle/
2
+ /.yardoc
3
+ /Gemfile.lock
4
+ /_yardoc/
5
+ /coverage/
6
+ /doc/
7
+ /pkg/
8
+ /spec/reports/
9
+ /tmp/
10
+ *.bundle
11
+ *.so
12
+ *.o
13
+ *.a
14
+ .rspec
15
+ mkmf.log
data/README.md CHANGED
@@ -1,117 +1,179 @@
1
- # Samidare
2
-
3
- Generate Embulk config and BigQuery schema from MySQL schema and run Embulk.
4
-
5
- ## Installation
6
-
7
- Add this line to your application's Gemfile:
8
-
9
- ```ruby
10
- gem 'samidare'
11
- ```
12
-
13
- And then execute:
14
-
15
- $ bundle
16
-
17
- Or install it yourself as:
18
-
19
- $ gem install samidare
20
-
21
- ## Usage
22
- Require `database.yml` and `table.yml`.
23
- Below is a sample config file.
24
-
25
- #### database.yml
26
- ```yml
27
- db01:
28
- host: localhost
29
- username: root
30
- password: pswd
31
- database: production
32
- bq_dataset: mysql_db01
33
-
34
- db02:
35
- host: localhost
36
- username: root
37
- password: pswd
38
- database: production
39
- bq_dataset: mysql_db02
40
-
41
- ```
42
-
43
- **Caution: Embulk doesn't allow no password for MySQL**
44
-
45
- #### table.yml
46
- ```yml
47
- db01:
48
- tables:
49
- - name: users
50
- - name: events
51
- - name: hobbies
52
-
53
- db02:
54
- tables:
55
- - name: administrators
56
- - name: configs
57
- ```
58
-
59
- Samidare requires BigQuery parameters like below.
60
- ```ruby
61
- [sample.rb]
62
- config = {
63
- 'project_id' => 'BIGQUERY_PROJECT_ID',
64
- 'service_email' => 'SERVICE_ACCOUNT_EMAIL',
65
- 'key' => '/etc/embulk/bigquery.p12',
66
- 'schema_dir' => '/var/tmp/embulk/schema',
67
- 'config_dir' => '/var/tmp/embulk/config',
68
- 'auth_method' => 'private_key'
69
- }
70
-
71
- client = Samidare::EmbulkClient.new
72
- client.generate_config(config)
73
- client.run(config)
74
- ```
75
-
76
- ```bash
77
- bundle exec ruby sample.rb
78
- ```
79
-
80
- ## Features
81
- #### daily snapshot
82
- BigQuery supports table wildcard expression of a specific set of daily tables, for example, `sales20150701` .
83
- If you need daily snapshot of a table for BigQuery, use `daily_snapshot` option to `database.yml` or `table.yml` like below.
84
- `daily_snapshot` option effects all tables in case of `database.yml` .
85
- On the other hand, only target table in `table.yml` .
86
- **Daily part is determined by execute date.**
87
-
88
- ```yml
89
- [database.yml]
90
- production:
91
- host: localhost
92
- username: root
93
- password: pswd
94
- database: production
95
- bq_dataset: mysql
96
- daily_snapshot: true
97
- ```
98
-
99
- ```yml
100
- [table.yml]
101
- production:
102
- tables:
103
- - name: users
104
- - name: events
105
- daily_snapshot: true
106
- - name: hobbies
107
-
108
- Only `events` is renamed to `eventsYYYYMMDD` for BigQuery.
109
- ```
110
-
111
- ## Contributing
112
-
113
- 1. Fork it ( https://github.com/[my-github-username]/samidare/fork )
114
- 2. Create your feature branch (`git checkout -b my-new-feature`)
115
- 3. Commit your changes (`git commit -am 'Add some feature'`)
116
- 4. Push to the branch (`git push origin my-new-feature`)
117
- 5. Create a new Pull Request
1
+ # Samidare
2
+
3
+ Generate Embulk config and BigQuery schema from MySQL schema and run Embulk.
4
+
5
+ ## Installation
6
+
7
+ Add this line to your application's Gemfile:
8
+
9
+ ```ruby
10
+ gem 'samidare'
11
+ ```
12
+
13
+ And then execute:
14
+
15
+ $ bundle
16
+
17
+ Or install it yourself as:
18
+
19
+ $ gem install samidare
20
+
21
+ ## Embulk setup
22
+ `Samidare` is utility for `Embulk` .
23
+ You need to install `Embulk` and install some gems like below.
24
+
25
+ ```bash
26
+ embulk gem install embulk-input-mysql --version 0.8.2
27
+ embulk gem install embulk-output-bigquery --version 0.4.3
28
+ embulk gem install embulk-parser-jsonl --version 0.2.0
29
+ embulk gem install embulk-formatter-jsonl --version 0.1.4
30
+ ```
31
+
32
+ ## Usage
33
+ Require `database.yml` and `table.yml`.
34
+ Below is a sample config file.
35
+
36
+ ### database.yml
37
+ ```yml
38
+ db01:
39
+ host: localhost
40
+ username: root
41
+ password: pswd
42
+ database: production
43
+ bq_dataset: mysql_db01
44
+
45
+ db02:
46
+ host: localhost
47
+ username: root
48
+ password: pswd
49
+ database: production
50
+ bq_dataset: mysql_db02
51
+
52
+ ```
53
+
54
+ **Caution: Embulk doesn't allow no password for MySQL**
55
+
56
+ ### table.yml
57
+ ```yml
58
+ db01:
59
+ tables:
60
+ - name: users
61
+ - name: events
62
+ - name: hobbies
63
+
64
+ db02:
65
+ tables:
66
+ - name: administrators
67
+ - name: configs
68
+ ```
69
+
70
+ Samidare requires BigQuery parameters like below.
71
+
72
+ ```ruby
73
+ [sample.rb]
74
+ require 'samidare'
75
+
76
+ config = {
77
+ 'project_id' => 'BIGQUERY_PROJECT_ID',
78
+ 'service_email' => 'SERVICE_ACCOUNT_EMAIL',
79
+ 'key' => '/etc/embulk/bigquery.p12',
80
+ 'schema_dir' => '/var/tmp/embulk/schema',
81
+ 'config_dir' => '/var/tmp/embulk/config',
82
+ 'auth_method' => 'private_key'
83
+ }
84
+
85
+ client = Samidare::EmbulkClient.new
86
+ client.generate_config(config)
87
+ client.run(config)
88
+ ```
89
+
90
+ ```bash
91
+ ruby sample.rb
92
+ ```
93
+
94
+ ## Features
95
+ ### process status
96
+ `Samidare` returns process status as boolean.
97
+ If all tables are succeed, then returns `true`, else `false` .
98
+ It is useful to control system flow.
99
+
100
+ ```ruby
101
+ process_status = Samidare::EmbulkClient.new.run(config)
102
+ exit 1 unless process_status
103
+ ```
104
+
105
+ ### narrow tables
106
+ You can narrow actual target tables from `table.yml` for test or to retry.
107
+ If no target tables is given, `Samidare` will execute all tables.
108
+
109
+ ```ruby
110
+ # in case, all tables are ['users', 'purchases', 'items']
111
+ target_tables = ['users', 'purchases']
112
+ Samidare::EmbulkClient.new.run(config, target_tables)
113
+ ```
114
+
115
+ ### retry
116
+ You can set retry count.
117
+ If any table failed, only failed table will be retried until retry count.
118
+ If no retry count is given, `Samidare` dosen't retry.
119
+
120
+ ```ruby
121
+ # 2 times retry will execute
122
+ Samidare::EmbulkClient.new.run(config, [], 2)
123
+ ```
124
+
125
+ ### SQL condition
126
+ If you set `condition` to a table in `table.yml` , SQL is generated like below.
127
+ It is useful for large size table.
128
+
129
+ ```yml
130
+ [table.yml]
131
+ production:
132
+ tables:
133
+ - name: users
134
+ - name: events
135
+ conditon: created_at < CURRENT_DATE()
136
+ ```
137
+
138
+ ```sql
139
+ SELECT * FROM users
140
+ SELECT * FROM events WHERE created_at < CURRENT_DATE()
141
+ ```
142
+
143
+ ### daily snapshot
144
+ BigQuery supports table wildcard expression of a specific set of daily tables, for example, `sales20150701` .
145
+ If you need daily snapshot of a table for BigQuery, use `daily_snapshot` option to `database.yml` or `table.yml` like below.
146
+ `daily_snapshot` option effects all tables in case of `database.yml` .
147
+ On the other hand, only target table in `table.yml` .
148
+ **Daily part is determined by execute date.**
149
+
150
+ ```yml
151
+ [database.yml]
152
+ production:
153
+ host: localhost
154
+ username: root
155
+ password: pswd
156
+ database: production
157
+ bq_dataset: mysql
158
+ daily_snapshot: true
159
+ ```
160
+
161
+ ```yml
162
+ [table.yml]
163
+ production:
164
+ tables:
165
+ - name: users
166
+ - name: events
167
+ daily_snapshot: true
168
+ - name: hobbies
169
+
170
+ Only `events` is renamed to `eventsYYYYMMDD` for BigQuery.
171
+ ```
172
+
173
+ ## Contributing
174
+
175
+ 1. Fork it ( https://github.com/[my-github-username]/samidare/fork )
176
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
177
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
178
+ 4. Push to the branch (`git push origin my-new-feature`)
179
+ 5. Create a new Pull Request
data/Rakefile CHANGED
@@ -1,7 +1,7 @@
1
- require "bundler/gem_tasks"
2
- require "rspec/core/rake_task"
3
-
4
- RSpec::Core::RakeTask.new(:spec)
5
-
6
- task :default => :spec
7
-
1
+ require "bundler/gem_tasks"
2
+ require "rspec/core/rake_task"
3
+
4
+ RSpec::Core::RakeTask.new(:spec)
5
+
6
+ task :default => :spec
7
+
@@ -1,42 +1,42 @@
1
- require 'samidare/version'
2
- require 'samidare/embulk_utility'
3
- require 'samidare/embulk'
4
- require 'samidare/mysql'
5
-
6
- module Samidare
7
- class EmbulkClient
8
- def generate_config(bq_config)
9
- Samidare::EmbulkUtility::ConfigGenerator.new.generate_config(database_configs, bq_config)
10
- end
11
-
12
- def run(bq_config, target_table_names = [], retry_max = 0)
13
- error_tables = run_and_retry(bq_config, target_table_names, retry_max, 0)
14
- # return batch status(true: all tables success)
15
- error_tables.size == 0
16
- end
17
-
18
- private
19
- def run_and_retry(bq_config, target_table_names = [], retry_max, retry_count)
20
- error_tables = Samidare::Embulk.new.run(
21
- database_configs,
22
- table_configs,
23
- bq_config,
24
- target_table_names)
25
- if error_tables.size > 0 && retry_count < retry_max
26
- puts "------------------------------------"
27
- puts "retry start -> #{retry_count + 1} time"
28
- puts "------------------------------------"
29
- error_tables = run_and_retry(bq_config, error_tables, retry_max, retry_count + 1)
30
- end
31
- error_tables
32
- end
33
-
34
- def database_configs
35
- @database_configs ||= YAML.load_file('database.yml')
36
- end
37
-
38
- def table_configs
39
- @table_configs ||= Samidare::MySQL::TableConfig.generate_table_configs
40
- end
41
- end
42
- end
1
+ require 'samidare/version'
2
+ require 'samidare/embulk_utility'
3
+ require 'samidare/embulk'
4
+ require 'samidare/mysql'
5
+
6
+ module Samidare
7
+ class EmbulkClient
8
+ def generate_config(bq_config)
9
+ Samidare::EmbulkUtility::ConfigGenerator.new.generate_config(database_configs, bq_config)
10
+ end
11
+
12
+ def run(bq_config, target_table_names = [], retry_max = 0)
13
+ error_tables = run_and_retry(bq_config, target_table_names, retry_max, 0)
14
+ # return batch status(true: all tables success)
15
+ error_tables.size == 0
16
+ end
17
+
18
+ private
19
+ def run_and_retry(bq_config, target_table_names = [], retry_max, retry_count)
20
+ error_tables = Samidare::Embulk.new.run(
21
+ database_configs,
22
+ table_configs,
23
+ bq_config,
24
+ target_table_names)
25
+ if error_tables.size > 0 && retry_count < retry_max
26
+ puts "------------------------------------"
27
+ puts "retry start -> #{retry_count + 1} time"
28
+ puts "------------------------------------"
29
+ error_tables = run_and_retry(bq_config, error_tables, retry_max, retry_count + 1)
30
+ end
31
+ error_tables
32
+ end
33
+
34
+ def database_configs
35
+ @database_configs ||= YAML.load_file('database.yml')
36
+ end
37
+
38
+ def table_configs
39
+ @table_configs ||= Samidare::MySQL::TableConfig.generate_table_configs
40
+ end
41
+ end
42
+ end
@@ -1,85 +1,85 @@
1
- require 'json'
2
- require 'erb'
3
- require 'big_query'
4
- require 'unindent'
5
- require 'date'
6
-
7
- module Samidare
8
- class BigQueryUtility
9
- CONTENTS = <<-EOS.unindent
10
- in:
11
- type: mysql
12
- user: <%= user %>
13
- password: <%= password %>
14
- database: <%= database %>
15
- host: <%= host %>
16
- query: |
17
- <%= query %>
18
- out:
19
- type: bigquery
20
- project: <%= project %>
21
- p12_keyfile_path: <%= p12_keyfile_path %>
22
- service_account_email: <%= service_account_email %>
23
- dataset: <%= dataset %>
24
- table: <%= table_name %>
25
- schema_path: <%= schema_path %>
26
- auto_create_table: 1
27
- path_prefix: <%= path_prefix %>
28
- source_format: NEWLINE_DELIMITED_JSON
29
- file_ext: .json.gz
30
- delete_from_local_when_job_end: 1
31
- formatter:
32
- type: jsonl
33
- encoders:
34
- - {type: gzip}
35
- EOS
36
-
37
- def initialize(config)
38
- @config = config.dup
39
- @current_date = Date.today
40
- end
41
-
42
- def self.generate_schema(columns)
43
- json_body = columns.map { |column| column.to_json }.join(",\n")
44
- "[\n" + json_body + "\n]\n"
45
- end
46
-
47
- def self.generate_sql(table_config, columns)
48
- columns = columns.map { |column| column.converted_value }
49
- sql = "SELECT " + columns.join(",")
50
- sql << " FROM #{table_config.name}"
51
- sql << " WHERE #{table_config.condition}" if table_config.condition
52
- sql << "\n"
53
- sql
54
- end
55
-
56
- def generate_embulk_config(db_name, database_config, table_config, columns)
57
- host = database_config['host']
58
- user = database_config['username']
59
- password = database_config['password']
60
- database = database_config['database']
61
- query = Samidare::BigQueryUtility.generate_sql(table_config, columns)
62
- project = @config['project_id']
63
- p12_keyfile_path = @config['key']
64
- service_account_email = @config['service_email']
65
- dataset = database_config['bq_dataset']
66
- table_name = actual_table_name(table_config.name, database_config['daily_snapshot'] || table_config.daily_snapshot)
67
- schema_path = "#{@config['schema_dir']}/#{db_name}/#{table_config.name}.json"
68
- path_prefix = "/var/tmp/embulk_#{db_name}_#{table_config.name}"
69
-
70
- ERB.new(CONTENTS).result(binding)
71
- end
72
-
73
- def delete_table(dataset, table_name)
74
- @config['dataset'] = dataset
75
-
76
- bq = BigQuery::Client.new(@config)
77
- bq.delete_table(table_name)
78
- end
79
-
80
- def actual_table_name(table_name, daily_snapshot)
81
- return table_name unless daily_snapshot
82
- table_name + @current_date.strftime('%Y%m%d')
83
- end
84
- end
85
- end
1
+ require 'json'
2
+ require 'erb'
3
+ require 'big_query'
4
+ require 'unindent'
5
+ require 'date'
6
+
7
+ module Samidare
8
+ class BigQueryUtility
9
+ CONTENTS = <<-EOS.unindent
10
+ in:
11
+ type: mysql
12
+ user: <%= user %>
13
+ password: <%= password %>
14
+ database: <%= database %>
15
+ host: <%= host %>
16
+ query: |
17
+ <%= query %>
18
+ out:
19
+ type: bigquery
20
+ project: <%= project %>
21
+ p12_keyfile: <%= p12_keyfile %>
22
+ service_account_email: <%= service_account_email %>
23
+ dataset: <%= dataset %>
24
+ table: <%= table_name %>
25
+ schema_path: <%= schema_path %>
26
+ auto_create_table: 1
27
+ path_prefix: <%= path_prefix %>
28
+ source_format: NEWLINE_DELIMITED_JSON
29
+ file_ext: .json.gz
30
+ delete_from_local_when_job_end: 1
31
+ formatter:
32
+ type: jsonl
33
+ encoders:
34
+ - {type: gzip}
35
+ EOS
36
+
37
+ def initialize(config)
38
+ @config = config.dup
39
+ @current_date = Date.today
40
+ end
41
+
42
+ def self.generate_schema(columns)
43
+ json_body = columns.map { |column| column.to_json }.join(",\n")
44
+ "[\n" + json_body + "\n]\n"
45
+ end
46
+
47
+ def self.generate_sql(table_config, columns)
48
+ columns = columns.map { |column| column.converted_value }
49
+ sql = "SELECT " + columns.join(",")
50
+ sql << " FROM #{table_config.name}"
51
+ sql << " WHERE #{table_config.condition}" if table_config.condition
52
+ sql << "\n"
53
+ sql
54
+ end
55
+
56
+ def generate_embulk_config(db_name, database_config, table_config, columns)
57
+ host = database_config['host']
58
+ user = database_config['username']
59
+ password = database_config['password']
60
+ database = database_config['database']
61
+ query = Samidare::BigQueryUtility.generate_sql(table_config, columns)
62
+ project = @config['project_id']
63
+ p12_keyfile = @config['key']
64
+ service_account_email = @config['service_email']
65
+ dataset = database_config['bq_dataset']
66
+ table_name = actual_table_name(table_config.name, database_config['daily_snapshot'] || table_config.daily_snapshot)
67
+ schema_path = "#{@config['schema_dir']}/#{db_name}/#{table_config.name}.json"
68
+ path_prefix = "/var/tmp/embulk_#{db_name}_#{table_config.name}"
69
+
70
+ ERB.new(CONTENTS).result(binding)
71
+ end
72
+
73
+ def delete_table(dataset, table_name)
74
+ @config['dataset'] = dataset
75
+
76
+ bq = BigQuery::Client.new(@config)
77
+ bq.delete_table(table_name)
78
+ end
79
+
80
+ def actual_table_name(table_name, daily_snapshot)
81
+ return table_name unless daily_snapshot
82
+ table_name + @current_date.strftime('%Y%m%d')
83
+ end
84
+ end
85
+ end