fluent-plugin-bigquery-custom 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 9f634996c0de109e264c651d08ac1e118a9694d2
4
+ data.tar.gz: edfb078ea6100688d83c5bcce0f7f5298a4e7d84
5
+ SHA512:
6
+ metadata.gz: d960bd5956b8ae9da5522f1372e698afaa0807e35ce67d2fe2cdc56837c626d822d43938e4162aa43c151affd52ff5cf02f8cf8fac2c50303aa0f4af0712b232
7
+ data.tar.gz: c8a4e351374c459aebd6ec3a72970c96f8895f9f54140ebf6dc1cadd828ba093407ddd2995a75f95e00bd7d2ac245ef36cf05e8152702bce0723bfa26bd8003d
@@ -0,0 +1,19 @@
1
+ *.gem
2
+ *.rbc
3
+ .bundle
4
+ .config
5
+ .yardoc
6
+ .ruby-version
7
+ Gemfile.lock
8
+ InstalledFiles
9
+ _yardoc
10
+ coverage
11
+ doc/
12
+ lib/bundler/man
13
+ pkg
14
+ rdoc
15
+ spec/reports
16
+ test/tmp
17
+ test/version_tmp
18
+ tmp
19
+ script/
@@ -0,0 +1,10 @@
1
+ language: ruby
2
+
3
+ rvm:
4
+ - 2.0
5
+ - 2.1
6
+ - 2.2
7
+ - 2.3.0
8
+
9
+ before_install: gem update bundler
10
+ script: bundle exec rake test
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in fluent-plugin-bigquery.gemspec
4
+ gemspec
@@ -0,0 +1,13 @@
1
+ Copyright (c) 2012- TAGOMORI Satoshi
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License");
4
+ you may not use this file except in compliance with the License.
5
+ You may obtain a copy of the License at
6
+
7
+ http://www.apache.org/licenses/LICENSE-2.0
8
+
9
+ Unless required by applicable law or agreed to in writing, software
10
+ distributed under the License is distributed on an "AS IS" BASIS,
11
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ See the License for the specific language governing permissions and
13
+ limitations under the License.
@@ -0,0 +1,424 @@
1
+ # fluent-plugin-bigquery-custom
2
+ [![Build Status](https://travis-ci.org/joker1007/fluent-plugin-bigquery.svg?branch=master)](https://travis-ci.org/joker1007/fluent-plugin-bigquery)
3
+
4
+ forked from [kaizenplatform/fluent-plugin-bigquery](https://github.com/kaizenplatform/fluent-plugin-bigquery "kaizenplatform/fluent-plugin-bigquery")
5
+
6
+ -----------
7
+
8
+ [Fluentd](http://fluentd.org) output plugin to load/insert data into Google BigQuery.
9
+
10
+ * insert data over streaming inserts
11
+ * for continuous real-time insertions
12
+ * https://developers.google.com/bigquery/streaming-data-into-bigquery#usecases
13
+ * load data
14
+ * for data loading as batch jobs, for big amount of data
15
+ * https://developers.google.com/bigquery/loading-data-into-bigquery
16
+
17
+ Current version of this plugin supports Google API with Service Account Authentication, but does not support
18
+ OAuth flow for installed applications.
19
+
20
+ ## Difference with original
21
+ - Implement load method
22
+ - Use google-api-client v0.9.pre
23
+ - TimeSlicedOutput based
24
+ - Use `%{time_slice}` placeholder in `table` parameter
25
+ - Add config parameters
26
+ - `skip_invalid_rows`
27
+ - `max_bad_records`
28
+ - `ignore_unknown_values`
29
+ - Improve error handling
30
+
31
+ ## Configuration
32
+
33
+ ### Streaming inserts
34
+
35
+ Configure insert specifications with target table schema, with your credentials. This is minimum configurations:
36
+
37
+ ```apache
38
+ <match dummy>
39
+ type bigquery
40
+
41
+ method insert # default
42
+
43
+ auth_method private_key # default
44
+ email xxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxx@developer.gserviceaccount.com
45
+ private_key_path /home/username/.keys/00000000000000000000000000000000-privatekey.p12
46
+ # private_key_passphrase notasecret # default
47
+
48
+ project yourproject_id
49
+ dataset yourdataset_id
50
+ table tablename
51
+
52
+ time_format %s
53
+ time_field time
54
+
55
+ field_integer time,status,bytes
56
+ field_string rhost,vhost,path,method,protocol,agent,referer
57
+ field_float requesttime
58
+ field_boolean bot_access,loginsession
59
+ </match>
60
+ ```
61
+
62
+ For high rate inserts over streaming inserts, you should specify flush intervals and buffer chunk options:
63
+
64
+ ```apache
65
+ <match dummy>
66
+ type bigquery
67
+
68
+ method insert # default
69
+
70
+ flush_interval 1 # flush as frequent as possible
71
+
72
+ buffer_chunk_records_limit 300 # default rate limit for users is 100
73
+ buffer_queue_limit 10240 # 1MB * 10240 -> 10GB!
74
+
75
+ num_threads 16
76
+
77
+ auth_method private_key # default
78
+ email xxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxx@developer.gserviceaccount.com
79
+ private_key_path /home/username/.keys/00000000000000000000000000000000-privatekey.p12
80
+ # private_key_passphrase notasecret # default
81
+
82
+ project yourproject_id
83
+ dataset yourdataset_id
84
+ tables accesslog1,accesslog2,accesslog3
85
+
86
+ time_format %s
87
+ time_field time
88
+
89
+ field_integer time,status,bytes
90
+ field_string rhost,vhost,path,method,protocol,agent,referer
91
+ field_float requesttime
92
+ field_boolean bot_access,loginsession
93
+ </match>
94
+ ```
95
+
96
+ Important options for high rate events are:
97
+
98
+ * `tables`
99
+ * 2 or more tables are available with ',' separator
100
+ * `out_bigquery` uses these tables for Table Sharding inserts
101
+ * these must have same schema
102
+ * `buffer_chunk_limit`
103
+ * max size of an insert or chunk (default 1000000 or 1MB)
104
+ * the max size is limited to 1MB on BigQuery
105
+ * `buffer_chunk_records_limit`
106
+ * number of records over streaming inserts API call is limited as 500, per insert or chunk
107
+ * `out_bigquery` flushes buffer with 500 records for 1 inserts API call
108
+ * `buffer_queue_limit`
109
+ * BigQuery streaming inserts needs very small buffer chunks
110
+ * for high-rate events, `buffer_queue_limit` should be configured with big number
111
+ * Max 1GB memory may be used under network problem in default configuration
112
+ * `buffer_chunk_limit (default 1MB)` x `buffer_queue_limit (default 1024)`
113
+ * `num_threads`
114
+ * threads for insert api calls in parallel
115
+ * specify this option for 100 or more records per seconds
116
+ * 10 or more threads seems good for inserts over internet
117
+ * less threads may be good for Google Compute Engine instances (with low latency for BigQuery)
118
+ * `flush_interval`
119
+ * interval between data flushes (default 0.25)
120
+ * you can set subsecond values such as `0.15` on Fluentd v0.10.42 or later
121
+
122
+ See [Quota policy](https://cloud.google.com/bigquery/streaming-data-into-bigquery#quota)
123
+ section in the Google BigQuery document.
124
+
125
+ ### Load
126
+ ```apache
127
+ <match bigquery>
128
+ type bigquery
129
+
130
+ method load
131
+ buffer_type file
132
+ buffer_path bigquery.*.buffer
133
+ flush_interval 1800
134
+ flush_at_shutdown true
135
+ try_flush_interval 1
136
+ utc
137
+
138
+ auth_method json_key
139
+ json_key json_key_path.json
140
+
141
+ time_format %s
142
+ time_field time
143
+
144
+ project yourproject_id
145
+ dataset yourdataset_id
146
+ auto_create_table true
147
+ table yourtable%{time_slice}
148
+ schema_path bq_schema.json
149
+ </match>
150
+ ```
151
+
152
+ I recommend to use file buffer and long flush interval.
153
+
154
+ ### Authentication
155
+
156
+ There are two methods supported to fetch access token for the service account.
157
+
158
+ 1. Public-Private key pair of GCP(Google Cloud Platform)'s service account
159
+ 2. JSON key of GCP(Google Cloud Platform)'s service account
160
+ 3. Predefined access token (Compute Engine only)
161
+ 4. Google application default credentials (http://goo.gl/IUuyuX)
162
+
163
+ #### Public-Private key pair of GCP's service account
164
+
165
+ The examples above use the first one. You first need to create a service account (client ID),
166
+ download its private key and deploy the key with fluentd.
167
+
168
+ #### JSON key of GCP(Google Cloud Platform)'s service account
169
+
170
+ You first need to create a service account (client ID),
171
+ download its JSON key and deploy the key with fluentd.
172
+
173
+ ```apache
174
+ <match dummy>
175
+ type bigquery
176
+
177
+ auth_method json_key
178
+ json_key /home/username/.keys/00000000000000000000000000000000-jsonkey.json
179
+
180
+ project yourproject_id
181
+ dataset yourdataset_id
182
+ table tablename
183
+ ...
184
+ </match>
185
+ ```
186
+
187
+ You can also provide `json_key` as embedded JSON string like this.
188
+ You need to only include `private_key` and `client_email` key from JSON key file.
189
+
190
+ ```apache
191
+ <match dummy>
192
+ type bigquery
193
+
194
+ auth_method json_key
195
+ json_key {"private_key": "-----BEGIN PRIVATE KEY-----\n...", "client_email": "xxx@developer.gserviceaccount.com"}
196
+
197
+ project yourproject_id
198
+ dataset yourdataset_id
199
+ table tablename
200
+ ...
201
+ </match>
202
+ ```
203
+
204
+ #### Predefined access token (Compute Engine only)
205
+
206
+ When you run fluentd on Googlce Compute Engine instance,
207
+ you don't need to explicitly create a service account for fluentd.
208
+ In this authentication method, you need to add the API scope "https://www.googleapis.com/auth/bigquery" to the scope list of your
209
+ Compute Engine instance, then you can configure fluentd like this.
210
+
211
+ ```apache
212
+ <match dummy>
213
+ type bigquery
214
+
215
+ auth_method compute_engine
216
+
217
+ project yourproject_id
218
+ dataset yourdataset_id
219
+ table tablename
220
+
221
+ time_format %s
222
+ time_field time
223
+
224
+ field_integer time,status,bytes
225
+ field_string rhost,vhost,path,method,protocol,agent,referer
226
+ field_float requesttime
227
+ field_boolean bot_access,loginsession
228
+ </match>
229
+ ```
230
+
231
+ #### Application default credentials
232
+
233
+ The Application Default Credentials provide a simple way to get authorization credentials for use in calling Google APIs, which are described in detail at http://goo.gl/IUuyuX.
234
+
235
+ In this authentication method, the credentials returned are determined by the environment the code is running in. Conditions are checked in the following order:credentials are get from following order.
236
+
237
+ 1. The environment variable `GOOGLE_APPLICATION_CREDENTIALS` is checked. If this variable is specified it should point to a JSON key file that defines the credentials.
238
+ 2. The environment variable `GOOGLE_PRIVATE_KEY` and `GOOGLE_CLIENT_EMAIL` are checked. If this variables are specified `GOOGLE_PRIVATE_KEY` should point to `private_key`, `GOOGLE_CLIENT_EMAIL` should point to `client_email` in a JSON key.
239
+ 3. Well known path is checked. If file is exists, the file used as a JSON key file. This path is `$HOME/.config/gcloud/application_default_credentials.json`.
240
+ 4. System default path is checked. If file is exists, the file used as a JSON key file. This path is `/etc/google/auth/application_default_credentials.json`.
241
+ 5. If you are running in Google Compute Engine production, the built-in service account associated with the virtual machine instance will be used.
242
+ 6. If none of these conditions is true, an error will occur.
243
+
244
+ ### Table id formatting
245
+
246
+ `table` and `tables` options accept [Time#strftime](http://ruby-doc.org/core-1.9.3/Time.html#method-i-strftime)
247
+ format to construct table ids.
248
+ Table ids are formatted at runtime
249
+ using the local time of the fluentd server.
250
+
251
+ For example, with the configuration below,
252
+ data is inserted into tables `accesslog_2014_08`, `accesslog_2014_09` and so on.
253
+
254
+ ```apache
255
+ <match dummy>
256
+ type bigquery
257
+
258
+ ...
259
+
260
+ project yourproject_id
261
+ dataset yourdataset_id
262
+ table accesslog_%Y_%m
263
+
264
+ ...
265
+ </match>
266
+ ```
267
+
268
+ Note that the timestamp of logs and the date in the table id do not always match,
269
+ because there is a time lag between collection and transmission of logs.
270
+
271
+ Or, the options can use `%{time_slice}` placeholder.
272
+ `%{time_slice}` is replaced by formatted time slice key at runtime.
273
+
274
+ ```apache
275
+ <match dummy>
276
+ type bigquery
277
+
278
+ ...
279
+
280
+ project yourproject_id
281
+ dataset yourdataset_id
282
+ table accesslog%{time_slice}
283
+
284
+ ...
285
+ </match>
286
+ ```
287
+
288
+ ### Dynamic table creating
289
+
290
+ When `auto_create_table` is set to `true`, try to create the table using BigQuery API when insertion failed with code=404 "Not Found: Table ...".
291
+ Next retry of insertion is expected to be success.
292
+
293
+ NOTE: `auto_create_table` option cannot be used with `fetch_schema`. You should create the table on ahead to use `fetch_schema`.
294
+
295
+ ```apache
296
+ <match dummy>
297
+ type bigquery
298
+
299
+ ...
300
+
301
+ auto_create_table true
302
+ table accesslog_%Y_%m
303
+
304
+ ...
305
+ </match>
306
+ ```
307
+
308
+ ### Table schema
309
+
310
+ There are three methods to describe the schema of the target table.
311
+
312
+ 1. List fields in fluent.conf
313
+ 2. Load a schema file in JSON.
314
+ 3. Fetch a schema using BigQuery API
315
+
316
+ The examples above use the first method. In this method,
317
+ you can also specify nested fields by prefixing their belonging record fields.
318
+
319
+ ```apache
320
+ <match dummy>
321
+ type bigquery
322
+
323
+ ...
324
+
325
+ time_format %s
326
+ time_field time
327
+
328
+ field_integer time,response.status,response.bytes
329
+ field_string request.vhost,request.path,request.method,request.protocol,request.agent,request.referer,remote.host,remote.ip,remote.user
330
+ field_float request.time
331
+ field_boolean request.bot_access,request.loginsession
332
+ </match>
333
+ ```
334
+
335
+ This schema accepts structured JSON data like:
336
+
337
+ ```json
338
+ {
339
+ "request":{
340
+ "time":1391748126.7000976,
341
+ "vhost":"www.example.com",
342
+ "path":"/",
343
+ "method":"GET",
344
+ "protocol":"HTTP/1.1",
345
+ "agent":"HotJava",
346
+ "bot_access":false
347
+ },
348
+ "remote":{ "ip": "192.0.2.1" },
349
+ "response":{
350
+ "status":200,
351
+ "bytes":1024
352
+ }
353
+ }
354
+ ```
355
+
356
+ The second method is to specify a path to a BigQuery schema file instead of listing fields. In this case, your fluent.conf looks like:
357
+
358
+ ```apache
359
+ <match dummy>
360
+ type bigquery
361
+
362
+ ...
363
+
364
+ time_format %s
365
+ time_field time
366
+
367
+ schema_path /path/to/httpd.schema
368
+ field_integer time
369
+ </match>
370
+ ```
371
+ where /path/to/httpd.schema is a path to the JSON-encoded schema file which you used for creating the table on BigQuery.
372
+
373
+ The third method is to set `fetch_schema` to `true` to enable fetch a schema using BigQuery API. In this case, your fluent.conf looks like:
374
+
375
+ ```apache
376
+ <match dummy>
377
+ type bigquery
378
+
379
+ ...
380
+
381
+ time_format %s
382
+ time_field time
383
+
384
+ fetch_schema true
385
+ field_integer time
386
+ </match>
387
+ ```
388
+
389
+ If you specify multiple tables in configuration file, plugin get all schema data from BigQuery and merge it.
390
+
391
+ NOTE: Since JSON does not define how to encode data of TIMESTAMP type,
392
+ you are still recommended to specify JSON types for TIMESTAMP fields as "time" field does in the example, if you use second or third method.
393
+
394
+ ### Specifying insertId property
395
+
396
+ BigQuery uses `insertId` property to detect duplicate insertion requests (see [data consistency](https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency) in Google BigQuery documents).
397
+ You can set `insert_id_field` option to specify the field to use as `insertId` property.
398
+
399
+ ```apache
400
+ <match dummy>
401
+ type bigquery
402
+
403
+ ...
404
+
405
+ insert_id_field uuid
406
+ field_string uuid
407
+ </match>
408
+ ```
409
+
410
+ ## TODO
411
+
412
+ * Automatically configured flush/buffer options
413
+ * support optional data fields
414
+ * support NULLABLE/REQUIRED/REPEATED field options in field list style of configuration
415
+ * OAuth installed application credentials support
416
+ * Google API discovery expiration
417
+ * Error classes
418
+ * check row size limits
419
+
420
+ ## Authors
421
+
422
+ * @tagomoris: First author, original version
423
+ * KAIZEN platform Inc.: Maintener, Since 2014.08.19 (original version)
424
+ * @joker1007 (forked version)