fluent-plugin-bigquery-custom 0.3.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 9f634996c0de109e264c651d08ac1e118a9694d2
4
+ data.tar.gz: edfb078ea6100688d83c5bcce0f7f5298a4e7d84
5
+ SHA512:
6
+ metadata.gz: d960bd5956b8ae9da5522f1372e698afaa0807e35ce67d2fe2cdc56837c626d822d43938e4162aa43c151affd52ff5cf02f8cf8fac2c50303aa0f4af0712b232
7
+ data.tar.gz: c8a4e351374c459aebd6ec3a72970c96f8895f9f54140ebf6dc1cadd828ba093407ddd2995a75f95e00bd7d2ac245ef36cf05e8152702bce0723bfa26bd8003d
@@ -0,0 +1,19 @@
1
+ *.gem
2
+ *.rbc
3
+ .bundle
4
+ .config
5
+ .yardoc
6
+ .ruby-version
7
+ Gemfile.lock
8
+ InstalledFiles
9
+ _yardoc
10
+ coverage
11
+ doc/
12
+ lib/bundler/man
13
+ pkg
14
+ rdoc
15
+ spec/reports
16
+ test/tmp
17
+ test/version_tmp
18
+ tmp
19
+ script/
@@ -0,0 +1,10 @@
1
+ language: ruby
2
+
3
+ rvm:
4
+ - 2.0
5
+ - 2.1
6
+ - 2.2
7
+ - 2.3.0
8
+
9
+ before_install: gem update bundler
10
+ script: bundle exec rake test
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in fluent-plugin-bigquery.gemspec
4
+ gemspec
@@ -0,0 +1,13 @@
1
+ Copyright (c) 2012- TAGOMORI Satoshi
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License");
4
+ you may not use this file except in compliance with the License.
5
+ You may obtain a copy of the License at
6
+
7
+ http://www.apache.org/licenses/LICENSE-2.0
8
+
9
+ Unless required by applicable law or agreed to in writing, software
10
+ distributed under the License is distributed on an "AS IS" BASIS,
11
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ See the License for the specific language governing permissions and
13
+ limitations under the License.
@@ -0,0 +1,424 @@
1
+ # fluent-plugin-bigquery-custom
2
+ [![Build Status](https://travis-ci.org/joker1007/fluent-plugin-bigquery.svg?branch=master)](https://travis-ci.org/joker1007/fluent-plugin-bigquery)
3
+
4
+ forked from [kaizenplatform/fluent-plugin-bigquery](https://github.com/kaizenplatform/fluent-plugin-bigquery "kaizenplatform/fluent-plugin-bigquery")
5
+
6
+ -----------
7
+
8
+ [Fluentd](http://fluentd.org) output plugin to load/insert data into Google BigQuery.
9
+
10
+ * insert data over streaming inserts
11
+ * for continuous real-time insertions
12
+ * https://developers.google.com/bigquery/streaming-data-into-bigquery#usecases
13
+ * load data
14
+ * for data loading as batch jobs, for big amount of data
15
+ * https://developers.google.com/bigquery/loading-data-into-bigquery
16
+
17
+ Current version of this plugin supports Google API with Service Account Authentication, but does not support
18
+ OAuth flow for installed applications.
19
+
20
+ ## Difference with original
21
+ - Implement load method
22
+ - Use google-api-client v0.9.pre
23
+ - TimeSlicedOutput based
24
+ - Use `%{time_slice}` placeholder in `table` parameter
25
+ - Add config parameters
26
+ - `skip_invalid_rows`
27
+ - `max_bad_records`
28
+ - `ignore_unknown_values`
29
+ - Improve error handling
30
+
31
+ ## Configuration
32
+
33
+ ### Streaming inserts
34
+
35
+ Configure insert specifications with target table schema, with your credentials. This is minimum configurations:
36
+
37
+ ```apache
38
+ <match dummy>
39
+ type bigquery
40
+
41
+ method insert # default
42
+
43
+ auth_method private_key # default
44
+ email xxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxx@developer.gserviceaccount.com
45
+ private_key_path /home/username/.keys/00000000000000000000000000000000-privatekey.p12
46
+ # private_key_passphrase notasecret # default
47
+
48
+ project yourproject_id
49
+ dataset yourdataset_id
50
+ table tablename
51
+
52
+ time_format %s
53
+ time_field time
54
+
55
+ field_integer time,status,bytes
56
+ field_string rhost,vhost,path,method,protocol,agent,referer
57
+ field_float requesttime
58
+ field_boolean bot_access,loginsession
59
+ </match>
60
+ ```
61
+
62
+ For high rate inserts over streaming inserts, you should specify flush intervals and buffer chunk options:
63
+
64
+ ```apache
65
+ <match dummy>
66
+ type bigquery
67
+
68
+ method insert # default
69
+
70
+ flush_interval 1 # flush as frequent as possible
71
+
72
+ buffer_chunk_records_limit 300 # default rate limit for users is 100
73
+ buffer_queue_limit 10240 # 1MB * 10240 -> 10GB!
74
+
75
+ num_threads 16
76
+
77
+ auth_method private_key # default
78
+ email xxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxx@developer.gserviceaccount.com
79
+ private_key_path /home/username/.keys/00000000000000000000000000000000-privatekey.p12
80
+ # private_key_passphrase notasecret # default
81
+
82
+ project yourproject_id
83
+ dataset yourdataset_id
84
+ tables accesslog1,accesslog2,accesslog3
85
+
86
+ time_format %s
87
+ time_field time
88
+
89
+ field_integer time,status,bytes
90
+ field_string rhost,vhost,path,method,protocol,agent,referer
91
+ field_float requesttime
92
+ field_boolean bot_access,loginsession
93
+ </match>
94
+ ```
95
+
96
+ Important options for high rate events are:
97
+
98
+ * `tables`
99
+ * 2 or more tables are available with ',' separator
100
+ * `out_bigquery` uses these tables for Table Sharding inserts
101
+ * these must have same schema
102
+ * `buffer_chunk_limit`
103
+ * max size of an insert or chunk (default 1000000 or 1MB)
104
+ * the max size is limited to 1MB on BigQuery
105
+ * `buffer_chunk_records_limit`
106
+ * number of records over streaming inserts API call is limited as 500, per insert or chunk
107
+ * `out_bigquery` flushes buffer with 500 records for 1 inserts API call
108
+ * `buffer_queue_limit`
109
+ * BigQuery streaming inserts needs very small buffer chunks
110
+ * for high-rate events, `buffer_queue_limit` should be configured with big number
111
+ * Max 1GB memory may be used under network problem in default configuration
112
+ * `buffer_chunk_limit (default 1MB)` x `buffer_queue_limit (default 1024)`
113
+ * `num_threads`
114
+ * threads for insert api calls in parallel
115
+ * specify this option for 100 or more records per seconds
116
+ * 10 or more threads seems good for inserts over internet
117
+ * less threads may be good for Google Compute Engine instances (with low latency for BigQuery)
118
+ * `flush_interval`
119
+ * interval between data flushes (default 0.25)
120
+ * you can set subsecond values such as `0.15` on Fluentd v0.10.42 or later
121
+
122
+ See [Quota policy](https://cloud.google.com/bigquery/streaming-data-into-bigquery#quota)
123
+ section in the Google BigQuery document.
124
+
125
+ ### Load
126
+ ```apache
127
+ <match bigquery>
128
+ type bigquery
129
+
130
+ method load
131
+ buffer_type file
132
+ buffer_path bigquery.*.buffer
133
+ flush_interval 1800
134
+ flush_at_shutdown true
135
+ try_flush_interval 1
136
+ utc
137
+
138
+ auth_method json_key
139
+ json_key json_key_path.json
140
+
141
+ time_format %s
142
+ time_field time
143
+
144
+ project yourproject_id
145
+ dataset yourdataset_id
146
+ auto_create_table true
147
+ table yourtable%{time_slice}
148
+ schema_path bq_schema.json
149
+ </match>
150
+ ```
151
+
152
+ I recommend to use file buffer and long flush interval.
153
+
154
+ ### Authentication
155
+
156
+ There are two methods supported to fetch access token for the service account.
157
+
158
+ 1. Public-Private key pair of GCP(Google Cloud Platform)'s service account
159
+ 2. JSON key of GCP(Google Cloud Platform)'s service account
160
+ 3. Predefined access token (Compute Engine only)
161
+ 4. Google application default credentials (http://goo.gl/IUuyuX)
162
+
163
+ #### Public-Private key pair of GCP's service account
164
+
165
+ The examples above use the first one. You first need to create a service account (client ID),
166
+ download its private key and deploy the key with fluentd.
167
+
168
+ #### JSON key of GCP(Google Cloud Platform)'s service account
169
+
170
+ You first need to create a service account (client ID),
171
+ download its JSON key and deploy the key with fluentd.
172
+
173
+ ```apache
174
+ <match dummy>
175
+ type bigquery
176
+
177
+ auth_method json_key
178
+ json_key /home/username/.keys/00000000000000000000000000000000-jsonkey.json
179
+
180
+ project yourproject_id
181
+ dataset yourdataset_id
182
+ table tablename
183
+ ...
184
+ </match>
185
+ ```
186
+
187
+ You can also provide `json_key` as embedded JSON string like this.
188
+ You need to only include `private_key` and `client_email` key from JSON key file.
189
+
190
+ ```apache
191
+ <match dummy>
192
+ type bigquery
193
+
194
+ auth_method json_key
195
+ json_key {"private_key": "-----BEGIN PRIVATE KEY-----\n...", "client_email": "xxx@developer.gserviceaccount.com"}
196
+
197
+ project yourproject_id
198
+ dataset yourdataset_id
199
+ table tablename
200
+ ...
201
+ </match>
202
+ ```
203
+
204
+ #### Predefined access token (Compute Engine only)
205
+
206
+ When you run fluentd on Googlce Compute Engine instance,
207
+ you don't need to explicitly create a service account for fluentd.
208
+ In this authentication method, you need to add the API scope "https://www.googleapis.com/auth/bigquery" to the scope list of your
209
+ Compute Engine instance, then you can configure fluentd like this.
210
+
211
+ ```apache
212
+ <match dummy>
213
+ type bigquery
214
+
215
+ auth_method compute_engine
216
+
217
+ project yourproject_id
218
+ dataset yourdataset_id
219
+ table tablename
220
+
221
+ time_format %s
222
+ time_field time
223
+
224
+ field_integer time,status,bytes
225
+ field_string rhost,vhost,path,method,protocol,agent,referer
226
+ field_float requesttime
227
+ field_boolean bot_access,loginsession
228
+ </match>
229
+ ```
230
+
231
+ #### Application default credentials
232
+
233
+ The Application Default Credentials provide a simple way to get authorization credentials for use in calling Google APIs, which are described in detail at http://goo.gl/IUuyuX.
234
+
235
+ In this authentication method, the credentials returned are determined by the environment the code is running in. Conditions are checked in the following order:credentials are get from following order.
236
+
237
+ 1. The environment variable `GOOGLE_APPLICATION_CREDENTIALS` is checked. If this variable is specified it should point to a JSON key file that defines the credentials.
238
+ 2. The environment variable `GOOGLE_PRIVATE_KEY` and `GOOGLE_CLIENT_EMAIL` are checked. If this variables are specified `GOOGLE_PRIVATE_KEY` should point to `private_key`, `GOOGLE_CLIENT_EMAIL` should point to `client_email` in a JSON key.
239
+ 3. Well known path is checked. If file is exists, the file used as a JSON key file. This path is `$HOME/.config/gcloud/application_default_credentials.json`.
240
+ 4. System default path is checked. If file is exists, the file used as a JSON key file. This path is `/etc/google/auth/application_default_credentials.json`.
241
+ 5. If you are running in Google Compute Engine production, the built-in service account associated with the virtual machine instance will be used.
242
+ 6. If none of these conditions is true, an error will occur.
243
+
244
+ ### Table id formatting
245
+
246
+ `table` and `tables` options accept [Time#strftime](http://ruby-doc.org/core-1.9.3/Time.html#method-i-strftime)
247
+ format to construct table ids.
248
+ Table ids are formatted at runtime
249
+ using the local time of the fluentd server.
250
+
251
+ For example, with the configuration below,
252
+ data is inserted into tables `accesslog_2014_08`, `accesslog_2014_09` and so on.
253
+
254
+ ```apache
255
+ <match dummy>
256
+ type bigquery
257
+
258
+ ...
259
+
260
+ project yourproject_id
261
+ dataset yourdataset_id
262
+ table accesslog_%Y_%m
263
+
264
+ ...
265
+ </match>
266
+ ```
267
+
268
+ Note that the timestamp of logs and the date in the table id do not always match,
269
+ because there is a time lag between collection and transmission of logs.
270
+
271
+ Or, the options can use `%{time_slice}` placeholder.
272
+ `%{time_slice}` is replaced by formatted time slice key at runtime.
273
+
274
+ ```apache
275
+ <match dummy>
276
+ type bigquery
277
+
278
+ ...
279
+
280
+ project yourproject_id
281
+ dataset yourdataset_id
282
+ table accesslog%{time_slice}
283
+
284
+ ...
285
+ </match>
286
+ ```
287
+
288
+ ### Dynamic table creating
289
+
290
+ When `auto_create_table` is set to `true`, try to create the table using BigQuery API when insertion failed with code=404 "Not Found: Table ...".
291
+ Next retry of insertion is expected to be success.
292
+
293
+ NOTE: `auto_create_table` option cannot be used with `fetch_schema`. You should create the table on ahead to use `fetch_schema`.
294
+
295
+ ```apache
296
+ <match dummy>
297
+ type bigquery
298
+
299
+ ...
300
+
301
+ auto_create_table true
302
+ table accesslog_%Y_%m
303
+
304
+ ...
305
+ </match>
306
+ ```
307
+
308
+ ### Table schema
309
+
310
+ There are three methods to describe the schema of the target table.
311
+
312
+ 1. List fields in fluent.conf
313
+ 2. Load a schema file in JSON.
314
+ 3. Fetch a schema using BigQuery API
315
+
316
+ The examples above use the first method. In this method,
317
+ you can also specify nested fields by prefixing their belonging record fields.
318
+
319
+ ```apache
320
+ <match dummy>
321
+ type bigquery
322
+
323
+ ...
324
+
325
+ time_format %s
326
+ time_field time
327
+
328
+ field_integer time,response.status,response.bytes
329
+ field_string request.vhost,request.path,request.method,request.protocol,request.agent,request.referer,remote.host,remote.ip,remote.user
330
+ field_float request.time
331
+ field_boolean request.bot_access,request.loginsession
332
+ </match>
333
+ ```
334
+
335
+ This schema accepts structured JSON data like:
336
+
337
+ ```json
338
+ {
339
+ "request":{
340
+ "time":1391748126.7000976,
341
+ "vhost":"www.example.com",
342
+ "path":"/",
343
+ "method":"GET",
344
+ "protocol":"HTTP/1.1",
345
+ "agent":"HotJava",
346
+ "bot_access":false
347
+ },
348
+ "remote":{ "ip": "192.0.2.1" },
349
+ "response":{
350
+ "status":200,
351
+ "bytes":1024
352
+ }
353
+ }
354
+ ```
355
+
356
+ The second method is to specify a path to a BigQuery schema file instead of listing fields. In this case, your fluent.conf looks like:
357
+
358
+ ```apache
359
+ <match dummy>
360
+ type bigquery
361
+
362
+ ...
363
+
364
+ time_format %s
365
+ time_field time
366
+
367
+ schema_path /path/to/httpd.schema
368
+ field_integer time
369
+ </match>
370
+ ```
371
+ where /path/to/httpd.schema is a path to the JSON-encoded schema file which you used for creating the table on BigQuery.
372
+
373
+ The third method is to set `fetch_schema` to `true` to enable fetch a schema using BigQuery API. In this case, your fluent.conf looks like:
374
+
375
+ ```apache
376
+ <match dummy>
377
+ type bigquery
378
+
379
+ ...
380
+
381
+ time_format %s
382
+ time_field time
383
+
384
+ fetch_schema true
385
+ field_integer time
386
+ </match>
387
+ ```
388
+
389
+ If you specify multiple tables in configuration file, plugin get all schema data from BigQuery and merge it.
390
+
391
+ NOTE: Since JSON does not define how to encode data of TIMESTAMP type,
392
+ you are still recommended to specify JSON types for TIMESTAMP fields as "time" field does in the example, if you use second or third method.
393
+
394
+ ### Specifying insertId property
395
+
396
+ BigQuery uses `insertId` property to detect duplicate insertion requests (see [data consistency](https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency) in Google BigQuery documents).
397
+ You can set `insert_id_field` option to specify the field to use as `insertId` property.
398
+
399
+ ```apache
400
+ <match dummy>
401
+ type bigquery
402
+
403
+ ...
404
+
405
+ insert_id_field uuid
406
+ field_string uuid
407
+ </match>
408
+ ```
409
+
410
+ ## TODO
411
+
412
+ * Automatically configured flush/buffer options
413
+ * support optional data fields
414
+ * support NULLABLE/REQUIRED/REPEATED field options in field list style of configuration
415
+ * OAuth installed application credentials support
416
+ * Google API discovery expiration
417
+ * Error classes
418
+ * check row size limits
419
+
420
+ ## Authors
421
+
422
+ * @tagomoris: First author, original version
423
+ * KAIZEN platform Inc.: Maintener, Since 2014.08.19 (original version)
424
+ * @joker1007 (forked version)