fluent-plugin-bigquery-custom 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.gitignore +19 -0
- data/.travis.yml +10 -0
- data/Gemfile +4 -0
- data/LICENSE.txt +13 -0
- data/README.md +424 -0
- data/Rakefile +11 -0
- data/fluent-plugin-bigquery-custom.gemspec +34 -0
- data/lib/fluent/plugin/bigquery/version.rb +6 -0
- data/lib/fluent/plugin/out_bigquery.rb +727 -0
- data/test/helper.rb +34 -0
- data/test/plugin/test_out_bigquery.rb +1015 -0
- data/test/plugin/testdata/apache.schema +98 -0
- data/test/plugin/testdata/json_key.json +7 -0
- data/test/plugin/testdata/sudo.schema +27 -0
- metadata +218 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 9f634996c0de109e264c651d08ac1e118a9694d2
|
4
|
+
data.tar.gz: edfb078ea6100688d83c5bcce0f7f5298a4e7d84
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: d960bd5956b8ae9da5522f1372e698afaa0807e35ce67d2fe2cdc56837c626d822d43938e4162aa43c151affd52ff5cf02f8cf8fac2c50303aa0f4af0712b232
|
7
|
+
data.tar.gz: c8a4e351374c459aebd6ec3a72970c96f8895f9f54140ebf6dc1cadd828ba093407ddd2995a75f95e00bd7d2ac245ef36cf05e8152702bce0723bfa26bd8003d
|
data/.gitignore
ADDED
data/.travis.yml
ADDED
data/Gemfile
ADDED
data/LICENSE.txt
ADDED
@@ -0,0 +1,13 @@
|
|
1
|
+
Copyright (c) 2012- TAGOMORI Satoshi
|
2
|
+
|
3
|
+
Licensed under the Apache License, Version 2.0 (the "License");
|
4
|
+
you may not use this file except in compliance with the License.
|
5
|
+
You may obtain a copy of the License at
|
6
|
+
|
7
|
+
http://www.apache.org/licenses/LICENSE-2.0
|
8
|
+
|
9
|
+
Unless required by applicable law or agreed to in writing, software
|
10
|
+
distributed under the License is distributed on an "AS IS" BASIS,
|
11
|
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
12
|
+
See the License for the specific language governing permissions and
|
13
|
+
limitations under the License.
|
data/README.md
ADDED
@@ -0,0 +1,424 @@
|
|
1
|
+
# fluent-plugin-bigquery-custom
|
2
|
+
[](https://travis-ci.org/joker1007/fluent-plugin-bigquery)
|
3
|
+
|
4
|
+
forked from [kaizenplatform/fluent-plugin-bigquery](https://github.com/kaizenplatform/fluent-plugin-bigquery "kaizenplatform/fluent-plugin-bigquery")
|
5
|
+
|
6
|
+
-----------
|
7
|
+
|
8
|
+
[Fluentd](http://fluentd.org) output plugin to load/insert data into Google BigQuery.
|
9
|
+
|
10
|
+
* insert data over streaming inserts
|
11
|
+
* for continuous real-time insertions
|
12
|
+
* https://developers.google.com/bigquery/streaming-data-into-bigquery#usecases
|
13
|
+
* load data
|
14
|
+
* for data loading as batch jobs, for big amount of data
|
15
|
+
* https://developers.google.com/bigquery/loading-data-into-bigquery
|
16
|
+
|
17
|
+
Current version of this plugin supports Google API with Service Account Authentication, but does not support
|
18
|
+
OAuth flow for installed applications.
|
19
|
+
|
20
|
+
## Difference with original
|
21
|
+
- Implement load method
|
22
|
+
- Use google-api-client v0.9.pre
|
23
|
+
- TimeSlicedOutput based
|
24
|
+
- Use `%{time_slice}` placeholder in `table` parameter
|
25
|
+
- Add config parameters
|
26
|
+
- `skip_invalid_rows`
|
27
|
+
- `max_bad_records`
|
28
|
+
- `ignore_unknown_values`
|
29
|
+
- Improve error handling
|
30
|
+
|
31
|
+
## Configuration
|
32
|
+
|
33
|
+
### Streaming inserts
|
34
|
+
|
35
|
+
Configure insert specifications with target table schema, with your credentials. This is minimum configurations:
|
36
|
+
|
37
|
+
```apache
|
38
|
+
<match dummy>
|
39
|
+
type bigquery
|
40
|
+
|
41
|
+
method insert # default
|
42
|
+
|
43
|
+
auth_method private_key # default
|
44
|
+
email xxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxx@developer.gserviceaccount.com
|
45
|
+
private_key_path /home/username/.keys/00000000000000000000000000000000-privatekey.p12
|
46
|
+
# private_key_passphrase notasecret # default
|
47
|
+
|
48
|
+
project yourproject_id
|
49
|
+
dataset yourdataset_id
|
50
|
+
table tablename
|
51
|
+
|
52
|
+
time_format %s
|
53
|
+
time_field time
|
54
|
+
|
55
|
+
field_integer time,status,bytes
|
56
|
+
field_string rhost,vhost,path,method,protocol,agent,referer
|
57
|
+
field_float requesttime
|
58
|
+
field_boolean bot_access,loginsession
|
59
|
+
</match>
|
60
|
+
```
|
61
|
+
|
62
|
+
For high rate inserts over streaming inserts, you should specify flush intervals and buffer chunk options:
|
63
|
+
|
64
|
+
```apache
|
65
|
+
<match dummy>
|
66
|
+
type bigquery
|
67
|
+
|
68
|
+
method insert # default
|
69
|
+
|
70
|
+
flush_interval 1 # flush as frequent as possible
|
71
|
+
|
72
|
+
buffer_chunk_records_limit 300 # default rate limit for users is 100
|
73
|
+
buffer_queue_limit 10240 # 1MB * 10240 -> 10GB!
|
74
|
+
|
75
|
+
num_threads 16
|
76
|
+
|
77
|
+
auth_method private_key # default
|
78
|
+
email xxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxx@developer.gserviceaccount.com
|
79
|
+
private_key_path /home/username/.keys/00000000000000000000000000000000-privatekey.p12
|
80
|
+
# private_key_passphrase notasecret # default
|
81
|
+
|
82
|
+
project yourproject_id
|
83
|
+
dataset yourdataset_id
|
84
|
+
tables accesslog1,accesslog2,accesslog3
|
85
|
+
|
86
|
+
time_format %s
|
87
|
+
time_field time
|
88
|
+
|
89
|
+
field_integer time,status,bytes
|
90
|
+
field_string rhost,vhost,path,method,protocol,agent,referer
|
91
|
+
field_float requesttime
|
92
|
+
field_boolean bot_access,loginsession
|
93
|
+
</match>
|
94
|
+
```
|
95
|
+
|
96
|
+
Important options for high rate events are:
|
97
|
+
|
98
|
+
* `tables`
|
99
|
+
* 2 or more tables are available with ',' separator
|
100
|
+
* `out_bigquery` uses these tables for Table Sharding inserts
|
101
|
+
* these must have same schema
|
102
|
+
* `buffer_chunk_limit`
|
103
|
+
* max size of an insert or chunk (default 1000000 or 1MB)
|
104
|
+
* the max size is limited to 1MB on BigQuery
|
105
|
+
* `buffer_chunk_records_limit`
|
106
|
+
* number of records over streaming inserts API call is limited as 500, per insert or chunk
|
107
|
+
* `out_bigquery` flushes buffer with 500 records for 1 inserts API call
|
108
|
+
* `buffer_queue_limit`
|
109
|
+
* BigQuery streaming inserts needs very small buffer chunks
|
110
|
+
* for high-rate events, `buffer_queue_limit` should be configured with big number
|
111
|
+
* Max 1GB memory may be used under network problem in default configuration
|
112
|
+
* `buffer_chunk_limit (default 1MB)` x `buffer_queue_limit (default 1024)`
|
113
|
+
* `num_threads`
|
114
|
+
* threads for insert api calls in parallel
|
115
|
+
* specify this option for 100 or more records per seconds
|
116
|
+
* 10 or more threads seems good for inserts over internet
|
117
|
+
* less threads may be good for Google Compute Engine instances (with low latency for BigQuery)
|
118
|
+
* `flush_interval`
|
119
|
+
* interval between data flushes (default 0.25)
|
120
|
+
* you can set subsecond values such as `0.15` on Fluentd v0.10.42 or later
|
121
|
+
|
122
|
+
See [Quota policy](https://cloud.google.com/bigquery/streaming-data-into-bigquery#quota)
|
123
|
+
section in the Google BigQuery document.
|
124
|
+
|
125
|
+
### Load
|
126
|
+
```apache
|
127
|
+
<match bigquery>
|
128
|
+
type bigquery
|
129
|
+
|
130
|
+
method load
|
131
|
+
buffer_type file
|
132
|
+
buffer_path bigquery.*.buffer
|
133
|
+
flush_interval 1800
|
134
|
+
flush_at_shutdown true
|
135
|
+
try_flush_interval 1
|
136
|
+
utc
|
137
|
+
|
138
|
+
auth_method json_key
|
139
|
+
json_key json_key_path.json
|
140
|
+
|
141
|
+
time_format %s
|
142
|
+
time_field time
|
143
|
+
|
144
|
+
project yourproject_id
|
145
|
+
dataset yourdataset_id
|
146
|
+
auto_create_table true
|
147
|
+
table yourtable%{time_slice}
|
148
|
+
schema_path bq_schema.json
|
149
|
+
</match>
|
150
|
+
```
|
151
|
+
|
152
|
+
I recommend to use file buffer and long flush interval.
|
153
|
+
|
154
|
+
### Authentication
|
155
|
+
|
156
|
+
There are two methods supported to fetch access token for the service account.
|
157
|
+
|
158
|
+
1. Public-Private key pair of GCP(Google Cloud Platform)'s service account
|
159
|
+
2. JSON key of GCP(Google Cloud Platform)'s service account
|
160
|
+
3. Predefined access token (Compute Engine only)
|
161
|
+
4. Google application default credentials (http://goo.gl/IUuyuX)
|
162
|
+
|
163
|
+
#### Public-Private key pair of GCP's service account
|
164
|
+
|
165
|
+
The examples above use the first one. You first need to create a service account (client ID),
|
166
|
+
download its private key and deploy the key with fluentd.
|
167
|
+
|
168
|
+
#### JSON key of GCP(Google Cloud Platform)'s service account
|
169
|
+
|
170
|
+
You first need to create a service account (client ID),
|
171
|
+
download its JSON key and deploy the key with fluentd.
|
172
|
+
|
173
|
+
```apache
|
174
|
+
<match dummy>
|
175
|
+
type bigquery
|
176
|
+
|
177
|
+
auth_method json_key
|
178
|
+
json_key /home/username/.keys/00000000000000000000000000000000-jsonkey.json
|
179
|
+
|
180
|
+
project yourproject_id
|
181
|
+
dataset yourdataset_id
|
182
|
+
table tablename
|
183
|
+
...
|
184
|
+
</match>
|
185
|
+
```
|
186
|
+
|
187
|
+
You can also provide `json_key` as embedded JSON string like this.
|
188
|
+
You need to only include `private_key` and `client_email` key from JSON key file.
|
189
|
+
|
190
|
+
```apache
|
191
|
+
<match dummy>
|
192
|
+
type bigquery
|
193
|
+
|
194
|
+
auth_method json_key
|
195
|
+
json_key {"private_key": "-----BEGIN PRIVATE KEY-----\n...", "client_email": "xxx@developer.gserviceaccount.com"}
|
196
|
+
|
197
|
+
project yourproject_id
|
198
|
+
dataset yourdataset_id
|
199
|
+
table tablename
|
200
|
+
...
|
201
|
+
</match>
|
202
|
+
```
|
203
|
+
|
204
|
+
#### Predefined access token (Compute Engine only)
|
205
|
+
|
206
|
+
When you run fluentd on Googlce Compute Engine instance,
|
207
|
+
you don't need to explicitly create a service account for fluentd.
|
208
|
+
In this authentication method, you need to add the API scope "https://www.googleapis.com/auth/bigquery" to the scope list of your
|
209
|
+
Compute Engine instance, then you can configure fluentd like this.
|
210
|
+
|
211
|
+
```apache
|
212
|
+
<match dummy>
|
213
|
+
type bigquery
|
214
|
+
|
215
|
+
auth_method compute_engine
|
216
|
+
|
217
|
+
project yourproject_id
|
218
|
+
dataset yourdataset_id
|
219
|
+
table tablename
|
220
|
+
|
221
|
+
time_format %s
|
222
|
+
time_field time
|
223
|
+
|
224
|
+
field_integer time,status,bytes
|
225
|
+
field_string rhost,vhost,path,method,protocol,agent,referer
|
226
|
+
field_float requesttime
|
227
|
+
field_boolean bot_access,loginsession
|
228
|
+
</match>
|
229
|
+
```
|
230
|
+
|
231
|
+
#### Application default credentials
|
232
|
+
|
233
|
+
The Application Default Credentials provide a simple way to get authorization credentials for use in calling Google APIs, which are described in detail at http://goo.gl/IUuyuX.
|
234
|
+
|
235
|
+
In this authentication method, the credentials returned are determined by the environment the code is running in. Conditions are checked in the following order:credentials are get from following order.
|
236
|
+
|
237
|
+
1. The environment variable `GOOGLE_APPLICATION_CREDENTIALS` is checked. If this variable is specified it should point to a JSON key file that defines the credentials.
|
238
|
+
2. The environment variable `GOOGLE_PRIVATE_KEY` and `GOOGLE_CLIENT_EMAIL` are checked. If this variables are specified `GOOGLE_PRIVATE_KEY` should point to `private_key`, `GOOGLE_CLIENT_EMAIL` should point to `client_email` in a JSON key.
|
239
|
+
3. Well known path is checked. If file is exists, the file used as a JSON key file. This path is `$HOME/.config/gcloud/application_default_credentials.json`.
|
240
|
+
4. System default path is checked. If file is exists, the file used as a JSON key file. This path is `/etc/google/auth/application_default_credentials.json`.
|
241
|
+
5. If you are running in Google Compute Engine production, the built-in service account associated with the virtual machine instance will be used.
|
242
|
+
6. If none of these conditions is true, an error will occur.
|
243
|
+
|
244
|
+
### Table id formatting
|
245
|
+
|
246
|
+
`table` and `tables` options accept [Time#strftime](http://ruby-doc.org/core-1.9.3/Time.html#method-i-strftime)
|
247
|
+
format to construct table ids.
|
248
|
+
Table ids are formatted at runtime
|
249
|
+
using the local time of the fluentd server.
|
250
|
+
|
251
|
+
For example, with the configuration below,
|
252
|
+
data is inserted into tables `accesslog_2014_08`, `accesslog_2014_09` and so on.
|
253
|
+
|
254
|
+
```apache
|
255
|
+
<match dummy>
|
256
|
+
type bigquery
|
257
|
+
|
258
|
+
...
|
259
|
+
|
260
|
+
project yourproject_id
|
261
|
+
dataset yourdataset_id
|
262
|
+
table accesslog_%Y_%m
|
263
|
+
|
264
|
+
...
|
265
|
+
</match>
|
266
|
+
```
|
267
|
+
|
268
|
+
Note that the timestamp of logs and the date in the table id do not always match,
|
269
|
+
because there is a time lag between collection and transmission of logs.
|
270
|
+
|
271
|
+
Or, the options can use `%{time_slice}` placeholder.
|
272
|
+
`%{time_slice}` is replaced by formatted time slice key at runtime.
|
273
|
+
|
274
|
+
```apache
|
275
|
+
<match dummy>
|
276
|
+
type bigquery
|
277
|
+
|
278
|
+
...
|
279
|
+
|
280
|
+
project yourproject_id
|
281
|
+
dataset yourdataset_id
|
282
|
+
table accesslog%{time_slice}
|
283
|
+
|
284
|
+
...
|
285
|
+
</match>
|
286
|
+
```
|
287
|
+
|
288
|
+
### Dynamic table creating
|
289
|
+
|
290
|
+
When `auto_create_table` is set to `true`, try to create the table using BigQuery API when insertion failed with code=404 "Not Found: Table ...".
|
291
|
+
Next retry of insertion is expected to be success.
|
292
|
+
|
293
|
+
NOTE: `auto_create_table` option cannot be used with `fetch_schema`. You should create the table on ahead to use `fetch_schema`.
|
294
|
+
|
295
|
+
```apache
|
296
|
+
<match dummy>
|
297
|
+
type bigquery
|
298
|
+
|
299
|
+
...
|
300
|
+
|
301
|
+
auto_create_table true
|
302
|
+
table accesslog_%Y_%m
|
303
|
+
|
304
|
+
...
|
305
|
+
</match>
|
306
|
+
```
|
307
|
+
|
308
|
+
### Table schema
|
309
|
+
|
310
|
+
There are three methods to describe the schema of the target table.
|
311
|
+
|
312
|
+
1. List fields in fluent.conf
|
313
|
+
2. Load a schema file in JSON.
|
314
|
+
3. Fetch a schema using BigQuery API
|
315
|
+
|
316
|
+
The examples above use the first method. In this method,
|
317
|
+
you can also specify nested fields by prefixing their belonging record fields.
|
318
|
+
|
319
|
+
```apache
|
320
|
+
<match dummy>
|
321
|
+
type bigquery
|
322
|
+
|
323
|
+
...
|
324
|
+
|
325
|
+
time_format %s
|
326
|
+
time_field time
|
327
|
+
|
328
|
+
field_integer time,response.status,response.bytes
|
329
|
+
field_string request.vhost,request.path,request.method,request.protocol,request.agent,request.referer,remote.host,remote.ip,remote.user
|
330
|
+
field_float request.time
|
331
|
+
field_boolean request.bot_access,request.loginsession
|
332
|
+
</match>
|
333
|
+
```
|
334
|
+
|
335
|
+
This schema accepts structured JSON data like:
|
336
|
+
|
337
|
+
```json
|
338
|
+
{
|
339
|
+
"request":{
|
340
|
+
"time":1391748126.7000976,
|
341
|
+
"vhost":"www.example.com",
|
342
|
+
"path":"/",
|
343
|
+
"method":"GET",
|
344
|
+
"protocol":"HTTP/1.1",
|
345
|
+
"agent":"HotJava",
|
346
|
+
"bot_access":false
|
347
|
+
},
|
348
|
+
"remote":{ "ip": "192.0.2.1" },
|
349
|
+
"response":{
|
350
|
+
"status":200,
|
351
|
+
"bytes":1024
|
352
|
+
}
|
353
|
+
}
|
354
|
+
```
|
355
|
+
|
356
|
+
The second method is to specify a path to a BigQuery schema file instead of listing fields. In this case, your fluent.conf looks like:
|
357
|
+
|
358
|
+
```apache
|
359
|
+
<match dummy>
|
360
|
+
type bigquery
|
361
|
+
|
362
|
+
...
|
363
|
+
|
364
|
+
time_format %s
|
365
|
+
time_field time
|
366
|
+
|
367
|
+
schema_path /path/to/httpd.schema
|
368
|
+
field_integer time
|
369
|
+
</match>
|
370
|
+
```
|
371
|
+
where /path/to/httpd.schema is a path to the JSON-encoded schema file which you used for creating the table on BigQuery.
|
372
|
+
|
373
|
+
The third method is to set `fetch_schema` to `true` to enable fetch a schema using BigQuery API. In this case, your fluent.conf looks like:
|
374
|
+
|
375
|
+
```apache
|
376
|
+
<match dummy>
|
377
|
+
type bigquery
|
378
|
+
|
379
|
+
...
|
380
|
+
|
381
|
+
time_format %s
|
382
|
+
time_field time
|
383
|
+
|
384
|
+
fetch_schema true
|
385
|
+
field_integer time
|
386
|
+
</match>
|
387
|
+
```
|
388
|
+
|
389
|
+
If you specify multiple tables in configuration file, plugin get all schema data from BigQuery and merge it.
|
390
|
+
|
391
|
+
NOTE: Since JSON does not define how to encode data of TIMESTAMP type,
|
392
|
+
you are still recommended to specify JSON types for TIMESTAMP fields as "time" field does in the example, if you use second or third method.
|
393
|
+
|
394
|
+
### Specifying insertId property
|
395
|
+
|
396
|
+
BigQuery uses `insertId` property to detect duplicate insertion requests (see [data consistency](https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency) in Google BigQuery documents).
|
397
|
+
You can set `insert_id_field` option to specify the field to use as `insertId` property.
|
398
|
+
|
399
|
+
```apache
|
400
|
+
<match dummy>
|
401
|
+
type bigquery
|
402
|
+
|
403
|
+
...
|
404
|
+
|
405
|
+
insert_id_field uuid
|
406
|
+
field_string uuid
|
407
|
+
</match>
|
408
|
+
```
|
409
|
+
|
410
|
+
## TODO
|
411
|
+
|
412
|
+
* Automatically configured flush/buffer options
|
413
|
+
* support optional data fields
|
414
|
+
* support NULLABLE/REQUIRED/REPEATED field options in field list style of configuration
|
415
|
+
* OAuth installed application credentials support
|
416
|
+
* Google API discovery expiration
|
417
|
+
* Error classes
|
418
|
+
* check row size limits
|
419
|
+
|
420
|
+
## Authors
|
421
|
+
|
422
|
+
* @tagomoris: First author, original version
|
423
|
+
* KAIZEN platform Inc.: Maintener, Since 2014.08.19 (original version)
|
424
|
+
* @joker1007 (forked version)
|