fluent-plugin-bigquery-test 2.2.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: ff14eb5085151de11780105f37826c8c6359e064e8b03b151593a825392743bb
4
+ data.tar.gz: 04ba1a56eef89610cdce659129de4b887141a992b1836114df07a0bc546e5d95
5
+ SHA512:
6
+ metadata.gz: 8016409a53493922cd2df4d1e1628fb47ca2189392fb3c0eef154e512ee34ce59bb491af4f6dca0d9841af52b81326fd2525f22971a174ff1b1ac2a6f627ac79
7
+ data.tar.gz: 2ec49bcf6281f40887128c079ded78a62f837737a4c264933cb8750233dd64cbdf76ab018260d071a6f2fa0f064b39fa340026e2d3bb20656be448a058a853c9
@@ -0,0 +1,16 @@
1
+ <!-- Please check your config and docs of fluentd !! -->
2
+
3
+ ## Environments
4
+
5
+ - fluentd version:
6
+ - plugin version:
7
+
8
+ ## Configuration
9
+ <!-- Please write your configuration -->
10
+
11
+ ## Expected Behavior
12
+
13
+ ## Actual Behavior
14
+
15
+ ## Log (if you have)
16
+
@@ -0,0 +1,21 @@
1
+ *.gem
2
+ *.rbc
3
+ .bundle
4
+ .config
5
+ .yardoc
6
+ .ruby-version
7
+ Gemfile.lock
8
+ InstalledFiles
9
+ _yardoc
10
+ coverage
11
+ doc/
12
+ lib/bundler/man
13
+ pkg
14
+ rdoc
15
+ spec/reports
16
+ test/tmp
17
+ test/version_tmp
18
+ tmp
19
+ script/
20
+
21
+ fluentd-0.12
@@ -0,0 +1,14 @@
1
+ language: ruby
2
+
3
+ rvm:
4
+ - 2.3.7
5
+ - 2.4.4
6
+ - 2.5.1
7
+
8
+ gemfile:
9
+ - Gemfile
10
+
11
+ before_install:
12
+ - gem update bundler
13
+
14
+ script: bundle exec rake test
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in fluent-plugin-bigquery.gemspec
4
+ gemspec
@@ -0,0 +1,13 @@
1
+ Copyright (c) 2012- TAGOMORI Satoshi
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License");
4
+ you may not use this file except in compliance with the License.
5
+ You may obtain a copy of the License at
6
+
7
+ http://www.apache.org/licenses/LICENSE-2.0
8
+
9
+ Unless required by applicable law or agreed to in writing, software
10
+ distributed under the License is distributed on an "AS IS" BASIS,
11
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ See the License for the specific language governing permissions and
13
+ limitations under the License.
@@ -0,0 +1,602 @@
1
+ # fluent-plugin-bigquery
2
+
3
+ ## Notice
4
+
5
+ We will transfer fluent-plugin-bigquery repository to [fluent-plugins-nursery](https://github.com/fluent-plugins-nursery) organization.
6
+ It does not change maintenance plan.
7
+ The main purpose is that it solves mismatch between maintainers and current organization.
8
+
9
+ ---
10
+
11
+ [Fluentd](http://fluentd.org) output plugin to load/insert data into Google BigQuery.
12
+
13
+ - **Plugin type**: Output
14
+
15
+ * insert data over streaming inserts
16
+ * plugin type is `bigquery_insert`
17
+ * for continuous real-time insertions
18
+ * https://developers.google.com/bigquery/streaming-data-into-bigquery#usecases
19
+ * load data
20
+ * plugin type is `bigquery_load`
21
+ * for data loading as batch jobs, for big amount of data
22
+ * https://developers.google.com/bigquery/loading-data-into-bigquery
23
+
24
+ Current version of this plugin supports Google API with Service Account Authentication, but does not support
25
+ OAuth flow for installed applications.
26
+
27
+ ## Support Version
28
+
29
+ | plugin version | fluentd version | ruby version |
30
+ | :----------- | :----------- | :----------- |
31
+ | v0.4.x | 0.12.x | 2.0 or later |
32
+ | v1.x.x | 0.14.x or later | 2.2 or later |
33
+ | v2.x.x | 0.14.x or later | 2.3 or later |
34
+
35
+ ## With docker image
36
+ If you use official alpine based fluentd docker image (https://github.com/fluent/fluentd-docker-image),
37
+ You need to install `bigdecimal` gem on your own dockerfile.
38
+ Because alpine based image has only minimal ruby environment in order to reduce image size.
39
+ And in most case, dependency to embedded gem is not written on gemspec.
40
+ Because embbeded gem dependency sometimes restricts ruby environment.
41
+
42
+ ## Configuration
43
+
44
+ ### Options
45
+
46
+ #### common
47
+
48
+ | name | type | required? | placeholder? | default | description |
49
+ | :-------------------------------------------- | :------------ | :----------- | :---------- | :------------------------- | :----------------------- |
50
+ | auth_method | enum | yes | no | private_key | `private_key` or `json_key` or `compute_engine` or `application_default` |
51
+ | email | string | yes (private_key) | no | nil | GCP Service Account Email |
52
+ | private_key_path | string | yes (private_key) | no | nil | GCP Private Key file path |
53
+ | private_key_passphrase | string | yes (private_key) | no | nil | GCP Private Key Passphrase |
54
+ | json_key | string | yes (json_key) | no | nil | GCP JSON Key file path or JSON Key string |
55
+ | location | string | no | no | nil | BigQuery Data Location. The geographic location of the job. Required except for US and EU. |
56
+ | project | string | yes | yes | nil | |
57
+ | dataset | string | yes | yes | nil | |
58
+ | table | string | yes (either `tables`) | yes | nil | |
59
+ | tables | array(string) | yes (either `table`) | yes | nil | can set multi table names splitted by `,` |
60
+ | auto_create_table | bool | no | no | false | If true, creates table automatically |
61
+ | ignore_unknown_values | bool | no | no | false | Accept rows that contain values that do not match the schema. The unknown values are ignored. |
62
+ | schema | array | yes (either `fetch_schema` or `schema_path`) | no | nil | Schema Definition. It is formatted by JSON. |
63
+ | schema_path | string | yes (either `fetch_schema`) | no | nil | Schema Definition file path. It is formatted by JSON. |
64
+ | fetch_schema | bool | yes (either `schema_path`) | no | false | If true, fetch table schema definition from Bigquery table automatically. |
65
+ | fetch_schema_table | string | no | yes | nil | If set, fetch table schema definition from this table, If fetch_schema is false, this param is ignored |
66
+ | schema_cache_expire | integer | no | no | 600 | Value is second. If current time is after expiration interval, re-fetch table schema definition. |
67
+ | request_timeout_sec | integer | no | no | nil | Bigquery API response timeout |
68
+ | request_open_timeout_sec | integer | no | no | 60 | Bigquery API connection, and request timeout. If you send big data to Bigquery, set large value. |
69
+ | time_partitioning_type | enum | no (either day) | no | nil | Type of bigquery time partitioning feature. |
70
+ | time_partitioning_field | string | no | no | nil | Field used to determine how to create a time-based partition. |
71
+ | time_partitioning_expiration | time | no | no | nil | Expiration milliseconds for bigquery time partitioning. |
72
+ | clustering_fields | array(string) | no | no | nil | One or more fields on which data should be clustered. The order of the specified columns determines the sort order of the data. |
73
+
74
+ #### bigquery_insert
75
+
76
+ | name | type | required? | placeholder? | default | description |
77
+ | :------------------------------------- | :------------ | :----------- | :---------- | :------------------------- | :----------------------- |
78
+ | template_suffix | string | no | yes | nil | can use `%{time_slice}` placeholder replaced by `time_slice_format` |
79
+ | skip_invalid_rows | bool | no | no | false | |
80
+ | insert_id_field | string | no | no | nil | Use key as `insert_id` of Streaming Insert API parameter. see. https://docs.fluentd.org/v1.0/articles/api-plugin-helper-record_accessor |
81
+ | add_insert_timestamp | string | no | no | nil | Adds a timestamp column just before sending the rows to BigQuery, so that buffering time is not taken into account. Gives a field in BigQuery which represents the insert time of the row. |
82
+ | allow_retry_insert_errors | bool | no | no | false | Retry to insert rows when an insertErrors occurs. There is a possibility that rows are inserted in duplicate. |
83
+
84
+ #### bigquery_load
85
+
86
+ | name | type | required? | placeholder? | default | description |
87
+ | :------------------------------------- | :------------ | :----------- | :---------- | :------------------------- | :----------------------- |
88
+ | source_format | enum | no | no | json | Specify source format `json` or `csv` or `avro`. If you change this parameter, you must change formatter plugin via `<format>` config section. |
89
+ | max_bad_records | integer | no | no | 0 | If the number of bad records exceeds this value, an invalid error is returned in the job result. |
90
+
91
+ ### Buffer section
92
+
93
+ | name | type | required? | default | description |
94
+ | :------------------------------------- | :------------ | :----------- | :------------------------- | :----------------------- |
95
+ | @type | string | no | memory (insert) or file (load) | |
96
+ | chunk_limit_size | integer | no | 1MB (insert) or 1GB (load) | |
97
+ | total_limit_size | integer | no | 1GB (insert) or 32GB (load) | |
98
+ | chunk_records_limit | integer | no | 500 (insert) or nil (load) | |
99
+ | flush_mode | enum | no | interval | default, lazy, interval, immediate |
100
+ | flush_interval | float | no | 1.0 (insert) or 3600 (load) | |
101
+ | flush_thread_interval | float | no | 0.05 (insert) or 5 (load) | |
102
+ | flush_thread_burst_interval | float | no | 0.05 (insert) or 5 (load) | |
103
+
104
+ And, other params (defined by base class) are available
105
+
106
+ see. https://github.com/fluent/fluentd/blob/master/lib/fluent/plugin/output.rb
107
+
108
+ ### Inject section
109
+
110
+ It is replacement of previous version `time_field` and `time_format`.
111
+
112
+ For example.
113
+
114
+ ```
115
+ <inject>
116
+ time_key time_field_name
117
+ time_type string
118
+ time_format %Y-%m-%d %H:%M:%S
119
+ </inject>
120
+ ```
121
+
122
+ | name | type | required? | default | description |
123
+ | :------------------------------------- | :------------ | :----------- | :------------------------- | :----------------------- |
124
+ | hostname_key | string | no | nil | |
125
+ | hostname | string | no | nil | |
126
+ | tag_key | string | no | nil | |
127
+ | time_key | string | no | nil | |
128
+ | time_type | string | no | nil | |
129
+ | time_format | string | no | nil | |
130
+ | localtime | bool | no | true | |
131
+ | utc | bool | no | false | |
132
+ | timezone | string | no | nil | |
133
+
134
+ see. https://github.com/fluent/fluentd/blob/master/lib/fluent/plugin_helper/inject.rb
135
+
136
+ ### Formatter section
137
+
138
+ This section is for `load` mode only.
139
+ If you use `insert` mode, used formatter is `json` only.
140
+
141
+ Bigquery supports `csv`, `json` and `avro` format. Default is `json`
142
+ I recommend to use `json` for now.
143
+
144
+ For example.
145
+
146
+ ```
147
+ source_format csv
148
+
149
+ <format>
150
+ @type csv
151
+ fields col1, col2, col3
152
+ </format>
153
+ ```
154
+
155
+ see. https://github.com/fluent/fluentd/blob/master/lib/fluent/plugin_helper/formatter.rb
156
+
157
+ ## Examples
158
+
159
+ ### Streaming inserts
160
+
161
+ Configure insert specifications with target table schema, with your credentials. This is minimum configurations:
162
+
163
+ ```apache
164
+ <match dummy>
165
+ @type bigquery_insert
166
+
167
+ auth_method private_key # default
168
+ email xxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxx@developer.gserviceaccount.com
169
+ private_key_path /home/username/.keys/00000000000000000000000000000000-privatekey.p12
170
+ # private_key_passphrase notasecret # default
171
+
172
+ project yourproject_id
173
+ dataset yourdataset_id
174
+ table tablename
175
+
176
+ schema [
177
+ {"name": "time", "type": "INTEGER"},
178
+ {"name": "status", "type": "INTEGER"},
179
+ {"name": "bytes", "type": "INTEGER"},
180
+ {"name": "vhost", "type": "STRING"},
181
+ {"name": "path", "type": "STRING"},
182
+ {"name": "method", "type": "STRING"},
183
+ {"name": "protocol", "type": "STRING"},
184
+ {"name": "agent", "type": "STRING"},
185
+ {"name": "referer", "type": "STRING"},
186
+ {"name": "remote", "type": "RECORD", "fields": [
187
+ {"name": "host", "type": "STRING"},
188
+ {"name": "ip", "type": "STRING"},
189
+ {"name": "user", "type": "STRING"}
190
+ ]},
191
+ {"name": "requesttime", "type": "FLOAT"},
192
+ {"name": "bot_access", "type": "BOOLEAN"},
193
+ {"name": "loginsession", "type": "BOOLEAN"}
194
+ ]
195
+ </match>
196
+ ```
197
+
198
+ For high rate inserts over streaming inserts, you should specify flush intervals and buffer chunk options:
199
+
200
+ ```apache
201
+ <match dummy>
202
+ @type bigquery_insert
203
+
204
+ <buffer>
205
+ flush_interval 0.1 # flush as frequent as possible
206
+
207
+ total_limit_size 10g
208
+
209
+ flush_thread_count 16
210
+ </buffer>
211
+
212
+ auth_method private_key # default
213
+ email xxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxx@developer.gserviceaccount.com
214
+ private_key_path /home/username/.keys/00000000000000000000000000000000-privatekey.p12
215
+ # private_key_passphrase notasecret # default
216
+
217
+ project yourproject_id
218
+ dataset yourdataset_id
219
+ tables accesslog1,accesslog2,accesslog3
220
+
221
+ schema [
222
+ {"name": "time", "type": "INTEGER"},
223
+ {"name": "status", "type": "INTEGER"},
224
+ {"name": "bytes", "type": "INTEGER"},
225
+ {"name": "vhost", "type": "STRING"},
226
+ {"name": "path", "type": "STRING"},
227
+ {"name": "method", "type": "STRING"},
228
+ {"name": "protocol", "type": "STRING"},
229
+ {"name": "agent", "type": "STRING"},
230
+ {"name": "referer", "type": "STRING"},
231
+ {"name": "remote", "type": "RECORD", "fields": [
232
+ {"name": "host", "type": "STRING"},
233
+ {"name": "ip", "type": "STRING"},
234
+ {"name": "user", "type": "STRING"}
235
+ ]},
236
+ {"name": "requesttime", "type": "FLOAT"},
237
+ {"name": "bot_access", "type": "BOOLEAN"},
238
+ {"name": "loginsession", "type": "BOOLEAN"}
239
+ ]
240
+ </match>
241
+ ```
242
+
243
+ Important options for high rate events are:
244
+
245
+ * `tables`
246
+ * 2 or more tables are available with ',' separator
247
+ * `out_bigquery` uses these tables for Table Sharding inserts
248
+ * these must have same schema
249
+ * `buffer/chunk_limit_size`
250
+ * max size of an insert or chunk (default 1000000 or 1MB)
251
+ * the max size is limited to 1MB on BigQuery
252
+ * `buffer/chunk_records_limit`
253
+ * number of records over streaming inserts API call is limited as 500, per insert or chunk
254
+ * `out_bigquery` flushes buffer with 500 records for 1 inserts API call
255
+ * `buffer/queue_length_limit`
256
+ * BigQuery streaming inserts needs very small buffer chunks
257
+ * for high-rate events, `buffer_queue_limit` should be configured with big number
258
+ * Max 1GB memory may be used under network problem in default configuration
259
+ * `chunk_limit_size (default 1MB)` x `queue_length_limit (default 1024)`
260
+ * `buffer/flush_thread_count`
261
+ * threads for insert api calls in parallel
262
+ * specify this option for 100 or more records per seconds
263
+ * 10 or more threads seems good for inserts over internet
264
+ * less threads may be good for Google Compute Engine instances (with low latency for BigQuery)
265
+ * `buffer/flush_interval`
266
+ * interval between data flushes (default 0.25)
267
+ * you can set subsecond values such as `0.15` on Fluentd v0.10.42 or later
268
+
269
+ See [Quota policy](https://cloud.google.com/bigquery/streaming-data-into-bigquery#quota)
270
+ section in the Google BigQuery document.
271
+
272
+ ### Load
273
+ ```apache
274
+ <match bigquery>
275
+ @type bigquery_load
276
+
277
+ <buffer>
278
+ path bigquery.*.buffer
279
+ flush_at_shutdown true
280
+ timekey_use_utc
281
+ </buffer>
282
+
283
+ auth_method json_key
284
+ json_key json_key_path.json
285
+
286
+ project yourproject_id
287
+ dataset yourdataset_id
288
+ auto_create_table true
289
+ table yourtable%{time_slice}
290
+ schema_path bq_schema.json
291
+ </match>
292
+ ```
293
+
294
+ I recommend to use file buffer and long flush interval.
295
+
296
+ ### Authentication
297
+
298
+ There are four methods supported to fetch access token for the service account.
299
+
300
+ 1. Public-Private key pair of GCP(Google Cloud Platform)'s service account
301
+ 2. JSON key of GCP(Google Cloud Platform)'s service account
302
+ 3. Predefined access token (Compute Engine only)
303
+ 4. Google application default credentials (http://goo.gl/IUuyuX)
304
+
305
+ #### Public-Private key pair of GCP's service account
306
+
307
+ The examples above use the first one. You first need to create a service account (client ID),
308
+ download its private key and deploy the key with fluentd.
309
+
310
+ #### JSON key of GCP(Google Cloud Platform)'s service account
311
+
312
+ You first need to create a service account (client ID),
313
+ download its JSON key and deploy the key with fluentd.
314
+
315
+ ```apache
316
+ <match dummy>
317
+ @type bigquery_insert
318
+
319
+ auth_method json_key
320
+ json_key /home/username/.keys/00000000000000000000000000000000-jsonkey.json
321
+
322
+ project yourproject_id
323
+ dataset yourdataset_id
324
+ table tablename
325
+ ...
326
+ </match>
327
+ ```
328
+
329
+ You can also provide `json_key` as embedded JSON string like this.
330
+ You need to only include `private_key` and `client_email` key from JSON key file.
331
+
332
+ ```apache
333
+ <match dummy>
334
+ @type bigquery_insert
335
+
336
+ auth_method json_key
337
+ json_key {"private_key": "-----BEGIN PRIVATE KEY-----\n...", "client_email": "xxx@developer.gserviceaccount.com"}
338
+
339
+ project yourproject_id
340
+ dataset yourdataset_id
341
+ table tablename
342
+ ...
343
+ </match>
344
+ ```
345
+
346
+ #### Predefined access token (Compute Engine only)
347
+
348
+ When you run fluentd on Googlce Compute Engine instance,
349
+ you don't need to explicitly create a service account for fluentd.
350
+ In this authentication method, you need to add the API scope "https://www.googleapis.com/auth/bigquery" to the scope list of your
351
+ Compute Engine instance, then you can configure fluentd like this.
352
+
353
+ ```apache
354
+ <match dummy>
355
+ @type bigquery_insert
356
+
357
+ auth_method compute_engine
358
+
359
+ project yourproject_id
360
+ dataset yourdataset_id
361
+ table tablename
362
+
363
+ ...
364
+ </match>
365
+ ```
366
+
367
+ #### Application default credentials
368
+
369
+ The Application Default Credentials provide a simple way to get authorization credentials for use in calling Google APIs, which are described in detail at http://goo.gl/IUuyuX.
370
+
371
+ In this authentication method, the credentials returned are determined by the environment the code is running in. Conditions are checked in the following order:credentials are get from following order.
372
+
373
+ 1. The environment variable `GOOGLE_APPLICATION_CREDENTIALS` is checked. If this variable is specified it should point to a JSON key file that defines the credentials.
374
+ 2. The environment variable `GOOGLE_PRIVATE_KEY` and `GOOGLE_CLIENT_EMAIL` are checked. If this variables are specified `GOOGLE_PRIVATE_KEY` should point to `private_key`, `GOOGLE_CLIENT_EMAIL` should point to `client_email` in a JSON key.
375
+ 3. Well known path is checked. If file is exists, the file used as a JSON key file. This path is `$HOME/.config/gcloud/application_default_credentials.json`.
376
+ 4. System default path is checked. If file is exists, the file used as a JSON key file. This path is `/etc/google/auth/application_default_credentials.json`.
377
+ 5. If you are running in Google Compute Engine production, the built-in service account associated with the virtual machine instance will be used.
378
+ 6. If none of these conditions is true, an error will occur.
379
+
380
+ ### Table id formatting
381
+
382
+ this plugin supports fluentd-0.14 style placeholder.
383
+
384
+ #### strftime formatting
385
+ `table` and `tables` options accept [Time#strftime](http://ruby-doc.org/core-1.9.3/Time.html#method-i-strftime)
386
+ format to construct table ids.
387
+ Table ids are formatted at runtime
388
+ using the chunk key time.
389
+
390
+ see. http://docs.fluentd.org/v0.14/articles/output-plugin-overview
391
+
392
+ For example, with the configuration below,
393
+ data is inserted into tables `accesslog_2014_08`, `accesslog_2014_09` and so on.
394
+
395
+ ```apache
396
+ <match dummy>
397
+ @type bigquery_insert
398
+
399
+ ...
400
+
401
+ project yourproject_id
402
+ dataset yourdataset_id
403
+ table accesslog_%Y_%m
404
+
405
+ <buffer time>
406
+ timekey 1d
407
+ </buffer>
408
+ ...
409
+ </match>
410
+ ```
411
+
412
+ #### record attribute formatting
413
+ The format can be suffixed with attribute name.
414
+
415
+ __CAUTION: format is different with previous version__
416
+
417
+ ```apache
418
+ <match dummy>
419
+ ...
420
+ table accesslog_${status_code}
421
+
422
+ <buffer status_code>
423
+ </buffer>
424
+ ...
425
+ </match>
426
+ ```
427
+
428
+ If attribute name is given, the time to be used for formatting is value of each row.
429
+ The value for the time should be a UNIX time.
430
+
431
+ #### time_slice_key formatting
432
+
433
+ Instead, Use strftime formatting.
434
+
435
+ strftime formatting of current version is based on chunk key.
436
+ That is same with previous time_slice_key formatting .
437
+
438
+ ### Date partitioned table support
439
+ this plugin can insert (load) into date partitioned table.
440
+
441
+ Use placeholder.
442
+
443
+ ```apache
444
+ <match dummy>
445
+ @type bigquery_load
446
+
447
+ ...
448
+ table accesslog$%Y%m%d
449
+
450
+ <buffer time>
451
+ timekey 1d
452
+ </buffer>
453
+ ...
454
+ </match>
455
+ ```
456
+
457
+ But, Dynamic table creating doesn't support date partitioned table yet.
458
+ And streaming insert is not allowed to insert with `$%Y%m%d` suffix.
459
+ If you use date partitioned table with streaming insert, Please omit `$%Y%m%d` suffix from `table`.
460
+
461
+ ### Dynamic table creating
462
+
463
+ When `auto_create_table` is set to `true`, try to create the table using BigQuery API when insertion failed with code=404 "Not Found: Table ...".
464
+ Next retry of insertion is expected to be success.
465
+
466
+ NOTE: `auto_create_table` option cannot be used with `fetch_schema`. You should create the table on ahead to use `fetch_schema`.
467
+
468
+ ```apache
469
+ <match dummy>
470
+ @type bigquery_insert
471
+
472
+ ...
473
+
474
+ auto_create_table true
475
+ table accesslog_%Y_%m
476
+
477
+ ...
478
+ </match>
479
+ ```
480
+
481
+ Also, you can create clustered table by using `clustering_fields`.
482
+
483
+ ### Table schema
484
+
485
+ There are three methods to describe the schema of the target table.
486
+
487
+ 1. List fields in fluent.conf
488
+ 2. Load a schema file in JSON.
489
+ 3. Fetch a schema using BigQuery API
490
+
491
+ The examples above use the first method. In this method,
492
+ you can also specify nested fields by prefixing their belonging record fields.
493
+
494
+ ```apache
495
+ <match dummy>
496
+ @type bigquery_insert
497
+
498
+ ...
499
+
500
+ schema [
501
+ {"name": "time", "type": "INTEGER"},
502
+ {"name": "status", "type": "INTEGER"},
503
+ {"name": "bytes", "type": "INTEGER"},
504
+ {"name": "vhost", "type": "STRING"},
505
+ {"name": "path", "type": "STRING"},
506
+ {"name": "method", "type": "STRING"},
507
+ {"name": "protocol", "type": "STRING"},
508
+ {"name": "agent", "type": "STRING"},
509
+ {"name": "referer", "type": "STRING"},
510
+ {"name": "remote", "type": "RECORD", "fields": [
511
+ {"name": "host", "type": "STRING"},
512
+ {"name": "ip", "type": "STRING"},
513
+ {"name": "user", "type": "STRING"}
514
+ ]},
515
+ {"name": "requesttime", "type": "FLOAT"},
516
+ {"name": "bot_access", "type": "BOOLEAN"},
517
+ {"name": "loginsession", "type": "BOOLEAN"}
518
+ ]
519
+ </match>
520
+ ```
521
+
522
+ This schema accepts structured JSON data like:
523
+
524
+ ```json
525
+ {
526
+ "request":{
527
+ "time":1391748126.7000976,
528
+ "vhost":"www.example.com",
529
+ "path":"/",
530
+ "method":"GET",
531
+ "protocol":"HTTP/1.1",
532
+ "agent":"HotJava",
533
+ "bot_access":false
534
+ },
535
+ "remote":{ "ip": "192.0.2.1" },
536
+ "response":{
537
+ "status":200,
538
+ "bytes":1024
539
+ }
540
+ }
541
+ ```
542
+
543
+ The second method is to specify a path to a BigQuery schema file instead of listing fields. In this case, your fluent.conf looks like:
544
+
545
+ ```apache
546
+ <match dummy>
547
+ @type bigquery_insert
548
+
549
+ ...
550
+
551
+ schema_path /path/to/httpd.schema
552
+ </match>
553
+ ```
554
+ where /path/to/httpd.schema is a path to the JSON-encoded schema file which you used for creating the table on BigQuery. By using external schema file you are able to write full schema that does support NULLABLE/REQUIRED/REPEATED, this feature is really useful and adds full flexbility.
555
+
556
+ The third method is to set `fetch_schema` to `true` to enable fetch a schema using BigQuery API. In this case, your fluent.conf looks like:
557
+
558
+ ```apache
559
+ <match dummy>
560
+ @type bigquery_insert
561
+
562
+ ...
563
+
564
+ fetch_schema true
565
+ # fetch_schema_table other_table # if you want to fetch schema from other table
566
+ </match>
567
+ ```
568
+
569
+ If you specify multiple tables in configuration file, plugin get all schema data from BigQuery and merge it.
570
+
571
+ NOTE: Since JSON does not define how to encode data of TIMESTAMP type,
572
+ you are still recommended to specify JSON types for TIMESTAMP fields as "time" field does in the example, if you use second or third method.
573
+
574
+ ### Specifying insertId property
575
+
576
+ BigQuery uses `insertId` property to detect duplicate insertion requests (see [data consistency](https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency) in Google BigQuery documents).
577
+ You can set `insert_id_field` option to specify the field to use as `insertId` property.
578
+ `insert_id_field` can use fluentd record_accessor format like `$['key1'][0]['key2']`.
579
+ (detail. https://docs.fluentd.org/v1.0/articles/api-plugin-helper-record_accessor)
580
+
581
+ ```apache
582
+ <match dummy>
583
+ @type bigquery_insert
584
+
585
+ ...
586
+
587
+ insert_id_field uuid
588
+ schema [{"name": "uuid", "type": "STRING"}]
589
+ </match>
590
+ ```
591
+
592
+ ## TODO
593
+
594
+ * OAuth installed application credentials support
595
+ * Google API discovery expiration
596
+ * check row size limits
597
+
598
+ ## Authors
599
+
600
+ * @tagomoris: First author, original version
601
+ * KAIZEN platform Inc.: Maintener, Since 2014.08.19
602
+ * @joker1007