google-cloud-bigquery 1.21.2

Sign up to get free protection for your applications and to get access to all the features.
Files changed (44) hide show
  1. checksums.yaml +7 -0
  2. data/.yardopts +16 -0
  3. data/AUTHENTICATION.md +158 -0
  4. data/CHANGELOG.md +397 -0
  5. data/CODE_OF_CONDUCT.md +40 -0
  6. data/CONTRIBUTING.md +188 -0
  7. data/LICENSE +201 -0
  8. data/LOGGING.md +27 -0
  9. data/OVERVIEW.md +463 -0
  10. data/TROUBLESHOOTING.md +31 -0
  11. data/lib/google-cloud-bigquery.rb +139 -0
  12. data/lib/google/cloud/bigquery.rb +145 -0
  13. data/lib/google/cloud/bigquery/argument.rb +197 -0
  14. data/lib/google/cloud/bigquery/convert.rb +383 -0
  15. data/lib/google/cloud/bigquery/copy_job.rb +316 -0
  16. data/lib/google/cloud/bigquery/credentials.rb +50 -0
  17. data/lib/google/cloud/bigquery/data.rb +526 -0
  18. data/lib/google/cloud/bigquery/dataset.rb +2845 -0
  19. data/lib/google/cloud/bigquery/dataset/access.rb +1021 -0
  20. data/lib/google/cloud/bigquery/dataset/list.rb +162 -0
  21. data/lib/google/cloud/bigquery/encryption_configuration.rb +123 -0
  22. data/lib/google/cloud/bigquery/external.rb +2432 -0
  23. data/lib/google/cloud/bigquery/extract_job.rb +368 -0
  24. data/lib/google/cloud/bigquery/insert_response.rb +180 -0
  25. data/lib/google/cloud/bigquery/job.rb +657 -0
  26. data/lib/google/cloud/bigquery/job/list.rb +162 -0
  27. data/lib/google/cloud/bigquery/load_job.rb +1704 -0
  28. data/lib/google/cloud/bigquery/model.rb +740 -0
  29. data/lib/google/cloud/bigquery/model/list.rb +164 -0
  30. data/lib/google/cloud/bigquery/project.rb +1655 -0
  31. data/lib/google/cloud/bigquery/project/list.rb +161 -0
  32. data/lib/google/cloud/bigquery/query_job.rb +1695 -0
  33. data/lib/google/cloud/bigquery/routine.rb +1108 -0
  34. data/lib/google/cloud/bigquery/routine/list.rb +165 -0
  35. data/lib/google/cloud/bigquery/schema.rb +564 -0
  36. data/lib/google/cloud/bigquery/schema/field.rb +668 -0
  37. data/lib/google/cloud/bigquery/service.rb +589 -0
  38. data/lib/google/cloud/bigquery/standard_sql.rb +495 -0
  39. data/lib/google/cloud/bigquery/table.rb +3340 -0
  40. data/lib/google/cloud/bigquery/table/async_inserter.rb +520 -0
  41. data/lib/google/cloud/bigquery/table/list.rb +172 -0
  42. data/lib/google/cloud/bigquery/time.rb +65 -0
  43. data/lib/google/cloud/bigquery/version.rb +22 -0
  44. metadata +297 -0
@@ -0,0 +1,463 @@
1
+ # Google Cloud BigQuery
2
+
3
+ Google BigQuery enables super-fast, SQL-like queries against massive datasets,
4
+ using the processing power of Google's infrastructure. To learn more, read [What
5
+ is BigQuery?](https://cloud.google.com/bigquery/what-is-bigquery).
6
+
7
+ The goal of google-cloud is to provide an API that is comfortable to Rubyists.
8
+ Your authentication credentials are detected automatically in Google Cloud
9
+ Platform (GCP), including Google Compute Engine (GCE), Google Kubernetes Engine
10
+ (GKE), Google App Engine (GAE), Google Cloud Functions (GCF) and Cloud Run. In
11
+ other environments you can configure authentication easily, either directly in
12
+ your code or via environment variables. Read more about the options for
13
+ connecting in the {file:AUTHENTICATION.md Authentication Guide}.
14
+
15
+ To help you get started quickly, the first few examples below use a public
16
+ dataset provided by Google. As soon as you have [signed
17
+ up](https://cloud.google.com/bigquery/sign-up) to use BigQuery, and provided
18
+ that you stay in the free tier for queries, you should be able to run these
19
+ first examples without the need to set up billing or to load data (although
20
+ we'll show you how to do that too.)
21
+
22
+ ## Listing Datasets and Tables
23
+
24
+ A BigQuery project contains datasets, which in turn contain tables. Assuming
25
+ that you have not yet created datasets or tables in your own project, let's
26
+ connect to Google's `bigquery-public-data` project, and see what we find.
27
+
28
+ ```ruby
29
+ require "google/cloud/bigquery"
30
+
31
+ bigquery = Google::Cloud::Bigquery.new project: "bigquery-public-data"
32
+
33
+ bigquery.datasets.count #=> 1
34
+ bigquery.datasets.first.dataset_id #=> "samples"
35
+
36
+ dataset = bigquery.datasets.first
37
+ tables = dataset.tables
38
+
39
+ tables.count #=> 7
40
+ tables.map &:table_id #=> [..., "shakespeare", "trigrams", "wikipedia"]
41
+ ```
42
+
43
+ In addition to listing all datasets and tables in the project, you can also
44
+ retrieve individual datasets and tables by ID. Let's look at the structure of
45
+ the `shakespeare` table, which contains an entry for every word in every play
46
+ written by Shakespeare.
47
+
48
+ ```ruby
49
+ require "google/cloud/bigquery"
50
+
51
+ bigquery = Google::Cloud::Bigquery.new project: "bigquery-public-data"
52
+
53
+ dataset = bigquery.dataset "samples"
54
+ table = dataset.table "shakespeare"
55
+
56
+ table.headers #=> [:word, :word_count, :corpus, :corpus_date]
57
+ table.rows_count #=> 164656
58
+ ```
59
+
60
+ Now that you know the column names for the Shakespeare table, let's write and
61
+ run a few queries against it.
62
+
63
+ ## Running queries
64
+
65
+ BigQuery supports two SQL dialects: [standard
66
+ SQL](https://cloud.google.com/bigquery/docs/reference/standard-sql/) and the
67
+ older [legacy SQl (BigQuery
68
+ SQL)](https://cloud.google.com/bigquery/docs/reference/legacy-sql), as discussed
69
+ in the guide [Migrating from legacy
70
+ SQL](https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql).
71
+
72
+ ### Standard SQL
73
+
74
+ Standard SQL is the preferred SQL dialect for querying data stored in BigQuery.
75
+ It is compliant with the SQL 2011 standard, and has extensions that support
76
+ querying nested and repeated data. This is the default syntax. It has several
77
+ advantages over legacy SQL, including:
78
+
79
+ * Composability using `WITH` clauses and SQL functions
80
+ * Subqueries in the `SELECT` list and `WHERE` clause
81
+ * Correlated subqueries
82
+ * `ARRAY` and `STRUCT` data types
83
+ * Inserts, updates, and deletes
84
+ * `COUNT(DISTINCT <expr>)` is exact and scalable, providing the accuracy of
85
+ `EXACT_COUNT_DISTINCT` without its limitations
86
+ * Automatic predicate push-down through `JOIN`s
87
+ * Complex `JOIN` predicates, including arbitrary expressions
88
+
89
+ For examples that demonstrate some of these features, see [Standard SQL
90
+ ghlights](https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql#standard_sql_highlights).
91
+
92
+ As shown in this example, standard SQL is the library default:
93
+
94
+ ```ruby
95
+ require "google/cloud/bigquery"
96
+
97
+ bigquery = Google::Cloud::Bigquery.new
98
+
99
+ sql = "SELECT word, SUM(word_count) AS word_count " \
100
+ "FROM `bigquery-public-data.samples.shakespeare`" \
101
+ "WHERE word IN ('me', 'I', 'you') GROUP BY word"
102
+ data = bigquery.query sql
103
+ ```
104
+
105
+ Notice that in standard SQL, a fully-qualified table name uses the following
106
+ format: <code>`my-dashed-project.dataset1.tableName`</code>.
107
+
108
+ ### Legacy SQL (formerly BigQuery SQL)
109
+
110
+ Before version 2.0, BigQuery executed queries using a non-standard SQL dialect
111
+ known as BigQuery SQL. This variant is optional, and can be enabled by passing
112
+ the flag `legacy_sql: true` with your query. (If you get an SQL syntax error
113
+ with a query that may be written in legacy SQL, be sure that you are passing
114
+ this option.)
115
+
116
+ To use legacy SQL, pass the option `legacy_sql: true` with your query:
117
+
118
+ ```ruby
119
+ require "google/cloud/bigquery"
120
+
121
+ bigquery = Google::Cloud::Bigquery.new
122
+
123
+ sql = "SELECT TOP(word, 50) as word, COUNT(*) as count " \
124
+ "FROM [bigquery-public-data:samples.shakespeare]"
125
+ data = bigquery.query sql, legacy_sql: true
126
+ ```
127
+
128
+ Notice that in legacy SQL, a fully-qualified table name uses brackets instead of
129
+ back-ticks, and a colon instead of a dot to separate the project and the
130
+ dataset: `[my-dashed-project:dataset1.tableName]`.
131
+
132
+ #### Query parameters
133
+
134
+ With standard SQL, you can use positional or named query parameters. This
135
+ example shows the use of named parameters:
136
+
137
+ ```ruby
138
+ require "google/cloud/bigquery"
139
+
140
+ bigquery = Google::Cloud::Bigquery.new
141
+
142
+ sql = "SELECT word, SUM(word_count) AS word_count " \
143
+ "FROM `bigquery-public-data.samples.shakespeare`" \
144
+ "WHERE word IN UNNEST(@words) GROUP BY word"
145
+ data = bigquery.query sql, params: { words: ['me', 'I', 'you'] }
146
+ ```
147
+
148
+ As demonstrated above, passing the `params` option will automatically set
149
+ `standard_sql` to `true`.
150
+
151
+ #### Data types
152
+
153
+ BigQuery standard SQL supports simple data types such as integers, as well as
154
+ more complex types such as `ARRAY` and `STRUCT`.
155
+
156
+ The BigQuery data types are converted to and from Ruby types as follows:
157
+
158
+ | BigQuery | Ruby | Notes |
159
+ |-------------|----------------|---|
160
+ | `BOOL` | `true`/`false` | |
161
+ | `INT64` | `Integer` | |
162
+ | `FLOAT64` | `Float` | |
163
+ | `NUMERIC` | `BigDecimal` | Will be rounded to 9 decimal places |
164
+ | `STRING` | `String` | |
165
+ | `DATETIME` | `DateTime` | `DATETIME` does not support time zone. |
166
+ | `DATE` | `Date` | |
167
+ | `TIMESTAMP` | `Time` | |
168
+ | `TIME` | `Google::Cloud::BigQuery::Time` | |
169
+ | `BYTES` | `File`, `IO`, `StringIO`, or similar | |
170
+ | `ARRAY` | `Array` | Nested arrays and `nil` values are not supported. |
171
+ | `STRUCT` | `Hash` | Hash keys may be strings or symbols. |
172
+
173
+ See [Data
174
+ Types](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types)
175
+ for an overview of each BigQuery data type, including allowed values.
176
+
177
+ ### Running Queries
178
+
179
+ Let's start with the simplest way to run a query. Notice that this time you are
180
+ connecting using your own default project. It is necessary to have write access
181
+ to the project for running a query, since queries need to create tables to hold
182
+ results.
183
+
184
+ ```ruby
185
+ require "google/cloud/bigquery"
186
+
187
+ bigquery = Google::Cloud::Bigquery.new
188
+
189
+ sql = "SELECT APPROX_TOP_COUNT(corpus, 10) as title, " \
190
+ "COUNT(*) as unique_words " \
191
+ "FROM `bigquery-public-data.samples.shakespeare`"
192
+ data = bigquery.query sql
193
+
194
+ data.next? #=> false
195
+ data.first #=> {:title=>[{:value=>"hamlet", :count=>5318}, ...}
196
+ ```
197
+
198
+ The `APPROX_TOP_COUNT` function shown above is just one of a variety of
199
+ functions offered by BigQuery. See the [Query Reference (standard
200
+ SQL)](https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators)
201
+ for a full listing.
202
+
203
+ ### Query Jobs
204
+
205
+ It is usually best not to block for most BigQuery operations, including querying
206
+ as well as importing, exporting, and copying data. Therefore, the BigQuery API
207
+ provides facilities for managing longer-running jobs. With this approach, an
208
+ instance of {Google::Cloud::Bigquery::QueryJob} is returned, rather than an
209
+ instance of {Google::Cloud::Bigquery::Data}.
210
+
211
+ ```ruby
212
+ require "google/cloud/bigquery"
213
+
214
+ bigquery = Google::Cloud::Bigquery.new
215
+
216
+ sql = "SELECT APPROX_TOP_COUNT(corpus, 10) as title, " \
217
+ "COUNT(*) as unique_words " \
218
+ "FROM `bigquery-public-data.samples.shakespeare`"
219
+ job = bigquery.query_job sql
220
+
221
+ job.wait_until_done!
222
+ if !job.failed?
223
+ job.data.first
224
+ #=> {:title=>[{:value=>"hamlet", :count=>5318}, ...}
225
+ end
226
+ ```
227
+
228
+ Once you have determined that the job is done and has not failed, you can obtain
229
+ an instance of {Google::Cloud::Bigquery::Data} by calling `data` on the job
230
+ instance. The query results for both of the above examples are stored in
231
+ temporary tables with a lifetime of about 24 hours. See the final example below
232
+ for a demonstration of how to store query results in a permanent table.
233
+
234
+ ## Creating Datasets and Tables
235
+
236
+ The first thing you need to do in a new BigQuery project is to create a
237
+ {Google::Cloud::Bigquery::Dataset}. Datasets hold tables and control access to
238
+ them.
239
+
240
+ ```ruby
241
+ require "google/cloud/bigquery"
242
+
243
+ bigquery = Google::Cloud::Bigquery.new
244
+
245
+ dataset = bigquery.create_dataset "my_dataset"
246
+ ```
247
+
248
+ Now that you have a dataset, you can use it to create a table. Every table is
249
+ defined by a schema that may contain nested and repeated fields. The example
250
+ below shows a schema with a repeated record field named `cities_lived`. (For
251
+ more information about nested and repeated fields, see [Preparing Data for
252
+ Loading](https://cloud.google.com/bigquery/preparing-data-for-loading).)
253
+
254
+ ```ruby
255
+ require "google/cloud/bigquery"
256
+
257
+ bigquery = Google::Cloud::Bigquery.new
258
+ dataset = bigquery.dataset "my_dataset"
259
+
260
+ table = dataset.create_table "people" do |schema|
261
+ schema.string "first_name", mode: :required
262
+ schema.record "cities_lived", mode: :repeated do |nested_schema|
263
+ nested_schema.string "place", mode: :required
264
+ nested_schema.integer "number_of_years", mode: :required
265
+ end
266
+ end
267
+ ```
268
+
269
+ Because of the repeated field in this schema, we cannot use the CSV format to
270
+ load data into the table.
271
+
272
+ ## Loading records
273
+
274
+ To follow along with these examples, you will need to set up billing on the
275
+ [Google Developers Console](https://console.developers.google.com).
276
+
277
+ In addition to CSV, data can be imported from files that are formatted as
278
+ [Newline-delimited JSON](http://jsonlines.org/),
279
+ [Avro](http://avro.apache.org/),
280
+ [ORC](https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-orc),
281
+ [Parquet](https://parquet.apache.org/) or from a Google Cloud Datastore backup.
282
+ It can also be "streamed" into BigQuery.
283
+
284
+ ### Streaming records
285
+
286
+ For situations in which you want new data to be available for querying as soon
287
+ as possible, inserting individual records directly from your Ruby application is
288
+ a great approach.
289
+
290
+ ```ruby
291
+ require "google/cloud/bigquery"
292
+
293
+ bigquery = Google::Cloud::Bigquery.new
294
+ dataset = bigquery.dataset "my_dataset"
295
+ table = dataset.table "people"
296
+
297
+ rows = [
298
+ {
299
+ "first_name" => "Anna",
300
+ "cities_lived" => [
301
+ {
302
+ "place" => "Stockholm",
303
+ "number_of_years" => 2
304
+ }
305
+ ]
306
+ },
307
+ {
308
+ "first_name" => "Bob",
309
+ "cities_lived" => [
310
+ {
311
+ "place" => "Seattle",
312
+ "number_of_years" => 5
313
+ },
314
+ {
315
+ "place" => "Austin",
316
+ "number_of_years" => 6
317
+ }
318
+ ]
319
+ }
320
+ ]
321
+ table.insert rows
322
+ ```
323
+
324
+ To avoid making RPCs (network requests) to retrieve the dataset and table
325
+ resources when streaming records, pass the `skip_lookup` option. This creates
326
+ local objects without verifying that the resources exist on the BigQuery
327
+ service.
328
+
329
+ ```ruby
330
+ require "google/cloud/bigquery"
331
+
332
+ bigquery = Google::Cloud::Bigquery.new
333
+ dataset = bigquery.dataset "my_dataset", skip_lookup: true
334
+ table = dataset.table "people", skip_lookup: true
335
+
336
+ rows = [
337
+ {
338
+ "first_name" => "Anna",
339
+ "cities_lived" => [
340
+ {
341
+ "place" => "Stockholm",
342
+ "number_of_years" => 2
343
+ }
344
+ ]
345
+ },
346
+ {
347
+ "first_name" => "Bob",
348
+ "cities_lived" => [
349
+ {
350
+ "place" => "Seattle",
351
+ "number_of_years" => 5
352
+ },
353
+ {
354
+ "place" => "Austin",
355
+ "number_of_years" => 6
356
+ }
357
+ ]
358
+ }
359
+ ]
360
+ table.insert rows
361
+ ```
362
+
363
+ There are some trade-offs involved with streaming, so be sure to read the
364
+ discussion of data consistency in [Streaming Data Into
365
+ BigQuery](https://cloud.google.com/bigquery/streaming-data-into-bigquery).
366
+
367
+ ### Uploading a file
368
+
369
+ To follow along with this example, please download the
370
+ [names.zip](http://www.ssa.gov/OACT/babynames/names.zip) archive from the U.S.
371
+ Social Security Administration. Inside the archive you will find over 100 files
372
+ containing baby name records since the year 1880.
373
+
374
+ ```ruby
375
+ require "google/cloud/bigquery"
376
+
377
+ bigquery = Google::Cloud::Bigquery.new
378
+ dataset = bigquery.dataset "my_dataset"
379
+ table = dataset.create_table "baby_names" do |schema|
380
+ schema.string "name", mode: :required
381
+ schema.string "gender", mode: :required
382
+ schema.integer "count", mode: :required
383
+ end
384
+
385
+ file = File.open "names/yob2014.txt"
386
+ table.load file, format: "csv"
387
+ ```
388
+
389
+ Because the names data, although formatted as CSV, is distributed in files with
390
+ a `.txt` extension, this example explicitly passes the `format` option in order
391
+ to demonstrate how to handle such situations. Because CSV is the default format
392
+ for load operations, the option is not actually necessary. For JSON saved with a
393
+ `.txt` extension, however, it would be.
394
+
395
+ ## Exporting query results to Google Cloud Storage
396
+
397
+ The example below shows how to pass the `table` option with a query in order to
398
+ store results in a permanent table. It also shows how to export the result data
399
+ to a Google Cloud Storage file. In order to follow along, you will need to
400
+ enable the Google Cloud Storage API in addition to setting up billing.
401
+
402
+ ```ruby
403
+ require "google/cloud/bigquery"
404
+
405
+ bigquery = Google::Cloud::Bigquery.new
406
+ dataset = bigquery.dataset "my_dataset"
407
+ source_table = dataset.table "baby_names"
408
+ result_table = dataset.create_table "baby_names_results"
409
+
410
+ sql = "SELECT name, count " \
411
+ "FROM baby_names " \
412
+ "WHERE gender = 'M' " \
413
+ "ORDER BY count ASC LIMIT 5"
414
+ query_job = dataset.query_job sql, table: result_table
415
+
416
+ query_job.wait_until_done!
417
+
418
+ if !query_job.failed?
419
+ require "google/cloud/storage"
420
+
421
+ storage = Google::Cloud::Storage.new
422
+ bucket_id = "bigquery-exports-#{SecureRandom.uuid}"
423
+ bucket = storage.create_bucket bucket_id
424
+ extract_url = "gs://#{bucket.id}/baby-names.csv"
425
+
426
+ result_table.extract extract_url
427
+
428
+ # Download to local filesystem
429
+ bucket.files.first.download "baby-names.csv"
430
+ end
431
+ ```
432
+
433
+ If a table you wish to export contains a large amount of data, you can pass a
434
+ wildcard URI to export to multiple files (for sharding), or an array of URIs
435
+ (for partitioning), or both. See [Exporting
436
+ Data](https://cloud.google.com/bigquery/docs/exporting-data) for details.
437
+
438
+ ## Configuring retries and timeout
439
+
440
+ You can configure how many times API requests may be automatically retried. When
441
+ an API request fails, the response will be inspected to see if the request meets
442
+ criteria indicating that it may succeed on retry, such as `500` and `503` status
443
+ codes or a specific internal error code such as `rateLimitExceeded`. If it meets
444
+ the criteria, the request will be retried after a delay. If another error
445
+ occurs, the delay will be increased before a subsequent attempt, until the
446
+ `retries` limit is reached.
447
+
448
+ You can also set the request `timeout` value in seconds.
449
+
450
+ ```ruby
451
+ require "google/cloud/bigquery"
452
+
453
+ bigquery = Google::Cloud::Bigquery.new retries: 10, timeout: 120
454
+ ```
455
+
456
+ See the [BigQuery error
457
+ table](https://cloud.google.com/bigquery/troubleshooting-errors#errortable) for
458
+ a list of error conditions.
459
+
460
+ ## Additional information
461
+
462
+ Google BigQuery can be configured to use logging. To learn more, see the
463
+ {file:LOGGING.md Logging guide}.