google-cloud-bigquery 1.21.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (44) hide show
  1. checksums.yaml +7 -0
  2. data/.yardopts +16 -0
  3. data/AUTHENTICATION.md +158 -0
  4. data/CHANGELOG.md +397 -0
  5. data/CODE_OF_CONDUCT.md +40 -0
  6. data/CONTRIBUTING.md +188 -0
  7. data/LICENSE +201 -0
  8. data/LOGGING.md +27 -0
  9. data/OVERVIEW.md +463 -0
  10. data/TROUBLESHOOTING.md +31 -0
  11. data/lib/google-cloud-bigquery.rb +139 -0
  12. data/lib/google/cloud/bigquery.rb +145 -0
  13. data/lib/google/cloud/bigquery/argument.rb +197 -0
  14. data/lib/google/cloud/bigquery/convert.rb +383 -0
  15. data/lib/google/cloud/bigquery/copy_job.rb +316 -0
  16. data/lib/google/cloud/bigquery/credentials.rb +50 -0
  17. data/lib/google/cloud/bigquery/data.rb +526 -0
  18. data/lib/google/cloud/bigquery/dataset.rb +2845 -0
  19. data/lib/google/cloud/bigquery/dataset/access.rb +1021 -0
  20. data/lib/google/cloud/bigquery/dataset/list.rb +162 -0
  21. data/lib/google/cloud/bigquery/encryption_configuration.rb +123 -0
  22. data/lib/google/cloud/bigquery/external.rb +2432 -0
  23. data/lib/google/cloud/bigquery/extract_job.rb +368 -0
  24. data/lib/google/cloud/bigquery/insert_response.rb +180 -0
  25. data/lib/google/cloud/bigquery/job.rb +657 -0
  26. data/lib/google/cloud/bigquery/job/list.rb +162 -0
  27. data/lib/google/cloud/bigquery/load_job.rb +1704 -0
  28. data/lib/google/cloud/bigquery/model.rb +740 -0
  29. data/lib/google/cloud/bigquery/model/list.rb +164 -0
  30. data/lib/google/cloud/bigquery/project.rb +1655 -0
  31. data/lib/google/cloud/bigquery/project/list.rb +161 -0
  32. data/lib/google/cloud/bigquery/query_job.rb +1695 -0
  33. data/lib/google/cloud/bigquery/routine.rb +1108 -0
  34. data/lib/google/cloud/bigquery/routine/list.rb +165 -0
  35. data/lib/google/cloud/bigquery/schema.rb +564 -0
  36. data/lib/google/cloud/bigquery/schema/field.rb +668 -0
  37. data/lib/google/cloud/bigquery/service.rb +589 -0
  38. data/lib/google/cloud/bigquery/standard_sql.rb +495 -0
  39. data/lib/google/cloud/bigquery/table.rb +3340 -0
  40. data/lib/google/cloud/bigquery/table/async_inserter.rb +520 -0
  41. data/lib/google/cloud/bigquery/table/list.rb +172 -0
  42. data/lib/google/cloud/bigquery/time.rb +65 -0
  43. data/lib/google/cloud/bigquery/version.rb +22 -0
  44. metadata +297 -0
@@ -0,0 +1,463 @@
1
+ # Google Cloud BigQuery
2
+
3
+ Google BigQuery enables super-fast, SQL-like queries against massive datasets,
4
+ using the processing power of Google's infrastructure. To learn more, read [What
5
+ is BigQuery?](https://cloud.google.com/bigquery/what-is-bigquery).
6
+
7
+ The goal of google-cloud is to provide an API that is comfortable to Rubyists.
8
+ Your authentication credentials are detected automatically in Google Cloud
9
+ Platform (GCP), including Google Compute Engine (GCE), Google Kubernetes Engine
10
+ (GKE), Google App Engine (GAE), Google Cloud Functions (GCF) and Cloud Run. In
11
+ other environments you can configure authentication easily, either directly in
12
+ your code or via environment variables. Read more about the options for
13
+ connecting in the {file:AUTHENTICATION.md Authentication Guide}.
14
+
15
+ To help you get started quickly, the first few examples below use a public
16
+ dataset provided by Google. As soon as you have [signed
17
+ up](https://cloud.google.com/bigquery/sign-up) to use BigQuery, and provided
18
+ that you stay in the free tier for queries, you should be able to run these
19
+ first examples without the need to set up billing or to load data (although
20
+ we'll show you how to do that too.)
21
+
22
+ ## Listing Datasets and Tables
23
+
24
+ A BigQuery project contains datasets, which in turn contain tables. Assuming
25
+ that you have not yet created datasets or tables in your own project, let's
26
+ connect to Google's `bigquery-public-data` project, and see what we find.
27
+
28
+ ```ruby
29
+ require "google/cloud/bigquery"
30
+
31
+ bigquery = Google::Cloud::Bigquery.new project: "bigquery-public-data"
32
+
33
+ bigquery.datasets.count #=> 1
34
+ bigquery.datasets.first.dataset_id #=> "samples"
35
+
36
+ dataset = bigquery.datasets.first
37
+ tables = dataset.tables
38
+
39
+ tables.count #=> 7
40
+ tables.map &:table_id #=> [..., "shakespeare", "trigrams", "wikipedia"]
41
+ ```
42
+
43
+ In addition to listing all datasets and tables in the project, you can also
44
+ retrieve individual datasets and tables by ID. Let's look at the structure of
45
+ the `shakespeare` table, which contains an entry for every word in every play
46
+ written by Shakespeare.
47
+
48
+ ```ruby
49
+ require "google/cloud/bigquery"
50
+
51
+ bigquery = Google::Cloud::Bigquery.new project: "bigquery-public-data"
52
+
53
+ dataset = bigquery.dataset "samples"
54
+ table = dataset.table "shakespeare"
55
+
56
+ table.headers #=> [:word, :word_count, :corpus, :corpus_date]
57
+ table.rows_count #=> 164656
58
+ ```
59
+
60
+ Now that you know the column names for the Shakespeare table, let's write and
61
+ run a few queries against it.
62
+
63
+ ## Running queries
64
+
65
+ BigQuery supports two SQL dialects: [standard
66
+ SQL](https://cloud.google.com/bigquery/docs/reference/standard-sql/) and the
67
+ older [legacy SQl (BigQuery
68
+ SQL)](https://cloud.google.com/bigquery/docs/reference/legacy-sql), as discussed
69
+ in the guide [Migrating from legacy
70
+ SQL](https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql).
71
+
72
+ ### Standard SQL
73
+
74
+ Standard SQL is the preferred SQL dialect for querying data stored in BigQuery.
75
+ It is compliant with the SQL 2011 standard, and has extensions that support
76
+ querying nested and repeated data. This is the default syntax. It has several
77
+ advantages over legacy SQL, including:
78
+
79
+ * Composability using `WITH` clauses and SQL functions
80
+ * Subqueries in the `SELECT` list and `WHERE` clause
81
+ * Correlated subqueries
82
+ * `ARRAY` and `STRUCT` data types
83
+ * Inserts, updates, and deletes
84
+ * `COUNT(DISTINCT <expr>)` is exact and scalable, providing the accuracy of
85
+ `EXACT_COUNT_DISTINCT` without its limitations
86
+ * Automatic predicate push-down through `JOIN`s
87
+ * Complex `JOIN` predicates, including arbitrary expressions
88
+
89
+ For examples that demonstrate some of these features, see [Standard SQL
90
+ ghlights](https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql#standard_sql_highlights).
91
+
92
+ As shown in this example, standard SQL is the library default:
93
+
94
+ ```ruby
95
+ require "google/cloud/bigquery"
96
+
97
+ bigquery = Google::Cloud::Bigquery.new
98
+
99
+ sql = "SELECT word, SUM(word_count) AS word_count " \
100
+ "FROM `bigquery-public-data.samples.shakespeare`" \
101
+ "WHERE word IN ('me', 'I', 'you') GROUP BY word"
102
+ data = bigquery.query sql
103
+ ```
104
+
105
+ Notice that in standard SQL, a fully-qualified table name uses the following
106
+ format: <code>`my-dashed-project.dataset1.tableName`</code>.
107
+
108
+ ### Legacy SQL (formerly BigQuery SQL)
109
+
110
+ Before version 2.0, BigQuery executed queries using a non-standard SQL dialect
111
+ known as BigQuery SQL. This variant is optional, and can be enabled by passing
112
+ the flag `legacy_sql: true` with your query. (If you get an SQL syntax error
113
+ with a query that may be written in legacy SQL, be sure that you are passing
114
+ this option.)
115
+
116
+ To use legacy SQL, pass the option `legacy_sql: true` with your query:
117
+
118
+ ```ruby
119
+ require "google/cloud/bigquery"
120
+
121
+ bigquery = Google::Cloud::Bigquery.new
122
+
123
+ sql = "SELECT TOP(word, 50) as word, COUNT(*) as count " \
124
+ "FROM [bigquery-public-data:samples.shakespeare]"
125
+ data = bigquery.query sql, legacy_sql: true
126
+ ```
127
+
128
+ Notice that in legacy SQL, a fully-qualified table name uses brackets instead of
129
+ back-ticks, and a colon instead of a dot to separate the project and the
130
+ dataset: `[my-dashed-project:dataset1.tableName]`.
131
+
132
+ #### Query parameters
133
+
134
+ With standard SQL, you can use positional or named query parameters. This
135
+ example shows the use of named parameters:
136
+
137
+ ```ruby
138
+ require "google/cloud/bigquery"
139
+
140
+ bigquery = Google::Cloud::Bigquery.new
141
+
142
+ sql = "SELECT word, SUM(word_count) AS word_count " \
143
+ "FROM `bigquery-public-data.samples.shakespeare`" \
144
+ "WHERE word IN UNNEST(@words) GROUP BY word"
145
+ data = bigquery.query sql, params: { words: ['me', 'I', 'you'] }
146
+ ```
147
+
148
+ As demonstrated above, passing the `params` option will automatically set
149
+ `standard_sql` to `true`.
150
+
151
+ #### Data types
152
+
153
+ BigQuery standard SQL supports simple data types such as integers, as well as
154
+ more complex types such as `ARRAY` and `STRUCT`.
155
+
156
+ The BigQuery data types are converted to and from Ruby types as follows:
157
+
158
+ | BigQuery | Ruby | Notes |
159
+ |-------------|----------------|---|
160
+ | `BOOL` | `true`/`false` | |
161
+ | `INT64` | `Integer` | |
162
+ | `FLOAT64` | `Float` | |
163
+ | `NUMERIC` | `BigDecimal` | Will be rounded to 9 decimal places |
164
+ | `STRING` | `String` | |
165
+ | `DATETIME` | `DateTime` | `DATETIME` does not support time zone. |
166
+ | `DATE` | `Date` | |
167
+ | `TIMESTAMP` | `Time` | |
168
+ | `TIME` | `Google::Cloud::BigQuery::Time` | |
169
+ | `BYTES` | `File`, `IO`, `StringIO`, or similar | |
170
+ | `ARRAY` | `Array` | Nested arrays and `nil` values are not supported. |
171
+ | `STRUCT` | `Hash` | Hash keys may be strings or symbols. |
172
+
173
+ See [Data
174
+ Types](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types)
175
+ for an overview of each BigQuery data type, including allowed values.
176
+
177
+ ### Running Queries
178
+
179
+ Let's start with the simplest way to run a query. Notice that this time you are
180
+ connecting using your own default project. It is necessary to have write access
181
+ to the project for running a query, since queries need to create tables to hold
182
+ results.
183
+
184
+ ```ruby
185
+ require "google/cloud/bigquery"
186
+
187
+ bigquery = Google::Cloud::Bigquery.new
188
+
189
+ sql = "SELECT APPROX_TOP_COUNT(corpus, 10) as title, " \
190
+ "COUNT(*) as unique_words " \
191
+ "FROM `bigquery-public-data.samples.shakespeare`"
192
+ data = bigquery.query sql
193
+
194
+ data.next? #=> false
195
+ data.first #=> {:title=>[{:value=>"hamlet", :count=>5318}, ...}
196
+ ```
197
+
198
+ The `APPROX_TOP_COUNT` function shown above is just one of a variety of
199
+ functions offered by BigQuery. See the [Query Reference (standard
200
+ SQL)](https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators)
201
+ for a full listing.
202
+
203
+ ### Query Jobs
204
+
205
+ It is usually best not to block for most BigQuery operations, including querying
206
+ as well as importing, exporting, and copying data. Therefore, the BigQuery API
207
+ provides facilities for managing longer-running jobs. With this approach, an
208
+ instance of {Google::Cloud::Bigquery::QueryJob} is returned, rather than an
209
+ instance of {Google::Cloud::Bigquery::Data}.
210
+
211
+ ```ruby
212
+ require "google/cloud/bigquery"
213
+
214
+ bigquery = Google::Cloud::Bigquery.new
215
+
216
+ sql = "SELECT APPROX_TOP_COUNT(corpus, 10) as title, " \
217
+ "COUNT(*) as unique_words " \
218
+ "FROM `bigquery-public-data.samples.shakespeare`"
219
+ job = bigquery.query_job sql
220
+
221
+ job.wait_until_done!
222
+ if !job.failed?
223
+ job.data.first
224
+ #=> {:title=>[{:value=>"hamlet", :count=>5318}, ...}
225
+ end
226
+ ```
227
+
228
+ Once you have determined that the job is done and has not failed, you can obtain
229
+ an instance of {Google::Cloud::Bigquery::Data} by calling `data` on the job
230
+ instance. The query results for both of the above examples are stored in
231
+ temporary tables with a lifetime of about 24 hours. See the final example below
232
+ for a demonstration of how to store query results in a permanent table.
233
+
234
+ ## Creating Datasets and Tables
235
+
236
+ The first thing you need to do in a new BigQuery project is to create a
237
+ {Google::Cloud::Bigquery::Dataset}. Datasets hold tables and control access to
238
+ them.
239
+
240
+ ```ruby
241
+ require "google/cloud/bigquery"
242
+
243
+ bigquery = Google::Cloud::Bigquery.new
244
+
245
+ dataset = bigquery.create_dataset "my_dataset"
246
+ ```
247
+
248
+ Now that you have a dataset, you can use it to create a table. Every table is
249
+ defined by a schema that may contain nested and repeated fields. The example
250
+ below shows a schema with a repeated record field named `cities_lived`. (For
251
+ more information about nested and repeated fields, see [Preparing Data for
252
+ Loading](https://cloud.google.com/bigquery/preparing-data-for-loading).)
253
+
254
+ ```ruby
255
+ require "google/cloud/bigquery"
256
+
257
+ bigquery = Google::Cloud::Bigquery.new
258
+ dataset = bigquery.dataset "my_dataset"
259
+
260
+ table = dataset.create_table "people" do |schema|
261
+ schema.string "first_name", mode: :required
262
+ schema.record "cities_lived", mode: :repeated do |nested_schema|
263
+ nested_schema.string "place", mode: :required
264
+ nested_schema.integer "number_of_years", mode: :required
265
+ end
266
+ end
267
+ ```
268
+
269
+ Because of the repeated field in this schema, we cannot use the CSV format to
270
+ load data into the table.
271
+
272
+ ## Loading records
273
+
274
+ To follow along with these examples, you will need to set up billing on the
275
+ [Google Developers Console](https://console.developers.google.com).
276
+
277
+ In addition to CSV, data can be imported from files that are formatted as
278
+ [Newline-delimited JSON](http://jsonlines.org/),
279
+ [Avro](http://avro.apache.org/),
280
+ [ORC](https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-orc),
281
+ [Parquet](https://parquet.apache.org/) or from a Google Cloud Datastore backup.
282
+ It can also be "streamed" into BigQuery.
283
+
284
+ ### Streaming records
285
+
286
+ For situations in which you want new data to be available for querying as soon
287
+ as possible, inserting individual records directly from your Ruby application is
288
+ a great approach.
289
+
290
+ ```ruby
291
+ require "google/cloud/bigquery"
292
+
293
+ bigquery = Google::Cloud::Bigquery.new
294
+ dataset = bigquery.dataset "my_dataset"
295
+ table = dataset.table "people"
296
+
297
+ rows = [
298
+ {
299
+ "first_name" => "Anna",
300
+ "cities_lived" => [
301
+ {
302
+ "place" => "Stockholm",
303
+ "number_of_years" => 2
304
+ }
305
+ ]
306
+ },
307
+ {
308
+ "first_name" => "Bob",
309
+ "cities_lived" => [
310
+ {
311
+ "place" => "Seattle",
312
+ "number_of_years" => 5
313
+ },
314
+ {
315
+ "place" => "Austin",
316
+ "number_of_years" => 6
317
+ }
318
+ ]
319
+ }
320
+ ]
321
+ table.insert rows
322
+ ```
323
+
324
+ To avoid making RPCs (network requests) to retrieve the dataset and table
325
+ resources when streaming records, pass the `skip_lookup` option. This creates
326
+ local objects without verifying that the resources exist on the BigQuery
327
+ service.
328
+
329
+ ```ruby
330
+ require "google/cloud/bigquery"
331
+
332
+ bigquery = Google::Cloud::Bigquery.new
333
+ dataset = bigquery.dataset "my_dataset", skip_lookup: true
334
+ table = dataset.table "people", skip_lookup: true
335
+
336
+ rows = [
337
+ {
338
+ "first_name" => "Anna",
339
+ "cities_lived" => [
340
+ {
341
+ "place" => "Stockholm",
342
+ "number_of_years" => 2
343
+ }
344
+ ]
345
+ },
346
+ {
347
+ "first_name" => "Bob",
348
+ "cities_lived" => [
349
+ {
350
+ "place" => "Seattle",
351
+ "number_of_years" => 5
352
+ },
353
+ {
354
+ "place" => "Austin",
355
+ "number_of_years" => 6
356
+ }
357
+ ]
358
+ }
359
+ ]
360
+ table.insert rows
361
+ ```
362
+
363
+ There are some trade-offs involved with streaming, so be sure to read the
364
+ discussion of data consistency in [Streaming Data Into
365
+ BigQuery](https://cloud.google.com/bigquery/streaming-data-into-bigquery).
366
+
367
+ ### Uploading a file
368
+
369
+ To follow along with this example, please download the
370
+ [names.zip](http://www.ssa.gov/OACT/babynames/names.zip) archive from the U.S.
371
+ Social Security Administration. Inside the archive you will find over 100 files
372
+ containing baby name records since the year 1880.
373
+
374
+ ```ruby
375
+ require "google/cloud/bigquery"
376
+
377
+ bigquery = Google::Cloud::Bigquery.new
378
+ dataset = bigquery.dataset "my_dataset"
379
+ table = dataset.create_table "baby_names" do |schema|
380
+ schema.string "name", mode: :required
381
+ schema.string "gender", mode: :required
382
+ schema.integer "count", mode: :required
383
+ end
384
+
385
+ file = File.open "names/yob2014.txt"
386
+ table.load file, format: "csv"
387
+ ```
388
+
389
+ Because the names data, although formatted as CSV, is distributed in files with
390
+ a `.txt` extension, this example explicitly passes the `format` option in order
391
+ to demonstrate how to handle such situations. Because CSV is the default format
392
+ for load operations, the option is not actually necessary. For JSON saved with a
393
+ `.txt` extension, however, it would be.
394
+
395
+ ## Exporting query results to Google Cloud Storage
396
+
397
+ The example below shows how to pass the `table` option with a query in order to
398
+ store results in a permanent table. It also shows how to export the result data
399
+ to a Google Cloud Storage file. In order to follow along, you will need to
400
+ enable the Google Cloud Storage API in addition to setting up billing.
401
+
402
+ ```ruby
403
+ require "google/cloud/bigquery"
404
+
405
+ bigquery = Google::Cloud::Bigquery.new
406
+ dataset = bigquery.dataset "my_dataset"
407
+ source_table = dataset.table "baby_names"
408
+ result_table = dataset.create_table "baby_names_results"
409
+
410
+ sql = "SELECT name, count " \
411
+ "FROM baby_names " \
412
+ "WHERE gender = 'M' " \
413
+ "ORDER BY count ASC LIMIT 5"
414
+ query_job = dataset.query_job sql, table: result_table
415
+
416
+ query_job.wait_until_done!
417
+
418
+ if !query_job.failed?
419
+ require "google/cloud/storage"
420
+
421
+ storage = Google::Cloud::Storage.new
422
+ bucket_id = "bigquery-exports-#{SecureRandom.uuid}"
423
+ bucket = storage.create_bucket bucket_id
424
+ extract_url = "gs://#{bucket.id}/baby-names.csv"
425
+
426
+ result_table.extract extract_url
427
+
428
+ # Download to local filesystem
429
+ bucket.files.first.download "baby-names.csv"
430
+ end
431
+ ```
432
+
433
+ If a table you wish to export contains a large amount of data, you can pass a
434
+ wildcard URI to export to multiple files (for sharding), or an array of URIs
435
+ (for partitioning), or both. See [Exporting
436
+ Data](https://cloud.google.com/bigquery/docs/exporting-data) for details.
437
+
438
+ ## Configuring retries and timeout
439
+
440
+ You can configure how many times API requests may be automatically retried. When
441
+ an API request fails, the response will be inspected to see if the request meets
442
+ criteria indicating that it may succeed on retry, such as `500` and `503` status
443
+ codes or a specific internal error code such as `rateLimitExceeded`. If it meets
444
+ the criteria, the request will be retried after a delay. If another error
445
+ occurs, the delay will be increased before a subsequent attempt, until the
446
+ `retries` limit is reached.
447
+
448
+ You can also set the request `timeout` value in seconds.
449
+
450
+ ```ruby
451
+ require "google/cloud/bigquery"
452
+
453
+ bigquery = Google::Cloud::Bigquery.new retries: 10, timeout: 120
454
+ ```
455
+
456
+ See the [BigQuery error
457
+ table](https://cloud.google.com/bigquery/troubleshooting-errors#errortable) for
458
+ a list of error conditions.
459
+
460
+ ## Additional information
461
+
462
+ Google BigQuery can be configured to use logging. To learn more, see the
463
+ {file:LOGGING.md Logging guide}.