RubyGems - google-cloud-bigquery - Versions diffs - 1.21.2 - Mend

google-cloud-bigquery 1.21.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (44) hide show

checksums.yaml +7 -0
data/.yardopts +16 -0
data/AUTHENTICATION.md +158 -0
data/CHANGELOG.md +397 -0
data/CODE_OF_CONDUCT.md +40 -0
data/CONTRIBUTING.md +188 -0
data/LICENSE +201 -0
data/LOGGING.md +27 -0
data/OVERVIEW.md +463 -0
data/TROUBLESHOOTING.md +31 -0
data/lib/google-cloud-bigquery.rb +139 -0
data/lib/google/cloud/bigquery.rb +145 -0
data/lib/google/cloud/bigquery/argument.rb +197 -0
data/lib/google/cloud/bigquery/convert.rb +383 -0
data/lib/google/cloud/bigquery/copy_job.rb +316 -0
data/lib/google/cloud/bigquery/credentials.rb +50 -0
data/lib/google/cloud/bigquery/data.rb +526 -0
data/lib/google/cloud/bigquery/dataset.rb +2845 -0
data/lib/google/cloud/bigquery/dataset/access.rb +1021 -0
data/lib/google/cloud/bigquery/dataset/list.rb +162 -0
data/lib/google/cloud/bigquery/encryption_configuration.rb +123 -0
data/lib/google/cloud/bigquery/external.rb +2432 -0
data/lib/google/cloud/bigquery/extract_job.rb +368 -0
data/lib/google/cloud/bigquery/insert_response.rb +180 -0
data/lib/google/cloud/bigquery/job.rb +657 -0
data/lib/google/cloud/bigquery/job/list.rb +162 -0
data/lib/google/cloud/bigquery/load_job.rb +1704 -0
data/lib/google/cloud/bigquery/model.rb +740 -0
data/lib/google/cloud/bigquery/model/list.rb +164 -0
data/lib/google/cloud/bigquery/project.rb +1655 -0
data/lib/google/cloud/bigquery/project/list.rb +161 -0
data/lib/google/cloud/bigquery/query_job.rb +1695 -0
data/lib/google/cloud/bigquery/routine.rb +1108 -0
data/lib/google/cloud/bigquery/routine/list.rb +165 -0
data/lib/google/cloud/bigquery/schema.rb +564 -0
data/lib/google/cloud/bigquery/schema/field.rb +668 -0
data/lib/google/cloud/bigquery/service.rb +589 -0
data/lib/google/cloud/bigquery/standard_sql.rb +495 -0
data/lib/google/cloud/bigquery/table.rb +3340 -0
data/lib/google/cloud/bigquery/table/async_inserter.rb +520 -0
data/lib/google/cloud/bigquery/table/list.rb +172 -0
data/lib/google/cloud/bigquery/time.rb +65 -0
data/lib/google/cloud/bigquery/version.rb +22 -0
metadata +297 -0

data/OVERVIEW.md ADDED

@@ -0,0 +1,463 @@
+# Google Cloud BigQuery
+Google BigQuery enables super-fast, SQL-like queries against massive datasets,
+using the processing power of Google's infrastructure. To learn more, read [What
+is BigQuery?](https://cloud.google.com/bigquery/what-is-bigquery).
+The goal of google-cloud is to provide an API that is comfortable to Rubyists.
+Your authentication credentials are detected automatically in Google Cloud
+Platform (GCP), including Google Compute Engine (GCE), Google Kubernetes Engine
+(GKE), Google App Engine (GAE), Google Cloud Functions (GCF) and Cloud Run. In
+other environments you can configure authentication easily, either directly in
+your code or via environment variables. Read more about the options for
+connecting in the {file:AUTHENTICATION.md Authentication Guide}.
+To help you get started quickly, the first few examples below use a public
+dataset provided by Google. As soon as you have [signed
+up](https://cloud.google.com/bigquery/sign-up) to use BigQuery, and provided
+that you stay in the free tier for queries, you should be able to run these
+first examples without the need to set up billing or to load data (although
+we'll show you how to do that too.)
+## Listing Datasets and Tables
+A BigQuery project contains datasets, which in turn contain tables. Assuming
+that you have not yet created datasets or tables in your own project, let's
+connect to Google's `bigquery-public-data` project, and see what we find.
+```ruby
+require "google/cloud/bigquery"
+bigquery = Google::Cloud::Bigquery.new project: "bigquery-public-data"
+bigquery.datasets.count #=> 1
+bigquery.datasets.first.dataset_id #=> "samples"
+dataset = bigquery.datasets.first
+tables = dataset.tables
+tables.count #=> 7
+tables.map &:table_id #=> [..., "shakespeare", "trigrams", "wikipedia"]
+```
+In addition to listing all datasets and tables in the project, you can also
+retrieve individual datasets and tables by ID. Let's look at the structure of
+the `shakespeare` table, which contains an entry for every word in every play
+written by Shakespeare.
+```ruby
+require "google/cloud/bigquery"
+bigquery = Google::Cloud::Bigquery.new project: "bigquery-public-data"
+dataset = bigquery.dataset "samples"
+table = dataset.table "shakespeare"
+table.headers #=> [:word, :word_count, :corpus, :corpus_date]
+table.rows_count #=> 164656
+```
+Now that you know the column names for the Shakespeare table, let's write and
+run a few queries against it.
+## Running queries
+BigQuery supports two SQL dialects: [standard
+SQL](https://cloud.google.com/bigquery/docs/reference/standard-sql/) and the
+older [legacy SQl (BigQuery
+SQL)](https://cloud.google.com/bigquery/docs/reference/legacy-sql), as discussed
+in the guide [Migrating from legacy
+SQL](https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql).
+### Standard SQL
+Standard SQL is the preferred SQL dialect for querying data stored in BigQuery.
+It is compliant with the SQL 2011 standard, and has extensions that support
+querying nested and repeated data. This is the default syntax. It has several
+advantages over legacy SQL, including:
+* Composability using `WITH` clauses and SQL functions
+* Subqueries in the `SELECT` list and `WHERE` clause
+* Correlated subqueries
+* `ARRAY` and `STRUCT` data types
+* Inserts, updates, and deletes
+* `COUNT(DISTINCT <expr>)` is exact and scalable, providing the accuracy of
+  `EXACT_COUNT_DISTINCT` without its limitations
+* Automatic predicate push-down through `JOIN`s
+* Complex `JOIN` predicates, including arbitrary expressions
+For examples that demonstrate some of these features, see [Standard SQL
+ghlights](https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql#standard_sql_highlights).
+As shown in this example, standard SQL is the library default:
+```ruby
+require "google/cloud/bigquery"
+bigquery = Google::Cloud::Bigquery.new
+sql = "SELECT word, SUM(word_count) AS word_count " \
+      "FROM `bigquery-public-data.samples.shakespeare`" \
+      "WHERE word IN ('me', 'I', 'you') GROUP BY word"
+data = bigquery.query sql
+```
+Notice that in standard SQL, a fully-qualified table name uses the following
+format: <code>`my-dashed-project.dataset1.tableName`</code>.
+### Legacy SQL (formerly BigQuery SQL)
+Before version 2.0, BigQuery executed queries using a non-standard SQL dialect
+known as BigQuery SQL. This variant is optional, and can be enabled by passing
+the flag `legacy_sql: true` with your query. (If you get an SQL syntax error
+with a query that may be written in legacy SQL, be sure that you are passing
+this option.)
+To use legacy SQL, pass the option `legacy_sql: true` with your query:
+```ruby
+require "google/cloud/bigquery"
+bigquery = Google::Cloud::Bigquery.new
+sql = "SELECT TOP(word, 50) as word, COUNT(*) as count " \
+      "FROM [bigquery-public-data:samples.shakespeare]"
+data = bigquery.query sql, legacy_sql: true
+```
+Notice that in legacy SQL, a fully-qualified table name uses brackets instead of
+back-ticks, and a colon instead of a dot to separate the project and the
+dataset: `[my-dashed-project:dataset1.tableName]`.
+#### Query parameters
+With standard SQL, you can use positional or named query parameters. This
+example shows the use of named parameters:
+```ruby
+require "google/cloud/bigquery"
+bigquery = Google::Cloud::Bigquery.new
+sql = "SELECT word, SUM(word_count) AS word_count " \
+      "FROM `bigquery-public-data.samples.shakespeare`" \
+      "WHERE word IN UNNEST(@words) GROUP BY word"
+data = bigquery.query sql, params: { words: ['me', 'I', 'you'] }
+```
+As demonstrated above, passing the `params` option will automatically set
+`standard_sql` to `true`.
+#### Data types
+BigQuery standard SQL supports simple data types such as integers, as well as
+more complex types such as `ARRAY` and `STRUCT`.
+The BigQuery data types are converted to and from Ruby types as follows:
+| BigQuery    | Ruby           | Notes  |
+|-------------|----------------|---|
+| `BOOL`      | `true`/`false` | |
+| `INT64`     | `Integer`      | |
+| `FLOAT64`   | `Float`        | |
+| `NUMERIC`   | `BigDecimal`   | Will be rounded to 9 decimal places |
+| `STRING`    | `String`       | |
+| `DATETIME`  | `DateTime`     | `DATETIME` does not support time zone. |
+| `DATE`      | `Date`         | |
+| `TIMESTAMP` | `Time`         | |
+| `TIME`      | `Google::Cloud::BigQuery::Time` | |
+| `BYTES`     | `File`, `IO`, `StringIO`, or similar | |
+| `ARRAY`     | `Array` | Nested arrays and `nil` values are not supported. |
+| `STRUCT`    | `Hash`         | Hash keys may be strings or symbols. |
+See [Data
+Types](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types)
+for an overview of each BigQuery data type, including allowed values.
+### Running Queries
+Let's start with the simplest way to run a query. Notice that this time you are
+connecting using your own default project. It is necessary to have write access
+to the project for running a query, since queries need to create tables to hold
+results.
+```ruby
+require "google/cloud/bigquery"
+bigquery = Google::Cloud::Bigquery.new
+sql = "SELECT APPROX_TOP_COUNT(corpus, 10) as title, " \
+      "COUNT(*) as unique_words " \
+      "FROM `bigquery-public-data.samples.shakespeare`"
+data = bigquery.query sql
+data.next? #=> false
+data.first #=> {:title=>[{:value=>"hamlet", :count=>5318}, ...}
+```
+The `APPROX_TOP_COUNT` function shown above is just one of a variety of
+functions offered by BigQuery. See the [Query Reference (standard
+SQL)](https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators)
+for a full listing.
+### Query Jobs
+It is usually best not to block for most BigQuery operations, including querying
+as well as importing, exporting, and copying data. Therefore, the BigQuery API
+provides facilities for managing longer-running jobs. With this approach, an
+instance of {Google::Cloud::Bigquery::QueryJob} is returned, rather than an
+instance of {Google::Cloud::Bigquery::Data}.
+```ruby
+require "google/cloud/bigquery"
+bigquery = Google::Cloud::Bigquery.new
+sql = "SELECT APPROX_TOP_COUNT(corpus, 10) as title, " \
+      "COUNT(*) as unique_words " \
+      "FROM `bigquery-public-data.samples.shakespeare`"
+job = bigquery.query_job sql
+job.wait_until_done!
+if !job.failed?
+  job.data.first
+  #=> {:title=>[{:value=>"hamlet", :count=>5318}, ...}
+end
+```
+Once you have determined that the job is done and has not failed, you can obtain
+an instance of {Google::Cloud::Bigquery::Data} by calling `data` on the job
+instance. The query results for both of the above examples are stored in
+temporary tables with a lifetime of about 24 hours. See the final example below
+for a demonstration of how to store query results in a permanent table.
+## Creating Datasets and Tables
+The first thing you need to do in a new BigQuery project is to create a
+{Google::Cloud::Bigquery::Dataset}. Datasets hold tables and control access to
+them.
+```ruby
+require "google/cloud/bigquery"
+bigquery = Google::Cloud::Bigquery.new
+dataset = bigquery.create_dataset "my_dataset"
+```
+Now that you have a dataset, you can use it to create a table. Every table is
+defined by a schema that may contain nested and repeated fields. The example
+below shows a schema with a repeated record field named `cities_lived`. (For
+more information about nested and repeated fields, see [Preparing Data for
+Loading](https://cloud.google.com/bigquery/preparing-data-for-loading).)
+```ruby
+require "google/cloud/bigquery"
+bigquery = Google::Cloud::Bigquery.new
+dataset = bigquery.dataset "my_dataset"
+table = dataset.create_table "people" do |schema|
+  schema.string "first_name", mode: :required
+  schema.record "cities_lived", mode: :repeated do |nested_schema|
+    nested_schema.string "place", mode: :required
+    nested_schema.integer "number_of_years", mode: :required
+  end
+end
+```
+Because of the repeated field in this schema, we cannot use the CSV format to
+load data into the table.
+## Loading records
+To follow along with these examples, you will need to set up billing on the
+[Google Developers Console](https://console.developers.google.com).
+In addition to CSV, data can be imported from files that are formatted as
+[Newline-delimited JSON](http://jsonlines.org/),
+[Avro](http://avro.apache.org/),
+[ORC](https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-orc),
+[Parquet](https://parquet.apache.org/) or from a Google Cloud Datastore backup.
+It can also be "streamed" into BigQuery.
+### Streaming records
+For situations in which you want new data to be available for querying as soon
+as possible, inserting individual records directly from your Ruby application is
+a great approach.
+```ruby
+require "google/cloud/bigquery"
+bigquery = Google::Cloud::Bigquery.new
+dataset = bigquery.dataset "my_dataset"
+table = dataset.table "people"
+rows = [
+    {
+        "first_name" => "Anna",
+        "cities_lived" => [
+            {
+                "place" => "Stockholm",
+                "number_of_years" => 2
+            }
+        ]
+    },
+    {
+        "first_name" => "Bob",
+        "cities_lived" => [
+            {
+                "place" => "Seattle",
+                "number_of_years" => 5
+            },
+            {
+                "place" => "Austin",
+                "number_of_years" => 6
+            }
+        ]
+    }
+]
+table.insert rows
+```
+To avoid making RPCs (network requests) to retrieve the dataset and table
+resources when streaming records, pass the `skip_lookup` option. This creates
+local objects without verifying that the resources exist on the BigQuery
+service.
+```ruby
+require "google/cloud/bigquery"
+bigquery = Google::Cloud::Bigquery.new
+dataset = bigquery.dataset "my_dataset", skip_lookup: true
+table = dataset.table "people", skip_lookup: true
+rows = [
+    {
+        "first_name" => "Anna",
+        "cities_lived" => [
+            {
+                "place" => "Stockholm",
+                "number_of_years" => 2
+            }
+        ]
+    },
+    {
+        "first_name" => "Bob",
+        "cities_lived" => [
+            {
+                "place" => "Seattle",
+                "number_of_years" => 5
+            },
+            {
+                "place" => "Austin",
+                "number_of_years" => 6
+            }
+        ]
+    }
+]
+table.insert rows
+```
+There are some trade-offs involved with streaming, so be sure to read the
+discussion of data consistency in [Streaming Data Into
+BigQuery](https://cloud.google.com/bigquery/streaming-data-into-bigquery).
+### Uploading a file
+To follow along with this example, please download the
+[names.zip](http://www.ssa.gov/OACT/babynames/names.zip) archive from the U.S.
+Social Security Administration. Inside the archive you will find over 100 files
+containing baby name records since the year 1880.
+```ruby
+require "google/cloud/bigquery"
+bigquery = Google::Cloud::Bigquery.new
+dataset = bigquery.dataset "my_dataset"
+table = dataset.create_table "baby_names" do |schema|
+  schema.string "name", mode: :required
+  schema.string "gender", mode: :required
+  schema.integer "count", mode: :required
+end
+file = File.open "names/yob2014.txt"
+table.load file, format: "csv"
+```
+Because the names data, although formatted as CSV, is distributed in files with
+a `.txt` extension, this example explicitly passes the `format` option in order
+to demonstrate how to handle such situations. Because CSV is the default format
+for load operations, the option is not actually necessary. For JSON saved with a
+`.txt` extension, however, it would be.
+## Exporting query results to Google Cloud Storage
+The example below shows how to pass the `table` option with a query in order to
+store results in a permanent table. It also shows how to export the result data
+to a Google Cloud Storage file. In order to follow along, you will need to
+enable the Google Cloud Storage API in addition to setting up billing.
+```ruby
+require "google/cloud/bigquery"
+bigquery = Google::Cloud::Bigquery.new
+dataset = bigquery.dataset "my_dataset"
+source_table = dataset.table "baby_names"
+result_table = dataset.create_table "baby_names_results"
+sql = "SELECT name, count " \
+      "FROM baby_names " \
+      "WHERE gender = 'M' " \
+      "ORDER BY count ASC LIMIT 5"
+query_job = dataset.query_job sql, table: result_table
+query_job.wait_until_done!
+if !query_job.failed?
+  require "google/cloud/storage"
+  storage = Google::Cloud::Storage.new
+  bucket_id = "bigquery-exports-#{SecureRandom.uuid}"
+  bucket = storage.create_bucket bucket_id
+  extract_url = "gs://#{bucket.id}/baby-names.csv"
+  result_table.extract extract_url
+  # Download to local filesystem
+  bucket.files.first.download "baby-names.csv"
+end
+```
+If a table you wish to export contains a large amount of data, you can pass a
+wildcard URI to export to multiple files (for sharding), or an array of URIs
+(for partitioning), or both. See [Exporting
+Data](https://cloud.google.com/bigquery/docs/exporting-data) for details.
+## Configuring retries and timeout
+You can configure how many times API requests may be automatically retried. When
+an API request fails, the response will be inspected to see if the request meets
+criteria indicating that it may succeed on retry, such as `500` and `503` status
+codes or a specific internal error code such as `rateLimitExceeded`. If it meets
+the criteria, the request will be retried after a delay. If another error
+occurs, the delay will be increased before a subsequent attempt, until the
+`retries` limit is reached.
+You can also set the request `timeout` value in seconds.
+```ruby
+require "google/cloud/bigquery"
+bigquery = Google::Cloud::Bigquery.new retries: 10, timeout: 120
+```
+See the [BigQuery error
+table](https://cloud.google.com/bigquery/troubleshooting-errors#errortable) for
+a list of error conditions.
+## Additional information
+Google BigQuery can be configured to use logging. To learn more, see the
+{file:LOGGING.md Logging guide}.