RubyGems - embulk-output-bigquery - Versions diffs - 0.7.4 → 0.7.5 - Mend

embulk-output-bigquery 0.7.4 → 0.7.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +3 -0
data/README.md +24 -0
data/embulk-output-bigquery.gemspec +1 -1
data/lib/embulk/output/bigquery/bigquery_client.rb +12 -0
data/lib/embulk/output/bigquery.rb +43 -1
data/test/test_configure.rb +38 -0
metadata +5 -5

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: ee5e3d543b40d3a9e1cd8a0d9af9aa05ed84e12abcd805d6b6cc9c9ffc79e825
-  data.tar.gz: 4257cb1626c92e3d46be2dff689b0a4b1efa8983c7531913a195cb074e77bc36
+  metadata.gz: bacb610086a2bbd94300aa3401565e2101bf8b094ef10a75e5c666d768ae5190
+  data.tar.gz: 6121440d4864f5561567ad6a4bc64151377bb8b840a6954e8303435cd83c291d
 SHA512:
-  metadata.gz: 9559bef20b7a5f644871f74bd64dbf90c9776deccbd0a59b2516a8a2fdab9a4952dd78dc8ec365992384286ec2f23ae15fa59d8e71f5dff36a72b38860a74bfe
-  data.tar.gz: edb4085785ad9ae94a53e31f4f13afec71da983856e0b94bde3e080d3280fb9e28c60ce259ecc3e336a77035afc9433e7ece11b3bdd6697c5e8bf7a52462eeb1
+  metadata.gz: 2e9bf6482b42a2d2a159babb0213418330283ec81b9a6bfeb2e85d1b1feed1cbf2d5c955007f055144951a4d51793cb996111229b98035031943731974bc57ae
+  data.tar.gz: 149abe8691c92b5ab32db84cfac98a71a7601b63f1b210eb9bb6011fb5124b80a8cd93fbb475809cd6dafa114792bd7599555dfdc27d6a34a827876efc2aa33d

data/CHANGELOG.md CHANGED Viewed

@@ -1,3 +1,6 @@
+## 0.7.5 - 2025-05-13
+* [enhancement] Add range partitioning support (Thanks to kitagry) #174
 ## 0.7.4 - 2024-12-19
 * [maintenance] Primary location unless location is set explicitly (Thanks to joker1007) #172

data/README.md CHANGED Viewed

@@ -110,6 +110,12 @@ Following options are same as [bq command-line tools](https://cloud.google.com/b
 |  time_partitioning.type           | string   | required  | nil     | The only type supported is DAY, which will generate one partition per day based on data loading time. |
 |  time_partitioning.expiration_ms  | int      | optional  | nil     | Number of milliseconds for which to keep the storage for a partition. |
 |  time_partitioning.field          | string   | optional  | nil     | `DATE` or `TIMESTAMP` column used for partitioning |
+|  range_partitioning               | hash     | optional  | nil     | See [Range Partitioning](#range-partitioning) |
+|  range_partitioning.field         | string   | required  | nil     | `INT64` column used for partitioning |
+|  range-partitioning.range         | hash     | required  | nil     | Defines the ranges for range paritioning |
+|  range-partitioning.range.start   | int      | required  | nil     | The start of range partitioning, inclusive. |
+|  range-partitioning.range.end     | int      | required  | nil     | The end of range partitioning, exclusive. |
+|  range-partitioning.range.interval| int      | required  | nil     | The width of each interval. |
 |  clustering                       | hash     | optional  | nil     | Currently, clustering is supported for partitioned tables, so must be used with `time_partitioning` option. See [clustered tables](https://cloud.google.com/bigquery/docs/clustered-tables) |
 |  clustering.fields                | array    | required  | nil     | One or more fields on which data should be clustered. The order of the specified columns determines the sort order of the data. |
 |  schema_update_options            | array    | optional  | nil     | (Experimental) List of `ALLOW_FIELD_ADDITION` or `ALLOW_FIELD_RELAXATION` or both. See [jobs#configuration.load.schemaUpdateOptions](https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.load.schemaUpdateOptions). NOTE for the current status: `schema_update_options` does not work for `copy` job, that is, is not effective for most of modes such as `append`, `replace` and `replace_backup`. `delete_in_advance` deletes origin table so does not need to update schema. Only `append_direct` can utilize schema update. |
@@ -448,6 +454,24 @@ MEMO: [jobs#configuration.load.schemaUpdateOptions](https://cloud.google.com/big
 to update the schema of the desitination table as a side effect of the load job, but it is not available for copy job.
 Thus, it was not suitable for embulk-output-bigquery idempotence modes, `append`, `replace`, and `replace_backup`, sigh.
+### Range Partitioning
+See also [Creating and Updating Range-Partitioned Tables](https://cloud.google.com/bigquery/docs/creating-partitioned-tables).
+To load into a partition, specify `range_partitioning` and `table` parameter with a partition decorator as:
+```yaml
+out:
+  type: bigquery
+  table: table_name$1
+  range_partitioning:
+    field: customer_id
+    range:
+      start: 1
+      end: 99999
+      interval: 1
+```
 ## Development
 ### Run example:

data/embulk-output-bigquery.gemspec CHANGED Viewed

@@ -1,6 +1,6 @@
 Gem::Specification.new do |spec|
   spec.name          = "embulk-output-bigquery"
-  spec.version       = "0.7.4"
+  spec.version       = "0.7.5"
   spec.authors       = ["Satoshi Akama", "Naotoshi Seo"]
   spec.summary       = "Google BigQuery output plugin for Embulk"
   spec.description   = "Embulk plugin that insert records to Google BigQuery."

data/lib/embulk/output/bigquery/bigquery_client.rb CHANGED Viewed

@@ -435,6 +435,18 @@ module Embulk
               }
             end
+            options['range_partitioning'] ||= @task['range_partitioning']
+            if options['range_partitioning']
+              body[:range_partitioning] = {
+                field: options['range_partitioning']['field'],
+                range: {
+                  start: options['range_partitioning']['range']['start'].to_s,
+                  end: options['range_partitioning']['range']['end'].to_s,
+                  interval: options['range_partitioning']['range']['interval'].to_s,
+                },
+              }
+            end
             options['clustering'] ||= @task['clustering']
             if options['clustering']
               body[:clustering] = {

data/lib/embulk/output/bigquery.rb CHANGED Viewed

@@ -89,6 +89,7 @@ module Embulk
           'ignore_unknown_values'          => config.param('ignore_unknown_values',          :bool,    :default => false),
           'allow_quoted_newlines'          => config.param('allow_quoted_newlines',          :bool,    :default => false),
           'time_partitioning'              => config.param('time_partitioning',              :hash,    :default => nil),
+          'range_partitioning'             => config.param('range_partitioning',             :hash,    :default => nil),
           'clustering'                     => config.param('clustering',                     :hash,    :default => nil), # google-api-ruby-client >= v0.21.0
           'schema_update_options'          => config.param('schema_update_options',          :array,   :default => nil),
@@ -227,14 +228,55 @@ module Embulk
           task['abort_on_error'] = (task['max_bad_records'] == 0)
         end
+        if task['time_partitioning'] && task['range_partitioning']
+          raise ConfigError.new "`time_partitioning` and `range_partitioning` cannot be used at the same time"
+        end
         if task['time_partitioning']
           unless task['time_partitioning']['type']
             raise ConfigError.new "`time_partitioning` must have `type` key"
           end
-        elsif Helper.has_partition_decorator?(task['table'])
+        end
+        if Helper.has_partition_decorator?(task['table'])
+          if task['range_partitioning']
+            raise ConfigError.new "Partition decorators(`#{task['table']}`) don't support `range_partition`"
+          end
           task['time_partitioning'] = {'type' => 'DAY'}
         end
+        if task['range_partitioning']
+          unless task['range_partitioning']['field']
+            raise ConfigError.new "`range_partitioning` must have `field` key"
+          end
+          unless task['range_partitioning']['range']
+            raise ConfigError.new "`range_partitioning` must have `range` key"
+          end
+          range = task['range_partitioning']['range']
+          unless range['start']
+            raise ConfigError.new "`range_partitioning` must have `range.start` key"
+          end
+          unless range['start'].is_a?(Integer)
+            raise ConfigError.new "`range_partitioning.range.start` must be an integer"
+          end
+          unless range['end']
+            raise ConfigError.new "`range_partitioning` must have `range.end` key"
+          end
+          unless range['end'].is_a?(Integer)
+            raise ConfigError.new "`range_partitioning.range.end` must be an integer"
+          end
+          unless range['interval']
+            raise ConfigError.new "`range_partitioning` must have `range.interval` key"
+          end
+          unless range['interval'].is_a?(Integer)
+            raise ConfigError.new "`range_partitioning.range.interval` must be an integer"
+          end
+          if range['start'] + range['interval'] >= range['end']
+            raise ConfigError.new "`range_partitioning.range.start` + `range_partitioning.range.interval` must be less than `range_partitioning.range.end`"
+          end
+        end
         if task['clustering']
           unless task['clustering']['fields']
             raise ConfigError.new "`clustering` must have `fields` key"

data/test/test_configure.rb CHANGED Viewed

@@ -270,6 +270,44 @@ module Embulk
         assert_equal 'DAY', task['time_partitioning']['type']
       end
+      def test_range_partitioning
+        config = least_config.merge('range_partitioning' => {'field' => 'foo', 'range' => { 'start' => 1, 'end' => 3, 'interval' => 1 }})
+        assert_nothing_raised { Bigquery.configure(config, schema, processor_count) }
+        # field is required
+        config = least_config.merge('range_partitioning' => {'range' => { 'start' => 1, 'end' => 2, 'interval' => 1 }})
+        assert_raise { Bigquery.configure(config, schema, processor_count) }
+        # range is required
+        config = least_config.merge('range_partitioning' => {'field' => 'foo'})
+        assert_raise { Bigquery.configure(config, schema, processor_count) }
+        # range.start is required
+        config = least_config.merge('range_partitioning' => {'field' => 'foo', 'range' => { 'end' => 2, 'interval' => 1 }})
+        assert_raise { Bigquery.configure(config, schema, processor_count) }
+        # range.end is required
+        config = least_config.merge('range_partitioning' => {'field' => 'foo', 'range' => { 'start' => 1, 'interval' => 1 }})
+        assert_raise { Bigquery.configure(config, schema, processor_count) }
+        # range.interval is required
+        config = least_config.merge('range_partitioning' => {'field' => 'foo', 'range' => { 'start' => 1, 'end' => 2 }})
+        assert_raise { Bigquery.configure(config, schema, processor_count) }
+        # range.start + range.interval should be less than range.end
+        config = least_config.merge('range_partitioning' => {'field' => 'foo', 'range' => { 'start' => 1, 'end' => 2, 'interval' => 2 }})
+        assert_raise { Bigquery.configure(config, schema, processor_count) }
+      end
+      def test_time_and_range_partitioning_error
+        config = least_config.merge('time_partitioning' => {'type' => 'DAY'}, 'range_partitioning' => {'field' => 'foo', 'range' => { 'start' => 1, 'end' => 2, 'interval' => 1 }})
+        assert_raise { Bigquery.configure(config, schema, processor_count) }
+        config = least_config.merge('table' => 'table_name$20160912', 'range_partitioning' => {'field' => 'foo', 'range' => { 'start' => 1, 'end' => 2, 'interval' => 1 }})
+        assert_raise { Bigquery.configure(config, schema, processor_count) }
+      end
       def test_clustering
         config = least_config.merge('clustering' => {'fields' => ['field_a']})
         assert_nothing_raised { Bigquery.configure(config, schema, processor_count) }

metadata CHANGED Viewed

@@ -1,15 +1,15 @@
 --- !ruby/object:Gem::Specification
 name: embulk-output-bigquery
 version: !ruby/object:Gem::Version
-  version: 0.7.4
+  version: 0.7.5
 platform: ruby
 authors:
 - Satoshi Akama
 - Naotoshi Seo
-autorequire:
+autorequire:
 bindir: bin
 cert_chain: []
-date: 2024-12-19 00:00:00.000000000 Z
+date: 2025-05-14 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: google-apis-storage_v1
@@ -147,7 +147,7 @@ homepage: https://github.com/embulk/embulk-output-bigquery
 licenses:
 - MIT
 metadata: {}
-post_install_message:
+post_install_message:
 rdoc_options: []
 require_paths:
 - lib
@@ -163,7 +163,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
       version: '0'
 requirements: []
 rubygems_version: 3.5.3
-signing_key:
+signing_key:
 specification_version: 4
 summary: Google BigQuery output plugin for Embulk
 test_files: