RubyGems - spark-connect - Versions diffs - 0.2.0 → 0.3.0 - Mend

spark-connect 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +56 -1
data/README.md +8 -2
data/lib/spark_connect/arrow.rb +5 -1
data/lib/spark_connect/client.rb +7 -3
data/lib/spark_connect/conf.rb +1 -1
data/lib/spark_connect/data_frame.rb +19 -2
data/lib/spark_connect/functions.rb +1 -1
data/lib/spark_connect/session.rb +1 -1
data/lib/spark_connect/version.rb +1 -1
metadata +6 -3

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: ad83f60b70501f49722aa4ae51d768d27accbeb056c7f6fe1be695e010116c52
-  data.tar.gz: 528b67cacc0ba1b516aa93228c242d07efcaa86ae297d75c9c11c9d36905a00f
+  metadata.gz: 27eea75da0f1e6659b7371732fc56d2f103e0e79f60662760046303b0af5671f
+  data.tar.gz: 20676c87b08bd17e425722cc74b8fdbd2a2e4fb97e4c0bb15b63a9f2a1602e45
 SHA512:
-  metadata.gz: ceb4e18bcc75a30a38fe4c2446ae192f07a24446f86b5a76b9992b4f6a70e59bfb3d4709f633988eacd00fa85d174b8904b80eb291855dc943a7c8e10f9c7f4d
-  data.tar.gz: e06a6c98eee8042f56aabc5689655ad8bb7b04c843d90d2093d24fc6c82247993c0e47b002125ab774b1ce87c0e3b4203a88de6864a0c3dde1b45e7236db29f0
+  metadata.gz: 3ca57c3c06de5909d836c9f3da9fd52eb70016262e8cbf5be90bf0082ff579c63c49b7586592eb6e465a090f16126a2e47fdac4f47c21402e4af2da356b2dfea
+  data.tar.gz: e68ec46d9a26c71201a456191847825c85d842bcfb76d2001eab07f6c5bbf7147f43feea68eeb85472f2d9b5a91a9ea5dcd43a80a70225fa48edf4e1a35deace

data/CHANGELOG.md CHANGED Viewed

@@ -7,6 +7,59 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
+## [0.3.0] - 2026-06-15
+Minor-version release. Contains the same fixes as 0.2.1 (listed below); there
+are no functional changes beyond 0.2.1.
+### Fixed
+- **Correct `TimestampType` instants in `create_data_frame`.** Timestamp columns
+  were shipped as zone-less Arrow timestamps, so the server interpreted the epoch
+  micros as session-local wall-clock and shifted the value by the session time
+  zone. They are now tagged UTC; `TimestampNTZType` remains zone-less.
+- **`RuntimeConfig#get` with a non-String default no longer raises.** A non-String
+  default (e.g. `conf.get(key, 8)`) was passed straight into a protobuf string
+  field, raising `Google::Protobuf::TypeError`. The default is now coerced to a
+  String, matching `#set`.
+- **No duplicate rows when an execute stream is retried.** The result accumulator
+  was created outside the retry loop, so a mid-stream gRPC failure replayed
+  already-consumed Arrow batches and duplicated rows on retry. The accumulator is
+  now reset per attempt.
+- **`DataFrame#drop_duplicates_within_watermark` is now watermark-aware.** It was a
+  plain alias of `#drop_duplicates` and never set the `within_watermark` flag, so
+  it silently performed an ordinary deduplication. Added a `dropDuplicatesWithinWatermark`
+  alias.
+- **`SparkSession::Builder#app_name` is now applied.** `create` explicitly skipped
+  `spark.app.name`, making `#app_name` a no-op; all builder options are now
+  forwarded to the new session.
+- Corrected misleading doc comments for `Functions#nanvl` and `DataFrame#except_all`.
+## [0.2.1] - 2026-06-15
+### Fixed
+- **Correct `TimestampType` instants in `create_data_frame`.** Timestamp columns
+  were shipped as zone-less Arrow timestamps, so the server interpreted the epoch
+  micros as session-local wall-clock and shifted the value by the session time
+  zone. They are now tagged UTC; `TimestampNTZType` remains zone-less.
+- **`RuntimeConfig#get` with a non-String default no longer raises.** A non-String
+  default (e.g. `conf.get(key, 8)`) was passed straight into a protobuf string
+  field, raising `Google::Protobuf::TypeError`. The default is now coerced to a
+  String, matching `#set`.
+- **No duplicate rows when an execute stream is retried.** The result accumulator
+  was created outside the retry loop, so a mid-stream gRPC failure replayed
+  already-consumed Arrow batches and duplicated rows on retry. The accumulator is
+  now reset per attempt.
+- **`DataFrame#drop_duplicates_within_watermark` is now watermark-aware.** It was a
+  plain alias of `#drop_duplicates` and never set the `within_watermark` flag, so
+  it silently performed an ordinary deduplication. Added a `dropDuplicatesWithinWatermark`
+  alias.
+- **`SparkSession::Builder#app_name` is now applied.** `create` explicitly skipped
+  `spark.app.name`, making `#app_name` a no-op; all builder options are now
+  forwarded to the new session.
+- Corrected misleading doc comments for `Functions#nanvl` and `DataFrame#except_all`.
 ## [0.2.0] - 2026-06-10
 ### Added
@@ -77,6 +130,8 @@ Initial release.
 - Vendored Spark Connect 4.1 protobuf/gRPC definitions and a regeneration script
   (`bin/generate-protos`).
-[Unreleased]: https://github.com/HyukjinKwon/spark-connect-ruby/compare/v0.2.0...HEAD
+[Unreleased]: https://github.com/HyukjinKwon/spark-connect-ruby/compare/v0.3.0...HEAD
+[0.3.0]: https://github.com/HyukjinKwon/spark-connect-ruby/compare/v0.2.1...v0.3.0
+[0.2.1]: https://github.com/HyukjinKwon/spark-connect-ruby/compare/v0.2.0...v0.2.1
 [0.2.0]: https://github.com/HyukjinKwon/spark-connect-ruby/compare/v0.1.0...v0.2.0
 [0.1.0]: https://github.com/HyukjinKwon/spark-connect-ruby/releases/tag/v0.1.0

data/README.md CHANGED Viewed

@@ -64,12 +64,14 @@ See the [installation guide](https://hyukjinkwon.github.io/spark-connect-ruby/in
 ## Installation
 ```bash
+gem install rubygems-requirements-system
 gem install spark-connect
 ```
 Or in a `Gemfile`:
 ```ruby
+plugin "rubygems-requirements-system"
 gem "spark-connect"
 ```
@@ -80,10 +82,14 @@ gem "spark-connect"
 curl -fsSL https://archive.apache.org/dist/spark/spark-4.1.0/spark-4.1.0-bin-hadoop3.tgz | tar xz
 cd spark-4.1.0-bin-hadoop3
-# Start the Connect server (requires Java 17+)
-./sbin/start-connect-server.sh --jars "$(pwd)/jars/spark-connect_2.13-4.1.0.jar"
+# Start the Connect server (requires Java 17+).
+# Spark 4.0.0+ bundles the Connect server, so no extra packages are needed.
+./sbin/start-connect-server.sh
 ```
+On **Spark 3.5.x** the Connect server is not bundled; pull it in with
+`--packages "org.apache.spark:spark-connect_2.13:3.5.5"` (use a Scala 2.13 distribution).
 The server listens on `sc://localhost:15002` by default.
 ## Connecting

data/lib/spark_connect/arrow.rb CHANGED Viewed

@@ -103,7 +103,11 @@ module SparkConnect
       when Types::StringType, Types::CharType, Types::VarcharType then :string
       when Types::BinaryType then :binary
       when Types::DateType then :date32
-      when Types::TimestampType, Types::TimestampNTZType then { type: :timestamp, unit: :micro }
+      # TimestampType is an instant: tag it UTC so the server reads the epoch
+      # micros as a point in time rather than session-local wall-clock. The NTZ
+      # variant stays zone-less (wall-clock) to match its semantics.
+      when Types::TimestampType then ::Arrow::TimestampDataType.new(:micro, GLib::TimeZone.new("UTC"))
+      when Types::TimestampNTZType then { type: :timestamp, unit: :micro }
       when Types::ArrayType
         { type: :list, field: { name: "element", type: arrow_field_type(data_type.element_type) } }
       when Types::StructType

data/lib/spark_connect/client.rb CHANGED Viewed

@@ -171,16 +171,20 @@ module SparkConnect
         tags: @tags
       )
-      result = ExecuteResult.new([], nil, nil, [], nil, 0)
-      result.pipeline_events = []
+      # Build the accumulator *inside* the retry block so that a mid-stream
+      # failure (which restarts the gRPC stream from the beginning) starts from
+      # a clean slate. Accumulating into a result created outside the block
+      # would re-append already-seen batches and duplicate rows on retry.
       with_retries do
+        result = ExecuteResult.new([], nil, nil, [], nil, 0)
+        result.pipeline_events = []
         responses = @stub.execute_plan(req, metadata: @metadata)
         responses.each do |resp|
           @server_side_session_id = resp.server_side_session_id unless resp.server_side_session_id.empty?
           accumulate(result, resp)
         end
+        result
       end
-      result
     end
     def accumulate(result, resp)

data/lib/spark_connect/conf.rb CHANGED Viewed

@@ -39,7 +39,7 @@ module SparkConnect
           Op.new(get: CR::Get.new(keys: [key.to_s]))
         else
           Op.new(get_with_default: CR::GetWithDefault.new(
-            pairs: [Proto::KeyValue.new(key: key.to_s, value: default)]
+            pairs: [Proto::KeyValue.new(key: key.to_s, value: default.to_s)]
           ))
         end
       resp = @client.config(op)

data/lib/spark_connect/data_frame.rb CHANGED Viewed

@@ -182,7 +182,24 @@ module SparkConnect
       build(deduplicate: dedup)
     end
     alias dropDuplicates drop_duplicates
-    alias drop_duplicates_within_watermark drop_duplicates
+    # Drop duplicate rows within the event-time watermark, optionally restricted
+    # to a subset of columns. Unlike {#drop_duplicates}, this is watermark-aware
+    # and is intended for streaming DataFrames (mirrors PySpark's
+    # `dropDuplicatesWithinWatermark`).
+    #
+    # @param subset [Array<String>, nil]
+    # @return [DataFrame]
+    def drop_duplicates_within_watermark(subset = nil)
+      dedup =
+        if subset.nil? || subset.empty?
+          Proto::Deduplicate.new(input: @relation, all_columns_as_keys: true, within_watermark: true)
+        else
+          Proto::Deduplicate.new(input: @relation, column_names: Array(subset).map(&:to_s), within_watermark: true)
+        end
+      build(deduplicate: dedup)
+    end
+    alias dropDuplicatesWithinWatermark drop_duplicates_within_watermark
     # ---- Ordering ----------------------------------------------------------
@@ -311,7 +328,7 @@ module SparkConnect
     end
     alias intersectAll intersect_all
-    # Rows in this DataFrame not in `other` (distinct).
+    # Rows in this DataFrame not in `other`, keeping duplicates - Spark's `EXCEPT ALL`.
     # @return [DataFrame]
     def except_all(other)
       set_op(other, :SET_OP_TYPE_EXCEPT, is_all: true)

data/lib/spark_connect/functions.rb CHANGED Viewed

@@ -85,7 +85,7 @@ module SparkConnect
     # @return [Column] first non-null among the given columns.
     def coalesce(*cols) = Column.invoke("coalesce", *cols.map { |c| _col(c) })
-    # @return [Column] `value` if `col` is NaN else `col`.
+    # @return [Column] `col1` if it is not NaN, else `col2`.
     def nanvl(col1, col2) = Column.invoke("nanvl", _col(col1), _col(col2))
     # ---- Constructors of complex types ------------------------------------

data/lib/spark_connect/session.rb CHANGED Viewed

@@ -308,7 +308,7 @@ module SparkConnect
         url = @remote || ENV["SPARK_REMOTE"] || "sc://localhost:15002"
         client = SparkConnectClient.new(ChannelBuilder.new(url))
         session = SparkSession.new(client)
-        @options.each { |k, v| session.conf.set(k, v) unless k == "spark.app.name" }
+        @options.each { |k, v| session.conf.set(k, v) }
         session
       end
       alias build create

data/lib/spark_connect/version.rb CHANGED Viewed

@@ -2,7 +2,7 @@
 module SparkConnect
   # The released version of the spark-connect gem.
-  VERSION = "0.2.0"
+  VERSION = "0.3.0"
   # The Apache Spark version whose Spark Connect protocol definitions this
   # client is generated against. The client aims to be wire-compatible with

metadata CHANGED Viewed

@@ -1,13 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: spark-connect
 version: !ruby/object:Gem::Version
-  version: 0.2.0
+  version: 0.3.0
 platform: ruby
 authors:
 - Hyukjin Kwon
+autorequire:
 bindir: bin
 cert_chain: []
-date: 1980-01-02 00:00:00.000000000 Z
+date: 2026-06-15 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: google-protobuf
@@ -129,6 +130,7 @@ metadata:
   documentation_uri: https://hyukjinkwon.github.io/spark-connect-ruby/
   bug_tracker_uri: https://github.com/HyukjinKwon/spark-connect-ruby/issues
   changelog_uri: https://github.com/HyukjinKwon/spark-connect-ruby/blob/main/CHANGELOG.md
+post_install_message:
 rdoc_options: []
 require_paths:
 - lib
@@ -143,7 +145,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 4.0.11
+rubygems_version: 3.5.22
+signing_key:
 specification_version: 4
 summary: A pure-Ruby client for Apache Spark Connect.
 test_files: []