RubyGems - parquet - Versions diffs - 0.5.2-x86_64-linux-musl → 0.5.4-x86_64-linux-musl - Mend

parquet 0.5.2-x86_64-linux-musl → 0.5.4-x86_64-linux-musl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: af45363d4392e08a812c1310a6ece11729b0966609cd51831fa60eb0f63004c6
-  data.tar.gz: e81c869e9cec3d80d6ed1f6a2e73ad4c397e2a59d4514f90b5f81b1877c28773
+  metadata.gz: ec4eb34a79658f88850161f93939452c5219fa735d4e01bd9b952b5b048274d0
+  data.tar.gz: a4357e8f18659b1b1ec06457ed3a1252d251ae2ef7b80834c4013b1da31df463
 SHA512:
-  metadata.gz: 9a492fc3a44cfbb9e70f4f477ac93ccfd47b0ed61c53351c9bd2a3bb83927174f3f1079088cf6f1f31fbddfd20b101ad5e031d8a1e6cd9c9e4772041832d519e
-  data.tar.gz: 241d149cd9feea1d8daf5ed70889b453101b146959ea65b6713211be22266eb2eca06ca44e775338f7d4255fe09442f8b23b15cd53a4e25e83952f45515c673d
+  metadata.gz: 7c3770e0f72ad2f4fc00e44b5d24da1c6d3ffc6faa68169fcfd9e6e69ef811709a0745166483fcb39bc87e67faf594e2899e0c0dc7508829a46aa893b41cc3ca
+  data.tar.gz: f11757a10e5f39f05cdbcb9349b062e2f933b47234fb9d6812497d65c6fe49505d8fdcc2c356f1d1c7eb079ebf151aed5f3b6d8b04ae8cbb61ce04a321c974f0

data/README.md CHANGED Viewed

@@ -8,6 +8,78 @@ This project is a Ruby library wrapping the [parquet-rs](https://github.com/apac
 This library provides high-level bindings to parquet-rs with two primary APIs for reading Parquet files: row-wise and column-wise iteration. The column-wise API generally offers better performance, especially when working with subset of columns.
+### Metadata
+The `metadata` method provides detailed information about a Parquet file's structure and contents:
+```ruby
+require "parquet"
+# Get metadata from a file path
+metadata = Parquet.metadata("data.parquet")
+# Or from an IO object
+File.open("data.parquet", "rb") do |file|
+  metadata = Parquet.metadata(file)
+end
+# Example metadata output:
+# {
+#   "num_rows" => 3,
+#   "created_by" => "parquet-rs version 54.2.0",
+#   "key_value_metadata" => [
+#     {
+#       "key" => "ARROW:schema",
+#       "value" => "base64_encoded_schema"
+#     }
+#   ],
+#   "schema" => {
+#     "name" => "arrow_schema",
+#     "fields" => [
+#       {
+#         "name" => "id",
+#         "type" => "primitive",
+#         "physical_type" => "INT64",
+#         "repetition" => "OPTIONAL",
+#         "converted_type" => "NONE"
+#       },
+#       # ... other fields
+#     ]
+#   },
+#   "row_groups" => [
+#     {
+#       "num_columns" => 5,
+#       "num_rows" => 3,
+#       "total_byte_size" => 379,
+#       "columns" => [
+#         {
+#           "column_path" => "id",
+#           "num_values" => 3,
+#           "compression" => "UNCOMPRESSED",
+#           "total_compressed_size" => 91,
+#           "encodings" => ["PLAIN", "RLE", "RLE_DICTIONARY"],
+#           "statistics" => {
+#             "min_is_exact" => true,
+#             "max_is_exact" => true
+#           }
+#         },
+#         # ... other columns
+#       ]
+#     }
+#   ]
+# }
+```
+The metadata includes:
+- Total number of rows
+- File creation information
+- Key-value metadata (including Arrow schema)
+- Detailed schema information for each column
+- Row group information including:
+  - Number of columns and rows
+  - Total byte size
+  - Column-level details (compression, encodings, statistics)
 ### Row-wise Iteration
 The `each_row` method provides sequential access to individual rows:
@@ -236,17 +308,169 @@ schema = Parquet::Schema.define do
     field :description, :string
   end
-  # Nested lists
+  # Nested lists (list of lists of strings)
   field :nested_lists, :list, item: :list do
-    field :item, :string  # For nested lists, inner item must be named 'item'
+    field :item, :string  # REQUIRED: Inner item field MUST be named 'item' for nested lists
   end
   # Map of lists
   field :map_of_lists, :map, key: :string, value: :list do
-    field :item, :int32  # For list items in maps, item must be named 'item'
+    field :item, :int32  # REQUIRED: List items in maps MUST be named 'item'
   end
 end
+### Nested Lists
+When working with nested lists (a list of lists), there are specific requirements:
+1. Using the Schema DSL:
+```ruby
+# A list of lists of strings
+field :nested_lists, :list, item: :list do
+  field :item, :string  # For nested lists, inner item MUST be named 'item'
+end
+```
+2. Using hash-based schema format:
+```ruby
+# A list of lists of integers
+{ "nested_numbers" => "list<list<int32>>" }
+```
+The data for nested lists is structured as an array of arrays:
+```ruby
+# Data for the nested_lists field
+[["a", "b"], ["c", "d", "e"], []]  # Last one is an empty inner list
+```
+### Decimal Data Type
+Parquet supports decimal numbers with configurable precision and scale, which is essential for financial applications where exact decimal representation is critical. The library seamlessly converts between Ruby's `BigDecimal` and Parquet's decimal type.
+#### Decimal Precision and Scale
+When working with decimal fields, you need to understand two key parameters:
+- **Precision**: The total number of significant digits (both before and after the decimal point)
+- **Scale**: The number of digits after the decimal point
+The rules for defining decimals are:
+```ruby
+# No precision/scale specified - uses maximum precision (38) with scale 0
+field :amount1, :decimal  # Equivalent to INTEGER with 38 digits
+# Only precision specified - scale defaults to 0
+field :amount2, :decimal, precision: 10  # 10 digits, no decimal places
+# Only scale specified - uses maximum precision (38)
+field :amount3, :decimal, scale: 2  # 38 digits with 2 decimal places
+# Both precision and scale specified
+field :amount4, :decimal, precision: 10, scale: 2  # 10 digits with 2 decimal places
+```
+#### Financial Data Example
+Here's a practical example for a financial application:
+```ruby
+require "parquet"
+require "bigdecimal"
+# Schema for financial transactions
+schema = Parquet::Schema.define do
+  field :transaction_id, :string, nullable: false
+  field :timestamp, :timestamp_millis, nullable: false
+  field :amount, :decimal, precision: 12, scale: 2  # Supports up to 10^10 with 2 decimal places
+  field :balance, :decimal, precision: 16, scale: 2  # Larger precision for running balances
+  field :currency, :string
+  field :exchange_rate, :decimal, precision: 10, scale: 6  # 6 decimal places for forex rates
+  field :fee, :decimal, precision: 8, scale: 2, nullable: true  # Optional fee
+  field :category, :string
+end
+# Sample financial data
+transactions = [
+  [
+    "T-12345",
+    Time.now,
+    BigDecimal("1256.99"),       # amount (directly using BigDecimal)
+    BigDecimal("10250.25"),      # balance
+    "USD",
+    BigDecimal("1.0"),           # exchange_rate
+    BigDecimal("2.50"),          # fee
+    "Groceries"
+  ],
+  [
+    "T-12346",
+    Time.now - 86400,            # yesterday
+    BigDecimal("-89.50"),        # negative amount for withdrawal
+    BigDecimal("10160.75"),      # updated balance
+    "USD",
+    BigDecimal("1.0"),           # exchange_rate
+    nil,                         # no fee
+    "Transportation"
+  ],
+  [
+    "T-12347",
+    Time.now - 172800,           # two days ago
+    BigDecimal("250.00"),        # amount
+    BigDecimal("10410.75"),      # balance
+    "EUR",                       # different currency
+    BigDecimal("1.05463"),       # exchange_rate
+    BigDecimal("1.75"),          # fee
+    "Entertainment"
+  ]
+]
+# Write financial data to Parquet file
+Parquet.write_rows(transactions.each, schema: schema, write_to: "financial_data.parquet")
+# Read back transactions
+Parquet.each_row("financial_data.parquet") do |transaction|
+  # Access decimal fields as BigDecimal objects
+  puts "Transaction: #{transaction['transaction_id']}"
+  puts "  Amount: #{transaction['currency']} #{transaction['amount']}"
+  puts "  Balance: $#{transaction['balance']}"
+  puts "  Fee: #{transaction['fee'] || 'No fee'}"
+  # You can perform precise decimal calculations
+  if transaction['currency'] != 'USD'
+    usd_amount = transaction['amount'] * transaction['exchange_rate']
+    puts "  USD Equivalent: $#{usd_amount.round(2)}"
+  end
+end
+```
+#### Decimal Type Storage Considerations
+Parquet optimizes storage based on the precision:
+- For precision ≤ 9: Uses 4-byte INT32
+- For precision ≤ 18: Uses 8-byte INT64
+- For precision ≤ 38: Uses 16-byte BYTE_ARRAY
+Choose appropriate precision and scale for your data to optimize storage while ensuring adequate range:
+```ruby
+# Banking examples
+field :account_balance, :decimal, precision: 16, scale: 2   # Up to 14 digits before decimal point
+field :interest_rate, :decimal, precision: 8, scale: 6      # Rate with 6 decimal places (e.g., 0.015625)
+# E-commerce examples
+field :product_price, :decimal, precision: 10, scale: 2     # Product price
+field :shipping_weight, :decimal, precision: 6, scale: 3    # Weight in kg with 3 decimal places
+# Analytics examples
+field :conversion_rate, :decimal, precision: 5, scale: 4    # Rate like 0.0123
+field :daily_revenue, :decimal, precision: 14, scale: 2     # Daily revenue with 2 decimal places
+```
+### Sample Data with Nested Structures
+Here's an example showing how to use the schema defined earlier with sample data:
+```ruby
 # Sample data with nested structures
 data = [
   [
@@ -271,7 +495,7 @@ data = [
       "feature1" => { "count" => 5, "description" => "Main feature" },
       "feature2" => { "count" => 3, "description" => "Secondary feature" }
     },
-    [["a", "b"], ["c", "d", "e"]],  # nested_lists
+    [["a", "b"], ["c", "d", "e"]],  # nested_lists (a list of lists of strings)
     {                                # map_of_lists
       "group1" => [1, 2, 3],
       "group2" => [4, 5, 6]

data/lib/parquet/3.2/parquet.so CHANGED Viewed

Binary file

data/lib/parquet/3.3/parquet.so CHANGED Viewed

Binary file

data/lib/parquet/3.4/parquet.so CHANGED Viewed

Binary file

data/lib/parquet/schema.rb CHANGED Viewed

@@ -11,6 +11,9 @@ module Parquet
     #     field :id, :int64, nullable: false  # ID cannot be null
     #     field :name, :string  # Default nullable: true
     #
+    #     # Decimal field with precision and scale
+    #     field :price, :decimal, precision: 10, scale: 2
+    #
     #     # List with non-nullable items
     #     field :scores, :list, item: :float, item_nullable: false
     #
@@ -45,7 +48,7 @@ module Parquet
       # Define a field in the schema
       # @param name [String, Symbol] field name
-      # @param type [Symbol] data type (:int32, :int64, :string, :list, :map, :struct, etc)
+      # @param type [Symbol] data type (:int32, :int64, :string, :list, :map, :struct, :decimal, etc)
       # @param nullable [Boolean] whether the field can be null (default: true)
       # @param kwargs [Hash] additional options depending on type
       #
@@ -55,6 +58,7 @@ module Parquet
       #   - `key:, value:` if type == :map
       #   - `key_nullable:, value_nullable:` controls nullability of map keys/values (default: true)
       #   - `format:` if you want to store some format string
+      #   - `precision:, scale:` if type == :decimal (precision defaults to 38, scale to 0)
       #   - `nullable:` default to true if not specified
       def field(name, type, nullable: true, **kwargs, &block)
         field_hash = { name: name.to_s, type: type, nullable: !!nullable }
@@ -73,7 +77,21 @@ module Parquet
           raise ArgumentError, "list field `#{name}` requires `item:` type" unless item_type
           # Pass item_nullable if provided, otherwise use true as default
           item_nullable = kwargs[:item_nullable].nil? ? true : !!kwargs[:item_nullable]
-          field_hash[:item] = wrap_subtype(item_type, nullable: item_nullable, &block)
+          # Pass precision and scale if type is decimal
+          if item_type == :decimal
+            precision = kwargs[:precision]
+            scale = kwargs[:scale]
+            field_hash[:item] = wrap_subtype(
+              item_type,
+              nullable: item_nullable,
+              precision: precision,
+              scale: scale,
+              &block
+            )
+          else
+            field_hash[:item] = wrap_subtype(item_type, nullable: item_nullable, &block)
+          end
         when :map
           # user must specify key:, value:
           key_type = kwargs[:key]
@@ -82,8 +100,47 @@ module Parquet
           # Pass key_nullable and value_nullable if provided, otherwise use true as default
           key_nullable = kwargs[:key_nullable].nil? ? true : !!kwargs[:key_nullable]
           value_nullable = kwargs[:value_nullable].nil? ? true : !!kwargs[:value_nullable]
           field_hash[:key] = wrap_subtype(key_type, nullable: key_nullable)
-          field_hash[:value] = wrap_subtype(value_type, nullable: value_nullable, &block)
+          # Pass precision and scale if value type is decimal
+          if value_type == :decimal
+            precision = kwargs[:precision]
+            scale = kwargs[:scale]
+            field_hash[:value] = wrap_subtype(
+              value_type,
+              nullable: value_nullable,
+              precision: precision,
+              scale: scale,
+              &block
+            )
+          else
+            field_hash[:value] = wrap_subtype(value_type, nullable: value_nullable, &block)
+          end
+        when :decimal
+          # Store precision and scale for decimal type according to rules:
+          # 1. When neither precision nor scale is provided, use maximum precision (38)
+          # 2. When only precision is provided, scale defaults to 0
+          # 3. When only scale is provided, use maximum precision (38)
+          # 4. When both are provided, use the provided values
+          if kwargs[:precision].nil? && kwargs[:scale].nil?
+            # No precision or scale provided - use maximum precision
+            field_hash[:precision] = 38
+            field_hash[:scale] = 0
+          elsif kwargs[:precision] && kwargs[:scale].nil?
+            # Precision only - scale defaults to 0
+            field_hash[:precision] = kwargs[:precision]
+            field_hash[:scale] = 0
+          elsif kwargs[:precision].nil? && kwargs[:scale]
+            # Scale only - use maximum precision
+            field_hash[:precision] = 38
+            field_hash[:scale] = kwargs[:scale]
+          else
+            # Both provided
+            field_hash[:precision] = kwargs[:precision]
+            field_hash[:scale] = kwargs[:scale]
+          end
         else
           # primitive type: :int32, :int64, :string, etc.
           # do nothing else special
@@ -122,7 +179,7 @@ module Parquet
       # If user said: field "something", :list, item: :struct do ... end
       # we want to recursively parse that sub-struct from the block.
       # So wrap_subtype might be:
-      def wrap_subtype(t, nullable: true, &block)
+      def wrap_subtype(t, nullable: true, precision: nil, scale: nil, &block)
         if t == :struct
           sub_builder = SchemaBuilder.new
           sub_builder.instance_eval(&block) if block
@@ -144,6 +201,34 @@ module Parquet
           end
           { type: :list, nullable: nullable, name: "item", item: sub_builder.fields[0] }
+        elsif t == :decimal
+          # Handle decimal type with precision and scale
+          result = { type: t, nullable: nullable, name: "item" }
+          # Follow the same rules as in field() method:
+          # 1. When neither precision nor scale is provided, use maximum precision (38)
+          # 2. When only precision is provided, scale defaults to 0
+          # 3. When only scale is provided, use maximum precision (38)
+          # 4. When both are provided, use the provided values
+          if precision.nil? && scale.nil?
+            # No precision or scale provided - use maximum precision
+            result[:precision] = 38
+            result[:scale] = 0
+          elsif precision && scale.nil?
+            # Precision only - scale defaults to 0
+            result[:precision] = precision
+            result[:scale] = 0
+          elsif precision.nil? && scale
+            # Scale only - use maximum precision
+            result[:precision] = 38
+            result[:scale] = scale
+          else
+            # Both provided
+            result[:precision] = precision
+            result[:scale] = scale
+          end
+          result
         else
           # e.g. :int32 => { type: :int32, nullable: true }
           { type: t, nullable: nullable, name: "item" }

data/lib/parquet/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Parquet
-  VERSION = "0.5.2"
+  VERSION = "0.5.4"
 end

data/lib/parquet.rbi CHANGED Viewed

@@ -1,6 +1,17 @@
 # typed: true
 module Parquet
+  # Returns metadata information about a Parquet file
+  #
+  # The returned hash contains information about:
+  # - Basic file metadata (num_rows, created_by)
+  # - Schema information (fields, types, etc.)
+  # - Row group details
+  # - Column chunk information (compression, encodings, statistics)
+  sig { params(path: String).returns(T::Hash[String, T.untyped]) }
+  def self.metadata(path)
+  end
   # Options:
   #   - `input`: String, File, or IO object containing parquet data
   #   - `result_type`: String specifying the output format

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: parquet
 version: !ruby/object:Gem::Version
-  version: 0.5.2
+  version: 0.5.4
 platform: x86_64-linux-musl
 authors:
 - Nathan Jaremko
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2025-03-17 00:00:00.000000000 Z
+date: 2025-04-01 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: rake-compiler