parquet 0.5.2-x86_64-linux-musl → 0.5.4-x86_64-linux-musl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: af45363d4392e08a812c1310a6ece11729b0966609cd51831fa60eb0f63004c6
4
- data.tar.gz: e81c869e9cec3d80d6ed1f6a2e73ad4c397e2a59d4514f90b5f81b1877c28773
3
+ metadata.gz: ec4eb34a79658f88850161f93939452c5219fa735d4e01bd9b952b5b048274d0
4
+ data.tar.gz: a4357e8f18659b1b1ec06457ed3a1252d251ae2ef7b80834c4013b1da31df463
5
5
  SHA512:
6
- metadata.gz: 9a492fc3a44cfbb9e70f4f477ac93ccfd47b0ed61c53351c9bd2a3bb83927174f3f1079088cf6f1f31fbddfd20b101ad5e031d8a1e6cd9c9e4772041832d519e
7
- data.tar.gz: 241d149cd9feea1d8daf5ed70889b453101b146959ea65b6713211be22266eb2eca06ca44e775338f7d4255fe09442f8b23b15cd53a4e25e83952f45515c673d
6
+ metadata.gz: 7c3770e0f72ad2f4fc00e44b5d24da1c6d3ffc6faa68169fcfd9e6e69ef811709a0745166483fcb39bc87e67faf594e2899e0c0dc7508829a46aa893b41cc3ca
7
+ data.tar.gz: f11757a10e5f39f05cdbcb9349b062e2f933b47234fb9d6812497d65c6fe49505d8fdcc2c356f1d1c7eb079ebf151aed5f3b6d8b04ae8cbb61ce04a321c974f0
data/README.md CHANGED
@@ -8,6 +8,78 @@ This project is a Ruby library wrapping the [parquet-rs](https://github.com/apac
8
8
 
9
9
  This library provides high-level bindings to parquet-rs with two primary APIs for reading Parquet files: row-wise and column-wise iteration. The column-wise API generally offers better performance, especially when working with subset of columns.
10
10
 
11
+ ### Metadata
12
+
13
+ The `metadata` method provides detailed information about a Parquet file's structure and contents:
14
+
15
+ ```ruby
16
+ require "parquet"
17
+
18
+ # Get metadata from a file path
19
+ metadata = Parquet.metadata("data.parquet")
20
+
21
+ # Or from an IO object
22
+ File.open("data.parquet", "rb") do |file|
23
+ metadata = Parquet.metadata(file)
24
+ end
25
+
26
+ # Example metadata output:
27
+ # {
28
+ # "num_rows" => 3,
29
+ # "created_by" => "parquet-rs version 54.2.0",
30
+ # "key_value_metadata" => [
31
+ # {
32
+ # "key" => "ARROW:schema",
33
+ # "value" => "base64_encoded_schema"
34
+ # }
35
+ # ],
36
+ # "schema" => {
37
+ # "name" => "arrow_schema",
38
+ # "fields" => [
39
+ # {
40
+ # "name" => "id",
41
+ # "type" => "primitive",
42
+ # "physical_type" => "INT64",
43
+ # "repetition" => "OPTIONAL",
44
+ # "converted_type" => "NONE"
45
+ # },
46
+ # # ... other fields
47
+ # ]
48
+ # },
49
+ # "row_groups" => [
50
+ # {
51
+ # "num_columns" => 5,
52
+ # "num_rows" => 3,
53
+ # "total_byte_size" => 379,
54
+ # "columns" => [
55
+ # {
56
+ # "column_path" => "id",
57
+ # "num_values" => 3,
58
+ # "compression" => "UNCOMPRESSED",
59
+ # "total_compressed_size" => 91,
60
+ # "encodings" => ["PLAIN", "RLE", "RLE_DICTIONARY"],
61
+ # "statistics" => {
62
+ # "min_is_exact" => true,
63
+ # "max_is_exact" => true
64
+ # }
65
+ # },
66
+ # # ... other columns
67
+ # ]
68
+ # }
69
+ # ]
70
+ # }
71
+ ```
72
+
73
+ The metadata includes:
74
+ - Total number of rows
75
+ - File creation information
76
+ - Key-value metadata (including Arrow schema)
77
+ - Detailed schema information for each column
78
+ - Row group information including:
79
+ - Number of columns and rows
80
+ - Total byte size
81
+ - Column-level details (compression, encodings, statistics)
82
+
11
83
  ### Row-wise Iteration
12
84
 
13
85
  The `each_row` method provides sequential access to individual rows:
@@ -236,17 +308,169 @@ schema = Parquet::Schema.define do
236
308
  field :description, :string
237
309
  end
238
310
 
239
- # Nested lists
311
+ # Nested lists (list of lists of strings)
240
312
  field :nested_lists, :list, item: :list do
241
- field :item, :string # For nested lists, inner item must be named 'item'
313
+ field :item, :string # REQUIRED: Inner item field MUST be named 'item' for nested lists
242
314
  end
243
315
 
244
316
  # Map of lists
245
317
  field :map_of_lists, :map, key: :string, value: :list do
246
- field :item, :int32 # For list items in maps, item must be named 'item'
318
+ field :item, :int32 # REQUIRED: List items in maps MUST be named 'item'
247
319
  end
248
320
  end
249
321
 
322
+ ### Nested Lists
323
+
324
+ When working with nested lists (a list of lists), there are specific requirements:
325
+
326
+ 1. Using the Schema DSL:
327
+ ```ruby
328
+ # A list of lists of strings
329
+ field :nested_lists, :list, item: :list do
330
+ field :item, :string # For nested lists, inner item MUST be named 'item'
331
+ end
332
+ ```
333
+
334
+ 2. Using hash-based schema format:
335
+ ```ruby
336
+ # A list of lists of integers
337
+ { "nested_numbers" => "list<list<int32>>" }
338
+ ```
339
+
340
+ The data for nested lists is structured as an array of arrays:
341
+ ```ruby
342
+ # Data for the nested_lists field
343
+ [["a", "b"], ["c", "d", "e"], []] # Last one is an empty inner list
344
+ ```
345
+
346
+ ### Decimal Data Type
347
+
348
+ Parquet supports decimal numbers with configurable precision and scale, which is essential for financial applications where exact decimal representation is critical. The library seamlessly converts between Ruby's `BigDecimal` and Parquet's decimal type.
349
+
350
+ #### Decimal Precision and Scale
351
+
352
+ When working with decimal fields, you need to understand two key parameters:
353
+
354
+ - **Precision**: The total number of significant digits (both before and after the decimal point)
355
+ - **Scale**: The number of digits after the decimal point
356
+
357
+ The rules for defining decimals are:
358
+
359
+ ```ruby
360
+ # No precision/scale specified - uses maximum precision (38) with scale 0
361
+ field :amount1, :decimal # Equivalent to INTEGER with 38 digits
362
+
363
+ # Only precision specified - scale defaults to 0
364
+ field :amount2, :decimal, precision: 10 # 10 digits, no decimal places
365
+
366
+ # Only scale specified - uses maximum precision (38)
367
+ field :amount3, :decimal, scale: 2 # 38 digits with 2 decimal places
368
+
369
+ # Both precision and scale specified
370
+ field :amount4, :decimal, precision: 10, scale: 2 # 10 digits with 2 decimal places
371
+ ```
372
+
373
+ #### Financial Data Example
374
+
375
+ Here's a practical example for a financial application:
376
+
377
+ ```ruby
378
+ require "parquet"
379
+ require "bigdecimal"
380
+
381
+ # Schema for financial transactions
382
+ schema = Parquet::Schema.define do
383
+ field :transaction_id, :string, nullable: false
384
+ field :timestamp, :timestamp_millis, nullable: false
385
+ field :amount, :decimal, precision: 12, scale: 2 # Supports up to 10^10 with 2 decimal places
386
+ field :balance, :decimal, precision: 16, scale: 2 # Larger precision for running balances
387
+ field :currency, :string
388
+ field :exchange_rate, :decimal, precision: 10, scale: 6 # 6 decimal places for forex rates
389
+ field :fee, :decimal, precision: 8, scale: 2, nullable: true # Optional fee
390
+ field :category, :string
391
+ end
392
+
393
+ # Sample financial data
394
+ transactions = [
395
+ [
396
+ "T-12345",
397
+ Time.now,
398
+ BigDecimal("1256.99"), # amount (directly using BigDecimal)
399
+ BigDecimal("10250.25"), # balance
400
+ "USD",
401
+ BigDecimal("1.0"), # exchange_rate
402
+ BigDecimal("2.50"), # fee
403
+ "Groceries"
404
+ ],
405
+ [
406
+ "T-12346",
407
+ Time.now - 86400, # yesterday
408
+ BigDecimal("-89.50"), # negative amount for withdrawal
409
+ BigDecimal("10160.75"), # updated balance
410
+ "USD",
411
+ BigDecimal("1.0"), # exchange_rate
412
+ nil, # no fee
413
+ "Transportation"
414
+ ],
415
+ [
416
+ "T-12347",
417
+ Time.now - 172800, # two days ago
418
+ BigDecimal("250.00"), # amount
419
+ BigDecimal("10410.75"), # balance
420
+ "EUR", # different currency
421
+ BigDecimal("1.05463"), # exchange_rate
422
+ BigDecimal("1.75"), # fee
423
+ "Entertainment"
424
+ ]
425
+ ]
426
+
427
+ # Write financial data to Parquet file
428
+ Parquet.write_rows(transactions.each, schema: schema, write_to: "financial_data.parquet")
429
+
430
+ # Read back transactions
431
+ Parquet.each_row("financial_data.parquet") do |transaction|
432
+ # Access decimal fields as BigDecimal objects
433
+ puts "Transaction: #{transaction['transaction_id']}"
434
+ puts " Amount: #{transaction['currency']} #{transaction['amount']}"
435
+ puts " Balance: $#{transaction['balance']}"
436
+ puts " Fee: #{transaction['fee'] || 'No fee'}"
437
+
438
+ # You can perform precise decimal calculations
439
+ if transaction['currency'] != 'USD'
440
+ usd_amount = transaction['amount'] * transaction['exchange_rate']
441
+ puts " USD Equivalent: $#{usd_amount.round(2)}"
442
+ end
443
+ end
444
+ ```
445
+
446
+ #### Decimal Type Storage Considerations
447
+
448
+ Parquet optimizes storage based on the precision:
449
+ - For precision ≤ 9: Uses 4-byte INT32
450
+ - For precision ≤ 18: Uses 8-byte INT64
451
+ - For precision ≤ 38: Uses 16-byte BYTE_ARRAY
452
+
453
+ Choose appropriate precision and scale for your data to optimize storage while ensuring adequate range:
454
+
455
+ ```ruby
456
+ # Banking examples
457
+ field :account_balance, :decimal, precision: 16, scale: 2 # Up to 14 digits before decimal point
458
+ field :interest_rate, :decimal, precision: 8, scale: 6 # Rate with 6 decimal places (e.g., 0.015625)
459
+
460
+ # E-commerce examples
461
+ field :product_price, :decimal, precision: 10, scale: 2 # Product price
462
+ field :shipping_weight, :decimal, precision: 6, scale: 3 # Weight in kg with 3 decimal places
463
+
464
+ # Analytics examples
465
+ field :conversion_rate, :decimal, precision: 5, scale: 4 # Rate like 0.0123
466
+ field :daily_revenue, :decimal, precision: 14, scale: 2 # Daily revenue with 2 decimal places
467
+ ```
468
+
469
+ ### Sample Data with Nested Structures
470
+
471
+ Here's an example showing how to use the schema defined earlier with sample data:
472
+
473
+ ```ruby
250
474
  # Sample data with nested structures
251
475
  data = [
252
476
  [
@@ -271,7 +495,7 @@ data = [
271
495
  "feature1" => { "count" => 5, "description" => "Main feature" },
272
496
  "feature2" => { "count" => 3, "description" => "Secondary feature" }
273
497
  },
274
- [["a", "b"], ["c", "d", "e"]], # nested_lists
498
+ [["a", "b"], ["c", "d", "e"]], # nested_lists (a list of lists of strings)
275
499
  { # map_of_lists
276
500
  "group1" => [1, 2, 3],
277
501
  "group2" => [4, 5, 6]
Binary file
Binary file
Binary file
@@ -11,6 +11,9 @@ module Parquet
11
11
  # field :id, :int64, nullable: false # ID cannot be null
12
12
  # field :name, :string # Default nullable: true
13
13
  #
14
+ # # Decimal field with precision and scale
15
+ # field :price, :decimal, precision: 10, scale: 2
16
+ #
14
17
  # # List with non-nullable items
15
18
  # field :scores, :list, item: :float, item_nullable: false
16
19
  #
@@ -45,7 +48,7 @@ module Parquet
45
48
 
46
49
  # Define a field in the schema
47
50
  # @param name [String, Symbol] field name
48
- # @param type [Symbol] data type (:int32, :int64, :string, :list, :map, :struct, etc)
51
+ # @param type [Symbol] data type (:int32, :int64, :string, :list, :map, :struct, :decimal, etc)
49
52
  # @param nullable [Boolean] whether the field can be null (default: true)
50
53
  # @param kwargs [Hash] additional options depending on type
51
54
  #
@@ -55,6 +58,7 @@ module Parquet
55
58
  # - `key:, value:` if type == :map
56
59
  # - `key_nullable:, value_nullable:` controls nullability of map keys/values (default: true)
57
60
  # - `format:` if you want to store some format string
61
+ # - `precision:, scale:` if type == :decimal (precision defaults to 38, scale to 0)
58
62
  # - `nullable:` default to true if not specified
59
63
  def field(name, type, nullable: true, **kwargs, &block)
60
64
  field_hash = { name: name.to_s, type: type, nullable: !!nullable }
@@ -73,7 +77,21 @@ module Parquet
73
77
  raise ArgumentError, "list field `#{name}` requires `item:` type" unless item_type
74
78
  # Pass item_nullable if provided, otherwise use true as default
75
79
  item_nullable = kwargs[:item_nullable].nil? ? true : !!kwargs[:item_nullable]
76
- field_hash[:item] = wrap_subtype(item_type, nullable: item_nullable, &block)
80
+
81
+ # Pass precision and scale if type is decimal
82
+ if item_type == :decimal
83
+ precision = kwargs[:precision]
84
+ scale = kwargs[:scale]
85
+ field_hash[:item] = wrap_subtype(
86
+ item_type,
87
+ nullable: item_nullable,
88
+ precision: precision,
89
+ scale: scale,
90
+ &block
91
+ )
92
+ else
93
+ field_hash[:item] = wrap_subtype(item_type, nullable: item_nullable, &block)
94
+ end
77
95
  when :map
78
96
  # user must specify key:, value:
79
97
  key_type = kwargs[:key]
@@ -82,8 +100,47 @@ module Parquet
82
100
  # Pass key_nullable and value_nullable if provided, otherwise use true as default
83
101
  key_nullable = kwargs[:key_nullable].nil? ? true : !!kwargs[:key_nullable]
84
102
  value_nullable = kwargs[:value_nullable].nil? ? true : !!kwargs[:value_nullable]
103
+
85
104
  field_hash[:key] = wrap_subtype(key_type, nullable: key_nullable)
86
- field_hash[:value] = wrap_subtype(value_type, nullable: value_nullable, &block)
105
+
106
+ # Pass precision and scale if value type is decimal
107
+ if value_type == :decimal
108
+ precision = kwargs[:precision]
109
+ scale = kwargs[:scale]
110
+ field_hash[:value] = wrap_subtype(
111
+ value_type,
112
+ nullable: value_nullable,
113
+ precision: precision,
114
+ scale: scale,
115
+ &block
116
+ )
117
+ else
118
+ field_hash[:value] = wrap_subtype(value_type, nullable: value_nullable, &block)
119
+ end
120
+ when :decimal
121
+ # Store precision and scale for decimal type according to rules:
122
+ # 1. When neither precision nor scale is provided, use maximum precision (38)
123
+ # 2. When only precision is provided, scale defaults to 0
124
+ # 3. When only scale is provided, use maximum precision (38)
125
+ # 4. When both are provided, use the provided values
126
+
127
+ if kwargs[:precision].nil? && kwargs[:scale].nil?
128
+ # No precision or scale provided - use maximum precision
129
+ field_hash[:precision] = 38
130
+ field_hash[:scale] = 0
131
+ elsif kwargs[:precision] && kwargs[:scale].nil?
132
+ # Precision only - scale defaults to 0
133
+ field_hash[:precision] = kwargs[:precision]
134
+ field_hash[:scale] = 0
135
+ elsif kwargs[:precision].nil? && kwargs[:scale]
136
+ # Scale only - use maximum precision
137
+ field_hash[:precision] = 38
138
+ field_hash[:scale] = kwargs[:scale]
139
+ else
140
+ # Both provided
141
+ field_hash[:precision] = kwargs[:precision]
142
+ field_hash[:scale] = kwargs[:scale]
143
+ end
87
144
  else
88
145
  # primitive type: :int32, :int64, :string, etc.
89
146
  # do nothing else special
@@ -122,7 +179,7 @@ module Parquet
122
179
  # If user said: field "something", :list, item: :struct do ... end
123
180
  # we want to recursively parse that sub-struct from the block.
124
181
  # So wrap_subtype might be:
125
- def wrap_subtype(t, nullable: true, &block)
182
+ def wrap_subtype(t, nullable: true, precision: nil, scale: nil, &block)
126
183
  if t == :struct
127
184
  sub_builder = SchemaBuilder.new
128
185
  sub_builder.instance_eval(&block) if block
@@ -144,6 +201,34 @@ module Parquet
144
201
  end
145
202
 
146
203
  { type: :list, nullable: nullable, name: "item", item: sub_builder.fields[0] }
204
+ elsif t == :decimal
205
+ # Handle decimal type with precision and scale
206
+ result = { type: t, nullable: nullable, name: "item" }
207
+
208
+ # Follow the same rules as in field() method:
209
+ # 1. When neither precision nor scale is provided, use maximum precision (38)
210
+ # 2. When only precision is provided, scale defaults to 0
211
+ # 3. When only scale is provided, use maximum precision (38)
212
+ # 4. When both are provided, use the provided values
213
+ if precision.nil? && scale.nil?
214
+ # No precision or scale provided - use maximum precision
215
+ result[:precision] = 38
216
+ result[:scale] = 0
217
+ elsif precision && scale.nil?
218
+ # Precision only - scale defaults to 0
219
+ result[:precision] = precision
220
+ result[:scale] = 0
221
+ elsif precision.nil? && scale
222
+ # Scale only - use maximum precision
223
+ result[:precision] = 38
224
+ result[:scale] = scale
225
+ else
226
+ # Both provided
227
+ result[:precision] = precision
228
+ result[:scale] = scale
229
+ end
230
+
231
+ result
147
232
  else
148
233
  # e.g. :int32 => { type: :int32, nullable: true }
149
234
  { type: t, nullable: nullable, name: "item" }
@@ -1,3 +1,3 @@
1
1
  module Parquet
2
- VERSION = "0.5.2"
2
+ VERSION = "0.5.4"
3
3
  end
data/lib/parquet.rbi CHANGED
@@ -1,6 +1,17 @@
1
1
  # typed: true
2
2
 
3
3
  module Parquet
4
+ # Returns metadata information about a Parquet file
5
+ #
6
+ # The returned hash contains information about:
7
+ # - Basic file metadata (num_rows, created_by)
8
+ # - Schema information (fields, types, etc.)
9
+ # - Row group details
10
+ # - Column chunk information (compression, encodings, statistics)
11
+ sig { params(path: String).returns(T::Hash[String, T.untyped]) }
12
+ def self.metadata(path)
13
+ end
14
+
4
15
  # Options:
5
16
  # - `input`: String, File, or IO object containing parquet data
6
17
  # - `result_type`: String specifying the output format
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: parquet
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.5.2
4
+ version: 0.5.4
5
5
  platform: x86_64-linux-musl
6
6
  authors:
7
7
  - Nathan Jaremko
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2025-03-17 00:00:00.000000000 Z
11
+ date: 2025-04-01 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: rake-compiler