parquet 0.5.3-x86_64-linux-musl → 0.5.4-x86_64-linux-musl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 995f809814285f31a46bdcd914d5ce40da819e9af3e750ab466c06f0cf87c2de
4
- data.tar.gz: e834018c1b0e445b224e36a362471d1697881eee319d91f2de0dfc4b80efce12
3
+ metadata.gz: ec4eb34a79658f88850161f93939452c5219fa735d4e01bd9b952b5b048274d0
4
+ data.tar.gz: a4357e8f18659b1b1ec06457ed3a1252d251ae2ef7b80834c4013b1da31df463
5
5
  SHA512:
6
- metadata.gz: c11e1f68789cd21a4218a306a813516bee8ffac4c1903b09ca141b77cb706ace60afb5d8194eaec5fc7891e3f2fef570d062a907a1a5371121b717bcd34a286e
7
- data.tar.gz: 1081a6ca4a5b0c11a96ca856397da5279bf8d44496c3493437130a70b2b8161b76094ce222c449965f3bcd2868731102894affb285d4e8143af3da9b19907a6e
6
+ metadata.gz: 7c3770e0f72ad2f4fc00e44b5d24da1c6d3ffc6faa68169fcfd9e6e69ef811709a0745166483fcb39bc87e67faf594e2899e0c0dc7508829a46aa893b41cc3ca
7
+ data.tar.gz: f11757a10e5f39f05cdbcb9349b062e2f933b47234fb9d6812497d65c6fe49505d8fdcc2c356f1d1c7eb079ebf151aed5f3b6d8b04ae8cbb61ce04a321c974f0
data/README.md CHANGED
@@ -8,6 +8,78 @@ This project is a Ruby library wrapping the [parquet-rs](https://github.com/apac
8
8
 
9
9
  This library provides high-level bindings to parquet-rs with two primary APIs for reading Parquet files: row-wise and column-wise iteration. The column-wise API generally offers better performance, especially when working with subset of columns.
10
10
 
11
+ ### Metadata
12
+
13
+ The `metadata` method provides detailed information about a Parquet file's structure and contents:
14
+
15
+ ```ruby
16
+ require "parquet"
17
+
18
+ # Get metadata from a file path
19
+ metadata = Parquet.metadata("data.parquet")
20
+
21
+ # Or from an IO object
22
+ File.open("data.parquet", "rb") do |file|
23
+ metadata = Parquet.metadata(file)
24
+ end
25
+
26
+ # Example metadata output:
27
+ # {
28
+ # "num_rows" => 3,
29
+ # "created_by" => "parquet-rs version 54.2.0",
30
+ # "key_value_metadata" => [
31
+ # {
32
+ # "key" => "ARROW:schema",
33
+ # "value" => "base64_encoded_schema"
34
+ # }
35
+ # ],
36
+ # "schema" => {
37
+ # "name" => "arrow_schema",
38
+ # "fields" => [
39
+ # {
40
+ # "name" => "id",
41
+ # "type" => "primitive",
42
+ # "physical_type" => "INT64",
43
+ # "repetition" => "OPTIONAL",
44
+ # "converted_type" => "NONE"
45
+ # },
46
+ # # ... other fields
47
+ # ]
48
+ # },
49
+ # "row_groups" => [
50
+ # {
51
+ # "num_columns" => 5,
52
+ # "num_rows" => 3,
53
+ # "total_byte_size" => 379,
54
+ # "columns" => [
55
+ # {
56
+ # "column_path" => "id",
57
+ # "num_values" => 3,
58
+ # "compression" => "UNCOMPRESSED",
59
+ # "total_compressed_size" => 91,
60
+ # "encodings" => ["PLAIN", "RLE", "RLE_DICTIONARY"],
61
+ # "statistics" => {
62
+ # "min_is_exact" => true,
63
+ # "max_is_exact" => true
64
+ # }
65
+ # },
66
+ # # ... other columns
67
+ # ]
68
+ # }
69
+ # ]
70
+ # }
71
+ ```
72
+
73
+ The metadata includes:
74
+ - Total number of rows
75
+ - File creation information
76
+ - Key-value metadata (including Arrow schema)
77
+ - Detailed schema information for each column
78
+ - Row group information including:
79
+ - Number of columns and rows
80
+ - Total byte size
81
+ - Column-level details (compression, encodings, statistics)
82
+
11
83
  ### Row-wise Iteration
12
84
 
13
85
  The `each_row` method provides sequential access to individual rows:
@@ -236,17 +308,169 @@ schema = Parquet::Schema.define do
236
308
  field :description, :string
237
309
  end
238
310
 
239
- # Nested lists
311
+ # Nested lists (list of lists of strings)
240
312
  field :nested_lists, :list, item: :list do
241
- field :item, :string # For nested lists, inner item must be named 'item'
313
+ field :item, :string # REQUIRED: Inner item field MUST be named 'item' for nested lists
242
314
  end
243
315
 
244
316
  # Map of lists
245
317
  field :map_of_lists, :map, key: :string, value: :list do
246
- field :item, :int32 # For list items in maps, item must be named 'item'
318
+ field :item, :int32 # REQUIRED: List items in maps MUST be named 'item'
247
319
  end
248
320
  end
249
321
 
322
+ ### Nested Lists
323
+
324
+ When working with nested lists (a list of lists), there are specific requirements:
325
+
326
+ 1. Using the Schema DSL:
327
+ ```ruby
328
+ # A list of lists of strings
329
+ field :nested_lists, :list, item: :list do
330
+ field :item, :string # For nested lists, inner item MUST be named 'item'
331
+ end
332
+ ```
333
+
334
+ 2. Using hash-based schema format:
335
+ ```ruby
336
+ # A list of lists of integers
337
+ { "nested_numbers" => "list<list<int32>>" }
338
+ ```
339
+
340
+ The data for nested lists is structured as an array of arrays:
341
+ ```ruby
342
+ # Data for the nested_lists field
343
+ [["a", "b"], ["c", "d", "e"], []] # Last one is an empty inner list
344
+ ```
345
+
346
+ ### Decimal Data Type
347
+
348
+ Parquet supports decimal numbers with configurable precision and scale, which is essential for financial applications where exact decimal representation is critical. The library seamlessly converts between Ruby's `BigDecimal` and Parquet's decimal type.
349
+
350
+ #### Decimal Precision and Scale
351
+
352
+ When working with decimal fields, you need to understand two key parameters:
353
+
354
+ - **Precision**: The total number of significant digits (both before and after the decimal point)
355
+ - **Scale**: The number of digits after the decimal point
356
+
357
+ The rules for defining decimals are:
358
+
359
+ ```ruby
360
+ # No precision/scale specified - uses maximum precision (38) with scale 0
361
+ field :amount1, :decimal # Equivalent to INTEGER with 38 digits
362
+
363
+ # Only precision specified - scale defaults to 0
364
+ field :amount2, :decimal, precision: 10 # 10 digits, no decimal places
365
+
366
+ # Only scale specified - uses maximum precision (38)
367
+ field :amount3, :decimal, scale: 2 # 38 digits with 2 decimal places
368
+
369
+ # Both precision and scale specified
370
+ field :amount4, :decimal, precision: 10, scale: 2 # 10 digits with 2 decimal places
371
+ ```
372
+
373
+ #### Financial Data Example
374
+
375
+ Here's a practical example for a financial application:
376
+
377
+ ```ruby
378
+ require "parquet"
379
+ require "bigdecimal"
380
+
381
+ # Schema for financial transactions
382
+ schema = Parquet::Schema.define do
383
+ field :transaction_id, :string, nullable: false
384
+ field :timestamp, :timestamp_millis, nullable: false
385
+ field :amount, :decimal, precision: 12, scale: 2 # Supports up to 10^10 with 2 decimal places
386
+ field :balance, :decimal, precision: 16, scale: 2 # Larger precision for running balances
387
+ field :currency, :string
388
+ field :exchange_rate, :decimal, precision: 10, scale: 6 # 6 decimal places for forex rates
389
+ field :fee, :decimal, precision: 8, scale: 2, nullable: true # Optional fee
390
+ field :category, :string
391
+ end
392
+
393
+ # Sample financial data
394
+ transactions = [
395
+ [
396
+ "T-12345",
397
+ Time.now,
398
+ BigDecimal("1256.99"), # amount (directly using BigDecimal)
399
+ BigDecimal("10250.25"), # balance
400
+ "USD",
401
+ BigDecimal("1.0"), # exchange_rate
402
+ BigDecimal("2.50"), # fee
403
+ "Groceries"
404
+ ],
405
+ [
406
+ "T-12346",
407
+ Time.now - 86400, # yesterday
408
+ BigDecimal("-89.50"), # negative amount for withdrawal
409
+ BigDecimal("10160.75"), # updated balance
410
+ "USD",
411
+ BigDecimal("1.0"), # exchange_rate
412
+ nil, # no fee
413
+ "Transportation"
414
+ ],
415
+ [
416
+ "T-12347",
417
+ Time.now - 172800, # two days ago
418
+ BigDecimal("250.00"), # amount
419
+ BigDecimal("10410.75"), # balance
420
+ "EUR", # different currency
421
+ BigDecimal("1.05463"), # exchange_rate
422
+ BigDecimal("1.75"), # fee
423
+ "Entertainment"
424
+ ]
425
+ ]
426
+
427
+ # Write financial data to Parquet file
428
+ Parquet.write_rows(transactions.each, schema: schema, write_to: "financial_data.parquet")
429
+
430
+ # Read back transactions
431
+ Parquet.each_row("financial_data.parquet") do |transaction|
432
+ # Access decimal fields as BigDecimal objects
433
+ puts "Transaction: #{transaction['transaction_id']}"
434
+ puts " Amount: #{transaction['currency']} #{transaction['amount']}"
435
+ puts " Balance: $#{transaction['balance']}"
436
+ puts " Fee: #{transaction['fee'] || 'No fee'}"
437
+
438
+ # You can perform precise decimal calculations
439
+ if transaction['currency'] != 'USD'
440
+ usd_amount = transaction['amount'] * transaction['exchange_rate']
441
+ puts " USD Equivalent: $#{usd_amount.round(2)}"
442
+ end
443
+ end
444
+ ```
445
+
446
+ #### Decimal Type Storage Considerations
447
+
448
+ Parquet optimizes storage based on the precision:
449
+ - For precision ≤ 9: Uses 4-byte INT32
450
+ - For precision ≤ 18: Uses 8-byte INT64
451
+ - For precision ≤ 38: Uses 16-byte BYTE_ARRAY
452
+
453
+ Choose appropriate precision and scale for your data to optimize storage while ensuring adequate range:
454
+
455
+ ```ruby
456
+ # Banking examples
457
+ field :account_balance, :decimal, precision: 16, scale: 2 # Up to 14 digits before decimal point
458
+ field :interest_rate, :decimal, precision: 8, scale: 6 # Rate with 6 decimal places (e.g., 0.015625)
459
+
460
+ # E-commerce examples
461
+ field :product_price, :decimal, precision: 10, scale: 2 # Product price
462
+ field :shipping_weight, :decimal, precision: 6, scale: 3 # Weight in kg with 3 decimal places
463
+
464
+ # Analytics examples
465
+ field :conversion_rate, :decimal, precision: 5, scale: 4 # Rate like 0.0123
466
+ field :daily_revenue, :decimal, precision: 14, scale: 2 # Daily revenue with 2 decimal places
467
+ ```
468
+
469
+ ### Sample Data with Nested Structures
470
+
471
+ Here's an example showing how to use the schema defined earlier with sample data:
472
+
473
+ ```ruby
250
474
  # Sample data with nested structures
251
475
  data = [
252
476
  [
@@ -271,7 +495,7 @@ data = [
271
495
  "feature1" => { "count" => 5, "description" => "Main feature" },
272
496
  "feature2" => { "count" => 3, "description" => "Secondary feature" }
273
497
  },
274
- [["a", "b"], ["c", "d", "e"]], # nested_lists
498
+ [["a", "b"], ["c", "d", "e"]], # nested_lists (a list of lists of strings)
275
499
  { # map_of_lists
276
500
  "group1" => [1, 2, 3],
277
501
  "group2" => [4, 5, 6]
Binary file
Binary file
Binary file
@@ -58,7 +58,7 @@ module Parquet
58
58
  # - `key:, value:` if type == :map
59
59
  # - `key_nullable:, value_nullable:` controls nullability of map keys/values (default: true)
60
60
  # - `format:` if you want to store some format string
61
- # - `precision:, scale:` if type == :decimal (precision defaults to 18, scale to 2)
61
+ # - `precision:, scale:` if type == :decimal (precision defaults to 38, scale to 0)
62
62
  # - `nullable:` default to true if not specified
63
63
  def field(name, type, nullable: true, **kwargs, &block)
64
64
  field_hash = { name: name.to_s, type: type, nullable: !!nullable }
@@ -77,12 +77,18 @@ module Parquet
77
77
  raise ArgumentError, "list field `#{name}` requires `item:` type" unless item_type
78
78
  # Pass item_nullable if provided, otherwise use true as default
79
79
  item_nullable = kwargs[:item_nullable].nil? ? true : !!kwargs[:item_nullable]
80
-
80
+
81
81
  # Pass precision and scale if type is decimal
82
82
  if item_type == :decimal
83
83
  precision = kwargs[:precision]
84
84
  scale = kwargs[:scale]
85
- field_hash[:item] = wrap_subtype(item_type, nullable: item_nullable, precision: precision, scale: scale, &block)
85
+ field_hash[:item] = wrap_subtype(
86
+ item_type,
87
+ nullable: item_nullable,
88
+ precision: precision,
89
+ scale: scale,
90
+ &block
91
+ )
86
92
  else
87
93
  field_hash[:item] = wrap_subtype(item_type, nullable: item_nullable, &block)
88
94
  end
@@ -94,14 +100,20 @@ module Parquet
94
100
  # Pass key_nullable and value_nullable if provided, otherwise use true as default
95
101
  key_nullable = kwargs[:key_nullable].nil? ? true : !!kwargs[:key_nullable]
96
102
  value_nullable = kwargs[:value_nullable].nil? ? true : !!kwargs[:value_nullable]
97
-
103
+
98
104
  field_hash[:key] = wrap_subtype(key_type, nullable: key_nullable)
99
-
105
+
100
106
  # Pass precision and scale if value type is decimal
101
107
  if value_type == :decimal
102
108
  precision = kwargs[:precision]
103
109
  scale = kwargs[:scale]
104
- field_hash[:value] = wrap_subtype(value_type, nullable: value_nullable, precision: precision, scale: scale, &block)
110
+ field_hash[:value] = wrap_subtype(
111
+ value_type,
112
+ nullable: value_nullable,
113
+ precision: precision,
114
+ scale: scale,
115
+ &block
116
+ )
105
117
  else
106
118
  field_hash[:value] = wrap_subtype(value_type, nullable: value_nullable, &block)
107
119
  end
@@ -111,7 +123,7 @@ module Parquet
111
123
  # 2. When only precision is provided, scale defaults to 0
112
124
  # 3. When only scale is provided, use maximum precision (38)
113
125
  # 4. When both are provided, use the provided values
114
-
126
+
115
127
  if kwargs[:precision].nil? && kwargs[:scale].nil?
116
128
  # No precision or scale provided - use maximum precision
117
129
  field_hash[:precision] = 38
@@ -192,7 +204,7 @@ module Parquet
192
204
  elsif t == :decimal
193
205
  # Handle decimal type with precision and scale
194
206
  result = { type: t, nullable: nullable, name: "item" }
195
-
207
+
196
208
  # Follow the same rules as in field() method:
197
209
  # 1. When neither precision nor scale is provided, use maximum precision (38)
198
210
  # 2. When only precision is provided, scale defaults to 0
@@ -215,7 +227,7 @@ module Parquet
215
227
  result[:precision] = precision
216
228
  result[:scale] = scale
217
229
  end
218
-
230
+
219
231
  result
220
232
  else
221
233
  # e.g. :int32 => { type: :int32, nullable: true }
@@ -1,3 +1,3 @@
1
1
  module Parquet
2
- VERSION = "0.5.3"
2
+ VERSION = "0.5.4"
3
3
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: parquet
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.5.3
4
+ version: 0.5.4
5
5
  platform: x86_64-linux-musl
6
6
  authors:
7
7
  - Nathan Jaremko