parquet 0.7.2-aarch64-linux → 0.8.0-aarch64-linux

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: e522f51be0304d36a528be23ae06ec426c96daa4d0890455862dadd3985c3023
4
- data.tar.gz: 857bc42992bf7afd986429be73f030ceabc90b6f6de6f90db3a214633f41213e
3
+ metadata.gz: 89db55543839853aef62e3f11511ba4ff54ee1896c7085b83d9e6e44d2b10335
4
+ data.tar.gz: 07db9c35d4c4777d9339f92b22112760c9ff0e96091b7dc26940c54ed4df03b8
5
5
  SHA512:
6
- metadata.gz: 1002992a1414104900ae6d378b022813bdc2900bc86bc727ec21f16e9a0e2474572b232387547c1dcd3206d0d3aa9264dcfa400c75fa07206907327c63948a62
7
- data.tar.gz: 04f27657c9c721a41db2f293b576e7481c836798f6cb2b22cf5289a8c08f269dc63234f2f25217d15174b8755c32aec6ea971736668d352731ff7ddc59578549
6
+ metadata.gz: 3f78682dc2b7b4d5aa3fa186b7cf743b1808c93644ee2097d76051c21e900da23578493a3d0e637b4bc61dd9ce06a85865fdf46e33a24acf46ff059519b5199f
7
+ data.tar.gz: ca2f9c79cfd0faf50567c7df2d73cbbea508e5ade7234ac8d2791288d0f174d1b34f707e63ef29765cdd6573ca34ac8ee0694f21d3ba46d7a8fd0ee49eb2f3cc
data/Gemfile CHANGED
@@ -16,5 +16,6 @@ end
16
16
 
17
17
  group :test do
18
18
  gem "csv"
19
+ gem "logger"
19
20
  gem "minitest", "~> 5.0"
20
21
  end
data/README.md CHANGED
@@ -2,616 +2,412 @@
2
2
 
3
3
  [![Gem Version](https://badge.fury.io/rb/parquet.svg)](https://badge.fury.io/rb/parquet)
4
4
 
5
- This project is a Ruby library wrapping the [`parquet`](https://github.com/apache/arrow-rs/tree/main/parquet) rust crate.
5
+ Read and write [Apache Parquet](https://parquet.apache.org/) files from Ruby. This gem wraps the official Apache [`parquet`](https://github.com/apache/arrow-rs/tree/main/parquet) rust crate, providing:
6
6
 
7
- ## Usage
7
+ - **High performance** columnar data storage and retrieval
8
+ - **Memory-efficient** streaming APIs for large datasets
9
+ - **Full compatibility** with the Apache Parquet specification
10
+ - **Simple, Ruby-native** APIs that feel natural
8
11
 
9
- This library provides high-level bindings to `parquet` with two primary APIs for reading Parquet files: row-wise and column-wise iteration. The column-wise API generally offers better performance, especially when working with subset of columns.
12
+ ## Why Use This Library?
10
13
 
11
- ### Metadata
14
+ Apache Parquet is the de facto standard for analytical data storage, offering:
15
+ - **Efficient compression** - typically 2-10x smaller than CSV
16
+ - **Fast columnar access** - read only the columns you need
17
+ - **Rich type system** - preserves data types, including nested structures
18
+ - **Wide ecosystem support** - works with Spark, Pandas, DuckDB, and more
12
19
 
13
- The `metadata` method provides detailed information about a Parquet file's structure and contents:
20
+ ## Installation
21
+
22
+ Add this line to your application's Gemfile:
23
+
24
+ ```ruby
25
+ gem 'parquet'
26
+ ```
27
+
28
+ Then execute:
29
+
30
+ ```bash
31
+ $ bundle install
32
+ ```
33
+
34
+ Or install it directly:
35
+
36
+ ```bash
37
+ $ gem install parquet
38
+ ```
39
+
40
+ ## Quick Start
41
+
42
+ ### Reading Data
14
43
 
15
44
  ```ruby
16
45
  require "parquet"
17
46
 
18
- # Get metadata from a file path
19
- metadata = Parquet.metadata("data.parquet")
47
+ # Read Parquet files row by row
48
+ Parquet.each_row("data.parquet") do |row|
49
+ puts row # => {"id" => 1, "name" => "Alice", "score" => 95.5}
50
+ end
20
51
 
21
- # Or from an IO object
22
- File.open("data.parquet", "rb") do |file|
23
- metadata = Parquet.metadata(file)
52
+ # Or column by column for better performance
53
+ Parquet.each_column("data.parquet", batch_size: 1000) do |batch|
54
+ puts batch # => {"id" => [1, 2, ...], "name" => ["Alice", "Bob", ...]}
24
55
  end
56
+ ```
57
+
58
+ ### Writing Data
25
59
 
26
- # Example metadata output:
27
- # {
28
- # "num_rows" => 3,
29
- # "created_by" => "parquet-rs version 54.2.0",
30
- # "key_value_metadata" => [
31
- # {
32
- # "key" => "ARROW:schema",
33
- # "value" => "base64_encoded_schema"
34
- # }
35
- # ],
36
- # "schema" => {
37
- # "name" => "arrow_schema",
38
- # "fields" => [
39
- # {
40
- # "name" => "id",
41
- # "type" => "primitive",
42
- # "physical_type" => "INT64",
43
- # "repetition" => "OPTIONAL",
44
- # "converted_type" => "NONE"
45
- # },
46
- # # ... other fields
47
- # ]
48
- # },
49
- # "row_groups" => [
50
- # {
51
- # "num_columns" => 5,
52
- # "num_rows" => 3,
53
- # "total_byte_size" => 379,
54
- # "columns" => [
55
- # {
56
- # "column_path" => "id",
57
- # "num_values" => 3,
58
- # "compression" => "UNCOMPRESSED",
59
- # "total_compressed_size" => 91,
60
- # "encodings" => ["PLAIN", "RLE", "RLE_DICTIONARY"],
61
- # "statistics" => {
62
- # "min_is_exact" => true,
63
- # "max_is_exact" => true
64
- # }
65
- # },
66
- # # ... other columns
67
- # ]
68
- # }
69
- # ]
70
- # }
60
+ ```ruby
61
+ # Define your schema
62
+ schema = [
63
+ { "id" => "int64" },
64
+ { "name" => "string" },
65
+ { "score" => "double" }
66
+ ]
67
+
68
+ # Write row by row
69
+ rows = [
70
+ [1, "Alice", 95.5],
71
+ [2, "Bob", 82.3]
72
+ ]
73
+
74
+ Parquet.write_rows(rows.each, schema: schema, write_to: "output.parquet")
71
75
  ```
72
76
 
73
- The metadata includes:
74
- - Total number of rows
75
- - File creation information
76
- - Key-value metadata (including Arrow schema)
77
- - Detailed schema information for each column
78
- - Row group information including:
79
- - Number of columns and rows
80
- - Total byte size
81
- - Column-level details (compression, encodings, statistics)
77
+ ## Reading Parquet Files
82
78
 
83
- ### Row-wise Iteration
79
+ The library provides two APIs for reading data, each optimized for different use cases:
84
80
 
85
- The `each_row` method provides sequential access to individual rows:
81
+ ### Row-wise Reading (Sequential Access)
86
82
 
87
- ```ruby
88
- require "parquet"
83
+ Best for: Processing records one at a time, data transformations, ETL pipelines
89
84
 
90
- # Basic usage with default hash output
85
+ ```ruby
86
+ # Basic usage - returns hashes
91
87
  Parquet.each_row("data.parquet") do |row|
92
- puts row.inspect # {"id"=>1, "name"=>"name_1"}
88
+ puts row # => {"id" => 1, "name" => "Alice"}
93
89
  end
94
90
 
95
- # Array output for more efficient memory usage
91
+ # Memory-efficient array format
96
92
  Parquet.each_row("data.parquet", result_type: :array) do |row|
97
- puts row.inspect # [1, "name_1"]
93
+ puts row # => [1, "Alice"]
98
94
  end
99
95
 
100
- # Select specific columns to reduce I/O
96
+ # Read specific columns only
101
97
  Parquet.each_row("data.parquet", columns: ["id", "name"]) do |row|
102
- puts row.inspect
98
+ # Only requested columns are loaded from disk
103
99
  end
104
100
 
105
- # Reading from IO objects
101
+ # Works with IO objects
106
102
  File.open("data.parquet", "rb") do |file|
107
103
  Parquet.each_row(file) do |row|
108
- puts row.inspect
104
+ # Process row
109
105
  end
110
106
  end
111
107
  ```
112
108
 
113
- ### Column-wise Iteration
109
+ ### Column-wise Reading (Analytical Access)
114
110
 
115
- The `each_column` method reads data in column-oriented batches, which is typically more efficient for analytical queries:
111
+ Best for: Analytics, aggregations, when you need few columns from wide tables
116
112
 
117
113
  ```ruby
118
- require "parquet"
119
-
120
- # Process columns in batches of 1024 rows
121
- Parquet.each_column("data.parquet", batch_size: 1024) do |batch|
122
- # With result_type: :hash (default)
123
- puts batch.inspect
124
- # {
125
- # "id" => [1, 2, ..., 1024],
126
- # "name" => ["name_1", "name_2", ..., "name_1024"]
127
- # }
114
+ # Process data in column batches
115
+ Parquet.each_column("data.parquet", batch_size: 1000) do |batch|
116
+ # batch is a hash of column_name => array_of_values
117
+ ids = batch["id"] # => [1, 2, 3, ..., 1000]
118
+ names = batch["name"] # => ["Alice", "Bob", ...]
119
+
120
+ # Perform columnar operations
121
+ avg_id = ids.sum.to_f / ids.length
128
122
  end
129
123
 
130
- # Array output with specific columns
124
+ # Array format for more control
131
125
  Parquet.each_column("data.parquet",
132
- columns: ["id", "name"],
133
126
  result_type: :array,
134
- batch_size: 1024) do |batch|
135
- puts batch.inspect
136
- # [
137
- # [1, 2, ..., 1024], # id column
138
- # ["name_1", "name_2", ...] # name column
139
- # ]
127
+ columns: ["id", "name"]) do |batch|
128
+ # batch is an array of arrays
129
+ # [[1, 2, ...], ["Alice", "Bob", ...]]
140
130
  end
141
131
  ```
142
132
 
143
- ### Arguments
144
-
145
- Both methods accept these common arguments:
133
+ ### File Metadata
146
134
 
147
- - `input`: Path string or IO-like object containing Parquet data
148
- - `result_type`: Output format (`:hash` or `:array`, defaults to `:hash`)
149
- - `columns`: Optional array of column names to read (improves performance)
135
+ Inspect file structure without reading data:
150
136
 
151
- Additional arguments for `each_column`:
137
+ ```ruby
138
+ metadata = Parquet.metadata("data.parquet")
152
139
 
153
- - `batch_size`: Number of rows per batch (defaults to implementation-defined value)
140
+ puts metadata["num_rows"] # Total row count
141
+ puts metadata["created_by"] # Writer identification
142
+ puts metadata["schema"]["fields"] # Column definitions
143
+ puts metadata["row_groups"].size # Number of row groups
144
+ ```
154
145
 
155
- When no block is given, both methods return an Enumerator.
146
+ ## Writing Parquet Files
156
147
 
157
- ### Writing Row-wise Data
148
+ ### Row-wise Writing
158
149
 
159
- The `write_rows` method allows you to write data row by row:
150
+ Best for: Streaming data, converting from other formats, memory-constrained environments
160
151
 
161
152
  ```ruby
162
- require "parquet"
163
-
164
- # Define the schema for your data
153
+ # Basic schema definition
165
154
  schema = [
166
155
  { "id" => "int64" },
167
156
  { "name" => "string" },
168
- { "score" => "double" }
157
+ { "active" => "boolean" },
158
+ { "balance" => "double" }
169
159
  ]
170
160
 
171
- # Create an enumerator that yields arrays of row values
172
- rows = [
173
- [1, "Alice", 95.5],
174
- [2, "Bob", 82.3],
175
- [3, "Charlie", 88.7]
176
- ].each
177
-
178
- # Write to a file
179
- Parquet.write_rows(rows, schema: schema, write_to: "data.parquet")
180
-
181
- # Write to an IO object
182
- File.open("data.parquet", "wb") do |file|
183
- Parquet.write_rows(rows, schema: schema, write_to: file)
161
+ # Stream data from any enumerable
162
+ rows = CSV.foreach("input.csv").map do |row|
163
+ [row[0].to_i, row[1], row[2] == "true", row[3].to_f]
184
164
  end
185
165
 
186
- # Optionally specify batch size (default is 1000)
187
- Parquet.write_rows(rows,
188
- schema: schema,
189
- write_to: "data.parquet",
190
- batch_size: 500
191
- )
192
-
193
- # Optionally specify memory threshold for flushing (default is 64MB)
194
- Parquet.write_rows(rows,
195
- schema: schema,
196
- write_to: "data.parquet",
197
- flush_threshold: 32 * 1024 * 1024 # 32MB
198
- )
199
-
200
- # Optionally specify sample size for row size estimation (default is 100)
201
166
  Parquet.write_rows(rows,
202
167
  schema: schema,
203
- write_to: "data.parquet",
204
- sample_size: 200 # Sample 200 rows for size estimation
168
+ write_to: "output.parquet",
169
+ batch_size: 5000 # Positive rows per batch (default: 1000)
205
170
  )
206
171
  ```
207
172
 
208
- ### Writing Column-wise Data
173
+ ### Column-wise Writing
209
174
 
210
- The `write_columns` method provides a more efficient way to write data in column-oriented batches:
175
+ Best for: Pre-columnar data, better compression, higher performance
211
176
 
212
177
  ```ruby
213
- require "parquet"
178
+ # Prepare columnar data
179
+ ids = [1, 2, 3, 4, 5]
180
+ names = ["Alice", "Bob", "Charlie", "Diana", "Eve"]
181
+ scores = [95.5, 82.3, 88.7, 91.2, 79.8]
182
+
183
+ # Create batches
184
+ batches = [[
185
+ ids, # First column
186
+ names, # Second column
187
+ scores # Third column
188
+ ]]
214
189
 
215
- # Define the schema
216
190
  schema = [
217
191
  { "id" => "int64" },
218
192
  { "name" => "string" },
219
193
  { "score" => "double" }
220
194
  ]
221
195
 
222
- # Create batches of column data
223
- batches = [
224
- # First batch
225
- [
226
- [1, 2], # id column
227
- ["Alice", "Bob"], # name column
228
- [95.5, 82.3] # score column
229
- ],
230
- # Second batch
231
- [
232
- [3], # id column
233
- ["Charlie"], # name column
234
- [88.7] # score column
235
- ]
236
- ]
237
-
238
- # Create an enumerator from the batches
239
- columns = batches.each
240
-
241
- # Write to a parquet file with default ZSTD compression
242
- Parquet.write_columns(columns, schema: schema, write_to: "data.parquet")
243
-
244
- # Write to a parquet file with specific compression and memory threshold
245
- Parquet.write_columns(columns,
196
+ Parquet.write_columns(batches.each,
246
197
  schema: schema,
247
- write_to: "data.parquet",
248
- compression: "snappy", # Supported: "none", "uncompressed", "snappy", "gzip", "lz4", "zstd"
249
- flush_threshold: 32 * 1024 * 1024 # 32MB
198
+ write_to: "output.parquet",
199
+ compression: "snappy" # Options: none, snappy, gzip, lz4, zstd
250
200
  )
251
-
252
- # Write to an IO object
253
- File.open("data.parquet", "wb") do |file|
254
- Parquet.write_columns(columns, schema: schema, write_to: file)
255
- end
256
201
  ```
257
202
 
258
- The following data types are supported in the schema:
259
-
260
- - `int8`, `int16`, `int32`, `int64`
261
- - `uint8`, `uint16`, `uint32`, `uint64`
262
- - `float`, `double`
263
- - `string`
264
- - `binary`
265
- - `boolean`
266
- - `date32`
267
- - `timestamp_millis`, `timestamp_micros`, `timestamp_second`, `timestamp_nanos`
268
- - `time_millis`, `time_micros`
269
-
270
- ### Timestamp Timezone Handling
271
-
272
- **CRITICAL PARQUET SPECIFICATION LIMITATION**: The Apache Parquet format specification only supports two types of timestamps:
273
- 1. **UTC-normalized timestamps** (when ANY timezone is specified) - `isAdjustedToUTC = true`
274
- 2. **Local/unzoned timestamps** (when NO timezone is specified) - `isAdjustedToUTC = false`
275
-
276
- This means that specific timezone offsets like "+09:00" or "America/New_York" CANNOT be preserved in Parquet files. This is not a limitation of this Ruby library, but of the Parquet format itself.
277
-
278
- **When Writing:**
279
- - If the schema specifies ANY timezone (whether it's "UTC", "+09:00", "America/New_York", etc.):
280
- - Time values are converted to UTC before storing
281
- - The file metadata sets `isAdjustedToUTC = true`
282
- - The original timezone information is LOST
283
- - If the schema doesn't specify a timezone:
284
- - Timestamps are stored as local/unzoned time (no conversion)
285
- - The file metadata sets `isAdjustedToUTC = false`
286
- - These represent "wall clock" times without timezone context
287
-
288
- **When Reading:**
289
- - If the Parquet file has `isAdjustedToUTC = true` (ANY timezone was specified during writing):
290
- - Time objects are returned in UTC
291
- - The original timezone (e.g., "+09:00") is NOT recoverable
292
- - If the file has `isAdjustedToUTC = false` (NO timezone was specified):
293
- - Time objects are returned as local time in your system's timezone
294
- - These are "wall clock" times without timezone information
203
+ `write_columns` also accepts `logger:` with the same Ruby logger interface as
204
+ row writes.
295
205
 
296
- ```ruby
297
- # Preferred approach: use has_timezone to be explicit about UTC vs local storage
298
- schema = Parquet::Schema.define do
299
- field :timestamp_utc, :timestamp_millis, has_timezone: true # Stored as UTC (default)
300
- field :timestamp_local, :timestamp_millis, has_timezone: false # Stored as local/unzoned
301
- field :timestamp_default, :timestamp_millis # Default: UTC storage
302
- end
206
+ ## Data Types
303
207
 
304
- # Legacy approach still supported (any timezone value means UTC storage)
305
- schema_legacy = Parquet::Schema.define do
306
- field :timestamp_utc, :timestamp_millis, timezone: "UTC" # Stored as UTC
307
- field :timestamp_tokyo, :timestamp_millis, timezone: "+09:00" # Also stored as UTC!
308
- field :timestamp_local, :timestamp_millis # No timezone - local
309
- end
208
+ ### Basic Types
310
209
 
311
- # Time values will be converted based on schema
312
- rows = [
313
- [
314
- Time.new(2024, 1, 1, 12, 0, 0, "+03:00"), # Converted to UTC if has_timezone: true
315
- Time.new(2024, 1, 1, 12, 0, 0, "-05:00"), # Kept as local if has_timezone: false
316
- Time.new(2024, 1, 1, 12, 0, 0) # Kept as local (default)
317
- ]
210
+ ```ruby
211
+ schema = [
212
+ # Integers
213
+ { "tiny" => "int8" }, # -128 to 127
214
+ { "small" => "int16" }, # -32,768 to 32,767
215
+ { "medium" => "int32" }, # ±2 billion
216
+ { "large" => "int64" }, # ±9 quintillion
217
+
218
+ # Unsigned integers
219
+ { "ubyte" => "uint8" }, # 0 to 255
220
+ { "ushort" => "uint16" }, # 0 to 65,535
221
+ { "uint" => "uint32" }, # 0 to 4 billion
222
+ { "ulong" => "uint64" }, # 0 to 18 quintillion
223
+
224
+ # Floating point
225
+ { "price" => "float" }, # 32-bit precision
226
+ { "amount" => "double" }, # 64-bit precision
227
+
228
+ # Other basics
229
+ { "name" => "string" },
230
+ { "data" => "binary" },
231
+ { "active" => "boolean" }
318
232
  ]
233
+ ```
319
234
 
320
- Parquet.write_rows(rows.each, schema: schema, write_to: "timestamps.parquet")
235
+ ### Date and Time Types
321
236
 
322
- # Reading back - timezone presence determines UTC vs local
323
- Parquet.each_row("timestamps.parquet") do |row|
324
- # row["timestamp_utc"] => Time object in UTC
325
- # row["timestamp_local"] => Time object in local timezone
326
- # row["timestamp_default"] => Time object in local timezone
327
- end
328
-
329
- # If you need to preserve specific timezone information, store it separately:
330
- schema_with_tz = Parquet::Schema.define do
331
- field :timestamp, :timestamp_millis, has_timezone: true # Store as UTC
332
- field :original_timezone, :string # Store timezone as string
333
- end
237
+ ```ruby
238
+ schema = [
239
+ # Date (days since Unix epoch)
240
+ { "date" => "date32" },
241
+
242
+ # Timestamps (with different precisions)
243
+ { "created_sec" => "timestamp_second" },
244
+ { "created_ms" => "timestamp_millis" }, # Most common
245
+ { "created_us" => "timestamp_micros" },
246
+ { "created_ns" => "timestamp_nanos" },
247
+
248
+ # Time of day (without date)
249
+ { "time_ms" => "time_millis" }, # Milliseconds since midnight
250
+ { "time_us" => "time_micros" } # Microseconds since midnight
251
+ ]
334
252
  ```
335
253
 
336
- ## Architecture
254
+ ### Decimal Type (Financial Data)
337
255
 
338
- This library uses a modular, trait-based architecture that separates language-agnostic Parquet operations from Ruby-specific bindings:
256
+ For exact decimal arithmetic (no floating-point errors):
339
257
 
340
- - **parquet-core**: Language-agnostic core functionality for Parquet file operations
341
- - Pure Rust implementation without Ruby dependencies
342
- - Traits for customizable I/O operations (`ChunkReader`) and value conversion (`ValueConverter`)
343
- - Efficient Arrow-based reader and writer implementations
344
-
345
- - **parquet-ruby-adapter**: Ruby-specific adapter layer
346
- - Implements core traits for Ruby integration
347
- - Handles Ruby value conversion through the `ValueConverter` trait
348
- - Manages Ruby I/O objects through the `ChunkReader` trait
258
+ ```ruby
259
+ require "bigdecimal"
260
+
261
+ schema = [
262
+ # Financial amounts with 2 decimal places
263
+ { "price" => "decimal", "precision" => 10, "scale" => 2 }, # Up to 99,999,999.99
264
+ { "balance" => "decimal", "precision" => 15, "scale" => 2 }, # Larger amounts
349
265
 
350
- - **parquet gem**: Ruby FFI bindings
351
- - Provides high-level Ruby API
352
- - Manages memory safety between Ruby and Rust
353
- - Supports both file-based and IO-based operations
266
+ # High-precision calculations
267
+ { "rate" => "decimal", "precision" => 10, "scale" => 8 } # 8 decimal places
268
+ ]
354
269
 
355
- This architecture enables:
356
- - Clear separation of concerns between core functionality and language bindings
357
- - Easy testing of core logic without Ruby dependencies
358
- - Potential reuse of core functionality for other language bindings
359
- - Type-safe interfaces through Rust's trait system
270
+ # Use BigDecimal for exact values
271
+ data = [[
272
+ BigDecimal("19.99"),
273
+ BigDecimal("1234567.89"),
274
+ BigDecimal("0.00000123")
275
+ ]]
276
+ ```
360
277
 
361
- ### Schema DSL for Complex Data Types
278
+ ## Complex Data Structures
362
279
 
363
- In addition to the hash-based schema definition shown above, this library provides a more expressive DSL for defining complex schemas with nested structures:
280
+ The library includes a powerful Schema DSL for defining nested data:
364
281
 
365
- ```ruby
366
- require "parquet"
282
+ ### Using the Schema DSL
367
283
 
368
- # Define a complex schema using the Schema DSL
284
+ ```ruby
369
285
  schema = Parquet::Schema.define do
370
- field :id, :int64, nullable: false # Required field
371
- field :name, :string # Optional field (nullable: true is default)
286
+ # Simple fields
287
+ field :id, :int64, nullable: false # Required field
288
+ field :name, :string # Optional by default
372
289
 
373
- # Nested struct
290
+ # Nested structure
374
291
  field :address, :struct do
375
292
  field :street, :string
376
293
  field :city, :string
377
- field :zip, :string
378
- field :coordinates, :struct do
379
- field :latitude, :double
380
- field :longitude, :double
294
+ field :location, :struct do
295
+ field :lat, :double
296
+ field :lng, :double
381
297
  end
382
298
  end
383
299
 
384
- # List of primitives
385
- field :scores, :list, item: :float
300
+ # Lists
301
+ field :tags, :list, item: :string
302
+ field :scores, :list, item: :int32
303
+
304
+ # Maps (dictionaries)
305
+ field :metadata, :map, key: :string, value: :string
386
306
 
387
- # List of structs
307
+ # Complex combinations
388
308
  field :contacts, :list, item: :struct do
389
309
  field :name, :string
390
- field :phone, :string
310
+ field :email, :string
391
311
  field :primary, :boolean
392
312
  end
393
-
394
- # Map with string values
395
- field :metadata, :map, key: :string, value: :string
396
-
397
- # Map with struct values
398
- field :properties, :map, key: :string, value: :struct do
399
- field :count, :int32
400
- field :description, :string
401
- end
402
-
403
- # Nested lists (list of lists of strings)
404
- field :nested_lists, :list, item: :list do
405
- field :item, :string # REQUIRED: Inner item field MUST be named 'item' for nested lists
406
- end
407
-
408
- # Map of lists
409
- field :map_of_lists, :map, key: :string, value: :list do
410
- field :item, :int32 # REQUIRED: List items in maps MUST be named 'item'
411
- end
412
- end
413
-
414
- ### Nested Lists
415
-
416
- When working with nested lists (a list of lists), there are specific requirements:
417
-
418
- 1. Using the Schema DSL:
419
- ```ruby
420
- # A list of lists of strings
421
- field :nested_lists, :list, item: :list do
422
- field :item, :string # For nested lists, inner item MUST be named 'item'
423
313
  end
424
314
  ```
425
315
 
426
- 2. Using hash-based schema format:
427
- ```ruby
428
- # A list of lists of integers
429
- { "nested_numbers" => "list<list<int32>>" }
430
- ```
316
+ ### Writing Complex Data
431
317
 
432
- The data for nested lists is structured as an array of arrays:
433
318
  ```ruby
434
- # Data for the nested_lists field
435
- [["a", "b"], ["c", "d", "e"], []] # Last one is an empty inner list
436
- ```
437
-
438
- ### Decimal Data Type
439
-
440
- Parquet supports decimal numbers with configurable precision and scale, which is essential for financial applications where exact decimal representation is critical. The library seamlessly converts between Ruby's `BigDecimal` and Parquet's decimal type.
441
-
442
- #### Decimal Precision and Scale
443
-
444
- When working with decimal fields, you need to understand two key parameters:
445
-
446
- - **Precision**: The total number of significant digits (both before and after the decimal point)
447
- - **Scale**: The number of digits after the decimal point
448
-
449
- The rules for defining decimals are:
319
+ data = [[
320
+ 1, # id
321
+ "Alice Johnson", # name
322
+ { # address
323
+ "street" => "123 Main St",
324
+ "city" => "Springfield",
325
+ "location" => {
326
+ "lat" => 40.7128,
327
+ "lng" => -74.0060
328
+ }
329
+ },
330
+ ["ruby", "parquet", "data"], # tags
331
+ [85, 92, 88], # scores
332
+ { "dept" => "Engineering" }, # metadata
333
+ [ # contacts
334
+ { "name" => "Bob", "email" => "bob@example.com", "primary" => true },
335
+ { "name" => "Carol", "email" => "carol@example.com", "primary" => false }
336
+ ]
337
+ ]]
450
338
 
451
- ```ruby
452
- # No precision/scale specified - uses maximum precision (38) with scale 0
453
- field :amount1, :decimal # Equivalent to INTEGER with 38 digits
339
+ Parquet.write_rows(data.each, schema: schema, write_to: "complex.parquet")
340
+ ```
454
341
 
455
- # Only precision specified - scale defaults to 0
456
- field :amount2, :decimal, precision: 10 # 10 digits, no decimal places
342
+ ## ⚠️ Important Limitations
457
343
 
458
- # Only scale specified - uses maximum precision (38)
459
- field :amount3, :decimal, scale: 2 # 38 digits with 2 decimal places
344
+ ### Timezone Handling in Parquet
460
345
 
461
- # Both precision and scale specified
462
- field :amount4, :decimal, precision: 10, scale: 2 # 10 digits with 2 decimal places
463
- ```
346
+ The Parquet specification has a fundamental limitation with timezone storage:
464
347
 
465
- #### Financial Data Example
348
+ 1. **UTC-normalized**: Any timestamp with timezone info (including "+09:00" or "America/New_York") is converted to UTC
349
+ 2. **Local/unzoned**: Timestamps without timezone info are stored as-is
466
350
 
467
- Here's a practical example for a financial application:
351
+ **The original timezone information is permanently lost.** This is not a limitation of this library but of the Parquet format itself.
468
352
 
469
353
  ```ruby
470
- require "parquet"
471
- require "bigdecimal"
472
-
473
- # Schema for financial transactions
474
354
  schema = Parquet::Schema.define do
475
- field :transaction_id, :string, nullable: false
476
- field :timestamp, :timestamp_millis, nullable: false
477
- field :amount, :decimal, precision: 12, scale: 2 # Supports up to 10^10 with 2 decimal places
478
- field :balance, :decimal, precision: 16, scale: 2 # Larger precision for running balances
479
- field :currency, :string
480
- field :exchange_rate, :decimal, precision: 10, scale: 6 # 6 decimal places for forex rates
481
- field :fee, :decimal, precision: 8, scale: 2, nullable: true # Optional fee
482
- field :category, :string
483
- end
355
+ # These BOTH store in UTC - timezone info is lost!
356
+ field :timestamp_utc, :timestamp_millis, timezone: "UTC"
357
+ field :timestamp_tokyo, :timestamp_millis, timezone: "+09:00"
484
358
 
485
- # Sample financial data
486
- transactions = [
487
- [
488
- "T-12345",
489
- Time.now,
490
- BigDecimal("1256.99"), # amount (directly using BigDecimal)
491
- BigDecimal("10250.25"), # balance
492
- "USD",
493
- BigDecimal("1.0"), # exchange_rate
494
- BigDecimal("2.50"), # fee
495
- "Groceries"
496
- ],
497
- [
498
- "T-12346",
499
- Time.now - 86400, # yesterday
500
- BigDecimal("-89.50"), # negative amount for withdrawal
501
- BigDecimal("10160.75"), # updated balance
502
- "USD",
503
- BigDecimal("1.0"), # exchange_rate
504
- nil, # no fee
505
- "Transportation"
506
- ],
507
- [
508
- "T-12347",
509
- Time.now - 172800, # two days ago
510
- BigDecimal("250.00"), # amount
511
- BigDecimal("10410.75"), # balance
512
- "EUR", # different currency
513
- BigDecimal("1.05463"), # exchange_rate
514
- BigDecimal("1.75"), # fee
515
- "Entertainment"
516
- ]
517
- ]
359
+ # This stores as local time (no timezone)
360
+ field :timestamp_local, :timestamp_millis
361
+ end
518
362
 
519
- # Write financial data to Parquet file
520
- Parquet.write_rows(transactions.each, schema: schema, write_to: "financial_data.parquet")
521
-
522
- # Read back transactions
523
- Parquet.each_row("financial_data.parquet") do |transaction|
524
- # Access decimal fields as BigDecimal objects
525
- puts "Transaction: #{transaction['transaction_id']}"
526
- puts " Amount: #{transaction['currency']} #{transaction['amount']}"
527
- puts " Balance: $#{transaction['balance']}"
528
- puts " Fee: #{transaction['fee'] || 'No fee'}"
529
-
530
- # You can perform precise decimal calculations
531
- if transaction['currency'] != 'USD'
532
- usd_amount = transaction['amount'] * transaction['exchange_rate']
533
- puts " USD Equivalent: $#{usd_amount.round(2)}"
534
- end
363
+ # If you need timezone preservation, store it separately:
364
+ schema = Parquet::Schema.define do
365
+ field :timestamp, :timestamp_millis, has_timezone: true # UTC storage
366
+ field :original_tz, :string # "America/New_York"
535
367
  end
536
368
  ```
537
369
 
538
- #### Decimal Type Storage Considerations
370
+ ## Performance Tips
539
371
 
540
- Parquet optimizes storage based on the precision:
541
- - For precision 9: Uses 4-byte INT32
542
- - For precision 18: Uses 8-byte INT64
543
- - For precision 38: Uses 16-byte BYTE_ARRAY
372
+ 1. **Use column-wise reading** when you need only a few columns from wide tables
373
+ 2. **Specify columns parameter** to avoid reading unnecessary data
374
+ 3. **Choose appropriate batch sizes**:
375
+ - Larger batches = better throughput but more memory
376
+ - Smaller batches = less memory but more overhead
377
+ 4. **Pre-sort data** by commonly filtered columns for better compression
544
378
 
545
- Choose appropriate precision and scale for your data to optimize storage while ensuring adequate range:
546
379
 
547
- ```ruby
548
- # Banking examples
549
- field :account_balance, :decimal, precision: 16, scale: 2 # Up to 14 digits before decimal point
550
- field :interest_rate, :decimal, precision: 8, scale: 6 # Rate with 6 decimal places (e.g., 0.015625)
380
+ ## Memory Management
551
381
 
552
- # E-commerce examples
553
- field :product_price, :decimal, precision: 10, scale: 2 # Product price
554
- field :shipping_weight, :decimal, precision: 6, scale: 3 # Weight in kg with 3 decimal places
382
+ Control memory usage with flush thresholds:
555
383
 
556
- # Analytics examples
557
- field :conversion_rate, :decimal, precision: 5, scale: 4 # Rate like 0.0123
558
- field :daily_revenue, :decimal, precision: 14, scale: 2 # Daily revenue with 2 decimal places
384
+ ```ruby
385
+ Parquet.write_rows(huge_dataset.each,
386
+ schema: schema,
387
+ write_to: "output.parquet",
388
+ batch_size: 1000, # Positive rows before considering flush
389
+ flush_threshold: 32 * 1024**2 # Flush if batch exceeds 32MB
390
+ )
559
391
  ```
560
392
 
561
- ### Sample Data with Nested Structures
393
+ Write batch and sample sizes are bounded before buffer allocation. Very large
394
+ batch sizes are rejected, and wide schemas have a lower effective batch cap so
395
+ the writer cannot reserve unbounded per-column value slots.
562
396
 
563
- Here's an example showing how to use the schema defined earlier with sample data:
397
+ ## Architecture
564
398
 
565
- ```ruby
566
- # Sample data with nested structures
567
- data = [
568
- [
569
- 1, # id
570
- "John Doe", # name
571
- { # address (struct)
572
- "street" => "123 Main St",
573
- "city" => "Springfield",
574
- "zip" => "12345",
575
- "coordinates" => {
576
- "latitude" => 37.7749,
577
- "longitude" => -122.4194
578
- }
579
- },
580
- [85.5, 92.0, 78.5], # scores (list of floats)
581
- [ # contacts (list of structs)
582
- { "name" => "Contact 1", "phone" => "555-1234", "primary" => true },
583
- { "name" => "Contact 2", "phone" => "555-5678", "primary" => false }
584
- ],
585
- { "created" => "2023-01-01", "status" => "active" }, # metadata (map)
586
- { # properties (map of structs)
587
- "feature1" => { "count" => 5, "description" => "Main feature" },
588
- "feature2" => { "count" => 3, "description" => "Secondary feature" }
589
- },
590
- [["a", "b"], ["c", "d", "e"]], # nested_lists (a list of lists of strings)
591
- { # map_of_lists
592
- "group1" => [1, 2, 3],
593
- "group2" => [4, 5, 6]
594
- }
595
- ]
596
- ]
399
+ This gem uses a modular architecture:
597
400
 
598
- # Write to a parquet file using the schema
599
- Parquet.write_rows(data.each, schema: schema, write_to: "complex_data.parquet")
401
+ - **parquet-core**: Language-agnostic Rust core for Parquet operations
402
+ - **parquet-ruby-adapter**: Ruby-specific FFI adapter layer
403
+ - **parquet gem**: High-level Ruby API
600
404
 
601
- # Read back the data
602
- Parquet.each_row("complex_data.parquet") do |row|
603
- puts row.inspect
604
- end
605
- ```
405
+ Take a look at [ARCH.md](./ARCH.md)
406
+
407
+ ## Contributing
606
408
 
607
- The Schema DSL supports:
409
+ Bug reports and pull requests are welcome on GitHub at https://github.com/njaremko/parquet-ruby.
608
410
 
609
- - **Primitive types**: All standard Parquet types (`int32`, `string`, etc.)
610
- - **Complex types**: Structs, lists, and maps with arbitrary nesting
611
- - **Nullability control**: Specify which fields can contain null values with `nullable: false/true`
612
- - **List item nullability**: Control whether list items can be null with `item_nullable: false/true`
613
- - **Map key/value nullability**: Control whether map keys or values can be null with `value_nullable: false/true`
411
+ ## License
614
412
 
615
- Note: When using List and Map types, you need to provide at least:
616
- - For lists: The `item:` parameter specifying the item type
617
- - For maps: Both `key:` and `value:` parameters specifying key and value types
413
+ The gem is available as open source under the terms of the MIT License.
Binary file
Binary file
Binary file
Binary file
@@ -116,8 +116,12 @@ module Parquet
116
116
  key_type = kwargs[:key]
117
117
  value_type = kwargs[:value]
118
118
  raise ArgumentError, "map field `#{name}` requires `key:` and `value:`" if key_type.nil? || value_type.nil?
119
- # Pass key_nullable and value_nullable if provided, otherwise use true as default
120
- key_nullable = kwargs[:key_nullable].nil? ? true : !!kwargs[:key_nullable]
119
+ # Map keys are required by the Parquet spec. Reject an explicit nullable
120
+ # key at this boundary rather than letting it fail deep in the writer.
121
+ if kwargs[:key_nullable]
122
+ raise ArgumentError, "map field `#{name}` keys are always required; remove `key_nullable: true`"
123
+ end
124
+ key_nullable = false
121
125
  value_nullable = kwargs[:value_nullable].nil? ? true : !!kwargs[:value_nullable]
122
126
 
123
127
  field_hash[:key] = wrap_subtype(key_type, nullable: key_nullable)
@@ -1,3 +1,3 @@
1
1
  module Parquet
2
- VERSION = "0.7.2"
2
+ VERSION = "0.8.0"
3
3
  end
data/lib/parquet.rbi CHANGED
@@ -18,12 +18,29 @@ module Parquet
18
18
  # ("hash" or "array" or :hash or :array)
19
19
  # - `columns`: When present, only the specified columns will be included in the output.
20
20
  # This is useful for reducing how much data is read and improving performance.
21
+ # - `string_storage`: How string *values* become Ruby strings (default `:copy`). Hash keys
22
+ # (struct field names and top-level column names) are always interned and
23
+ # reused regardless of this setting.
24
+ # - `:copy` allocates a fresh mutable String per value.
25
+ # - `:intern` deduplicates low-cardinality equal values into frozen interned
26
+ # Strings up to a bounded per-read cache, then falls back to frozen copies.
27
+ # A transient copy still happens per value, so it is not a per-value speedup.
28
+ # - `:shared` returns frozen, zero-copy strings backed by Rust memory for
29
+ # short, repeated, low-cardinality values. Each read returns at most the
30
+ # configured number of shared values and only values up to the configured
31
+ # byte size; values past those bounds become frozen copies. New process-wide
32
+ # leaks are also capped by the requested budget and hard process ceilings.
33
+ # All `:shared` results are frozen. Not recommended for high-cardinality or
34
+ # large-blob string columns.
35
+ # Pass a hash to set the `:shared` leak budget, e.g.
36
+ # `{ mode: :shared, max_entries: 16_384, max_value_bytes: 1024 }`.
21
37
  sig do
22
38
  params(
23
39
  input: T.any(String, File, StringIO, IO),
24
40
  result_type: T.nilable(T.any(String, Symbol)),
25
41
  columns: T.nilable(T::Array[String]),
26
- strict: T.nilable(T::Boolean)
42
+ strict: T.nilable(T::Boolean),
43
+ string_storage: T.nilable(T.any(String, Symbol, T::Hash[Symbol, T.untyped]))
27
44
  ).returns(T::Enumerator[T.any(T::Hash[String, T.untyped], T::Array[T.untyped])])
28
45
  end
29
46
  sig do
@@ -32,10 +49,11 @@ module Parquet
32
49
  result_type: T.nilable(T.any(String, Symbol)),
33
50
  columns: T.nilable(T::Array[String]),
34
51
  strict: T.nilable(T::Boolean),
52
+ string_storage: T.nilable(T.any(String, Symbol, T::Hash[Symbol, T.untyped])),
35
53
  blk: T.nilable(T.proc.params(row: T.any(T::Hash[String, T.untyped], T::Array[T.untyped])).void)
36
54
  ).returns(NilClass)
37
55
  end
38
- def self.each_row(input, result_type: nil, columns: nil, strict: nil, &blk)
56
+ def self.each_row(input, result_type: nil, columns: nil, strict: nil, string_storage: nil, &blk)
39
57
  end
40
58
 
41
59
  # Options:
@@ -44,13 +62,16 @@ module Parquet
44
62
  # ("hash" or "array" or :hash or :array)
45
63
  # - `columns`: When present, only the specified columns will be included in the output.
46
64
  # - `batch_size`: When present, specifies the number of rows per batch
65
+ # - `string_storage`: How string values become Ruby strings (`:copy` (default), `:intern`,
66
+ # or `:shared`). See `each_row` for the semantics of each mode.
47
67
  sig do
48
68
  params(
49
69
  input: T.any(String, File, StringIO, IO),
50
70
  result_type: T.nilable(T.any(String, Symbol)),
51
71
  columns: T.nilable(T::Array[String]),
52
72
  batch_size: T.nilable(Integer),
53
- strict: T.nilable(T::Boolean)
73
+ strict: T.nilable(T::Boolean),
74
+ string_storage: T.nilable(T.any(String, Symbol, T::Hash[Symbol, T.untyped]))
54
75
  ).returns(T::Enumerator[T.any(T::Hash[String, T.untyped], T::Array[T.untyped])])
55
76
  end
56
77
  sig do
@@ -60,11 +81,12 @@ module Parquet
60
81
  columns: T.nilable(T::Array[String]),
61
82
  batch_size: T.nilable(Integer),
62
83
  strict: T.nilable(T::Boolean),
84
+ string_storage: T.nilable(T.any(String, Symbol, T::Hash[Symbol, T.untyped])),
63
85
  blk:
64
86
  T.nilable(T.proc.params(batch: T.any(T::Hash[String, T::Array[T.untyped]], T::Array[T::Array[T.untyped]])).void)
65
87
  ).returns(NilClass)
66
88
  end
67
- def self.each_column(input, result_type: nil, columns: nil, batch_size: nil, strict: nil, &blk)
89
+ def self.each_column(input, result_type: nil, columns: nil, batch_size: nil, strict: nil, string_storage: nil, &blk)
68
90
  end
69
91
 
70
92
  # Options:
@@ -79,11 +101,19 @@ module Parquet
79
101
  # - `date32`
80
102
  # - `timestamp_millis`, `timestamp_micros`
81
103
  # - `write_to`: String path or IO object to write the parquet file to
82
- # - `batch_size`: Optional batch size for writing (defaults to 1000)
83
- # - `flush_threshold`: Optional memory threshold in bytes before flushing (defaults to 64MB)
104
+ # - `batch_size`: Optional positive batch size for writing (defaults to 1000, at most 1_000_000
105
+ # for one-column schemas; wide schemas may have a lower safety cap)
106
+ # - `flush_threshold`: Optional threshold in bytes for the writer's in-progress (encoded)
107
+ # buffer before a row group is flushed (defaults to 100MB)
84
108
  # - `compression`: Optional compression type to use (defaults to "zstd")
85
109
  # Supported values: "none", "uncompressed", "snappy", "gzip", "lz4", "zstd"
86
- # - `sample_size`: Optional number of rows to sample for size estimation (defaults to 100)
110
+ # - `sample_size`: Optional positive number of rows to sample for size estimation
111
+ # (defaults to 100, at most 10_000)
112
+ # - `string_cache`: Deduplicate repeated string values while writing. `false` (default)
113
+ # disables it, `true` enables it with a default capacity, and an Integer
114
+ # enables it with that many retained distinct strings (at most 65_536).
115
+ # Retention also skips values larger than 4KB and stops after 16MB of
116
+ # cached string content.
87
117
  sig do
88
118
  params(
89
119
  read_from: T::Enumerator[T::Array[T.untyped]],
@@ -92,7 +122,8 @@ module Parquet
92
122
  batch_size: T.nilable(Integer),
93
123
  flush_threshold: T.nilable(Integer),
94
124
  compression: T.nilable(String),
95
- sample_size: T.nilable(Integer)
125
+ sample_size: T.nilable(Integer),
126
+ string_cache: T.nilable(T.any(T::Boolean, Integer))
96
127
  ).void
97
128
  end
98
129
  def self.write_rows(
@@ -102,7 +133,8 @@ module Parquet
102
133
  batch_size: nil,
103
134
  flush_threshold: nil,
104
135
  compression: nil,
105
- sample_size: nil
136
+ sample_size: nil,
137
+ string_cache: nil
106
138
  )
107
139
  end
108
140
 
@@ -119,18 +151,28 @@ module Parquet
119
151
  # - `timestamp_millis`, `timestamp_micros`
120
152
  # - Looks like [{"column_name" => {"type" => "date32", "format" => "%Y-%m-%d"}}, {"column_name" => "int8"}]
121
153
  # - `write_to`: String path or IO object to write the parquet file to
122
- # - `flush_threshold`: Optional memory threshold in bytes before flushing (defaults to 64MB)
154
+ # - `flush_threshold`: Optional threshold in bytes for the writer's in-progress (encoded)
155
+ # buffer before a row group is flushed (defaults to 100MB)
123
156
  # - `compression`: Optional compression type to use (defaults to "zstd")
124
157
  # Supported values: "none", "uncompressed", "snappy", "gzip", "lz4", "zstd"
158
+ # - `logger`: Optional Ruby logger for column-write progress messages
125
159
  sig do
126
160
  params(
127
161
  read_from: T::Enumerator[T::Array[T::Array[T.untyped]]],
128
162
  schema: T::Array[T::Hash[String, String]],
129
163
  write_to: T.any(String, IO),
130
164
  flush_threshold: T.nilable(Integer),
131
- compression: T.nilable(String)
165
+ compression: T.nilable(String),
166
+ logger: T.nilable(T.untyped)
132
167
  ).void
133
168
  end
134
- def self.write_columns(read_from, schema:, write_to:, flush_threshold: nil, compression: nil)
169
+ def self.write_columns(
170
+ read_from,
171
+ schema:,
172
+ write_to:,
173
+ flush_threshold: nil,
174
+ compression: nil,
175
+ logger: nil
176
+ )
135
177
  end
136
178
  end
metadata CHANGED
@@ -1,15 +1,29 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: parquet
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.7.2
4
+ version: 0.8.0
5
5
  platform: aarch64-linux
6
6
  authors:
7
7
  - Nathan Jaremko
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2025-07-05 00:00:00.000000000 Z
11
+ date: 2026-06-25 00:00:00.000000000 Z
12
12
  dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: bigdecimal
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - ">="
18
+ - !ruby/object:Gem::Version
19
+ version: '0'
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - ">="
25
+ - !ruby/object:Gem::Version
26
+ version: '0'
13
27
  - !ruby/object:Gem::Dependency
14
28
  name: rake-compiler
15
29
  requirement: !ruby/object:Gem::Requirement
@@ -42,6 +56,7 @@ files:
42
56
  - lib/parquet/3.2/parquet.so
43
57
  - lib/parquet/3.3/parquet.so
44
58
  - lib/parquet/3.4/parquet.so
59
+ - lib/parquet/4.0/parquet.so
45
60
  - lib/parquet/schema.rb
46
61
  - lib/parquet/version.rb
47
62
  homepage: https://github.com/njaremko/parquet-ruby
@@ -65,7 +80,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
65
80
  version: '3.2'
66
81
  - - "<"
67
82
  - !ruby/object:Gem::Version
68
- version: 3.5.dev
83
+ version: 4.1.dev
69
84
  required_rubygems_version: !ruby/object:Gem::Requirement
70
85
  requirements:
71
86
  - - ">="