parquet 0.2.15-x86_64-darwin

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 2d007ac814c3524e11fbca3d9c92423eca8852ed147678adf1cd20074c22b47b
4
+ data.tar.gz: c146411e985c37aedd097733390791d98a01f21fd09b784a3843b2cff8f47c05
5
+ SHA512:
6
+ metadata.gz: 8c089c06a9df4ce9e5a2470309cf01419eeaa8745c495fcd0e83baa3e57f2099a7a5c6068f4d79afb27195141f628ebdab1c1633135b1e171e0c707c3f9a0923
7
+ data.tar.gz: e9fc400d937282c37bf7996bb4a27b2cb2390cea707a9d0ee93241a2daf5249e8de57fac047b70166bdd729b4c805835008cfe88fe158108460a96cb2e301370
data/Gemfile ADDED
@@ -0,0 +1,19 @@
1
+ source "https://rubygems.org"
2
+
3
+ gem "rb_sys", "~> 0.9.56"
4
+ gem "rake"
5
+ gem "bigdecimal"
6
+
7
+ # Use local version of parquet
8
+ gemspec
9
+
10
+ group :development do
11
+ # gem "benchmark-ips", "~> 2.12"
12
+ # gem "polars-df"
13
+ # gem "duckdb"
14
+ end
15
+
16
+ group :test do
17
+ gem "csv"
18
+ gem "minitest", "~> 5.0"
19
+ end
data/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2024 Nathan Jaremko
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,197 @@
1
+ # parquet-ruby
2
+
3
+ [![Gem Version](https://badge.fury.io/rb/parquet.svg)](https://badge.fury.io/rb/parquet)
4
+
5
+ This project is a Ruby library wrapping the [parquet-rs](https://github.com/apache/parquet-rs) rust crate.
6
+
7
+ ## Usage
8
+
9
+ This library provides high-level bindings to parquet-rs with two primary APIs for reading Parquet files: row-wise and column-wise iteration. The column-wise API generally offers better performance, especially when working with subset of columns.
10
+
11
+ ### Row-wise Iteration
12
+
13
+ The `each_row` method provides sequential access to individual rows:
14
+
15
+ ```ruby
16
+ require "parquet"
17
+
18
+ # Basic usage with default hash output
19
+ Parquet.each_row("data.parquet") do |row|
20
+ puts row.inspect # {"id"=>1, "name"=>"name_1"}
21
+ end
22
+
23
+ # Array output for more efficient memory usage
24
+ Parquet.each_row("data.parquet", result_type: :array) do |row|
25
+ puts row.inspect # [1, "name_1"]
26
+ end
27
+
28
+ # Select specific columns to reduce I/O
29
+ Parquet.each_row("data.parquet", columns: ["id", "name"]) do |row|
30
+ puts row.inspect
31
+ end
32
+
33
+ # Reading from IO objects
34
+ File.open("data.parquet", "rb") do |file|
35
+ Parquet.each_row(file) do |row|
36
+ puts row.inspect
37
+ end
38
+ end
39
+ ```
40
+
41
+ ### Column-wise Iteration
42
+
43
+ The `each_column` method reads data in column-oriented batches, which is typically more efficient for analytical queries:
44
+
45
+ ```ruby
46
+ require "parquet"
47
+
48
+ # Process columns in batches of 1024 rows
49
+ Parquet.each_column("data.parquet", batch_size: 1024) do |batch|
50
+ # With result_type: :hash (default)
51
+ puts batch.inspect
52
+ # {
53
+ # "id" => [1, 2, ..., 1024],
54
+ # "name" => ["name_1", "name_2", ..., "name_1024"]
55
+ # }
56
+ end
57
+
58
+ # Array output with specific columns
59
+ Parquet.each_column("data.parquet",
60
+ columns: ["id", "name"],
61
+ result_type: :array,
62
+ batch_size: 1024) do |batch|
63
+ puts batch.inspect
64
+ # [
65
+ # [1, 2, ..., 1024], # id column
66
+ # ["name_1", "name_2", ...] # name column
67
+ # ]
68
+ end
69
+ ```
70
+
71
+ ### Arguments
72
+
73
+ Both methods accept these common arguments:
74
+
75
+ - `input`: Path string or IO-like object containing Parquet data
76
+ - `result_type`: Output format (`:hash` or `:array`, defaults to `:hash`)
77
+ - `columns`: Optional array of column names to read (improves performance)
78
+
79
+ Additional arguments for `each_column`:
80
+
81
+ - `batch_size`: Number of rows per batch (defaults to implementation-defined value)
82
+
83
+ When no block is given, both methods return an Enumerator.
84
+
85
+ ### Writing Row-wise Data
86
+
87
+ The `write_rows` method allows you to write data row by row:
88
+
89
+ ```ruby
90
+ require "parquet"
91
+
92
+ # Define the schema for your data
93
+ schema = [
94
+ { "id" => "int64" },
95
+ { "name" => "string" },
96
+ { "score" => "double" }
97
+ ]
98
+
99
+ # Create an enumerator that yields arrays of row values
100
+ rows = [
101
+ [1, "Alice", 95.5],
102
+ [2, "Bob", 82.3],
103
+ [3, "Charlie", 88.7]
104
+ ].each
105
+
106
+ # Write to a file
107
+ Parquet.write_rows(rows, schema: schema, write_to: "data.parquet")
108
+
109
+ # Write to an IO object
110
+ File.open("data.parquet", "wb") do |file|
111
+ Parquet.write_rows(rows, schema: schema, write_to: file)
112
+ end
113
+
114
+ # Optionally specify batch size (default is 1000)
115
+ Parquet.write_rows(rows,
116
+ schema: schema,
117
+ write_to: "data.parquet",
118
+ batch_size: 500
119
+ )
120
+
121
+ # Optionally specify memory threshold for flushing (default is 64MB)
122
+ Parquet.write_rows(rows,
123
+ schema: schema,
124
+ write_to: "data.parquet",
125
+ flush_threshold: 32 * 1024 * 1024 # 32MB
126
+ )
127
+
128
+ # Optionally specify sample size for row size estimation (default is 100)
129
+ Parquet.write_rows(rows,
130
+ schema: schema,
131
+ write_to: "data.parquet",
132
+ sample_size: 200 # Sample 200 rows for size estimation
133
+ )
134
+ ```
135
+
136
+ ### Writing Column-wise Data
137
+
138
+ The `write_columns` method provides a more efficient way to write data in column-oriented batches:
139
+
140
+ ```ruby
141
+ require "parquet"
142
+
143
+ # Define the schema
144
+ schema = [
145
+ { "id" => "int64" },
146
+ { "name" => "string" },
147
+ { "score" => "double" }
148
+ ]
149
+
150
+ # Create batches of column data
151
+ batches = [
152
+ # First batch
153
+ [
154
+ [1, 2], # id column
155
+ ["Alice", "Bob"], # name column
156
+ [95.5, 82.3] # score column
157
+ ],
158
+ # Second batch
159
+ [
160
+ [3], # id column
161
+ ["Charlie"], # name column
162
+ [88.7] # score column
163
+ ]
164
+ ]
165
+
166
+ # Create an enumerator from the batches
167
+ columns = batches.each
168
+
169
+ # Write to a parquet file with default ZSTD compression
170
+ Parquet.write_columns(columns, schema: schema, write_to: "data.parquet")
171
+
172
+ # Write to a parquet file with specific compression and memory threshold
173
+ Parquet.write_columns(columns,
174
+ schema: schema,
175
+ write_to: "data.parquet",
176
+ compression: "snappy", # Supported: "none", "uncompressed", "snappy", "gzip", "lz4", "zstd"
177
+ flush_threshold: 32 * 1024 * 1024 # 32MB
178
+ )
179
+
180
+ # Write to an IO object
181
+ File.open("data.parquet", "wb") do |file|
182
+ Parquet.write_columns(columns, schema: schema, write_to: file)
183
+ end
184
+ ```
185
+
186
+ The following data types are supported in the schema:
187
+
188
+ - `int8`, `int16`, `int32`, `int64`
189
+ - `uint8`, `uint16`, `uint32`, `uint64`
190
+ - `float`, `double`
191
+ - `string`
192
+ - `binary`
193
+ - `boolean`
194
+ - `date32`
195
+ - `timestamp_millis`, `timestamp_micros`
196
+
197
+ Note: List and Map types are currently not supported.
data/Rakefile ADDED
@@ -0,0 +1,43 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "rake/testtask"
4
+ require "rb_sys/extensiontask"
5
+
6
+ task default: :test
7
+
8
+ GEMSPEC = Gem::Specification.load("parquet.gemspec")
9
+
10
+
11
+ platforms = [
12
+ "x86_64-linux",
13
+ "x86_64-linux-musl",
14
+ "aarch64-linux",
15
+ "aarch64-linux-musl",
16
+ "x86_64-darwin",
17
+ "arm64-darwin"
18
+ ]
19
+
20
+ RbSys::ExtensionTask.new("parquet", GEMSPEC) do |ext|
21
+ ext.lib_dir = "lib/parquet"
22
+ ext.ext_dir = "ext/parquet"
23
+ ext.cross_compile = true
24
+ ext.cross_platform = platforms
25
+ ext.cross_compiling do |spec|
26
+ spec.dependencies.reject! { |dep| dep.name == "rb_sys" }
27
+ spec.files.reject! { |file| File.fnmatch?("ext/*", file, File::FNM_EXTGLOB) }
28
+ end
29
+ end
30
+
31
+ Rake::TestTask.new do |t|
32
+ t.deps << :compile
33
+ t.test_files = FileList[File.expand_path("test/*_test.rb", __dir__)]
34
+ t.libs << "lib"
35
+ t.libs << "test"
36
+ end
37
+
38
+ task :release do
39
+ sh "bundle exec rake test"
40
+ sh "mkdir -p pkg"
41
+ sh "gem build parquet.gemspec -o pkg/parquet-#{Parquet::VERSION}.gem"
42
+ sh "gem push pkg/parquet-#{Parquet::VERSION}.gem"
43
+ end
Binary file
Binary file
Binary file
@@ -0,0 +1,3 @@
1
+ module Parquet
2
+ VERSION = "0.2.15"
3
+ end
data/lib/parquet.rb ADDED
@@ -0,0 +1,10 @@
1
+ require_relative "parquet/version"
2
+
3
+ begin
4
+ require "parquet/#{RUBY_VERSION.to_f}/parquet"
5
+ rescue LoadError
6
+ require "parquet/parquet"
7
+ end
8
+
9
+ module Parquet
10
+ end
data/lib/parquet.rbi ADDED
@@ -0,0 +1,113 @@
1
+ # typed: true
2
+
3
+ module Parquet
4
+ # Options:
5
+ # - `input`: String, File, or IO object containing parquet data
6
+ # - `result_type`: String specifying the output format
7
+ # ("hash" or "array" or :hash or :array)
8
+ # - `columns`: When present, only the specified columns will be included in the output.
9
+ # This is useful for reducing how much data is read and improving performance.
10
+ sig do
11
+ params(
12
+ input: T.any(String, File, StringIO, IO),
13
+ result_type: T.nilable(T.any(String, Symbol)),
14
+ columns: T.nilable(T::Array[String])
15
+ ).returns(T::Enumerator[T.any(T::Hash[String, T.untyped], T::Array[T.untyped])])
16
+ end
17
+ sig do
18
+ params(
19
+ input: T.any(String, File, StringIO, IO),
20
+ result_type: T.nilable(T.any(String, Symbol)),
21
+ columns: T.nilable(T::Array[String]),
22
+ blk: T.nilable(T.proc.params(row: T.any(T::Hash[String, T.untyped], T::Array[T.untyped])).void)
23
+ ).returns(NilClass)
24
+ end
25
+ def self.each_row(input, result_type: nil, columns: nil, &blk)
26
+ end
27
+
28
+ # Options:
29
+ # - `input`: String, File, or IO object containing parquet data
30
+ # - `result_type`: String specifying the output format
31
+ # ("hash" or "array" or :hash or :array)
32
+ # - `columns`: When present, only the specified columns will be included in the output.
33
+ # - `batch_size`: When present, specifies the number of rows per batch
34
+ sig do
35
+ params(
36
+ input: T.any(String, File, StringIO, IO),
37
+ result_type: T.nilable(T.any(String, Symbol)),
38
+ columns: T.nilable(T::Array[String]),
39
+ batch_size: T.nilable(Integer)
40
+ ).returns(T::Enumerator[T.any(T::Hash[String, T.untyped], T::Array[T.untyped])])
41
+ end
42
+ sig do
43
+ params(
44
+ input: T.any(String, File, StringIO, IO),
45
+ result_type: T.nilable(T.any(String, Symbol)),
46
+ columns: T.nilable(T::Array[String]),
47
+ batch_size: T.nilable(Integer),
48
+ blk:
49
+ T.nilable(T.proc.params(batch: T.any(T::Hash[String, T::Array[T.untyped]], T::Array[T::Array[T.untyped]])).void)
50
+ ).returns(NilClass)
51
+ end
52
+ def self.each_column(input, result_type: nil, columns: nil, batch_size: nil, &blk)
53
+ end
54
+
55
+ # Options:
56
+ # - `read_from`: An Enumerator yielding arrays of values representing each row
57
+ # - `schema`: Array of hashes specifying column names and types. Supported types:
58
+ # - `int8`, `int16`, `int32`, `int64`
59
+ # - `uint8`, `uint16`, `uint32`, `uint64`
60
+ # - `float`, `double`
61
+ # - `string`
62
+ # - `binary`
63
+ # - `boolean`
64
+ # - `date32`
65
+ # - `timestamp_millis`, `timestamp_micros`
66
+ # - `write_to`: String path or IO object to write the parquet file to
67
+ # - `batch_size`: Optional batch size for writing (defaults to 1000)
68
+ # - `flush_threshold`: Optional memory threshold in bytes before flushing (defaults to 64MB)
69
+ # - `compression`: Optional compression type to use (defaults to "zstd")
70
+ # Supported values: "none", "uncompressed", "snappy", "gzip", "lz4", "zstd"
71
+ # - `sample_size`: Optional number of rows to sample for size estimation (defaults to 100)
72
+ sig do
73
+ params(
74
+ read_from: T::Enumerator[T::Array[T.untyped]],
75
+ schema: T::Array[T::Hash[String, String]],
76
+ write_to: T.any(String, IO),
77
+ batch_size: T.nilable(Integer),
78
+ flush_threshold: T.nilable(Integer),
79
+ compression: T.nilable(String),
80
+ sample_size: T.nilable(Integer)
81
+ ).void
82
+ end
83
+ def self.write_rows(read_from, schema:, write_to:, batch_size: nil, flush_threshold: nil, compression: nil, sample_size: nil)
84
+ end
85
+
86
+ # Options:
87
+ # - `read_from`: An Enumerator yielding arrays of column batches
88
+ # - `schema`: Array of hashes specifying column names and types. Supported types:
89
+ # - `int8`, `int16`, `int32`, `int64`
90
+ # - `uint8`, `uint16`, `uint32`, `uint64`
91
+ # - `float`, `double`
92
+ # - `string`
93
+ # - `binary`
94
+ # - `boolean`
95
+ # - `date32`
96
+ # - `timestamp_millis`, `timestamp_micros`
97
+ # - Looks like [{"column_name" => {"type" => "date32", "format" => "%Y-%m-%d"}}, {"column_name" => "int8"}]
98
+ # - `write_to`: String path or IO object to write the parquet file to
99
+ # - `flush_threshold`: Optional memory threshold in bytes before flushing (defaults to 64MB)
100
+ # - `compression`: Optional compression type to use (defaults to "zstd")
101
+ # Supported values: "none", "uncompressed", "snappy", "gzip", "lz4", "zstd"
102
+ sig do
103
+ params(
104
+ read_from: T::Enumerator[T::Array[T::Array[T.untyped]]],
105
+ schema: T::Array[T::Hash[String, String]],
106
+ write_to: T.any(String, IO),
107
+ flush_threshold: T.nilable(Integer),
108
+ compression: T.nilable(String)
109
+ ).void
110
+ end
111
+ def self.write_columns(read_from, schema:, write_to:, flush_threshold: nil, compression: nil)
112
+ end
113
+ end
metadata ADDED
@@ -0,0 +1,78 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: parquet
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.2.15
5
+ platform: x86_64-darwin
6
+ authors:
7
+ - Nathan Jaremko
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2025-02-04 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: rake-compiler
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - "~>"
18
+ - !ruby/object:Gem::Version
19
+ version: 1.2.0
20
+ type: :development
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: 1.2.0
27
+ description: |2
28
+ Parquet is a high-performance Parquet library for Ruby, written in Rust.
29
+ It wraps the official Apache Rust implementation to provide fast, correct Parquet parsing.
30
+ email:
31
+ - nathan@jaremko.ca
32
+ executables: []
33
+ extensions: []
34
+ extra_rdoc_files: []
35
+ files:
36
+ - Gemfile
37
+ - LICENSE
38
+ - README.md
39
+ - Rakefile
40
+ - lib/parquet.rb
41
+ - lib/parquet.rbi
42
+ - lib/parquet/3.2/parquet.bundle
43
+ - lib/parquet/3.3/parquet.bundle
44
+ - lib/parquet/3.4/parquet.bundle
45
+ - lib/parquet/version.rb
46
+ homepage: https://github.com/njaremko/parquet-ruby
47
+ licenses:
48
+ - MIT
49
+ metadata:
50
+ homepage_uri: https://github.com/njaremko/parquet-ruby
51
+ source_code_uri: https://github.com/njaremko/parquet-ruby
52
+ readme_uri: https://github.com/njaremko/parquet-ruby/blob/main/README.md
53
+ changelog_uri: https://github.com/njaremko/parquet-ruby/blob/main/CHANGELOG.md
54
+ documentation_uri: https://www.rubydoc.info/gems/parquet
55
+ funding_uri: https://github.com/sponsors/njaremko
56
+ post_install_message:
57
+ rdoc_options: []
58
+ require_paths:
59
+ - lib
60
+ required_ruby_version: !ruby/object:Gem::Requirement
61
+ requirements:
62
+ - - ">="
63
+ - !ruby/object:Gem::Version
64
+ version: '3.2'
65
+ - - "<"
66
+ - !ruby/object:Gem::Version
67
+ version: 3.5.dev
68
+ required_rubygems_version: !ruby/object:Gem::Requirement
69
+ requirements:
70
+ - - ">="
71
+ - !ruby/object:Gem::Version
72
+ version: '0'
73
+ requirements: []
74
+ rubygems_version: 3.5.23
75
+ signing_key:
76
+ specification_version: 4
77
+ summary: Parquet library for Ruby, written in Rust
78
+ test_files: []