RubyGems - parquet - Versions diffs - 0.2.15-x86_64-darwin - Mend

parquet 0.2.15-x86_64-darwin

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

checksums.yaml +7 -0
data/Gemfile +19 -0
data/LICENSE +21 -0
data/README.md +197 -0
data/Rakefile +43 -0
data/lib/parquet/3.2/parquet.bundle +0 -0
data/lib/parquet/3.3/parquet.bundle +0 -0
data/lib/parquet/3.4/parquet.bundle +0 -0
data/lib/parquet/version.rb +3 -0
data/lib/parquet.rb +10 -0
data/lib/parquet.rbi +113 -0
metadata +78 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: 2d007ac814c3524e11fbca3d9c92423eca8852ed147678adf1cd20074c22b47b
+  data.tar.gz: c146411e985c37aedd097733390791d98a01f21fd09b784a3843b2cff8f47c05
+SHA512:
+  metadata.gz: 8c089c06a9df4ce9e5a2470309cf01419eeaa8745c495fcd0e83baa3e57f2099a7a5c6068f4d79afb27195141f628ebdab1c1633135b1e171e0c707c3f9a0923
+  data.tar.gz: e9fc400d937282c37bf7996bb4a27b2cb2390cea707a9d0ee93241a2daf5249e8de57fac047b70166bdd729b4c805835008cfe88fe158108460a96cb2e301370

data/Gemfile ADDED Viewed

@@ -0,0 +1,19 @@
+source "https://rubygems.org"
+gem "rb_sys", "~> 0.9.56"
+gem "rake"
+gem "bigdecimal"
+# Use local version of parquet
+gemspec
+group :development do
+  # gem "benchmark-ips", "~> 2.12"
+  # gem "polars-df"
+  # gem "duckdb"
+end
+group :test do
+  gem "csv"
+  gem "minitest", "~> 5.0"
+end

data/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2024 Nathan Jaremko
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,197 @@
+# parquet-ruby
+[![Gem Version](https://badge.fury.io/rb/parquet.svg)](https://badge.fury.io/rb/parquet)
+This project is a Ruby library wrapping the [parquet-rs](https://github.com/apache/parquet-rs) rust crate.
+## Usage
+This library provides high-level bindings to parquet-rs with two primary APIs for reading Parquet files: row-wise and column-wise iteration. The column-wise API generally offers better performance, especially when working with subset of columns.
+### Row-wise Iteration
+The `each_row` method provides sequential access to individual rows:
+```ruby
+require "parquet"
+# Basic usage with default hash output
+Parquet.each_row("data.parquet") do |row|
+  puts row.inspect  # {"id"=>1, "name"=>"name_1"}
+end
+# Array output for more efficient memory usage
+Parquet.each_row("data.parquet", result_type: :array) do |row|
+  puts row.inspect  # [1, "name_1"]
+end
+# Select specific columns to reduce I/O
+Parquet.each_row("data.parquet", columns: ["id", "name"]) do |row|
+  puts row.inspect
+end
+# Reading from IO objects
+File.open("data.parquet", "rb") do |file|
+  Parquet.each_row(file) do |row|
+    puts row.inspect
+  end
+end
+```
+### Column-wise Iteration
+The `each_column` method reads data in column-oriented batches, which is typically more efficient for analytical queries:
+```ruby
+require "parquet"
+# Process columns in batches of 1024 rows
+Parquet.each_column("data.parquet", batch_size: 1024) do |batch|
+  # With result_type: :hash (default)
+  puts batch.inspect
+  # {
+  #   "id" => [1, 2, ..., 1024],
+  #   "name" => ["name_1", "name_2", ..., "name_1024"]
+  # }
+end
+# Array output with specific columns
+Parquet.each_column("data.parquet",
+                    columns: ["id", "name"],
+                    result_type: :array,
+                    batch_size: 1024) do |batch|
+  puts batch.inspect
+  # [
+  #   [1, 2, ..., 1024],           # id column
+  #   ["name_1", "name_2", ...]    # name column
+  # ]
+end
+```
+### Arguments
+Both methods accept these common arguments:
+- `input`: Path string or IO-like object containing Parquet data
+- `result_type`: Output format (`:hash` or `:array`, defaults to `:hash`)
+- `columns`: Optional array of column names to read (improves performance)
+Additional arguments for `each_column`:
+- `batch_size`: Number of rows per batch (defaults to implementation-defined value)
+When no block is given, both methods return an Enumerator.
+### Writing Row-wise Data
+The `write_rows` method allows you to write data row by row:
+```ruby
+require "parquet"
+# Define the schema for your data
+schema = [
+  { "id" => "int64" },
+  { "name" => "string" },
+  { "score" => "double" }
+]
+# Create an enumerator that yields arrays of row values
+rows = [
+  [1, "Alice", 95.5],
+  [2, "Bob", 82.3],
+  [3, "Charlie", 88.7]
+].each
+# Write to a file
+Parquet.write_rows(rows, schema: schema, write_to: "data.parquet")
+# Write to an IO object
+File.open("data.parquet", "wb") do |file|
+  Parquet.write_rows(rows, schema: schema, write_to: file)
+end
+# Optionally specify batch size (default is 1000)
+Parquet.write_rows(rows,
+  schema: schema,
+  write_to: "data.parquet",
+  batch_size: 500
+)
+# Optionally specify memory threshold for flushing (default is 64MB)
+Parquet.write_rows(rows,
+  schema: schema,
+  write_to: "data.parquet",
+  flush_threshold: 32 * 1024 * 1024  # 32MB
+)
+# Optionally specify sample size for row size estimation (default is 100)
+Parquet.write_rows(rows,
+  schema: schema,
+  write_to: "data.parquet",
+  sample_size: 200  # Sample 200 rows for size estimation
+)
+```
+### Writing Column-wise Data
+The `write_columns` method provides a more efficient way to write data in column-oriented batches:
+```ruby
+require "parquet"
+# Define the schema
+schema = [
+  { "id" => "int64" },
+  { "name" => "string" },
+  { "score" => "double" }
+]
+# Create batches of column data
+batches = [
+  # First batch
+  [
+    [1, 2],          # id column
+    ["Alice", "Bob"], # name column
+    [95.5, 82.3]     # score column
+  ],
+  # Second batch
+  [
+    [3],             # id column
+    ["Charlie"],     # name column
+    [88.7]           # score column
+  ]
+]
+# Create an enumerator from the batches
+columns = batches.each
+# Write to a parquet file with default ZSTD compression
+Parquet.write_columns(columns, schema: schema, write_to: "data.parquet")
+# Write to a parquet file with specific compression and memory threshold
+Parquet.write_columns(columns,
+  schema: schema,
+  write_to: "data.parquet",
+  compression: "snappy",  # Supported: "none", "uncompressed", "snappy", "gzip", "lz4", "zstd"
+  flush_threshold: 32 * 1024 * 1024  # 32MB
+)
+# Write to an IO object
+File.open("data.parquet", "wb") do |file|
+  Parquet.write_columns(columns, schema: schema, write_to: file)
+end
+```
+The following data types are supported in the schema:
+- `int8`, `int16`, `int32`, `int64`
+- `uint8`, `uint16`, `uint32`, `uint64`
+- `float`, `double`
+- `string`
+- `binary`
+- `boolean`
+- `date32`
+- `timestamp_millis`, `timestamp_micros`
+Note: List and Map types are currently not supported.

data/Rakefile ADDED Viewed

@@ -0,0 +1,43 @@
+# frozen_string_literal: true
+require "rake/testtask"
+require "rb_sys/extensiontask"
+task default: :test
+GEMSPEC = Gem::Specification.load("parquet.gemspec")
+platforms = [
+  "x86_64-linux",
+  "x86_64-linux-musl",
+  "aarch64-linux",
+  "aarch64-linux-musl",
+  "x86_64-darwin",
+  "arm64-darwin"
+]
+RbSys::ExtensionTask.new("parquet", GEMSPEC) do |ext|
+  ext.lib_dir = "lib/parquet"
+  ext.ext_dir = "ext/parquet"
+  ext.cross_compile = true
+  ext.cross_platform = platforms
+  ext.cross_compiling do |spec|
+    spec.dependencies.reject! { |dep| dep.name == "rb_sys" }
+    spec.files.reject! { |file| File.fnmatch?("ext/*", file, File::FNM_EXTGLOB) }
+  end
+end
+Rake::TestTask.new do |t|
+  t.deps << :compile
+  t.test_files = FileList[File.expand_path("test/*_test.rb", __dir__)]
+  t.libs << "lib"
+  t.libs << "test"
+end
+task :release do
+  sh "bundle exec rake test"
+  sh "mkdir -p pkg"
+  sh "gem build parquet.gemspec -o pkg/parquet-#{Parquet::VERSION}.gem"
+  sh "gem push pkg/parquet-#{Parquet::VERSION}.gem"
+end

data/lib/parquet/3.2/parquet.bundle ADDED Viewed

Binary file

data/lib/parquet/3.3/parquet.bundle ADDED Viewed

Binary file

data/lib/parquet/3.4/parquet.bundle ADDED Viewed

Binary file

data/lib/parquet/version.rb ADDED Viewed

@@ -0,0 +1,3 @@
+module Parquet
+  VERSION = "0.2.15"
+end

data/lib/parquet.rb ADDED Viewed

@@ -0,0 +1,10 @@
+require_relative "parquet/version"
+begin
+  require "parquet/#{RUBY_VERSION.to_f}/parquet"
+rescue LoadError
+  require "parquet/parquet"
+end
+module Parquet
+end

data/lib/parquet.rbi ADDED Viewed

@@ -0,0 +1,113 @@
+# typed: true
+module Parquet
+  # Options:
+  #   - `input`: String, File, or IO object containing parquet data
+  #   - `result_type`: String specifying the output format
+  #                    ("hash" or "array" or :hash or :array)
+  #   - `columns`: When present, only the specified columns will be included in the output.
+  #                This is useful for reducing how much data is read and improving performance.
+  sig do
+    params(
+      input: T.any(String, File, StringIO, IO),
+      result_type: T.nilable(T.any(String, Symbol)),
+      columns: T.nilable(T::Array[String])
+    ).returns(T::Enumerator[T.any(T::Hash[String, T.untyped], T::Array[T.untyped])])
+  end
+  sig do
+    params(
+      input: T.any(String, File, StringIO, IO),
+      result_type: T.nilable(T.any(String, Symbol)),
+      columns: T.nilable(T::Array[String]),
+      blk: T.nilable(T.proc.params(row: T.any(T::Hash[String, T.untyped], T::Array[T.untyped])).void)
+    ).returns(NilClass)
+  end
+  def self.each_row(input, result_type: nil, columns: nil, &blk)
+  end
+  # Options:
+  #   - `input`: String, File, or IO object containing parquet data
+  #   - `result_type`: String specifying the output format
+  #                    ("hash" or "array" or :hash or :array)
+  #   - `columns`: When present, only the specified columns will be included in the output.
+  #   - `batch_size`: When present, specifies the number of rows per batch
+  sig do
+    params(
+      input: T.any(String, File, StringIO, IO),
+      result_type: T.nilable(T.any(String, Symbol)),
+      columns: T.nilable(T::Array[String]),
+      batch_size: T.nilable(Integer)
+    ).returns(T::Enumerator[T.any(T::Hash[String, T.untyped], T::Array[T.untyped])])
+  end
+  sig do
+    params(
+      input: T.any(String, File, StringIO, IO),
+      result_type: T.nilable(T.any(String, Symbol)),
+      columns: T.nilable(T::Array[String]),
+      batch_size: T.nilable(Integer),
+      blk:
+        T.nilable(T.proc.params(batch: T.any(T::Hash[String, T::Array[T.untyped]], T::Array[T::Array[T.untyped]])).void)
+    ).returns(NilClass)
+  end
+  def self.each_column(input, result_type: nil, columns: nil, batch_size: nil, &blk)
+  end
+  # Options:
+  #   - `read_from`: An Enumerator yielding arrays of values representing each row
+  #   - `schema`: Array of hashes specifying column names and types. Supported types:
+  #     - `int8`, `int16`, `int32`, `int64`
+  #     - `uint8`, `uint16`, `uint32`, `uint64`
+  #     - `float`, `double`
+  #     - `string`
+  #     - `binary`
+  #     - `boolean`
+  #     - `date32`
+  #     - `timestamp_millis`, `timestamp_micros`
+  #   - `write_to`: String path or IO object to write the parquet file to
+  #   - `batch_size`: Optional batch size for writing (defaults to 1000)
+  #   - `flush_threshold`: Optional memory threshold in bytes before flushing (defaults to 64MB)
+  #   - `compression`: Optional compression type to use (defaults to "zstd")
+  #                   Supported values: "none", "uncompressed", "snappy", "gzip", "lz4", "zstd"
+  #   - `sample_size`: Optional number of rows to sample for size estimation (defaults to 100)
+  sig do
+    params(
+      read_from: T::Enumerator[T::Array[T.untyped]],
+      schema: T::Array[T::Hash[String, String]],
+      write_to: T.any(String, IO),
+      batch_size: T.nilable(Integer),
+      flush_threshold: T.nilable(Integer),
+      compression: T.nilable(String),
+      sample_size: T.nilable(Integer)
+    ).void
+  end
+  def self.write_rows(read_from, schema:, write_to:, batch_size: nil, flush_threshold: nil, compression: nil, sample_size: nil)
+  end
+  # Options:
+  #   - `read_from`: An Enumerator yielding arrays of column batches
+  #   - `schema`: Array of hashes specifying column names and types. Supported types:
+  #     - `int8`, `int16`, `int32`, `int64`
+  #     - `uint8`, `uint16`, `uint32`, `uint64`
+  #     - `float`, `double`
+  #     - `string`
+  #     - `binary`
+  #     - `boolean`
+  #     - `date32`
+  #     - `timestamp_millis`, `timestamp_micros`
+  #     - Looks like [{"column_name" => {"type" => "date32", "format" => "%Y-%m-%d"}}, {"column_name" => "int8"}]
+  #   - `write_to`: String path or IO object to write the parquet file to
+  #   - `flush_threshold`: Optional memory threshold in bytes before flushing (defaults to 64MB)
+  #   - `compression`: Optional compression type to use (defaults to "zstd")
+  #                   Supported values: "none", "uncompressed", "snappy", "gzip", "lz4", "zstd"
+  sig do
+    params(
+      read_from: T::Enumerator[T::Array[T::Array[T.untyped]]],
+      schema: T::Array[T::Hash[String, String]],
+      write_to: T.any(String, IO),
+      flush_threshold: T.nilable(Integer),
+      compression: T.nilable(String)
+    ).void
+  end
+  def self.write_columns(read_from, schema:, write_to:, flush_threshold: nil, compression: nil)
+  end
+end

metadata ADDED Viewed

@@ -0,0 +1,78 @@
+--- !ruby/object:Gem::Specification
+name: parquet
+version: !ruby/object:Gem::Version
+  version: 0.2.15
+platform: x86_64-darwin
+authors:
+- Nathan Jaremko
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2025-02-04 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: rake-compiler
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 1.2.0
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 1.2.0
+description: |2
+      Parquet is a high-performance Parquet library for Ruby, written in Rust.
+      It wraps the official Apache Rust implementation to provide fast, correct Parquet parsing.
+email:
+- nathan@jaremko.ca
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- Gemfile
+- LICENSE
+- README.md
+- Rakefile
+- lib/parquet.rb
+- lib/parquet.rbi
+- lib/parquet/3.2/parquet.bundle
+- lib/parquet/3.3/parquet.bundle
+- lib/parquet/3.4/parquet.bundle
+- lib/parquet/version.rb
+homepage: https://github.com/njaremko/parquet-ruby
+licenses:
+- MIT
+metadata:
+  homepage_uri: https://github.com/njaremko/parquet-ruby
+  source_code_uri: https://github.com/njaremko/parquet-ruby
+  readme_uri: https://github.com/njaremko/parquet-ruby/blob/main/README.md
+  changelog_uri: https://github.com/njaremko/parquet-ruby/blob/main/CHANGELOG.md
+  documentation_uri: https://www.rubydoc.info/gems/parquet
+  funding_uri: https://github.com/sponsors/njaremko
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '3.2'
+  - - "<"
+    - !ruby/object:Gem::Version
+      version: 3.5.dev
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubygems_version: 3.5.23
+signing_key:
+specification_version: 4
+summary: Parquet library for Ruby, written in Rust
+test_files: []