parqueteur 1.2.0 → 1.3.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 056b9208a8bffcd163464dbd2cf276a9b0704e96788b77555d545eb339a4e798
4
- data.tar.gz: 1e20d31b1fc6f198fee42546939ce289d71d66f65ffa66562cdd7841e0f24f61
3
+ metadata.gz: 715ea84521855ea8978e4f944fd6ae07f192bf97fdd40893b4b9bd292a3fe0b5
4
+ data.tar.gz: d05798d68479c37a8d7028cd65ff35438399de4459d84d4797399940c63513f6
5
5
  SHA512:
6
- metadata.gz: fe08a7b282c4ededc08acb5aa9f4b485ead828aee4fd1444e8bb1af80cc56ea8c20411aefe136809f91ad808bee52db261218e8b5e6b7538bfa53d1eb38eb4b5
7
- data.tar.gz: 0fee8ec94698b7b4c9d3a089fd0094a52bd83dfda56d0652f8a5b08dfe84a88b251736e62a9da7f510e0fa3d1842e2551161178ce30b5e0f5c6ee9b903917a2c
6
+ metadata.gz: 1a4d74f311c64f79c6e339ba05e631feb94552268936f58c86e0e9bcf70bb46fec8a94d452501cd395203ea33ab9440e015e4503f255cfcecc7703e6fc8d0a1b
7
+ data.tar.gz: 344ea6420b6c08bbe61f4f534a92c26f563193d3f6dc995dd7eb09d9df5e9f3342971bbd53570d8db0f6b910e457e62e95811c04ba2245578de3b8bb245f7dc1
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- parqueteur (1.2.0)
4
+ parqueteur (1.3.0)
5
5
  red-parquet (~> 5.0)
6
6
 
7
7
  GEM
data/README.md CHANGED
@@ -1,15 +1,29 @@
1
1
  # Parqueteur
2
2
 
3
- Welcome to your new gem! In this directory, you'll find the files you need to be able to package up your Ruby library into a gem. Put your Ruby code in the file `lib/parqueteur`. To experiment with that code, run `bin/console` for an interactive prompt.
3
+ Parqueteur enables you to generate Apache Parquet files from raw data.
4
4
 
5
- TODO: Delete this and the text above, and describe your gem
5
+ ## Dependencies
6
6
 
7
+ Since I only tested Parqueteur on Ubuntu, I don't have any install scripts for other operating systems.
8
+ ### Debian/Ubuntu packages
9
+ - `libgirepository1.0-dev`
10
+ - `libarrow-dev`
11
+ - `libarrow-glib-dev`
12
+ - `libparquet-dev`
13
+ - `libparquet-glib-dev`
14
+
15
+ You can check `scripts/apache-arrow-ubuntu-install.sh` script for a quick way to install all of them.
7
16
  ## Installation
8
17
 
9
18
  Add this line to your application's Gemfile:
10
19
 
11
20
  ```ruby
12
- gem 'parqueteur'
21
+ gem 'parqueteur', '~> 1.0'
22
+ ```
23
+
24
+ > (optional) If you don't want to require Parqueteur globally you can add `require: false` to the Gemfile instruction:
25
+ ```ruby
26
+ gem 'parqueteur', '~> 1.0', require: false
13
27
  ```
14
28
 
15
29
  And then execute:
@@ -22,14 +36,35 @@ Or install it yourself as:
22
36
 
23
37
  ## Usage
24
38
 
25
- TODO: Write usage instructions here
39
+ Parqueteur provides an elegant way to generate Apache Parquet files from a defined schema.
40
+ ```ruby
41
+ require 'parqueteur'
26
42
 
27
- ## Development
43
+ class FooParquetConverter < Parqueteur::Converter
44
+ column :id, :bigint
45
+ column :reference, :string
46
+ end
28
47
 
29
- After checking out the repo, run `bin/setup` to install dependencies. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
48
+ data = [
49
+ { 'id' => 1, 'reference' => 'hello world 1' },
50
+ { 'id' => 2, 'reference' => 'hello world 2' },
51
+ { 'id' => 3, 'reference' => 'hello world 3' }
52
+ ]
30
53
 
31
- To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and the created tag, and push the `.gem` file to [rubygems.org](https://rubygems.org).
54
+ # initialize Converter with Parquet GZIP compression mode
55
+ converter = FooParquetConverter.new(data, compression: :gzip)
56
+
57
+ # write result to file
58
+ converter.write('hello_world.parquet')
59
+
60
+ # in-memory result (StringIO)
61
+ converter.to_io
62
+
63
+ # write to temporary file (Tempfile)
64
+ # don't forget to `close` / `unlink` it after usage
65
+ converter.to_tmpfile
66
+ ```
32
67
 
33
68
  ## Contributing
34
69
 
35
- Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/parqueteur.
70
+ Bug reports and pull requests are welcome on GitHub at https://github.com/pocketsizesun/parqueteur-ruby.
@@ -3,7 +3,7 @@ require 'parqueteur'
3
3
  require 'securerandom'
4
4
  require 'benchmark'
5
5
 
6
- class Foo < Parqueteur::Converter
6
+ class FooParquetConverter < Parqueteur::Converter
7
7
  column :id, :bigint
8
8
  column :reference, :string
9
9
  column :hash, :map, key: :string, value: :string
@@ -48,5 +48,9 @@ data = 10000.times.collect do |i|
48
48
  end
49
49
  puts "data generation OK"
50
50
 
51
- path = 'tmp/test.parquet'
52
- Foo.convert_to(data, path, compression: :gzip)
51
+ # initialize Converter with Parquet GZIP compression mode
52
+ converter = FooParquetConverter.new(data, compression: :gzip)
53
+
54
+ # write result to file
55
+ converter.write('tmp/example.gzip-compressed.parquet')
56
+ converter.write('tmp/example.no-gzip.parquet', compression: false)
@@ -0,0 +1,56 @@
1
+ require 'bundler/setup'
2
+ require 'parqueteur'
3
+
4
+ class FooParquetConverter < Parqueteur::Converter
5
+ column :id, :bigint
6
+ column :my_string_array, :array, elements: :string
7
+ column :my_date, :date
8
+ column :my_decimal, :decimal, precision: 12, scale: 4
9
+ column :my_int, :integer
10
+ column :my_map, :map, key: :string, value: :string
11
+ column :my_string, :string
12
+ column :my_struct, :struct do
13
+ field :my_struct_str, :string
14
+ field :my_struct_int, :integer
15
+ end
16
+ column :my_time, :time
17
+ column :my_timestamp, :timestamp
18
+ end
19
+
20
+ data = 100.times.collect do |i|
21
+ {
22
+ 'id' => i,
23
+ 'my_string_array' => %w[a b c],
24
+ 'my_date' => Date.today,
25
+ 'my_decimal' => BigDecimal('789000.5678'),
26
+ 'my_int' => rand(1..10),
27
+ 'my_map' => { 'a' => 'b' },
28
+ 'my_string' => 'Hello World',
29
+ 'my_struct' => {
30
+ 'my_struct_str' => 'Hello World',
31
+ 'my_struct_int' => 1
32
+ },
33
+ 'my_time' => 3600,
34
+ 'my_timestamp' => Time.now
35
+ }
36
+ end
37
+
38
+ # initialize Converter with Parquet GZIP compression mode
39
+ converter = FooParquetConverter.new(data, compression: :gzip)
40
+
41
+ # write result to file
42
+ converter.write('tmp/hello_world.compressed.parquet')
43
+ converter.write('tmp/hello_world.parquet', compression: false)
44
+
45
+ # in-memory result (StringIO)
46
+ converter.to_io
47
+
48
+ # write to temporary file (Tempfile)
49
+ # don't forget to `close` / `unlink` it after usage
50
+ converter.to_tmpfile
51
+
52
+ # Arrow Table
53
+ table = converter.to_arrow_table
54
+ table.each_record do |record|
55
+ pp record.to_h
56
+ end
@@ -41,12 +41,14 @@ module Parqueteur
41
41
  @compression = kwargs.fetch(:compression, nil)&.to_sym
42
42
  end
43
43
 
44
- def split(size)
44
+ def split(size, batch_size: nil, compression: nil)
45
45
  Enumerator.new do |arr|
46
+ options = {
47
+ batch_size: batch_size || @batch_size,
48
+ compression: compression || @compression
49
+ }
46
50
  @input.each_slice(size) do |records|
47
- local_converter = self.class.new(
48
- records, batch_size: @batch_size, compression: @compression
49
- )
51
+ local_converter = self.class.new(records, **options)
50
52
  file = local_converter.to_tmpfile
51
53
  arr << file
52
54
  file.close
@@ -55,23 +57,31 @@ module Parqueteur
55
57
  end
56
58
  end
57
59
 
58
- def split_by_io(size)
60
+ def split_by_io(size, batch_size: nil, compression: nil)
59
61
  Enumerator.new do |arr|
62
+ options = {
63
+ batch_size: batch_size || @batch_size,
64
+ compression: compression || @compression
65
+ }
60
66
  @input.each_slice(size) do |records|
61
- local_converter = self.class.new(records)
67
+ local_converter = self.class.new(records, **options)
62
68
  arr << local_converter.to_io
63
69
  end
64
70
  end
65
71
  end
66
72
 
67
- def write(path)
73
+ def write(path, batch_size: nil, compression: nil)
74
+ compression = @compression if compression.nil?
75
+ batch_size = @batch_size if batch_size.nil?
68
76
  arrow_schema = self.class.columns.arrow_schema
69
77
  writer_properties = Parquet::WriterProperties.new
70
- writer_properties.set_compression(@compression) unless @compression.nil?
78
+ if !compression.nil? && compression != false
79
+ writer_properties.set_compression(compression)
80
+ end
71
81
 
72
82
  Arrow::FileOutputStream.open(path, false) do |output|
73
83
  Parquet::ArrowFileWriter.open(arrow_schema, output, writer_properties) do |writer|
74
- @input.each_slice(@batch_size) do |records|
84
+ @input.each_slice(batch_size) do |records|
75
85
  arrow_table = build_arrow_table(records)
76
86
  writer.write_table(arrow_table, 1024)
77
87
  end
@@ -81,32 +91,32 @@ module Parqueteur
81
91
  true
82
92
  end
83
93
 
84
- def to_tmpfile
94
+ def to_tmpfile(options = {})
85
95
  tempfile = Tempfile.new
86
96
  tempfile.binmode
87
- write(tempfile.path)
97
+ write(tempfile.path, **options)
88
98
  tempfile.rewind
89
99
  tempfile
90
100
  end
91
101
 
92
- def to_io
93
- tmpfile = to_tmpfile
102
+ def to_io(options = {})
103
+ tmpfile = to_tmpfile(options)
94
104
  strio = StringIO.new(tmpfile.read)
95
105
  tmpfile.close
96
106
  tmpfile.unlink
97
107
  strio
98
108
  end
99
109
 
100
- def to_arrow_table
101
- file = to_tmpfile
110
+ def to_arrow_table(options = {})
111
+ file = to_tmpfile(options)
102
112
  table = Arrow::Table.load(file.path, format: :parquet)
103
113
  file.close
104
114
  file.unlink
105
115
  table
106
116
  end
107
117
 
108
- def to_blob
109
- to_io.read
118
+ def to_blob(options = {})
119
+ to_tmpfile(options).read
110
120
  end
111
121
 
112
122
  private
@@ -7,14 +7,24 @@ module Parqueteur
7
7
  def self.registered_types
8
8
  @registered_types ||= {
9
9
  array: Parqueteur::Types::ArrayType,
10
+ bigdecimal: Parqueteur::Types::Decimal256Type,
10
11
  bigint: Parqueteur::Types::Int64Type,
11
12
  boolean: Parqueteur::Types::BooleanType,
13
+ date: Parqueteur::Types::Date32Type,
14
+ date32: Parqueteur::Types::Date64Type,
15
+ date64: Parqueteur::Types::Date64Type,
16
+ decimal: Parqueteur::Types::Decimal128Type,
17
+ decimal128: Parqueteur::Types::Decimal128Type,
18
+ decimal256: Parqueteur::Types::Decimal256Type,
12
19
  int32: Parqueteur::Types::Int32Type,
13
20
  int64: Parqueteur::Types::Int64Type,
14
21
  integer: Parqueteur::Types::Int32Type,
15
22
  map: Parqueteur::Types::MapType,
16
23
  string: Parqueteur::Types::StringType,
17
24
  struct: Parqueteur::Types::StructType,
25
+ time: Parqueteur::Types::Time32Type,
26
+ time32: Parqueteur::Types::Time32Type,
27
+ time64: Parqueteur::Types::Time64Type,
18
28
  timestamp: Parqueteur::Types::TimestampType
19
29
  }
20
30
  end
@@ -0,0 +1,15 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Parqueteur
4
+ module Types
5
+ class Date32Type < Parqueteur::Type
6
+ def build_value_array(values)
7
+ Arrow::Date32ArrayBuilder.build(values)
8
+ end
9
+
10
+ def arrow_type_builder
11
+ Arrow::Date32DataType.new
12
+ end
13
+ end
14
+ end
15
+ end
@@ -0,0 +1,15 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Parqueteur
4
+ module Types
5
+ class Date64Type < Parqueteur::Type
6
+ def build_value_array(values)
7
+ Arrow::Date64ArrayBuilder.build([values])
8
+ end
9
+
10
+ def arrow_type_builder
11
+ Arrow::Date64DataType.new
12
+ end
13
+ end
14
+ end
15
+ end
@@ -0,0 +1,18 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Parqueteur
4
+ module Types
5
+ class Decimal128Type < Parqueteur::Type
6
+ def build_value_array(values)
7
+ Arrow::Decimal128ArrayBuilder.build(@arrow_type, values)
8
+ end
9
+
10
+ def arrow_type_builder
11
+ Arrow::Decimal128DataType.new(
12
+ precision: @options.fetch(:precision),
13
+ scale: @options.fetch(:scale)
14
+ )
15
+ end
16
+ end
17
+ end
18
+ end
@@ -0,0 +1,18 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Parqueteur
4
+ module Types
5
+ class Decimal256Type < Parqueteur::Type
6
+ def build_value_array(values)
7
+ Arrow::Decimal256ArrayBuilder.build(@arrow_type, values)
8
+ end
9
+
10
+ def arrow_type_builder
11
+ Arrow::Decimal256DataType.new(
12
+ precision: @options.fetch(:precision),
13
+ scale: @options.fetch(:scale)
14
+ )
15
+ end
16
+ end
17
+ end
18
+ end
@@ -21,5 +21,3 @@ module Parqueteur
21
21
  end
22
22
  end
23
23
  end
24
-
25
- # when :integer
@@ -21,5 +21,3 @@ module Parqueteur
21
21
  end
22
22
  end
23
23
  end
24
-
25
- # when :integer
@@ -0,0 +1,19 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Parqueteur
4
+ module Types
5
+ class Time32Type < Parqueteur::Type
6
+ def build_value_array(values)
7
+ Arrow::Time32Array.new(
8
+ @options.fetch(:precision, :second), values
9
+ )
10
+ end
11
+
12
+ def arrow_type_builder
13
+ Arrow::Time32DataType.new(
14
+ options.fetch(:unit, :second)
15
+ )
16
+ end
17
+ end
18
+ end
19
+ end
@@ -0,0 +1,19 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Parqueteur
4
+ module Types
5
+ class Time64Type < Parqueteur::Type
6
+ def build_value_array(values)
7
+ Arrow::Time64Array.new(
8
+ @options.fetch(:precision, :second), values
9
+ )
10
+ end
11
+
12
+ def arrow_type_builder
13
+ Arrow::Time64DataType.new(
14
+ options.fetch(:unit, :second)
15
+ )
16
+ end
17
+ end
18
+ end
19
+ end
@@ -9,7 +9,9 @@ module Parqueteur
9
9
  module Types
10
10
  class TimestampType < Parqueteur::Type
11
11
  def build_value_array(values)
12
- Arrow::TimestampArray.new(values)
12
+ Arrow::TimestampArray.new(
13
+ options.fetch(:unit, :second), values
14
+ )
13
15
  end
14
16
 
15
17
  def arrow_type_builder
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Parqueteur
4
- VERSION = '1.2.0'
4
+ VERSION = '1.3.0'
5
5
  end
data/lib/parqueteur.rb CHANGED
@@ -3,6 +3,7 @@
3
3
  require 'json'
4
4
  require 'singleton'
5
5
  require 'tempfile'
6
+ require 'parquet'
6
7
 
7
8
  require_relative 'parqueteur/version'
8
9
  require 'parqueteur/column'
@@ -14,16 +15,20 @@ require 'parqueteur/type'
14
15
  require 'parqueteur/type_resolver'
15
16
  require 'parqueteur/types/array_type'
16
17
  require 'parqueteur/types/boolean_type'
18
+ require 'parqueteur/types/date32_type'
19
+ require 'parqueteur/types/date64_type'
20
+ require 'parqueteur/types/decimal128_type'
21
+ require 'parqueteur/types/decimal256_type'
17
22
  require 'parqueteur/types/int32_type'
18
23
  require 'parqueteur/types/int64_type'
19
24
  require 'parqueteur/types/map_type'
20
25
  require 'parqueteur/types/string_type'
21
26
  require 'parqueteur/types/struct_type'
27
+ require 'parqueteur/types/time32_type'
28
+ require 'parqueteur/types/time64_type'
22
29
  require 'parqueteur/types/timestamp_type'
23
- require 'parquet'
24
30
 
25
31
  module Parqueteur
26
32
  class Error < StandardError; end
27
33
  class TypeNotFound < Error; end
28
- # Your code goes here...
29
34
  end
data/parqueteur.gemspec CHANGED
@@ -8,8 +8,8 @@ Gem::Specification.new do |spec|
8
8
  spec.authors = ["Julien D."]
9
9
  spec.email = ["julien@pocketsizesun.com"]
10
10
  spec.license = 'Apache-2.0'
11
- spec.summary = 'Parqueteur - A Ruby gem that convert JSON to Parquet'
12
- spec.description = 'Convert JSON to Parquet'
11
+ spec.summary = 'Parqueteur - A Ruby gem that convert data to Parquet'
12
+ spec.description = 'Convert data to Parquet'
13
13
  spec.homepage = 'https://github.com/pocketsizesun/parqueteur-ruby'
14
14
  spec.required_ruby_version = Gem::Requirement.new(">= 2.3.0")
15
15
 
@@ -0,0 +1,18 @@
1
+ #!/bin/sh
2
+
3
+ if [ $(dpkg-query -W -f='${Status}' apache-arrow-apt-source 2>/dev/null | grep -c "ok installed") -eq 1 ]
4
+ then
5
+ exit 0
6
+ fi
7
+
8
+ LSB_RELEASE_CODENAME_SHORT=$(lsb_release --codename --short)
9
+
10
+ apt-get update
11
+ apt-get install -y -V ca-certificates lsb-release wget
12
+ wget https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
13
+ apt-get install -y -V ./apache-arrow-apt-source-latest-${LSB_RELEASE_CODENAME_SHORT}.deb
14
+ rm ./apache-arrow-apt-source-latest-${LSB_RELEASE_CODENAME_SHORT}.deb
15
+ apt-get update
16
+ apt-get install -y libgirepository1.0-dev libarrow-dev libarrow-glib-dev libparquet-dev libparquet-glib-dev
17
+
18
+ exit 0
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: parqueteur
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.2.0
4
+ version: 1.3.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Julien D.
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2021-10-02 00:00:00.000000000 Z
11
+ date: 2021-10-03 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: red-parquet
@@ -24,7 +24,7 @@ dependencies:
24
24
  - - "~>"
25
25
  - !ruby/object:Gem::Version
26
26
  version: '5.0'
27
- description: Convert JSON to Parquet
27
+ description: Convert data to Parquet
28
28
  email:
29
29
  - julien@pocketsizesun.com
30
30
  executables: []
@@ -38,11 +38,12 @@ files:
38
38
  - Rakefile
39
39
  - bin/console
40
40
  - bin/setup
41
+ - examples/convert-and-compression.rb
41
42
  - examples/convert-methods.rb
42
43
  - examples/convert-to-io.rb
43
44
  - examples/convert-with-chunks.rb
44
- - examples/convert-with-compression.rb
45
45
  - examples/convert-without-compression.rb
46
+ - examples/hello-world.rb
46
47
  - lib/parqueteur.rb
47
48
  - lib/parqueteur/column.rb
48
49
  - lib/parqueteur/column_collection.rb
@@ -53,15 +54,21 @@ files:
53
54
  - lib/parqueteur/type_resolver.rb
54
55
  - lib/parqueteur/types/array_type.rb
55
56
  - lib/parqueteur/types/boolean_type.rb
57
+ - lib/parqueteur/types/date32_type.rb
58
+ - lib/parqueteur/types/date64_type.rb
59
+ - lib/parqueteur/types/decimal128_type.rb
60
+ - lib/parqueteur/types/decimal256_type.rb
56
61
  - lib/parqueteur/types/int32_type.rb
57
62
  - lib/parqueteur/types/int64_type.rb
58
63
  - lib/parqueteur/types/map_type.rb
59
64
  - lib/parqueteur/types/string_type.rb
60
65
  - lib/parqueteur/types/struct_type.rb
66
+ - lib/parqueteur/types/time32_type.rb
67
+ - lib/parqueteur/types/time64_type.rb
61
68
  - lib/parqueteur/types/timestamp_type.rb
62
69
  - lib/parqueteur/version.rb
63
70
  - parqueteur.gemspec
64
- - test.json
71
+ - scripts/apache-arrow-ubuntu-install.sh
65
72
  homepage: https://github.com/pocketsizesun/parqueteur-ruby
66
73
  licenses:
67
74
  - Apache-2.0
@@ -85,5 +92,5 @@ requirements: []
85
92
  rubygems_version: 3.2.3
86
93
  signing_key:
87
94
  specification_version: 4
88
- summary: Parqueteur - A Ruby gem that convert JSON to Parquet
95
+ summary: Parqueteur - A Ruby gem that convert data to Parquet
89
96
  test_files: []
data/test.json DELETED
@@ -1 +0,0 @@
1
- [{"id":1,"reference":"coucou","hash":{"a":"b"},"valid":true,"hash2":{},"numbers":[1,2,3],"map_array":[]},{"id":2,"reference":"coucou","hash":{"c":"d"},"valid":false,"hash2":{},"numbers":[4,5,6],"map_array":[]},{"id":3,"reference":"coucou","hash":{"e":"f"},"valid":true,"hash2":{"x":[1,2,3]},"numbers":[7,8,9],"map_array":[{"x":"y"}]}]