parqueteur 1.1.0 → 1.3.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 90ffcadf5b78e4ffc3329eac9be1be34af9209a77a39a6395eefc1c56afa7ce6
4
- data.tar.gz: 1d7f5257d3f86443e0d13b789d7449565198f6cda1563d111a36ad3728264044
3
+ metadata.gz: 86f767a68e38cdd93da015e4ddfcda06b2eefe553ecb6d7a423b4cc0f2752183
4
+ data.tar.gz: cd741b1d023e44fcc14c08192b9791b675339736527f47dbe2393e85bdff9d07
5
5
  SHA512:
6
- metadata.gz: 262039094dd3aa5890f9d1a87d836eb64004c51fc57456d7880a36332295ac8a447fcd045c6bd2e053986ad4e9981021448b39e20010d0e41fef6b7e233d91ca
7
- data.tar.gz: 73988e1f836acbe22e26c20b8559d8f79e9eeda55f9612fbefae7aa5bef131187b7a0716a3bd7804db5321b29b19f54b383e8594830814ae765d2c7b30986f59
6
+ metadata.gz: 58882a4d2d1ea5a0cb53f5643f96ac504352b87e6c1436d10841a616d2867bf127c70a31374f6ff2f455fb2894d29fe8cc439ca8a0efca5fe519c00ee312c8c5
7
+ data.tar.gz: ac185e758d8c0fa19d11ac05e96f77f93cb8260e055d203d1d0d71dd04e0ff6db70d773824a8a6fe5c63e5f235ab179f3417ac689b590403b69192cbad0bde98
data/Gemfile.lock CHANGED
@@ -1,21 +1,21 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- parqueteur (1.1.0)
4
+ parqueteur (1.3.1)
5
5
  red-parquet (~> 5.0)
6
6
 
7
7
  GEM
8
8
  remote: https://rubygems.org/
9
9
  specs:
10
- bigdecimal (3.0.0)
11
- extpp (0.0.9)
12
- gio2 (3.4.6)
13
- gobject-introspection (= 3.4.6)
14
- glib2 (3.4.6)
10
+ bigdecimal (3.0.2)
11
+ extpp (0.1.0)
12
+ gio2 (3.4.9)
13
+ gobject-introspection (= 3.4.9)
14
+ glib2 (3.4.9)
15
15
  native-package-installer (>= 1.0.3)
16
16
  pkg-config (>= 1.3.5)
17
- gobject-introspection (3.4.6)
18
- glib2 (= 3.4.6)
17
+ gobject-introspection (3.4.9)
18
+ glib2 (= 3.4.9)
19
19
  native-package-installer (1.1.1)
20
20
  pkg-config (1.4.6)
21
21
  rake (13.0.6)
data/README.md CHANGED
@@ -1,15 +1,31 @@
1
1
  # Parqueteur
2
2
 
3
- Welcome to your new gem! In this directory, you'll find the files you need to be able to package up your Ruby library into a gem. Put your Ruby code in the file `lib/parqueteur`. To experiment with that code, run `bin/console` for an interactive prompt.
3
+ [![Gem Version](https://badge.fury.io/rb/parqueteur.svg)](https://badge.fury.io/rb/parqueteur)
4
4
 
5
- TODO: Delete this and the text above, and describe your gem
5
+ Parqueteur enables you to generate Apache Parquet files from raw data.
6
6
 
7
+ ## Dependencies
8
+
9
+ Since I only tested Parqueteur on Ubuntu, I don't have any install scripts for others operating systems.
10
+ ### Debian/Ubuntu packages
11
+ - `libgirepository1.0-dev`
12
+ - `libarrow-dev`
13
+ - `libarrow-glib-dev`
14
+ - `libparquet-dev`
15
+ - `libparquet-glib-dev`
16
+
17
+ You can check `scripts/apache-arrow-ubuntu-install.sh` script for a quick way to install all of them.
7
18
  ## Installation
8
19
 
9
20
  Add this line to your application's Gemfile:
10
21
 
11
22
  ```ruby
12
- gem 'parqueteur'
23
+ gem 'parqueteur', '~> 1.0'
24
+ ```
25
+
26
+ > (optional) If you don't want to require Parqueteur globally you can add `require: false` to the Gemfile instruction:
27
+ ```ruby
28
+ gem 'parqueteur', '~> 1.0', require: false
13
29
  ```
14
30
 
15
31
  And then execute:
@@ -22,14 +38,127 @@ Or install it yourself as:
22
38
 
23
39
  ## Usage
24
40
 
25
- TODO: Write usage instructions here
41
+ Parqueteur provides an elegant way to generate Apache Parquet files from a defined schema.
42
+
43
+ Converters accepts any object that implements `Enumerable` as data source.
44
+
45
+ ### Working example
46
+
47
+ ```ruby
48
+ require 'parqueteur'
49
+
50
+ class FooParquetConverter < Parqueteur::Converter
51
+ column :id, :bigint
52
+ column :reference, :string
53
+ column :datetime, :timestamp
54
+ end
55
+
56
+ data = [
57
+ { 'id' => 1, 'reference' => 'hello world 1', 'datetime' => Time.now },
58
+ { 'id' => 2, 'reference' => 'hello world 2', 'datetime' => Time.now },
59
+ { 'id' => 3, 'reference' => 'hello world 3', 'datetime' => Time.now }
60
+ ]
61
+
62
+ # initialize Converter with Parquet GZIP compression mode
63
+ converter = FooParquetConverter.new(data, compression: :gzip)
26
64
 
27
- ## Development
65
+ # write result to file
66
+ converter.write('hello_world.parquet')
28
67
 
29
- After checking out the repo, run `bin/setup` to install dependencies. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
68
+ # in-memory result (StringIO)
69
+ converter.to_io
70
+
71
+ # write to temporary file (Tempfile)
72
+ # don't forget to `close` / `unlink` it after usage
73
+ converter.to_tmpfile
74
+
75
+ # convert to Arrow::Table
76
+ pp converter.to_arrow_table
77
+ ```
78
+
79
+ ### Using transformers
80
+
81
+ You can use transformers to apply data items transformations.
82
+
83
+ From `examples/cars.rb`:
84
+
85
+ ```ruby
86
+ require 'parqueteur'
87
+
88
+ class Car
89
+ attr_reader :name, :production_year
90
+
91
+ def initialize(name, production_year)
92
+ @name = name
93
+ @production_year = production_year
94
+ end
95
+ end
96
+
97
+ class CarParquetConverter < Parqueteur::Converter
98
+ column :name, :string
99
+ column :production_year, :integer
100
+
101
+ transform do |car|
102
+ {
103
+ 'name' => car.name,
104
+ 'production_year' => car.production_year
105
+ }
106
+ end
107
+ end
108
+
109
+ cars = [
110
+ Car.new('Alfa Romeo 75', 1985),
111
+ Car.new('Alfa Romeo 33', 1983),
112
+ Car.new('Audi A3', 1996),
113
+ Car.new('Audi A4', 1994),
114
+ Car.new('BMW 503', 1956),
115
+ Car.new('BMW X5', 1999)
116
+ ]
117
+
118
+ # initialize Converter with Parquet GZIP compression mode
119
+ converter = CarParquetConverter.new(data, compression: :gzip)
120
+
121
+ # write result to file
122
+ pp converter.to_arrow_table
123
+ ```
124
+
125
+ Output:
126
+ ```
127
+ #<Arrow::Table:0x7fc1fb24b958 ptr=0x7fc1faedd910>
128
+ # name production_year
129
+ 0 Alfa Romeo 75 1985
130
+ 1 Alfa Romeo 33 1983
131
+ 2 Audi A3 1996
132
+ 3 Audi A4 1994
133
+ 4 BMW 503 1956
134
+ 5 BMW X5 1999
135
+ ```
30
136
 
31
- To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and the created tag, and push the `.gem` file to [rubygems.org](https://rubygems.org).
137
+ ### Available Types
138
+
139
+ | Name (Symbol) | Apache Parquet Type |
140
+ | ------------- | --------- |
141
+ | `:array` | `Array` |
142
+ | `:bigdecimal` | `Decimal256` |
143
+ | `:bigint` | `Int64` or `UInt64` with `unsigned: true` option |
144
+ | `:boolean` | `Boolean` |
145
+ | `:date` | `Date32` |
146
+ | `:date32` | `Date32` |
147
+ | `:date64` | `Date64` |
148
+ | `:decimal` | `Decimal128` |
149
+ | `:decimal128` | `Decimal128` |
150
+ | `:decimal256` | `Decimal256` |
151
+ | `:int32` | `Int32` or `UInt32` with `unsigned: true` option |
152
+ | `:int64` | `Int64` or `UInt64` with `unsigned: true` option |
153
+ | `:integer` | `Int32` or `UInt32` with `unsigned: true` option |
154
+ | `:map` | `Map` |
155
+ | `:string` | `String` |
156
+ | `:struct` | `Struct` |
157
+ | `:time` | `Time32` |
158
+ | `:time32` | `Time32` |
159
+ | `:time64` | `Time64` |
160
+ | `:timestamp` | `Timestamp` |
32
161
 
33
162
  ## Contributing
34
163
 
35
- Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/parqueteur.
164
+ Bug reports and pull requests are welcome on GitHub at https://github.com/pocketsizesun/parqueteur-ruby.
data/examples/cars.rb ADDED
@@ -0,0 +1,40 @@
1
+ require 'bundler/setup'
2
+ require 'parqueteur'
3
+
4
+ class Car
5
+ attr_reader :name, :production_year
6
+
7
+ def initialize(name, production_year)
8
+ @name = name
9
+ @production_year = production_year
10
+ end
11
+ end
12
+
13
+ class CarParquetConverter < Parqueteur::Converter
14
+ column :name, :string
15
+ column :production_year, :integer
16
+
17
+ transform do |car|
18
+ {
19
+ 'name' => car.name,
20
+ 'production_year' => car.production_year
21
+ }
22
+ end
23
+ end
24
+
25
+ cars = [
26
+ Car.new('Alfa Romeo 75', 1985),
27
+ Car.new('Alfa Romeo 33', 1983),
28
+ Car.new('Audi A3', 1996),
29
+ Car.new('Audi A4', 1994),
30
+ Car.new('BMW 503', 1956),
31
+ Car.new('BMW X5', 1999)
32
+ ]
33
+
34
+ # initialize Converter with Parquet GZIP compression mode
35
+ converter = CarParquetConverter.new(
36
+ cars, compression: :gzip
37
+ )
38
+
39
+ # write result to file
40
+ pp converter.to_arrow_table
@@ -0,0 +1,56 @@
1
+ require 'bundler/setup'
2
+ require 'parqueteur'
3
+ require 'securerandom'
4
+ require 'benchmark'
5
+
6
+ class FooParquetConverter < Parqueteur::Converter
7
+ column :id, :bigint
8
+ column :reference, :string
9
+ column :hash, :map, key: :string, value: :string
10
+ # column :hash2, :map, key: :string, value: :string
11
+ # column :hash3, :map, key: :string, value: :string
12
+ column :valid, :boolean
13
+ column :total, :integer
14
+ column :numbers, :array, elements: :integer
15
+ column :my_struct, :struct do
16
+ field :test, :string
17
+ field :mon_nombre, :integer
18
+ end
19
+ end
20
+
21
+ def random_hash
22
+ {
23
+ 'a' => SecureRandom.hex(128),
24
+ 'b' => SecureRandom.hex(128),
25
+ 'c' => SecureRandom.hex(128),
26
+ 'd' => SecureRandom.hex(128),
27
+ 'e' => SecureRandom.hex(128),
28
+ 'f' => SecureRandom.hex(128),
29
+ 'g' => SecureRandom.hex(128),
30
+ 'h' => SecureRandom.hex(128),
31
+ 'i' => SecureRandom.hex(128),
32
+ 'j' => SecureRandom.hex(128),
33
+ 'k' => SecureRandom.hex(128),
34
+ }
35
+ end
36
+
37
+ data = 10000.times.collect do |i|
38
+ {
39
+ 'id' => i + 1,
40
+ 'reference' => "coucou:#{i}",
41
+ 'hash' => random_hash,
42
+ # 'hash2' => random_hash,
43
+ # 'hash3' => random_hash,
44
+ 'valid' => rand < 0.5,
45
+ 'total' => rand(100..500),
46
+ 'numbers' => [1, 2, 3]
47
+ }
48
+ end
49
+ puts "data generation OK"
50
+
51
+ # initialize Converter with Parquet GZIP compression mode
52
+ converter = FooParquetConverter.new(data, compression: :gzip)
53
+
54
+ # write result to file
55
+ converter.write('tmp/example.gzip-compressed.parquet')
56
+ converter.write('tmp/example.no-gzip.parquet', compression: false)
@@ -0,0 +1,54 @@
1
+ require 'bundler/setup'
2
+ require 'parqueteur'
3
+ require 'securerandom'
4
+ require 'benchmark'
5
+
6
+ class Foo < Parqueteur::Converter
7
+ column :id, :bigint
8
+ column :reference, :string
9
+ column :hash, :map, key: :string, value: :string
10
+ # column :hash2, :map, key: :string, value: :string
11
+ # column :hash3, :map, key: :string, value: :string
12
+ column :valid, :boolean
13
+ column :total, :integer
14
+ column :numbers, :array, elements: :integer
15
+ column :my_struct, :struct do
16
+ field :test, :string
17
+ field :mon_nombre, :integer
18
+ end
19
+ end
20
+
21
+ def random_hash
22
+ {
23
+ 'a' => SecureRandom.hex(128),
24
+ 'b' => SecureRandom.hex(128),
25
+ 'c' => SecureRandom.hex(128),
26
+ 'd' => SecureRandom.hex(128),
27
+ 'e' => SecureRandom.hex(128),
28
+ 'f' => SecureRandom.hex(128),
29
+ 'g' => SecureRandom.hex(128),
30
+ 'h' => SecureRandom.hex(128),
31
+ 'i' => SecureRandom.hex(128),
32
+ 'j' => SecureRandom.hex(128),
33
+ 'k' => SecureRandom.hex(128),
34
+ }
35
+ end
36
+
37
+ data = 10000.times.collect do |i|
38
+ {
39
+ 'id' => i + 1,
40
+ 'reference' => "coucou:#{i}",
41
+ 'hash' => random_hash,
42
+ # 'hash2' => random_hash,
43
+ # 'hash3' => random_hash,
44
+ 'valid' => rand < 0.5,
45
+ 'total' => rand(100..500),
46
+ 'numbers' => [1, 2, 3]
47
+ }
48
+ end
49
+ puts "data generation OK"
50
+
51
+ converter = Foo.new(data, compression: :gzip)
52
+ pp converter.to_io
53
+ pp converter.to_arrow_table
54
+ converter.write('tmp/test.parquet')
@@ -0,0 +1,52 @@
1
+ require 'bundler/setup'
2
+ require 'parqueteur'
3
+ require 'securerandom'
4
+ require 'benchmark'
5
+
6
+ class Foo < Parqueteur::Converter
7
+ column :id, :bigint
8
+ column :reference, :string
9
+ column :hash, :map, key: :string, value: :string
10
+ # column :hash2, :map, key: :string, value: :string
11
+ # column :hash3, :map, key: :string, value: :string
12
+ column :valid, :boolean
13
+ column :total, :integer
14
+ column :numbers, :array, elements: :integer
15
+ column :my_struct, :struct do
16
+ field :test, :string
17
+ field :mon_nombre, :integer
18
+ end
19
+ end
20
+
21
+ def random_hash
22
+ {
23
+ 'a' => SecureRandom.hex(128),
24
+ 'b' => SecureRandom.hex(128),
25
+ 'c' => SecureRandom.hex(128),
26
+ 'd' => SecureRandom.hex(128),
27
+ 'e' => SecureRandom.hex(128),
28
+ 'f' => SecureRandom.hex(128),
29
+ 'g' => SecureRandom.hex(128),
30
+ 'h' => SecureRandom.hex(128),
31
+ 'i' => SecureRandom.hex(128),
32
+ 'j' => SecureRandom.hex(128),
33
+ 'k' => SecureRandom.hex(128),
34
+ }
35
+ end
36
+
37
+ data = 10000.times.collect do |i|
38
+ {
39
+ 'id' => i + 1,
40
+ 'reference' => "coucou:#{i}",
41
+ 'hash' => random_hash,
42
+ # 'hash2' => random_hash,
43
+ # 'hash3' => random_hash,
44
+ 'valid' => rand < 0.5,
45
+ 'total' => rand(100..500),
46
+ 'numbers' => [1, 2, 3]
47
+ }
48
+ end
49
+ puts "data generation OK"
50
+
51
+ io = Foo.convert(data)
52
+ pp io.read
@@ -0,0 +1,54 @@
1
+ require 'bundler/setup'
2
+ require 'parqueteur'
3
+ require 'securerandom'
4
+ require 'benchmark'
5
+
6
+ class Foo < Parqueteur::Converter
7
+ column :id, :bigint
8
+ column :reference, :string
9
+ column :hash, :map, key: :string, value: :string
10
+ # column :hash2, :map, key: :string, value: :string
11
+ # column :hash3, :map, key: :string, value: :string
12
+ column :valid, :boolean
13
+ column :total, :integer
14
+ column :numbers, :array, elements: :integer
15
+ column :my_struct, :struct do
16
+ field :test, :string
17
+ field :mon_nombre, :integer
18
+ end
19
+ end
20
+
21
+ def random_hash
22
+ {
23
+ 'a' => SecureRandom.hex(128),
24
+ 'b' => SecureRandom.hex(128),
25
+ 'c' => SecureRandom.hex(128),
26
+ 'd' => SecureRandom.hex(128),
27
+ 'e' => SecureRandom.hex(128),
28
+ 'f' => SecureRandom.hex(128),
29
+ 'g' => SecureRandom.hex(128),
30
+ 'h' => SecureRandom.hex(128),
31
+ 'i' => SecureRandom.hex(128),
32
+ 'j' => SecureRandom.hex(128),
33
+ 'k' => SecureRandom.hex(128),
34
+ }
35
+ end
36
+
37
+ data = 10000.times.collect do |i|
38
+ {
39
+ 'id' => i + 1,
40
+ 'reference' => "coucou:#{i}",
41
+ 'hash' => random_hash,
42
+ # 'hash2' => random_hash,
43
+ # 'hash3' => random_hash,
44
+ 'valid' => rand < 0.5,
45
+ 'total' => rand(100..500),
46
+ 'numbers' => [1, 2, 3]
47
+ }
48
+ end
49
+ puts "data generation OK"
50
+
51
+ converter = Foo.new(data, compression: :gzip)
52
+ converter.split(200).each_with_index do |chunk, idx|
53
+ puts "#{idx}: #{chunk.path}"
54
+ end
@@ -0,0 +1,52 @@
1
+ require 'bundler/setup'
2
+ require 'parqueteur'
3
+ require 'securerandom'
4
+ require 'benchmark'
5
+
6
+ class Foo < Parqueteur::Converter
7
+ column :id, :bigint
8
+ column :reference, :string
9
+ column :hash, :map, key: :string, value: :string
10
+ # column :hash2, :map, key: :string, value: :string
11
+ # column :hash3, :map, key: :string, value: :string
12
+ column :valid, :boolean
13
+ column :total, :integer
14
+ column :numbers, :array, elements: :integer
15
+ column :my_struct, :struct do
16
+ field :test, :string
17
+ field :mon_nombre, :integer
18
+ end
19
+ end
20
+
21
+ def random_hash
22
+ {
23
+ 'a' => SecureRandom.hex(128),
24
+ 'b' => SecureRandom.hex(128),
25
+ 'c' => SecureRandom.hex(128),
26
+ 'd' => SecureRandom.hex(128),
27
+ 'e' => SecureRandom.hex(128),
28
+ 'f' => SecureRandom.hex(128),
29
+ 'g' => SecureRandom.hex(128),
30
+ 'h' => SecureRandom.hex(128),
31
+ 'i' => SecureRandom.hex(128),
32
+ 'j' => SecureRandom.hex(128),
33
+ 'k' => SecureRandom.hex(128),
34
+ }
35
+ end
36
+
37
+ data = 10000.times.collect do |i|
38
+ {
39
+ 'id' => i + 1,
40
+ 'reference' => "coucou:#{i}",
41
+ 'hash' => random_hash,
42
+ # 'hash2' => random_hash,
43
+ # 'hash3' => random_hash,
44
+ 'valid' => rand < 0.5,
45
+ 'total' => rand(100..500),
46
+ 'numbers' => [1, 2, 3]
47
+ }
48
+ end
49
+ puts "data generation OK"
50
+
51
+ path = 'tmp/test.parquet'
52
+ Foo.convert_to(data, path)
@@ -0,0 +1,57 @@
1
+ require 'bundler/setup'
2
+ require 'parqueteur'
3
+
4
+ class FooParquetConverter < Parqueteur::Converter
5
+ column :id, :bigint
6
+ column :my_string_array, :array, elements: :string
7
+ column :my_date, :date
8
+ column :my_decimal, :decimal, precision: 12, scale: 4
9
+ column :my_int, :integer
10
+ column :my_map, :map, key: :string, value: :string
11
+ column :my_string, :string
12
+ column :my_struct, :struct do
13
+ field :my_struct_str, :string
14
+ field :my_struct_int, :integer
15
+ end
16
+ column :my_time, :time
17
+ column :my_timestamp, :timestamp
18
+ end
19
+
20
+ data = 100.times.collect do |i|
21
+ {
22
+ 'id' => i,
23
+ 'my_string_array' => %w[a b c],
24
+ 'my_date' => Date.today,
25
+ 'my_decimal' => BigDecimal('0.03'),
26
+ 'my_int' => rand(1..10),
27
+ 'my_map' => { 'a' => 'b' },
28
+ 'my_string' => 'Hello World',
29
+ 'my_struct' => {
30
+ 'my_struct_str' => 'Hello World',
31
+ 'my_struct_int' => 1
32
+ },
33
+ 'my_time' => 3600,
34
+ 'my_timestamp' => Time.now
35
+ }
36
+ end
37
+
38
+ # initialize Converter with Parquet GZIP compression mode
39
+ converter = FooParquetConverter.new(data, compression: :gzip)
40
+
41
+ # write result to file
42
+ converter.write('tmp/hello_world.compressed.parquet')
43
+ converter.write('tmp/hello_world.parquet', compression: false)
44
+
45
+ # in-memory result (StringIO)
46
+ converter.to_io
47
+
48
+ # write to temporary file (Tempfile)
49
+ # don't forget to `close` / `unlink` it after usage
50
+ converter.to_tmpfile
51
+
52
+ # Arrow Table
53
+ table = converter.to_arrow_table
54
+ table.each_record do |record|
55
+ # pp record['my_decimal'].to_f
56
+ pp record.to_h
57
+ end
@@ -0,0 +1,44 @@
1
+ require 'bundler/setup'
2
+ require 'parqueteur'
3
+
4
+ class FooParquetConverter < Parqueteur::Converter
5
+ column :id, :bigint
6
+ column :reference, :string
7
+ column :datetime, :timestamp
8
+ column :beers_count, :integer
9
+
10
+ transform do |item|
11
+ item.merge(
12
+ 'datetime' => Time.now
13
+ )
14
+ end
15
+
16
+ transform :add_beers
17
+
18
+ private
19
+
20
+ def add_beers(item)
21
+ item['beers_count'] += rand(1..3)
22
+ item
23
+ end
24
+ end
25
+
26
+ data = 10.times.lazy.map do |i|
27
+ { 'id' => i + 1, 'reference' => 'hello world 1', 'beers_count' => 0 }
28
+ end
29
+
30
+ # initialize Converter with Parquet GZIP compression mode
31
+ converter = FooParquetConverter.new(data, compression: :gzip)
32
+
33
+ # write result to file
34
+ converter.write('tmp/hello_world.parquet')
35
+
36
+ # in-memory result (StringIO)
37
+ converter.to_io
38
+
39
+ # write to temporary file (Tempfile)
40
+ # don't forget to `close` / `unlink` it after usage
41
+ converter.to_tmpfile
42
+
43
+ # convert to Arrow::Table
44
+ pp converter.to_arrow_table
@@ -2,7 +2,7 @@
2
2
 
3
3
  module Parqueteur
4
4
  class Converter
5
- attr_reader :schema
5
+ DEFAULT_BATCH_SIZE = 10
6
6
 
7
7
  def self.inline(&block)
8
8
  Class.new(self, &block)
@@ -24,104 +24,137 @@ module Parqueteur
24
24
  transforms << (method_name || block)
25
25
  end
26
26
 
27
- def self.convert(input, output: nil)
28
- converter = new(input)
29
- if !output.nil?
30
- converter.write(output)
31
- else
32
- converter.to_blob
27
+ def self.convert(input, **kwargs)
28
+ new(input, **kwargs).to_io
29
+ end
30
+
31
+ def self.convert_to(input, output_path, **kwargs)
32
+ converter = new(input, **kwargs)
33
+ converter.write(output_path)
34
+ end
35
+
36
+ # @param [Enumerable] An enumerable object
37
+ # @option [Symbol] compression - :gzip
38
+ def initialize(input, **kwargs)
39
+ @input = Parqueteur::Input.from(input)
40
+ @batch_size = kwargs.fetch(:batch_size, DEFAULT_BATCH_SIZE)
41
+ @compression = kwargs.fetch(:compression, nil)&.to_sym
42
+ end
43
+
44
+ def split(size, batch_size: nil, compression: nil)
45
+ Enumerator.new do |arr|
46
+ options = {
47
+ batch_size: batch_size || @batch_size,
48
+ compression: compression || @compression
49
+ }
50
+ @input.each_slice(size) do |records|
51
+ local_converter = self.class.new(records, **options)
52
+ file = local_converter.to_tmpfile
53
+ arr << file
54
+ file.close
55
+ file.unlink
56
+ end
33
57
  end
34
58
  end
35
59
 
36
- def initialize(input, options = {})
37
- @input = Parqueteur::Input.from(input, options)
38
- end
39
-
40
- def write(output)
41
- case output
42
- when :io
43
- to_io
44
- when String
45
- to_arrow_table.save(output)
46
- when StringIO, IO
47
- buffer = Arrow::ResizableBuffer.new(0)
48
- to_arrow_table.save(buffer, format: :parquet)
49
- output.write(buffer.data.to_s)
50
- output.rewind
51
- output
52
- else
53
- raise ArgumentError, "unsupported output: #{output.class}, accepted: String (filename), IO, StringIO"
60
+ def split_by_io(size, batch_size: nil, compression: nil)
61
+ Enumerator.new do |arr|
62
+ options = {
63
+ batch_size: batch_size || @batch_size,
64
+ compression: compression || @compression
65
+ }
66
+ @input.each_slice(size) do |records|
67
+ local_converter = self.class.new(records, **options)
68
+ arr << local_converter.to_io
69
+ end
70
+ end
71
+ end
72
+
73
+ def write(path, batch_size: nil, compression: nil)
74
+ compression = @compression if compression.nil?
75
+ batch_size = @batch_size if batch_size.nil?
76
+ arrow_schema = self.class.columns.arrow_schema
77
+ writer_properties = Parquet::WriterProperties.new
78
+ if !compression.nil? && compression != false
79
+ writer_properties.set_compression(compression)
80
+ end
81
+
82
+ Arrow::FileOutputStream.open(path, false) do |output|
83
+ Parquet::ArrowFileWriter.open(arrow_schema, output, writer_properties) do |writer|
84
+ @input.each_slice(batch_size) do |records|
85
+ arrow_table = build_arrow_table(records)
86
+ writer.write_table(arrow_table, 1024)
87
+ end
88
+ end
54
89
  end
90
+
91
+ true
92
+ end
93
+
94
+ def to_tmpfile(options = {})
95
+ tempfile = Tempfile.new
96
+ tempfile.binmode
97
+ write(tempfile.path, **options)
98
+ tempfile.rewind
99
+ tempfile
55
100
  end
56
101
 
57
- def to_s
58
- inspect
102
+ def to_io(options = {})
103
+ tmpfile = to_tmpfile(options)
104
+ strio = StringIO.new(tmpfile.read)
105
+ tmpfile.close
106
+ tmpfile.unlink
107
+ strio
59
108
  end
60
109
 
61
- def to_io
62
- write(StringIO.new)
110
+ def to_arrow_table(options = {})
111
+ file = to_tmpfile(options)
112
+ table = Arrow::Table.load(file.path, format: :parquet)
113
+ file.close
114
+ file.unlink
115
+ table
63
116
  end
64
117
 
65
- def to_blob
66
- write(StringIO.new).read
118
+ def to_blob(options = {})
119
+ to_tmpfile(options).read
67
120
  end
68
121
 
69
- def to_arrow_table
122
+ private
123
+
124
+ def build_arrow_table(records)
70
125
  transforms = self.class.transforms
71
126
 
72
- chunks = self.class.columns.each_with_object({}) do |column, hash|
127
+ values = self.class.columns.each_with_object({}) do |column, hash|
73
128
  hash[column.name] = []
74
129
  end
75
- items_count = 0
76
- @input.each_slice(100) do |items|
77
- values = self.class.columns.each_with_object({}) do |column, hash|
78
- hash[column.name] = []
79
- end
80
130
 
81
- items.each do |item|
82
- if transforms.length > 0
83
- transforms.each do |transform|
84
- item = \
85
- if transform.is_a?(Symbol)
86
- __send__(transform, item)
87
- else
88
- transform.call(item)
89
- end
90
- end
131
+ records.each do |item|
132
+ if transforms.length > 0
133
+ transforms.each do |transform|
134
+ item = \
135
+ if transform.is_a?(Symbol)
136
+ __send__(transform, item)
137
+ else
138
+ transform.call(item)
139
+ end
91
140
  end
141
+ end
92
142
 
93
- values.each_key do |value_key|
94
- if item.key?(value_key)
95
- values[value_key] << item[value_key]
96
- else
97
- values[value_key] << nil
98
- end
143
+ values.each_key do |value_key|
144
+ if item.key?(value_key)
145
+ values[value_key] << item[value_key]
146
+ else
147
+ values[value_key] << nil
99
148
  end
100
149
  end
150
+ end
101
151
 
102
- values.each_with_object(chunks) do |item, hash|
152
+ Arrow::Table.new(
153
+ values.each_with_object({}) do |item, hash|
103
154
  column = self.class.columns.find(item[0])
104
- hash[item[0]].push(
105
- column.type.build_value_array(item[1])
106
- )
155
+ hash[item[0]] = column.type.build_value_array(item[1])
107
156
  end
108
-
109
- items_count += items.length
110
- end
111
-
112
- if items_count > 0
113
- Arrow::Table.new(
114
- chunks.transform_values! do |value|
115
- Arrow::ChunkedArray.new(value)
116
- end
117
- )
118
- else
119
- Arrow::Table.new(
120
- self.class.columns.each_with_object({}) do |column, hash|
121
- hash[column.name] = column.type.build_value_array([])
122
- end
123
- )
124
- end
157
+ )
125
158
  end
126
159
  end
127
160
  end
@@ -4,40 +4,25 @@ module Parqueteur
4
4
  class Input
5
5
  include Enumerable
6
6
 
7
- def self.from(arg, options = {})
8
- new(
9
- case arg
10
- when String
11
- if File.exist?(arg)
12
- File.new(arg, 'r')
13
- else
14
- arg.split("\n")
15
- end
16
- when Enumerable
17
- arg
18
- end,
19
- options
20
- )
7
+ def self.from(arg)
8
+ return arg if arg.is_a?(self)
9
+
10
+ new(arg)
21
11
  end
22
12
 
23
- def initialize(source, options = {})
13
+ def initialize(source)
14
+ unless source.is_a?(Enumerable)
15
+ raise ArgumentError, 'Enumerable object expected'
16
+ end
17
+
24
18
  @source = source
25
- @options = options
26
19
  end
27
20
 
28
21
  def each(&block)
29
- case @source
30
- when File
31
- if @options.fetch(:json_newlines, true) == true
32
- @source.each_line do |line|
33
- yield(JSON.parse(line.strip))
34
- end
35
- else
36
- JSON.parse(@source.read).each(&block)
37
- end
38
- @source.rewind
39
- when Enumerable
22
+ if block_given?
40
23
  @source.each(&block)
24
+ else
25
+ @source.to_enum(:each)
41
26
  end
42
27
  end
43
28
  end
@@ -7,14 +7,24 @@ module Parqueteur
7
7
  def self.registered_types
8
8
  @registered_types ||= {
9
9
  array: Parqueteur::Types::ArrayType,
10
+ bigdecimal: Parqueteur::Types::Decimal256Type,
10
11
  bigint: Parqueteur::Types::Int64Type,
11
12
  boolean: Parqueteur::Types::BooleanType,
13
+ date: Parqueteur::Types::Date32Type,
14
+ date32: Parqueteur::Types::Date32Type,
15
+ date64: Parqueteur::Types::Date64Type,
16
+ decimal: Parqueteur::Types::Decimal128Type,
17
+ decimal128: Parqueteur::Types::Decimal128Type,
18
+ decimal256: Parqueteur::Types::Decimal256Type,
12
19
  int32: Parqueteur::Types::Int32Type,
13
20
  int64: Parqueteur::Types::Int64Type,
14
21
  integer: Parqueteur::Types::Int32Type,
15
22
  map: Parqueteur::Types::MapType,
16
23
  string: Parqueteur::Types::StringType,
17
24
  struct: Parqueteur::Types::StructType,
25
+ time: Parqueteur::Types::Time32Type,
26
+ time32: Parqueteur::Types::Time32Type,
27
+ time64: Parqueteur::Types::Time64Type,
18
28
  timestamp: Parqueteur::Types::TimestampType
19
29
  }
20
30
  end
@@ -0,0 +1,15 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Parqueteur
4
+ module Types
5
+ class Date32Type < Parqueteur::Type
6
+ def build_value_array(values)
7
+ Arrow::Date32ArrayBuilder.build(values)
8
+ end
9
+
10
+ def arrow_type_builder
11
+ Arrow::Date32DataType.new
12
+ end
13
+ end
14
+ end
15
+ end
@@ -0,0 +1,15 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Parqueteur
4
+ module Types
5
+ class Date64Type < Parqueteur::Type
6
+ def build_value_array(values)
7
+ Arrow::Date64ArrayBuilder.build([values])
8
+ end
9
+
10
+ def arrow_type_builder
11
+ Arrow::Date64DataType.new
12
+ end
13
+ end
14
+ end
15
+ end
@@ -0,0 +1,29 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Parqueteur
4
+ module Types
5
+ class Decimal128Type < Parqueteur::Type
6
+ def initialize(options = {}, &block)
7
+ @scale = options.fetch(:scale)
8
+ @precision = options.fetch(:precision)
9
+ @format_str = "%.#{@scale}f"
10
+ super(options, &block)
11
+ end
12
+
13
+ def build_value_array(values)
14
+ Arrow::Decimal128ArrayBuilder.build(
15
+ @arrow_type,
16
+ values.map do |value|
17
+ Arrow::Decimal128.new(format(@format_str, BigDecimal(value)))
18
+ end
19
+ )
20
+ end
21
+
22
+ def arrow_type_builder
23
+ Arrow::Decimal128DataType.new(
24
+ @precision, @scale
25
+ )
26
+ end
27
+ end
28
+ end
29
+ end
@@ -0,0 +1,29 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Parqueteur
4
+ module Types
5
+ class Decimal256Type < Parqueteur::Type
6
+ def initialize(options = {}, &block)
7
+ @scale = options.fetch(:scale)
8
+ @precision = options.fetch(:precision)
9
+ @format_str = "%.#{@scale}f"
10
+ super(options, &block)
11
+ end
12
+
13
+ def build_value_array(values)
14
+ Arrow::Decimal256ArrayBuilder.build(
15
+ @arrow_type,
16
+ values.map do |value|
17
+ Arrow::Decimal256.new(format(@format_str, BigDecimal(value)))
18
+ end
19
+ )
20
+ end
21
+
22
+ def arrow_type_builder
23
+ Arrow::Decimal256DataType.new(
24
+ @precision, @scale
25
+ )
26
+ end
27
+ end
28
+ end
29
+ end
@@ -21,5 +21,3 @@ module Parqueteur
21
21
  end
22
22
  end
23
23
  end
24
-
25
- # when :integer
@@ -21,5 +21,3 @@ module Parqueteur
21
21
  end
22
22
  end
23
23
  end
24
-
25
- # when :integer
@@ -0,0 +1,19 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Parqueteur
4
+ module Types
5
+ class Time32Type < Parqueteur::Type
6
+ def build_value_array(values)
7
+ Arrow::Time32Array.new(
8
+ @options.fetch(:precision, :second), values
9
+ )
10
+ end
11
+
12
+ def arrow_type_builder
13
+ Arrow::Time32DataType.new(
14
+ options.fetch(:unit, :second)
15
+ )
16
+ end
17
+ end
18
+ end
19
+ end
@@ -0,0 +1,19 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Parqueteur
4
+ module Types
5
+ class Time64Type < Parqueteur::Type
6
+ def build_value_array(values)
7
+ Arrow::Time64Array.new(
8
+ @options.fetch(:precision, :second), values
9
+ )
10
+ end
11
+
12
+ def arrow_type_builder
13
+ Arrow::Time64DataType.new(
14
+ options.fetch(:unit, :second)
15
+ )
16
+ end
17
+ end
18
+ end
19
+ end
@@ -9,7 +9,9 @@ module Parqueteur
9
9
  module Types
10
10
  class TimestampType < Parqueteur::Type
11
11
  def build_value_array(values)
12
- Arrow::TimestampArray.new(values)
12
+ Arrow::TimestampArray.new(
13
+ options.fetch(:unit, :second), values
14
+ )
13
15
  end
14
16
 
15
17
  def arrow_type_builder
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Parqueteur
4
- VERSION = '1.1.0'
4
+ VERSION = '1.3.1'
5
5
  end
data/lib/parqueteur.rb CHANGED
@@ -2,9 +2,10 @@
2
2
 
3
3
  require 'json'
4
4
  require 'singleton'
5
+ require 'tempfile'
6
+ require 'parquet'
5
7
 
6
- require_relative "parqueteur/version"
7
- require 'parqueteur/chunked_converter'
8
+ require_relative 'parqueteur/version'
8
9
  require 'parqueteur/column'
9
10
  require 'parqueteur/column_collection'
10
11
  require 'parqueteur/converter'
@@ -14,16 +15,20 @@ require 'parqueteur/type'
14
15
  require 'parqueteur/type_resolver'
15
16
  require 'parqueteur/types/array_type'
16
17
  require 'parqueteur/types/boolean_type'
18
+ require 'parqueteur/types/date32_type'
19
+ require 'parqueteur/types/date64_type'
20
+ require 'parqueteur/types/decimal128_type'
21
+ require 'parqueteur/types/decimal256_type'
17
22
  require 'parqueteur/types/int32_type'
18
23
  require 'parqueteur/types/int64_type'
19
24
  require 'parqueteur/types/map_type'
20
25
  require 'parqueteur/types/string_type'
21
26
  require 'parqueteur/types/struct_type'
27
+ require 'parqueteur/types/time32_type'
28
+ require 'parqueteur/types/time64_type'
22
29
  require 'parqueteur/types/timestamp_type'
23
- require 'parquet'
24
30
 
25
31
  module Parqueteur
26
32
  class Error < StandardError; end
27
33
  class TypeNotFound < Error; end
28
- # Your code goes here...
29
34
  end
data/parqueteur.gemspec CHANGED
@@ -8,8 +8,8 @@ Gem::Specification.new do |spec|
8
8
  spec.authors = ["Julien D."]
9
9
  spec.email = ["julien@pocketsizesun.com"]
10
10
  spec.license = 'Apache-2.0'
11
- spec.summary = 'Parqueteur - A Ruby gem that convert JSON to Parquet'
12
- spec.description = 'Convert JSON to Parquet'
11
+ spec.summary = 'Parqueteur - A Ruby gem that convert data to Parquet'
12
+ spec.description = 'Convert data to Parquet'
13
13
  spec.homepage = 'https://github.com/pocketsizesun/parqueteur-ruby'
14
14
  spec.required_ruby_version = Gem::Requirement.new(">= 2.3.0")
15
15
 
@@ -0,0 +1,18 @@
1
+ #!/bin/sh
2
+
3
+ if [ $(dpkg-query -W -f='${Status}' apache-arrow-apt-source 2>/dev/null | grep -c "ok installed") -eq 1 ]
4
+ then
5
+ exit 0
6
+ fi
7
+
8
+ LSB_RELEASE_CODENAME_SHORT=$(lsb_release --codename --short)
9
+
10
+ apt-get update
11
+ apt-get install -y -V ca-certificates lsb-release wget
12
+ wget https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
13
+ apt-get install -y -V ./apache-arrow-apt-source-latest-${LSB_RELEASE_CODENAME_SHORT}.deb
14
+ rm ./apache-arrow-apt-source-latest-${LSB_RELEASE_CODENAME_SHORT}.deb
15
+ apt-get update
16
+ apt-get install -y libgirepository1.0-dev libarrow-dev libarrow-glib-dev libparquet-dev libparquet-glib-dev
17
+
18
+ exit 0
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: parqueteur
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.1.0
4
+ version: 1.3.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Julien D.
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2021-10-02 00:00:00.000000000 Z
11
+ date: 2021-10-17 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: red-parquet
@@ -24,7 +24,7 @@ dependencies:
24
24
  - - "~>"
25
25
  - !ruby/object:Gem::Version
26
26
  version: '5.0'
27
- description: Convert JSON to Parquet
27
+ description: Convert data to Parquet
28
28
  email:
29
29
  - julien@pocketsizesun.com
30
30
  executables: []
@@ -38,9 +38,15 @@ files:
38
38
  - Rakefile
39
39
  - bin/console
40
40
  - bin/setup
41
- - example.rb
41
+ - examples/cars.rb
42
+ - examples/convert-and-compression.rb
43
+ - examples/convert-methods.rb
44
+ - examples/convert-to-io.rb
45
+ - examples/convert-with-chunks.rb
46
+ - examples/convert-without-compression.rb
47
+ - examples/hello-world.rb
48
+ - examples/readme-example.rb
42
49
  - lib/parqueteur.rb
43
- - lib/parqueteur/chunked_converter.rb
44
50
  - lib/parqueteur/column.rb
45
51
  - lib/parqueteur/column_collection.rb
46
52
  - lib/parqueteur/converter.rb
@@ -50,15 +56,21 @@ files:
50
56
  - lib/parqueteur/type_resolver.rb
51
57
  - lib/parqueteur/types/array_type.rb
52
58
  - lib/parqueteur/types/boolean_type.rb
59
+ - lib/parqueteur/types/date32_type.rb
60
+ - lib/parqueteur/types/date64_type.rb
61
+ - lib/parqueteur/types/decimal128_type.rb
62
+ - lib/parqueteur/types/decimal256_type.rb
53
63
  - lib/parqueteur/types/int32_type.rb
54
64
  - lib/parqueteur/types/int64_type.rb
55
65
  - lib/parqueteur/types/map_type.rb
56
66
  - lib/parqueteur/types/string_type.rb
57
67
  - lib/parqueteur/types/struct_type.rb
68
+ - lib/parqueteur/types/time32_type.rb
69
+ - lib/parqueteur/types/time64_type.rb
58
70
  - lib/parqueteur/types/timestamp_type.rb
59
71
  - lib/parqueteur/version.rb
60
72
  - parqueteur.gemspec
61
- - test.json
73
+ - scripts/apache-arrow-ubuntu-install.sh
62
74
  homepage: https://github.com/pocketsizesun/parqueteur-ruby
63
75
  licenses:
64
76
  - Apache-2.0
@@ -82,5 +94,5 @@ requirements: []
82
94
  rubygems_version: 3.2.3
83
95
  signing_key:
84
96
  specification_version: 4
85
- summary: Parqueteur - A Ruby gem that convert JSON to Parquet
97
+ summary: Parqueteur - A Ruby gem that convert data to Parquet
86
98
  test_files: []
data/example.rb DELETED
@@ -1,39 +0,0 @@
1
- require 'bundler/setup'
2
- require 'parqueteur'
3
-
4
- class Foo < Parqueteur::Converter
5
- column :id, :bigint
6
- column :reference, :string
7
- column :hash, :map, key: :string, value: :string
8
- column :valid, :boolean
9
- column :total, :integer
10
- column :numbers, :array, elements: :integer
11
- column :my_struct, :struct do
12
- field :test, :string
13
- field :mon_nombre, :integer
14
- end
15
- end
16
-
17
- LETTERS = ('a'..'z').to_a
18
-
19
- data = 1000.times.collect do |i|
20
- {
21
- 'id' => i + 1,
22
- 'reference' => "coucou:#{i}",
23
- 'hash' => { 'a' => LETTERS.sample },
24
- 'valid' => rand < 0.5,
25
- 'total' => rand(100..500),
26
- 'numbers' => [1, 2, 3],
27
- 'my_struct' => {
28
- 'test' => 'super'
29
- }
30
- }
31
- end
32
-
33
- # chunked_converter = Parqueteur::ChunkedConverter.new(data, Foo)
34
- # pp chunked_converter.write_files('test')
35
- puts Foo.convert(data, output: 'tmp/test.parquet')
36
- table = Arrow::Table.load('tmp/test.parquet')
37
- table.each_record do |record|
38
- puts record.to_h
39
- end
@@ -1,28 +0,0 @@
1
- # frozen_string_literal: true
2
-
3
- module Parqueteur
4
- class ChunkedConverter
5
- attr_reader :schema
6
-
7
- def initialize(input, converter, chunk_size = 200)
8
- @input = Parqueteur::Input.from(input)
9
- @converter = converter
10
- @chunk_size = chunk_size
11
- end
12
-
13
- def chunks
14
- Enumerator.new do |arr|
15
- @input.each_slice(@chunk_size) do |chunk|
16
- local_converter = @converter.new(chunk)
17
- arr << local_converter.to_io
18
- end
19
- end
20
- end
21
-
22
- def write_files(prefix)
23
- chunks.each_with_index do |chunk, idx|
24
- File.write("#{prefix}.#{idx}.parquet", chunk.read)
25
- end
26
- end
27
- end
28
- end
data/test.json DELETED
@@ -1 +0,0 @@
1
- [{"id":1,"reference":"coucou","hash":{"a":"b"},"valid":true,"hash2":{},"numbers":[1,2,3],"map_array":[]},{"id":2,"reference":"coucou","hash":{"c":"d"},"valid":false,"hash2":{},"numbers":[4,5,6],"map_array":[]},{"id":3,"reference":"coucou","hash":{"e":"f"},"valid":true,"hash2":{"x":[1,2,3]},"numbers":[7,8,9],"map_array":[{"x":"y"}]}]