rover-df 0.2.2 → 0.2.6

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 0452f4e042fe699042ceebd158a63957b2c2aad0c6fb5652b5e1bb0c49b39f5f
4
- data.tar.gz: 81eca93e309798632b1192b12d50a44828a07c82b84b1f58da9406968761960f
3
+ metadata.gz: 01e2a90ba133ae07ad6ad482bdca985df806d6a073fa2d93029b2b7e1b55dc49
4
+ data.tar.gz: 96f4171420dea68b38cffdd5a365657bc464f6d1f0c4f6bf1aefb20377c56179
5
5
  SHA512:
6
- metadata.gz: feb735bbf9fd17006b2a66416527cd280241082db0bb61b3c1a16317833baa96392b0d5fe70f15ceb8878247747ee966da5fdea607600620e6aa806103c5547c
7
- data.tar.gz: f56e61bb2869beddf953eaf64e3759ab3987b6c53f918d6c52be6d8efcb22a16a9aba8bff727ab6d477e1981cabe76be70ce5e875a5de2db5298dd2f6654163c
6
+ metadata.gz: 2451d6844c7ece459e61c8e1499047f8efd6472a0d57317b7e2e1110527d843e8177c16ccbb1aeb0fd61e647fdd4291ebf73d4bfe008560eff7b963b1ac22ee6
7
+ data.tar.gz: 18ad0cfb8fc22aeb63d2e1b11333b1a5989c7bcc0f2b5fbebedb11acf3d3dc26e7235109e1501d0e4b9a06b5aa7e47b71bdb23de6d6fcd87c9fb53d2bf0be330
data/CHANGELOG.md CHANGED
@@ -1,3 +1,23 @@
1
+ ## 0.2.6 (2021-10-27)
2
+
3
+ - Added support for `nil` headers to `read_csv` and `parse_csv`
4
+ - Added `read_parquet`, `parse_parquet`, and `to_parquet` methods
5
+
6
+ ## 0.2.5 (2021-09-25)
7
+
8
+ - Fixed column types with joins
9
+
10
+ ## 0.2.4 (2021-06-03)
11
+
12
+ - Added grouping for `std` and `var`
13
+ - Fixed `==` for data frames
14
+ - Fixed error with `first` and `last` for data frames
15
+ - Fixed error with `last` when vector size is smaller than `n`
16
+
17
+ ## 0.2.3 (2021-02-08)
18
+
19
+ - Added `select`, `reject`, and `map!` methods to vectors
20
+
1
21
  ## 0.2.2 (2021-01-01)
2
22
 
3
23
  - Added line, pie, area, and bar charts
data/README.md CHANGED
@@ -20,7 +20,7 @@ gem 'rover-df'
20
20
 
21
21
  A data frame is an in-memory table. It’s a useful data structure for data analysis and machine learning. It uses columnar storage for fast operations on columns.
22
22
 
23
- Try it out for forecasting by clicking the button below:
23
+ Try it out for forecasting by clicking the button below (it can take a few minutes to start):
24
24
 
25
25
  [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/ankane/ml-stack/master?filepath=Forecasting.ipynb)
26
26
 
@@ -61,6 +61,14 @@ Rover.read_csv("file.csv")
61
61
  Rover.parse_csv("CSV,data,string")
62
62
  ```
63
63
 
64
+ From Parquet (requires the [red-parquet](https://github.com/apache/arrow/tree/master/ruby/red-parquet) gem) [unreleased]
65
+
66
+ ```ruby
67
+ Rover.read_parquet("file.parquet")
68
+ # or
69
+ Rover.parse_parquet("PAR1...")
70
+ ```
71
+
64
72
  ## Attributes
65
73
 
66
74
  Get number of rows
@@ -89,7 +97,7 @@ Select a column
89
97
  df[:a]
90
98
  ```
91
99
 
92
- > Note that strings and symbols are different keys, just like hashes
100
+ > Note that strings and symbols are different keys, just like hashes. Creating a data frame from Active Record, a CSV, or Parquet uses strings.
93
101
 
94
102
  Select multiple columns
95
103
 
@@ -123,6 +131,20 @@ df[1..3]
123
131
  df[[1, 4, 5]]
124
132
  ```
125
133
 
134
+ Iterate over rows
135
+
136
+ ```ruby
137
+ df.each_row { |row| ... }
138
+ ```
139
+
140
+ Iterate over a column
141
+
142
+ ```ruby
143
+ df[:a].each { |item| ... }
144
+ # or
145
+ df[:a].each_with_index { |item, index| ... }
146
+ ```
147
+
126
148
  ## Filtering
127
149
 
128
150
  Filter on a condition
@@ -181,6 +203,8 @@ df[:a].median
181
203
  df[:a].percentile(90)
182
204
  df[:a].min
183
205
  df[:a].max
206
+ df[:a].std
207
+ df[:a].var
184
208
  ```
185
209
 
186
210
  Count occurrences
@@ -259,6 +283,14 @@ df[:a][0..2] = 1
259
283
  df[:a][0..2] = [1, 2, 3]
260
284
  ```
261
285
 
286
+ Update all elements
287
+
288
+ ```ruby
289
+ df[:a] = df[:a].map { |v| v.gsub("a", "b") }
290
+ # or
291
+ df[:a].map! { |v| v.gsub("a", "b") }
292
+ ```
293
+
262
294
  Update elements matching a condition
263
295
 
264
296
  ```ruby
@@ -369,6 +401,12 @@ CSV
369
401
  df.to_csv
370
402
  ```
371
403
 
404
+ Parquet (requires the [red-parquet](https://github.com/apache/arrow/tree/master/ruby/red-parquet) gem) [unreleased]
405
+
406
+ ```ruby
407
+ df.to_parquet
408
+ ```
409
+
372
410
  ## Types
373
411
 
374
412
  You can specify column types when creating a data frame
@@ -163,7 +163,7 @@ module Rover
163
163
  last(n)
164
164
  end
165
165
 
166
- def first(n = nil)
166
+ def first(n = 1)
167
167
  new_vectors = {}
168
168
  @vectors.each do |k, v|
169
169
  new_vectors[k] = v.first(n)
@@ -171,7 +171,7 @@ module Rover
171
171
  DataFrame.new(new_vectors)
172
172
  end
173
173
 
174
- def last(n = nil)
174
+ def last(n = 1)
175
175
  new_vectors = {}
176
176
  @vectors.each do |k, v|
177
177
  new_vectors[k] = v.last(n)
@@ -235,6 +235,42 @@ module Rover
235
235
  end
236
236
  end
237
237
 
238
+ def to_parquet
239
+ require "parquet"
240
+
241
+ schema = {}
242
+ types.each do |name, type|
243
+ schema[name] =
244
+ case type
245
+ when :int
246
+ :int64
247
+ when :uint
248
+ :uint64
249
+ when :float
250
+ :double
251
+ when :float32
252
+ :float
253
+ when :object
254
+ if @vectors[name].all? { |v| v.is_a?(String) }
255
+ :string
256
+ else
257
+ raise "Unknown type"
258
+ end
259
+ else
260
+ type
261
+ end
262
+ end
263
+ # TODO improve performance
264
+ raw_records = []
265
+ size.times do |i|
266
+ raw_records << @vectors.map { |_, v| v[i] }
267
+ end
268
+ table = Arrow::Table.new(schema, raw_records)
269
+ buffer = Arrow::ResizableBuffer.new(1024)
270
+ table.save(buffer, format: :parquet)
271
+ buffer.data.to_s
272
+ end
273
+
238
274
  # for IRuby
239
275
  def to_html
240
276
  require "iruby"
@@ -301,7 +337,7 @@ module Rover
301
337
  Group.new(self, columns.flatten)
302
338
  end
303
339
 
304
- [:max, :min, :median, :mean, :percentile, :sum].each do |name|
340
+ [:max, :min, :median, :mean, :percentile, :sum, :std, :var].each do |name|
305
341
  define_method(name) do |column, *args|
306
342
  check_column(column)
307
343
  self[column].send(name, *args)
@@ -360,7 +396,7 @@ module Rover
360
396
  def ==(other)
361
397
  size == other.size &&
362
398
  keys == other.keys &&
363
- keys.all? { |k| self[k] == other[k] }
399
+ keys.all? { |k| self[k].to_numo == other[k].to_numo }
364
400
  end
365
401
 
366
402
  def plot(x = nil, y = nil, type: nil)
@@ -475,10 +511,12 @@ module Rover
475
511
 
476
512
  left = how == "left"
477
513
 
514
+ types = {}
478
515
  vectors = {}
479
516
  keys = (self.keys + other.keys).uniq
480
517
  keys.each do |k|
481
518
  vectors[k] = []
519
+ types[k] = join_type(self.types[k], other.types[k])
482
520
  end
483
521
 
484
522
  each_row do |r|
@@ -498,7 +536,7 @@ module Rover
498
536
  end
499
537
  end
500
538
 
501
- DataFrame.new(vectors)
539
+ DataFrame.new(vectors, types: types)
502
540
  end
503
541
 
504
542
  def check_join_keys(df, keys)
@@ -523,6 +561,19 @@ module Rover
523
561
  end
524
562
  end
525
563
 
564
+ def join_type(a, b)
565
+ if a.nil?
566
+ b
567
+ elsif b.nil?
568
+ a
569
+ elsif a == b
570
+ a
571
+ else
572
+ # TODO specify
573
+ nil
574
+ end
575
+ end
576
+
526
577
  def to_vector(v, size: nil, type: nil)
527
578
  if v.is_a?(Vector)
528
579
  v = v.to(type) if type && v.type != type
data/lib/rover/group.rb CHANGED
@@ -9,7 +9,7 @@ module Rover
9
9
  Group.new(@df, @columns + columns.flatten)
10
10
  end
11
11
 
12
- [:count, :max, :min, :mean, :median, :percentile, :sum].each do |name|
12
+ [:count, :max, :min, :mean, :median, :percentile, :sum, :std, :var].each do |name|
13
13
  define_method(name) do |*args|
14
14
  n = [name, args.first].compact.join("_")
15
15
 
data/lib/rover/vector.rb CHANGED
@@ -166,6 +166,19 @@ module Rover
166
166
  Vector.new(@data.to_a.map(&block))
167
167
  end
168
168
 
169
+ def map!(&block)
170
+ @data = cast_data(@data.to_a.map(&block))
171
+ self
172
+ end
173
+
174
+ def select(&block)
175
+ Vector.new(@data.to_a.select(&block))
176
+ end
177
+
178
+ def reject(&block)
179
+ Vector.new(@data.to_a.reject(&block))
180
+ end
181
+
169
182
  def tally
170
183
  result = Hash.new(0)
171
184
  @data.each do |v|
@@ -250,7 +263,11 @@ module Rover
250
263
  end
251
264
 
252
265
  def last(n = 1)
253
- Vector.new(@data[-n..-1])
266
+ if n >= size
267
+ Vector.new(@data)
268
+ else
269
+ Vector.new(@data[-n..-1])
270
+ end
254
271
  end
255
272
 
256
273
  def take(n)
@@ -306,7 +323,12 @@ module Rover
306
323
  # for IRuby
307
324
  def to_html
308
325
  require "iruby"
309
- IRuby::HTML.table(to_a)
326
+ if size > 7
327
+ # pass 8 rows so maxrows is applied
328
+ IRuby::HTML.table(first(4).to_a + last(4).to_a, maxrows: 7)
329
+ else
330
+ IRuby::HTML.table(to_a)
331
+ end
310
332
  end
311
333
 
312
334
  private
data/lib/rover/version.rb CHANGED
@@ -1,3 +1,3 @@
1
1
  module Rover
2
- VERSION = "0.2.2"
2
+ VERSION = "0.2.6"
3
3
  end
data/lib/rover.rb CHANGED
@@ -19,6 +19,16 @@ module Rover
19
19
  csv_to_df(CSV.parse(str, **csv_options(options)), types: types, headers: options[:headers])
20
20
  end
21
21
 
22
+ def read_parquet(path)
23
+ require "parquet"
24
+ parquet_to_df(Arrow::Table.load(path))
25
+ end
26
+
27
+ def parse_parquet(str)
28
+ require "parquet"
29
+ parquet_to_df(Arrow::Table.load(Arrow::Buffer.new(str), format: :parquet))
30
+ end
31
+
22
32
  private
23
33
 
24
34
  # TODO use date converter
@@ -35,10 +45,49 @@ module Rover
35
45
 
36
46
  table.by_col!
37
47
  data = {}
48
+ keys = table.map { |k, _| [k, true] }.to_h
49
+ unnamed_suffix = 1
38
50
  table.each do |k, v|
51
+ # TODO do same for empty string in 0.3.0
52
+ if k.nil?
53
+ k = "unnamed"
54
+ while keys.include?(k)
55
+ unnamed_suffix += 1
56
+ k = "unnamed#{unnamed_suffix}"
57
+ end
58
+ keys[k] = true
59
+ end
39
60
  data[k] = v
40
61
  end
62
+
41
63
  DataFrame.new(data, types: types)
42
64
  end
65
+
66
+ PARQUET_TYPE_MAPPING = {
67
+ "float" => Numo::SFloat,
68
+ "double" => Numo::DFloat,
69
+ "int8" => Numo::Int8,
70
+ "int16" => Numo::Int16,
71
+ "int32" => Numo::Int32,
72
+ "int64" => Numo::Int64,
73
+ "string" => Numo::RObject,
74
+ "uint8" => Numo::UInt8,
75
+ "uint16" => Numo::UInt16,
76
+ "uint32" => Numo::UInt32,
77
+ "uint64" => Numo::UInt64
78
+ }
79
+
80
+ def parquet_to_df(table)
81
+ data = {}
82
+ table.each_column do |column|
83
+ k = column.field.name
84
+ type = column.field.data_type.to_s
85
+ numo_type = PARQUET_TYPE_MAPPING[type]
86
+ raise "Unknown type: #{type}" unless numo_type
87
+ # TODO improve performance
88
+ data[k] = numo_type.cast(column.data.values)
89
+ end
90
+ DataFrame.new(data)
91
+ end
43
92
  end
44
93
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: rover-df
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.2
4
+ version: 0.2.6
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Kane
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2021-01-02 00:00:00.000000000 Z
11
+ date: 2021-10-27 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: numo-narray
@@ -25,7 +25,7 @@ dependencies:
25
25
  - !ruby/object:Gem::Version
26
26
  version: 0.9.1.9
27
27
  description:
28
- email: andrew@chartkick.com
28
+ email: andrew@ankane.org
29
29
  executables: []
30
30
  extensions: []
31
31
  extra_rdoc_files: []
@@ -58,7 +58,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
58
58
  - !ruby/object:Gem::Version
59
59
  version: '0'
60
60
  requirements: []
61
- rubygems_version: 3.2.3
61
+ rubygems_version: 3.2.22
62
62
  signing_key:
63
63
  specification_version: 4
64
64
  summary: Simple, powerful data frames for Ruby