rover-df 0.1.0 → 0.2.3

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 4588d0b3b5633a3821a4c07e7102e5933edca92179836db041f2400d8be88538
4
- data.tar.gz: 9b01cd2bae5fb6ba9f426fe0d347752cd30c63619b00284fb68e8f711ec38ddf
3
+ metadata.gz: b8ac8c0dda5ee8ea5482b5d52927446e52a60151c05959324970b6b420c6b825
4
+ data.tar.gz: cbabf42c40195303fa62a85b40c3d516dff7cb56a4059c2ab6867921fae62bb9
5
5
  SHA512:
6
- metadata.gz: b2d35866786a7fbe17b274585419c752b08c817b2db1bf939a6c3f92a7ae2cd282d725614f96db730fd2590cbb8c24710d0fb1f713255d2c348c0fed0b874a35
7
- data.tar.gz: 4bf0ba38ce2c3ef4765d702591948af18fddf142efb7e559e26cc4ab504538775a1771c839f1570230f7d101fa20bfbbeb5044f6bf567637790575ee9b95be87
6
+ metadata.gz: 2b906f49a0accbbf4682216808faf3113c3f31c24e9e434a03f996d8e8e9b4db1c8ca0ccfb3f604e798261f97d88b26a5376bace349f230b5eda5949b492fb88
7
+ data.tar.gz: 8f3d590c6df3d588f92c6c84b327211a3dce6b27452b4a1161492ca90dc87cfd6aad02a3c7ef038a9c6cb69155558f2a332acbfd65a9bb4ba1d220333b051872
data/CHANGELOG.md CHANGED
@@ -1,3 +1,35 @@
1
+ ## 0.2.3 (2021-02-08)
2
+
3
+ - Added `select`, `reject`, and `map!` methods to vectors
4
+
5
+ ## 0.2.2 (2021-01-01)
6
+
7
+ - Added line, pie, area, and bar charts
8
+ - Added `|` and `^` for vectors
9
+ - Fixed typecasting with `map`
10
+
11
+ ## 0.2.1 (2020-11-23)
12
+
13
+ - Added `plot` method to data frames
14
+ - Improved error message when too few headers
15
+
16
+ ## 0.2.0 (2020-08-17)
17
+
18
+ - Added `numeric?` and `zip` methods to vectors
19
+ - Changed group calculations to return a data frame instead of a hash
20
+ - Changed `each_row` to return enumerator
21
+ - Improved inspect
22
+ - Fixed `any?`, `all?`, and `uniq` for boolean vectors
23
+
24
+ ## 0.1.1 (2020-06-10)
25
+
26
+ - Added methods and options for types
27
+ - Added grouping
28
+ - Added one-hot encoding
29
+ - Added `sample` to data frames
30
+ - Added `tally`, `var`, `std`, `take`, `count`, and `length` to vectors
31
+ - Improved error message for `read_csv` with no headers
32
+
1
33
  ## 0.1.0 (2020-05-13)
2
34
 
3
35
  - First release
data/LICENSE.txt CHANGED
@@ -1,4 +1,4 @@
1
- Copyright (c) 2020 Andrew Kane
1
+ Copyright (c) 2020-2021 Andrew Kane
2
2
 
3
3
  MIT License
4
4
 
data/README.md CHANGED
@@ -2,7 +2,11 @@
2
2
 
3
3
  Simple, powerful data frames for Ruby
4
4
 
5
- :mountain: Designed for data exploration and machine learning, and powered by [Numo](https://github.com/ruby-numo/numo-narray) for blazing performance
5
+ :mountain: Designed for data exploration and machine learning, and powered by [Numo](https://github.com/ruby-numo/numo-narray)
6
+
7
+ :evergreen_tree: Uses [Vega](https://github.com/ankane/vega) for visualization
8
+
9
+ [![Build Status](https://github.com/ankane/rover/workflows/build/badge.svg?branch=master)](https://github.com/ankane/rover/actions)
6
10
 
7
11
  ## Installation
8
12
 
@@ -16,12 +20,22 @@ gem 'rover-df'
16
20
 
17
21
  A data frame is an in-memory table. It’s a useful data structure for data analysis and machine learning. It uses columnar storage for fast operations on columns.
18
22
 
23
+ Try it out for forecasting by clicking the button below:
24
+
25
+ [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/ankane/ml-stack/master?filepath=Forecasting.ipynb)
26
+
27
+ Use the `Run` button (or `SHIFT` + `ENTER`) to run each line.
28
+
19
29
  ## Creating Data Frames
20
30
 
21
31
  From an array
22
32
 
23
33
  ```ruby
24
- Rover::DataFrame.new([{a: 1, b: "one"}, {a: 2, b: "two"}, {a: 3, b: "three"}])
34
+ Rover::DataFrame.new([
35
+ {a: 1, b: "one"},
36
+ {a: 2, b: "two"},
37
+ {a: 3, b: "three"}
38
+ ])
25
39
  ```
26
40
 
27
41
  From a hash
@@ -33,7 +47,7 @@ Rover::DataFrame.new({
33
47
  })
34
48
  ```
35
49
 
36
- From an Active Record relation
50
+ From Active Record
37
51
 
38
52
  ```ruby
39
53
  Rover::DataFrame.new(User.all)
@@ -75,6 +89,8 @@ Select a column
75
89
  df[:a]
76
90
  ```
77
91
 
92
+ > Note that strings and symbols are different keys, just like hashes
93
+
78
94
  Select multiple columns
79
95
 
80
96
  ```ruby
@@ -112,25 +128,34 @@ df[[1, 4, 5]]
112
128
  Filter on a condition
113
129
 
114
130
  ```ruby
131
+ df[df[:a] == 100]
132
+ df[df[:a] != 100]
115
133
  df[df[:a] > 100]
134
+ df[df[:a] >= 100]
135
+ df[df[:a] < 100]
136
+ df[df[:a] <= 100]
116
137
  ```
117
138
 
118
- And
139
+ In
119
140
 
120
141
  ```ruby
121
- df[df[:a] > 100 & df[:b] == "one"]
142
+ df[df[:a].in?([1, 2, 3])]
143
+ df[df[:a].in?(1..3)]
144
+ df[df[:a].in?(["a", "b", "c"])]
122
145
  ```
123
146
 
124
- Or
147
+ Not in
125
148
 
126
149
  ```ruby
127
- df[df[:a] > 100 | df[:b] == "one"]
150
+ df[!df[:a].in?([1, 2, 3])]
128
151
  ```
129
152
 
130
- Not
153
+ And, or, and exclusive or
131
154
 
132
155
  ```ruby
133
- df[df[:a] != 100]
156
+ df[(df[:a] > 100) & (df[:b] == "one")] # and
157
+ df[(df[:a] > 100) | (df[:b] == "one")] # or
158
+ df[(df[:a] > 100) ^ (df[:b] == "one")] # xor
134
159
  ```
135
160
 
136
161
  ## Operations
@@ -158,13 +183,59 @@ df[:a].min
158
183
  df[:a].max
159
184
  ```
160
185
 
186
+ Count occurrences
187
+
188
+ ```ruby
189
+ df[:a].tally
190
+ ```
191
+
161
192
  Cross tabulation
162
193
 
163
194
  ```ruby
164
195
  df[:a].crosstab(df[:b])
165
196
  ```
166
197
 
167
- ## Updates
198
+ ## Grouping
199
+
200
+ Group
201
+
202
+ ```ruby
203
+ df.group(:a).count
204
+ ```
205
+
206
+ Works with all summary statistics
207
+
208
+ ```ruby
209
+ df.group(:a).max(:b)
210
+ ```
211
+
212
+ Multiple groups
213
+
214
+ ```ruby
215
+ df.group([:a, :b]).count
216
+ ```
217
+
218
+ ## Visualization
219
+
220
+ Add [Vega](https://github.com/ankane/vega) to your application’s Gemfile:
221
+
222
+ ```ruby
223
+ gem 'vega'
224
+ ```
225
+
226
+ And use:
227
+
228
+ ```ruby
229
+ df.plot(:a, :b)
230
+ ```
231
+
232
+ Specify the chart type (`line`, `pie`, `column`, `bar`, `area`, or `scatter`)
233
+
234
+ ```ruby
235
+ df.plot(:a, :b, type: "pie")
236
+ ```
237
+
238
+ ## Updating Data
168
239
 
169
240
  Add a new column
170
241
 
@@ -214,7 +285,7 @@ Rename a column
214
285
  df[:new_a] = df.delete(:a)
215
286
  ```
216
287
 
217
- Sort data
288
+ Sort rows
218
289
 
219
290
  ```ruby
220
291
  df.sort_by! { |r| r[:a] }
@@ -258,6 +329,20 @@ Left join
258
329
  df.left_join(other_df)
259
330
  ```
260
331
 
332
+ ## Encoding
333
+
334
+ One-hot encoding
335
+
336
+ ```ruby
337
+ df.one_hot
338
+ ```
339
+
340
+ Drop a variable in each category to avoid the dummy variable trap
341
+
342
+ ```ruby
343
+ df.one_hot(drop: true)
344
+ ```
345
+
261
346
  ## Conversion
262
347
 
263
348
  Array of hashes
@@ -284,6 +369,46 @@ CSV
284
369
  df.to_csv
285
370
  ```
286
371
 
372
+ ## Types
373
+
374
+ You can specify column types when creating a data frame
375
+
376
+ ```ruby
377
+ Rover::DataFrame.new(data, types: {"a" => :int, "b" => :float})
378
+ ```
379
+
380
+ Or
381
+
382
+ ```ruby
383
+ Rover.read_csv("data.csv", types: {"a" => :int, "b" => :float})
384
+ ```
385
+
386
+ Supported types are:
387
+
388
+ - boolean - `bool`
389
+ - float - `float`, `float32`
390
+ - integer - `int`, `int32`, `int16`, `int8`
391
+ - unsigned integer - `uint`, `uint32`, `uint16`, `uint8`
392
+ - object - `object`
393
+
394
+ Get column types
395
+
396
+ ```ruby
397
+ df.types
398
+ ```
399
+
400
+ For a specific column
401
+
402
+ ```ruby
403
+ df[:a].type
404
+ ```
405
+
406
+ Change the type of a column
407
+
408
+ ```ruby
409
+ df[:a] = df[:a].to(:int)
410
+ ```
411
+
287
412
  ## History
288
413
 
289
414
  View the [changelog](https://github.com/ankane/rover/blob/master/CHANGELOG.md)
data/lib/rover.rb CHANGED
@@ -3,30 +3,42 @@ require "numo/narray"
3
3
 
4
4
  # modules
5
5
  require "rover/data_frame"
6
+ require "rover/group"
6
7
  require "rover/vector"
7
8
  require "rover/version"
8
9
 
9
10
  module Rover
10
11
  class << self
11
- def read_csv(path, **options)
12
+ def read_csv(path, types: nil, **options)
12
13
  require "csv"
13
- csv_to_df(CSV.read(path, headers: true, converters: :numeric, **options))
14
+ csv_to_df(CSV.read(path, **csv_options(options)), types: types, headers: options[:headers])
14
15
  end
15
16
 
16
- def parse_csv(str, **options)
17
+ def parse_csv(str, types: nil, **options)
17
18
  require "csv"
18
- csv_to_df(CSV.parse(str, headers: true, converters: :numeric, **options))
19
+ csv_to_df(CSV.parse(str, **csv_options(options)), types: types, headers: options[:headers])
19
20
  end
20
21
 
21
22
  private
22
23
 
23
- def csv_to_df(table)
24
+ # TODO use date converter
25
+ def csv_options(options)
26
+ options = {headers: true, converters: :numeric}.merge(options)
27
+ raise ArgumentError, "Must specify headers" unless options[:headers]
28
+ options
29
+ end
30
+
31
+ def csv_to_df(table, types: nil, headers: nil)
32
+ if headers && headers.size < table.headers.size
33
+ raise ArgumentError, "Expected #{table.headers.size} headers, got #{headers.size}"
34
+ end
35
+
24
36
  table.by_col!
25
37
  data = {}
26
38
  table.each do |k, v|
27
39
  data[k] = v
28
40
  end
29
- DataFrame.new(data)
41
+ DataFrame.new(data, types: types)
30
42
  end
31
43
  end
32
44
  end
@@ -1,7 +1,10 @@
1
1
  module Rover
2
2
  class DataFrame
3
- def initialize(data = {})
3
+ def initialize(*args)
4
+ data, options = process_args(args)
5
+
4
6
  @vectors = {}
7
+ types = options[:types] || {}
5
8
 
6
9
  if data.is_a?(DataFrame)
7
10
  data.vectors.each do |k, v|
@@ -11,7 +14,7 @@ module Rover
11
14
  data.to_h.each do |k, v|
12
15
  @vectors[k] =
13
16
  if v.respond_to?(:to_a)
14
- Vector.new(v)
17
+ Vector.new(v, type: types[k])
15
18
  else
16
19
  v
17
20
  end
@@ -20,7 +23,7 @@ module Rover
20
23
  # handle scalars
21
24
  size = @vectors.values.find { |v| v.is_a?(Vector) }&.size || 1
22
25
  @vectors.each_key do |k|
23
- @vectors[k] = to_vector(@vectors[k], size)
26
+ @vectors[k] = to_vector(@vectors[k], size: size, type: types[k])
24
27
  end
25
28
  elsif data.is_a?(Array)
26
29
  vectors = {}
@@ -35,12 +38,12 @@ module Rover
35
38
  end
36
39
  end
37
40
  vectors.each do |k, v|
38
- @vectors[k] = to_vector(v)
41
+ @vectors[k] = to_vector(v, type: types[k])
39
42
  end
40
43
  elsif defined?(ActiveRecord) && (data.is_a?(ActiveRecord::Relation) || (data.is_a?(Class) && data < ActiveRecord::Base))
41
44
  result = data.connection.select_all(data.all.to_sql)
42
45
  result.columns.each_with_index do |k, i|
43
- @vectors[k] = to_vector(result.rows.map { |r| r[i] })
46
+ @vectors[k] = to_vector(result.rows.map { |r| r[i] }, type: types[k])
44
47
  end
45
48
  else
46
49
  raise ArgumentError, "Cannot cast to data frame: #{data.class.name}"
@@ -69,6 +72,7 @@ module Rover
69
72
  # multiple columns
70
73
  df = DataFrame.new
71
74
  where.each do |k|
75
+ check_column(k, true)
72
76
  df[k] = @vectors[k]
73
77
  end
74
78
  df
@@ -78,8 +82,9 @@ module Rover
78
82
  end
79
83
  end
80
84
 
81
- # return each row as a hash
82
85
  def each_row
86
+ return enum_for(:each_row) unless block_given?
87
+
83
88
  size.times do |i|
84
89
  yield @vectors.map { |k, v| [k, v[i]] }.to_h
85
90
  end
@@ -90,9 +95,13 @@ module Rover
90
95
  @vectors.dup
91
96
  end
92
97
 
98
+ def types
99
+ @vectors.map { |k, v| [k, v.type] }.to_h
100
+ end
101
+
93
102
  def []=(k, v)
94
103
  check_key(k)
95
- v = to_vector(v, size)
104
+ v = to_vector(v, size: size)
96
105
  raise ArgumentError, "Size mismatch: expected #{size}, got #{v.size}" if @vectors.any? && v.size != size
97
106
  @vectors[k] = v
98
107
  end
@@ -170,6 +179,12 @@ module Rover
170
179
  DataFrame.new(new_vectors)
171
180
  end
172
181
 
182
+ def sample(*args, **kwargs)
183
+ # TODO make more efficient
184
+ indexes = (0...size).to_a.sample(*args, **kwargs)
185
+ self[indexes]
186
+ end
187
+
173
188
  def to_a
174
189
  a = []
175
190
  each_row do |row|
@@ -190,6 +205,25 @@ module Rover
190
205
  Numo::NArray.column_stack(vectors.values.map(&:to_numo))
191
206
  end
192
207
 
208
+ # TODO raise error when collision
209
+ def one_hot(drop: false)
210
+ df = DataFrame.new
211
+ vectors.each do |k, v|
212
+ if v.to_numo.is_a?(Numo::RObject)
213
+ df.merge!(v.one_hot(drop: drop, prefix: "#{k}_"))
214
+ else
215
+ df[k] = v
216
+ end
217
+ end
218
+ df
219
+ rescue ArgumentError => e
220
+ if e.message == "All elements must be strings"
221
+ # better error message
222
+ raise ArgumentError, "All elements must be numeric or strings"
223
+ end
224
+ raise e
225
+ end
226
+
193
227
  def to_csv
194
228
  require "csv"
195
229
  CSV.generate do |csv|
@@ -204,7 +238,12 @@ module Rover
204
238
  # for IRuby
205
239
  def to_html
206
240
  require "iruby"
207
- IRuby::HTML.table(to_h)
241
+ if size > 7
242
+ # pass 8 rows so maxrows is applied
243
+ IRuby::HTML.table((self[0..4] + self[-4..-1]).to_h, maxrows: 7)
244
+ else
245
+ IRuby::HTML.table(to_h)
246
+ end
208
247
  end
209
248
 
210
249
  # TODO handle long text better
@@ -215,18 +254,19 @@ module Rover
215
254
  line_start = 0
216
255
  spaces = 2
217
256
 
257
+ summarize = size >= 30
258
+
218
259
  @vectors.each do |k, v|
219
- v = v.first(5).to_a
260
+ v = summarize ? v.first(5).to_a + ["..."] + v.last(5).to_a : v.to_a
220
261
  width = ([k] + v).map(&:to_s).map(&:size).max
221
262
  width = 3 if width < 3
222
263
 
223
264
  if lines.empty? || lines[-2].map { |l| l.size + spaces }.sum + width > 120
224
265
  line_start = lines.size
225
266
  lines << []
226
- [size, 5].min.times do |i|
267
+ v.size.times do |i|
227
268
  lines << []
228
269
  end
229
- lines << [] if size > 5
230
270
  lines << []
231
271
  end
232
272
 
@@ -234,7 +274,6 @@ module Rover
234
274
  v.each_with_index do |v2, i|
235
275
  lines[line_start + 1 + i] << "%#{width}s" % v2.to_s
236
276
  end
237
- lines[line_start + 6] << "%#{width}s" % "..." if size > 5
238
277
  end
239
278
 
240
279
  lines.pop
@@ -258,6 +297,17 @@ module Rover
258
297
  dup.sort_by!(&block)
259
298
  end
260
299
 
300
+ def group(*columns)
301
+ Group.new(self, columns.flatten)
302
+ end
303
+
304
+ [:max, :min, :median, :mean, :percentile, :sum].each do |name|
305
+ define_method(name) do |column, *args|
306
+ check_column(column)
307
+ self[column].send(name, *args)
308
+ end
309
+ end
310
+
261
311
  def dup
262
312
  df = DataFrame.new
263
313
  @vectors.each do |k, v|
@@ -313,6 +363,88 @@ module Rover
313
363
  keys.all? { |k| self[k] == other[k] }
314
364
  end
315
365
 
366
+ def plot(x = nil, y = nil, type: nil)
367
+ require "vega"
368
+
369
+ raise ArgumentError, "Must specify columns" if keys.size != 2 && (!x || !y)
370
+ x ||= keys[0]
371
+ y ||= keys[1]
372
+ type ||= begin
373
+ if self[x].numeric? && self[y].numeric?
374
+ "scatter"
375
+ elsif types[x] == :object && self[y].numeric?
376
+ "column"
377
+ else
378
+ raise "Cannot determine type. Use the type option."
379
+ end
380
+ end
381
+ data = self[[x, y]]
382
+
383
+ case type
384
+ when "line", "area"
385
+ x_type =
386
+ if data[x].numeric?
387
+ "quantitative"
388
+ elsif data[x].all? { |v| v.is_a?(Date) || v.is_a?(Time) }
389
+ "temporal"
390
+ else
391
+ "nominal"
392
+ end
393
+
394
+ scale = x_type == "temporal" ? {type: "utc"} : {}
395
+
396
+ Vega.lite
397
+ .data(data)
398
+ .mark(type: type, tooltip: true, interpolate: "cardinal", point: {size: 60})
399
+ .encoding(
400
+ x: {field: x, type: x_type, scale: scale},
401
+ y: {field: y, type: "quantitative"}
402
+ )
403
+ .config(axis: {labelFontSize: 12})
404
+ when "pie"
405
+ Vega.lite
406
+ .data(data)
407
+ .mark(type: "arc", tooltip: true)
408
+ .encoding(
409
+ color: {field: x, type: "nominal", sort: "none", axis: {title: nil}, legend: {labelFontSize: 12}},
410
+ theta: {field: y, type: "quantitative"}
411
+ )
412
+ .view(stroke: nil)
413
+ when "column"
414
+ Vega.lite
415
+ .data(data)
416
+ .mark(type: "bar", tooltip: true)
417
+ .encoding(
418
+ # TODO determine label angle
419
+ x: {field: x, type: "nominal", sort: "none", axis: {labelAngle: 0}},
420
+ y: {field: y, type: "quantitative"}
421
+ )
422
+ .config(axis: {labelFontSize: 12})
423
+ when "bar"
424
+ Vega.lite
425
+ .data(data)
426
+ .mark(type: "bar", tooltip: true)
427
+ .encoding(
428
+ # TODO determine label angle
429
+ y: {field: x, type: "nominal", sort: "none", axis: {labelAngle: 0}},
430
+ x: {field: y, type: "quantitative"}
431
+ )
432
+ .config(axis: {labelFontSize: 12})
433
+ when "scatter"
434
+ Vega.lite
435
+ .data(data)
436
+ .mark(type: "circle", tooltip: true)
437
+ .encoding(
438
+ x: {field: x, type: "quantitative", scale: {zero: false}},
439
+ y: {field: y, type: "quantitative", scale: {zero: false}},
440
+ size: {value: 60}
441
+ )
442
+ .config(axis: {labelFontSize: 12})
443
+ else
444
+ raise ArgumentError, "Invalid type: #{type}"
445
+ end
446
+ end
447
+
316
448
  private
317
449
 
318
450
  def check_key(key)
@@ -375,8 +507,27 @@ module Rover
375
507
  raise ArgumentError, "Missing keys: #{missing_keys.join(", ")}" if missing_keys.any?
376
508
  end
377
509
 
378
- def to_vector(v, size = nil)
379
- return v if v.is_a?(Vector)
510
+ # TODO in 0.3.0
511
+ # always use did_you_mean
512
+ def check_column(key, did_you_mean = false)
513
+ unless include?(key)
514
+ if did_you_mean
515
+ if RUBY_VERSION.to_f >= 2.6
516
+ raise KeyError.new("Missing column: #{key}", receiver: self, key: key)
517
+ else
518
+ raise KeyError.new("Missing column: #{key}")
519
+ end
520
+ else
521
+ raise ArgumentError, "Missing column: #{key}"
522
+ end
523
+ end
524
+ end
525
+
526
+ def to_vector(v, size: nil, type: nil)
527
+ if v.is_a?(Vector)
528
+ v = v.to(type) if type && v.type != type
529
+ return v
530
+ end
380
531
 
381
532
  if size && !v.respond_to?(:to_a)
382
533
  v =
@@ -392,7 +543,31 @@ module Rover
392
543
  end
393
544
  end
394
545
 
395
- Vector.new(v)
546
+ Vector.new(v, type: type)
547
+ end
548
+
549
+ # can't use data = {} and keyword arguments
550
+ # as this causes an unknown keyword error when data is passed as
551
+ # DataFrame.new({a: ..., b: ...})
552
+ #
553
+ # at the moment, there doesn't appear to be a way to distinguish between
554
+ # DataFrame.new({types: ...}) which should set data, and
555
+ # DataFrame.new(types: ...) which should set options
556
+ # https://bugs.ruby-lang.org/issues/16891
557
+ #
558
+ # there aren't currently options that should be used without data
559
+ # if this is ever the case, we should still require data
560
+ # to prevent new options from breaking existing code
561
+ def process_args(args)
562
+ data = args[0] || {}
563
+ options = args.size > 1 && args.last.is_a?(Hash) ? args.pop : {}
564
+ raise ArgumentError, "wrong number of arguments (given #{args.size}, expected 0..1)" if args.size > 1
565
+
566
+ known_keywords = [:types]
567
+ unknown_keywords = options.keys - known_keywords
568
+ raise ArgumentError, "unknown keywords: #{unknown_keywords.join(", ")}" if unknown_keywords.any?
569
+
570
+ [data, options]
396
571
  end
397
572
  end
398
573
  end
@@ -0,0 +1,50 @@
1
+ module Rover
2
+ class Group
3
+ def initialize(df, columns)
4
+ @df = df
5
+ @columns = columns
6
+ end
7
+
8
+ def group(*columns)
9
+ Group.new(@df, @columns + columns.flatten)
10
+ end
11
+
12
+ [:count, :max, :min, :mean, :median, :percentile, :sum].each do |name|
13
+ define_method(name) do |*args|
14
+ n = [name, args.first].compact.join("_")
15
+
16
+ rows = []
17
+ grouped_dfs.each do |k, df|
18
+ rows << k.merge(n => df.send(name, *args))
19
+ end
20
+
21
+ DataFrame.new(rows)
22
+ end
23
+ end
24
+
25
+ private
26
+
27
+ # TODO make more efficient
28
+ def grouped_dfs
29
+ # cache here so we can reuse for multiple calcuations if needed
30
+ @grouped_dfs ||= begin
31
+ raise ArgumentError, "No columns given" if @columns.empty?
32
+ missing_keys = @columns - @df.keys
33
+ raise ArgumentError, "Missing keys: #{missing_keys.join(", ")}" if missing_keys.any?
34
+
35
+ groups = Hash.new { |hash, key| hash[key] = [] }
36
+ i = 0
37
+ @df.each_row do |row|
38
+ groups[row.slice(*@columns)] << i
39
+ i += 1
40
+ end
41
+
42
+ result = {}
43
+ groups.keys.each do |k|
44
+ result[k] = @df[groups[k]]
45
+ end
46
+ result
47
+ end
48
+ end
49
+ end
50
+ end
data/lib/rover/vector.rb CHANGED
@@ -1,27 +1,39 @@
1
1
  module Rover
2
2
  class Vector
3
- def initialize(data)
4
- @data =
5
- if data.is_a?(Vector)
6
- data.to_numo
7
- elsif data.is_a?(Numo::NArray)
8
- data
9
- else
10
- data = data.to_a
11
- if data.all? { |v| v.is_a?(Integer) }
12
- Numo::Int64.cast(data)
13
- elsif data.all? { |v| v.is_a?(Numeric) || v.nil? }
14
- Numo::DFloat.cast(data.map { |v| v || Float::NAN })
15
- elsif data.all? { |v| v == true || v == false }
16
- Numo::Bit.cast(data)
17
- else
18
- Numo::RObject.cast(data)
19
- end
20
- end
21
-
3
+ # if a user never specifies types,
4
+ # the defaults are bool, float, int, and object
5
+ # keep these simple
6
+ #
7
+ # we could create aliases for float64, int64, uint64
8
+ # if so, type should still return the simple type
9
+ TYPE_CAST_MAPPING = {
10
+ bool: Numo::Bit,
11
+ float32: Numo::SFloat,
12
+ float: Numo::DFloat,
13
+ int8: Numo::Int8,
14
+ int16: Numo::Int16,
15
+ int32: Numo::Int32,
16
+ int: Numo::Int64,
17
+ object: Numo::RObject,
18
+ uint8: Numo::UInt8,
19
+ uint16: Numo::UInt16,
20
+ uint32: Numo::UInt32,
21
+ uint: Numo::UInt64
22
+ }
23
+
24
+ def initialize(data, type: nil)
25
+ @data = cast_data(data, type: type)
22
26
  raise ArgumentError, "Bad size: #{@data.shape}" unless @data.ndim == 1
23
27
  end
24
28
 
29
+ def type
30
+ TYPE_CAST_MAPPING.find { |_, v| @data.is_a?(v) }[0]
31
+ end
32
+
33
+ def to(type)
34
+ Vector.new(self, type: type)
35
+ end
36
+
25
37
  def to_numo
26
38
  @data
27
39
  end
@@ -32,12 +44,18 @@ module Rover
32
44
  a
33
45
  end
34
46
 
47
+ def numeric?
48
+ ![:object, :bool].include?(type)
49
+ end
50
+
35
51
  def size
36
52
  @data.size
37
53
  end
54
+ alias_method :length, :size
55
+ alias_method :count, :size
38
56
 
39
57
  def uniq
40
- Vector.new(@data.to_a.uniq)
58
+ Vector.new(to_a.uniq)
41
59
  end
42
60
 
43
61
  def missing
@@ -73,11 +91,11 @@ module Rover
73
91
  @data[k] = v
74
92
  end
75
93
 
76
- %w(+ - * / % ** &).each do |op|
94
+ %w(+ - * / % ** & | ^).each do |op|
77
95
  define_method(op) do |other|
78
96
  other = other.to_numo if other.is_a?(Vector)
79
97
  # TODO better logic
80
- if @data.is_a?(Numo::RObject)
98
+ if @data.is_a?(Numo::RObject) && !other.is_a?(Numo::RObject)
81
99
  map { |v| v.send(op, other) }
82
100
  else
83
101
  Vector.new(@data.send(op, other))
@@ -143,9 +161,31 @@ module Rover
143
161
  end
144
162
 
145
163
  def map(&block)
146
- mapped = @data.map(&block)
147
- mapped = mapped.to_a if mapped.is_a?(Numo::RObject) # re-evaluate cast
148
- Vector.new(mapped)
164
+ # convert to Ruby first to cast properly
165
+ # https://github.com/ruby-numo/numo-narray/issues/181
166
+ Vector.new(@data.to_a.map(&block))
167
+ end
168
+
169
+ def map!(&block)
170
+ @data = cast_data(@data.to_a.map(&block))
171
+ self
172
+ end
173
+
174
+ def select(&block)
175
+ Vector.new(@data.to_a.select(&block))
176
+ end
177
+
178
+ def reject(&block)
179
+ Vector.new(@data.to_a.reject(&block))
180
+ end
181
+
182
+ def tally
183
+ result = Hash.new(0)
184
+ @data.each do |v|
185
+ result[v] += 1
186
+ end
187
+ result.default = nil
188
+ result
149
189
  end
150
190
 
151
191
  def sort
@@ -157,7 +197,11 @@ module Rover
157
197
  end
158
198
 
159
199
  def each(&block)
160
- to_a.each(&block)
200
+ @data.each(&block)
201
+ end
202
+
203
+ def each_with_index(&block)
204
+ @data.each_with_index(&block)
161
205
  end
162
206
 
163
207
  def max
@@ -176,7 +220,7 @@ module Rover
176
220
 
177
221
  def median
178
222
  # need to cast to get correct result
179
- # TODO file bug with Numo
223
+ # https://github.com/ruby-numo/numo-narray/issues/165
180
224
  @data.cast_to(Numo::DFloat).median
181
225
  end
182
226
 
@@ -188,12 +232,26 @@ module Rover
188
232
  @data.sum
189
233
  end
190
234
 
235
+ # uses Bessel's correction for now since that's all Numo supports
236
+ def std
237
+ @data.cast_to(Numo::DFloat).stddev
238
+ end
239
+
240
+ # uses Bessel's correction for now since that's all Numo supports
241
+ def var
242
+ @data.cast_to(Numo::DFloat).var
243
+ end
244
+
191
245
  def all?(&block)
192
- @data.to_a.all?(&block)
246
+ to_a.all?(&block)
193
247
  end
194
248
 
195
249
  def any?(&block)
196
- @data.to_a.any?(&block)
250
+ to_a.any?(&block)
251
+ end
252
+
253
+ def zip(other, &block)
254
+ to_a.zip(other.to_a, &block)
197
255
  end
198
256
 
199
257
  def first(n = 1)
@@ -208,6 +266,11 @@ module Rover
208
266
  Vector.new(@data[-n..-1])
209
267
  end
210
268
 
269
+ def take(n)
270
+ raise ArgumentError, "attempt to take negative size" if n < 0
271
+ first(n)
272
+ end
273
+
211
274
  def crosstab(other)
212
275
  index = uniq.sort
213
276
  index_pos = index.to_a.map.with_index.to_h
@@ -231,6 +294,20 @@ module Rover
231
294
  last(n)
232
295
  end
233
296
 
297
+ def one_hot(drop: false, prefix: nil)
298
+ raise ArgumentError, "All elements must be strings" unless all? { |vi| vi.is_a?(String) }
299
+
300
+ new_vectors = {}
301
+ # maybe sort values first
302
+ values = uniq.to_a
303
+ values.shift if drop
304
+ values.each do |v2|
305
+ # TODO use types
306
+ new_vectors["#{prefix}#{v2}"] = (self == v2).to_numo.cast_to(Numo::Int64)
307
+ end
308
+ DataFrame.new(new_vectors)
309
+ end
310
+
234
311
  # TODO add type and size?
235
312
  def inspect
236
313
  elements = first(5).to_a.map(&:inspect)
@@ -242,7 +319,64 @@ module Rover
242
319
  # for IRuby
243
320
  def to_html
244
321
  require "iruby"
245
- IRuby::HTML.table(to_a)
322
+ if size > 7
323
+ # pass 8 rows so maxrows is applied
324
+ IRuby::HTML.table(first(4).to_a + last(4).to_a, maxrows: 7)
325
+ else
326
+ IRuby::HTML.table(to_a)
327
+ end
328
+ end
329
+
330
+ private
331
+
332
+ def cast_data(data, type: nil)
333
+ numo_type = numo_type(type) if type
334
+
335
+ data = data.to_numo if data.is_a?(Vector)
336
+
337
+ if data.is_a?(Numo::NArray)
338
+ raise ArgumentError, "Complex types not supported yet" if data.is_a?(Numo::DComplex) || data.is_a?(Numo::SComplex)
339
+
340
+ if type
341
+ case type
342
+ when /int/
343
+ # Numo does not check these when casting
344
+ raise RangeError, "float NaN out of range of integer" if data.respond_to?(:isnan) && data.isnan.any?
345
+ raise RangeError, "float Inf out of range of integer" if data.respond_to?(:isinf) && data.isinf.any?
346
+
347
+ data = data.to_a.map { |v| v.nil? ? nil : v.to_i } if data.is_a?(Numo::RObject)
348
+ when /float/
349
+ data = data.to_a.map { |v| v.nil? ? Float::NAN : v.to_f } if data.is_a?(Numo::RObject)
350
+ end
351
+
352
+ data = numo_type.cast(data)
353
+ end
354
+ else
355
+ data = data.to_a
356
+
357
+ if type
358
+ data = numo_type.cast(data)
359
+ else
360
+ data =
361
+ if data.all? { |v| v.is_a?(Integer) }
362
+ Numo::Int64.cast(data)
363
+ elsif data.all? { |v| v.is_a?(Numeric) || v.nil? }
364
+ Numo::DFloat.cast(data.map { |v| v || Float::NAN })
365
+ elsif data.all? { |v| v == true || v == false }
366
+ Numo::Bit.cast(data)
367
+ else
368
+ Numo::RObject.cast(data)
369
+ end
370
+ end
371
+ end
372
+
373
+ data
374
+ end
375
+
376
+ def numo_type(type)
377
+ numo_type = TYPE_CAST_MAPPING[type]
378
+ raise ArgumentError, "Invalid type: #{type}" unless numo_type
379
+ numo_type
246
380
  end
247
381
  end
248
382
  end
data/lib/rover/version.rb CHANGED
@@ -1,3 +1,3 @@
1
1
  module Rover
2
- VERSION = "0.1.0"
2
+ VERSION = "0.2.3"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: rover-df
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.2.3
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Kane
8
- autorequire:
8
+ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-05-14 00:00:00.000000000 Z
11
+ date: 2021-02-08 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: numo-narray
@@ -16,100 +16,16 @@ dependencies:
16
16
  requirements:
17
17
  - - ">="
18
18
  - !ruby/object:Gem::Version
19
- version: 0.9.1.7
19
+ version: 0.9.1.9
20
20
  type: :runtime
21
21
  prerelease: false
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
24
  - - ">="
25
25
  - !ruby/object:Gem::Version
26
- version: 0.9.1.7
27
- - !ruby/object:Gem::Dependency
28
- name: bundler
29
- requirement: !ruby/object:Gem::Requirement
30
- requirements:
31
- - - ">="
32
- - !ruby/object:Gem::Version
33
- version: '0'
34
- type: :development
35
- prerelease: false
36
- version_requirements: !ruby/object:Gem::Requirement
37
- requirements:
38
- - - ">="
39
- - !ruby/object:Gem::Version
40
- version: '0'
41
- - !ruby/object:Gem::Dependency
42
- name: rake
43
- requirement: !ruby/object:Gem::Requirement
44
- requirements:
45
- - - ">="
46
- - !ruby/object:Gem::Version
47
- version: '0'
48
- type: :development
49
- prerelease: false
50
- version_requirements: !ruby/object:Gem::Requirement
51
- requirements:
52
- - - ">="
53
- - !ruby/object:Gem::Version
54
- version: '0'
55
- - !ruby/object:Gem::Dependency
56
- name: minitest
57
- requirement: !ruby/object:Gem::Requirement
58
- requirements:
59
- - - ">="
60
- - !ruby/object:Gem::Version
61
- version: '5'
62
- type: :development
63
- prerelease: false
64
- version_requirements: !ruby/object:Gem::Requirement
65
- requirements:
66
- - - ">="
67
- - !ruby/object:Gem::Version
68
- version: '5'
69
- - !ruby/object:Gem::Dependency
70
- name: activerecord
71
- requirement: !ruby/object:Gem::Requirement
72
- requirements:
73
- - - ">="
74
- - !ruby/object:Gem::Version
75
- version: '5'
76
- type: :development
77
- prerelease: false
78
- version_requirements: !ruby/object:Gem::Requirement
79
- requirements:
80
- - - ">="
81
- - !ruby/object:Gem::Version
82
- version: '5'
83
- - !ruby/object:Gem::Dependency
84
- name: sqlite3
85
- requirement: !ruby/object:Gem::Requirement
86
- requirements:
87
- - - ">="
88
- - !ruby/object:Gem::Version
89
- version: '0'
90
- type: :development
91
- prerelease: false
92
- version_requirements: !ruby/object:Gem::Requirement
93
- requirements:
94
- - - ">="
95
- - !ruby/object:Gem::Version
96
- version: '0'
97
- - !ruby/object:Gem::Dependency
98
- name: iruby
99
- requirement: !ruby/object:Gem::Requirement
100
- requirements:
101
- - - ">="
102
- - !ruby/object:Gem::Version
103
- version: '0'
104
- type: :development
105
- prerelease: false
106
- version_requirements: !ruby/object:Gem::Requirement
107
- requirements:
108
- - - ">="
109
- - !ruby/object:Gem::Version
110
- version: '0'
111
- description:
112
- email: andrew@chartkick.com
26
+ version: 0.9.1.9
27
+ description:
28
+ email: andrew@ankane.org
113
29
  executables: []
114
30
  extensions: []
115
31
  extra_rdoc_files: []
@@ -120,13 +36,14 @@ files:
120
36
  - lib/rover-df.rb
121
37
  - lib/rover.rb
122
38
  - lib/rover/data_frame.rb
39
+ - lib/rover/group.rb
123
40
  - lib/rover/vector.rb
124
41
  - lib/rover/version.rb
125
42
  homepage: https://github.com/ankane/rover
126
43
  licenses:
127
44
  - MIT
128
45
  metadata: {}
129
- post_install_message:
46
+ post_install_message:
130
47
  rdoc_options: []
131
48
  require_paths:
132
49
  - lib
@@ -141,8 +58,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
141
58
  - !ruby/object:Gem::Version
142
59
  version: '0'
143
60
  requirements: []
144
- rubygems_version: 3.1.2
145
- signing_key:
61
+ rubygems_version: 3.2.3
62
+ signing_key:
146
63
  specification_version: 4
147
64
  summary: Simple, powerful data frames for Ruby
148
65
  test_files: []