rover-df 0.1.0 → 0.2.3
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +32 -0
- data/LICENSE.txt +1 -1
- data/README.md +136 -11
- data/lib/rover.rb +18 -6
- data/lib/rover/data_frame.rb +190 -15
- data/lib/rover/group.rb +50 -0
- data/lib/rover/vector.rb +164 -30
- data/lib/rover/version.rb +1 -1
- metadata +11 -94
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: b8ac8c0dda5ee8ea5482b5d52927446e52a60151c05959324970b6b420c6b825
|
4
|
+
data.tar.gz: cbabf42c40195303fa62a85b40c3d516dff7cb56a4059c2ab6867921fae62bb9
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 2b906f49a0accbbf4682216808faf3113c3f31c24e9e434a03f996d8e8e9b4db1c8ca0ccfb3f604e798261f97d88b26a5376bace349f230b5eda5949b492fb88
|
7
|
+
data.tar.gz: 8f3d590c6df3d588f92c6c84b327211a3dce6b27452b4a1161492ca90dc87cfd6aad02a3c7ef038a9c6cb69155558f2a332acbfd65a9bb4ba1d220333b051872
|
data/CHANGELOG.md
CHANGED
@@ -1,3 +1,35 @@
|
|
1
|
+
## 0.2.3 (2021-02-08)
|
2
|
+
|
3
|
+
- Added `select`, `reject`, and `map!` methods to vectors
|
4
|
+
|
5
|
+
## 0.2.2 (2021-01-01)
|
6
|
+
|
7
|
+
- Added line, pie, area, and bar charts
|
8
|
+
- Added `|` and `^` for vectors
|
9
|
+
- Fixed typecasting with `map`
|
10
|
+
|
11
|
+
## 0.2.1 (2020-11-23)
|
12
|
+
|
13
|
+
- Added `plot` method to data frames
|
14
|
+
- Improved error message when too few headers
|
15
|
+
|
16
|
+
## 0.2.0 (2020-08-17)
|
17
|
+
|
18
|
+
- Added `numeric?` and `zip` methods to vectors
|
19
|
+
- Changed group calculations to return a data frame instead of a hash
|
20
|
+
- Changed `each_row` to return enumerator
|
21
|
+
- Improved inspect
|
22
|
+
- Fixed `any?`, `all?`, and `uniq` for boolean vectors
|
23
|
+
|
24
|
+
## 0.1.1 (2020-06-10)
|
25
|
+
|
26
|
+
- Added methods and options for types
|
27
|
+
- Added grouping
|
28
|
+
- Added one-hot encoding
|
29
|
+
- Added `sample` to data frames
|
30
|
+
- Added `tally`, `var`, `std`, `take`, `count`, and `length` to vectors
|
31
|
+
- Improved error message for `read_csv` with no headers
|
32
|
+
|
1
33
|
## 0.1.0 (2020-05-13)
|
2
34
|
|
3
35
|
- First release
|
data/LICENSE.txt
CHANGED
data/README.md
CHANGED
@@ -2,7 +2,11 @@
|
|
2
2
|
|
3
3
|
Simple, powerful data frames for Ruby
|
4
4
|
|
5
|
-
:mountain: Designed for data exploration and machine learning, and powered by [Numo](https://github.com/ruby-numo/numo-narray)
|
5
|
+
:mountain: Designed for data exploration and machine learning, and powered by [Numo](https://github.com/ruby-numo/numo-narray)
|
6
|
+
|
7
|
+
:evergreen_tree: Uses [Vega](https://github.com/ankane/vega) for visualization
|
8
|
+
|
9
|
+
[![Build Status](https://github.com/ankane/rover/workflows/build/badge.svg?branch=master)](https://github.com/ankane/rover/actions)
|
6
10
|
|
7
11
|
## Installation
|
8
12
|
|
@@ -16,12 +20,22 @@ gem 'rover-df'
|
|
16
20
|
|
17
21
|
A data frame is an in-memory table. It’s a useful data structure for data analysis and machine learning. It uses columnar storage for fast operations on columns.
|
18
22
|
|
23
|
+
Try it out for forecasting by clicking the button below:
|
24
|
+
|
25
|
+
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/ankane/ml-stack/master?filepath=Forecasting.ipynb)
|
26
|
+
|
27
|
+
Use the `Run` button (or `SHIFT` + `ENTER`) to run each line.
|
28
|
+
|
19
29
|
## Creating Data Frames
|
20
30
|
|
21
31
|
From an array
|
22
32
|
|
23
33
|
```ruby
|
24
|
-
Rover::DataFrame.new([
|
34
|
+
Rover::DataFrame.new([
|
35
|
+
{a: 1, b: "one"},
|
36
|
+
{a: 2, b: "two"},
|
37
|
+
{a: 3, b: "three"}
|
38
|
+
])
|
25
39
|
```
|
26
40
|
|
27
41
|
From a hash
|
@@ -33,7 +47,7 @@ Rover::DataFrame.new({
|
|
33
47
|
})
|
34
48
|
```
|
35
49
|
|
36
|
-
From
|
50
|
+
From Active Record
|
37
51
|
|
38
52
|
```ruby
|
39
53
|
Rover::DataFrame.new(User.all)
|
@@ -75,6 +89,8 @@ Select a column
|
|
75
89
|
df[:a]
|
76
90
|
```
|
77
91
|
|
92
|
+
> Note that strings and symbols are different keys, just like hashes
|
93
|
+
|
78
94
|
Select multiple columns
|
79
95
|
|
80
96
|
```ruby
|
@@ -112,25 +128,34 @@ df[[1, 4, 5]]
|
|
112
128
|
Filter on a condition
|
113
129
|
|
114
130
|
```ruby
|
131
|
+
df[df[:a] == 100]
|
132
|
+
df[df[:a] != 100]
|
115
133
|
df[df[:a] > 100]
|
134
|
+
df[df[:a] >= 100]
|
135
|
+
df[df[:a] < 100]
|
136
|
+
df[df[:a] <= 100]
|
116
137
|
```
|
117
138
|
|
118
|
-
|
139
|
+
In
|
119
140
|
|
120
141
|
```ruby
|
121
|
-
df[df[:a]
|
142
|
+
df[df[:a].in?([1, 2, 3])]
|
143
|
+
df[df[:a].in?(1..3)]
|
144
|
+
df[df[:a].in?(["a", "b", "c"])]
|
122
145
|
```
|
123
146
|
|
124
|
-
|
147
|
+
Not in
|
125
148
|
|
126
149
|
```ruby
|
127
|
-
df[df[:a]
|
150
|
+
df[!df[:a].in?([1, 2, 3])]
|
128
151
|
```
|
129
152
|
|
130
|
-
|
153
|
+
And, or, and exclusive or
|
131
154
|
|
132
155
|
```ruby
|
133
|
-
df[df[:a]
|
156
|
+
df[(df[:a] > 100) & (df[:b] == "one")] # and
|
157
|
+
df[(df[:a] > 100) | (df[:b] == "one")] # or
|
158
|
+
df[(df[:a] > 100) ^ (df[:b] == "one")] # xor
|
134
159
|
```
|
135
160
|
|
136
161
|
## Operations
|
@@ -158,13 +183,59 @@ df[:a].min
|
|
158
183
|
df[:a].max
|
159
184
|
```
|
160
185
|
|
186
|
+
Count occurrences
|
187
|
+
|
188
|
+
```ruby
|
189
|
+
df[:a].tally
|
190
|
+
```
|
191
|
+
|
161
192
|
Cross tabulation
|
162
193
|
|
163
194
|
```ruby
|
164
195
|
df[:a].crosstab(df[:b])
|
165
196
|
```
|
166
197
|
|
167
|
-
##
|
198
|
+
## Grouping
|
199
|
+
|
200
|
+
Group
|
201
|
+
|
202
|
+
```ruby
|
203
|
+
df.group(:a).count
|
204
|
+
```
|
205
|
+
|
206
|
+
Works with all summary statistics
|
207
|
+
|
208
|
+
```ruby
|
209
|
+
df.group(:a).max(:b)
|
210
|
+
```
|
211
|
+
|
212
|
+
Multiple groups
|
213
|
+
|
214
|
+
```ruby
|
215
|
+
df.group([:a, :b]).count
|
216
|
+
```
|
217
|
+
|
218
|
+
## Visualization
|
219
|
+
|
220
|
+
Add [Vega](https://github.com/ankane/vega) to your application’s Gemfile:
|
221
|
+
|
222
|
+
```ruby
|
223
|
+
gem 'vega'
|
224
|
+
```
|
225
|
+
|
226
|
+
And use:
|
227
|
+
|
228
|
+
```ruby
|
229
|
+
df.plot(:a, :b)
|
230
|
+
```
|
231
|
+
|
232
|
+
Specify the chart type (`line`, `pie`, `column`, `bar`, `area`, or `scatter`)
|
233
|
+
|
234
|
+
```ruby
|
235
|
+
df.plot(:a, :b, type: "pie")
|
236
|
+
```
|
237
|
+
|
238
|
+
## Updating Data
|
168
239
|
|
169
240
|
Add a new column
|
170
241
|
|
@@ -214,7 +285,7 @@ Rename a column
|
|
214
285
|
df[:new_a] = df.delete(:a)
|
215
286
|
```
|
216
287
|
|
217
|
-
Sort
|
288
|
+
Sort rows
|
218
289
|
|
219
290
|
```ruby
|
220
291
|
df.sort_by! { |r| r[:a] }
|
@@ -258,6 +329,20 @@ Left join
|
|
258
329
|
df.left_join(other_df)
|
259
330
|
```
|
260
331
|
|
332
|
+
## Encoding
|
333
|
+
|
334
|
+
One-hot encoding
|
335
|
+
|
336
|
+
```ruby
|
337
|
+
df.one_hot
|
338
|
+
```
|
339
|
+
|
340
|
+
Drop a variable in each category to avoid the dummy variable trap
|
341
|
+
|
342
|
+
```ruby
|
343
|
+
df.one_hot(drop: true)
|
344
|
+
```
|
345
|
+
|
261
346
|
## Conversion
|
262
347
|
|
263
348
|
Array of hashes
|
@@ -284,6 +369,46 @@ CSV
|
|
284
369
|
df.to_csv
|
285
370
|
```
|
286
371
|
|
372
|
+
## Types
|
373
|
+
|
374
|
+
You can specify column types when creating a data frame
|
375
|
+
|
376
|
+
```ruby
|
377
|
+
Rover::DataFrame.new(data, types: {"a" => :int, "b" => :float})
|
378
|
+
```
|
379
|
+
|
380
|
+
Or
|
381
|
+
|
382
|
+
```ruby
|
383
|
+
Rover.read_csv("data.csv", types: {"a" => :int, "b" => :float})
|
384
|
+
```
|
385
|
+
|
386
|
+
Supported types are:
|
387
|
+
|
388
|
+
- boolean - `bool`
|
389
|
+
- float - `float`, `float32`
|
390
|
+
- integer - `int`, `int32`, `int16`, `int8`
|
391
|
+
- unsigned integer - `uint`, `uint32`, `uint16`, `uint8`
|
392
|
+
- object - `object`
|
393
|
+
|
394
|
+
Get column types
|
395
|
+
|
396
|
+
```ruby
|
397
|
+
df.types
|
398
|
+
```
|
399
|
+
|
400
|
+
For a specific column
|
401
|
+
|
402
|
+
```ruby
|
403
|
+
df[:a].type
|
404
|
+
```
|
405
|
+
|
406
|
+
Change the type of a column
|
407
|
+
|
408
|
+
```ruby
|
409
|
+
df[:a] = df[:a].to(:int)
|
410
|
+
```
|
411
|
+
|
287
412
|
## History
|
288
413
|
|
289
414
|
View the [changelog](https://github.com/ankane/rover/blob/master/CHANGELOG.md)
|
data/lib/rover.rb
CHANGED
@@ -3,30 +3,42 @@ require "numo/narray"
|
|
3
3
|
|
4
4
|
# modules
|
5
5
|
require "rover/data_frame"
|
6
|
+
require "rover/group"
|
6
7
|
require "rover/vector"
|
7
8
|
require "rover/version"
|
8
9
|
|
9
10
|
module Rover
|
10
11
|
class << self
|
11
|
-
def read_csv(path, **options)
|
12
|
+
def read_csv(path, types: nil, **options)
|
12
13
|
require "csv"
|
13
|
-
csv_to_df(CSV.read(path,
|
14
|
+
csv_to_df(CSV.read(path, **csv_options(options)), types: types, headers: options[:headers])
|
14
15
|
end
|
15
16
|
|
16
|
-
def parse_csv(str, **options)
|
17
|
+
def parse_csv(str, types: nil, **options)
|
17
18
|
require "csv"
|
18
|
-
csv_to_df(CSV.parse(str,
|
19
|
+
csv_to_df(CSV.parse(str, **csv_options(options)), types: types, headers: options[:headers])
|
19
20
|
end
|
20
21
|
|
21
22
|
private
|
22
23
|
|
23
|
-
|
24
|
+
# TODO use date converter
|
25
|
+
def csv_options(options)
|
26
|
+
options = {headers: true, converters: :numeric}.merge(options)
|
27
|
+
raise ArgumentError, "Must specify headers" unless options[:headers]
|
28
|
+
options
|
29
|
+
end
|
30
|
+
|
31
|
+
def csv_to_df(table, types: nil, headers: nil)
|
32
|
+
if headers && headers.size < table.headers.size
|
33
|
+
raise ArgumentError, "Expected #{table.headers.size} headers, got #{headers.size}"
|
34
|
+
end
|
35
|
+
|
24
36
|
table.by_col!
|
25
37
|
data = {}
|
26
38
|
table.each do |k, v|
|
27
39
|
data[k] = v
|
28
40
|
end
|
29
|
-
DataFrame.new(data)
|
41
|
+
DataFrame.new(data, types: types)
|
30
42
|
end
|
31
43
|
end
|
32
44
|
end
|
data/lib/rover/data_frame.rb
CHANGED
@@ -1,7 +1,10 @@
|
|
1
1
|
module Rover
|
2
2
|
class DataFrame
|
3
|
-
def initialize(
|
3
|
+
def initialize(*args)
|
4
|
+
data, options = process_args(args)
|
5
|
+
|
4
6
|
@vectors = {}
|
7
|
+
types = options[:types] || {}
|
5
8
|
|
6
9
|
if data.is_a?(DataFrame)
|
7
10
|
data.vectors.each do |k, v|
|
@@ -11,7 +14,7 @@ module Rover
|
|
11
14
|
data.to_h.each do |k, v|
|
12
15
|
@vectors[k] =
|
13
16
|
if v.respond_to?(:to_a)
|
14
|
-
Vector.new(v)
|
17
|
+
Vector.new(v, type: types[k])
|
15
18
|
else
|
16
19
|
v
|
17
20
|
end
|
@@ -20,7 +23,7 @@ module Rover
|
|
20
23
|
# handle scalars
|
21
24
|
size = @vectors.values.find { |v| v.is_a?(Vector) }&.size || 1
|
22
25
|
@vectors.each_key do |k|
|
23
|
-
@vectors[k] = to_vector(@vectors[k], size)
|
26
|
+
@vectors[k] = to_vector(@vectors[k], size: size, type: types[k])
|
24
27
|
end
|
25
28
|
elsif data.is_a?(Array)
|
26
29
|
vectors = {}
|
@@ -35,12 +38,12 @@ module Rover
|
|
35
38
|
end
|
36
39
|
end
|
37
40
|
vectors.each do |k, v|
|
38
|
-
@vectors[k] = to_vector(v)
|
41
|
+
@vectors[k] = to_vector(v, type: types[k])
|
39
42
|
end
|
40
43
|
elsif defined?(ActiveRecord) && (data.is_a?(ActiveRecord::Relation) || (data.is_a?(Class) && data < ActiveRecord::Base))
|
41
44
|
result = data.connection.select_all(data.all.to_sql)
|
42
45
|
result.columns.each_with_index do |k, i|
|
43
|
-
@vectors[k] = to_vector(result.rows.map { |r| r[i] })
|
46
|
+
@vectors[k] = to_vector(result.rows.map { |r| r[i] }, type: types[k])
|
44
47
|
end
|
45
48
|
else
|
46
49
|
raise ArgumentError, "Cannot cast to data frame: #{data.class.name}"
|
@@ -69,6 +72,7 @@ module Rover
|
|
69
72
|
# multiple columns
|
70
73
|
df = DataFrame.new
|
71
74
|
where.each do |k|
|
75
|
+
check_column(k, true)
|
72
76
|
df[k] = @vectors[k]
|
73
77
|
end
|
74
78
|
df
|
@@ -78,8 +82,9 @@ module Rover
|
|
78
82
|
end
|
79
83
|
end
|
80
84
|
|
81
|
-
# return each row as a hash
|
82
85
|
def each_row
|
86
|
+
return enum_for(:each_row) unless block_given?
|
87
|
+
|
83
88
|
size.times do |i|
|
84
89
|
yield @vectors.map { |k, v| [k, v[i]] }.to_h
|
85
90
|
end
|
@@ -90,9 +95,13 @@ module Rover
|
|
90
95
|
@vectors.dup
|
91
96
|
end
|
92
97
|
|
98
|
+
def types
|
99
|
+
@vectors.map { |k, v| [k, v.type] }.to_h
|
100
|
+
end
|
101
|
+
|
93
102
|
def []=(k, v)
|
94
103
|
check_key(k)
|
95
|
-
v = to_vector(v, size)
|
104
|
+
v = to_vector(v, size: size)
|
96
105
|
raise ArgumentError, "Size mismatch: expected #{size}, got #{v.size}" if @vectors.any? && v.size != size
|
97
106
|
@vectors[k] = v
|
98
107
|
end
|
@@ -170,6 +179,12 @@ module Rover
|
|
170
179
|
DataFrame.new(new_vectors)
|
171
180
|
end
|
172
181
|
|
182
|
+
def sample(*args, **kwargs)
|
183
|
+
# TODO make more efficient
|
184
|
+
indexes = (0...size).to_a.sample(*args, **kwargs)
|
185
|
+
self[indexes]
|
186
|
+
end
|
187
|
+
|
173
188
|
def to_a
|
174
189
|
a = []
|
175
190
|
each_row do |row|
|
@@ -190,6 +205,25 @@ module Rover
|
|
190
205
|
Numo::NArray.column_stack(vectors.values.map(&:to_numo))
|
191
206
|
end
|
192
207
|
|
208
|
+
# TODO raise error when collision
|
209
|
+
def one_hot(drop: false)
|
210
|
+
df = DataFrame.new
|
211
|
+
vectors.each do |k, v|
|
212
|
+
if v.to_numo.is_a?(Numo::RObject)
|
213
|
+
df.merge!(v.one_hot(drop: drop, prefix: "#{k}_"))
|
214
|
+
else
|
215
|
+
df[k] = v
|
216
|
+
end
|
217
|
+
end
|
218
|
+
df
|
219
|
+
rescue ArgumentError => e
|
220
|
+
if e.message == "All elements must be strings"
|
221
|
+
# better error message
|
222
|
+
raise ArgumentError, "All elements must be numeric or strings"
|
223
|
+
end
|
224
|
+
raise e
|
225
|
+
end
|
226
|
+
|
193
227
|
def to_csv
|
194
228
|
require "csv"
|
195
229
|
CSV.generate do |csv|
|
@@ -204,7 +238,12 @@ module Rover
|
|
204
238
|
# for IRuby
|
205
239
|
def to_html
|
206
240
|
require "iruby"
|
207
|
-
|
241
|
+
if size > 7
|
242
|
+
# pass 8 rows so maxrows is applied
|
243
|
+
IRuby::HTML.table((self[0..4] + self[-4..-1]).to_h, maxrows: 7)
|
244
|
+
else
|
245
|
+
IRuby::HTML.table(to_h)
|
246
|
+
end
|
208
247
|
end
|
209
248
|
|
210
249
|
# TODO handle long text better
|
@@ -215,18 +254,19 @@ module Rover
|
|
215
254
|
line_start = 0
|
216
255
|
spaces = 2
|
217
256
|
|
257
|
+
summarize = size >= 30
|
258
|
+
|
218
259
|
@vectors.each do |k, v|
|
219
|
-
v = v.first(5).to_a
|
260
|
+
v = summarize ? v.first(5).to_a + ["..."] + v.last(5).to_a : v.to_a
|
220
261
|
width = ([k] + v).map(&:to_s).map(&:size).max
|
221
262
|
width = 3 if width < 3
|
222
263
|
|
223
264
|
if lines.empty? || lines[-2].map { |l| l.size + spaces }.sum + width > 120
|
224
265
|
line_start = lines.size
|
225
266
|
lines << []
|
226
|
-
|
267
|
+
v.size.times do |i|
|
227
268
|
lines << []
|
228
269
|
end
|
229
|
-
lines << [] if size > 5
|
230
270
|
lines << []
|
231
271
|
end
|
232
272
|
|
@@ -234,7 +274,6 @@ module Rover
|
|
234
274
|
v.each_with_index do |v2, i|
|
235
275
|
lines[line_start + 1 + i] << "%#{width}s" % v2.to_s
|
236
276
|
end
|
237
|
-
lines[line_start + 6] << "%#{width}s" % "..." if size > 5
|
238
277
|
end
|
239
278
|
|
240
279
|
lines.pop
|
@@ -258,6 +297,17 @@ module Rover
|
|
258
297
|
dup.sort_by!(&block)
|
259
298
|
end
|
260
299
|
|
300
|
+
def group(*columns)
|
301
|
+
Group.new(self, columns.flatten)
|
302
|
+
end
|
303
|
+
|
304
|
+
[:max, :min, :median, :mean, :percentile, :sum].each do |name|
|
305
|
+
define_method(name) do |column, *args|
|
306
|
+
check_column(column)
|
307
|
+
self[column].send(name, *args)
|
308
|
+
end
|
309
|
+
end
|
310
|
+
|
261
311
|
def dup
|
262
312
|
df = DataFrame.new
|
263
313
|
@vectors.each do |k, v|
|
@@ -313,6 +363,88 @@ module Rover
|
|
313
363
|
keys.all? { |k| self[k] == other[k] }
|
314
364
|
end
|
315
365
|
|
366
|
+
def plot(x = nil, y = nil, type: nil)
|
367
|
+
require "vega"
|
368
|
+
|
369
|
+
raise ArgumentError, "Must specify columns" if keys.size != 2 && (!x || !y)
|
370
|
+
x ||= keys[0]
|
371
|
+
y ||= keys[1]
|
372
|
+
type ||= begin
|
373
|
+
if self[x].numeric? && self[y].numeric?
|
374
|
+
"scatter"
|
375
|
+
elsif types[x] == :object && self[y].numeric?
|
376
|
+
"column"
|
377
|
+
else
|
378
|
+
raise "Cannot determine type. Use the type option."
|
379
|
+
end
|
380
|
+
end
|
381
|
+
data = self[[x, y]]
|
382
|
+
|
383
|
+
case type
|
384
|
+
when "line", "area"
|
385
|
+
x_type =
|
386
|
+
if data[x].numeric?
|
387
|
+
"quantitative"
|
388
|
+
elsif data[x].all? { |v| v.is_a?(Date) || v.is_a?(Time) }
|
389
|
+
"temporal"
|
390
|
+
else
|
391
|
+
"nominal"
|
392
|
+
end
|
393
|
+
|
394
|
+
scale = x_type == "temporal" ? {type: "utc"} : {}
|
395
|
+
|
396
|
+
Vega.lite
|
397
|
+
.data(data)
|
398
|
+
.mark(type: type, tooltip: true, interpolate: "cardinal", point: {size: 60})
|
399
|
+
.encoding(
|
400
|
+
x: {field: x, type: x_type, scale: scale},
|
401
|
+
y: {field: y, type: "quantitative"}
|
402
|
+
)
|
403
|
+
.config(axis: {labelFontSize: 12})
|
404
|
+
when "pie"
|
405
|
+
Vega.lite
|
406
|
+
.data(data)
|
407
|
+
.mark(type: "arc", tooltip: true)
|
408
|
+
.encoding(
|
409
|
+
color: {field: x, type: "nominal", sort: "none", axis: {title: nil}, legend: {labelFontSize: 12}},
|
410
|
+
theta: {field: y, type: "quantitative"}
|
411
|
+
)
|
412
|
+
.view(stroke: nil)
|
413
|
+
when "column"
|
414
|
+
Vega.lite
|
415
|
+
.data(data)
|
416
|
+
.mark(type: "bar", tooltip: true)
|
417
|
+
.encoding(
|
418
|
+
# TODO determine label angle
|
419
|
+
x: {field: x, type: "nominal", sort: "none", axis: {labelAngle: 0}},
|
420
|
+
y: {field: y, type: "quantitative"}
|
421
|
+
)
|
422
|
+
.config(axis: {labelFontSize: 12})
|
423
|
+
when "bar"
|
424
|
+
Vega.lite
|
425
|
+
.data(data)
|
426
|
+
.mark(type: "bar", tooltip: true)
|
427
|
+
.encoding(
|
428
|
+
# TODO determine label angle
|
429
|
+
y: {field: x, type: "nominal", sort: "none", axis: {labelAngle: 0}},
|
430
|
+
x: {field: y, type: "quantitative"}
|
431
|
+
)
|
432
|
+
.config(axis: {labelFontSize: 12})
|
433
|
+
when "scatter"
|
434
|
+
Vega.lite
|
435
|
+
.data(data)
|
436
|
+
.mark(type: "circle", tooltip: true)
|
437
|
+
.encoding(
|
438
|
+
x: {field: x, type: "quantitative", scale: {zero: false}},
|
439
|
+
y: {field: y, type: "quantitative", scale: {zero: false}},
|
440
|
+
size: {value: 60}
|
441
|
+
)
|
442
|
+
.config(axis: {labelFontSize: 12})
|
443
|
+
else
|
444
|
+
raise ArgumentError, "Invalid type: #{type}"
|
445
|
+
end
|
446
|
+
end
|
447
|
+
|
316
448
|
private
|
317
449
|
|
318
450
|
def check_key(key)
|
@@ -375,8 +507,27 @@ module Rover
|
|
375
507
|
raise ArgumentError, "Missing keys: #{missing_keys.join(", ")}" if missing_keys.any?
|
376
508
|
end
|
377
509
|
|
378
|
-
|
379
|
-
|
510
|
+
# TODO in 0.3.0
|
511
|
+
# always use did_you_mean
|
512
|
+
def check_column(key, did_you_mean = false)
|
513
|
+
unless include?(key)
|
514
|
+
if did_you_mean
|
515
|
+
if RUBY_VERSION.to_f >= 2.6
|
516
|
+
raise KeyError.new("Missing column: #{key}", receiver: self, key: key)
|
517
|
+
else
|
518
|
+
raise KeyError.new("Missing column: #{key}")
|
519
|
+
end
|
520
|
+
else
|
521
|
+
raise ArgumentError, "Missing column: #{key}"
|
522
|
+
end
|
523
|
+
end
|
524
|
+
end
|
525
|
+
|
526
|
+
def to_vector(v, size: nil, type: nil)
|
527
|
+
if v.is_a?(Vector)
|
528
|
+
v = v.to(type) if type && v.type != type
|
529
|
+
return v
|
530
|
+
end
|
380
531
|
|
381
532
|
if size && !v.respond_to?(:to_a)
|
382
533
|
v =
|
@@ -392,7 +543,31 @@ module Rover
|
|
392
543
|
end
|
393
544
|
end
|
394
545
|
|
395
|
-
Vector.new(v)
|
546
|
+
Vector.new(v, type: type)
|
547
|
+
end
|
548
|
+
|
549
|
+
# can't use data = {} and keyword arguments
|
550
|
+
# as this causes an unknown keyword error when data is passed as
|
551
|
+
# DataFrame.new({a: ..., b: ...})
|
552
|
+
#
|
553
|
+
# at the moment, there doesn't appear to be a way to distinguish between
|
554
|
+
# DataFrame.new({types: ...}) which should set data, and
|
555
|
+
# DataFrame.new(types: ...) which should set options
|
556
|
+
# https://bugs.ruby-lang.org/issues/16891
|
557
|
+
#
|
558
|
+
# there aren't currently options that should be used without data
|
559
|
+
# if this is ever the case, we should still require data
|
560
|
+
# to prevent new options from breaking existing code
|
561
|
+
def process_args(args)
|
562
|
+
data = args[0] || {}
|
563
|
+
options = args.size > 1 && args.last.is_a?(Hash) ? args.pop : {}
|
564
|
+
raise ArgumentError, "wrong number of arguments (given #{args.size}, expected 0..1)" if args.size > 1
|
565
|
+
|
566
|
+
known_keywords = [:types]
|
567
|
+
unknown_keywords = options.keys - known_keywords
|
568
|
+
raise ArgumentError, "unknown keywords: #{unknown_keywords.join(", ")}" if unknown_keywords.any?
|
569
|
+
|
570
|
+
[data, options]
|
396
571
|
end
|
397
572
|
end
|
398
573
|
end
|
data/lib/rover/group.rb
ADDED
@@ -0,0 +1,50 @@
|
|
1
|
+
module Rover
|
2
|
+
class Group
|
3
|
+
def initialize(df, columns)
|
4
|
+
@df = df
|
5
|
+
@columns = columns
|
6
|
+
end
|
7
|
+
|
8
|
+
def group(*columns)
|
9
|
+
Group.new(@df, @columns + columns.flatten)
|
10
|
+
end
|
11
|
+
|
12
|
+
[:count, :max, :min, :mean, :median, :percentile, :sum].each do |name|
|
13
|
+
define_method(name) do |*args|
|
14
|
+
n = [name, args.first].compact.join("_")
|
15
|
+
|
16
|
+
rows = []
|
17
|
+
grouped_dfs.each do |k, df|
|
18
|
+
rows << k.merge(n => df.send(name, *args))
|
19
|
+
end
|
20
|
+
|
21
|
+
DataFrame.new(rows)
|
22
|
+
end
|
23
|
+
end
|
24
|
+
|
25
|
+
private
|
26
|
+
|
27
|
+
# TODO make more efficient
|
28
|
+
def grouped_dfs
|
29
|
+
# cache here so we can reuse for multiple calcuations if needed
|
30
|
+
@grouped_dfs ||= begin
|
31
|
+
raise ArgumentError, "No columns given" if @columns.empty?
|
32
|
+
missing_keys = @columns - @df.keys
|
33
|
+
raise ArgumentError, "Missing keys: #{missing_keys.join(", ")}" if missing_keys.any?
|
34
|
+
|
35
|
+
groups = Hash.new { |hash, key| hash[key] = [] }
|
36
|
+
i = 0
|
37
|
+
@df.each_row do |row|
|
38
|
+
groups[row.slice(*@columns)] << i
|
39
|
+
i += 1
|
40
|
+
end
|
41
|
+
|
42
|
+
result = {}
|
43
|
+
groups.keys.each do |k|
|
44
|
+
result[k] = @df[groups[k]]
|
45
|
+
end
|
46
|
+
result
|
47
|
+
end
|
48
|
+
end
|
49
|
+
end
|
50
|
+
end
|
data/lib/rover/vector.rb
CHANGED
@@ -1,27 +1,39 @@
|
|
1
1
|
module Rover
|
2
2
|
class Vector
|
3
|
-
|
4
|
-
|
5
|
-
|
6
|
-
|
7
|
-
|
8
|
-
|
9
|
-
|
10
|
-
|
11
|
-
|
12
|
-
|
13
|
-
|
14
|
-
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
|
20
|
-
|
21
|
-
|
3
|
+
# if a user never specifies types,
|
4
|
+
# the defaults are bool, float, int, and object
|
5
|
+
# keep these simple
|
6
|
+
#
|
7
|
+
# we could create aliases for float64, int64, uint64
|
8
|
+
# if so, type should still return the simple type
|
9
|
+
TYPE_CAST_MAPPING = {
|
10
|
+
bool: Numo::Bit,
|
11
|
+
float32: Numo::SFloat,
|
12
|
+
float: Numo::DFloat,
|
13
|
+
int8: Numo::Int8,
|
14
|
+
int16: Numo::Int16,
|
15
|
+
int32: Numo::Int32,
|
16
|
+
int: Numo::Int64,
|
17
|
+
object: Numo::RObject,
|
18
|
+
uint8: Numo::UInt8,
|
19
|
+
uint16: Numo::UInt16,
|
20
|
+
uint32: Numo::UInt32,
|
21
|
+
uint: Numo::UInt64
|
22
|
+
}
|
23
|
+
|
24
|
+
def initialize(data, type: nil)
|
25
|
+
@data = cast_data(data, type: type)
|
22
26
|
raise ArgumentError, "Bad size: #{@data.shape}" unless @data.ndim == 1
|
23
27
|
end
|
24
28
|
|
29
|
+
def type
|
30
|
+
TYPE_CAST_MAPPING.find { |_, v| @data.is_a?(v) }[0]
|
31
|
+
end
|
32
|
+
|
33
|
+
def to(type)
|
34
|
+
Vector.new(self, type: type)
|
35
|
+
end
|
36
|
+
|
25
37
|
def to_numo
|
26
38
|
@data
|
27
39
|
end
|
@@ -32,12 +44,18 @@ module Rover
|
|
32
44
|
a
|
33
45
|
end
|
34
46
|
|
47
|
+
def numeric?
|
48
|
+
![:object, :bool].include?(type)
|
49
|
+
end
|
50
|
+
|
35
51
|
def size
|
36
52
|
@data.size
|
37
53
|
end
|
54
|
+
alias_method :length, :size
|
55
|
+
alias_method :count, :size
|
38
56
|
|
39
57
|
def uniq
|
40
|
-
Vector.new(
|
58
|
+
Vector.new(to_a.uniq)
|
41
59
|
end
|
42
60
|
|
43
61
|
def missing
|
@@ -73,11 +91,11 @@ module Rover
|
|
73
91
|
@data[k] = v
|
74
92
|
end
|
75
93
|
|
76
|
-
%w(+ - * / % ** &).each do |op|
|
94
|
+
%w(+ - * / % ** & | ^).each do |op|
|
77
95
|
define_method(op) do |other|
|
78
96
|
other = other.to_numo if other.is_a?(Vector)
|
79
97
|
# TODO better logic
|
80
|
-
if @data.is_a?(Numo::RObject)
|
98
|
+
if @data.is_a?(Numo::RObject) && !other.is_a?(Numo::RObject)
|
81
99
|
map { |v| v.send(op, other) }
|
82
100
|
else
|
83
101
|
Vector.new(@data.send(op, other))
|
@@ -143,9 +161,31 @@ module Rover
|
|
143
161
|
end
|
144
162
|
|
145
163
|
def map(&block)
|
146
|
-
|
147
|
-
|
148
|
-
Vector.new(
|
164
|
+
# convert to Ruby first to cast properly
|
165
|
+
# https://github.com/ruby-numo/numo-narray/issues/181
|
166
|
+
Vector.new(@data.to_a.map(&block))
|
167
|
+
end
|
168
|
+
|
169
|
+
def map!(&block)
|
170
|
+
@data = cast_data(@data.to_a.map(&block))
|
171
|
+
self
|
172
|
+
end
|
173
|
+
|
174
|
+
def select(&block)
|
175
|
+
Vector.new(@data.to_a.select(&block))
|
176
|
+
end
|
177
|
+
|
178
|
+
def reject(&block)
|
179
|
+
Vector.new(@data.to_a.reject(&block))
|
180
|
+
end
|
181
|
+
|
182
|
+
def tally
|
183
|
+
result = Hash.new(0)
|
184
|
+
@data.each do |v|
|
185
|
+
result[v] += 1
|
186
|
+
end
|
187
|
+
result.default = nil
|
188
|
+
result
|
149
189
|
end
|
150
190
|
|
151
191
|
def sort
|
@@ -157,7 +197,11 @@ module Rover
|
|
157
197
|
end
|
158
198
|
|
159
199
|
def each(&block)
|
160
|
-
|
200
|
+
@data.each(&block)
|
201
|
+
end
|
202
|
+
|
203
|
+
def each_with_index(&block)
|
204
|
+
@data.each_with_index(&block)
|
161
205
|
end
|
162
206
|
|
163
207
|
def max
|
@@ -176,7 +220,7 @@ module Rover
|
|
176
220
|
|
177
221
|
def median
|
178
222
|
# need to cast to get correct result
|
179
|
-
#
|
223
|
+
# https://github.com/ruby-numo/numo-narray/issues/165
|
180
224
|
@data.cast_to(Numo::DFloat).median
|
181
225
|
end
|
182
226
|
|
@@ -188,12 +232,26 @@ module Rover
|
|
188
232
|
@data.sum
|
189
233
|
end
|
190
234
|
|
235
|
+
# uses Bessel's correction for now since that's all Numo supports
|
236
|
+
def std
|
237
|
+
@data.cast_to(Numo::DFloat).stddev
|
238
|
+
end
|
239
|
+
|
240
|
+
# uses Bessel's correction for now since that's all Numo supports
|
241
|
+
def var
|
242
|
+
@data.cast_to(Numo::DFloat).var
|
243
|
+
end
|
244
|
+
|
191
245
|
def all?(&block)
|
192
|
-
|
246
|
+
to_a.all?(&block)
|
193
247
|
end
|
194
248
|
|
195
249
|
def any?(&block)
|
196
|
-
|
250
|
+
to_a.any?(&block)
|
251
|
+
end
|
252
|
+
|
253
|
+
def zip(other, &block)
|
254
|
+
to_a.zip(other.to_a, &block)
|
197
255
|
end
|
198
256
|
|
199
257
|
def first(n = 1)
|
@@ -208,6 +266,11 @@ module Rover
|
|
208
266
|
Vector.new(@data[-n..-1])
|
209
267
|
end
|
210
268
|
|
269
|
+
def take(n)
|
270
|
+
raise ArgumentError, "attempt to take negative size" if n < 0
|
271
|
+
first(n)
|
272
|
+
end
|
273
|
+
|
211
274
|
def crosstab(other)
|
212
275
|
index = uniq.sort
|
213
276
|
index_pos = index.to_a.map.with_index.to_h
|
@@ -231,6 +294,20 @@ module Rover
|
|
231
294
|
last(n)
|
232
295
|
end
|
233
296
|
|
297
|
+
def one_hot(drop: false, prefix: nil)
|
298
|
+
raise ArgumentError, "All elements must be strings" unless all? { |vi| vi.is_a?(String) }
|
299
|
+
|
300
|
+
new_vectors = {}
|
301
|
+
# maybe sort values first
|
302
|
+
values = uniq.to_a
|
303
|
+
values.shift if drop
|
304
|
+
values.each do |v2|
|
305
|
+
# TODO use types
|
306
|
+
new_vectors["#{prefix}#{v2}"] = (self == v2).to_numo.cast_to(Numo::Int64)
|
307
|
+
end
|
308
|
+
DataFrame.new(new_vectors)
|
309
|
+
end
|
310
|
+
|
234
311
|
# TODO add type and size?
|
235
312
|
def inspect
|
236
313
|
elements = first(5).to_a.map(&:inspect)
|
@@ -242,7 +319,64 @@ module Rover
|
|
242
319
|
# for IRuby
|
243
320
|
def to_html
|
244
321
|
require "iruby"
|
245
|
-
|
322
|
+
if size > 7
|
323
|
+
# pass 8 rows so maxrows is applied
|
324
|
+
IRuby::HTML.table(first(4).to_a + last(4).to_a, maxrows: 7)
|
325
|
+
else
|
326
|
+
IRuby::HTML.table(to_a)
|
327
|
+
end
|
328
|
+
end
|
329
|
+
|
330
|
+
private
|
331
|
+
|
332
|
+
def cast_data(data, type: nil)
|
333
|
+
numo_type = numo_type(type) if type
|
334
|
+
|
335
|
+
data = data.to_numo if data.is_a?(Vector)
|
336
|
+
|
337
|
+
if data.is_a?(Numo::NArray)
|
338
|
+
raise ArgumentError, "Complex types not supported yet" if data.is_a?(Numo::DComplex) || data.is_a?(Numo::SComplex)
|
339
|
+
|
340
|
+
if type
|
341
|
+
case type
|
342
|
+
when /int/
|
343
|
+
# Numo does not check these when casting
|
344
|
+
raise RangeError, "float NaN out of range of integer" if data.respond_to?(:isnan) && data.isnan.any?
|
345
|
+
raise RangeError, "float Inf out of range of integer" if data.respond_to?(:isinf) && data.isinf.any?
|
346
|
+
|
347
|
+
data = data.to_a.map { |v| v.nil? ? nil : v.to_i } if data.is_a?(Numo::RObject)
|
348
|
+
when /float/
|
349
|
+
data = data.to_a.map { |v| v.nil? ? Float::NAN : v.to_f } if data.is_a?(Numo::RObject)
|
350
|
+
end
|
351
|
+
|
352
|
+
data = numo_type.cast(data)
|
353
|
+
end
|
354
|
+
else
|
355
|
+
data = data.to_a
|
356
|
+
|
357
|
+
if type
|
358
|
+
data = numo_type.cast(data)
|
359
|
+
else
|
360
|
+
data =
|
361
|
+
if data.all? { |v| v.is_a?(Integer) }
|
362
|
+
Numo::Int64.cast(data)
|
363
|
+
elsif data.all? { |v| v.is_a?(Numeric) || v.nil? }
|
364
|
+
Numo::DFloat.cast(data.map { |v| v || Float::NAN })
|
365
|
+
elsif data.all? { |v| v == true || v == false }
|
366
|
+
Numo::Bit.cast(data)
|
367
|
+
else
|
368
|
+
Numo::RObject.cast(data)
|
369
|
+
end
|
370
|
+
end
|
371
|
+
end
|
372
|
+
|
373
|
+
data
|
374
|
+
end
|
375
|
+
|
376
|
+
def numo_type(type)
|
377
|
+
numo_type = TYPE_CAST_MAPPING[type]
|
378
|
+
raise ArgumentError, "Invalid type: #{type}" unless numo_type
|
379
|
+
numo_type
|
246
380
|
end
|
247
381
|
end
|
248
382
|
end
|
data/lib/rover/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: rover-df
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.2.3
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Andrew Kane
|
8
|
-
autorequire:
|
8
|
+
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2021-02-08 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: numo-narray
|
@@ -16,100 +16,16 @@ dependencies:
|
|
16
16
|
requirements:
|
17
17
|
- - ">="
|
18
18
|
- !ruby/object:Gem::Version
|
19
|
-
version: 0.9.1.
|
19
|
+
version: 0.9.1.9
|
20
20
|
type: :runtime
|
21
21
|
prerelease: false
|
22
22
|
version_requirements: !ruby/object:Gem::Requirement
|
23
23
|
requirements:
|
24
24
|
- - ">="
|
25
25
|
- !ruby/object:Gem::Version
|
26
|
-
version: 0.9.1.
|
27
|
-
|
28
|
-
|
29
|
-
requirement: !ruby/object:Gem::Requirement
|
30
|
-
requirements:
|
31
|
-
- - ">="
|
32
|
-
- !ruby/object:Gem::Version
|
33
|
-
version: '0'
|
34
|
-
type: :development
|
35
|
-
prerelease: false
|
36
|
-
version_requirements: !ruby/object:Gem::Requirement
|
37
|
-
requirements:
|
38
|
-
- - ">="
|
39
|
-
- !ruby/object:Gem::Version
|
40
|
-
version: '0'
|
41
|
-
- !ruby/object:Gem::Dependency
|
42
|
-
name: rake
|
43
|
-
requirement: !ruby/object:Gem::Requirement
|
44
|
-
requirements:
|
45
|
-
- - ">="
|
46
|
-
- !ruby/object:Gem::Version
|
47
|
-
version: '0'
|
48
|
-
type: :development
|
49
|
-
prerelease: false
|
50
|
-
version_requirements: !ruby/object:Gem::Requirement
|
51
|
-
requirements:
|
52
|
-
- - ">="
|
53
|
-
- !ruby/object:Gem::Version
|
54
|
-
version: '0'
|
55
|
-
- !ruby/object:Gem::Dependency
|
56
|
-
name: minitest
|
57
|
-
requirement: !ruby/object:Gem::Requirement
|
58
|
-
requirements:
|
59
|
-
- - ">="
|
60
|
-
- !ruby/object:Gem::Version
|
61
|
-
version: '5'
|
62
|
-
type: :development
|
63
|
-
prerelease: false
|
64
|
-
version_requirements: !ruby/object:Gem::Requirement
|
65
|
-
requirements:
|
66
|
-
- - ">="
|
67
|
-
- !ruby/object:Gem::Version
|
68
|
-
version: '5'
|
69
|
-
- !ruby/object:Gem::Dependency
|
70
|
-
name: activerecord
|
71
|
-
requirement: !ruby/object:Gem::Requirement
|
72
|
-
requirements:
|
73
|
-
- - ">="
|
74
|
-
- !ruby/object:Gem::Version
|
75
|
-
version: '5'
|
76
|
-
type: :development
|
77
|
-
prerelease: false
|
78
|
-
version_requirements: !ruby/object:Gem::Requirement
|
79
|
-
requirements:
|
80
|
-
- - ">="
|
81
|
-
- !ruby/object:Gem::Version
|
82
|
-
version: '5'
|
83
|
-
- !ruby/object:Gem::Dependency
|
84
|
-
name: sqlite3
|
85
|
-
requirement: !ruby/object:Gem::Requirement
|
86
|
-
requirements:
|
87
|
-
- - ">="
|
88
|
-
- !ruby/object:Gem::Version
|
89
|
-
version: '0'
|
90
|
-
type: :development
|
91
|
-
prerelease: false
|
92
|
-
version_requirements: !ruby/object:Gem::Requirement
|
93
|
-
requirements:
|
94
|
-
- - ">="
|
95
|
-
- !ruby/object:Gem::Version
|
96
|
-
version: '0'
|
97
|
-
- !ruby/object:Gem::Dependency
|
98
|
-
name: iruby
|
99
|
-
requirement: !ruby/object:Gem::Requirement
|
100
|
-
requirements:
|
101
|
-
- - ">="
|
102
|
-
- !ruby/object:Gem::Version
|
103
|
-
version: '0'
|
104
|
-
type: :development
|
105
|
-
prerelease: false
|
106
|
-
version_requirements: !ruby/object:Gem::Requirement
|
107
|
-
requirements:
|
108
|
-
- - ">="
|
109
|
-
- !ruby/object:Gem::Version
|
110
|
-
version: '0'
|
111
|
-
description:
|
112
|
-
email: andrew@chartkick.com
|
26
|
+
version: 0.9.1.9
|
27
|
+
description:
|
28
|
+
email: andrew@ankane.org
|
113
29
|
executables: []
|
114
30
|
extensions: []
|
115
31
|
extra_rdoc_files: []
|
@@ -120,13 +36,14 @@ files:
|
|
120
36
|
- lib/rover-df.rb
|
121
37
|
- lib/rover.rb
|
122
38
|
- lib/rover/data_frame.rb
|
39
|
+
- lib/rover/group.rb
|
123
40
|
- lib/rover/vector.rb
|
124
41
|
- lib/rover/version.rb
|
125
42
|
homepage: https://github.com/ankane/rover
|
126
43
|
licenses:
|
127
44
|
- MIT
|
128
45
|
metadata: {}
|
129
|
-
post_install_message:
|
46
|
+
post_install_message:
|
130
47
|
rdoc_options: []
|
131
48
|
require_paths:
|
132
49
|
- lib
|
@@ -141,8 +58,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
141
58
|
- !ruby/object:Gem::Version
|
142
59
|
version: '0'
|
143
60
|
requirements: []
|
144
|
-
rubygems_version: 3.
|
145
|
-
signing_key:
|
61
|
+
rubygems_version: 3.2.3
|
62
|
+
signing_key:
|
146
63
|
specification_version: 4
|
147
64
|
summary: Simple, powerful data frames for Ruby
|
148
65
|
test_files: []
|