red_amber 0.1.5 → 0.1.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 4d18eedf5de7fd06fe52e8a82ad38fe12d590dc10929c96872e557b9e946f785
4
- data.tar.gz: dda93f0af421096410e00ecf2261e8846a236634bd96ae9941d1b5cd49cd5eb2
3
+ metadata.gz: ae6a6696e0f01ae7d621d11542e203803ba117fc1ee3d286a1444b3c4ac746fc
4
+ data.tar.gz: 722d4ad538fe4f0c85db4911e773e1f87eb03f47fa63c954529bf04babc55d8c
5
5
  SHA512:
6
- metadata.gz: 7c1b1edd6c1f6f3f275ea765c4bc8765327c88a36120a4c5a66dd8afa59f5913db4a5b436d80378554e03403bab823edf7467beea0f44e2803e36f3e9677a065
7
- data.tar.gz: 949fd15d2076d4e53fb141375bde282228c7f6566e137047344134c54964fe77fd2f9757b0bdc324eb3cfa14091f2ae928e0e844d28f3ebbcfa17fc7d388bbd0
6
+ metadata.gz: 96887abfbdd44330e80a6a97f91597c00706fc99492d086683702f1d3e757331e90fb275e5796a2b7b3228b476f8c7799ab22727411baedb79ae39acafd2d3f0
7
+ data.tar.gz: c020bba60734fccdeb4a18efecb98260f70fa76c5b8ba2c7f2830d2dac6de66e9776a1fa2ffc5d396e7270b29c5e4dd9737dee70b5dcc47dd3113aaafe4f4d22
data/.rubocop.yml CHANGED
@@ -56,7 +56,9 @@ Metrics/AbcSize:
56
56
  Max: 30
57
57
  Exclude:
58
58
  - 'lib/red_amber/data_frame_displayable.rb' # Max: 55
59
- - 'lib/red_amber/vector_compensable.rb' # Max: 36
59
+ - 'lib/red_amber/data_frame_selectable.rb' # Max: 51
60
+ - 'lib/red_amber/vector_updatable.rb' # Max: 36
61
+ - 'lib/red_amber/vector_selectable.rb' # Max: 33
60
62
 
61
63
  # Max: 25
62
64
  Metrics/BlockLength:
@@ -66,15 +68,18 @@ Metrics/BlockLength:
66
68
 
67
69
  # Max: 100
68
70
  Metrics/ClassLength:
69
- Max: 120
71
+ Max: 100
70
72
  Exclude:
71
73
  - 'test/**/*'
74
+ - 'lib/red_amber/data_frame.rb' #Max: 131
75
+ - 'lib/red_amber/vector.rb' #Max: 102
72
76
 
73
77
  # Max: 7
74
78
  Metrics/CyclomaticComplexity:
75
79
  Max: 12
76
80
  Exclude:
77
- - 'lib/red_amber/vector_compensable.rb' # Max: 14
81
+ - 'lib/red_amber/data_frame_selectable.rb' # Max: 14
82
+ - 'lib/red_amber/vector_updatable.rb' # Max: 14
78
83
 
79
84
  # Max: 10
80
85
  Metrics/MethodLength:
@@ -86,20 +91,34 @@ Metrics/MethodLength:
86
91
  Metrics/ModuleLength:
87
92
  Max: 100
88
93
  Exclude:
94
+ - 'lib/red_amber/data_frame_selectable.rb' # Max: 141
89
95
  - 'lib/red_amber/vector_functions.rb' # Max: 114
90
96
 
91
97
  # Max: 8
92
98
  Metrics/PerceivedComplexity:
93
99
  Max: 13
94
100
  Exclude:
95
- - 'lib/red_amber/vector_compensable.rb' # Max: 15
101
+ - 'lib/red_amber/data_frame_selectable.rb' # Max: 14
102
+ - 'lib/red_amber/vector_updatable.rb' # Max: 15
103
+
104
+ Naming/FileName:
105
+ Exclude:
106
+ - 'lib/red-amber.rb'
96
107
 
97
- # Necessary to define is_na
108
+ # Necessary to define is_na, is_in, etc.
98
109
  Naming/PredicateName:
99
110
  Exclude:
100
111
  - 'lib/red_amber/vector_functions.rb'
112
+ - 'lib/red_amber/vector.rb'
113
+ - 'lib/red_amber/vector_selectable.rb'
101
114
 
102
115
  # Necessary to test when range.end == -1
103
116
  Style/SlicingWithRange:
104
117
  Exclude:
105
118
  - 'test/test_data_frame_selectable.rb'
119
+
120
+ # Necessary to Vector < 0 element-wise comparison
121
+ Style/NumericPredicate:
122
+ Exclude:
123
+ - 'lib/red_amber/data_frame_selectable.rb'
124
+ - 'lib/red_amber/vector_selectable.rb'
data/CHANGELOG.md CHANGED
@@ -1,30 +1,115 @@
1
- ## [0.2.0] - unreleased
1
+ ## - unreleased
2
2
 
3
3
  - Document
4
4
  - YARD support
5
5
 
6
- - DataFrame#join features
6
+ - `datasets-red-amber` gem
7
+ - `red-amber` gem
7
8
 
8
- ## [0.1.6] - Unreleased
9
+ - `Vector#divmod`
10
+ - Introduce if Arrow's function is ready
9
11
 
10
- - Feedback something to Red Data Tools
12
+ ## - Unreleased, will be after Arrow 9.0.0 released
11
13
 
12
14
  - `DataFrame`
13
15
  - Introduce `summary` or ``describe`
14
- - Add `Quantile` by own code?
15
- - Improve dataframe obs. manipuration methods to accept float as a index (#10)
16
- - Improve as more performant by benchmark check.
16
+ - `Quantile` will be available
17
17
 
18
- - `Vector`
19
- - Support more functions
20
- - Support coerece
18
+ ## [0.1.7] - Unreleased, may be 2022-07-10
21
19
 
20
+ - Feedback something to Red Data Tools
21
+ - Support more functions
22
+ - Improve as more performant
22
23
  - More examples of frequently needed tasks
23
24
 
25
+ - New `Group` API
26
+ - `DataFrame#join features
27
+
28
+ ## [0.1.6] - 2022-06-26 (experimental)
29
+
30
+ - Bug fixes
31
+ - Fix mime-type of empty DataFrame in `#to_iruby` (#31)
32
+ - Fix mime setting in `DataFrame#to_iruby` (#36)
33
+ - Fix unmatched return val in Selectable (#34)
34
+ - Fix to return same error as `#[]` in `DataFrame#slice` (#34)
35
+
36
+ - New features and improvements
37
+ - Introduce Jupyter support (#29, #30, #31, #32)
38
+ - Add `DataFrame#to_html (changed to use #to_iruby)
39
+ - Add feature to show nil in to_iruby
40
+ - nil is expressed as (nil)
41
+ - empty string('') is ""
42
+ - blank spaces are " "
43
+
44
+ - Enable to change DataFrame display mode by ENV (#36)
45
+ - Support ENV['RED_AMBER_OUTPUT_STYLE'] to change display mode in `#inspect` and `#to_iruby`
46
+ - ENV['RED_AMBER_OUTPUT_STYLE'] = 'table' # => Table mode
47
+ - ENV['RED_AMBER_OUTPUT_STYLE'] = nil or other than 'table' # => TDR mode
48
+
49
+ - Support `require 'red-amber'`, as well (#34)
50
+
51
+ - Refine Vector slicing methods (#31)
52
+ - Introduce `Vector#take` method
53
+ - Introduce `Vector#filter` method
54
+ - Improve `Vector#[]` to overload take and filter
55
+ - Introduce `Vector#drop_nil` method
56
+ - Introduce `Vector#if_else` method
57
+ - Intorduce `Vector#is_in` method
58
+ - Add alias `Vector#all?`, `#any?` methods (#32)
59
+ - Add `Vector#has_nil?` method(#32)
60
+ - Add `Vector#empty?` method
61
+ - Add `Vector#primitive_invert` method
62
+ - Refactor `Vector#take`, `#filter`
63
+ - Move `Vector#if_else` from function to Updatable
64
+ - Move if_else test to updatable
65
+ - Rename updatable in test
66
+ - Remove method `Vector#take_out_element_wise`
67
+ - Rename inner metthod name
68
+
69
+ - Refine DataFrame slicing methods (#31)
70
+ - Introduce `DataFrame#take method
71
+ - #take is implemented as vector calculation by #if_else
72
+ - Introduce `DataFrame#fliter method
73
+ - Change `DataFrame#[] to use take and filter
74
+ - Float indices is acceptable (#10)
75
+ - Negative index (like Array) is also acceptable
76
+
77
+ - Further refinement in DataFrame slicing methods (#34)
78
+ - Improve `DataFrame#[]`, `#slice`, `#remove` by a new engine
79
+ - It parses arguments to Vector internally.
80
+ - Used Kernel#Array to simplify code (#16) .
81
+ - recycle: Move `DataFrame#slice`, `#remove` to Selectable
82
+ - Refine `DataFrame#take`, `#filter` (undocumented)
83
+
84
+ - Introduce coerce in Vector (#35)
85
+ - Introduce `Vector#coerce`
86
+ - Now we can `-1 * Vector.new([1, 2, 3])`
87
+ - Add `Vector#to_ary` method
88
+ - Now we can `[1, 2] + Vector.new([3, 4, 5])`
89
+
90
+ - Other new feature or refinements
91
+ - Common
92
+ - Refactor helper as common for DataFrame and Vector (#35)
93
+ - Change name row/col to obs/var (#34)
94
+ - Rename internal function name (#34)
95
+ - Delete unused methods (#34)
96
+ - DataFrame
97
+ - Change to return instance variable in `#to_arrow`, `#keys` and `#key_index` (#34)
98
+ - Change to return an Array in `DataFrame#indices` (#35)
99
+ - Vector
100
+ - Introduce `Vector#replace` method
101
+ - Accept Range and expanded Array in `Vector#new`
102
+ - Add `Vector#indices` method (#35)
103
+ - Add `Vector#index` method (#35)
104
+ - Rename VectorCompensable to *Updatable (#33)
105
+
106
+ - Documentation
107
+ - Fix typo in DataFrame.md
108
+
24
109
  ## [0.1.5] - 2022-06-12 (experimental)
25
110
 
26
111
  - Bug fixes
27
- - Fix DF#tdr to display timestamp type (#19)
112
+ - Fix DataFrame#tdr to display timestamp type (#19)
28
113
  - Add TZ setting in CI test to pass temporal tests (#19)
29
114
  - Fix example in document of #load(csv_from_URI) (#23)
30
115
 
@@ -38,7 +123,7 @@
38
123
  - Add `Vector#temporal?` to check if temporal type
39
124
  - Refine around DataFrame#variables
40
125
  - Refine init of instance variables
41
- - Refine DataFrame#type_classes, V#ectortype_class
126
+ - Refine DataFrame#type_classes, Vector#ectortype_class
42
127
  - Refine DataFrame#tdr to shorten temporal data
43
128
 
44
129
  - Add supports to make up for missing values (#20)
@@ -86,7 +171,7 @@
86
171
 
87
172
  - Bug fixes
88
173
  - Fix missing support for scalar argument (#1)
89
- - Fix type name of boolean in DF#types to be same as Vector#type (#6, #7)
174
+ - Fix type name of boolean in DataFrame#types to be same as Vector#type (#6, #7)
90
175
  - Fix zero picking to return empty DataFrame (#8)
91
176
  - Fix code at both args and a block given (#8)
92
177
 
data/Gemfile CHANGED
@@ -12,6 +12,7 @@ group :test do
12
12
  gem 'rubocop-rake'
13
13
  gem 'rubocop-rubycw', require: false
14
14
 
15
+ gem 'iruby'
15
16
  gem 'test-unit'
16
17
  gem 'webrick'
17
18
 
data/README.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # RedAmber
2
2
 
3
- A simple dataframe library for Ruby (experimental)
3
+ A simple dataframe library for Ruby (experimental).
4
4
 
5
5
  - Powered by [Red Arrow](https://github.com/apache/arrow/tree/master/ruby/red-arrow)
6
6
  - Inspired by the dataframe library [Rover-df](https://github.com/ankane/rover)
@@ -42,17 +42,56 @@ Or install it yourself as:
42
42
  gem install red_amber
43
43
  ```
44
44
 
45
+ (From v0.1.6)
46
+
47
+ RedAmber uses TDR mode for `#inspect` and `#to_iruby` by default. If you prefer Table mode, please set the environment variable
48
+ `RED_AMBER_OUTPUT_MODE` to `"table"`. See [TDR section](#TDR) for detail.
49
+
45
50
  ## `RedAmber::DataFrame`
46
51
 
47
- Represents a set of data in 2D-shape.
52
+ Represents a set of data in 2D-shape. The entity is a Red Arrow's Table object.
48
53
 
49
54
  ```ruby
50
- require 'red_amber'
55
+ require 'red_amber' # require 'red-amber' is also OK.
51
56
  require 'datasets-arrow'
52
57
 
53
58
  arrow = Datasets::Penguins.new.to_arrow
54
59
  penguins = RedAmber::DataFrame.new(arrow)
55
- penguins.tdr
60
+ penguins.table
61
+
62
+ # =>
63
+ #<Arrow::Table:0x111271098 ptr=0x7f9118b3e0b0>
64
+ species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
65
+ 0 Adelie Torgersen 39.100000 18.700000 181 3750 male 2007
66
+ 1 Adelie Torgersen 39.500000 17.400000 186 3800 female 2007
67
+ 2 Adelie Torgersen 40.300000 18.000000 195 3250 female 2007
68
+ 3 Adelie Torgersen (null) (null) (null) (null) (null) 2007
69
+ 4 Adelie Torgersen 36.700000 19.300000 193 3450 female 2007
70
+ 5 Adelie Torgersen 39.300000 20.600000 190 3650 male 2007
71
+ 6 Adelie Torgersen 38.900000 17.800000 181 3625 female 2007
72
+ 7 Adelie Torgersen 39.200000 19.600000 195 4675 male 2007
73
+ 8 Adelie Torgersen 34.100000 18.100000 193 3475 (null) 2007
74
+ 9 Adelie Torgersen 42.000000 20.200000 190 4250 (null) 2007
75
+ ...
76
+ 334 Gentoo Biscoe 46.200000 14.100000 217 4375 female 2009
77
+ 335 Gentoo Biscoe 55.100000 16.000000 230 5850 male 2009
78
+ 336 Gentoo Biscoe 44.500000 15.700000 217 4875 (null) 2009
79
+ 337 Gentoo Biscoe 48.800000 16.200000 222 6000 male 2009
80
+ 338 Gentoo Biscoe 47.200000 13.700000 214 4925 female 2009
81
+ 339 Gentoo Biscoe (null) (null) (null) (null) (null) 2009
82
+ 340 Gentoo Biscoe 46.800000 14.300000 215 4850 female 2009
83
+ 341 Gentoo Biscoe 50.400000 15.700000 222 5750 male 2009
84
+ 342 Gentoo Biscoe 45.200000 14.800000 212 5200 female 2009
85
+ 343 Gentoo Biscoe 49.900000 16.100000 213 5400 male 2009
86
+ ```
87
+
88
+ By default, RedAmber shows self by compact transposed style. This unfamiliar style (TDR) is designed for
89
+ the exploratory data processing. It keeps Vectors as row vectors, shows keys and types at a glance, shows levels
90
+ for the 'factor-like' variables and shows the number of abnormal values like NaN and nil.
91
+
92
+ ```ruby
93
+ penguins
94
+
56
95
  # =>
57
96
  RedAmber::DataFrame : 344 x 8 Vectors
58
97
  Vectors : 5 numeric, 3 strings
@@ -139,9 +178,19 @@ Vectors accepts some [functional methods from Arrow](https://arrow.apache.org/do
139
178
 
140
179
  See [Vector.md](doc/Vector.md) for details.
141
180
 
142
- ## TDR concept
181
+ ## TDR
182
+
183
+ I named the data frame representation style in the model above as TDR (Transposed DataFrame Representation).
184
+
185
+ This library can be used with both TDR mode and usual Table mode.
186
+ If you set the environment variable `RED_AMBER_OUTPUT_MODE` to `"table"`, output style by `inspect` and `to_iruby` is the Table mode. Other value including nil will output TDR style.
187
+
188
+ You can switch the mode in Ruby like this.
189
+ ```ruby
190
+ ENV['RED_AMBER_OUTPUT_STYLE'] = 'table' # => Table mode
191
+ ```
143
192
 
144
- I named the data frame representation style in the model above as TDR (Transposed DataFrame Representation). See [TDR.md](doc/tdr.md) for details.
193
+ For more detail information about TDR, see [TDR.md](doc/tdr.md).
145
194
 
146
195
  ## Development
147
196
 
data/doc/DataFrame.md CHANGED
@@ -4,11 +4,13 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
4
4
  - A collection of data which have same data type within. We call it `Vector`.
5
5
  - A label is attached to `Vector`. We call it `key`.
6
6
  - A `Vector` and associated `key` is grouped as a `variable`.
7
- - `variable`s with same vector length are aligned and arranged to be a `DaTaFrame`.
7
+ - `variable`s with same vector length are aligned and arranged to be a `DataFrame`.
8
8
  - Each `Vector` in a `DataFrame` contains a set of relating data at same position. We call it `observation`.
9
9
 
10
10
  ![dataframe model image](doc/../image/dataframe_model.png)
11
11
 
12
+ (No change in this model in v0.1.6 .)
13
+
12
14
  ## Constructors and saving
13
15
 
14
16
  ### `new` from a Hash
@@ -52,7 +54,7 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
52
54
  - from a URI
53
55
 
54
56
  ```ruby
55
- uri = URI("uri = URI("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv")
57
+ uri = URI("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv")
56
58
  RedAmber::DataFrame.load(uri)
57
59
  ```
58
60
 
@@ -147,9 +149,9 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
147
149
 
148
150
  - Returns an Array of Vectors.
149
151
 
150
- ### `indexes`, `indices`
152
+ ### `indices`, `indexes`
151
153
 
152
- - Returns all indexes in a Range.
154
+ - Returns all indexes in an Array.
153
155
 
154
156
  ### `to_h`
155
157
 
@@ -179,6 +181,10 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
179
181
 
180
182
  - Returns a `Rover::DataFrame`.
181
183
 
184
+ ### `to_iruby`
185
+
186
+ - Show the DataFrame as a Table in Jupyter Notebook or Jupyter Lab with IRuby.
187
+
182
188
  ### `tdr(limit = 10, tally: 5, elements: 5)`
183
189
 
184
190
  - Shows some information about self in a transposed style.
@@ -280,6 +286,9 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
280
286
  An end-less or a begin-less Range can be used to represent indeces.
281
287
 
282
288
  - Select obs. by indeces in an Array: `df[1, 2]`
289
+
290
+ - You can use float indices.
291
+
283
292
  - Mixed case: `df[2, 0..]`
284
293
 
285
294
  ```ruby
@@ -423,9 +432,11 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
423
432
 
424
433
  ![slice method image](doc/../image/dataframe/slice.png)
425
434
 
426
- - Keys as arguments
435
+ - Indices as arguments
427
436
 
428
- `slice(indeces)` accepts indeces as arguments. Indeces should be an Integer or a Range of Integer.
437
+ `slice(indeces)` accepts indices as arguments. Indices should be Integers, Floats or Ranges of Integers.
438
+
439
+ Negative index from the tail like Ruby's Array is also acceptable.
429
440
 
430
441
  ```ruby
431
442
  # returns 5 obs. at start and 5 obs. from end
@@ -457,7 +468,7 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
457
468
  ... 5 more Vectors ...
458
469
  ```
459
470
 
460
- - Keys or booleans by a block
471
+ - Indices or booleans by a block
461
472
 
462
473
  `slice {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as `size`. Block is called in the context of self.
463
474
 
@@ -469,6 +480,7 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
469
480
  max = vector.mean + vector.std
470
481
  vector.to_a.map { |e| (min..max).include? e }
471
482
  end
483
+
472
484
  # =>
473
485
  #<RedAmber::DataFrame : 204 x 8 Vectors, 0x000000000000f30c>
474
486
  Vectors : 5 numeric, 3 strings
@@ -509,7 +521,7 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
509
521
 
510
522
  ![remove method image](doc/../image/dataframe/remove.png)
511
523
 
512
- - Keys as arguments
524
+ - Indices as arguments
513
525
 
514
526
  `remove(indeces)` accepts indeces as arguments. Indeces should be an Integer or a Range of Integer.
515
527
 
@@ -548,7 +560,7 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
548
560
  8 :year uint16 3 {2007=>103, 2008=>113, 2009=>117}
549
561
  ```
550
562
 
551
- - Keys or booleans by a block
563
+ - Indices or booleans by a block
552
564
 
553
565
  `remove {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as `size`. Block is called in the context of self.
554
566
 
@@ -748,6 +760,8 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
748
760
 
749
761
  ### `group(aggregating_keys, function, target_keys)`
750
762
 
763
+ (This is a temporary API and may change in the future version.)
764
+
751
765
  Create grouped dataframe by `aggregation_keys` and apply `function` to each group and returns in `target_keys`. Aggregated key name is `function(key)` style.
752
766
 
753
767
  (The current implementation is not intuitive. Needs improvement.)