RubyGems - red_amber - Versions diffs - 0.1.1 → 0.1.4 - Mend

red_amber 0.1.1 → 0.1.4

Files changed (39) hide show

checksums.yaml +4 -4
data/.rubocop.yml +26 -10
data/.rubocop_todo.yml +1 -7
data/CHANGELOG.md +109 -8
data/README.md +66 -279
data/doc/DataFrame.md +690 -0
data/doc/Vector.md +195 -0
data/doc/image/TDR_operations.pdf +0 -0
data/doc/image/arrow_table_new.png +0 -0
data/doc/image/dataframe/assign.png +0 -0
data/doc/image/dataframe/drop.png +0 -0
data/doc/image/dataframe/pick.png +0 -0
data/doc/image/dataframe/remove.png +0 -0
data/doc/image/dataframe/rename.png +0 -0
data/doc/image/dataframe/slice.png +0 -0
data/doc/image/dataframe_model.png +0 -0
data/doc/image/example_in_red_arrow.png +0 -0
data/doc/image/tdr.png +0 -0
data/doc/image/tdr_and_table.png +0 -0
data/doc/image/tidy_data_in_TDR.png +0 -0
data/doc/image/vector/binary_element_wise.png +0 -0
data/doc/image/vector/unary_aggregation.png +0 -0
data/doc/image/vector/unary_aggregation_w_option.png +0 -0
data/doc/image/vector/unary_element_wise.png +0 -0
data/doc/tdr.md +53 -0
data/doc/tdr_ja.md +53 -0
data/lib/red_amber/data_frame.rb +42 -21
data/lib/red_amber/data_frame_displayable.rb +131 -0
data/lib/red_amber/data_frame_helper.rb +64 -0
data/lib/red_amber/data_frame_observation_operation.rb +72 -0
data/lib/red_amber/data_frame_selectable.rb +29 -35
data/lib/red_amber/data_frame_variable_operation.rb +133 -0
data/lib/red_amber/vector.rb +35 -2
data/lib/red_amber/vector_functions.rb +134 -58
data/lib/red_amber/version.rb +1 -1
data/lib/red_amber.rb +4 -1
data/red_amber.gemspec +5 -5
metadata +35 -10
data/lib/red_amber/data_frame_output.rb +0 -116

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 00ba2e99b2b1d6f977b2e2e5c7d60b9313972cf3e831918606e5388d51442137
-  data.tar.gz: f0fc831937bff5fede4ee0f0537b0ef5fdfb8a1faa8a57082a197a627562252c
+  metadata.gz: 6ceace9db54b82c03ccf00fcd1b7bf2af57d94ea4e54183dc6af1da47e21ef00
+  data.tar.gz: f30578dcec45fd5efec9219c6438fd0108a0690b1cd69b1c398dffacd38aeba1
 SHA512:
-  metadata.gz: 7bc020b8663c3523426461e3bd54642d4eb85a86296a8db3f5d94315091ee4475ec8b910fb87165c5d029e35fa9dc45f119bea6278e023d3cc63ad011388fbfb
-  data.tar.gz: 78dd55182b40ee9bec769efdbcac23adb85ad93bbafe3f74c4ded9d56ab40e39da0ce1e34a841d5705e6b94fea85312057e360140965fa217243eede0d238eb5
+  metadata.gz: ee26fd212d0cb0758bc4611c5b43b302fe5c1b958239b5a9ac81ee09e936bdded733a719507e24e5434c33fc5d7ece43c973dd66d51413f23cc435ea0bd7570c
+  data.tar.gz: 674f56a11ddf906f608ecf7d7c852bec654a749e9052092553d19be967072d5acec95a096fbecc60ffd4b33fad3f4322354d93fade67230078fff15b6b7398dd

data/.rubocop.yml CHANGED Viewed

@@ -45,7 +45,7 @@ Lint/BinaryOperatorWithIdenticalOperands:
 # Max: 120
 Layout/LineLength:
-  Max: 100
+  Max: 118
   Exclude:
     - 'test/**/*'
@@ -53,16 +53,18 @@ Layout/LineLength:
 # 18..30 unsatisfactory
 # > 30 dangerous
 Metrics/AbcSize:
-  Max: 19
+  Max: 23
   Exclude:
-    - 'lib/red_amber/data_frame_output.rb' # Max: 78
+    - 'lib/red_amber/data_frame_displayable.rb' # Max: 55
+    - 'lib/red_amber/data_frame_selectable.rb' # Max: 27
+    - 'lib/red_amber/data_frame_observation_operation.rb' # Max: 29
+    - 'lib/red_amber/data_frame_variable_operation.rb' # Max: 26
 # Max: 25
 Metrics/BlockLength:
   Max: 25
   Exclude:
     - 'test/**/*'
-    - '*.gemspec'
 # Max: 100
 Metrics/ClassLength:
@@ -72,18 +74,32 @@ Metrics/ClassLength:
 # Max: 7
 Metrics/CyclomaticComplexity:
-  Max: 10
-  Exclude:
-    - 'lib/red_amber/data_frame_output.rb' # Max: 11
+  Max: 12
 # Max: 10
 Metrics/MethodLength:
   Max: 18
   Exclude:
-    - 'lib/red_amber/data_frame_output.rb' # Max: 35
+    - 'lib/red_amber/data_frame_displayable.rb' # Max: 33
+    - 'lib/red_amber/data_frame_observation_operation.rb' # Max: 21
+    - 'lib/red_amber/data_frame_variable_operation.rb' # Max: 20
+# Max: 100
+Metrics/ModuleLength:
+  Max: 100
+  Exclude:
+    - 'lib/red_amber/vector_functions.rb' # Max: 114
 # Max: 8
 Metrics/PerceivedComplexity:
-  Max: 9
+  Max: 13
+# Necessary to define is_na
+Naming/PredicateName:
+  Exclude:
+    - 'lib/red_amber/vector_functions.rb'
+# Necessary to test when range.end == -1
+Style/SlicingWithRange:
   Exclude:
-    - 'lib/red_amber/data_frame_output.rb' # Max: 12
+    - 'test/test_data_frame_selectable.rb'

data/.rubocop_todo.yml CHANGED Viewed

@@ -1,17 +1,11 @@
 # This configuration was generated by
 # `rubocop --auto-gen-config`
-# on 2022-04-27 00:29:57 UTC using RuboCop version 1.27.0.
+# on 2022-05-08 02:37:36 UTC using RuboCop version 1.27.0.
 # The point is for the user to remove these configuration records
 # one by one as the offenses are removed from the code base.
 # Note that changes in the inspected code, or installation of new
 # versions of RuboCop, may require this file to be generated again.
-# Offense count: 1
-# This cop supports unsafe auto-correction (--auto-correct-all).
-Style/SlicingWithRange:
-  Exclude:
-    - 'lib/red_amber/data_frame_selectable.rb'
 # Offense count: 1
 # This cop supports unsafe auto-correction (--auto-correct-all).
 # Configuration parameters: EnforcedStyle.

data/CHANGELOG.md CHANGED Viewed

@@ -1,17 +1,118 @@
-## [0.1.2] - Unreleased
+##  - Unreleased
+- Feedback something to Red Arrow
-- Add support for Arrow 8.0.0
 - `DataFrame`
-  - Introduce updating
-  - Introduce NA support
-  - Add slice method
+  - Introduce `group_by`
+  - Introduce `summarize`
+  - Introduce `summary` or ``describe`
+  - Improve dataframe obs. manipuration methods to accept float as a index (#10)
+  - More performant
 - `Vector`
-  - Add NaN support for functions
-  - More functions
+  - Support more functions
+- Document
+  - YARD support
+## [0.1.4] - 2022-05-29 (experimental)
+- Bug fixes
+  - Fix missing support for scalar argument (#1)
+  - Fix type name of boolean in DF#types to be same as Vector#type (#6, #7)
+  - Fix zero picking to return empty DataFrame (#8)
+  - Fix code at both args and a block given (#8)
+- New features and improvements
+  - `DataFrame`
+    - Refine module name `Displayable`
+    - Rename nrow/ncol methods to `size`/`n_keys` to align with TDR concept (#4)
+      - Remain `n_row`/`n_col` for compatibility
+    - Rename `ls` method to `tdr` (#4)
+      - Add limit option to `tdr`
+      - Shorten option name (#11)
+    - Introduce `pick` method to create sub DataFrame (#8)
+      - Add boolean support (#8)
+      - Refactor `pick` (#9)
+    - Introduce `drop` method to create sub DataFrame (#8)
+      - Add boolean support (#8)
+      - Refactor `drop` (#9)
+    - Add boolean array support for `[]` (#9)
+    - Add `indexes`/`indices` to use with selecting observations (#9)
+    - Introduce `slice` method to create sub DataFrame (#8)
+      - Refactor `slice` (#9)
+    - Introduce `remove` method to create sub DataFrame (#9)
+    - Introduce `rename` method to create sub DataFrame (#14)
+    - Introduce `assign` method to create sub DataFrame (#14)
+    - Improve to call block by instance_eval (#13)
+  - `Vector`
+    - Refine `find(function)`
+    - Add `min_max` method (#2)
+    - Add `std`/`sd` method (ddof=0 version: `stddev`) (#2)
+    - Add `var` method (ddof=0 version: `variance`) (#2)
+    - Add `VectorFunctions.arrow_doc(func_name)` (temporally)
+  - Documentation
+    - Show code in README
+    - Change row/column names for **TDR** concept (#4)
+    - Add documents about **TDR** concept (#4)
+    - Add example about TDR (#4)
+    - Separate README to create DataFrame and Vector documents (#12)
+    - Add DataFrame model concept image to README (#12)
+  - GitHub site
+    - Switched to use merge on GitHub (not to push merged master) (#1)
+    - Create lifetime issue #3 to show the goal of this project (#3)
+## [0.1.3] - 2022-05-15 (experimental)
+- Bug fixes
+  - Fix boolean functions in `Vector` to align with Ruby's behavior
+    - `&` == `and_kleene`
+    - `|` == `or_kleene`
+  - Quote strings of data-preview in `DataFrame#inspect`
+  - Quote empty and blank keys in `DataFrame#inspect`
+  - Respond to error for a wrong key in `DataFrame#[]`
+- New features and improvements
+  - `DataFrame`
+    - Display nil elements in `inspect`
+    - Show NaN and nil counts in `inspect`
+    - Refactor `inspect`
+    - Add method `key` and `key_index`
+    - Add how to load/save Parquet to README
+  - `Vector`
+    - Add categorization functions
+      This is an important step to support `slice` method and NA treatment features.
+      -  `is_finite`
+      -  `is_inf`
+      -  `is_na` (RedAmber original)
+      -  `is_nan`
+      -  `is_nil`, `is_null`
+      -  `is_valid`
+    - Show in a reduced representation for long array in `inspect`
+    - Support options in aggregatiton functions
+    - Return values in non-arrow object for scalar aggregation functions
+## [0.1.2] - 2022-05-08 (experimental)
+- Bug fixes:
+  - `DataFrame`
+    - Fix bug in `#[]` with end-less Range
+- New features and improvements
+  - Add support for Arrow 8.0.0
+  - `DataFrame`
+    - `types` and `data_types`
+    - Range is usable to specify columns in `#[]`
+  - `Vector`
+    - `type` and `data_type`
 ## [0.1.1] - 2022-05-06 (experimental)
-- Release on rubygem.org
+- Release on rubygems.org
 - Introduce class `DataFrame`
   -  New from Hash, schema/rows, `Arrow::Table`, `Rover::DataFrame`
   -  Load from file, string, URI

data/README.md CHANGED Viewed

@@ -8,8 +8,8 @@ A simple dataframe library for Ruby (experimental)
 ## Requirements
 ```ruby
-gem 'red-arrow',   '~> 7.0.0'
-gem 'red-parquet', '~> 7.0.0' # if you use IO from/to parquet
+gem 'red-arrow',   '>= 7.0.0'
+gem 'red-parquet', '>= 7.0.0' # if you use IO from/to parquet
 gem 'rover-df',    '~> 0.3.0' # if you use IO from/to Rover::DataFrame
 ```
@@ -23,308 +23,95 @@ gem 'red_amber'
 And then execute:
-    $ bundle install
+```shell
+bundle install
+```
 Or install it yourself as:
-    $ gem install red_amber
+```shell
+gem install red_amber
+```
 ## `RedAmber::DataFrame`
-### Constructors and saving
-- [x] `new` from a columnar Hash
-  - `RedAmber::DataFrame.new(x: [1, 2, 3])`
-- [x] `new` from a schema (by Hash) and rows (by Array)
-  - `RedAmber::DataFrame.new({:x=>:uint8}, [[1], [2], [3]])`
-- [x] `new` from an Arrow::Table
-  - `RedAmber::DataFrame.new(Arrow::Table.new(x: [1, 2, 3]))`
-- [x] `new` from a Rover::DataFrame
-  - `RedAmber::DataFrame.new(Rover::DataFrame.new(x: [1, 2, 3]))`
-- [ ] `load` (class method)
-     - [x] from a [`.arrow`, `.arrows`, `.csv`, `.csv.gz`, `.tsv`] file
-       - `RedAmber::DataFrame.load("test/entity/with_header.csv")`
-     - [x] from a string buffer
-     - [x] from a URI
-       - `RedAmber::DataFrame.load(URI("https://github.com/heronshoes/red_amber/blob/master/test/entity/with_header.csv"))`
-     - [ ] from a parquet file
-- [ ] `save` (instance method)
-     - [x] to a [`.arrow`, `.arrows`, `.csv`, `.csv.gz`, `.tsv`] file
-     - [x] to a string buffer
-     - [x] to a URI
-     - [ ] to a parquet file
-### Properties
-- [x] `table`
-  Reader of Arrow::Table object inside.
-- [x] `n_rows`, `nrow`, `size`, `length`
-  Returns num of rows (data size).
-- [x] `n_columns`, `ncol`, `width`
-  Returns num of columns (num of vectors).
-- [x] `shape`
-  Returns shape in an Array[n_rows, n_cols].
-- [x] `column_names`, `keys`
-  Returns num of column names by an Array.
-- [x] `types(class_name: false)`
-  Returns types of columns by an Array.
-  If `class_name: true` returns an Array of `Arrow::DataType`.
-- [x] `vectors`
-  Returns an Array of Vectors.
-- [x] `to_h`
-  Returns column-oriented data in a Hash.
-- [x] `to_a`, `raw_records`
-  Returns an array of row-oriented data without header. If you need a column-oriented full array, use `.to_h.to_a`
-- [x] `schema`
-  Returns column name and data type in a Hash.
-- [x] `==`
-- [x] `empty?`
-### Output
-- [x] `to_s`
-- [ ] summary, describe
-- [x] `to_rover`
+Represents a set of data in 2D-shape.
-  Returns a `Rover::DataFrame`.
-- [x] `inspect(tally_level: 5, max_element: 5)`
-  Shows some information about self.
-  - tally_level: max level to use tally mode
-  - max_element: max num of element to show values in each row
-### Selecting
-- [x] Selecting columns by `[]`
-  `[key]`, `[keys]`, `[keys[index]]`
-- [x] Selecting rows by `[]`
-  `[index]`, `[range]`, `[array]`
-- [x] Selecting rows from top or bottom
-  `head(n=5)`, `tail(n=5)`, `first(n=1)`, `last(n=1)`
-- [ ] slice
-### Updating
-- [ ] Add a new column
-- [ ] Update a single element
-- [ ] Update multiple elements
-- [ ] Update all elements
-- [ ] Update elements matching a condition
-- [ ] Clamp
-- [ ] Delete columns
-- [ ] Rename a column
-- [ ] Sort rows
-- [ ] Clear data
-### Treat na data
+```ruby
+require 'red_amber'
+require 'datasets-arrow'
+penguins = Datasets::Penguins.new.to_arrow
+puts RedAmber::DataFrame.new(penguins).tdr
+# =>
+RedAmber::DataFrame : 344 x 8 Vectors
+Vectors : 5 numeric, 3 strings
+# key                type   level data_preview
+1 :species           string     3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
+2 :island            string     3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
+3 :bill_length_mm    double   165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
+4 :bill_depth_mm     double    81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils
+5 :flipper_length_mm uint8     56 [181, 186, 195, nil, 193, ... ], 2 nils
+6 :body_mass_g       uint16    95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils
+7 :sex               string     3 {"male"=>168, "female"=>165, nil=>11}
+8 :year              uint16     3 {2007=>110, 2008=>114, 2009=>120}
+```
-- [ ] Drop na (NaN, nil)
+### DataFrame model
+![dataframe model of RedAmber](doc/image/dataframe_model.png)
-- [ ] Replace na with value
+For example, `DataFrame#pick` accepts keys as an argument and returns a sub DataFrame.
-- [ ] Interpolate na with convolution array
+```ruby
+df = penguins.pick(:body_mass_g)
+# =>
+#<RedAmber::DataFrame : 344 x 1 Vector, 0x000000000000fa14>
+Vector : 1 numeric
+# key          type  level data_preview
+1 :body_mass_g int64    95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils
+```
-### Combining DataFrames
+`DataFrame#assign` can accept a block and create new variables.
-- [ ] Add rows
+```ruby
+df.assign do
+  {:body_mass_kg => penguins[:body_mass_g] / 1000.0}
+end
+# =>
+#<RedAmber::DataFrame : 344 x 2 Vectors, 0x000000000000fa28>
+Vectors : 2 numeric
+# key           type   level data_preview
+1 :body_mass_g  int64     95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils
+2 :body_mass_kg double    95 [3.75, 3.8, 3.25, nil, 3.45, ... ], 2 nils
+```
-- [ ] Add columns
+Other DataFrame manipulating methods like `pick`, `drop`, `slice`, `remove` and `rename` also accept a block.
-- [ ] Inner join
+See [DataFrame.md](doc/DataFrame.md) for details.
-- [ ] Left join
-### Encoding
+## `RedAmber::Vector`
-- [ ] One-hot encoding
+Class `RedAmber::Vector` represents a series of data in the DataFrame.
-### Iteration (not impremented)
+```ruby
+penguins[:species]
+# =>
+#<RedAmber::Vector(:string, size=344):0x000000000000f8e8>
+["Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", ... ]
+```
-### Filtering (not impremented)
+Vectors accepts some [functional methods from Arrow](https://arrow.apache.org/docs/cpp/compute.html).
+See [Vector.md](doc/Vector.md) for details.
-## `RedAmber::Vector`
-### Constructor
-- [x] Create from a column in a DataFrame
-- [x] New from an Array
-### Properties
-- [x] `to_s`
-- [x] `values`, `to_a`, `entries`
-- [x] `size`, `length`, `n_rows`, `nrow`
-- [x] `type`
-- [ ] `each`
-- [ ] `chunked?`
-- [ ] `n_chunks`
-- [ ] `each_chunk`
-- [x] `tally`
-- [ ] `n_nulls`
-### Functions
-#### Unary aggregations: vector.func => Scalar
-| Method    |Boolean|Numeric|String|Remarks|
-| ------------ | --- | --- | --- | ----- |
-|[x] `all`     | [x] |     |     |       |
-|[x] `any`     | [x] |     |     |       |
-|[x] `approximate_median`| | [x] |     |     |
-|[x] `count`         | [x] | [x] | [x] |     |
-|[x] `count_distinct`| [x] | [x] | [x] |     |
-|[x] `count_uniq`    | [x] | [x] | [x] |an alias of `count_distinct`|
-|[ ] `index`   |     |     |     |       |
-|[x] `max`     | [x] | [x] | [x] |       |
-|[x] `mean`    | [x] | [x] |     |       |
-|[x] `min`     | [x] | [x] | [x] |       |
-|[ ] `min_max` |     |     |     |       |
-|[ ] `mode`    |     |     |     |       |
-|[x] `product` | [x] | [x] |     |       |
-|[ ] `quantile`|     |     |     |       |
-|[x] `stddev`  |     | [x] |     |       |
-|[x] `sum`     | [x] | [x] |     |       |
-|[ ] `tdigest` |     |     |     |       |
-|[x] `variance`|     | [x] |     |       |
-#### Unary element-wise: vector.func => Vector
-| Method    |Boolean|Numeric|String|Remarks|
-| ------------ | --- | --- | --- | ----- |
-|[x] `-@`      |     | [x] |     |as `-vector`|
-|[x] `negate`  |     | [x] |     |`-@`   |
-|[x] `abs`     |     | [x] |     |       |
-|[ ] `acos`    |     | [ ] |     |       |
-|[ ] `asin`    |     | [ ] |     |       |
-|[x] `atan`    |     | [x] |     |       |
-|[ ] `ceil`    |     | [x] |     |       |
-|[x] `cos`     |     | [x] |     |       |
-|[ ] `floor`   |     | [x] |     |       |
-|[ ] `ln`      |     | [ ] |     |       |
-|[ ] `log10`   |     | [ ] |     |       |
-|[ ] `log1p`   |     | [ ] |     |       |
-|[ ] `log2`    |     | [ ] |     |       |
-|[x] `sign`    |     | [x] |     |       |
-|[x] `sin`     |     | [x] |     |       |
-|[x] `tan`     |     | [x] |     |       |
-|[ ] `trunc`   |     | [x] |     |       |
-#### Binary element-wise: vector.func(vector) => Vector
-| Method          |Boolean|Numeric|String|Remarks|
-| ------------------ | --- | --- | --- | ----- |
-|[x] `add`           |     | [x] |     | `+`   |
-|[x] `atan2`         |     | [x] |     |       |
-|[x] `and`           | [x] |     |     |       |
-|[x] `and_kleene`    | [x] |     |     |       |
-|[x] `and_not`       | [x] |     |     |       |
-|[x] `and_not_kleene`| [x] |     |     |       |
-|[x] `bit_wise_and`  |     |([x])|     |`&`, integer only|
-|[ ] `bit_wise_not`  |     |([x])|     |`!`, integer only|
-|[x] `bit_wise_or`   |     |([x])|     |`|`, integer only|
-|[x] `bit_wise_xor`  |     |([x])|     |`^`, integer only|
-|[x] `divide`        |     | [x] |     | `/`   |
-|[x] `equal`         | [x] | [x] | [x] |`==`, alias `eq`|
-|[x] `greater`       | [x] | [x] | [x] |`>`, alias `gt`|
-|[x] `greater_equal` | [x] | [x] | [x] |`>=`, alias `ge`|
-|[x] `less`          | [x] | [x] | [x] |`<`, alias `lt`|
-|[x] `less_equal`    | [x] | [x] | [x] |`<=`, alias `le`|
-|[ ] `logb`          |     | [ ] |     |       |
-|[ ] `mod`           |     | [ ] |     |       |
-|[x] `multiply`      |     | [x] |     | `*`   |
-|[x] `not_equal`     | [x] | [x] | [x] |`!=`, alias `ne`|
-|[x] `or`            | [x] |     |     |       |
-|[x] `or_kleene`     | [x] |     |     |       |
-|[x] `power`         |     | [x] |     | `**`  |
-|[x] `subtract`      |     | [x] |     | `-`   |
-|[x] `shift_left`    |     |([x])|     |`<<`, integer only|
-|[x] `shift_right`   |     |([x])|     |`>>`, integer only|
-|[x] `xor`           | [x] |     |     |       |
-##### (Not impremented)
-- [ ] invert, round, round_to_multiple
-- [ ] sort, sort_index
-- [ ] minmax, var, median, quantile
-- [ ] argmin, argmax
-- [ ] (array functions)
-- [ ] (strings functions)
-- [ ] (temporal functions)
-- [ ] (conditional functions)
-- [ ] (index functions)
-- [ ] (other functions)
-### Coerce (not impremented)
-### Updating (not impremented)
-### DSL in a block for faster calculation ?
+## TDR concept
+I named the data frame representation style in the model above as TDR (Transposed DataFrame Representation). See [TDR.md](doc/tdr.md) for details.
 ## Development
-```
+```shell
 git clone https://github.com/heronshoes/red_amber.git
 cd red_amber
 bundle install