RubyGems - red_amber - Versions diffs - 0.1.3 → 0.1.4 - Mend

red_amber 0.1.3 → 0.1.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (35) hide show

checksums.yaml +4 -4
data/.rubocop.yml +9 -4
data/CHANGELOG.md +60 -8
data/README.md +41 -349
data/doc/DataFrame.md +690 -0
data/doc/Vector.md +195 -0
data/doc/image/TDR_operations.pdf +0 -0
data/doc/image/arrow_table_new.png +0 -0
data/doc/image/dataframe/assign.png +0 -0
data/doc/image/dataframe/drop.png +0 -0
data/doc/image/dataframe/pick.png +0 -0
data/doc/image/dataframe/remove.png +0 -0
data/doc/image/dataframe/rename.png +0 -0
data/doc/image/dataframe/slice.png +0 -0
data/doc/image/dataframe_model.png +0 -0
data/doc/image/example_in_red_arrow.png +0 -0
data/doc/image/tdr.png +0 -0
data/doc/image/tdr_and_table.png +0 -0
data/doc/image/tidy_data_in_TDR.png +0 -0
data/doc/image/vector/binary_element_wise.png +0 -0
data/doc/image/vector/unary_aggregation.png +0 -0
data/doc/image/vector/unary_aggregation_w_option.png +0 -0
data/doc/image/vector/unary_element_wise.png +0 -0
data/doc/tdr.md +53 -0
data/doc/tdr_ja.md +53 -0
data/lib/red_amber/data_frame.rb +22 -15
data/lib/red_amber/{data_frame_output.rb → data_frame_displayable.rb} +44 -37
data/lib/red_amber/data_frame_helper.rb +64 -0
data/lib/red_amber/data_frame_observation_operation.rb +72 -0
data/lib/red_amber/data_frame_selectable.rb +21 -43
data/lib/red_amber/data_frame_variable_operation.rb +133 -0
data/lib/red_amber/vector_functions.rb +54 -29
data/lib/red_amber/version.rb +1 -1
data/lib/red_amber.rb +4 -1
metadata +27 -3

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: '0308ff686bf7b49b767b7cd28ddc068e02170c00c093dcd42c7187e438e0adf3'
-  data.tar.gz: 98397e31bce1a440e951357d5d3b475814a6ecc08f21a0908c0fdf58c6189be4
+  metadata.gz: 6ceace9db54b82c03ccf00fcd1b7bf2af57d94ea4e54183dc6af1da47e21ef00
+  data.tar.gz: f30578dcec45fd5efec9219c6438fd0108a0690b1cd69b1c398dffacd38aeba1
 SHA512:
-  metadata.gz: 7ad71d8259d04535d08567bde6ca0fc419e0d9de15d1e812dbc642fb3901f1c744c69766dbf409e876212e426e309ac0032968b767df3a960a8e6eb40d4f3c19
-  data.tar.gz: eee78ae4316b007d95714d6e2920ad32518497942d9cd5adb373476321e6f9e6e8099f9c721ee8bac05df2617fb3f3c747ce92ec74b1cb84da0b0bd4664051cf
+  metadata.gz: ee26fd212d0cb0758bc4611c5b43b302fe5c1b958239b5a9ac81ee09e936bdded733a719507e24e5434c33fc5d7ece43c973dd66d51413f23cc435ea0bd7570c
+  data.tar.gz: 674f56a11ddf906f608ecf7d7c852bec654a749e9052092553d19be967072d5acec95a096fbecc60ffd4b33fad3f4322354d93fade67230078fff15b6b7398dd

data/.rubocop.yml CHANGED Viewed

@@ -55,7 +55,10 @@ Layout/LineLength:
 Metrics/AbcSize:
   Max: 23
   Exclude:
-    - 'lib/red_amber/data_frame_output.rb' # Max: 51
+    - 'lib/red_amber/data_frame_displayable.rb' # Max: 55
+    - 'lib/red_amber/data_frame_selectable.rb' # Max: 27
+    - 'lib/red_amber/data_frame_observation_operation.rb' # Max: 29
+    - 'lib/red_amber/data_frame_variable_operation.rb' # Max: 26
 # Max: 25
 Metrics/BlockLength:
@@ -71,13 +74,15 @@ Metrics/ClassLength:
 # Max: 7
 Metrics/CyclomaticComplexity:
-  Max: 10
+  Max: 12
 # Max: 10
 Metrics/MethodLength:
   Max: 18
   Exclude:
-    - 'lib/red_amber/data_frame_output.rb' # Max: 31
+    - 'lib/red_amber/data_frame_displayable.rb' # Max: 33
+    - 'lib/red_amber/data_frame_observation_operation.rb' # Max: 21
+    - 'lib/red_amber/data_frame_variable_operation.rb' # Max: 20
 # Max: 100
 Metrics/ModuleLength:
@@ -87,7 +92,7 @@ Metrics/ModuleLength:
 # Max: 8
 Metrics/PerceivedComplexity:
-  Max: 11
+  Max: 13
 # Necessary to define is_na
 Naming/PredicateName:

data/CHANGELOG.md CHANGED Viewed

@@ -1,18 +1,70 @@
-## [0.1.4] - Unreleased
+##  - Unreleased
-- Prepare documents for the 'Transposed DataFrame Representation'
-- Feedback to Red Arrow
-- Separate documents
+- Feedback something to Red Arrow
 - `DataFrame`
-  - Introduce updating capabilities
-  - Introduce NA support
-  - Add slice method
+  - Introduce `group_by`
+  - Introduce `summarize`
+  - Introduce `summary` or ``describe`
+  - Improve dataframe obs. manipuration methods to accept float as a index (#10)
+  - More performant
 - `Vector`
-  - Add NaN support for functions
   - Support more functions
+- Document
+  - YARD support
+## [0.1.4] - 2022-05-29 (experimental)
+- Bug fixes
+  - Fix missing support for scalar argument (#1)
+  - Fix type name of boolean in DF#types to be same as Vector#type (#6, #7)
+  - Fix zero picking to return empty DataFrame (#8)
+  - Fix code at both args and a block given (#8)
+- New features and improvements
+  - `DataFrame`
+    - Refine module name `Displayable`
+    - Rename nrow/ncol methods to `size`/`n_keys` to align with TDR concept (#4)
+      - Remain `n_row`/`n_col` for compatibility
+    - Rename `ls` method to `tdr` (#4)
+      - Add limit option to `tdr`
+      - Shorten option name (#11)
+    - Introduce `pick` method to create sub DataFrame (#8)
+      - Add boolean support (#8)
+      - Refactor `pick` (#9)
+    - Introduce `drop` method to create sub DataFrame (#8)
+      - Add boolean support (#8)
+      - Refactor `drop` (#9)
+    - Add boolean array support for `[]` (#9)
+    - Add `indexes`/`indices` to use with selecting observations (#9)
+    - Introduce `slice` method to create sub DataFrame (#8)
+      - Refactor `slice` (#9)
+    - Introduce `remove` method to create sub DataFrame (#9)
+    - Introduce `rename` method to create sub DataFrame (#14)
+    - Introduce `assign` method to create sub DataFrame (#14)
+    - Improve to call block by instance_eval (#13)
+  - `Vector`
+    - Refine `find(function)`
+    - Add `min_max` method (#2)
+    - Add `std`/`sd` method (ddof=0 version: `stddev`) (#2)
+    - Add `var` method (ddof=0 version: `variance`) (#2)
+    - Add `VectorFunctions.arrow_doc(func_name)` (temporally)
+  - Documentation
+    - Show code in README
+    - Change row/column names for **TDR** concept (#4)
+    - Add documents about **TDR** concept (#4)
+    - Add example about TDR (#4)
+    - Separate README to create DataFrame and Vector documents (#12)
+    - Add DataFrame model concept image to README (#12)
+  - GitHub site
+    - Switched to use merge on GitHub (not to push merged master) (#1)
+    - Create lifetime issue #3 to show the goal of this project (#3)
 ## [0.1.3] - 2022-05-15 (experimental)
 - Bug fixes

data/README.md CHANGED Viewed

@@ -23,134 +23,26 @@ gem 'red_amber'
 And then execute:
-    $ bundle install
+```shell
+bundle install
+```
 Or install it yourself as:
-    $ gem install red_amber
+```shell
+gem install red_amber
+```
 ## `RedAmber::DataFrame`
-### Constructors and saving
-- [x] `new` from a columnar Hash
-  - `RedAmber::DataFrame.new(x: [1, 2, 3])`
-- [x] `new` from a schema (by Hash) and rows (by Array)
-  - `RedAmber::DataFrame.new({:x=>:uint8}, [[1], [2], [3]])`
-- [x] `new` from an Arrow::Table
-  - `RedAmber::DataFrame.new(Arrow::Table.new(x: [1, 2, 3]))`
-- [x] `new` from a Rover::DataFrame
-  - `RedAmber::DataFrame.new(Rover::DataFrame.new(x: [1, 2, 3]))`
-- [x] `load` (class method)
-     - [x] from a [`.arrow`, `.arrows`, `.csv`, `.csv.gz`, `.tsv`] file
-       - `RedAmber::DataFrame.load("test/entity/with_header.csv")`
-     - [x] from a string buffer
-     - [x] from a URI
-       - `RedAmber::DataFrame.load(URI("https://github.com/heronshoes/red_amber/blob/master/test/entity/with_header.csv"))`
-     - [x] from a Parquet file
-       `red-parquet` gem is required.
-  ```ruby
-    require 'parquet'
-    dataframe = RedAmber::DataFrame.load("file.parquet")
-  ```
-- [x] `save` (instance method)
-     - [x] to a [`.arrow`, `.arrows`, `.csv`, `.csv.gz`, `.tsv`] file
-     - [x] to a string buffer
-     - [x] to a URI
-     - [x] to a Parquet file
-       `red-parquet` gem is required.
-  ```ruby
-    require 'parquet'
-    dataframe.save("file.parquet")
-  ```
-### Properties
-- [x] `table`
-  Reader of Arrow::Table object inside.
-- [x] `n_rows`, `nrow`, `size`, `length`
-  Returns num of rows (data size).
-- [x] `n_columns`, `ncol`, `width`
-  Returns num of columns (num of vectors).
-- [x] `shape`
-  Returns shape in an Array[n_rows, n_cols].
-- [x] `column_names`, `keys`
-  Returns num of column names by an Array.
-- [x] `types`
-  Returns types of columns by an Array of Symbols.
-- [x] `data_types`
-  Returns types of columns by an Array of `Arrow::DataType`.
-- [x] `vectors`
-  Returns an Array of Vectors.
-- [x] `to_h`
-  Returns column-oriented data in a Hash.
-- [x] `to_a`, `raw_records`
-  Returns an array of row-oriented data without header. If you need a column-oriented full array, use `.to_h.to_a`
-- [x] `schema`
-  Returns column name and data type in a Hash.
-- [x] `==`
-- [x] `empty?`
-### Output
-- [x] `to_s`
-- [ ] summary, describe
-- [x] `to_rover`
-  Returns a `Rover::DataFrame`.
-- [x] `inspect(tally_level: 5, max_element: 5)`
-  Shows some information about self in a transposed style.
+Represents a set of data in 2D-shape.
 ```ruby
 require 'red_amber'
 require 'datasets-arrow'
 penguins = Datasets::Penguins.new.to_arrow
-RedAmber::DataFrame.new(penguins)
+puts RedAmber::DataFrame.new(penguins).tdr
 # =>
 RedAmber::DataFrame : 344 x 8 Vectors
 Vectors : 5 numeric, 3 strings
@@ -165,257 +57,57 @@ Vectors : 5 numeric, 3 strings
 8 :year              uint16     3 {2007=>110, 2008=>114, 2009=>120}
 ```
-  - tally_level: max level to use tally mode
-  - max_element: max num of element to show values in each row
+### DataFrame model
+![dataframe model of RedAmber](doc/image/dataframe_model.png)
-### Selecting
-- [x] Select columns by `[]` as `[key]`, `[keys]`, `[keys[index]]`
-  - Key in a Symbol: `df[:symbol]`
-  - Key in a String: `df["string"]`
-  - Keys in an Array: `df[:symbol1, "string", :symbol2]`
-  - Keys in indeces: `df[df.keys[0]`, `df[df.keys[1,2]]`, `df[df.keys[1..]]`
-  - Keys in a Range:
-    A end-less Range can be used to represent keys.
+For example, `DataFrame#pick` accepts keys as an argument and returns a sub DataFrame.
 ```ruby
-hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
-df = RedAmber::DataFrame.new(hash)
-df[:b..:c, "a"]
+df = penguins.pick(:body_mass_g)
 # =>
-RedAmber::DataFrame : 3 x 3 Vectors
-Vectors : 2 numeric, 1 string
-# key type   level data_preview
-1 :b  string     3 ["A", "B", "C"]
-2 :c  double     3 [1.0, 2.0, 3.0]
-3 :a  uint8      3 [1, 2, 3]
+#<RedAmber::DataFrame : 344 x 1 Vector, 0x000000000000fa14>
+Vector : 1 numeric
+# key          type  level data_preview
+1 :body_mass_g int64    95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils
 ```
-- [x] Select rows by `[]` as `[index]`, `[range]`, `[array]`
-  - Select a row by index: `df[0]`
-  - Select rows by indeces in a Range: `df[1..2]`
-  - Select rows by indeces in an Array: `df[1, 2]`
-  - Mixed case: `df[2, 0..]`
-- [x] Select rows from top or bottom
-  `head(n=5)`, `tail(n=5)`, `first(n=1)`, `last(n=1)`
+`DataFrame#assign` can accept a block and create new variables.
-- [ ] slice
-### Updating
-- [ ] Add a new column
-- [ ] Update a single element
-- [ ] Update multiple elements
-- [ ] Update all elements
-- [ ] Update elements matching a condition
-- [ ] Clamp
-- [ ] Delete columns
-- [ ] Rename a column
-- [ ] Sort rows
-- [ ] Clear data
-### Treat na data
-- [ ] Drop na (NaN, nil)
-- [ ] Replace na with value
-- [ ] Interpolate na with convolution array
-### Combining DataFrames
-- [ ] Add rows
-- [ ] Add columns
-- [ ] Inner join
-- [ ] Left join
-### Encoding
-- [ ] One-hot encoding
+```ruby
+df.assign do
+  {:body_mass_kg => penguins[:body_mass_g] / 1000.0}
+end
+# =>
+#<RedAmber::DataFrame : 344 x 2 Vectors, 0x000000000000fa28>
+Vectors : 2 numeric
+# key           type   level data_preview
+1 :body_mass_g  int64     95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils
+2 :body_mass_kg double    95 [3.75, 3.8, 3.25, nil, 3.45, ... ], 2 nils
+```
-### Iteration (not impremented)
+Other DataFrame manipulating methods like `pick`, `drop`, `slice`, `remove` and `rename` also accept a block.
-### Filtering (not impremented)
+See [DataFrame.md](doc/DataFrame.md) for details.
 ## `RedAmber::Vector`
-### Constructor
-- [x] Create from a column in a DataFrame
-- [x] New from an Array
-### Properties
-- [x] `to_s`
-- [x] `values`, `to_a`, `entries`
-- [x] `size`, `length`, `n_rows`, `nrow`
-- [x] `type`
-- [x] `data_type`
-- [ ] `each`
-- [ ] `chunked?`
-- [ ] `n_chunks`
-- [ ] `each_chunk`
-- [x] `tally`
-- [x] `n_nils`, `n_nans`
-  - `n_nulls` is an alias of `n_nils`
-- [x] `inspect(limit: 80)`
-  - `limit` sets size limit to display long array.
-### Functions
-#### Unary aggregations: vector.func => scalar
-| Method    |Boolean|Numeric|String|Options|Remarks|
-| ----------- | --- | --- | --- | --- | --- |
-| ✓ `all`     |  ✓  |     |     | ✓ ScalarAggregate|     |
-| ✓ `any`     |  ✓  |     |     | ✓ ScalarAggregate|     |
-| ✓ `approximate_median`|  |✓|  | ✓ ScalarAggregate| alias `median`|
-| ✓ `count`   |  ✓  |  ✓  |  ✓  | ✓  Count  |     |
-| ✓ `count_distinct`| ✓ | ✓ | ✓ | ✓  Count  |alias `count_uniq`|
-|[ ]`index`   | [ ] | [ ] | [ ] |[ ] Index  |     |
-| ✓ `max`     |  ✓  |  ✓  |  ✓  | ✓ ScalarAggregate|     |
-| ✓ `mean`    |  ✓  |  ✓  |     | ✓ ScalarAggregate|     |
-| ✓ `min`     |  ✓  |  ✓  |  ✓  | ✓ ScalarAggregate|     |
-|[ ]`min_max` | [ ] | [ ] | [ ] |[ ] ScalarAggregate|     |
-|[ ]`mode`    |     | [ ] |     |[ ] Mode    |     |
-| ✓ `product` |  ✓  |  ✓  |     | ✓ ScalarAggregate|     |
-|[ ]`quantile`|     | [ ] |     |[ ] Quantile|     |
-|[ ]`stddev`  |     |  ✓  |     |[ ] Variance|     |
-| ✓ `sum`     |  ✓  |  ✓  |     | ✓ ScalarAggregate|     |
-|[ ]`tdigest` |     | [ ] |     |[ ] TDigest |     |
-|[ ]`variance`|     |  ✓  |     |[ ] Variance|     |
-Options can be used as follows.
-See the [document of C++ function](https://arrow.apache.org/docs/cpp/compute.html) for detail.
+Class `RedAmber::Vector` represents a series of data in the DataFrame.
 ```ruby
-double = RedAmber::Vector.new([1, 0/0.0, -1/0.0, 1/0.0, nil, ""])
-#=>
-#<RedAmber::Vector(:double, size=6):0x000000000000f910>
-[1.0, NaN, -Infinity, Infinity, nil, 0.0]
-double.count #=> 5
-double.count(opts: {mode: :only_valid}) #=> 5, default
-double.count(opts: {mode: :only_null}) #=> 1
-double.count(opts: {mode: :all}) #=> 6
-boolean = RedAmber::Vector.new([true, true, nil])
-#=>
-#<RedAmber::Vector(:boolean, size=3):0x000000000000f924>
-[true, true, nil]
-boolean.all #=> true
-boolean.all(opts: {skip_nulls: true}) #=> true
-boolean.all(opts: {skip_nulls: false}) #=> false
+penguins[:species]
+# =>
+#<RedAmber::Vector(:string, size=344):0x000000000000f8e8>
+["Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", ... ]
 ```
-#### Unary element-wise: vector.func => vector
-| Method    |Boolean|Numeric|String|Options|Remarks|
-| ------------ | --- | --- | --- | --- | ----- |
-| ✓ `-@`       |     |  ✓  |     |     |as `-vector`|
-| ✓ `negate`   |     |  ✓  |     |     |`-@`   |
-| ✓ `abs`      |     |  ✓  |     |     |       |
-|[ ]`acos`     |     | [ ] |     |     |       |
-|[ ]`asin`     |     | [ ] |     |     |       |
-| ✓ `atan`     |     |  ✓  |     |     |       |
-| ✓ `bit_wise_not`|  | (✓) |     |     |integer only|
-|[ ]`ceil`     |     |  ✓  |     |     |       |
-| ✓ `cos`      |     |  ✓  |     |     |       |
-|[ ]`floor`    |     |  ✓  |     |     |       |
-| ✓ `invert`   |  ✓  |     |     |     |`!`, alias `not`|
-|[ ]`ln`       |     | [ ] |     |     |       |
-|[ ]`log10`    |     | [ ] |     |     |       |
-|[ ]`log1p`    |     | [ ] |     |     |       |
-|[ ]`log2`     |     | [ ] |     |     |       |
-|[ ]`round`    |     | [ ] |     |[ ] Round|       |
-|[ ]`round_to_multiple`| | [ ] | |[ ] RoundToMultiple|       |
-| ✓ `sign`     |     |  ✓  |     |     |       |
-| ✓ `sin`      |     |  ✓  |     |     |       |
-| ✓ `tan`      |     |  ✓  |     |     |       |
-|[ ]`trunc`    |     |  ✓  |     |     |       |
-#### Binary element-wise: vector.func(vector) => vector
-| Method       |Boolean|Numeric|String|Options|Remarks|
-| ----------------- | --- | --- | --- | --- | ----- |
-| ✓ `add`           |     |  ✓  |     |     | `+`   |
-| ✓ `atan2`         |     |  ✓  |     |     |       |
-| ✓ `and_kleene`    |  ✓  |     |     |     | `&`   |
-| ✓ `and_org   `    |  ✓  |     |     |     |`and` in Red Arrow|
-| ✓ `and_not`       |  ✓  |     |     |     |       |
-| ✓ `and_not_kleene`|  ✓  |     |     |     |       |
-| ✓ `bit_wise_and`  |     | (✓) |     |     |integer only|
-| ✓ `bit_wise_or`   |     | (✓) |     |     |integer only|
-| ✓ `bit_wise_xor`  |     | (✓) |     |     |integer only|
-| ✓ `divide`        |     |  ✓  |     |     | `/`   |
-| ✓ `equal`         |  ✓  |  ✓  |  ✓  |     |`==`, alias `eq`|
-| ✓ `greater`       |  ✓  |  ✓  |  ✓  |     |`>`, alias `gt`|
-| ✓ `greater_equal` |  ✓  |  ✓  |  ✓  |     |`>=`, alias `ge`|
-| ✓ `is_finite`     |     |  ✓  |     |     |       |
-| ✓ `is_inf`        |     |  ✓  |     |     |       |
-| ✓ `is_na`         |  ✓  |  ✓  |  ✓  |     |       |
-| ✓ `is_nan`        |     |  ✓  |     |     |       |
-|[ ]`is_nil`        |  ✓  |  ✓  |  ✓  |[ ] Null|alias `is_null`|
-| ✓ `is_valid`      |  ✓  |  ✓  |  ✓  |     |       |
-| ✓ `less`          |  ✓  |  ✓  |  ✓  |     |`<`, alias `lt`|
-| ✓ `less_equal`    |  ✓  |  ✓  |  ✓  |     |`<=`, alias `le`|
-|[ ]`logb`          |     | [ ] |     |     |       |
-|[ ]`mod`           |     | [ ] |     |     | `%`   |
-| ✓ `multiply`      |     |  ✓  |     |     | `*`   |
-| ✓ `not_equal`     |  ✓  |  ✓  |  ✓  |     |`!=`, alias `ne`|
-| ✓ `or_kleene`     |  ✓  |     |     |     | `\|`  |
-| ✓ `or_org`        |  ✓  |     |     |     |`or` in Red Arrow|
-| ✓ `power`         |     |  ✓  |     |     | `**`  |
-| ✓ `subtract`      |     |  ✓  |     |     | `-`   |
-| ✓ `shift_left`    |     | (✓) |     |     |`<<`, integer only|
-| ✓ `shift_right`   |     | (✓) |     |     |`>>`, integer only|
-| ✓ `xor`           |  ✓  |     |     |     | `^`   |
-##### (Not impremented)
-- [ ] sort, sort_index
-- [ ] argmin, argmax
-- [ ] (array functions)
-- [ ] (strings functions)
-- [ ] (temporal functions)
-- [ ] (conditional functions)
-- [ ] (index functions)
-- [ ] (other functions)
-### Coerce (not impremented)
-### Updating (not impremented)
-### DSL in a block for faster calculation ?
+Vectors accepts some [functional methods from Arrow](https://arrow.apache.org/docs/cpp/compute.html).
+See [Vector.md](doc/Vector.md) for details.
+## TDR concept
+I named the data frame representation style in the model above as TDR (Transposed DataFrame Representation). See [TDR.md](doc/tdr.md) for details.
 ## Development