RubyGems - red_amber - Versions diffs - 0.2.2 → 0.2.3 - Mend

red_amber 0.2.2 → 0.2.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (34) hide show

checksums.yaml +4 -4
data/.rubocop.yml +12 -0
data/CHANGELOG.md +114 -31
data/Gemfile +4 -2
data/README.md +41 -25
data/benchmark/basic.yml +79 -0
data/benchmark/combine.yml +63 -0
data/benchmark/drop_nil.yml +15 -3
data/benchmark/group.yml +33 -0
data/benchmark/reshape.yml +27 -0
data/benchmark/{csv_load_penguins.yml → rover/csv_load_penguins.yml} +3 -3
data/benchmark/rover/flights.yml +23 -0
data/benchmark/rover/penguins.yml +23 -0
data/benchmark/rover/planes.yml +23 -0
data/benchmark/rover/weather.yml +23 -0
data/doc/DataFrame.md +332 -53
data/doc/Vector.md +3 -0
data/doc/image/dataframe/join.png +0 -0
data/doc/image/dataframe/set_and_bind.png +0 -0
data/doc/image/dataframe_model.png +0 -0
data/lib/red_amber/data_frame.rb +6 -5
data/lib/red_amber/data_frame_combinable.rb +283 -0
data/lib/red_amber/data_frame_displayable.rb +2 -0
data/lib/red_amber/data_frame_selectable.rb +9 -9
data/lib/red_amber/data_frame_variable_operation.rb +4 -4
data/lib/red_amber/group.rb +99 -18
data/lib/red_amber/helper.rb +1 -13
data/lib/red_amber/vector.rb +7 -0
data/lib/red_amber/vector_functions.rb +0 -8
data/lib/red_amber/vector_updatable.rb +60 -65
data/lib/red_amber/version.rb +1 -1
data/lib/red_amber.rb +1 -0
data/red_amber.gemspec +1 -1
metadata +21 -10

data/benchmark/rover/flights.yml ADDED Viewed

@@ -0,0 +1,23 @@
+contexts:
+  - gems:
+      red_amber: 0.2.2
+  - name: HEAD
+    prelude: |
+      $LOAD_PATH.unshift(File.expand_path('lib'))
+      require 'red_amber'
+prelude: |
+  require 'rover'
+  require 'datasets-arrow'
+  ds = Datasets::Rdatasets.new('nycflights13', 'flights')
+  df = RedAmber::DataFrame.new(ds)
+  rover = Rover::DataFrame.new(df.to_h)
+  group_keys = [:month, :origin]
+  summary_key = :air_time
+benchmark:
+  'penguins Group by Rover': |
+    rover.group(group_keys).count
+  'penguins Group by RedAmber': |
+    df.group(group_keys).count

data/benchmark/rover/penguins.yml ADDED Viewed

@@ -0,0 +1,23 @@
+contexts:
+  - gems:
+      red_amber: 0.2.2
+  - name: HEAD
+    prelude: |
+      $LOAD_PATH.unshift(File.expand_path('lib'))
+      require 'red_amber'
+prelude: |
+  require 'rover'
+  require 'datasets-arrow'
+  ds = Datasets::Penguins.new
+  df = RedAmber::DataFrame.new(ds)
+  rover = Rover::DataFrame.new(df.to_h)
+  group_keys = [:species, :island]
+  summary_key = :body_mass_g
+benchmark:
+  'penguins Group by Rover': |
+    rover.group(group_keys).mean(summary_key)
+  'penguins Group by RedAmber': |
+    df.group(group_keys).mean(summary_key)

data/benchmark/rover/planes.yml ADDED Viewed

@@ -0,0 +1,23 @@
+contexts:
+  - gems:
+      red_amber: 0.2.2
+  - name: HEAD
+    prelude: |
+      $LOAD_PATH.unshift(File.expand_path('lib'))
+      require 'red_amber'
+prelude: |
+  require 'rover'
+  require 'datasets-arrow'
+  ds = Datasets::Rdatasets.new('nycflights13', 'planes')
+  df = RedAmber::DataFrame.new(ds)
+  rover = Rover::DataFrame.new(df.to_h)
+  group_keys = [:engines, :engine]
+  summary_key = :seats
+benchmark:
+  'penguins Group by Rover': |
+    rover.group(group_keys).mean(summary_key)
+  'penguins Group by RedAmber': |
+    df.group(group_keys).mean(summary_key)

data/benchmark/rover/weather.yml ADDED Viewed

@@ -0,0 +1,23 @@
+contexts:
+  - gems:
+      red_amber: 0.2.2
+  - name: HEAD
+    prelude: |
+      $LOAD_PATH.unshift(File.expand_path('lib'))
+      require 'red_amber'
+prelude: |
+  require 'rover'
+  require 'datasets-arrow'
+  ds = Datasets::Rdatasets.new('nycflights13', 'weather')
+  df = RedAmber::DataFrame.new(ds)
+  rover = Rover::DataFrame.new(df.to_h)
+  group_keys = [:month, :origin]
+  summary_key = :temp
+benchmark:
+  'penguins Group by Rover': |
+    rover.group(group_keys).mean(summary_key)
+  'penguins Group by RedAmber': |
+    df.group(group_keys).mean(summary_key)

data/doc/DataFrame.md CHANGED Viewed

@@ -5,7 +5,8 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
 - A label is attached to `Vector`. We call it `key`.
 - A `Vector` and associated `key` is grouped as a `variable`.
 - `variable`s with same vector length are aligned and arranged to be a `DataFrame`.
-- Each `Vector` in a `DataFrame` contains a set of relating data at same position. We call it `observation`.
+  - Each `key` in a `DataFrame` must be unique.
+- Each `Vector` in a `DataFrame` contains a set of relating data at same position. We call it `record` or `observation`.
 ![dataframe model image](doc/../image/dataframe_model.png)
@@ -94,13 +95,13 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
 ### `table`, `to_arrow`
-- Reader of Arrow::Table object inside.
+- Returns Arrow::Table object in the DataFrame.
-### `size`, `n_obs`, `n_rows`
+### `size`, `n_records`, `n_obs`, `n_rows`
-- Returns size of Vector (num of observations).
-### `n_keys`, `n_vars`, `n_cols`,
+- Returns size of Vector (num of records).
+### `n_keys`, `n_variables`, `n_vars`, `n_cols`,
 - Returns num of keys (num of variables).
@@ -138,16 +139,7 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
 - Returns key names in an Array.
-  When we use it with vectors, Vector#key is useful to get the key inside of DataFrame.
-  ```ruby
-    # update numeric variables, another solution
-    df.assign do
-      vectors.each_with_object({}) do |vector, assigner|
-        assigner[vector.key] = vector * -1 if vector.numeric?
-      end
-    end
-  ```
+  Each key must be unique in the DataFrame.
 ### `types`
@@ -161,9 +153,20 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
 - Returns an Array of Vectors.
+  When we use it, Vector#key is useful to get the key in the DataFrame.
+  ```ruby
+    # update numeric variables, another solution
+    df.assign do
+      vectors.each_with_object({}) do |vector, assigner|
+        assigner[vector.key] = vector * -1 if vector.numeric?
+      end
+    end
+  ```
 ### `indices`, `indexes`
-- Returns indexes in an Array.
+- Returns indexes in a Vector.
   Accepts an option `start` as the first of indexes.
   ```ruby
@@ -171,15 +174,19 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
   df.indices
   # =>
+  #<RedAmber::Vector(:uint8, size=5):0x0000000000013ed4>
   [0, 1, 2, 3, 4]
   df.indices(1)
   # =>
+  #<RedAmber::Vector(:uint8, size=5):0x0000000000018fd8>
   [1, 2, 3, 4, 5]
   df.indices(:a)
   # =>
+  #<RedAmber::Vector(:dictionary, size=5):0x000000000001bd50>
   [:a, :b, :c, :d, :e]
   ```
@@ -275,6 +282,7 @@ penguins.to_rover
   dataset = Datasets::Penguins.new
   # (From 0.2.2) responsible to the object which has `to_arrow` method.
+  # If older, it should be `dataset.to_arrow` in the parentheses.
   RedAmber::DataFrame.new(dataset).tdr
   # =>
@@ -290,10 +298,11 @@ penguins.to_rover
   6 :sex               string     3 {"male"=>168, "female"=>165, nil=>11}
   7 :year              uint16     3 {2007=>110, 2008=>114, 2009=>120}
   ```
+  Options:
   - limit: limit of variables to show. Default value is 10.
-  - tally: max level to use tally mode.
-  - elements: max num of element to show values in each observations.
+  - tally: max level to use tally mode. Default value is 5.
+  - elements: max num of element to show values in each records. Default value is 5.
 ## Selecting
@@ -303,13 +312,13 @@ penguins.to_rover
 - Keys in an Array: `df[:symbol1, "string", :symbol2]`
 - Keys by indeces: `df[df.keys[0]`, `df[df.keys[1,2]]`, `df[df.keys[1..]]`
-  Key indeces can be used via `keys[i]` because numbers are used to select observations (rows).
+  Key indeces should be used via `keys[i]` because numbers are used to select records (rows). See next section.
 - Keys by a Range:
-  If keys are able to represent by Range, it can be included in the arguments. See a example below.
+  If keys are able to represent by a Range, it can be included in the arguments. See a example below.
-- You can exchange the order of variables (columns).
+- You can also exchange the order of variables (columns).
   ```ruby
   hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
@@ -325,7 +334,7 @@ penguins.to_rover
   2 C             3.0       3
   ```
-  If `#[]` represents single variable (column), it returns a Vector object.
+  If `#[]` represents a single variable (column), it returns a Vector object.
   ```ruby
   df[:a]
@@ -334,6 +343,7 @@ penguins.to_rover
   #<RedAmber::Vector(:uint8, size=3):0x000000000000f140>
   [1, 2, 3]
   ```
   Or `#v` method also returns a Vector for a key.
   ```ruby
@@ -344,18 +354,19 @@ penguins.to_rover
   [1, 2, 3]
   ```
-  This may be useful to use in a block of DataFrame manipulation verbs. We can write `v(:a)` rather than `self[:a]` or `df[:a]`
+  This method may be useful to use in a block of DataFrame manipulation verbs. We can write `v(:a)` rather than `self[:a]` or `df[:a]`
-### Select observations (rows in a table) by `[]` as `[index]`, `[range]`, `[array]`
+### Select records (rows in a table) by `[]` as `[index]`, `[range]`, `[array]`
-- Select a obs. by index: `df[0]`
-- Select obs. by indeces in a Range: `df[1..2]`
+- Select a record by index: `df[0]`
-  An end-less or a begin-less Range can be used to represent indeces.
+- Select records by indeces in an Array: `df[1, 2]`
-- Select obs. by indeces in an Array: `df[1, 2]`
+- Select records by indeces in a Range: `df[1..2]`
-- You can use float indices.
+  An end-less or a begin-less Range can be used to represent indeces.
+- You can use indices in Float.
 - Mixed case: `df[2, 0..]`
@@ -374,9 +385,9 @@ penguins.to_rover
   3       3 C             3.0
   ```
-- Select obs. by a boolean Array or a boolean RedAmber::Vector at same size as self.
+- Select records by a boolean Array or a boolean RedAmber::Vector at same size as self.
-  It returns a sub dataframe with observations at boolean is true.
+  It returns a sub dataframe with records at boolean is true.
     ```ruby
     # with the same dataframe `df` above
@@ -391,15 +402,15 @@ penguins.to_rover
     1       1 A             1.0
     ```
-### Select rows from top or from bottom
+### Select records (rows) from top or from bottom
   `head(n=5)`, `tail(n=5)`, `first(n=1)`, `last(n=1)`
 ## Sub DataFrame manipulations
-### `pick  ` - pick up variables by key label -
+### `pick  ` - pick up variables -
-  Pick up some columns (variables) to create a sub DataFrame.
+  Pick up some variables (columns) to create a sub DataFrame.
   ![pick method image](doc/../image/dataframe/pick.png)
@@ -491,9 +502,9 @@ penguins.to_rover
     343           49.9          16.1               213
     ```
-### `drop  ` - pick and drop -
+### `drop  ` - counterpart of pick -
-  Drop some columns (variables) to create a remainer DataFrame.
+  Drop some variables (columns) to create a remainer DataFrame.
   ![drop method image](doc/../image/dataframe/drop.png)
@@ -557,9 +568,9 @@ penguins.to_rover
   [1, 2, 3]
   ```
-### `slice  `  - to cut vertically is slice -
+### `slice  `  - slice and select records -
-  Slice and select rows (observations) to create a sub DataFrame.
+  Slice and select records (rows) to create a sub DataFrame.
   ![slice method image](doc/../image/dataframe/slice.png)
@@ -570,7 +581,7 @@ penguins.to_rover
     Negative index from the tail like Ruby's Array is also acceptable.
     ```ruby
-    # returns 5 obs. at start and 5 obs. from end
+    # returns 5 records at start and 5 records from end
     penguins.slice(0...5, -5..-1)
     # =>
@@ -665,9 +676,9 @@ penguins.to_rover
     0	1	A	  1.000000
     ```
-### `remove`
+### `remove` - counterpart of slice -
-  Slice and reject rows (observations) to create a remainer DataFrame.
+  Slice and reject records (rows) to create a remainer DataFrame.
   ![remove method image](doc/../image/dataframe/remove.png)
@@ -676,7 +687,7 @@ penguins.to_rover
     `remove(indeces)` accepts indeces as arguments. Indeces should be an Integer or a Range of Integer.
     ```ruby
-    # returns 6th to 339th obs.
+    # returns 6th to 339th records
     penguins.remove(0...5, -5..-1)
     # =>
@@ -699,7 +710,7 @@ penguins.to_rover
   `remove(booleans)` accepts booleans as an argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
     ```ruby
-    # remove all observation contains nil
+    # remove all records contains nil
     removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) }
     removed
@@ -785,7 +796,7 @@ penguins.to_rover
 ### `rename`
-  Rename keys (column names) to create a updated DataFrame.
+  Rename keys (variable/column names) to create a updated DataFrame.
   ![rename method image](doc/../image/dataframe/rename.png)
@@ -820,7 +831,7 @@ penguins.to_rover
 ### `assign`
-  Assign new or updated columns (variables) and create a updated DataFrame.
+  Assign new or updated variables (columns) and create an updated DataFrame.
   - Variables with new keys will append new columns from the right.
   - Variables with exisiting keys will update corresponding vectors.
@@ -1009,7 +1020,7 @@ When the option `keep_key: true` used, the column `key` will be preserved.
 ### `sort`
-  `sort` accepts parameters as sort_keys thanks to the amazing Red Arrow feature。
+  `sort` accepts parameters as sort_keys thanks to the Red Arrow's feature。
     - :key, "key" or "+key" denotes ascending order
     - "-key" denotes descending order
@@ -1040,7 +1051,7 @@ When the option `keep_key: true` used, the column `key` will be preserved.
 ### `remove_nil`
-  Remove any observations containing nil.
+  Remove any records containing nil.
 ## Grouping
@@ -1210,7 +1221,7 @@ When the option `keep_key: true` used, the column `key` will be preserved.
 ### `to_long(*keep_keys)`
-  Creates a 'long' (tidy) DataFrame from a 'wide' DataFrame.
+  Creates a 'long' (may be tidy) DataFrame from a 'wide' DataFrame.
   - Parameter `keep_keys` specifies the key names to keep.
@@ -1257,7 +1268,7 @@ When the option `keep_key: true` used, the column `key` will be preserved.
 ### `to_wide`
-  Creates a 'wide' (messy) DataFrame from a 'long' DataFrame.
+  Creates a 'wide' (may be messy) DataFrame from a 'long' DataFrame.
   - Option `:name` is the key of the column which will be expanded **to key names**.
     The default value is `:NAME` if it is not specified.
@@ -1282,9 +1293,277 @@ When the option `keep_key: true` used, the column `key` will be preserved.
 ## Combine
-- [ ] Combining dataframes
+### `join`
+![dataframe joining image](doc/../image/dataframe/join.png)
+  You should use specific `*_join` methods below.
+  - `other` is a DataFrame or a Arrow::Table.
+  - `join_keys` are keys shared by self and other to match with them.
+  - If `join_keys` are empty, common keys in self and other are chosen (natural join).
+  - If (common keys) > `join_keys`, duplicated keys are renamed by `suffix`.
+  ```ruby
+  df = DataFrame.new(
+    KEY: %w[A B C],
+    X1: [1, 2, 3]
+  )
+  #=>
+  #<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000012a70>
+    KEY           X1
+    <string> <uint8>
+  0 A              1
+  1 B              2
+  2 C              3
+  other = DataFrame.new(
+    KEY: %w[A B D],
+    X2: [true, false, nil]
+  )
+  #=>
+  #<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000017034>
+    KEY      X2
+    <string> <boolean>
+  0 A        true
+  1 B        false
+  2 D        (nil)
+  ```
+#### Mutating joins
+##### `inner_join(other, join_keys = nil, suffix: '.1')`
+  Join data, leaving only the matching records.
+  ```ruby
+  df.inner_join(other, :KEY)
+  #=>
+  #<RedAmber::DataFrame : 2 x 3 Vectors, 0x000000000001e2bc>
+    KEY           X1 X2
+    <string> <uint8> <boolean>
+  0 A              1 true
+  1 B              2 false
+  ```
+##### `full_join(other, join_keys = nil, suffix: '.1')`
+  Join data, leaving all records.
+  ```ruby
+  df.full_join(other, :KEY)
+  #=>
+  #<RedAmber::DataFrame : 4 x 3 Vectors, 0x0000000000029fcc>
+    KEY           X1 X2
+    <string> <uint8> <boolean>
+  0 A              1 true
+  1 B              2 false
+  2 C              3 (nil)
+  3 D          (nil) (nil)
+  ```
+##### `left_join(other, join_keys = nil, suffix: '.1')`
+  Join matching values to self from other.
+  ```ruby
+  df.left_join(other, :KEY)
+  #=>
+  #<RedAmber::DataFrame : 3 x 3 Vectors, 0x0000000000029fcc>
+    KEY           X1 X2
+    <string> <uint8> <boolean>
+  0 A              1 true
+  1 B              2 false
+  2 C              3 (nil)
+  ```
+##### `right_join(other, join_keys = nil, suffix: '.1')`
+  Join matching values from self to other.
-- [ ] Join
+  ```ruby
+  df.right_join(other, :KEY)
+  #=>
+  #<RedAmber::DataFrame : 2 x 3 Vectors, 0x0000000000029fcc>
+    KEY           X1 X2
+    <string> <uint8> <boolean>
+  0 A              1 true
+  1 B              2 false
+  2 D          (nil) (nil)
+  ```
+#### Filtering join
+##### `semi_join(other, join_keys = nil, suffix: '.1')`
+  Return records of self that have a match in other.
+  ```ruby
+  df.semi_join(other, :KEY)
+  #=>
+  #<RedAmber::DataFrame : 2 x 2 Vectors, 0x0000000000029fcc>
+    KEY           X1
+    <string> <uint8>
+  0 A              1
+  1 B              2
+  ```
+##### `anti_join(other, join_keys = nil, suffix: '.1')`
+  Return records of self that do not have a match in other.
+  ```ruby
+  df.anti_join(other, :KEY)
+  #=>
+  #<RedAmber::DataFrame : 1 x 2 Vectors, 0x0000000000029fcc>
+    KEY           X1
+    <string> <uint8>
+  0 C              3
+  ```
+## Set operations
+![dataframe set and binding image](doc/../image/dataframe/set_and_bind.png)
+  Keys in self and other must be same in set operations.
+  ```ruby
+  df = DataFrame.new(
+    KEY1: %w[A B C],
+    KEY2: [1, 2, 3]
+  )
+  #=>
+  #<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000012a70>
+    KEY1        KEY2
+    <string> <uint8>
+  0 A              1
+  1 B              2
+  2 C              3
+  other = DataFrame.new(
+    KEY1: %w[A B D],
+    KEY2: [1, 4, 5]
+  )
+  #=>
+  #<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000017034>
+    KEY1        KEY2
+    <string> <uint8>
+  0 A              1
+  1 B              4
+  2 D              5
+  ```
+##### `intersect(other)`
+  Select records appearing in both self and other.
+  ```ruby
+  df.intersect(other)
+  #=>
+  #<RedAmber::DataFrame : 1 x 2 Vectors, 0x0000000000029fcc>
+    KEY1        KEY2
+    <string> <uint8>
+  0 A              1
+  ```
+##### `union(other)`
+  Select records appearing in self or other.
+  ```ruby
+  df.union(other)
+  #=>
+  #<RedAmber::DataFrame : 5 x 2 Vectors, 0x0000000000029fcc>
+    KEY1        KEY2
+    <string> <uint8>
+  0 A              1
+  1 B              2
+  2 C              3
+  3 B              4
+  4 D              5
+  ```
+##### `difference(other)`
+  Select records appearing in self but not in other.
+  It has an alias `setdiff`.
+  ```ruby
+  df.difference(other)
+  #=>
+  #<RedAmber::DataFrame : 1 x 2 Vectors, 0x0000000000029fcc>
+    KEY1        KEY2
+    <string> <uint8>
+  1 B              2
+  2 C              3
+  ```
+## Binding
+### `concatenate(other)`
+  Concatenate another DataFrame or Table onto the bottom of self. The shape and data type of other must be the same as self.
+  The alias is `concat`.
+  An array of DataFrames or Tables is also acceptable as other.
+  ```ruby
+  df
+  #=>
+  #<RedAmber::DataFrame : 2 x 2 Vectors, 0x0000000000022cb8>
+          x y
+    <uint8> <string>
+  0       1 A
+  1       2 B
+  other
+  #=>
+  #<RedAmber::DataFrame : 2 x 2 Vectors, 0x000000000001f6d0>
+          x y
+    <uint8> <string>
+  0       3 C
+  1       4 D
+  df.concatenate(other)
+  #=>
+  #<RedAmber::DataFrame : 4 x 2 Vectors, 0x0000000000022574>
+          x y
+    <uint8> <string>
+  0       1 A
+  1       2 B
+  2       3 C
+  3       4 D
+  ```
+### `merge(other)`
+  Concatenate another DataFrame or Table onto the bottom of self. The shape and data type of other must be the same as self.
+  ```ruby
+  df
+  #=>
+  #<RedAmber::DataFrame : 2 x 2 Vectors, 0x0000000000009150>
+          x       y
+    <uint8> <uint8>
+  0       1       3
+  1       2       4
+  other
+  #=>
+  #<RedAmber::DataFrame : 2 x 2 Vectors, 0x0000000000008a0c>
+    a        b
+    <string> <string>
+  0 A        C
+  1 B        D
+  df.merge(other)
+  #=>
+  #<RedAmber::DataFrame : 2 x 4 Vectors, 0x000000000000cb70>
+          x       y a        b
+    <uint8> <uint8> <string> <string>
+  0       1       3 A        C
+  1       2       4 B        D
+  ```
 ## Encoding

data/doc/Vector.md CHANGED Viewed

@@ -24,6 +24,9 @@ Class `RedAmber::Vector` represents a series of data in the DataFrame.
   vector = Vector.new(1..3)
   # or
   vector = Vector.new(Arrow::Array.new([1, 2, 3])
+  # or
+  require 'arrow-numo-narray'
+  vector = Vector.new(Numo::Int8[1, 2, 3])
   # =>
   #<RedAmber::Vector(:uint8, size=3):0x000000000000f514>

data/doc/image/dataframe/join.png ADDED Viewed

Binary file

data/doc/image/dataframe/set_and_bind.png ADDED Viewed

Binary file

data/doc/image/dataframe_model.png CHANGED Viewed

Binary file