RubyGems - rover-df - Versions diffs - 0.1.0 → 0.1.1 - Mend

rover-df 0.1.0 → 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 4588d0b3b5633a3821a4c07e7102e5933edca92179836db041f2400d8be88538
-  data.tar.gz: 9b01cd2bae5fb6ba9f426fe0d347752cd30c63619b00284fb68e8f711ec38ddf
+  metadata.gz: b003d311b623fdd38fee4c6fa76129ad4bba042e8193c1872928cb05085daad3
+  data.tar.gz: d0c8c04b2a8aec3ea5b7616cbcda61f03a12c96fc8f9a0c7aa29fc898948b759
 SHA512:
-  metadata.gz: b2d35866786a7fbe17b274585419c752b08c817b2db1bf939a6c3f92a7ae2cd282d725614f96db730fd2590cbb8c24710d0fb1f713255d2c348c0fed0b874a35
-  data.tar.gz: 4bf0ba38ce2c3ef4765d702591948af18fddf142efb7e559e26cc4ab504538775a1771c839f1570230f7d101fa20bfbbeb5044f6bf567637790575ee9b95be87
+  metadata.gz: 8033d8ae9e5fb8c8e767ba68897d37745cc5d35a7a82cb2847e2b1d2c3adf8eeb406914cd47949d8f4c3f21307617ab550f435e7a5c257fe1187ed47dd943829
+  data.tar.gz: acfdca4ad081e2722c4b5269824de123d26aadab28532837d2bfc717c2ca263f73dc4335963beae654b24d4c17cebbefdcce761a946f391340a306c9ca2a8c9b

data/CHANGELOG.md CHANGED

@@ -1,3 +1,12 @@
+## 0.1.1 (2020-06-10)
+- Added methods and options for types
+- Added grouping
+- Added one-hot encoding
+- Added `sample` to data frames
+- Added `tally`, `var`, `std`, `take`, `count`, and `length` to vectors
+- Improved error message for `read_csv` with no headers
 ## 0.1.0 (2020-05-13)
 - First release

data/README.md CHANGED

@@ -2,7 +2,9 @@
 Simple, powerful data frames for Ruby
-:mountain: Designed for data exploration and machine learning, and powered by [Numo](https://github.com/ruby-numo/numo-narray) for blazing performance
+:mountain: Designed for data exploration and machine learning, and powered by [Numo](https://github.com/ruby-numo/numo-narray)
+[![Build Status](https://travis-ci.org/ankane/rover.svg?branch=master)](https://travis-ci.org/ankane/rover)
 ## Installation
@@ -16,12 +18,22 @@ gem 'rover-df'
 A data frame is an in-memory table. It’s a useful data structure for data analysis and machine learning. It uses columnar storage for fast operations on columns.
+Try it out for forecasting by clicking the button below:
+[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/ankane/ml-stack/master?filepath=Forecasting.ipynb)
+Use the `Run` button (or `SHIFT` + `ENTER`) to run each line.
 ## Creating Data Frames
 From an array
 ```ruby
-Rover::DataFrame.new([{a: 1, b: "one"}, {a: 2, b: "two"}, {a: 3, b: "three"}])
+Rover::DataFrame.new([
+  {a: 1, b: "one"},
+  {a: 2, b: "two"},
+  {a: 3, b: "three"}
+])
 ```
 From a hash
@@ -33,7 +45,7 @@ Rover::DataFrame.new({
 })
 ```
-From an Active Record relation
+From Active Record
 ```ruby
 Rover::DataFrame.new(User.all)
@@ -75,6 +87,8 @@ Select a column
 df[:a]
 ```
+> Note that strings and symbols are different keys, just like hashes
 Select multiple columns
 ```ruby
@@ -158,13 +172,39 @@ df[:a].min
 df[:a].max
 ```
+Count occurrences
+```ruby
+df[:a].tally
+```
 Cross tabulation
 ```ruby
 df[:a].crosstab(df[:b])
 ```
-## Updates
+## Grouping
+Group
+```ruby
+df.group(:a).count
+```
+Works with all summary statistics
+```ruby
+df.group(:a).max(:b)
+```
+Multiple groups
+```ruby
+df.group([:a, :b]).count
+```
+## Updating Data
 Add a new column
@@ -214,7 +254,7 @@ Rename a column
 df[:new_a] = df.delete(:a)
 ```
-Sort data
+Sort rows
 ```ruby
 df.sort_by! { |r| r[:a] }
@@ -258,6 +298,20 @@ Left join
 df.left_join(other_df)
 ```
+## Encoding
+One-hot encoding
+```ruby
+df.one_hot
+```
+Drop a variable in each category to avoid the dummy variable trap
+```ruby
+df.one_hot(drop: true)
+```
 ## Conversion
 Array of hashes
@@ -284,6 +338,46 @@ CSV
 df.to_csv
 ```
+## Types
+Pass column types when creating a data frame
+```ruby
+Rover::DataFrame.new(data, types: {"a" => :int, "b" => :float})
+```
+Or
+```ruby
+Rover.read_csv("data.csv", types: {"a" => :int, "b" => :float})
+```
+Supported types are:
+- boolean - `bool`
+- float - `float`, `float32`
+- integer - `int`, `int32`, `int16`, `int8`
+- unsigned integer - `uint`, `uint32`, `uint16`, `uint8`
+- object - `object`
+Get column types
+```ruby
+df.types
+```
+For a specific column
+```ruby
+df[:a].type
+```
+Change the type of a column
+```ruby
+df[:a] = df[:a].to(:int)
+```
 ## History
 View the [changelog](https://github.com/ankane/rover/blob/master/CHANGELOG.md)

data/lib/rover.rb CHANGED

@@ -3,30 +3,38 @@ require "numo/narray"
 # modules
 require "rover/data_frame"
+require "rover/group"
 require "rover/vector"
 require "rover/version"
 module Rover
   class << self
-    def read_csv(path, **options)
+    def read_csv(path, types: nil, **options)
       require "csv"
-      csv_to_df(CSV.read(path, headers: true, converters: :numeric, **options))
+      csv_to_df(CSV.read(path, **csv_options(options)), types: types)
     end
-    def parse_csv(str, **options)
+    def parse_csv(str, types: nil, **options)
       require "csv"
-      csv_to_df(CSV.parse(str, headers: true, converters: :numeric, **options))
+      csv_to_df(CSV.parse(str, **csv_options(options)), types: types)
     end
     private
-    def csv_to_df(table)
+    # TODO use date converter
+    def csv_options(options)
+      options = {headers: true, converters: :numeric}.merge(options)
+      raise ArgumentError, "Must specify headers" unless options[:headers]
+      options
+    end
+    def csv_to_df(table, types: nil)
       table.by_col!
       data = {}
       table.each do |k, v|
         data[k] = v
       end
-      DataFrame.new(data)
+      DataFrame.new(data, types: types)
     end
   end
 end

data/lib/rover/data_frame.rb CHANGED

@@ -1,7 +1,10 @@
 module Rover
   class DataFrame
-    def initialize(data = {})
+    def initialize(*args)
+      data, options = process_args(args)
       @vectors = {}
+      types = options[:types] || {}
       if data.is_a?(DataFrame)
         data.vectors.each do |k, v|
@@ -11,7 +14,7 @@ module Rover
         data.to_h.each do |k, v|
           @vectors[k] =
             if v.respond_to?(:to_a)
-              Vector.new(v)
+              Vector.new(v, type: types[k])
             else
               v
             end
@@ -20,7 +23,7 @@ module Rover
         # handle scalars
         size = @vectors.values.find { |v| v.is_a?(Vector) }&.size || 1
         @vectors.each_key do |k|
-          @vectors[k] = to_vector(@vectors[k], size)
+          @vectors[k] = to_vector(@vectors[k], size: size, type: types[k])
         end
       elsif data.is_a?(Array)
         vectors = {}
@@ -35,12 +38,12 @@ module Rover
           end
         end
         vectors.each do |k, v|
-          @vectors[k] = to_vector(v)
+          @vectors[k] = to_vector(v, type: types[k])
         end
       elsif defined?(ActiveRecord) && (data.is_a?(ActiveRecord::Relation) || (data.is_a?(Class) && data < ActiveRecord::Base))
         result = data.connection.select_all(data.all.to_sql)
         result.columns.each_with_index do |k, i|
-          @vectors[k] = to_vector(result.rows.map { |r| r[i] })
+          @vectors[k] = to_vector(result.rows.map { |r| r[i] }, type: types[k])
         end
       else
         raise ArgumentError, "Cannot cast to data frame: #{data.class.name}"
@@ -90,9 +93,13 @@ module Rover
       @vectors.dup
     end
+    def types
+      @vectors.map { |k, v| [k, v.type] }.to_h
+    end
     def []=(k, v)
       check_key(k)
-      v = to_vector(v, size)
+      v = to_vector(v, size: size)
       raise ArgumentError, "Size mismatch: expected #{size}, got #{v.size}" if @vectors.any? && v.size != size
       @vectors[k] = v
     end
@@ -170,6 +177,12 @@ module Rover
       DataFrame.new(new_vectors)
     end
+    def sample(*args, **kwargs)
+      # TODO make more efficient
+      indexes = (0...size).to_a.sample(*args, **kwargs)
+      self[indexes]
+    end
     def to_a
       a = []
       each_row do |row|
@@ -190,6 +203,25 @@ module Rover
       Numo::NArray.column_stack(vectors.values.map(&:to_numo))
     end
+    # TODO raise error when collision
+    def one_hot(drop: false)
+      df = DataFrame.new
+      vectors.each do |k, v|
+        if v.to_numo.is_a?(Numo::RObject)
+          df.merge!(v.one_hot(drop: drop, prefix: "#{k}_"))
+        else
+          df[k] = v
+        end
+      end
+      df
+    rescue ArgumentError => e
+      if e.message == "All elements must be strings"
+        # better error message
+        raise ArgumentError, "All elements must be numeric or strings"
+      end
+      raise e
+    end
     def to_csv
       require "csv"
       CSV.generate do |csv|
@@ -258,6 +290,17 @@ module Rover
       dup.sort_by!(&block)
     end
+    def group(columns)
+      Group.new(self, columns)
+    end
+    [:max, :min, :median, :mean, :percentile, :sum].each do |name|
+      define_method(name) do |column, *args|
+        check_column(column)
+        self[column].send(name, *args)
+      end
+    end
     def dup
       df = DataFrame.new
       @vectors.each do |k, v|
@@ -375,8 +418,15 @@ module Rover
       raise ArgumentError, "Missing keys: #{missing_keys.join(", ")}" if missing_keys.any?
     end
-    def to_vector(v, size = nil)
-      return v if v.is_a?(Vector)
+    def check_column(key)
+      raise ArgumentError, "Missing column: #{key}" unless include?(key)
+    end
+    def to_vector(v, size: nil, type: nil)
+      if v.is_a?(Vector)
+        v = v.to(type) if type && v.type != type
+        return v
+      end
       if size && !v.respond_to?(:to_a)
         v =
@@ -392,7 +442,31 @@ module Rover
           end
       end
-      Vector.new(v)
+      Vector.new(v, type: type)
+    end
+    # can't use data = {} and keyword arguments
+    # as this causes an unknown keyword error when data is passed as
+    # DataFrame.new({a: ..., b: ...})
+    #
+    # at the moment, there doesn't appear to be a way to distinguish between
+    # DataFrame.new({types: ...}) which should set data, and
+    # DataFrame.new(types: ...) which should set options
+    # https://bugs.ruby-lang.org/issues/16891
+    #
+    # there aren't currently options that should be used without data
+    # if this is ever the case, we should still require data
+    # to prevent new options from breaking existing code
+    def process_args(args)
+      data = args[0] || {}
+      options = args.size > 1 && args.last.is_a?(Hash) ? args.pop : {}
+      raise ArgumentError, "wrong number of arguments (given #{args.size}, expected 0..1)" if args.size > 1
+      known_keywords = [:types]
+      unknown_keywords = options.keys - known_keywords
+      raise ArgumentError, "unknown keywords: #{unknown_keywords.join(", ")}" if unknown_keywords.any?
+      [data, options]
     end
   end
 end

data/lib/rover/group.rb ADDED

@@ -0,0 +1,49 @@
+module Rover
+  class Group
+    def initialize(df, columns)
+      @df = df
+      @columns = Array(columns)
+    end
+    [:count, :max, :min, :mean, :median, :percentile, :sum].each do |name|
+      define_method(name) do |*args|
+        result = {}
+        grouped_dfs.each do |k, df|
+          result[k] = df.send(name, *args)
+        end
+        result
+      end
+    end
+    private
+    # TODO make more efficient
+    def grouped_dfs
+      # cache here so we can reuse for multiple calcuations if needed
+      @grouped_dfs ||= begin
+        raise ArgumentError, "No columns given" if @columns.empty?
+        missing_keys = @columns - @df.keys
+        raise ArgumentError, "Missing keys: #{missing_keys.join(", ")}" if missing_keys.any?
+        groups = Hash.new { |hash, key| hash[key] = [] }
+        if @columns.size == 1
+          @df[@columns.first].each_with_index do |v, i|
+            groups[v] << i
+          end
+        else
+          i = 0
+          @df.each_row do |row|
+            groups[@columns.map { |c| row[c] }] << i
+            i += 1
+          end
+        end
+        result = {}
+        groups.each do |k, indexes|
+          result[k] = @df[indexes]
+        end
+        result
+      end
+    end
+  end
+end

data/lib/rover/vector.rb CHANGED

@@ -1,27 +1,39 @@
 module Rover
   class Vector
-    def initialize(data)
-      @data =
-        if data.is_a?(Vector)
-          data.to_numo
-        elsif data.is_a?(Numo::NArray)
-          data
-        else
-          data = data.to_a
-          if data.all? { |v| v.is_a?(Integer) }
-            Numo::Int64.cast(data)
-          elsif data.all? { |v| v.is_a?(Numeric) || v.nil? }
-            Numo::DFloat.cast(data.map { |v| v || Float::NAN })
-          elsif data.all? { |v| v == true || v == false }
-            Numo::Bit.cast(data)
-          else
-            Numo::RObject.cast(data)
-          end
-        end
+    # if a user never specifies types,
+    # the defaults are bool, float, int, and object
+    # keep these simple
+    #
+    # we could create aliases for float64, int64, uint64
+    # if so, type should still return the simple type
+    TYPE_CAST_MAPPING = {
+      bool: Numo::Bit,
+      float32: Numo::SFloat,
+      float: Numo::DFloat,
+      int8: Numo::Int8,
+      int16: Numo::Int16,
+      int32: Numo::Int32,
+      int: Numo::Int64,
+      object: Numo::RObject,
+      uint8: Numo::UInt8,
+      uint16: Numo::UInt16,
+      uint32: Numo::UInt32,
+      uint: Numo::UInt64
+    }
+    def initialize(data, type: nil)
+      @data = cast_data(data, type: type)
       raise ArgumentError, "Bad size: #{@data.shape}" unless @data.ndim == 1
     end
+    def type
+      TYPE_CAST_MAPPING.find { |_, v| @data.is_a?(v) }[0]
+    end
+    def to(type)
+      Vector.new(self, type: type)
+    end
     def to_numo
       @data
     end
@@ -35,6 +47,8 @@ module Rover
     def size
       @data.size
     end
+    alias_method :length, :size
+    alias_method :count, :size
     def uniq
       Vector.new(@data.to_a.uniq)
@@ -148,6 +162,15 @@ module Rover
       Vector.new(mapped)
     end
+    def tally
+      result = Hash.new(0)
+      @data.each do |v|
+        result[v] += 1
+      end
+      result.default = nil
+      result
+    end
     def sort
       Vector.new(@data.respond_to?(:sort) ? @data.sort : @data.to_a.sort)
     end
@@ -157,7 +180,11 @@ module Rover
     end
     def each(&block)
-      to_a.each(&block)
+      @data.each(&block)
+    end
+    def each_with_index(&block)
+      @data.each_with_index(&block)
     end
     def max
@@ -176,7 +203,7 @@ module Rover
     def median
       # need to cast to get correct result
-      # TODO file bug with Numo
+      # https://github.com/ruby-numo/numo-narray/issues/165
       @data.cast_to(Numo::DFloat).median
     end
@@ -188,6 +215,16 @@ module Rover
       @data.sum
     end
+    # uses Bessel's correction for now since that's all Numo supports
+    def std
+      @data.cast_to(Numo::DFloat).stddev
+    end
+    # uses Bessel's correction for now since that's all Numo supports
+    def var
+      @data.cast_to(Numo::DFloat).var
+    end
     def all?(&block)
       @data.to_a.all?(&block)
     end
@@ -208,6 +245,11 @@ module Rover
       Vector.new(@data[-n..-1])
     end
+    def take(n)
+      raise ArgumentError, "attempt to take negative size" if n < 0
+      first(n)
+    end
     def crosstab(other)
       index = uniq.sort
       index_pos = index.to_a.map.with_index.to_h
@@ -231,6 +273,20 @@ module Rover
       last(n)
     end
+    def one_hot(drop: false, prefix: nil)
+      raise ArgumentError, "All elements must be strings" unless all? { |vi| vi.is_a?(String) }
+      new_vectors = {}
+      # maybe sort values first
+      values = uniq.to_a
+      values.shift if drop
+      values.each do |v2|
+        # TODO use types
+        new_vectors["#{prefix}#{v2}"] = (self == v2).to_numo.cast_to(Numo::Int64)
+      end
+      DataFrame.new(new_vectors)
+    end
     # TODO add type and size?
     def inspect
       elements = first(5).to_a.map(&:inspect)
@@ -244,5 +300,57 @@ module Rover
       require "iruby"
       IRuby::HTML.table(to_a)
     end
+    private
+    def cast_data(data, type: nil)
+      numo_type = numo_type(type) if type
+      data = data.to_numo if data.is_a?(Vector)
+      if data.is_a?(Numo::NArray)
+        raise ArgumentError, "Complex types not supported yet" if data.is_a?(Numo::DComplex) || data.is_a?(Numo::SComplex)
+        if type
+          case type
+          when /int/
+            # Numo does not check these when casting
+            raise RangeError, "float NaN out of range of integer" if data.respond_to?(:isnan) && data.isnan.any?
+            raise RangeError, "float Inf out of range of integer" if data.respond_to?(:isinf) && data.isinf.any?
+            data = data.to_a.map { |v| v.nil? ? nil : v.to_i } if data.is_a?(Numo::RObject)
+          when /float/
+            data = data.to_a.map { |v| v.nil? ? Float::NAN : v.to_f } if data.is_a?(Numo::RObject)
+          end
+          data = numo_type.cast(data)
+        end
+      else
+        data = data.to_a
+        if type
+          data = numo_type.cast(data)
+        else
+          data =
+            if data.all? { |v| v.is_a?(Integer) }
+              Numo::Int64.cast(data)
+            elsif data.all? { |v| v.is_a?(Numeric) || v.nil? }
+              Numo::DFloat.cast(data.map { |v| v || Float::NAN })
+            elsif data.all? { |v| v == true || v == false }
+              Numo::Bit.cast(data)
+            else
+              Numo::RObject.cast(data)
+            end
+        end
+      end
+      data
+    end
+    def numo_type(type)
+      numo_type = TYPE_CAST_MAPPING[type]
+      raise ArgumentError, "Invalid type: #{type}" unless numo_type
+      numo_type
+    end
   end
 end

data/lib/rover/version.rb CHANGED

@@ -1,3 +1,3 @@
 module Rover
-  VERSION = "0.1.0"
+  VERSION = "0.1.1"
 end

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: rover-df
 version: !ruby/object:Gem::Version
-  version: 0.1.0
+  version: 0.1.1
 platform: ruby
 authors:
 - Andrew Kane
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2020-05-14 00:00:00.000000000 Z
+date: 2020-06-10 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: numo-narray
@@ -120,6 +120,7 @@ files:
 - lib/rover-df.rb
 - lib/rover.rb
 - lib/rover/data_frame.rb
+- lib/rover/group.rb
 - lib/rover/vector.rb
 - lib/rover/version.rb
 homepage: https://github.com/ankane/rover