RubyGems - data_frame - Versions diffs - 0.1.8 - Mend

data_frame 0.1.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (45) hide show

data/README.rdoc +122 -0
data/VERSION.yml +4 -0
data/bin/plain_frame +22 -0
data/lib/data_frame.rb +26 -0
data/lib/data_frame/arff.rb +52 -0
data/lib/data_frame/callback_array.rb +152 -0
data/lib/data_frame/core/column_management.rb +147 -0
data/lib/data_frame/core/filter.rb +48 -0
data/lib/data_frame/core/import.rb +113 -0
data/lib/data_frame/core/pre_process.rb +69 -0
data/lib/data_frame/core/saving.rb +29 -0
data/lib/data_frame/core/training.rb +46 -0
data/lib/data_frame/data_frame.rb +115 -0
data/lib/data_frame/id3.rb +28 -0
data/lib/data_frame/kmeans.rb +10 -0
data/lib/data_frame/labels_from_uci.rb +48 -0
data/lib/data_frame/mlp.rb +18 -0
data/lib/data_frame/model.rb +22 -0
data/lib/data_frame/parameter_capture.rb +50 -0
data/lib/data_frame/sbn.rb +18 -0
data/lib/data_frame/transposable_array.rb +23 -0
data/lib/ext/array.rb +11 -0
data/lib/ext/open_struct.rb +5 -0
data/lib/ext/string.rb +5 -0
data/lib/ext/symbol.rb +5 -0
data/spec/data_frame/arff_spec.rb +48 -0
data/spec/data_frame/callback_array_spec.rb +148 -0
data/spec/data_frame/core/column_management_spec.rb +128 -0
data/spec/data_frame/core/filter_spec.rb +88 -0
data/spec/data_frame/core/import_spec.rb +41 -0
data/spec/data_frame/core/pre_process_spec.rb +103 -0
data/spec/data_frame/core/saving_spec.rb +61 -0
data/spec/data_frame/core/training_spec.rb +72 -0
data/spec/data_frame/data_frame_spec.rb +141 -0
data/spec/data_frame/id3_spec.rb +22 -0
data/spec/data_frame/model_spec.rb +36 -0
data/spec/data_frame/parameter_capture_spec.rb +32 -0
data/spec/data_frame/transposable_array_spec.rb +138 -0
data/spec/data_frame_spec.rb +29 -0
data/spec/ext/array_spec.rb +13 -0
data/spec/fixtures/basic.csv +3 -0
data/spec/fixtures/discrete_testing.csv +4 -0
data/spec/fixtures/discrete_training.csv +21 -0
data/spec/spec_helper.rb +8 -0
metadata +128 -0

data/README.rdoc ADDED

@@ -0,0 +1,122 @@
+== Data Frame
+This is a general data frame.  Load arrays and labels into it, and you will have a very powerful set of tools on your data set.
+==Usage
+  df = DataFrame.from_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv')
+  df.labels
+  # => [:x, :y, :month, :day, :ffmc, :dmc, :dc, :isi, :temp, :rh, :wind, :rain, :area]
+  df.dmc
+  # => [26.2, 35.4, 43.7, 33.3, 51.3, 85.3,...]
+  df.dmc.max
+  # => 291.3
+  df.dmc.min
+  # => 1.1
+  df.dmc.mean
+  # => 110.872340425532
+  df.dmc.std
+  # => 64.0464822492543
+  df = DataFrame.new(:list, :of, :things)
+  # => #<DataFrame:0x24ec6e8 @items=[], @labels=[:list, :of, :things]>
+  df.labels
+  # => [:list, :of, :things]
+  df << [1,2,3]
+  # => [[1, 2, 3]]
+  df.import([[2,3,4],[5,6,7]])
+  # => [[2, 3, 4], [5, 6, 7]]
+  df.items
+  # => [[1, 2, 3], [2, 3, 4], [5, 6, 7]]
+  df.list
+  # => [1, 2, 5]
+  df.list.correlation(df.things)
+  # => 1.0
+  df.list
+  # => [1, 2, 5]
+  df.things
+  # => [3, 4, 7]
+There are a few important features to know:
+* DataFrame.from_csv works for a string, a filename, or a URL.
+* FasterCSV parsing parameters can be passed to DataFrame.from_csv
+* DataFrame looks for operations first on the column labels, then on the row labels, then on the items table.  So don't name things :mean, :standard_deviation, :min, and that sort of thing.
+* CallbackArray allows you to set a callback anytime an array is tainted or untainted (taint, shift, pop, clear, map!, that sort of thing).  This is generally useful and will probably be copied into the Repositories gem.
+* TransposableArray is a subclass of CallbackArray, demonstrating how to use it.  It creates a very simple approach to memoization.  It caches the transpose of the table and resets it whenever it is tainted.
+To get your feet wet, you may want to play with data sets found here:
+  http://www.liaad.up.pt/~ltorgo/Regression/DataSets.html
+== Transformations
+A lot of the work in the data frame is to transform the actual table.  You may need to drop columns, filter results, replace values in a column or create a new data frame based on the existing one.  Here's how to do that:
+  >  df = DataFrame.from_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv')
+  # => DataFrame rows: 517 labels: [:x, :y, :month, :day, :ffmc, :dmc, :dc, :isi, :temp, :rh, :wind, :rain, :area]
+  > df.drop!(:ffmc)
+  # => DataFrame rows: 517 labels: [:x, :y, :month, :day, :dmc, :dc, :isi, :temp, :rh, :wind, :rain, :area]
+  > df.drop!(:dmc, :dc, :isi, :rh)
+  # => DataFrame rows: 517 labels: [:x, :y, :month, :day, :temp, :wind, :rain, :area]
+  > df.x
+  # => [7, 7, 7, 8, 8, 8, 8, 8, 8, 7, 7, 7, 6, 6, 6,...]
+  > df.replace!(:x) {|e| e * 3}
+  # => DataFrame rows: 517 labels: [:x, :y, :month, :day, :temp, :wind, :rain, :area]
+  > df.x
+  # => [21, 21, 21, 24, 24, 24, 24, 24, 24, 21, 21, 21, 18, 18, 18,...]
+  > df.filter!(:open_struct) {|row| row.x == 24}
+  # => DataFrame rows: 61 labels: [:x, :y, :month, :day, :temp, :wind, :rain, :area]
+  > df.x
+  # => [24, 24, 24, 24, 24, 24, 24, 24, 24,...]
+  > new_data_frame = df.subset_from_columns(:x, :y)
+  # => DataFrame rows: 61 labels: [:x, :y]
+  > new_data_frame.items
+  # => [[24, 6], [24, 6], [24, 6], [24, 6], ...]
+Note: most of these transformations are not optimized.  I'll work with things for a while before I try to optimize this library.  However, I should say that I've used some fairly large data sets (thousands of rows) and have been fine with things so far.
+== Models
+Data Frame can now create sub-models:
+	>> df = DataFrame.from_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv')
+	=> DataFrame rows: 517 labels: [:x, :y, :month, :day, :ffmc, :dmc, :dc, :isi, :temp, :rh, :wind, :rain, :area]
+	>> df.model(:weekend) do |m|
+	?> m.day %w(sat sun)
+	>> end
+	=> DataFrame rows: 179 labels: [:x, :y, :month, :day, :ffmc, :dmc, :dc, :isi, :temp, :rh, :wind, :rain, :area]
+	>> df.models.weekend.day.uniq
+	=> ["sat", "sun"]
+	>> df.models
+	=> #<OpenStruct weekend=DataFrame rows: 179 labels: [:x, :y, :month, :day, :ffmc, :dmc, :dc, :isi, :temp, :rh, :wind, :rain, :area]>
+== Utilities
+I use data frame for a lot of things, and I've added some utilities for this gem in case you would like to as well.  For instance, here is how I take the data in a data frame and load it into a neural network:
+  # Show mlp.  Will probably need to add a row classifier for training and test data.  Also, will probably want to
+== CLI
+There are some really interesting things that have good command-line shortcuts:
+* Make
+* A
+* List
+	# Now add some demos
+==Installation
+sudo gem install davidrichards-data_frame
+=== Dependencies
+* ActiveSupport: sudo gem install activesupport
+* JustEnumerableStats: sudo gem install davidrichards-just_enumerable_stats
+* FasterCSV: sudo gem install fastercsv
+==COPYRIGHT
+Copyright (c) 2009 David Richards. See LICENSE for details.

data/VERSION.yml ADDED

@@ -0,0 +1,4 @@
+---
+:major: 0
+:minor: 1
+:patch: 8

data/bin/plain_frame ADDED

@@ -0,0 +1,22 @@
+#!/usr/bin/env ruby -wKU
+require 'yaml'
+version_hash = YAML.load_file(File.join(File.dirname(__FILE__), %w(.. VERSION.yml)))
+version = [version_hash[:major].to_s, version_hash[:minor].to_s, version_hash[:patch].to_s].join(".")
+df_file = File.join(File.dirname(__FILE__), %w(.. lib data_frame))
+irb = RUBY_PLATFORM =~ /(:?mswin|mingw)/ ? 'irb.bat' : 'irb'
+require 'optparse'
+options = { :irb => irb, :without_stored_procedures => false }
+OptionParser.new do |opt|
+  opt.banner = "Usage: console [environment] [options]"
+  opt.on("--irb=[#{irb}]", 'Invoke a different irb.') { |v| options[:irb] = v }
+  opt.parse!(ARGV)
+end
+libs =  " -r irb/completion -r #{df_file}"
+puts "Loading Data Frame version: #{version}"
+exec "#{options[:irb]} #{libs} --simple-prompt"

data/lib/data_frame.rb ADDED

@@ -0,0 +1,26 @@
+require 'rubygems'
+require 'activesupport'
+require 'just_enumerable_stats'
+require 'open-uri'
+require 'fastercsv'
+require 'ostruct'
+# Use a Dictionary if available
+begin
+  require 'facets/dictionary'
+rescue LoadError => e
+  # Do nothing
+end
+Dir.glob("#{File.dirname(__FILE__)}/ext/*.rb").each { |file| require file }
+$:.unshift(File.dirname(__FILE__))
+require 'data_frame/callback_array'
+require 'data_frame/transposable_array'
+require 'data_frame/parameter_capture'
+require 'data_frame/data_frame'
+require 'data_frame/model'
+Dir.glob("#{File.dirname(__FILE__)}/data_frame/core/*.rb").each { |file| require file }

data/lib/data_frame/arff.rb ADDED

@@ -0,0 +1,52 @@
+module DF #:nodoc:
+  # Turns a data frame into ARFF-formatted content.
+  module ARFF
+    # Used in arff, but generally useful.
+    def to_csv(include_header=true)
+      value = include_header ? self.labels.map{|e| e.to_s}.join(',') + "\n" : ''
+      self.items.inject(value) do |list, e|
+        list << e.map {|cell| cell.to_s}.join(',') + "\n"
+      end
+    end
+    def to_arff
+      arff_header + to_csv(false)
+    end
+    protected
+      def arff_attributes
+        container = defined?(Dictionary) ? Dictionary.new : Hash.new
+        self.labels.inject(container) do |list, e|
+          list[e] = self.render_column(e).categories
+        end
+      end
+      def arff_formatted_attributes
+        self.labels.inject('') do |str, e|
+          val = "{" + self.render_column(e).categories.map{|x| x.to_s}.join(',') + "}"
+          str << "@attribute #{e} #{val}\n"
+        end
+      end
+      def arff_relation
+        self.name ? self.name.to_underscore_sym.to_s : 'unamed_relation'
+      end
+      def arff_header
+        %[@relation #{arff_relation}
+#{arff_formatted_attributes}
+@data
+]
+      end
+      alias :arff_items :to_csv
+  end
+end
+class DataFrame
+  include DF::ARFF
+end

data/lib/data_frame/callback_array.rb ADDED

@@ -0,0 +1,152 @@
+# This overloads the tainting methods in array with callbacks.  So, I
+# can block all changes to an array, or broadcast to observers after a
+# change, or limit the size of an array. It really just opens up the array to one more dimension: change.  Before and after change, stack up any activity to block or enhance the experience.  There are also callbacks on untaint.  The tainting methods actually
+class CallbackArray < Array
+  include ActiveSupport::Callbacks
+  define_callbacks :before_taint, :after_taint, :before_untaint, :after_untaint
+  def wrap_call(safe_method, *args)
+    callback_result = run_callbacks(:before_taint)
+    if callback_result
+      result = self.send(safe_method, *args)
+      self.orig_taint
+      run_callbacks(:after_taint)
+    end
+    result
+  end
+  protected :wrap_call
+  # Need the original taint for all tainting methods
+  alias :orig_taint :taint
+  def taint
+    callback_result = run_callbacks(:before_taint)
+    if callback_result
+      result = self.orig_taint
+      run_callbacks(:after_taint)
+    end
+    result
+  end
+  # No other method needs orig_untaint, so building this in the cleanest
+  # way possible.
+  orig_untaint = instance_method(:untaint)
+  define_method(:untaint) {
+    callback_result = run_callbacks(:before_untaint)
+    if callback_result
+      val = orig_untaint.bind(self).call
+      run_callbacks(:after_untaint)
+    end
+    val
+  }
+  alias :nontainting_assign :[]=
+  def []=(index, value)
+    wrap_call(:nontainting_assign, index, value)
+  end
+  alias :nontainting_append :<<
+  def <<(value)
+    wrap_call(:nontainting_append, value)
+  end
+  alias :nontainting_delete :delete
+  def delete(value)
+    wrap_call(:nontainting_delete, value)
+  end
+  alias :nontainting_push :push
+  def push(value)
+    wrap_call(:nontainting_push, value)
+  end
+  alias :nontainting_pop :pop
+  def pop
+    wrap_call(:nontainting_pop)
+  end
+  alias :nontainting_shift :shift
+  def shift
+    wrap_call(:nontainting_shift)
+  end
+  alias :nontainting_unshift :unshift
+  def unshift(value)
+    wrap_call(:nontainting_unshift, value)
+  end
+  alias :nontainting_map! :map!
+  def map!(&block)
+    callback_result = run_callbacks(:before_taint)
+    if callback_result
+      result = nontainting_map!(&block)
+      self.orig_taint
+      run_callbacks(:after_taint)
+    end
+    result
+  end
+  alias :nontainting_sort! :sort!
+  def sort!(&block)
+    callback_result = run_callbacks(:before_taint)
+    if callback_result
+      result = nontainting_sort!(&block)
+      self.orig_taint
+      run_callbacks(:after_taint)
+    end
+    result
+  end
+  alias :nontainting_reverse! :reverse!
+  def reverse!
+    wrap_call(:nontainting_reverse!)
+  end
+  alias :nontainting_collect! :collect!
+  def collect!(&block)
+    callback_result = run_callbacks(:before_taint)
+    if callback_result
+      result = nontainting_collect!(&block)
+      self.orig_taint
+      run_callbacks(:after_taint)
+    end
+    result
+  end
+  alias :nontainting_compact! :compact!
+  def compact!
+    wrap_call(:nontainting_compact!)
+  end
+  alias :nontainting_reject! :reject!
+  def reject!(&block)
+    callback_result = run_callbacks(:before_taint)
+    if callback_result
+      result = nontainting_reject!(&block)
+      self.orig_taint
+      run_callbacks(:after_taint)
+    end
+    result
+  end
+  alias :nontainting_slice! :slice!
+  def slice!(*args)
+    wrap_call(:nontainting_slice!, *args)
+  end
+  alias :nontainting_flatten! :flatten!
+  def flatten!
+    wrap_call(:nontainting_flatten!)
+  end
+  alias :nontainting_uniq! :uniq!
+  def uniq!
+    wrap_call(:nontainting_uniq!)
+  end
+  alias :nontainting_clear :clear
+  def clear
+    wrap_call(:nontainting_clear)
+  end
+end

data/lib/data_frame/core/column_management.rb ADDED

@@ -0,0 +1,147 @@
+module DF #:nodoc:
+  module ColumnManagement #:nodoc:
+    def move_to_last!(orig_name)
+      raise ArgumentError, "Column not found" unless self.labels.include?(orig_name)
+      new_name = (orig_name.to_s + "_a_unique_name").to_sym
+      self.append!(new_name, self.render_column(orig_name))
+      self.drop!(orig_name)
+      self.rename!(orig_name, new_name)
+    end
+    # In the order of alias: new_name, orig_name
+    def rename!(new_name, orig_name)
+      new_name = new_name.to_underscore_sym
+      orig_name = orig_name.to_underscore_sym
+      raise ArgumentError, "Column not found" unless self.labels.include?(orig_name)
+      raise ArgumentError, "Cannot name #{orig_name} to #{new_name}, that column already exists." if self.labels.include?(new_name)
+      i = self.labels.index(orig_name)
+      self.labels[i] = new_name
+    end
+    # Adds a unique column to the table
+    def append!(column_name, value=nil)
+      raise ArgumentError, "Can't have duplicate column names" if self.labels.include?(column_name)
+      self.labels << column_name.to_underscore_sym
+      if value.is_a?(Array)
+        self.items.each_with_index do |item, i|
+          item << value[i]
+        end
+      else
+        self.items.each do |item|
+          item << value
+        end
+      end
+      self.columns(true)
+      # Because we are tainting the sub arrays, the TaintableArray doesn't know it's been changed.
+      self.items.taint
+    end
+    def replace!(column, values=nil, &block)
+      column = validate_column(column)
+      if not values
+        values = self.send(column)
+        values.map! {|e| block.call(e)}
+      end
+      replace_column!(column, values)
+      self
+    end
+    # Replace a single column with an array of values.
+    # It is helpful to have the values the same size as the rest of the data
+    # frame.
+    def replace_column!(column, values)
+      store_range_hashes
+      column = validate_column(column)
+      index = self.labels.index(column)
+      @items.each_with_index do |item, i|
+        item[index] = values[i]
+      end
+      # Make sure we recalculate things after changing a column
+      self.items.taint
+      @columns = nil
+      self.columns
+      restore_range_hashes
+      # Return the items
+      @items
+    end
+    # Drop one or more columns
+    def drop!(*labels)
+      labels.each do |label|
+        drop_one!(label)
+      end
+      self
+    end
+    # Drop a single column
+    def drop_one!(label)
+      i = self.labels.index(label)
+      return nil unless i
+      self.items.each do |item|
+        item.delete_at(i)
+      end
+      self.labels.delete_at(i)
+      self
+    end
+    # Creates a new data frame, only with the specified columns.
+    def subset_from_columns(*cols)
+      new_labels = self.labels.inject([]) do |list, label|
+        list << label if cols.include?(label)
+        list
+      end
+      new_data_frame = DataFrame.new(*self.labels)
+      new_data_frame.import(self.items)
+      self.labels.each do |label|
+        new_data_frame.drop!(label) unless new_labels.include?(label)
+      end
+      new_data_frame
+    end
+    # Duplicates a column, the values only.  This is useful when creating a related column, such as values by category.
+    def duplicate!(column_name)
+      return false unless self.labels.include?(column_name)
+      i = 1
+      i += 1 while self.labels.include?(new_column_name(column_name, i))
+      self.append!(new_column_name(column_name, i), self.render_column(column_name).dup)
+      true
+    end
+    def new_column_name(column_name, i)
+      (column_name.to_s + i.to_s).to_sym
+    end
+    protected :new_column_name
+    protected
+      def store_range_hashes
+        @stored_range_hashes = self.labels.inject({}) do |h, label|
+          h[label] = self.render_column(label).range_hash
+          h
+        end
+        @stored_range_hashes = nil if @stored_range_hashes.all? {|k, v| v.nil?}
+      end
+      def restore_range_hashes
+        return false unless @stored_range_hashes
+        @stored_range_hashes.each do |label, range_hash|
+          self.render_column(label).set_categories(range_hash) if range_hash
+        end
+        true
+      end
+      def category_map_from_stored_range_hash(column)
+        self.render_column(column).set_categories(@stored_range_hashes[column]) if
+          @stored_range_hashes and @stored_range_hashes.keys.include?(column)
+        self.render_column(column).category_map.dup
+      end
+  end
+end
+class DataFrame
+  include DF::ColumnManagement
+end