RubyGems - incsv - Versions diffs - 0.1.0 → 0.2.0 - Mend

incsv 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 349fb4611684a002bf078566b5d115d826f4e8a9
-  data.tar.gz: 2f2551c05e9a5adf601d1fb5bbd272b956e37140
+  metadata.gz: 2a64c9c204b84b53e240994132a99746b64a3d8a
+  data.tar.gz: e302031882b7e9b9aac37108e79025c593062c31
 SHA512:
-  metadata.gz: 73ad7b8f885e8068898fa8ef8ff80947e2221a5f367c2dad9a948ffb7b7a26a35ab5c5cd47ff670ffedefaa16c47b16b364cf8f8b63ab127c387d01484fb10c3
-  data.tar.gz: 689a7012aa844f8f3b8d04a0385b1554ac226b5a92a9de4ba8d28d4398e74546ff1b68a64faaef7ef0ed34a08d55bc7ee3c00f5c449a9de71408fc7d2e2f45b9
+  metadata.gz: 22d2a9fb3bcfc0206b96af378d5b7dffdc0b79501c1eac19b9482451c691ca1e9bbdee925b22971e68bedf4cc33c64b4d2f26c46e13d089fe48f566100accb50
+  data.tar.gz: 270ca3f95c76700bc24f410ec90b39141d40fb683d32557b1fc0b72e5a99853fb22dc7fe5a81db335087e966d3c85dac8870a72325f34cbf01ad966d68f9dfcf

data/.gitignore CHANGED Viewed

@@ -7,3 +7,4 @@
 /pkg/
 /spec/reports/
 /tmp/
+*.db

data/README.md CHANGED Viewed

@@ -18,7 +18,128 @@ incsv can be installed via RubyGems:
 ## Usage
-TBC.
+### The quick version
+The following command will drop you into a [REPL][] prompt:
+	$ incsv console path/to/file.csv
+A Sequel connection to the database is stored in a variable called
+`@db`. The name of the table is based on the filename of the CSV; so, if
+your CSV file is called `products.csv`, then data will be imported into
+a database table called `products`.
+A quick example:
+	> @db[:products].select(:name).reverse_order(:price).take(5)
+	=> [{:name=>"Makeshift battery"},
+	  {:name=>"clothing iron"},
+	  {:name=>"toy alien"},
+	  {:name=>"enhanced targeting card"},
+	  {:name=>"Giddyup Buttercup"}]
+[repl]: https://en.wikipedia.org/wiki/Read%E2%80%93eval%E2%80%93print_loop
+### The less-quick version
+To use incsv, you essentially just need to point it at a CSV file. It’ll
+then take care of parsing the CSV, figuring out the nature of the data
+within it, creating a database and a table, and importing the data.
+To perform all of these steps and be given an interactive console once
+they’re done, you can use the `console` command.
+Let’s imagine we have a CSV file that contains some product information:
+	$ head -3 products.csv
+	name,date_added,price
+	"Acid",2013-03-24,£38
+	"Abraxo cleaner",2016-09-25,£21
+Here we can see that we have three columns: the product name, which is
+just a string; the date the product was added, which is an
+ISO-8601–formatted date; and the price, which is a currency value in
+dollars.
+In my sample data there are 515 products (plus a header row):
+	$ wc -l products.csv
+	516
+In order to query this data, we can pass the CSV file to incsv:
+	$ incsv console products.csv
+	Found database at products.db
+	Connection is in @db
+	Primary table name is products
+	Columns: _incsv_id, name, date_added, price
+	First row:
+	_incsv_id, name, date_added, price
+	1, Acid, 2013-03-24, 0.38E2
+	Not sure what to do next? Try this:
+	@db[:products].count
+	>
+It tells us some information about the file, and about the assumptions
+it has made about the file. We can see that it’s imported the contents
+of the file into a table called `products`, and that it’s used the
+column names from the CSV to name the columns in the database table.
+It also shows us the first row, where you might have noticed that the
+price is in a slightly odd representation. That’s because incsv will
+look at what type of data seems to be stored in your CSV before
+importing it. In this case, it knows that the `date_added` column
+contains a date, and that the `price` column contains a currency value.
+In the former case, that means converting it into an actual SQL date. In
+the latter case, this means converting it to `BigDecimal` format (and
+storing it in the database as `DECIMAL(10, 2)`, so that we don’t either
+lose any precision by storing the value as a float, or lose the ability
+to do numerical calculations by storing it as a string.
+It then suggests a query for us to run, which might generally be the
+first thing that you’d want to know about the dataset: how many values
+are there? We can run it and see:
+	> @db[:products].count
+	=> 515
+Excellent! It’s imported every one of the products that were in the CSV.
+From this point on we can do any kind of analysis of the data that we
+like; we have all the power of SQLite and Sequel at our fingertips. For
+example, to get the number of products added each year:
+	> @db[:products].group_and_count{strftime("%Y", date_added).as(year)}.all
+	=> [{:year=>"2013", :count=>132}, {:year=>"2014", :count=>123}, {:year=>"2015", :count=>131}, {:year=>"2016", :count=>129}]
+Or to get the total value of products added today:
+	> @db[:products].select{sum(price).as(total_cost)}.where(date_added: Date.today).first
+	=> {:total_cost=>40}
+We can also do processing in Ruby, if there’s anything that’s difficult
+in pure SQL. Imagine wanting to convert the product names to
+URL-friendly “slugs”. This is pretty easy in Ruby. Let’s try it out on
+the top 10 most expensive products:
+	> @db[:products].select(:name).reverse_order(:price).limit(10).each do |product|
+	*   puts product[:name].gsub(/\s/, "-").squeeze("-").downcase.gsub(/[^a-z0-9\-]/, "")
+	* end
+	makeshift-battery
+	clothing-iron
+	toy-alien
+	enhanced-targeting-card
+	giddyup-buttercup
+	mole-rat-teeth
+	empty-teal-rounded-vase
+	pre-war-money
+	bowling-ball
+	toothbrush
+Hopefully this illustrates what you can do with incsv!
 ## Development

data/exe/incsv ADDED Viewed

@@ -0,0 +1,94 @@
+#!/usr/bin/env ruby
+$LOAD_PATH.unshift File.expand_path('../../lib', __FILE__)
+require "thor"
+require "pry"
+require "incsv"
+module InCSV
+  class Console
+    def initialize(db)
+      @db = db
+    end
+    def get_binding
+      binding
+    end
+  end
+  class CLI < Thor
+    desc "create CSV_FILE", "Creates a database file with the appropriate schema for the given CSV file, but doesn't import any data."
+    method_option :force, type: :boolean, default: false
+    def create(csv_file)
+      database = Database.new(csv_file)
+      if database.exists? && database.table_created? && !options.force?
+        $stderr.puts "Database already exists."
+        exit 41
+      end
+      database.create_table
+      puts "Database created successfully in #{database.db_path}"
+    rescue StandardError => e
+      $stderr.puts "Database failed to create."
+      $stderr.puts "#{e.message}"
+      exit 40
+    end
+    desc "import CSV_FILE", "Creates a database file with the appropriate schema for the given CSV file, and then imports the data within the file."
+    method_option :force, type: :boolean, default: false
+    def import(csv_file)
+      database = Database.new(csv_file)
+      create(csv_file)
+      database.import
+      puts "Data imported."
+      puts
+      puts "Command to query:"
+      puts "$ sqlite3 #{database.db_path}"
+    rescue StandardError => e
+      $stderr.puts "Import failed."
+      $stderr.puts "#{e.message}"
+      exit 50
+    end
+    desc "console CSV_FILE", "Opens a query console for the given CSV file, creating a database file and importing the data if necessary."
+    def console(csv_file)
+      database = Database.new(csv_file)
+      unless database.table_created? && database.imported?
+        database.create
+        database.import
+      end
+      console = Console.new(database.db)
+      puts "Found database at #{database.db_path}"
+      puts "Connection is in @db"
+      puts
+      puts "Primary table name is #{database.table_name}"
+      puts "Columns: #{database.db[database.table_name].columns.join(", ")}"
+      first_row = database.db[database.table_name].first
+      puts
+      puts "First row:"
+      puts first_row.keys.join(", ")
+      puts first_row.values.join(", ")
+      puts
+      puts "Not sure what to do next? Try this:"
+      puts "@db[:#{database.table_name}].count"
+      console.get_binding.pry(quiet: true, prompt: [proc { "> " }, proc { "* " }])
+    rescue StandardError => e
+      $stderr.puts "Failed to start console."
+      $stderr.puts "#{e.message}"
+      exit 60
+    end
+  end
+end
+InCSV::CLI.start

data/incsv.gemspec CHANGED Viewed

@@ -24,6 +24,7 @@ Gem::Specification.new do |spec|
   spec.add_development_dependency "rspec", "~> 3.0"
   spec.add_runtime_dependency "thor", "~> 0.19.1"
+  spec.add_runtime_dependency "pry", "~> 0.10"
   spec.add_runtime_dependency "sqlite3", "~> 1.3"
   spec.add_runtime_dependency "sequel", "~> 4.31"
 end

data/lib/incsv/column.rb ADDED Viewed

@@ -0,0 +1,25 @@
+require "bigdecimal"
+module InCSV
+  class Column
+    def initialize(name, values)
+      @name = name
+      @values = values
+    end
+    attr_reader :name
+    def type
+      Types.constants.select do |column_type|
+        column_type = Types.const_get(column_type)
+        if values.all? { |value| value.nil? || column_type.new(value).match? }
+          return column_type
+        end
+      end
+    end
+    private
+    attr_accessor :values
+  end
+end

data/lib/incsv/column_type.rb ADDED Viewed

@@ -0,0 +1,30 @@
+module InCSV
+  class ColumnType
+    def self.name
+      self.to_s.sub(/.*::/, "").downcase.to_sym
+    end
+    def self.for_database
+      self.to_s.sub(/.*::/, "").downcase.to_sym
+    end
+    def initialize(value)
+      @value = value
+    end
+    def match?
+      false
+    end
+    def clean_value
+      self.class.clean_value(@value)
+    end
+    def self.clean_value(value)
+      value
+    end
+    private
+    attr_reader :value
+  end
+end

data/lib/incsv/database.rb ADDED Viewed

@@ -0,0 +1,89 @@
+require "sequel"
+require "pathname"
+module InCSV
+  class Database
+    def initialize(csv)
+      @csv = csv
+      @db = Sequel.sqlite(db_path)
+      # require "logger"
+      # @db.loggers << Logger.new($stdout)
+    end
+    attr_reader :db
+    def table_created?
+      @db.table_exists?(table_name)
+    end
+    def imported?
+       table_created? && @db[table_name].count > 0
+    end
+    def exists?
+      File.exist?(db_path)
+    end
+    def db_path
+      path = Pathname(csv)
+      (path.dirname + (path.basename(".csv").to_s + ".db")).to_s
+    end
+    def table_name
+      @table_name ||= begin
+        File.basename(csv, ".csv").downcase.gsub(/[^a-z_]/, "").to_sym
+      end
+    end
+    def create_table
+      @db.create_table!(table_name) do
+        primary_key :_incsv_id
+      end
+      schema.columns.each do |c|
+        @db.alter_table(table_name) do
+          add_column c.name, c.type.for_database
+        end
+      end
+    end
+    def import
+      return if imported?
+      create_table unless table_created?
+      columns      = schema.columns
+      column_names = columns.map(&:name)
+      chunks(200) do |chunk|
+        rows = chunk.map do |row|
+          row.to_hash.values.each_with_index.map do |column, n|
+            columns[n].type.clean_value(column)
+          end
+        end
+        @db[table_name].import(column_names, rows)
+      end
+    end
+    private
+    attr_reader :csv
+    def schema
+      @schema ||= Schema.new(csv)
+    end
+    def chunks(size = 200, &block)
+      data =
+        File.read(csv)
+          .encode("UTF-8", invalid: :replace, undef: :replace, replace: "")
+      csv = CSV.new(data, headers: true)
+      csv.each_slice(size, &block)
+      csv.close
+    end
+  end
+end

data/lib/incsv/schema.rb ADDED Viewed

@@ -0,0 +1,55 @@
+require "csv"
+module InCSV
+  class Schema
+    def initialize(csv)
+      @csv = csv
+    end
+    def columns
+      @columns ||= parsed_columns
+    end
+    private
+    attr_reader :csv
+    def parsed_columns
+      samples(50).map do |name, values|
+        Column.new(name, values)
+      end
+    end
+    # Returns the first `num_rows` rows of data, transposed into a hash.
+    #
+    # For example, the following CSV data:
+    #
+    # foo,bar
+    # 1,2
+    # 3,4
+    #
+    # Would become:
+    #
+    # { "foo" => [1, 3], "bar" => [2, 4] }
+    #
+    # This gives us enough data to be able to guess the type of
+    # a column.
+    def samples(num_rows)
+      data =
+        File.read(csv)
+          .encode("UTF-8", invalid: :replace, undef: :replace, replace: "")
+      csv = CSV.new(data, headers: true)
+      sample_data = csv.each.take(num_rows)
+      csv.close
+      sample_data.map(&:to_a).flatten(1).each_with_object({}) do |row, data|
+        column = row[0]
+        value  = row[1]
+        data[column] ||= []
+        data[column] << value
+      end
+    end
+  end
+end

data/lib/incsv/types/currency.rb ADDED Viewed

@@ -0,0 +1,23 @@
+module InCSV
+  module Types
+    class Currency < ColumnType
+      MATCH_EXPRESSION = /\A(\$|£)([0-9,\.]+)\z/
+      def self.for_database
+        "DECIMAL(10,2)"
+      end
+      def match?
+        value.strip.match(MATCH_EXPRESSION)
+      end
+      def self.clean_value(value)
+        return unless value
+        value.strip.match(MATCH_EXPRESSION) do |match|
+          BigDecimal(match[2].delete(","))
+        end
+      end
+    end
+  end
+end

data/lib/incsv/types/date.rb ADDED Viewed

@@ -0,0 +1,9 @@
+module InCSV
+  module Types
+    class Date < ColumnType
+      def match?
+        value.strip.match(/\A[0-9]{4}-[0-9]{2}-[0-9]{2}\z/)
+      end
+    end
+  end
+end

data/lib/incsv/types/string.rb ADDED Viewed

@@ -0,0 +1,9 @@
+module InCSV
+  module Types
+    class String < ColumnType
+      def match?
+        true
+      end
+    end
+  end
+end

data/lib/incsv/types.rb ADDED Viewed

@@ -0,0 +1,5 @@
+require "incsv/column_type"
+require "incsv/types/date"
+require "incsv/types/currency"
+require "incsv/types/string"

data/lib/incsv/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module InCSV
-  VERSION = "0.1.0"
+  VERSION = "0.2.0"
 end

data/lib/incsv.rb CHANGED Viewed

@@ -1,5 +1,6 @@
 require "incsv/version"
-module InCSV
-  # Your code goes here...
-end
+require "incsv/schema"
+require "incsv/types"
+require "incsv/column"
+require "incsv/database"

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: incsv
 version: !ruby/object:Gem::Version
-  version: 0.1.0
+  version: 0.2.0
 platform: ruby
 authors:
 - Rob Miller
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2016-02-17 00:00:00.000000000 Z
+date: 2016-02-22 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bundler
@@ -66,6 +66,20 @@ dependencies:
     - - "~>"
       - !ruby/object:Gem::Version
         version: 0.19.1
+- !ruby/object:Gem::Dependency
+  name: pry
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.10'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.10'
 - !ruby/object:Gem::Dependency
   name: sqlite3
   requirement: !ruby/object:Gem::Requirement
@@ -98,7 +112,8 @@ description: Loads a CSV file into an SQLite database automatically, dropping yo
   into a Ruby shell that allows you to explore the data within.
 email:
 - rob@bigfish.co.uk
-executables: []
+executables:
+- incsv
 extensions: []
 extra_rdoc_files: []
 files:
@@ -112,8 +127,17 @@ files:
 - Rakefile
 - bin/console
 - bin/setup
+- exe/incsv
 - incsv.gemspec
 - lib/incsv.rb
+- lib/incsv/column.rb
+- lib/incsv/column_type.rb
+- lib/incsv/database.rb
+- lib/incsv/schema.rb
+- lib/incsv/types.rb
+- lib/incsv/types/currency.rb
+- lib/incsv/types/date.rb
+- lib/incsv/types/string.rb
 - lib/incsv/version.rb
 homepage: https://github.com/robmiller/incsv
 licenses: