incsv 0.1.0 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.gitignore +1 -0
- data/README.md +122 -1
- data/exe/incsv +94 -0
- data/incsv.gemspec +1 -0
- data/lib/incsv/column.rb +25 -0
- data/lib/incsv/column_type.rb +30 -0
- data/lib/incsv/database.rb +89 -0
- data/lib/incsv/schema.rb +55 -0
- data/lib/incsv/types/currency.rb +23 -0
- data/lib/incsv/types/date.rb +9 -0
- data/lib/incsv/types/string.rb +9 -0
- data/lib/incsv/types.rb +5 -0
- data/lib/incsv/version.rb +1 -1
- data/lib/incsv.rb +4 -3
- metadata +27 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 2a64c9c204b84b53e240994132a99746b64a3d8a
|
4
|
+
data.tar.gz: e302031882b7e9b9aac37108e79025c593062c31
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 22d2a9fb3bcfc0206b96af378d5b7dffdc0b79501c1eac19b9482451c691ca1e9bbdee925b22971e68bedf4cc33c64b4d2f26c46e13d089fe48f566100accb50
|
7
|
+
data.tar.gz: 270ca3f95c76700bc24f410ec90b39141d40fb683d32557b1fc0b72e5a99853fb22dc7fe5a81db335087e966d3c85dac8870a72325f34cbf01ad966d68f9dfcf
|
data/.gitignore
CHANGED
data/README.md
CHANGED
@@ -18,7 +18,128 @@ incsv can be installed via RubyGems:
|
|
18
18
|
|
19
19
|
## Usage
|
20
20
|
|
21
|
-
|
21
|
+
### The quick version
|
22
|
+
|
23
|
+
The following command will drop you into a [REPL][] prompt:
|
24
|
+
|
25
|
+
$ incsv console path/to/file.csv
|
26
|
+
|
27
|
+
A Sequel connection to the database is stored in a variable called
|
28
|
+
`@db`. The name of the table is based on the filename of the CSV; so, if
|
29
|
+
your CSV file is called `products.csv`, then data will be imported into
|
30
|
+
a database table called `products`.
|
31
|
+
|
32
|
+
A quick example:
|
33
|
+
|
34
|
+
> @db[:products].select(:name).reverse_order(:price).take(5)
|
35
|
+
=> [{:name=>"Makeshift battery"},
|
36
|
+
{:name=>"clothing iron"},
|
37
|
+
{:name=>"toy alien"},
|
38
|
+
{:name=>"enhanced targeting card"},
|
39
|
+
{:name=>"Giddyup Buttercup"}]
|
40
|
+
|
41
|
+
[repl]: https://en.wikipedia.org/wiki/Read%E2%80%93eval%E2%80%93print_loop
|
42
|
+
|
43
|
+
### The less-quick version
|
44
|
+
|
45
|
+
To use incsv, you essentially just need to point it at a CSV file. It’ll
|
46
|
+
then take care of parsing the CSV, figuring out the nature of the data
|
47
|
+
within it, creating a database and a table, and importing the data.
|
48
|
+
|
49
|
+
To perform all of these steps and be given an interactive console once
|
50
|
+
they’re done, you can use the `console` command.
|
51
|
+
|
52
|
+
Let’s imagine we have a CSV file that contains some product information:
|
53
|
+
|
54
|
+
$ head -3 products.csv
|
55
|
+
name,date_added,price
|
56
|
+
"Acid",2013-03-24,£38
|
57
|
+
"Abraxo cleaner",2016-09-25,£21
|
58
|
+
|
59
|
+
Here we can see that we have three columns: the product name, which is
|
60
|
+
just a string; the date the product was added, which is an
|
61
|
+
ISO-8601–formatted date; and the price, which is a currency value in
|
62
|
+
dollars.
|
63
|
+
|
64
|
+
In my sample data there are 515 products (plus a header row):
|
65
|
+
|
66
|
+
$ wc -l products.csv
|
67
|
+
516
|
68
|
+
|
69
|
+
In order to query this data, we can pass the CSV file to incsv:
|
70
|
+
|
71
|
+
$ incsv console products.csv
|
72
|
+
Found database at products.db
|
73
|
+
Connection is in @db
|
74
|
+
|
75
|
+
Primary table name is products
|
76
|
+
Columns: _incsv_id, name, date_added, price
|
77
|
+
|
78
|
+
First row:
|
79
|
+
_incsv_id, name, date_added, price
|
80
|
+
1, Acid, 2013-03-24, 0.38E2
|
81
|
+
|
82
|
+
Not sure what to do next? Try this:
|
83
|
+
@db[:products].count
|
84
|
+
>
|
85
|
+
|
86
|
+
It tells us some information about the file, and about the assumptions
|
87
|
+
it has made about the file. We can see that it’s imported the contents
|
88
|
+
of the file into a table called `products`, and that it’s used the
|
89
|
+
column names from the CSV to name the columns in the database table.
|
90
|
+
|
91
|
+
It also shows us the first row, where you might have noticed that the
|
92
|
+
price is in a slightly odd representation. That’s because incsv will
|
93
|
+
look at what type of data seems to be stored in your CSV before
|
94
|
+
importing it. In this case, it knows that the `date_added` column
|
95
|
+
contains a date, and that the `price` column contains a currency value.
|
96
|
+
In the former case, that means converting it into an actual SQL date. In
|
97
|
+
the latter case, this means converting it to `BigDecimal` format (and
|
98
|
+
storing it in the database as `DECIMAL(10, 2)`, so that we don’t either
|
99
|
+
lose any precision by storing the value as a float, or lose the ability
|
100
|
+
to do numerical calculations by storing it as a string.
|
101
|
+
|
102
|
+
It then suggests a query for us to run, which might generally be the
|
103
|
+
first thing that you’d want to know about the dataset: how many values
|
104
|
+
are there? We can run it and see:
|
105
|
+
|
106
|
+
> @db[:products].count
|
107
|
+
=> 515
|
108
|
+
|
109
|
+
Excellent! It’s imported every one of the products that were in the CSV.
|
110
|
+
|
111
|
+
From this point on we can do any kind of analysis of the data that we
|
112
|
+
like; we have all the power of SQLite and Sequel at our fingertips. For
|
113
|
+
example, to get the number of products added each year:
|
114
|
+
|
115
|
+
> @db[:products].group_and_count{strftime("%Y", date_added).as(year)}.all
|
116
|
+
=> [{:year=>"2013", :count=>132}, {:year=>"2014", :count=>123}, {:year=>"2015", :count=>131}, {:year=>"2016", :count=>129}]
|
117
|
+
|
118
|
+
Or to get the total value of products added today:
|
119
|
+
|
120
|
+
> @db[:products].select{sum(price).as(total_cost)}.where(date_added: Date.today).first
|
121
|
+
=> {:total_cost=>40}
|
122
|
+
|
123
|
+
We can also do processing in Ruby, if there’s anything that’s difficult
|
124
|
+
in pure SQL. Imagine wanting to convert the product names to
|
125
|
+
URL-friendly “slugs”. This is pretty easy in Ruby. Let’s try it out on
|
126
|
+
the top 10 most expensive products:
|
127
|
+
|
128
|
+
> @db[:products].select(:name).reverse_order(:price).limit(10).each do |product|
|
129
|
+
* puts product[:name].gsub(/\s/, "-").squeeze("-").downcase.gsub(/[^a-z0-9\-]/, "")
|
130
|
+
* end
|
131
|
+
makeshift-battery
|
132
|
+
clothing-iron
|
133
|
+
toy-alien
|
134
|
+
enhanced-targeting-card
|
135
|
+
giddyup-buttercup
|
136
|
+
mole-rat-teeth
|
137
|
+
empty-teal-rounded-vase
|
138
|
+
pre-war-money
|
139
|
+
bowling-ball
|
140
|
+
toothbrush
|
141
|
+
|
142
|
+
Hopefully this illustrates what you can do with incsv!
|
22
143
|
|
23
144
|
## Development
|
24
145
|
|
data/exe/incsv
ADDED
@@ -0,0 +1,94 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
$LOAD_PATH.unshift File.expand_path('../../lib', __FILE__)
|
4
|
+
|
5
|
+
require "thor"
|
6
|
+
require "pry"
|
7
|
+
|
8
|
+
require "incsv"
|
9
|
+
|
10
|
+
module InCSV
|
11
|
+
class Console
|
12
|
+
def initialize(db)
|
13
|
+
@db = db
|
14
|
+
end
|
15
|
+
|
16
|
+
def get_binding
|
17
|
+
binding
|
18
|
+
end
|
19
|
+
end
|
20
|
+
|
21
|
+
class CLI < Thor
|
22
|
+
desc "create CSV_FILE", "Creates a database file with the appropriate schema for the given CSV file, but doesn't import any data."
|
23
|
+
method_option :force, type: :boolean, default: false
|
24
|
+
def create(csv_file)
|
25
|
+
database = Database.new(csv_file)
|
26
|
+
|
27
|
+
if database.exists? && database.table_created? && !options.force?
|
28
|
+
$stderr.puts "Database already exists."
|
29
|
+
exit 41
|
30
|
+
end
|
31
|
+
|
32
|
+
database.create_table
|
33
|
+
puts "Database created successfully in #{database.db_path}"
|
34
|
+
rescue StandardError => e
|
35
|
+
$stderr.puts "Database failed to create."
|
36
|
+
$stderr.puts "#{e.message}"
|
37
|
+
exit 40
|
38
|
+
end
|
39
|
+
|
40
|
+
desc "import CSV_FILE", "Creates a database file with the appropriate schema for the given CSV file, and then imports the data within the file."
|
41
|
+
|
42
|
+
method_option :force, type: :boolean, default: false
|
43
|
+
def import(csv_file)
|
44
|
+
database = Database.new(csv_file)
|
45
|
+
create(csv_file)
|
46
|
+
database.import
|
47
|
+
|
48
|
+
puts "Data imported."
|
49
|
+
puts
|
50
|
+
puts "Command to query:"
|
51
|
+
puts "$ sqlite3 #{database.db_path}"
|
52
|
+
rescue StandardError => e
|
53
|
+
$stderr.puts "Import failed."
|
54
|
+
$stderr.puts "#{e.message}"
|
55
|
+
exit 50
|
56
|
+
end
|
57
|
+
|
58
|
+
desc "console CSV_FILE", "Opens a query console for the given CSV file, creating a database file and importing the data if necessary."
|
59
|
+
def console(csv_file)
|
60
|
+
database = Database.new(csv_file)
|
61
|
+
|
62
|
+
unless database.table_created? && database.imported?
|
63
|
+
database.create
|
64
|
+
database.import
|
65
|
+
end
|
66
|
+
|
67
|
+
console = Console.new(database.db)
|
68
|
+
|
69
|
+
puts "Found database at #{database.db_path}"
|
70
|
+
puts "Connection is in @db"
|
71
|
+
puts
|
72
|
+
puts "Primary table name is #{database.table_name}"
|
73
|
+
puts "Columns: #{database.db[database.table_name].columns.join(", ")}"
|
74
|
+
|
75
|
+
first_row = database.db[database.table_name].first
|
76
|
+
puts
|
77
|
+
puts "First row:"
|
78
|
+
puts first_row.keys.join(", ")
|
79
|
+
puts first_row.values.join(", ")
|
80
|
+
|
81
|
+
puts
|
82
|
+
puts "Not sure what to do next? Try this:"
|
83
|
+
puts "@db[:#{database.table_name}].count"
|
84
|
+
|
85
|
+
console.get_binding.pry(quiet: true, prompt: [proc { "> " }, proc { "* " }])
|
86
|
+
rescue StandardError => e
|
87
|
+
$stderr.puts "Failed to start console."
|
88
|
+
$stderr.puts "#{e.message}"
|
89
|
+
exit 60
|
90
|
+
end
|
91
|
+
end
|
92
|
+
end
|
93
|
+
|
94
|
+
InCSV::CLI.start
|
data/incsv.gemspec
CHANGED
@@ -24,6 +24,7 @@ Gem::Specification.new do |spec|
|
|
24
24
|
spec.add_development_dependency "rspec", "~> 3.0"
|
25
25
|
|
26
26
|
spec.add_runtime_dependency "thor", "~> 0.19.1"
|
27
|
+
spec.add_runtime_dependency "pry", "~> 0.10"
|
27
28
|
spec.add_runtime_dependency "sqlite3", "~> 1.3"
|
28
29
|
spec.add_runtime_dependency "sequel", "~> 4.31"
|
29
30
|
end
|
data/lib/incsv/column.rb
ADDED
@@ -0,0 +1,25 @@
|
|
1
|
+
require "bigdecimal"
|
2
|
+
|
3
|
+
module InCSV
|
4
|
+
class Column
|
5
|
+
def initialize(name, values)
|
6
|
+
@name = name
|
7
|
+
@values = values
|
8
|
+
end
|
9
|
+
|
10
|
+
attr_reader :name
|
11
|
+
|
12
|
+
def type
|
13
|
+
Types.constants.select do |column_type|
|
14
|
+
column_type = Types.const_get(column_type)
|
15
|
+
if values.all? { |value| value.nil? || column_type.new(value).match? }
|
16
|
+
return column_type
|
17
|
+
end
|
18
|
+
end
|
19
|
+
end
|
20
|
+
|
21
|
+
private
|
22
|
+
|
23
|
+
attr_accessor :values
|
24
|
+
end
|
25
|
+
end
|
@@ -0,0 +1,30 @@
|
|
1
|
+
module InCSV
|
2
|
+
class ColumnType
|
3
|
+
def self.name
|
4
|
+
self.to_s.sub(/.*::/, "").downcase.to_sym
|
5
|
+
end
|
6
|
+
|
7
|
+
def self.for_database
|
8
|
+
self.to_s.sub(/.*::/, "").downcase.to_sym
|
9
|
+
end
|
10
|
+
|
11
|
+
def initialize(value)
|
12
|
+
@value = value
|
13
|
+
end
|
14
|
+
|
15
|
+
def match?
|
16
|
+
false
|
17
|
+
end
|
18
|
+
|
19
|
+
def clean_value
|
20
|
+
self.class.clean_value(@value)
|
21
|
+
end
|
22
|
+
|
23
|
+
def self.clean_value(value)
|
24
|
+
value
|
25
|
+
end
|
26
|
+
|
27
|
+
private
|
28
|
+
attr_reader :value
|
29
|
+
end
|
30
|
+
end
|
@@ -0,0 +1,89 @@
|
|
1
|
+
require "sequel"
|
2
|
+
|
3
|
+
require "pathname"
|
4
|
+
|
5
|
+
module InCSV
|
6
|
+
class Database
|
7
|
+
def initialize(csv)
|
8
|
+
@csv = csv
|
9
|
+
|
10
|
+
@db = Sequel.sqlite(db_path)
|
11
|
+
# require "logger"
|
12
|
+
# @db.loggers << Logger.new($stdout)
|
13
|
+
end
|
14
|
+
|
15
|
+
attr_reader :db
|
16
|
+
|
17
|
+
def table_created?
|
18
|
+
@db.table_exists?(table_name)
|
19
|
+
end
|
20
|
+
|
21
|
+
def imported?
|
22
|
+
table_created? && @db[table_name].count > 0
|
23
|
+
end
|
24
|
+
|
25
|
+
def exists?
|
26
|
+
File.exist?(db_path)
|
27
|
+
end
|
28
|
+
|
29
|
+
def db_path
|
30
|
+
path = Pathname(csv)
|
31
|
+
(path.dirname + (path.basename(".csv").to_s + ".db")).to_s
|
32
|
+
end
|
33
|
+
|
34
|
+
def table_name
|
35
|
+
@table_name ||= begin
|
36
|
+
File.basename(csv, ".csv").downcase.gsub(/[^a-z_]/, "").to_sym
|
37
|
+
end
|
38
|
+
end
|
39
|
+
|
40
|
+
def create_table
|
41
|
+
@db.create_table!(table_name) do
|
42
|
+
primary_key :_incsv_id
|
43
|
+
end
|
44
|
+
|
45
|
+
schema.columns.each do |c|
|
46
|
+
@db.alter_table(table_name) do
|
47
|
+
add_column c.name, c.type.for_database
|
48
|
+
end
|
49
|
+
end
|
50
|
+
end
|
51
|
+
|
52
|
+
def import
|
53
|
+
return if imported?
|
54
|
+
|
55
|
+
create_table unless table_created?
|
56
|
+
|
57
|
+
columns = schema.columns
|
58
|
+
column_names = columns.map(&:name)
|
59
|
+
|
60
|
+
chunks(200) do |chunk|
|
61
|
+
rows = chunk.map do |row|
|
62
|
+
row.to_hash.values.each_with_index.map do |column, n|
|
63
|
+
columns[n].type.clean_value(column)
|
64
|
+
end
|
65
|
+
end
|
66
|
+
|
67
|
+
@db[table_name].import(column_names, rows)
|
68
|
+
end
|
69
|
+
end
|
70
|
+
|
71
|
+
private
|
72
|
+
|
73
|
+
attr_reader :csv
|
74
|
+
|
75
|
+
def schema
|
76
|
+
@schema ||= Schema.new(csv)
|
77
|
+
end
|
78
|
+
|
79
|
+
def chunks(size = 200, &block)
|
80
|
+
data =
|
81
|
+
File.read(csv)
|
82
|
+
.encode("UTF-8", invalid: :replace, undef: :replace, replace: "")
|
83
|
+
|
84
|
+
csv = CSV.new(data, headers: true)
|
85
|
+
csv.each_slice(size, &block)
|
86
|
+
csv.close
|
87
|
+
end
|
88
|
+
end
|
89
|
+
end
|
data/lib/incsv/schema.rb
ADDED
@@ -0,0 +1,55 @@
|
|
1
|
+
require "csv"
|
2
|
+
|
3
|
+
module InCSV
|
4
|
+
class Schema
|
5
|
+
def initialize(csv)
|
6
|
+
@csv = csv
|
7
|
+
end
|
8
|
+
|
9
|
+
def columns
|
10
|
+
@columns ||= parsed_columns
|
11
|
+
end
|
12
|
+
|
13
|
+
private
|
14
|
+
|
15
|
+
attr_reader :csv
|
16
|
+
|
17
|
+
def parsed_columns
|
18
|
+
samples(50).map do |name, values|
|
19
|
+
Column.new(name, values)
|
20
|
+
end
|
21
|
+
end
|
22
|
+
|
23
|
+
# Returns the first `num_rows` rows of data, transposed into a hash.
|
24
|
+
#
|
25
|
+
# For example, the following CSV data:
|
26
|
+
#
|
27
|
+
# foo,bar
|
28
|
+
# 1,2
|
29
|
+
# 3,4
|
30
|
+
#
|
31
|
+
# Would become:
|
32
|
+
#
|
33
|
+
# { "foo" => [1, 3], "bar" => [2, 4] }
|
34
|
+
#
|
35
|
+
# This gives us enough data to be able to guess the type of
|
36
|
+
# a column.
|
37
|
+
def samples(num_rows)
|
38
|
+
data =
|
39
|
+
File.read(csv)
|
40
|
+
.encode("UTF-8", invalid: :replace, undef: :replace, replace: "")
|
41
|
+
|
42
|
+
csv = CSV.new(data, headers: true)
|
43
|
+
sample_data = csv.each.take(num_rows)
|
44
|
+
csv.close
|
45
|
+
|
46
|
+
sample_data.map(&:to_a).flatten(1).each_with_object({}) do |row, data|
|
47
|
+
column = row[0]
|
48
|
+
value = row[1]
|
49
|
+
|
50
|
+
data[column] ||= []
|
51
|
+
data[column] << value
|
52
|
+
end
|
53
|
+
end
|
54
|
+
end
|
55
|
+
end
|
@@ -0,0 +1,23 @@
|
|
1
|
+
module InCSV
|
2
|
+
module Types
|
3
|
+
class Currency < ColumnType
|
4
|
+
MATCH_EXPRESSION = /\A(\$|£)([0-9,\.]+)\z/
|
5
|
+
|
6
|
+
def self.for_database
|
7
|
+
"DECIMAL(10,2)"
|
8
|
+
end
|
9
|
+
|
10
|
+
def match?
|
11
|
+
value.strip.match(MATCH_EXPRESSION)
|
12
|
+
end
|
13
|
+
|
14
|
+
def self.clean_value(value)
|
15
|
+
return unless value
|
16
|
+
|
17
|
+
value.strip.match(MATCH_EXPRESSION) do |match|
|
18
|
+
BigDecimal(match[2].delete(","))
|
19
|
+
end
|
20
|
+
end
|
21
|
+
end
|
22
|
+
end
|
23
|
+
end
|
data/lib/incsv/types.rb
ADDED
data/lib/incsv/version.rb
CHANGED
data/lib/incsv.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: incsv
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.2.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Rob Miller
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2016-02-
|
11
|
+
date: 2016-02-22 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: bundler
|
@@ -66,6 +66,20 @@ dependencies:
|
|
66
66
|
- - "~>"
|
67
67
|
- !ruby/object:Gem::Version
|
68
68
|
version: 0.19.1
|
69
|
+
- !ruby/object:Gem::Dependency
|
70
|
+
name: pry
|
71
|
+
requirement: !ruby/object:Gem::Requirement
|
72
|
+
requirements:
|
73
|
+
- - "~>"
|
74
|
+
- !ruby/object:Gem::Version
|
75
|
+
version: '0.10'
|
76
|
+
type: :runtime
|
77
|
+
prerelease: false
|
78
|
+
version_requirements: !ruby/object:Gem::Requirement
|
79
|
+
requirements:
|
80
|
+
- - "~>"
|
81
|
+
- !ruby/object:Gem::Version
|
82
|
+
version: '0.10'
|
69
83
|
- !ruby/object:Gem::Dependency
|
70
84
|
name: sqlite3
|
71
85
|
requirement: !ruby/object:Gem::Requirement
|
@@ -98,7 +112,8 @@ description: Loads a CSV file into an SQLite database automatically, dropping yo
|
|
98
112
|
into a Ruby shell that allows you to explore the data within.
|
99
113
|
email:
|
100
114
|
- rob@bigfish.co.uk
|
101
|
-
executables:
|
115
|
+
executables:
|
116
|
+
- incsv
|
102
117
|
extensions: []
|
103
118
|
extra_rdoc_files: []
|
104
119
|
files:
|
@@ -112,8 +127,17 @@ files:
|
|
112
127
|
- Rakefile
|
113
128
|
- bin/console
|
114
129
|
- bin/setup
|
130
|
+
- exe/incsv
|
115
131
|
- incsv.gemspec
|
116
132
|
- lib/incsv.rb
|
133
|
+
- lib/incsv/column.rb
|
134
|
+
- lib/incsv/column_type.rb
|
135
|
+
- lib/incsv/database.rb
|
136
|
+
- lib/incsv/schema.rb
|
137
|
+
- lib/incsv/types.rb
|
138
|
+
- lib/incsv/types/currency.rb
|
139
|
+
- lib/incsv/types/date.rb
|
140
|
+
- lib/incsv/types/string.rb
|
117
141
|
- lib/incsv/version.rb
|
118
142
|
homepage: https://github.com/robmiller/incsv
|
119
143
|
licenses:
|