RubyGems - datamancer - Versions diffs - 0.0.2 → 0.0.3 - Mend

datamancer 0.0.2 → 0.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

checksums.yaml +4 -4
data/README.md +156 -3
data/lib/datamancer/extract.rb +15 -4
data/lib/datamancer/transform.rb +72 -10
data/lib/datamancer/version.rb +1 -1
data/spec/data/destination.sqlite3 +0 -0
data/spec/extract_spec.rb +16 -4
data/spec/load_spec.rb +4 -2
data/spec/transform_spec.rb +39 -11
metadata +2 -6
data/spec/data/left_source.csv +0 -5
data/spec/data/right_source.csv +0 -5

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 5a09ebee71e5c2b54c0b790ae4dc2165e4e9e45a
-  data.tar.gz: eabae34cb0d1b34ce072796f3b0edab7117325e2
+  metadata.gz: 7c61e73995d0dfa4c372c1142482044a3a55ceec
+  data.tar.gz: 71968fed5543e5653ebf74e4b1f48f7e4576d5ef
 SHA512:
-  metadata.gz: 84af02d3f38e7e861086bf268eb6670ef8528f6b70035d2bba6f3098b99dd897b29b727ae1037ba80abad412ff34ae543eec4e8c57ed71f4a74a939f35be4b15
-  data.tar.gz: 4c88628302bca9be9089ee2a6c1be2991c06201d66ce91bd88a62247b6ad45b64ff0cc90403dce9c1ebd0dcabb6fb807be58f52133394c0f57e5771473d558c2
+  metadata.gz: 4657f9b28e44d82e6ceff39cc9d59f10964255138c2542e3d828a3fa5201858a1932eedc0b55918abcfd9e469591527bfe1c42f10b88a7629b0faa10965b5615
+  data.tar.gz: 382575e0b05be66f8f23b0c549895bc82e57a38575605bebf4cf196053f03d78b8ea9e52b7a4ff9f457ac8cb4917ee076178984ca315ad5b87e5635953f1f35a

data/README.md CHANGED Viewed

@@ -2,9 +2,11 @@
 **The Datamancer** is a form of sorcerer whose spells deal with data manipulation between databases.
-Data targets (sources and destinations) can be **databases** supported by ActiveRecord and **CSV files**. Multiple targets can be present in a single ETL process.
+Sources and destinations can be **databases** supported by ActiveRecord and **CSV files**. Multiple sources and destinations can be present in a single ETL process.
-Datamancer relies in bulk SQL reading and writing, and does not instantiate ActiveRecord objects, which is used for the sole purpose of connecting to databases.
+To optimize, Datamancer relies in bulk SQL reading and writing, and does not instantiate ActiveRecord objects, which is used for the sole purpose of connecting to databases.
+It is tested to run over JRuby as well.
 ## Installation
@@ -22,7 +24,158 @@ Or install it yourself as:
 ## Usage
-*Please see the specs, for now.*
+### Datastreams
+A **datastream** is a table with columns and rows made with an array of hashes.
+Each element of the array (a hash) represents a row,
+and each pair of a hash, an attribute (a column) and its value.
+```ruby
+people = [
+  {name: 'Foo', age: 27},
+  {name: 'Bar', age: 42}
+]
+```
+### Extract
+To get a datastream from a data source use the extract() method, like this:
+```ruby
+people = extract(from: 'people.csv')
+```
+If 'from' value is a string, extract will suppose that the source is a CSV file.
+One option that can be used along this kind of source is 'separator', which value
+will be used to interpret the file (defaults to comma).
+Similarly, if 'from' attribute is a hash, it will be given to ActiveRecord to establish
+a connection with the specified database and table. In case that the table is not indicated
+there, the 'table' option can provide it.
+Next, an example using a YAML file listing databases and showing further capacities of extract().
+*databases.yml*
+```yaml
+warehouse:
+  adapter: postgresql
+  host: 192.168.0.1
+  database: Warehouse
+  username: foo
+  password: 1234
+```
+```ruby
+databases = YAML.load_file('databases.yml')
+people = extract(from: databases['main'], table: 'people') do
+  field :name
+  field :age
+end
+```
+extract() can take a block in order to manipulate data being extracted. All table or file columns
+are brung by default; to get the opposite effect use 'exclude: true'.
+Inside the block, field() is used to refer to columns. Without more arguments than columns' name
+(as above) it performs no operation, but if columns are being excluded through the aforementioned
+option, field() includes the referred columns. Both string and symbol are valid for column names.
+To summarize, extract() options are:
+* from: (hash or string)
+* table: (string)
+* exclude: (boolean)
+* separator: (string)
+And now we introduce field() options under extraction context:
+* reject_if: (value or array of values) — Rejects row if condition is meet.
+* reject_unless: (value or array of values) — Rejects row unless condition is meet.
+* map: (string or symbol) — Alias column name in database.
+* type: (class) — Casts data type on field.
+* default: (value) — Set field to default value if nil.
+* type_default: (class) — Casts data type even if field's value is nil. This way a nil
+can be converted into a zero or an empty string.
+* empty_default: (string) — Set field to default value if nil or empty.
+* strip: (boolean) — Right and left strips the value to remove extra spaces in the string.
+In addition to these options field() also takes blocks!
+All together now:
+```ruby
+people = extract(from: databases['main'], table: 'people') do
+  field 'name', map: 'Name', reject_if: nil
+  field 'age', map: 'Age', type: Integer
+end
+```
+### Transform
+Method options:
+* exclude
+* join
+* on
+* unique
+Field options: Not allowed.
+### Load
+Method options:
+* to
+* table
+* exclude
+* append
+* batch
+Field options:
+* map
+## Example
+```ruby
+require 'bundler/setup'
+require 'datamancer'
+require 'active_record'
+require 'csv'
+include Datamancer
+bases = YAML.load_file('/home/matias/proyectos/panel/bases_de_datos.yml')
+países_ISO =
+extract from: 'country-list/country/cldr/es_AR/country.csv' do
+  field :iso
+  field :nombre, map: 'name'
+end
+países_UN =
+extract from: 'countries/countries.csv', separator: ';', exclude: true do
+  field :iso, map: 'cca2'
+  field :número, map: 'ccn3', type: Integer
+end
+países =
+transform países_ISO, join: países_UN, on: :iso
+load países, to: bases['panel'], table: 'lk_com_pais', append: true do
+  field :número, map: 'id_com_pais'
+  field :iso, map: 'cd_com_pais'
+  field :nombre, map: 'ds_com_pais'
+end
+```
+## Future features
+* Batch mode
+* Error monitor
+* Control files
 ## Contributing

data/lib/datamancer/extract.rb CHANGED Viewed

@@ -43,6 +43,7 @@ module Datamancer
     define_singleton_method :field do |name, actions = {}, &block|
       actions[:type] ||= actions[:type_default]
+      actions[:default] ||= actions[:empty_default]
       actions = default_actions.merge(actions)
       mapping = actions[:map] || name.to_s
@@ -171,11 +172,21 @@ module Datamancer
     # TODO: Test this. Test to not reject nil by default.
-    if actions[:reject_if] == value ||
-      (actions[:reject_unless] != :ñil &&
-       actions[:reject_unless] != value)
+    if actions[:reject_if].is_a? Array
+      actions[:reject_if].each do |reject_value|
+        value = :reject if reject_value == value
+      end
+    else
+      value = :reject if actions[:reject_if] == value
+    end
-      value = :reject
+    if actions[:reject_unless].is_a? Array
+      actions[:reject_unless].each do |reject_value|
+        value = :reject if reject_value != value
+      end
+    else
+      value = :reject if actions[:reject_unless] != :ñil &&
+                         actions[:reject_unless] != value
     end
     value

data/lib/datamancer/transform.rb CHANGED Viewed

@@ -33,14 +33,40 @@ module Datamancer
     output
   end
+  def where input, attributes
+  end
+  def unique input, attribute
+    attribute = attribute.to_sym
+    output = Array.new
+    unique_values = Array.new
+    input.each do |row|
+      unless unique_values.include?(row[attribute])
+        output << row
+        unique_values << row[attribute]
+      end
+    end
+    output
+  end
   def transform input, args = {}
     if args[:join]
       raise ArgumentError unless args[:on]
+      raise ArgumentError unless input.first.keys.include?(args[:on].to_sym)
+      raise ArgumentError unless args[:join].first.keys.include?(args[:on].to_sym)
       input = join input, args[:join], args[:on]
     end
+    if args[:unique]
+      input = unique input, args[:unique]
+    end
     # TODO: Method-overriding safeguard.
     input.first.each_key do |key|
@@ -55,43 +81,79 @@ module Datamancer
       end
     end
+    define_singleton_method :row_number do
+      @row_number
+    end
+    define_singleton_method :row do
+      @supplementary_row
+    end
+    define_singleton_method :count do
+      @count += 1
+    end
+    define_singleton_method :output do
+      @output
+    end
+    define_singleton_method :switch do |slot|
+      @slot = slot
+    end
     define_singleton_method :field do |name, value = nil, *args|
       raise MissingField,
-        "Required field '#{name}' was not found" unless respond_to?(name)
+        "Required field '#{name}' was not found" unless @input_row.include?(name.to_sym)
       @output_row[name.to_sym] = if value.is_a?(Symbol)
-                                   send(name).send *args.unshift(value)
+                                   send(name.downcase).send *args.unshift(value)
                                  else
-                                   value || send(name)
+                                   value || send(name.downcase)
                                  end
     end
     define_singleton_method :del_field do |name|
       raise MissingField,
-        "Filtered field '#{name}' was not found" unless respond_to?(name)
+        "Filtered field '#{name}' was not found" unless @input_row.include?(name.to_sym)
       @output_row.delete(name.to_sym)
     end
     define_singleton_method :new_field do |name, value|
       raise ExistingField,
-        "New field '#{name}' already exists" if respond_to?(name)
+        "New field '#{name}' already exists" if @input_row.include?(name.to_sym)
       @output_row[name.to_sym] = value
     end
-    input.map do |row|
+    # TODO: Test for count.
+    @count = 0
+    # TODO: Test for slots.
+    @output = Hash.new { |h, k| h[k] = [] }
+    input.each_with_index do |row, row_number|
+      # TODO: Test for row_number.
+      # TODO: Test for (supplementary) row.
+      @row_number = row_number
       @input_row = row
+      @supplementary_row = @input_row.dup
       @output_row = args[:exclude] ? {} : @input_row.dup
       yield if block_given?
-      @output_row
+      @output[@slot] << @output_row
     end
+    @output.length == 1? @output[nil] : @output
   end
   def aggregate input
     define_singleton_method :dim do |name|
       name = name.to_sym
       @dimensions[name] = @row[name]
@@ -104,7 +166,7 @@ module Datamancer
     aggregated_input = Hash.new { |hash, key| hash[key] = Hash.new }
-    input.each do |row|
+    input.each_with_index do |row, row_number|
       @row = row
       @dimensions = {}
       @facts = {}

data/lib/datamancer/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Datamancer
-  VERSION = "0.0.2"
+  VERSION = '0.0.3'
 end

data/spec/data/destination.sqlite3 CHANGED Viewed

Binary file

data/spec/extract_spec.rb CHANGED Viewed

@@ -1,5 +1,16 @@
 require 'spec_helper'
+# TODO: Some wanted features
+# allow table_name to be both string or symbol
+# database.yml autoloading
+# autoincremental id
+# transform!
+# reject by array and regex
+# existent actions check
+# TODO: case insentivity
 describe Datamancer do
   context 'EXTRACT spell' do
@@ -60,10 +71,11 @@ describe Datamancer do
           }.to raise_error(ArgumentError, 'Extract requires a source, i.e. extract(from: source)')
         end
         it 'raises an exception if a required field is missing' do
-          # TODO: Better explanation for this error.
+          # TODO: Better explanation for this error. Assure that table appears
+          # even though it has been specified with :table option (outside :from).
           # source = case source_type
           # when 'CSV file' then "#{@source} file"
@@ -73,11 +85,11 @@ describe Datamancer do
           expect {
             extract(from: @source) do
-              field :surname
+              field :agE
             end
           }.to raise_error(MissingField,
-            "Required field 'surname' was not found in '#{@source}'")
+            "Required field 'agE' was not found in '#{@source}'")
         end

data/spec/load_spec.rb CHANGED Viewed

@@ -80,11 +80,13 @@ describe Datamancer do
           expect {
             load(@data, to: @destination) do
-              field :surname
+              field :agE
             end
+            # TODO: Check for table.
           }.to raise_error(MissingField,
-            "Required field 'surname' was not found in '#{@destination}'")
+            "Required field 'agE' was not found in '#{@destination}'")
         end

data/spec/transform_spec.rb CHANGED Viewed

@@ -55,20 +55,20 @@ describe Datamancer do
       expect {
         transform(@data) do
-          field :surname
+          field :agE
         end
       }.to raise_error(MissingField,
-        "Required field 'surname' was not found")
+        "Required field 'agE' was not found")
       expect {
         transform(@data) do
-          del_field :surname
+          del_field :agE
         end
       }.to raise_error(MissingField,
-        "Filtered field 'surname' was not found")
+        "Filtered field 'agE' was not found")
     end
@@ -149,11 +149,35 @@ describe Datamancer do
     end
+    it 'drops duplicated rows' do
+      duplicated_data = @data + @data
+      expect(
+        transform(duplicated_data, unique: :name)
+      ).to eq(@data)
+      expect(
+        transform(duplicated_data, unique: :name) do
+          field :name
+        end
+      ).to eq(@data)
+    end
     context 'combines records by' do
       before(:all) do
-        @left_data = extract from: $dir + '/left_source.csv'
-        @right_data = extract from: $dir + '/right_source.csv'
+        @left_data =
+        [{name: 'Foo', some_id: 1},
+         {name: 'Bar', some_id: 2},
+         {name: 'Baz', some_id: 2},
+         {name: 'Foobar', some_id: nil}]
+        @right_data =
+        [{age: 0, some_id: nil},
+         {age: 27, some_id: 1},
+         {age: 33, some_id: 1},
+         {age: 42, some_id: 2}]
       end
@@ -164,16 +188,20 @@ describe Datamancer do
         expect {
           transform(@left_data, join: @right_data)
         }.to raise_error(ArgumentError)
+        expect {
+          transform(@left_data, join: @right_data, on: 'some_ID')
+        }.to raise_error(ArgumentError)
         expect(
           transform(@left_data, join: @right_data, on: 'some_id') do
             del_field :some_id
-            new_field :namage, name.downcase + age
+            new_field :namage, name.downcase + age.to_s
           end
-        ).to eq([{name: 'Foo', age: '27', namage: 'foo27'},
-                 {name: 'Foo', age: '33', namage: 'foo33'},
-                 {name: 'Bar', age: '42', namage: 'bar42'},
-                 {name: 'Baz', age: '42', namage: 'baz42'}])
+        ).to eq([{name: 'Foo', age: 27, namage: 'foo27'},
+                 {name: 'Foo', age: 33, namage: 'foo33'},
+                 {name: 'Bar', age: 42, namage: 'bar42'},
+                 {name: 'Baz', age: 42, namage: 'baz42'}])
       end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: datamancer
 version: !ruby/object:Gem::Version
-  version: 0.0.2
+  version: 0.0.3
 platform: ruby
 authors:
 - Matías Battocchia
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2013-11-22 00:00:00.000000000 Z
+date: 2013-12-19 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bundler
@@ -114,8 +114,6 @@ files:
 - lib/datamancer/version.rb
 - spec/data/destination.csv
 - spec/data/destination.sqlite3
-- spec/data/left_source.csv
-- spec/data/right_source.csv
 - spec/data/simple.yml
 - spec/data/source.csv
 - spec/data/source.sqlite3
@@ -151,8 +149,6 @@ summary: ''
 test_files:
 - spec/data/destination.csv
 - spec/data/destination.sqlite3
-- spec/data/left_source.csv
-- spec/data/right_source.csv
 - spec/data/simple.yml
 - spec/data/source.csv
 - spec/data/source.sqlite3

data/spec/data/left_source.csv DELETED Viewed

@@ -1,5 +0,0 @@
-name,some_id
-"Foo",1
-"Bar",2
-"Baz",2
-"Foobar",

data/spec/data/right_source.csv DELETED Viewed

@@ -1,5 +0,0 @@
-age,some_id
-0,
-27,1
-33,1
-42,2