RubyGems - csv-diff - Versions diffs - 0.3.1 → 0.3.3 - Mend

csv-diff 0.3.1 → 0.3.3

Files changed (9) hide show

checksums.yaml ADDED

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: 877e2055af0b258a897c546c7d053d95f3ce3a75
+  data.tar.gz: 6c7f2c56091beff466c09d221206e3569ca2c912
+SHA512:
+  metadata.gz: 8074e3349c0017154222c243c8f995b4354a656cd913b1786b632103c5224ace09f8b75180e497f441c140ead9ae96f700fc8e918805cd880045b0fd335f1e58
+  data.tar.gz: 295ca04fb380e30b22520ea7c7f6d28fbe7eeb4324cf8c69f476c576572c1b99f1269101b2fe46aa9a2c19a5d9584a974fe0a7254e042961fc03750b1c8b9762

data/LICENSE CHANGED

@@ -1,4 +1,4 @@
-Copyright (c) 2013, Adam Gardiner
+Copyright (c) 2013-2016, Adam Gardiner
 All rights reserved.
 Redistribution and use in source and binary forms, with or without

data/README.md CHANGED

@@ -125,7 +125,7 @@ column in the data. In this case, a diff can be created simply via:
 diff = CSVDiff.new(file1, file2)
 ```
-### Specifynig Unique Row Identifiers
+### Specifying Unique Row Identifiers
 Often however, rows are not uniquely identifiable via the first column in the file.
 In a parent-child hierarchy, for example, combinations of parent and child may be
@@ -211,6 +211,23 @@ diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam'
                    ignore_fields: ['CreatedAt', 'UpdatedAt'])
 ```
+### Filtering Rows
+If you need to filter source data before running the diff process, you can use the :include
+and :exclude options to do so. Both options take a Hash as their value; the hash should have
+keys that are the field names or indexes (0-based) on which to filter, and whose values are
+regular expressions or lambdas to be applied to values of the corresponding field. Rows will
+only be diffed if they satisfy :include conditions, and do not satisfy :exclude conditions.
+```ruby
+# Generate a diff of Arsenal home games not refereed by Clattenburg
+diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'],
+                   include: {HomeTeam: 'Arsenal'}, exclude: {Referee: /Clattenburg/})
+# Generate a diff of games played over the Xmas/New Year period
+diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'],
+                   include: {Date: lambda{ |d| holiday_period.include?(Date.strptime(d, '%y/%m/%d')) } })
+```
 ### Ignoring Certain Changes
 CSVDiff identifies Adds, Updates, Moves and Deletes; any of these changes can be selectively

data/lib/csv-diff.rb CHANGED

File without changes

data/lib/csv-diff/algorithm.rb CHANGED

@@ -3,6 +3,55 @@ class CSVDiff
     # Implements the CSV diff algorithm.
     module Algorithm
+        # Holds the details of a single difference
+        class Diff
+            attr_accessor :diff_type
+            attr_reader :fields
+            attr_reader :row
+            attr_reader :sibling_position
+            def initialize(diff_type, fields, row_idx, pos_idx)
+                @diff_type = diff_type
+                @fields = fields
+                @row = row_idx + 1
+                self.sibling_position = pos_idx
+            end
+            def sibling_position=(pos_idx)
+                if pos_idx.is_a?(Array)
+                    pos_idx.compact!
+                    if pos_idx.first != pos_idx.last
+                        @sibling_position = pos_idx.map{ |pos| pos + 1 }
+                    else
+                        @sibling_position = pos_idx.first + 1
+                    end
+                else
+                    @sibling_position = pos_idx + 1
+                end
+            end
+            # For backwards compatibility and access to fields with differences
+            def [](key)
+                case key
+                when String
+                    @fields[key]
+                when :action
+                    a = diff_type.to_s
+                    a[0] = a[0].upcase
+                    a
+                when :row
+                    @row
+                when :sibling_position
+                    @sibling_position
+                end
+            end
+        end
         # Diffs two CSVSource structures.
         #
         # @param left [CSVSource] A CSVSource object containing the contents of
@@ -22,43 +71,50 @@ class CSVDiff
         #  items that exist in both +left+ and +right+.
         # @option options [Boolean] :ignore_deletes If set to true, we ignore any
         #  new items that appear only in +left+.
+        # @option options [Hash<Object,Proc>] :equality_procs A Hash mapping fields
+        #  to a 2-arg Proc that should be used to compare values in that field for
+        #  equality.
         def diff_sources(left, right, key_fields, diff_fields, options = {})
             unless left.case_sensitive? == right.case_sensitive?
                 raise ArgumentError, "Left and right must have same settings for case-sensitivity"
             end
-            case_sensitive = left.case_sensitive?
+            # Ensure key fields are not also in the diff_fields
+            diff_fields = diff_fields - key_fields
             left_index = left.index
             left_values = left.lines
             left_keys = left_values.keys
             right_index = right.index
             right_values = right.lines
             right_keys = right_values.keys
-            parent_fields = left.parent_fields.length
+            parent_field_count = left.parent_fields.length
             include_adds = !options[:ignore_adds]
             include_moves = !options[:ignore_moves]
             include_updates = !options[:ignore_updates]
             include_deletes = !options[:ignore_deletes]
-            diffs = Hash.new{ |h, k| h[k] = {} }
+            @case_sensitive = left.case_sensitive?
+            @equality_procs = options.fetch(:equality_procs, {})
+            diffs = {}
+            potential_moves = Hash.new{ |h, k| h[k] = [] }
             # First identify deletions
             if include_deletes
                 (left_keys - right_keys).each do |key|
                     # Delete
                     key_vals = key.split('~', -1)
-                    parent = key_vals[0...parent_fields].join('~')
+                    parent = key_vals[0...parent_field_count].join('~')
+                    child = key_vals[parent_field_count..-1].join('~')
                     left_parent = left_index[parent]
                     left_value = left_values[key]
-                    left_idx = left_parent.index(key)
-                    next unless left_idx
-                    id = {}
-                    id[:row] = left_keys.index(key) + 1
-                    id[:sibling_position] = left_idx + 1
-                    key_fields.each do |field_name|
-                        id[field_name] = left_value[field_name]
-                    end
-                    diffs[key].merge!(id.merge(left_values[key].merge(:action => 'Delete')))
+                    row_idx = left_keys.index(key)
+                    sib_idx = left_parent.index(key)
+                    raise "Can't locate key #{key} in parent #{parent}" unless sib_idx
+                    diffs[key] = Diff.new(:delete, left_value, row_idx, sib_idx)
+                    potential_moves[child] << key
                     #puts "Delete: #{key}"
                 end
             end
@@ -66,7 +122,7 @@ class CSVDiff
             # Now identify adds/updates
             right_keys.each_with_index do |key, right_row_id|
                 key_vals = key.split('~', -1)
-                parent = key_vals[0...parent_fields].join('~')
+                parent = key_vals[0...parent_field_count].join('~')
                 left_parent = left_index[parent]
                 right_parent = right_index[parent]
                 left_value = left_values[key]
@@ -74,13 +130,12 @@ class CSVDiff
                 left_idx = left_parent && left_parent.index(key)
                 right_idx = right_parent && right_parent.index(key)
-                id = {}
-                id[:row] = right_row_id + 1
-                id[:sibling_position] = right_idx + 1
-                key_fields.each do |field_name|
-                    id[field_name] = right_value[field_name]
-                end
                 if left_idx && right_idx
+                    if include_updates && (changes = diff_row(left_value, right_value, diff_fields))
+                        id = id_fields(key_fields, right_value)
+                        diffs[key] = Diff.new(:update, id.merge!(changes), right_row_id, right_idx)
+                        #puts "Change: #{key}"
+                    end
                     if include_moves
                         left_common = left_parent & right_parent
                         right_common = right_parent & left_parent
@@ -88,19 +143,31 @@ class CSVDiff
                         right_pos = right_common.index(key)
                         if left_pos != right_pos
                             # Move
-                            diffs[key].merge!(id.merge!(:action => 'Move',
-                                              :sibling_position => [left_idx + 1, right_idx + 1]))
+                            if d = diffs[key]
+                                d.sibling_position = [left_idx, right_idx]
+                            else
+                                id = id_fields(key_fields, right_value)
+                                diffs[key] = Diff.new(:move, id, right_row_id, [left_idx, right_idx])
+                            end
                             #puts "Move #{left_idx} -> #{right_idx}: #{key}"
                         end
                     end
-                    if include_updates && (changes = diff_row(left_value, right_value, diff_fields, case_sensitive))
-                        diffs[key].merge!(id.merge(changes.merge(:action => 'Update')))
-                        #puts "Change: #{key}"
-                    end
-                elsif include_adds && right_idx
+                elsif right_idx
                     # Add
-                    diffs[key].merge!(id.merge(right_values[key].merge(:action => 'Add')))
-                    #puts "Add: #{key}"
+                    child = key_vals[parent_field_count..-1].join('~')
+                    if potential_moves.has_key?(child) && old_key = potential_moves[child].pop
+                        diffs.delete(old_key)
+                        if include_updates
+                            left_value = left_values[old_key]
+                            id = id_fields(right.child_fields, right_value)
+                            changes = diff_row(left_value, right_value, left.parent_fields + diff_fields)
+                            diffs[key] = Diff.new(:update, id.merge!(changes), right_row_id, right_idx)
+                            #puts "Update Parent: #{key}"
+                        end
+                    elsif include_adds
+                        diffs[key] = Diff.new(:add, right_value, right_row_id, right_idx)
+                        #puts "Add: #{key}"
+                    end
                 end
             end
@@ -116,27 +183,41 @@ class CSVDiff
         # @param right_row [Hash] The version of the CSV row from the right/to
         #   file.
         # @param fields [Array<String>] An array of field names to compare.
-        # @param case_sensitive [Boolean] Whether field comparisons should be
-        #   case sensitive or not.
         # @return [Hash<String, Array>] A Hash whose keys are the fields that
         #   contain differences, and whose values are a two-element array of
         #   [left/from, right/to] values.
-        def diff_row(left_row, right_row, fields, case_sensitive)
+        def diff_row(left_row, right_row, fields)
             diffs = {}
             fields.each do |attr|
+                eq_proc = @equality_procs[attr]
                 right_val = right_row[attr]
                 right_val = nil if right_val == ""
                 left_val = left_row[attr]
                 left_val = nil if left_val == ""
-                if (case_sensitive && left_val != right_val) ||
-                   (left_val.to_s.upcase != right_val.to_s.upcase)
+                if eq_proc
+                    diffs[attr] = [left_val, right_val] unless eq_proc.call(left_val, right_val)
+                elsif @case_sensitive
+                    diffs[attr] = [left_val, right_val] unless left_val == right_val
+                elsif (left_val.to_s.upcase != right_val.to_s.upcase)
                     diffs[attr] = [left_val, right_val]
-                    #puts "#{attr}: #{left_val} -> #{right_val}"
                 end
             end
             diffs if diffs.size > 0
         end
+        private
+        # Return a hash containing just the key field values
+        def id_fields(key_fields, fields)
+            id = {}
+            key_fields.each do |field_name|
+                id[field_name] = fields[field_name]
+            end
+            id
+        end
     end
 end

data/lib/csv-diff/csv_diff.rb CHANGED

@@ -87,7 +87,7 @@ class CSVDiff
         raise "No field names found in right (to) source" unless @right.field_names && @right.field_names.size > 0
         @warnings = []
         @diff_fields = get_diff_fields(@left.field_names, @right.field_names, options)
-        @key_fields = @left.key_fields.map{ |kf| @diff_fields[kf] }
+        @key_fields = @left.key_fields
         diff(options)
     end
@@ -148,8 +148,6 @@ class CSVDiff
             right_fields.each_with_index do |fld, i|
                 if left_fields.include?(fld)
                     diff_fields << fld unless ignore_fields.include?(fld.upcase)
-                else
-                    @warnings << "Field '#{fld}' is missing from the left (from) file, and won't be diffed"
                 end
             end
         else

data/lib/csv-diff/csv_source.rb CHANGED

@@ -6,6 +6,7 @@ class CSVDiff
         # @return [String] the path to the source file
         attr_accessor :path
         # @return [Array<String>] The names of the fields in the source file
         attr_reader :field_names
         # @return [Array<String>] The names of the field(s) that uniquely
@@ -17,6 +18,17 @@ class CSVDiff
         # @return [Array<String>] The names of the field(s) that distinguish a
         #   child of a parent record.
         attr_reader :child_fields
+        # @return [Array<Fixnum>] The indexes of the key fields in the source
+        #   file.
+        attr_reader :key_field_indexes
+        # @return [Array<Fixnum>] The indexes of the parent fields in the source
+        #   file.
+        attr_reader :parent_field_indexes
+        # @return [Array<Fixnum>] The indexes of the child fields in the source
+        #   file.
+        attr_reader :child_field_indexes
         # @return [Boolean] True if the source has been indexed with case-
         #   sensitive keys, or false if it has been indexed using upper-case key
         #   values.
@@ -34,6 +46,12 @@ class CSVDiff
         # @return [Array<String>] An array of any warnings encountered while
         #   processing the source.
         attr_reader :warnings
+        # @return [Fixnum] A count of the lines processed from this source.
+        #   Excludes any header and duplicate records identified during indexing.
+        attr_reader :line_count
+        # @return [Fixnum] A count of the lines from this source that were skipped,
+        #   due either to duplicate keys or filter conditions.
+        attr_reader :skip_count
         # Creates a new diff source.
@@ -72,8 +90,14 @@ class CSVDiff
         # @option options [String] :child_field The name of the field(s) that
         #   uniquely identify a child of a parent.
         # @option options [Boolean] :case_sensitive If true (the default), keys
-        #  are indexed as-is; if false, the index is built in upper-case for
-        #  case-insensitive comparisons.
+        #   are indexed as-is; if false, the index is built in upper-case for
+        #   case-insensitive comparisons.
+        # @option options [Hash] :include A hash of field name(s) or index(es) to
+        #   regular expression(s). Only source rows whose field values satisfy the
+        #   regular expressions will be indexed and included in the diff process.
+        # @option options [Hash] :exclude A hash of field name(s) or index(es) to
+        #   regular expression(s). Source rows with a field value that satisfies
+        #   the regular expressions will be excluded from the diff process.
         def initialize(source, options = {})
             if source.is_a?(String)
                 require 'csv'
@@ -81,8 +105,12 @@ class CSVDiff
                 csv_options = options.fetch(:csv_options, {})
                 @path = source
                 source = CSV.open(@path, mode_string, csv_options).readlines
+            elsif !source.is_a?(Enumerable) || (source.is_a?(Enumerable) && source.size > 0 &&
+                                                !source.first.is_a?(Enumerable))
+                raise ArgumentError, "source must be a path to a file or an Enumerable<Enumerable>"
             end
-            if kf = options.fetch(:key_field, options[:key_fields])
+            if (options.keys & [:parent_field, :parent_fields, :child_field, :child_fields]).empty? &&
+               (kf = options.fetch(:key_field, options[:key_fields]))
                 @key_fields = [kf].flatten
                 @parent_fields = @key_fields[0...-1]
                 @child_fields = @key_fields[-1..-1]
@@ -114,38 +142,69 @@ class CSVDiff
         def index_source(lines, options)
             @lines = {}
             @index = Hash.new{ |h, k| h[k] = [] }
-            @key_fields = find_field_indexes(@key_fields, @field_names) if @field_names
+            if @field_names
+                index_fields(options)
+            end
             @case_sensitive = options.fetch(:case_sensitive, true)
             @trim_whitespace = options.fetch(:trim_whitespace, false)
+            @line_count = 0
+            @skip_count = 0
             line_num = 0
             lines.each do |row|
                 line_num += 1
                 next if line_num == 1 && @field_names && options[:ignore_header]
                 unless @field_names
                     @field_names = row
-                    @key_fields = find_field_indexes(@key_fields, @field_names)
+                    index_fields(options)
                     next
                 end
                 field_vals = row
                 line = {}
+                filter = false
                 @field_names.each_with_index do |field, i|
                     line[field] = field_vals[i]
                     line[field].strip! if @trim_whitespace && line[field]
+                    if @include_filter && f = @include_filter[i]
+                        filter = !check_filter(f, line[field])
+                    end
+                    if @exclude_filter && f = @exclude_filter[i]
+                        filter = check_filter(f, line[field])
+                    end
+                    break if filter
+                end
+                if filter
+                    @skip_count += 1
+                    next
                 end
-                key_values = @key_fields.map{ |kf| field_vals[kf].to_s.upcase }
+                key_values = @key_field_indexes.map{ |kf| field_vals[kf].to_s.upcase }
                 key = key_values.join('~')
                 parent_key = key_values[0...(@parent_fields.length)].join('~')
                 parent_key.upcase! unless @case_sensitive
                 if @lines[key]
                     @warnings << "Duplicate key '#{key}' encountered and ignored at line #{line_num}"
+                    @skip_count += 1
                 else
                     @index[parent_key] << key
                     @lines[key] = line
+                    @line_count += 1
                 end
             end
         end
+        def index_fields(options)
+            @key_field_indexes = find_field_indexes(@key_fields, @field_names)
+            @parent_field_indexes = find_field_indexes(@parent_fields, @field_names)
+            @child_field_indexes = find_field_indexes(@child_fields, @field_names)
+            @key_fields = @key_field_indexes.map{ |i| @field_names[i] }
+            @parent_fields = @parent_field_indexes.map{ |i| @field_names[i] }
+            @child_fields = @child_field_indexes.map{ |i| @field_names[i] }
+            @include_filter = convert_filter(options, :include, @field_names)
+            @exclude_filter = convert_filter(options, :exclude, @field_names)
+        end
         # Converts an array of field names to an array of indexes of the fields
         # matching those names.
         def find_field_indexes(key_fields, field_names)
@@ -160,6 +219,35 @@ class CSVDiff
             end
         end
+        def convert_filter(options, key, field_names)
+            return unless hsh = options[key]
+            if !hsh.is_a?(Hash)
+                raise ArgumentError, ":#{key} option must be a Hash of field name(s)/index(es) to RegExp(s)"
+            end
+            keys = hsh.keys
+            idxs = find_field_indexes(keys, @field_names)
+            Hash[keys.each_with_index.map{ |k, i| [idxs[i], hsh[k]] }]
+        end
+        def check_filter(filter, field_val)
+            case filter
+            when String
+                if @case_sensitive
+                    filter == field_val
+                else
+                    filter.downcase == field_val.to_s.downcase
+                end
+            when Regexp
+                filter.match(field_val)
+            when Proc
+                filter.call(field_val)
+            else
+                raise ArgumentError, "Unsupported filter expression: #{filter.inspect}"
+            end
+        end
     end
 end

data/lib/csv_diff.rb CHANGED

File without changes

metadata CHANGED

@@ -1,35 +1,42 @@
 --- !ruby/object:Gem::Specification
 name: csv-diff
 version: !ruby/object:Gem::Version
-  version: 0.3.1
-  prerelease:
+  version: 0.3.3
 platform: ruby
 authors:
 - Adam Gardiner
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2016-01-26 00:00:00.000000000 Z
+date: 2017-05-17 00:00:00.000000000 Z
 dependencies: []
-description: ! "        This library performs diffs of CSV files.\n\n        Unlike
-  a standard diff that compares line by line, and is sensitive to the\n        ordering
-  of records, CSV-Diff identifies common lines by key field(s), and\n        then
-  compares the contents of the fields in each line.\n\n        Data may be supplied
-  in the form of CSV files, or as an array of arrays. The\n        diff process provides
-  a fine level of control over what to diff, and can\n        optionally ignore certain
-  types of changes (e.g. changes in position).\n\n        CSV-Diff is particularly
-  well suited to data in parent-child format. Parent-\n        child data does not
-  lend itself well to standard text diffs, as small changes\n        in the organisation
-  of the tree at an upper level can lead to big movements\n        in the position
-  of descendant records. By instead matching records by key,\n        CSV-Diff avoids
-  this issue, while still being able to detect changes in\n        sibling order.\n\n
-  \       This gem implements the core diff algorithm, and handles the loading and\n
-  \       diffing of CSV files. It returns a CSVDiff object, that contains the details\n
-  \       of differences in object form. This is useful for projects that need diff\n
-  \       capability, but want to handle the reporting of differences themselves.
-  For\n        a pre-built diff reporting capability, see the csv-diff-report gem,
-  which\n        provides a command-line tool for generating diff reports in HTML
-  or Excel\n        format.\n"
+description: |2
+          This library performs diffs of CSV files (or table-like sources).
+          Unlike a standard diff that compares line by line, and is sensitive to the
+          ordering of records, CSV-Diff identifies common lines by key field(s), and
+          then compares the contents of the fields in each line.
+          Data may be supplied in the form of CSV files, or as an array of arrays. The
+          diff process provides a fine level of control over what to diff, and can
+          optionally ignore certain types of changes (e.g. changes in position).
+          CSV-Diff is particularly well suited to data in parent-child format. Parent-
+          child data does not lend itself well to standard text diffs, as small changes
+          in the organisation of the tree at an upper level can lead to big movements
+          in the position of descendant records. By instead matching records by key,
+          CSV-Diff avoids this issue, while still being able to detect changes in
+          sibling order.
+          This gem implements the core diff algorithm, and handles the loading and
+          diffing of CSV files (or Arrays of Arrays). It returns a CSVDiff object
+          containing the details of differences in object form. This is useful for
+          projects that need diff capability, but want to handle the reporting or
+          actioning of differences themselves.
+          For a pre-built diff reporting capability, see the csv-diff-report gem,
+          which provides a command-line tool for generating diff reports in HTML,
+          Excel, or text formats.
 email: adam.b.gardiner@gmail.com
 executables: []
 extensions: []
@@ -43,27 +50,27 @@ files:
 - lib/csv-diff.rb
 - lib/csv_diff.rb
 homepage: https://github.com/agardiner/csv-diff
-licenses: []
+licenses:
+- MIT
+metadata: {}
 post_install_message: For command-line tools and diff reports, 'gem install csv-diff-report'
 rdoc_options: []
 require_paths:
 - lib
 required_ruby_version: !ruby/object:Gem::Requirement
-  none: false
   requirements:
-  - - ! '>='
+  - - '>='
     - !ruby/object:Gem::Version
       version: '0'
 required_rubygems_version: !ruby/object:Gem::Requirement
-  none: false
   requirements:
-  - - ! '>='
+  - - '>='
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
 rubyforge_project:
-rubygems_version: 1.8.21
+rubygems_version: 2.0.14.1
 signing_key:
-specification_version: 3
+specification_version: 4
 summary: CSV Diff is a library for generating diffs from data in CSV format
 test_files: []