RubyGems - csv-diff - Versions diffs - 0.2 → 0.5.0 - Mend

csv-diff 0.2 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

checksums.yaml +5 -13
data/LICENSE +1 -1
data/README.md +18 -1
data/lib/csv-diff.rb +1 -0
data/lib/csv-diff/algorithm.rb +145 -41
data/lib/csv-diff/csv_diff.rb +24 -13
data/lib/csv-diff/csv_source.rb +24 -101
data/lib/csv-diff/source.rb +275 -0
data/lib/csv-diff/xml_source.rb +142 -0
metadata +39 -19

checksums.yaml CHANGED

@@ -1,15 +1,7 @@
 ---
-!binary "U0hBMQ==":
-  metadata.gz: !binary |-
-    ZGY3NTBiZGFiNTQxMmZiZmMwOTE2MDIzYmEwNWQ3NTY0MzZmNjRkMg==
-  data.tar.gz: !binary |-
-    NzAyNTY1OTk3MTA3Y2ZhNjk2YWRmNTJkYTljNGZhZDY2YjQ1OTg2ZQ==
+SHA1:
+  metadata.gz: d1b3b8deee34344d334e740285cb1f3c99074694
+  data.tar.gz: d95158d13861cb66fd460ee430714ec3c83cd0b1
 SHA512:
-  metadata.gz: !binary |-
-    ZDRmODg2NDlhYTY4NDM1MWYxNjgyZGQ0MGRiZDNkYmI3MDkwYjFkZWVhOWYw
-    YmVkNDVhMjk1M2EyNjFkZGIxOGE0Y2MwOWQwMWRhNzhjZDk2N2RhZmEyZGRm
-    MjliNmM4Y2ZmNzY4ZTJkY2EzZWY4Mjg3NmU3ZjQxM2RkYTBjODE=
-  data.tar.gz: !binary |-
-    YTljOGZhNDY0YzdjZGYzMTA2NTM3MzIwNDg4MTcwYWEyM2IyZTc1YWYxMjFm
-    YjMwYWU1NWMzNGVkZGRkYWYyZjUwMTQ2MWZlMjdkNjQwMjIwYWUwNmNlYjM3
-    NDlmZDk5MGNlMTk4ZDhlMzFiOGUyZTIwY2EyZTY3MjUwYjc2NWY=
+  metadata.gz: 50c74d6a4093012b0ba44fef70c2d749348d6777cfb9f2cfda66c6e075423191a4c6c22019a388b9d8bd14e22ac60d539f4e3b4aa85fd87fd774a64da15858c7
+  data.tar.gz: 8fa030a54e7a97db9913b3c36a1942de1e07a6549f9ae7aa58b5b3f44d522fe11f72d44e18b6b7612d2b2dc9f106ece1fea183557c507bcf18316891ab63f230

data/LICENSE CHANGED

@@ -1,4 +1,4 @@
-Copyright (c) 2013, Adam Gardiner
+Copyright (c) 2013-2016, Adam Gardiner
 All rights reserved.
 Redistribution and use in source and binary forms, with or without

data/README.md CHANGED

@@ -125,7 +125,7 @@ column in the data. In this case, a diff can be created simply via:
 diff = CSVDiff.new(file1, file2)
 ```
-### Specifynig Unique Row Identifiers
+### Specifying Unique Row Identifiers
 Often however, rows are not uniquely identifiable via the first column in the file.
 In a parent-child hierarchy, for example, combinations of parent and child may be
@@ -211,6 +211,23 @@ diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam'
                    ignore_fields: ['CreatedAt', 'UpdatedAt'])
 ```
+### Filtering Rows
+If you need to filter source data before running the diff process, you can use the :include
+and :exclude options to do so. Both options take a Hash as their value; the hash should have
+keys that are the field names or indexes (0-based) on which to filter, and whose values are
+regular expressions or lambdas to be applied to values of the corresponding field. Rows will
+only be diffed if they satisfy :include conditions, and do not satisfy :exclude conditions.
+```ruby
+# Generate a diff of Arsenal home games not refereed by Clattenburg
+diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'],
+                   include: {HomeTeam: 'Arsenal'}, exclude: {Referee: /Clattenburg/})
+# Generate a diff of games played over the Xmas/New Year period
+diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'],
+                   include: {Date: lambda{ |d| holiday_period.include?(Date.strptime(d, '%y/%m/%d')) } })
+```
 ### Ignoring Certain Changes
 CSVDiff identifies Adds, Updates, Moves and Deletes; any of these changes can be selectively

data/lib/csv-diff.rb CHANGED

@@ -1,3 +1,4 @@
+require 'csv-diff/source'
 require 'csv-diff/csv_source'
 require 'csv-diff/algorithm'
 require 'csv-diff/csv_diff'

data/lib/csv-diff/algorithm.rb CHANGED

@@ -3,6 +3,55 @@ class CSVDiff
     # Implements the CSV diff algorithm.
     module Algorithm
+        # Holds the details of a single difference
+        class Diff
+            attr_accessor :diff_type
+            attr_reader :fields
+            attr_reader :row
+            attr_reader :sibling_position
+            def initialize(diff_type, fields, row_idx, pos_idx)
+                @diff_type = diff_type
+                @fields = fields
+                @row = row_idx + 1
+                self.sibling_position = pos_idx
+            end
+            def sibling_position=(pos_idx)
+                if pos_idx.is_a?(Array)
+                    pos_idx.compact!
+                    if pos_idx.first != pos_idx.last
+                        @sibling_position = pos_idx.map{ |pos| pos + 1 }
+                    else
+                        @sibling_position = pos_idx.first + 1
+                    end
+                else
+                    @sibling_position = pos_idx + 1
+                end
+            end
+            # For backwards compatibility and access to fields with differences
+            def [](key)
+                case key
+                when :action
+                    a = diff_type.to_s
+                    a[0] = a[0].upcase
+                    a
+                when :row
+                    @row
+                when :sibling_position
+                    @sibling_position
+                else
+                    @fields[key]
+                end
+            end
+        end
         # Diffs two CSVSource structures.
         #
         # @param left [CSVSource] A CSVSource object containing the contents of
@@ -13,24 +62,70 @@ class CSVDiff
         #   that uniquely identify each row.
         # @param diff_fields [Array] An array containing the names of the fields
         #   to be diff-ed.
+        # @param options [Hash] An options hash.
+        # @option options [Boolean] :ignore_adds If set to true, we ignore any
+        #  new items that appear only in +right+.
+        # @option options [Boolean] :ignore_moves If set to true, we ignore any
+        #  changes in sibling order.
+        # @option options [Boolean] :ignore_updates If set to true, we ignore any
+        #  items that exist in both +left+ and +right+.
+        # @option options [Boolean] :ignore_deletes If set to true, we ignore any
+        #  new items that appear only in +left+.
+        # @option options [Hash<Object,Proc>] :equality_procs A Hash mapping fields
+        #  to a 2-arg Proc that should be used to compare values in that field for
+        #  equality.
         def diff_sources(left, right, key_fields, diff_fields, options = {})
+            unless left.case_sensitive? == right.case_sensitive?
+                raise ArgumentError, "Left and right must have same settings for case-sensitivity"
+            end
+            unless left.parent_fields.length == right.parent_fields.length
+                raise ArgumentError, "Left and right must have same settings for parent/child fields"
+            end
+            # Ensure key fields are not also in the diff_fields
+            diff_fields = diff_fields - key_fields
             left_index = left.index
             left_values = left.lines
             left_keys = left_values.keys
             right_index = right.index
             right_values = right.lines
             right_keys = right_values.keys
-            parent_fields = left.parent_fields.length
+            parent_field_count = left.parent_fields.length
             include_adds = !options[:ignore_adds]
             include_moves = !options[:ignore_moves]
             include_updates = !options[:ignore_updates]
             include_deletes = !options[:ignore_deletes]
-            diffs = Hash.new{ |h, k| h[k] = {} }
+            @case_sensitive = left.case_sensitive?
+            @equality_procs = options.fetch(:equality_procs, {})
+            diffs = {}
+            potential_moves = Hash.new{ |h, k| h[k] = [] }
+            # First identify deletions
+            if include_deletes
+                (left_keys - right_keys).each do |key|
+                    # Delete
+                    key_vals = key.split('~', -1)
+                    parent = key_vals[0...parent_field_count].join('~')
+                    child = key_vals[parent_field_count..-1].join('~')
+                    left_parent = left_index[parent]
+                    left_value = left_values[key]
+                    row_idx = left_keys.index(key)
+                    sib_idx = left_parent.index(key)
+                    raise "Can't locate key #{key} in parent #{parent}" unless sib_idx
+                    diffs[key] = Diff.new(:delete, left_value, row_idx, sib_idx)
+                    potential_moves[child] << key
+                    #puts "Delete: #{key}"
+                end
+            end
+            # Now identify adds/updates
             right_keys.each_with_index do |key, right_row_id|
-                key_vals = key.split('~')
-                parent = key_vals[0...parent_fields].join('~')
+                key_vals = key.split('~', -1)
+                parent = key_vals[0...parent_field_count].join('~')
                 left_parent = left_index[parent]
                 right_parent = right_index[parent]
                 left_value = left_values[key]
@@ -38,13 +133,12 @@ class CSVDiff
                 left_idx = left_parent && left_parent.index(key)
                 right_idx = right_parent && right_parent.index(key)
-                id = {}
-                id[:row] = right_row_id + 1
-                id[:sibling_position] = right_idx + 1
-                key_fields.each do |field_name|
-                    id[field_name] = right_value[field_name]
-                end
                 if left_idx && right_idx
+                    if include_updates && (changes = diff_row(left_value, right_value, diff_fields))
+                        id = id_fields(key_fields, right_value)
+                        diffs[key] = Diff.new(:update, id.merge!(changes), right_row_id, right_idx)
+                        #puts "Change: #{key}"
+                    end
                     if include_moves
                         left_common = left_parent & right_parent
                         right_common = right_parent & left_parent
@@ -52,42 +146,34 @@ class CSVDiff
                         right_pos = right_common.index(key)
                         if left_pos != right_pos
                             # Move
-                            diffs[key].merge!(id.merge!(:action => 'Move',
-                                              :sibling_position => [left_idx + 1, right_idx + 1]))
+                            if d = diffs[key]
+                                d.sibling_position = [left_idx, right_idx]
+                            else
+                                id = id_fields(key_fields, right_value)
+                                diffs[key] = Diff.new(:move, id, right_row_id, [left_idx, right_idx])
+                            end
                             #puts "Move #{left_idx} -> #{right_idx}: #{key}"
                         end
                     end
-                    if include_updates && (changes = diff_row(left_value, right_value, diff_fields))
-                        diffs[key].merge!(id.merge(changes.merge(:action => 'Update')))
-                        #puts "Change: #{key}"
-                    end
-                elsif include_adds && right_idx
+                elsif right_idx
                     # Add
-                    diffs[key].merge!(id.merge(right_values[key].merge(:action => 'Add')))
-                    #puts "Add: #{key}"
-                end
-            end
-            # Now identify deletions
-            if include_deletes
-                (left_keys - right_keys).each do |key|
-                    # Delete
-                    key_vals = key.split('~')
-                    parent = key_vals[0...parent_fields].join('~')
-                    left_parent = left_index[parent]
-                    left_value = left_values[key]
-                    left_idx = left_parent.index(key)
-                    next unless left_idx
-                    id = {}
-                    id[:row] = left_keys.index(key) + 1
-                    id[:sibling_position] = left_idx + 1
-                    key_fields.each do |field_name|
-                        id[field_name] = left_value[field_name]
+                    child = key_vals[parent_field_count..-1].join('~')
+                    if potential_moves.has_key?(child) && old_key = potential_moves[child].pop
+                        diffs.delete(old_key)
+                        if include_updates
+                            left_value = left_values[old_key]
+                            id = id_fields(right.child_fields, right_value)
+                            changes = diff_row(left_value, right_value, left.parent_fields + diff_fields)
+                            diffs[key] = Diff.new(:update, id.merge!(changes), right_row_id, right_idx)
+                            #puts "Update Parent: #{key}"
+                        end
+                    elsif include_adds
+                        diffs[key] = Diff.new(:add, right_value, right_row_id, right_idx)
+                        #puts "Add: #{key}"
                     end
-                    diffs[key].merge!(id.merge(left_values[key].merge(:action => 'Delete')))
-                    #puts "Delete: #{key}"
                 end
             end
             diffs
         end
@@ -99,24 +185,42 @@ class CSVDiff
         #   file.
         # @param right_row [Hash] The version of the CSV row from the right/to
         #   file.
+        # @param fields [Array<String>] An array of field names to compare.
         # @return [Hash<String, Array>] A Hash whose keys are the fields that
         #   contain differences, and whose values are a two-element array of
         #   [left/from, right/to] values.
         def diff_row(left_row, right_row, fields)
             diffs = {}
             fields.each do |attr|
+                eq_proc = @equality_procs[attr]
                 right_val = right_row[attr]
                 right_val = nil if right_val == ""
                 left_val = left_row[attr]
                 left_val = nil if left_val == ""
-                if left_val != right_val
+                if eq_proc
+                    diffs[attr] = [left_val, right_val] unless eq_proc.call(left_val, right_val)
+                elsif @case_sensitive
+                    diffs[attr] = [left_val, right_val] unless left_val == right_val
+                elsif (left_val.to_s.upcase != right_val.to_s.upcase)
                     diffs[attr] = [left_val, right_val]
-                    #puts "#{attr}: #{left_val} -> #{right_val}"
                 end
             end
             diffs if diffs.size > 0
         end
+        private
+        # Return a hash containing just the key field values
+        def id_fields(key_fields, fields)
+            id = {}
+            key_fields.each do |field_name|
+                id[field_name] = fields[field_name]
+            end
+            id
+        end
     end
 end

data/lib/csv-diff/csv_diff.rb CHANGED

@@ -28,13 +28,15 @@ class CSVDiff
     # @return [Array<String>] An array of field names that are compared in the
     #    diff process.
     attr_reader :diff_fields
-    # @return [Array<Fixnum>] An array of field indexes identifying the key
-    #    fields that uniquely identify each row.
+    # @return [Array<String>] An array of field namees of the key fields that
+    #    uniquely identify each row.
     attr_reader :key_fields
     # @return [Array<String>] An array of field names for the parent field(s).
     attr_reader :parent_fields
     # @return [Array<String>] An array of field names for the child field(s).
     attr_reader :child_fields
+    # @return [Hash] The options hash used for the diff.
+    attr_reader :options
     # Generates a diff between two hierarchical tree structures, provided
@@ -79,13 +81,15 @@ class CSVDiff
     # @option options [Boolean] :ignore_deletes If true, records that appear
     #   in the left/from file but not in the right/to file are not reported.
     def initialize(left, right, options = {})
-        @left = left.is_a?(CSVSource) ? left : CSVSource.new(left, options)
+        @left = left.is_a?(Source) ? left : CSVSource.new(left, options)
+        @left.index_source if @left.lines.nil?
         raise "No field names found in left (from) source" unless @left.field_names && @left.field_names.size > 0
-        @right = right.is_a?(CSVSource) ? right : CSVSource.new(right, options)
+        @right = right.is_a?(Source) ? right : CSVSource.new(right, options)
+        @right.index_source if @right.lines.nil?
         raise "No field names found in right (to) source" unless @right.field_names && @right.field_names.size > 0
         @warnings = []
-        @diff_fields = get_diff_fields(@left.field_names, @right.field_names, options[:ignore_fields])
-        @key_fields = @left.key_fields.map{ |kf| @diff_fields[kf] }
+        @diff_fields = get_diff_fields(@left.field_names, @right.field_names, options)
+        @key_fields = @left.key_fields
         diff(options)
     end
@@ -93,6 +97,7 @@ class CSVDiff
     # Performs a diff with the specified +options+.
     def diff(options = {})
         @summary = nil
+        @options = options
         @diffs = diff_sources(@left, @right, @key_fields, @diff_fields, options)
     end
@@ -134,15 +139,21 @@ class CSVDiff
     # Given two sets of field names, determines the common set of fields present
     # in both, on which members can be diffed.
-    def get_diff_fields(left_fields, right_fields, ignore_fields)
+    def get_diff_fields(left_fields, right_fields, options)
+        ignore_fields = options.fetch(:ignore_fields, [])
+        ignore_fields = [ignore_fields] unless ignore_fields.is_a?(Array)
+        ignore_fields.map! do |f|
+            (f.is_a?(Numeric) ? right_fields[f] : f).upcase
+        end
         diff_fields = []
-        right_fields.each_with_index do |fld, i|
-            if left_fields.include?(fld)
-                diff_fields << fld unless ignore_fields && (ignore_fields.include?(fld) ||
-                                                            ignore_fields.include?(i))
-            else
-                @warnings << "Field '#{fld}' is missing from the left (from) file, and won't be diffed"
+        if options[:diff_common_fields_only]
+            right_fields.each_with_index do |fld, i|
+                if left_fields.include?(fld)
+                    diff_fields << fld unless ignore_fields.include?(fld.upcase)
+                end
             end
+        else
+            diff_fields = (right_fields + left_fields).uniq.reject{ |fld| ignore_fields.include?(fld.upcase) }
         end
         diff_fields
     end

data/lib/csv-diff/csv_source.rb CHANGED

@@ -2,31 +2,7 @@ class CSVDiff
     # Represents a CSV input (i.e. the left/from or right/to input) to the diff
     # process.
-    class CSVSource
-        # @return [String] the path to the source file
-        attr_accessor :path
-        # @return [Array<String>] The names of the fields in the source file
-        attr_reader :field_names
-        # @return [Array<String>] The names of the field(s) that uniquely
-        #   identify each row.
-        attr_reader :key_fields
-        # @return [Array<String>] The names of the field(s) that identify a
-        #   common parent of child records.
-        attr_reader :parent_fields
-        # @return [Array<String>] The names of the field(s) that distinguish a
-        #   child of a parent record.
-        attr_reader :child_fields
-        # @return [Hash<String,Hash>] A hash containing each line of the source,
-        #   keyed on the values of the +key_fields+.
-        attr_reader :lines
-        # @return [Hash<String,Array<String>>] A hash containing each parent key,
-        #   and an Array of the child keys it is a parent of.
-        attr_reader :index
-        # @return [Array<String>] An array of any warnings encountered while
-        #   processing the source.
-        attr_reader :warnings
+    class CSVSource < Source
         # Creates a new diff source.
         #
@@ -59,90 +35,37 @@ class CSVDiff
         #   identifies each row.
         # @option options [Array<String>] :key_fields The names of the fields
         #   that uniquely identifies each row.
-        # @option options [String] :parent_field The name of the field that
-        #   identifies a parent within which sibling order should be checked.
-        # @option options [String] :child_field The name of the field that
-        #   uniquely identifies a child of a parent.
+        # @option options [String] :parent_field The name of the field(s) that
+        #   identify a parent within which sibling order should be checked.
+        # @option options [String] :child_field The name of the field(s) that
+        #   uniquely identify a child of a parent.
+        # @option options [Boolean] :case_sensitive If true (the default), keys
+        #   are indexed as-is; if false, the index is built in upper-case for
+        #   case-insensitive comparisons.
+        # @option options [Hash] :include A hash of field name(s) or index(es) to
+        #   regular expression(s). Only source rows whose field values satisfy the
+        #   regular expressions will be indexed and included in the diff process.
+        # @option options [Hash] :exclude A hash of field name(s) or index(es) to
+        #   regular expression(s). Source rows with a field value that satisfies
+        #   the regular expressions will be excluded from the diff process.
         def initialize(source, options = {})
+            super(options)
             if source.is_a?(String)
                 require 'csv'
                 mode_string = options[:encoding] ? "r:#{options[:encoding]}" : 'r'
                 csv_options = options.fetch(:csv_options, {})
                 @path = source
-                source = CSV.open(@path, mode_string, csv_options).readlines
-            end
-            if kf = options.fetch(:key_field, options[:key_fields])
-                @key_fields = [kf].flatten
-                @parent_fields = @key_fields[0...-1]
-                @child_fields = @key_fields[-1..-1]
-            else
-                @parent_fields = [options.fetch(:parent_field, options[:parent_fields]) || []].flatten
-                @child_fields = [options.fetch(:child_field, options[:child_fields]) || [0]].flatten
-                @key_fields = @parent_fields + @child_fields
-            end
-            @field_names = options[:field_names]
-            @warnings = []
-            index_source(source, options)
-        end
-        # Returns the row in the CSV source corresponding to the supplied key.
-        #
-        # @param key [String] The unique key to use to lookup the row.
-        # @return [Hash] The fields for the line corresponding to +key+, or nil
-        #   if the key is not recognised.
-        def [](key)
-            @lines[key]
-        end
-        private
-        # Given an array of lines, where each line is an array of fields, indexes
-        # the array contents so that it can be looked up by key.
-        def index_source(lines, options)
-            @lines = {}
-            @index = Hash.new{ |h, k| h[k] = [] }
-            @key_fields = find_field_indexes(@key_fields, @field_names) if @field_names
-            line_num = 0
-            lines.each do |row|
-                line_num += 1
-                next if line_num == 1 && @field_names && options[:ignore_header]
-                unless @field_names
-                    @field_names = row
-                    @key_fields = find_field_indexes(@key_fields, @field_names)
-                    next
-                end
-                field_vals = row
-                line = {}
-                @field_names.each_with_index do |field, i|
-                    line[field] = field_vals[i]
-                end
-                key_values = @key_fields.map{ |kf| field_vals[kf].to_s.upcase }
-                key = key_values.join('~')
-                parent_key = key_values[0...(@parent_fields.length)].join('~')
-                if @lines[key]
-                    @warnings << "Duplicate key '#{key}' encountered and ignored at line #{line_num}"
-                else
-                    @index[parent_key] << key
-                    @lines[key] = line
-                end
-            end
-        end
-        # Converts an array of field names to an array of indexes of the fields
-        # matching those names.
-        def find_field_indexes(key_fields, field_names)
-            key_fields.map do |field|
-                if field.is_a?(Fixnum)
-                    field
-                else
-                    field_names.index{ |field_name| field.to_s.downcase == field_name.downcase } or
-                        raise ArgumentError, "Could not locate field '#{field}' in source field names: #{
-                            field_names.join(', ')}"
+                # When you call CSV.open, it's best to pass in a block so that after it's yielded,
+                # the underlying file handle is closed. Otherwise, you risk leaking the handle.
+                @data = CSV.open(@path, mode_string, csv_options) do |csv|
+                     csv.readlines
                 end
+            elsif source.is_a?(Enumerable) && source.size == 0 || (source.size > 0 && source.first.is_a?(Enumerable))
+                @data = source
+            else
+                raise ArgumentError, "source must be a path to a file or an Enumerable<Enumerable>"
             end
+            index_source
         end
     end

data/lib/csv-diff/source.rb ADDED

@@ -0,0 +1,275 @@
+class CSVDiff
+    # Reppresents an input (i.e the left/from or tight/to input) to the diff
+    # process.
+    class Source
+        # @return [String] the path to the source file
+        attr_accessor :path
+        # @return [Array<Arrary>] The data for this source
+        attr_reader :data
+        # @return [Array<String>] The names of the fields in the source file
+        attr_reader :field_names
+        # @return [Array<String>] The names of the field(s) that uniquely
+        #   identify each row.
+        attr_reader :key_fields
+        # @return [Array<String>] The names of the field(s) that identify a
+        #   common parent of child records.
+        attr_reader :parent_fields
+        # @return [Array<String>] The names of the field(s) that distinguish a
+        #   child of a parent record.
+        attr_reader :child_fields
+        # @return [Array<Fixnum>] The indexes of the key fields in the source
+        #   file.
+        attr_reader :key_field_indexes
+        # @return [Array<Fixnum>] The indexes of the parent fields in the source
+        #   file.
+        attr_reader :parent_field_indexes
+        # @return [Array<Fixnum>] The indexes of the child fields in the source
+        #   file.
+        attr_reader :child_field_indexes
+        # @return [Boolean] True if the source has been indexed with case-
+        #   sensitive keys, or false if it has been indexed using upper-case key
+        #   values.
+        attr_reader :case_sensitive
+        alias_method :case_sensitive?, :case_sensitive
+        # @return [Boolean] True if leading/trailing whitespace should be stripped
+        #   from fields
+        attr_reader :trim_whitespace
+        # @return [Hash<String,Hash>] A hash containing each line of the source,
+        #   keyed on the values of the +key_fields+.
+        attr_reader :lines
+        # @return [Hash<String,Array<String>>] A hash containing each parent key,
+        #   and an Array of the child keys it is a parent of.
+        attr_reader :index
+        # @return [Array<String>] An array of any warnings encountered while
+        #   processing the source.
+        attr_reader :warnings
+        # @return [Fixnum] A count of the lines processed from this source.
+        #   Excludes any header and duplicate records identified during indexing.
+        attr_reader :line_count
+        # @return [Fixnum] A count of the lines from this source that were skipped
+        #   due to filter conditions.
+        attr_reader :skip_count
+        # @return [Fixnum] A count of the lines from this source that had the same
+        #   key value as another line.
+        attr_reader :dup_count
+        # Creates a new diff source.
+        #
+        # A diff source must contain at least one field that will be used as the
+        # key to identify the same record in a different version of this file.
+        # If not specified via one of the options, the first field is assumed to
+        # be the unique key.
+        #
+        # If multiple fields combine to form a unique key, the parent is assumed
+        # to be identified by all but the last field of the unique key. If finer
+        # control is required, use a combination of the :parent_fields and
+        # :child_fields options.
+        #
+        # All key options can be specified either by field name, or by field
+        # index (0 based).
+        #
+        # @param options [Hash] An options hash.
+        # @option options [Array<String>] :field_names The names of each of the
+        #   fields in +source+.
+        # @option options [Boolean] :ignore_header If true, and :field_names has
+        #   been specified, then the first row of the file is ignored.
+        # @option options [String] :key_field The name of the field that uniquely
+        #   identifies each row.
+        # @option options [Array<String>] :key_fields The names of the fields
+        #   that uniquely identifies each row.
+        # @option options [String] :parent_field The name of the field(s) that
+        #   identify a parent within which sibling order should be checked.
+        # @option options [String] :child_field The name of the field(s) that
+        #   uniquely identify a child of a parent.
+        # @option options [Boolean] :case_sensitive If true (the default), keys
+        #   are indexed as-is; if false, the index is built in upper-case for
+        #   case-insensitive comparisons.
+        # @option options [Hash] :include A hash of field name(s) or index(es) to
+        #   regular expression(s). Only source rows whose field values satisfy the
+        #   regular expressions will be indexed and included in the diff process.
+        # @option options [Hash] :exclude A hash of field name(s) or index(es) to
+        #   regular expression(s). Source rows with a field value that satisfies
+        #   the regular expressions will be excluded from the diff process.
+        def initialize(options = {})
+            if (options.keys & [:parent_field, :parent_fields, :child_field, :child_fields]).empty? &&
+               (kf = options.fetch(:key_field, options[:key_fields]))
+                @key_fields = [kf].flatten
+                @parent_fields = @key_fields[0...-1]
+                @child_fields = @key_fields[-1..-1]
+            else
+                @parent_fields = [options.fetch(:parent_field, options[:parent_fields]) || []].flatten
+                @child_fields = [options.fetch(:child_field, options[:child_fields]) || [0]].flatten
+                @key_fields = @parent_fields + @child_fields
+            end
+            @field_names = options[:field_names]
+            @case_sensitive = options.fetch(:case_sensitive, true)
+            @trim_whitespace = options.fetch(:trim_whitespace, false)
+            @ignore_header = options[:ignore_header]
+            @include = options[:include]
+            @exclued = options[:exclude]
+            @path = options.fetch(:path, 'NA') unless @path
+            @warnings = []
+        end
+        def path?
+            @path != 'NA'
+        end
+        # Returns the row in the CSV source corresponding to the supplied key.
+        #
+        # @param key [String] The unique key to use to lookup the row.
+        # @return [Hash] The fields for the line corresponding to +key+, or nil
+        #   if the key is not recognised.
+        def [](key)
+            @lines[key]
+        end
+        # Given an array of lines, where each line is an array of fields, indexes
+        # the array contents so that it can be looked up by key.
+        def index_source
+            @lines = {}
+            @index = Hash.new{ |h, k| h[k] = [] }
+            if @field_names
+                index_fields
+                include_filter = convert_filter(@include, @field_names)
+                exclude_filter = convert_filter(@exclude, @field_names)
+            end
+            @line_count = 0
+            @skip_count = 0
+            @dup_count = 0
+            line_num = 0
+            @data.each do |row|
+                line_num += 1
+                next if line_num == 1 && @field_names && @ignore_header
+                unless @field_names
+                    if row.class.name == 'CSV::Row'
+                        @field_names = row.headers.each_with_index.map{ |f, i| f || i.to_s }
+                    else
+                        @field_names = row.each_with_index.map{ |f, i| f || i.to_s }
+                    end
+                    index_fields
+                    include_filter = convert_filter(@include, @field_names)
+                    exclude_filter = convert_filter(@exclude, @field_names)
+                    next
+                end
+                field_vals = row
+                line = {}
+                filter = false
+                @field_names.each_with_index do |field, i|
+                    val = field_vals[i]
+                    val = val.to_s.strip if val && @trim_whitespace
+                    line[field] = val
+                    if include_filter && f = include_filter[i]
+                        filter = !check_filter(f, line[field])
+                    end
+                    if exclude_filter && f = exclude_filter[i]
+                        filter = check_filter(f, line[field])
+                    end
+                    break if filter
+                end
+                if filter
+                    @skip_count += 1
+                    next
+                end
+                key_values = @key_field_indexes.map{ |kf| @case_sensitive ?
+                                                          field_vals[kf].to_s :
+                                                          field_vals[kf].to_s.upcase }
+                key = key_values.join('~')
+                parent_key = key_values[0...(@parent_fields.length)].join('~')
+                if @lines[key]
+                    @warnings << "Duplicate key '#{key}' encountered at line #{line_num}"
+                    @dup_count += 1
+                    key += "[#{@dup_count}]"
+                end
+                @index[parent_key] << key
+                @lines[key] = line
+                @line_count += 1
+            end
+        end
+        # Save the data in this Source as a CSV at +file_path+.
+        #
+        # @parma file_path [String] The target path to save the data to.
+        # @param options [Hash] A set of options to pass to CSV.open to control
+        #   how the CSV is generated.
+        def save_csv(file_path, options = {})
+            require 'csv'
+            default_opts = {
+                headers: @field_name, write_headers: true
+            }
+            CSV.open(file_path, 'wb', default_opts.merge(options)) do |csv|
+                @data.each{ |rec| csv << rec }
+            end
+        end
+        private
+        def index_fields
+            @key_field_indexes = find_field_indexes(@key_fields, @field_names)
+            @parent_field_indexes = find_field_indexes(@parent_fields, @field_names)
+            @child_field_indexes = find_field_indexes(@child_fields, @field_names)
+            @key_fields = @key_field_indexes.map{ |i| @field_names[i] }
+            @parent_fields = @parent_field_indexes.map{ |i| @field_names[i] }
+            @child_fields = @child_field_indexes.map{ |i| @field_names[i] }
+        end
+        # Converts an array of field names to an array of indexes of the fields
+        # matching those names.
+        def find_field_indexes(key_fields, field_names)
+            key_fields.map do |field|
+                if field.is_a?(Integer)
+                    field
+                else
+                    field_names.index{ |field_name| field.to_s.downcase == field_name.to_s.downcase } or
+                        raise ArgumentError, "Could not locate field '#{field}' in source field names: #{
+                            field_names.join(', ')}"
+                end
+            end
+        end
+        def convert_filter(hsh, field_names)
+            return unless hsh
+            if !hsh.is_a?(Hash)
+                raise ArgumentError, ":include/:exclude option must be a Hash of field name(s)/index(es) to RegExp(s)"
+            end
+            keys = hsh.keys
+            idxs = find_field_indexes(keys, @field_names)
+            Hash[keys.each_with_index.map{ |k, i| [idxs[i], hsh[k]] }]
+        end
+        def check_filter(filter, field_val)
+            case filter
+            when String
+                if @case_sensitive
+                    filter == field_val
+                else
+                    filter.downcase == field_val.to_s.downcase
+                end
+            when Regexp
+                filter.match(field_val)
+            when Proc
+                filter.call(field_val)
+            else
+                raise ArgumentError, "Unsupported filter expression: #{filter.inspect}"
+            end
+        end
+    end
+end

data/lib/csv-diff/xml_source.rb ADDED

@@ -0,0 +1,142 @@
+require 'nokogiri'
+require 'cgi'
+class CSVDiff
+    # Convert XML content to CSV format using XPath selectors to identify the
+    # rows and field values in an XML document
+    class XMLSource < Source
+        attr_accessor :context
+        # Create a new XMLSource, identified by +path+. Normally this is a path
+        # to the XML document, but any value is fine, as it isreally just a label
+        # to identify this data set.
+        #
+        # @param path [String] A label for this data set (often a path to the
+        #   XML document used as the source).
+        # @param options [Hash] An options hash.
+        # @option options [Array<String>] :field_names The names of each of the
+        #   fields in +source+.
+        # @option options [Boolean] :ignore_header If true, and :field_names has
+        #   been specified, then the first row of the file is ignored.
+        # @option options [String] :key_field The name of the field that uniquely
+        #   identifies each row.
+        # @option options [Array<String>] :key_fields The names of the fields
+        #   that uniquely identifies each row.
+        # @option options [String] :parent_field The name of the field(s) that
+        #   identify a parent within which sibling order should be checked.
+        # @option options [String] :child_field The name of the field(s) that
+        #   uniquely identify a child of a parent.
+        # @option options [Boolean] :case_sensitive If true (the default), keys
+        #   are indexed as-is; if false, the index is built in upper-case for
+        #   case-insensitive comparisons.
+        # @option options [Hash] :include A hash of field name(s) or index(es) to
+        #   regular expression(s). Only source rows whose field values satisfy the
+        #   regular expressions will be indexed and included in the diff process.
+        # @option options [Hash] :exclude A hash of field name(s) or index(es) to
+        #   regular expression(s). Source rows with a field value that satisfies
+        #   the regular expressions will be excluded from the diff process.
+        # @option options [String] :context A context value from which fields
+        #   can be populated using a Regexp.
+        def initialize(path, options = {})
+            super(options)
+            @path = path
+            @context = options[:context]
+            @data = []
+        end
+        # Process a +source+, converting the XML into a table of data, using
+        # +rec_xpath+ to identify the nodes that correspond each record that
+        # should appear in the output, and +field_maps+ to populate each field
+        # in each row.
+        #
+        # @param source [String|Array] may be a String containing XML content,
+        #   an Array of paths to files containing XML content, or a path to
+        #   a single file.
+        # @param rec_xpath [String] An XPath expression that selects all the
+        #   items in the XML document that are to be converted into new rows.
+        #   The returned items are not directly used to populate the fields,
+        #   but provide a context for the field XPath expreessions that populate
+        #   each field's content.
+        # @param field_maps [Hash<String, String>] A map of field names to
+        #   expressions that are evaluated in the context of each row node
+        #   selected by +rec_xpath+. The field expressions are typically XPath
+        #   expressions evaluated in the context of the nodes returned by the
+        #   +rec_xpath+. Alternatively, a String that is not an XPath expression
+        #   is used as a literal value for a field, while a Regexp can also
+        #   be used to pull a value from any context specified in the +options+
+        #   hash. The Regexp should include a single grouping, as the value used
+        #   will be the result in $1 after the match is performed.
+        # @param context [String] An optional context for the XML to be processed.
+        #   The value passed here can be referenced in field map expressions
+        #   using a Regexp, with the value of the first grouping in the regex
+        #   being the value returned for the field.
+        def process(source, rec_xpath, field_maps, context = nil)
+            @field_names = field_maps.keys unless @field_names
+            case source
+            when Nokogiri::XML::Document
+                add_data(source, rec_xpath, field_maps, context || @context)
+            when /<\?xml/
+                doc = Nokogiri::XML(source)
+                add_data(doc, rec_xpath, field_maps, context || @context)
+            when Array
+                source.each{ |f| process_file(f, rec_xpath, field_maps) }
+            when String
+                process_file(source, rec_xpath, field_maps)
+            else
+                raise ArgumentError, "Unhandled source type #{source.class.name}"
+            end
+            @data
+        end
+        private
+        # Load the XML document at +file_path+ and process it into rows of data.
+        def process_file(file_path, rec_xpath, field_maps)
+            begin
+                File.open(file_path) do |f|
+                    doc = Nokogiri::XML(f)
+                    add_data(doc, rec_xpath, field_maps, @context || file_path)
+                end
+            rescue
+                STDERR.puts "An error occurred while attempting to open #{file_path}"
+                raise
+            end
+        end
+        # Locate records in +doc+ using +rec_xpath+ to identify the nodes that
+        # correspond to a new record in the data, and +field_maps+ to populate
+        # the fields in each row.
+        def add_data(doc, rec_xpath, field_maps, context)
+            doc.xpath(rec_xpath).each do |rec_node|
+                rec = []
+                field_maps.each do |field_name, expr|
+                    case expr
+                    when Regexp         # Match context against Regexp and extract first grouping
+                        if context
+                            context =~ expr
+                            rec << $1
+                        else
+                            rec << nil
+                        end
+                    when %r{[/(.@]}     # XPath expression
+                        res = rec_node.xpath(expr)
+                        rec << CGI.unescape_html(res.to_s)
+                    else                # Use expr as the value for this field
+                        rec << expr
+                    end
+                end
+                @data << rec
+            end
+        end
+    end
+end

metadata CHANGED

@@ -1,27 +1,44 @@
 --- !ruby/object:Gem::Specification
 name: csv-diff
 version: !ruby/object:Gem::Version
-  version: '0.2'
+  version: 0.5.0
 platform: ruby
 authors:
 - Adam Gardiner
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2014-08-11 00:00:00.000000000 Z
+date: 2020-07-15 00:00:00.000000000 Z
 dependencies: []
-description: ! "        This library performs diffs of CSV files.\n\n        Unlike
-  a standard diff that compares line by line, and is sensitive to the\n        ordering
-  of records, CSV-Diff identifies common lines by key field(s), and\n        then
-  compares the contents of the fields in each line.\n\n        Data may be supplied
-  in the form of CSV files, or as an array of arrays. The\n        diff process provides
-  a fine level of control over what to diff, and can\n        optionally ignore certain
-  types of changes (e.g. changes in position).\n\n        CSV-Diff is particularly
-  well suited to data in parent-child format. Parent-\n        child data does not
-  lend itself well to standard text diffs, as small changes\n        in the organisation
-  of the tree at an upper level can lead to big movements\n        in the position
-  of descendant records. By instead matching records by key,\n        CSV-Diff avoids
-  this issue, while still being able to detect changes in\n        sibling order.\n"
+description: |2
+          This library performs diffs of CSV data, or any table-like source.
+          Unlike a standard diff that compares line by line, and is sensitive to the
+          ordering of records, CSV-Diff identifies common lines by key field(s), and
+          then compares the contents of the fields in each line.
+          Data may be supplied in the form of CSV files, or as an array of arrays. The
+          diff process provides a fine level of control over what to diff, and can
+          optionally ignore certain types of changes (e.g. changes in position).
+          CSV-Diff is particularly well suited to data in parent-child format. Parent-
+          child data does not lend itself well to standard text diffs, as small changes
+          in the organisation of the tree at an upper level can lead to big movements
+          in the position of descendant records. By instead matching records by key,
+          CSV-Diff avoids this issue, while still being able to detect changes in
+          sibling order.
+          This gem implements the core diff algorithm, and handles the loading and
+          diffing of CSV files (or Arrays of Arrays). It also supports converting
+          data in XML format into tabular form, so that it can then be processed
+          like any other CSV or table-like source.  It returns a CSVDiff object
+          containing the details of differences in object form. This is useful for
+          projects that need diff capability, but want to handle the reporting or
+          actioning of differences themselves.
+          For a pre-built diff reporting capability, see the csv-diff-report gem,
+          which provides a command-line tool for generating diff reports in HTML,
+          Excel, or text formats.
 email: adam.b.gardiner@gmail.com
 executables: []
 extensions: []
@@ -33,9 +50,12 @@ files:
 - lib/csv-diff/algorithm.rb
 - lib/csv-diff/csv_diff.rb
 - lib/csv-diff/csv_source.rb
+- lib/csv-diff/source.rb
+- lib/csv-diff/xml_source.rb
 - lib/csv_diff.rb
 homepage: https://github.com/agardiner/csv-diff
-licenses: []
+licenses:
+- MIT
 metadata: {}
 post_install_message: For command-line tools and diff reports, 'gem install csv-diff-report'
 rdoc_options: []
@@ -43,18 +63,18 @@ require_paths:
 - lib
 required_ruby_version: !ruby/object:Gem::Requirement
   requirements:
-  - - ! '>='
+  - - ">="
     - !ruby/object:Gem::Version
       version: '0'
 required_rubygems_version: !ruby/object:Gem::Requirement
   requirements:
-  - - ! '>='
+  - - ">="
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
 rubyforge_project:
-rubygems_version: 2.4.1
+rubygems_version: 2.5.2.3
 signing_key:
 specification_version: 4
-summary: CSV Diff is a library for generating diffs from data in CSV format
+summary: CSV Diff is a library for generating diffs from data in CSV or XML format
 test_files: []