RubyGems - csv-diff - Versions diffs - 0.3.0 → 0.6.0 - Mend

csv-diff 0.3.0 → 0.6.0

Files changed (10) hide show

checksums.yaml +5 -13
data/LICENSE +1 -1
data/README.md +112 -26
data/lib/csv-diff.rb +1 -0
data/lib/csv-diff/algorithm.rb +133 -46
data/lib/csv-diff/csv_diff.rb +12 -10
data/lib/csv-diff/csv_source.rb +19 -106
data/lib/csv-diff/source.rb +278 -0
data/lib/csv-diff/xml_source.rb +142 -0
metadata +39 -26

checksums.yaml CHANGED

@@ -1,15 +1,7 @@
 ---
-!binary "U0hBMQ==":
-  metadata.gz: !binary |-
-    MzM3MmMzNzU5ZDE5ZTA5MGI4OGQxNTBkMTE0NWM1MzYzNDVkYTFjYQ==
-  data.tar.gz: !binary |-
-    MDQ3Yzk3ZDc4ZTZiNTMwMzc1NGMxMTU5ZTBkMzdjMTMyNzE5OTYwMg==
+SHA1:
+  metadata.gz: 9dde7ded89bb58f75505ae9237c97b8acd365c42
+  data.tar.gz: adcf17af6b67797c9b5fbab80b6c7d421cd73d6e
 SHA512:
-  metadata.gz: !binary |-
-    YTYxMDM1MjUxNDk3ZjE3YWUxZWJjNGFmMzQzMDUyNGJlOGUzMmI2MDVlNjg4
-    OWJiYmQxNTI2MjBlMmQzNGFkZTk0ZGY1Y2I0ZDBkODljYjc4NDQ1ZDY1ODky
-    ZDU4MDUwZjVjYWU0MTE5NTExMmExYTM0NTY1ZGMzYzc5NDkxYmE=
-  data.tar.gz: !binary |-
-    YzU4YWM2MmZjMjE4MjlhZDgxN2IxNmI1NmU2YjFiZTcwN2ZlNTZlYzE5MzZm
-    NGJkZDNkODUzNGNlNjA2NGZlZWIyMWFmMjljOTQyNjE3OGU4YWFmY2UyZWE2
-    YzhlZmI0YTcxMGY3N2Y2ODQ5NTM3ODcyMTUzOTJkODhjNjY0YWM=
+  metadata.gz: 1aad2d1174758488d1e984239ae38917211bf1e185d2b37350a402e145bde652b44e437774f2695bb33c2b48531637f6c9ed13fb5fbaa0b610ac9d7810ec16ed
+  data.tar.gz: 3aab02344cfa4f111433c616caa445f43d3f4515c65e8ef7175f2287996d8f6bc592e1daca5683bef5d561e21f6f23e954a1115854577cced4551582ec4fa293

data/LICENSE CHANGED

@@ -1,4 +1,4 @@
-Copyright (c) 2013, Adam Gardiner
+Copyright (c) 2013-2016, Adam Gardiner
 All rights reserved.
 Redistribution and use in source and binary forms, with or without

data/README.md CHANGED

@@ -1,14 +1,19 @@
 # CSV-Diff
-CSV-Diff is a small library for performing diffs of CSV data.
+CSV-Diff is a small library for performing diffs of tabular data, typically
+data loaded from CSV files.
 Unlike a standard diff that compares line by line, and is sensitive to the
 ordering of records, CSV-Diff identifies common lines by key field(s), and
 then compares the contents of the fields in each line.
-Data may be supplied in the form of CSV files, or as an array of arrays. The
-diff process provides a fine level of control over what to diff, and can
-optionally ignore certain types of changes (e.g. changes in position).
+Data may be supplied in the form of CSV files, or as an array of arrays.
+More complex usage also allows you to specify XPath expressions to extract
+tabular data from XML documents for diffing.
+The diff process provides a fine level of control over what to diff, and can
+optionally ignore certain types of changes (e.g. adds, deletes, changes in
+position etc).
 CSV-Diff is particularly well suited to data in parent-child format. Parent-
 child data does not lend itself well to standard text diffs, as small changes
@@ -21,17 +26,19 @@ sibling order.
 ## Usage
 CSV-Diff is supplied as a gem, and has no dependencies. To use it, simply:
-```
-gem install csv-diff
-```
+    ```
+    gem install csv-diff
+    ```
 To compare two CSV files where the field names are in the first row of the file,
 and the first field contains the unique key for each record, simply use:
-```ruby
-require 'csv-diff'
-diff = CSVDiff.new(file1, file2)
-```
+    ```ruby
+    require 'csv-diff'
+    diff = CSVDiff.new(file1, file2)
+    ```
 The returned diff object can be queried for the differences that exist between
 the two files, e.g.:
@@ -96,7 +103,7 @@ change in order) of all 6 rows.
 The more correct specification of this file is that column 0 contains a unique parent
 identifier, and column 1 contains a unique child identifier. CSVDiff can then correctly
-deduce that there is in fact only two changes in order - the swap in positions of A and
+deduce that there are in fact only two changes in order - the swap in positions of A and
 B below Root.
 Note: If you aren't interested in changes in the order of siblings, then you could use
@@ -121,43 +128,59 @@ Warnings may be raised for any of the following:
 The simplest use case is as shown above, where the data to be diffed is in CSV files
 with the column names as the first record, and where the unique key is the first
 column in the data. In this case, a diff can be created simply via:
-```ruby
-diff = CSVDiff.new(file1, file2)
-```
-### Specifynig Unique Row Identifiers
+    ```ruby
+    diff = CSVDiff.new(file1, file2)
+    ```
+### Specifying Unique Row Identifiers
 Often however, rows are not uniquely identifiable via the first column in the file.
 In a parent-child hierarchy, for example, combinations of parent and child may be
-necessary to uniquely identify a row. In these cases, it is necessary to indicate
-which fields are used to uniquely identify common rows across the two files. This
-can be done in several different ways.
+necessary to uniquely identify a row, while in other cases a combination of fields
+may be needed to derive a natural unique key or identifier for each row.
+In these cases, it is necessary to indicate to CSVDiff which fields are needed to
+uniquely identify common rows across the two files. This can be done in several
+different ways.
+#### :key_field(s)
-1. Using the :key_fields option with field numbers (these are 0-based):
+The first method is using the **key_fields** option (or key_field if you have only a
+single key field). Use this option when your data represents a flat structure rather
+than a parent-child hierarchy or flattened tree. You can specify key_fields using
+field numbers/column indices (0-based):
     ```ruby
     diff = CSVDiff.new(file1, file2, key_fields: [0, 1])
     ```
-2. Using the :key_fields options with column names:
+Alternatively, you can use the :key_fields options with column names (provided CSVDiff
+knows the names of your fields, either via the **field_names** option or from headers
+in the file):
     ```ruby
-    diff = CSVDiff.new(file1, file2, key_fields: ['Parent', 'Child'])
+    diff = CSVDiff.new(file1, file2, key_fields: ['First Name', 'Last Name'])
     ```
-3. Using the :parent_fields and :child_fields with field numbers:
+#### :parent_field(s)/:child_field(s)
+The second method for identifying the unique identifiers in your file is to use the
+:parent_fields and :child_fields options. Use this option when your data represents
+a tree structure flattened to a table in parent-child form.
+Using the :parent_fields and :child_fields with field numbers:
     ```ruby
     diff = CSVDiff.new(file1, file2, parent_field: 1, child_fields: [2, 3])
     ```
-4. Using the :parent_fields and :child_fields with column names:
+Using the :parent_fields and :child_fields with column names:
     ```ruby
     diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'])
     ```
-### Using Non-CSV File Sources
+### Using Non-CSV Sources
 Data from non-CSV sources can be diffed, as long as it can be supplied as an Array
 of Arrays:
@@ -174,7 +197,53 @@ DATA2 = [
     ['A', 'A2', 'Account2']
 ]
-diff = CSVDiff.new(DATA1, DATA2, key_fields: [1, 0])
+diff = CSVDiff.new(DATA1, DATA2, parent_field: 1, child_field: 0)
+```
+Data can also be diffed if it is an XML source, although this requires a little
+more effort to tell CSVDiff how to transform/extract content from the XML document
+into an array-of-arrays form. It also introduces a dependency on Nokogiri - you
+will need to install this gem to use CSVDiff with XML sources.
+The first step is to use the CSVDiff::XMLSource class to define how to convert
+your XML content to an array-of-arrays. The XMLSource class is quite flexible,
+and can be used to convert single or multiple XML sources into a single data set
+for diffing, and different documents may even have different layouts.
+The first step is to create an XMLSource object, which requires a label to
+identify the type of data it will generate:
+```ruby
+xml_source_1 = CSVDiff::XMLSource.new('My Label')
+```
+Next, we pass XML documents to this source, and specify XPath expressions for each
+row and column of data to produce via the `process(rec_xpath, field_maps, options)`
+method:
+* An XPath expression is provided to select each node value in the document that
+  will represent a row. Taking an HTML table as an example of something we wanted
+  to parse, your rec_xpath value might be something like the following:
+  `'//table/tbody/tr'`. This would locate all tables in the document, and create
+  a new row of data in the XMLSource every time a `<tr>` tag was encountered.
+* A hash of field_maps is then provided to describe how to generate column values
+  for each row of data. The keys to field_maps are the names of the fields to be
+  output, while the values are the epression for how to generate values. Most
+  commonly, this will be another XPath expression that is evaluated in the context
+  of the node returned by the row XPath expression. So continuing our HTML example,
+  we might use `'./td[0]/text()'` as an expression to select the content of the
+  first `<td>` element within the `<tr>` representing the current row.
+```ruby
+xml_source1.process('//table/tbody/tr',
+                    col_A: './td[0]/text()',
+                    col_B: './td[1]/text()',
+                    col_C: './td[2]/text()')
+```
+Finally, to diff two XML sources, we create a CSVDiff object with two XMLSource
+objects as the source:
+```ruby
+diff = CSVDiff.new(xml_source1, xml_source2, key_field: 'col_A')
 ```
 ### Specifying Column Names
@@ -211,6 +280,23 @@ diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam'
                    ignore_fields: ['CreatedAt', 'UpdatedAt'])
 ```
+### Filtering Rows
+If you need to filter source data before running the diff process, you can use the :include
+and :exclude options to do so. Both options take a Hash as their value; the hash should have
+keys that are the field names or indexes (0-based) on which to filter, and whose values are
+regular expressions or lambdas to be applied to values of the corresponding field. Rows will
+only be diffed if they satisfy :include conditions, and do not satisfy :exclude conditions.
+```ruby
+# Generate a diff of Arsenal home games not refereed by Clattenburg
+diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'],
+                   include: {HomeTeam: 'Arsenal'}, exclude: {Referee: /Clattenburg/})
+# Generate a diff of games played over the Xmas/New Year period
+diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'],
+                   include: {Date: lambda{ |d| holiday_period.include?(Date.strptime(d, '%y/%m/%d')) } })
+```
 ### Ignoring Certain Changes
 CSVDiff identifies Adds, Updates, Moves and Deletes; any of these changes can be selectively

data/lib/csv-diff.rb CHANGED

@@ -1,3 +1,4 @@
+require 'csv-diff/source'
 require 'csv-diff/csv_source'
 require 'csv-diff/algorithm'
 require 'csv-diff/csv_diff'

data/lib/csv-diff/algorithm.rb CHANGED

@@ -3,6 +3,55 @@ class CSVDiff
     # Implements the CSV diff algorithm.
     module Algorithm
+        # Holds the details of a single difference
+        class Diff
+            attr_accessor :diff_type
+            attr_reader :fields
+            attr_reader :row
+            attr_reader :sibling_position
+            def initialize(diff_type, fields, row_idx, pos_idx)
+                @diff_type = diff_type
+                @fields = fields
+                @row = row_idx + 1
+                self.sibling_position = pos_idx
+            end
+            def sibling_position=(pos_idx)
+                if pos_idx.is_a?(Array)
+                    pos_idx.compact!
+                    if pos_idx.first != pos_idx.last
+                        @sibling_position = pos_idx.map{ |pos| pos + 1 }
+                    else
+                        @sibling_position = pos_idx.first + 1
+                    end
+                else
+                    @sibling_position = pos_idx + 1
+                end
+            end
+            # For backwards compatibility and access to fields with differences
+            def [](key)
+                case key
+                when :action
+                    a = diff_type.to_s
+                    a[0] = a[0].upcase
+                    a
+                when :row
+                    @row
+                when :sibling_position
+                    @sibling_position
+                else
+                    @fields[key]
+                end
+            end
+        end
         # Diffs two CSVSource structures.
         #
         # @param left [CSVSource] A CSVSource object containing the contents of
@@ -22,28 +71,61 @@ class CSVDiff
         #  items that exist in both +left+ and +right+.
         # @option options [Boolean] :ignore_deletes If set to true, we ignore any
         #  new items that appear only in +left+.
+        # @option options [Hash<Object,Proc>] :equality_procs A Hash mapping fields
+        #  to a 2-arg Proc that should be used to compare values in that field for
+        #  equality.
         def diff_sources(left, right, key_fields, diff_fields, options = {})
             unless left.case_sensitive? == right.case_sensitive?
                 raise ArgumentError, "Left and right must have same settings for case-sensitivity"
             end
-            case_sensitive = left.case_sensitive?
+            unless left.parent_fields.length == right.parent_fields.length
+                raise ArgumentError, "Left and right must have same settings for parent/child fields"
+            end
+            # Ensure key fields are not also in the diff_fields
+            diff_fields = diff_fields - key_fields
             left_index = left.index
             left_values = left.lines
             left_keys = left_values.keys
             right_index = right.index
             right_values = right.lines
             right_keys = right_values.keys
-            parent_fields = left.parent_fields.length
+            parent_field_count = left.parent_fields.length
             include_adds = !options[:ignore_adds]
             include_moves = !options[:ignore_moves]
             include_updates = !options[:ignore_updates]
             include_deletes = !options[:ignore_deletes]
-            diffs = Hash.new{ |h, k| h[k] = {} }
+            @case_sensitive = left.case_sensitive?
+            @equality_procs = options.fetch(:equality_procs, {})
+            diffs = {}
+            potential_moves = Hash.new{ |h, k| h[k] = [] }
+            # First identify deletions
+            if include_deletes
+                (left_keys - right_keys).each do |key|
+                    # Delete
+                    key_vals = key.split('~', -1)
+                    parent = key_vals[0...parent_field_count].join('~')
+                    child = key_vals[parent_field_count..-1].join('~')
+                    left_parent = left_index[parent]
+                    left_value = left_values[key]
+                    row_idx = left_keys.index(key)
+                    sib_idx = left_parent.index(key)
+                    raise "Can't locate key #{key} in parent #{parent}" unless sib_idx
+                    diffs[key] = Diff.new(:delete, left_value, row_idx, sib_idx)
+                    potential_moves[child] << key
+                    #puts "Delete: #{key}"
+                end
+            end
+            # Now identify adds/updates
             right_keys.each_with_index do |key, right_row_id|
-                key_vals = key.split('~')
-                parent = key_vals[0...parent_fields].join('~')
+                key_vals = key.split('~', -1)
+                parent = key_vals[0...parent_field_count].join('~')
                 left_parent = left_index[parent]
                 right_parent = right_index[parent]
                 left_value = left_values[key]
@@ -51,13 +133,12 @@ class CSVDiff
                 left_idx = left_parent && left_parent.index(key)
                 right_idx = right_parent && right_parent.index(key)
-                id = {}
-                id[:row] = right_row_id + 1
-                id[:sibling_position] = right_idx + 1
-                key_fields.each do |field_name|
-                    id[field_name] = right_value[field_name]
-                end
                 if left_idx && right_idx
+                    if include_updates && (changes = diff_row(left_value, right_value, diff_fields))
+                        id = id_fields(key_fields, right_value)
+                        diffs[key] = Diff.new(:update, id.merge!(changes), right_row_id, right_idx)
+                        #puts "Change: #{key}"
+                    end
                     if include_moves
                         left_common = left_parent & right_parent
                         right_common = right_parent & left_parent
@@ -65,42 +146,34 @@ class CSVDiff
                         right_pos = right_common.index(key)
                         if left_pos != right_pos
                             # Move
-                            diffs[key].merge!(id.merge!(:action => 'Move',
-                                              :sibling_position => [left_idx + 1, right_idx + 1]))
+                            if d = diffs[key]
+                                d.sibling_position = [left_idx, right_idx]
+                            else
+                                id = id_fields(key_fields, right_value)
+                                diffs[key] = Diff.new(:move, id, right_row_id, [left_idx, right_idx])
+                            end
                             #puts "Move #{left_idx} -> #{right_idx}: #{key}"
                         end
                     end
-                    if include_updates && (changes = diff_row(left_value, right_value, diff_fields, case_sensitive))
-                        diffs[key].merge!(id.merge(changes.merge(:action => 'Update')))
-                        #puts "Change: #{key}"
-                    end
-                elsif include_adds && right_idx
+                elsif right_idx
                     # Add
-                    diffs[key].merge!(id.merge(right_values[key].merge(:action => 'Add')))
-                    #puts "Add: #{key}"
-                end
-            end
-            # Now identify deletions
-            if include_deletes
-                (left_keys - right_keys).each do |key|
-                    # Delete
-                    key_vals = key.split('~')
-                    parent = key_vals[0...parent_fields].join('~')
-                    left_parent = left_index[parent]
-                    left_value = left_values[key]
-                    left_idx = left_parent.index(key)
-                    next unless left_idx
-                    id = {}
-                    id[:row] = left_keys.index(key) + 1
-                    id[:sibling_position] = left_idx + 1
-                    key_fields.each do |field_name|
-                        id[field_name] = left_value[field_name]
+                    child = key_vals[parent_field_count..-1].join('~')
+                    if potential_moves.has_key?(child) && old_key = potential_moves[child].pop
+                        diffs.delete(old_key)
+                        if include_updates
+                            left_value = left_values[old_key]
+                            id = id_fields(right.child_fields, right_value)
+                            changes = diff_row(left_value, right_value, left.parent_fields + diff_fields)
+                            diffs[key] = Diff.new(:update, id.merge!(changes), right_row_id, right_idx)
+                            #puts "Update Parent: #{key}"
+                        end
+                    elsif include_adds
+                        diffs[key] = Diff.new(:add, right_value, right_row_id, right_idx)
+                        #puts "Add: #{key}"
                     end
-                    diffs[key].merge!(id.merge(left_values[key].merge(:action => 'Delete')))
-                    #puts "Delete: #{key}"
                 end
             end
             diffs
         end
@@ -113,27 +186,41 @@ class CSVDiff
         # @param right_row [Hash] The version of the CSV row from the right/to
         #   file.
         # @param fields [Array<String>] An array of field names to compare.
-        # @param case_sensitive [Boolean] Whether field comparisons should be
-        #   case sensitive or not.
         # @return [Hash<String, Array>] A Hash whose keys are the fields that
         #   contain differences, and whose values are a two-element array of
         #   [left/from, right/to] values.
-        def diff_row(left_row, right_row, fields, case_sensitive)
+        def diff_row(left_row, right_row, fields)
             diffs = {}
             fields.each do |attr|
+                eq_proc = @equality_procs[attr]
                 right_val = right_row[attr]
                 right_val = nil if right_val == ""
                 left_val = left_row[attr]
                 left_val = nil if left_val == ""
-                if (case_sensitive && left_val != right_val) ||
-                   (left_val.to_s.upcase != right_val.to_s.upcase)
+                if eq_proc
+                    diffs[attr] = [left_val, right_val] unless eq_proc.call(left_val, right_val)
+                elsif @case_sensitive
+                    diffs[attr] = [left_val, right_val] unless left_val == right_val
+                elsif (left_val.to_s.upcase != right_val.to_s.upcase)
                     diffs[attr] = [left_val, right_val]
-                    #puts "#{attr}: #{left_val} -> #{right_val}"
                 end
             end
             diffs if diffs.size > 0
         end
+        private
+        # Return a hash containing just the key field values
+        def id_fields(key_fields, fields)
+            id = {}
+            key_fields.each do |field_name|
+                id[field_name] = fields[field_name]
+            end
+            id
+        end
     end
 end

data/lib/csv-diff/csv_diff.rb CHANGED

@@ -81,13 +81,15 @@ class CSVDiff
     # @option options [Boolean] :ignore_deletes If true, records that appear
     #   in the left/from file but not in the right/to file are not reported.
     def initialize(left, right, options = {})
-        @left = left.is_a?(CSVSource) ? left : CSVSource.new(left, options)
+        @left = left.is_a?(Source) ? left : CSVSource.new(left, options)
+        @left.index_source if @left.lines.nil?
         raise "No field names found in left (from) source" unless @left.field_names && @left.field_names.size > 0
-        @right = right.is_a?(CSVSource) ? right : CSVSource.new(right, options)
+        @right = right.is_a?(Source) ? right : CSVSource.new(right, options)
+        @right.index_source if @right.lines.nil?
         raise "No field names found in right (to) source" unless @right.field_names && @right.field_names.size > 0
         @warnings = []
         @diff_fields = get_diff_fields(@left.field_names, @right.field_names, options)
-        @key_fields = @left.key_fields.map{ |kf| @diff_fields[kf] }
+        @key_fields = @left.key_fields
         diff(options)
     end
@@ -95,8 +97,8 @@ class CSVDiff
     # Performs a diff with the specified +options+.
     def diff(options = {})
         @summary = nil
-        @diffs = diff_sources(@left, @right, @key_fields, @diff_fields, options)
         @options = options
+        @diffs = diff_sources(@left, @right, @key_fields, @diff_fields, options)
     end
@@ -138,20 +140,20 @@ class CSVDiff
     # Given two sets of field names, determines the common set of fields present
     # in both, on which members can be diffed.
     def get_diff_fields(left_fields, right_fields, options)
-        ignore_fields = (options[:ignore_fields] || []).map do |f|
-            f.is_a?(Fixnum) ? right_fields[f] : f
+        ignore_fields = options.fetch(:ignore_fields, [])
+        ignore_fields = [ignore_fields] unless ignore_fields.is_a?(Array)
+        ignore_fields.map! do |f|
+            (f.is_a?(Numeric) ? right_fields[f] : f).upcase
         end
         diff_fields = []
         if options[:diff_common_fields_only]
             right_fields.each_with_index do |fld, i|
                 if left_fields.include?(fld)
-                    diff_fields << fld unless ignore_fields.include?(fld)
-                else
-                    @warnings << "Field '#{fld}' is missing from the left (from) file, and won't be diffed"
+                    diff_fields << fld unless ignore_fields.include?(fld.upcase)
                 end
             end
         else
-            diff_fields = (right_fields + left_fields).uniq.reject{ |fld| ignore_fields.include?(fld) }
+            diff_fields = (right_fields + left_fields).uniq.reject{ |fld| ignore_fields.include?(fld.upcase) }
         end
         diff_fields
     end

data/lib/csv-diff/csv_source.rb CHANGED

@@ -2,36 +2,7 @@ class CSVDiff
     # Represents a CSV input (i.e. the left/from or right/to input) to the diff
     # process.
-    class CSVSource
-        # @return [String] the path to the source file
-        attr_accessor :path
-        # @return [Array<String>] The names of the fields in the source file
-        attr_reader :field_names
-        # @return [Array<String>] The names of the field(s) that uniquely
-        #   identify each row.
-        attr_reader :key_fields
-        # @return [Array<String>] The names of the field(s) that identify a
-        #   common parent of child records.
-        attr_reader :parent_fields
-        # @return [Array<String>] The names of the field(s) that distinguish a
-        #   child of a parent record.
-        attr_reader :child_fields
-        # @return [Boolean] True if the source has been indexed with case-
-        #   sensitive keys, or false if it has been indexed using upper-case key
-        #   values.
-        attr_reader :case_sensitive
-        alias_method :case_sensitive?, :case_sensitive
-        # @return [Hash<String,Hash>] A hash containing each line of the source,
-        #   keyed on the values of the +key_fields+.
-        attr_reader :lines
-        # @return [Hash<String,Array<String>>] A hash containing each parent key,
-        #   and an Array of the child keys it is a parent of.
-        attr_reader :index
-        # @return [Array<String>] An array of any warnings encountered while
-        #   processing the source.
-        attr_reader :warnings
+    class CSVSource < Source
         # Creates a new diff source.
         #
@@ -69,90 +40,32 @@ class CSVDiff
         # @option options [String] :child_field The name of the field(s) that
         #   uniquely identify a child of a parent.
         # @option options [Boolean] :case_sensitive If true (the default), keys
-        #  are indexed as-is; if false, the index is built in upper-case for
-        #  case-insensitive comparisons.
+        #   are indexed as-is; if false, the index is built in upper-case for
+        #   case-insensitive comparisons.
+        # @option options [Hash] :include A hash of field name(s) or index(es) to
+        #   regular expression(s). Only source rows whose field values satisfy the
+        #   regular expressions will be indexed and included in the diff process.
+        # @option options [Hash] :exclude A hash of field name(s) or index(es) to
+        #   regular expression(s). Source rows with a field value that satisfies
+        #   the regular expressions will be excluded from the diff process.
         def initialize(source, options = {})
+            super(options)
             if source.is_a?(String)
                 require 'csv'
                 mode_string = options[:encoding] ? "r:#{options[:encoding]}" : 'r'
                 csv_options = options.fetch(:csv_options, {})
                 @path = source
-                source = CSV.open(@path, mode_string, csv_options).readlines
-            end
-            if kf = options.fetch(:key_field, options[:key_fields])
-                @key_fields = [kf].flatten
-                @parent_fields = @key_fields[0...-1]
-                @child_fields = @key_fields[-1..-1]
-            else
-                @parent_fields = [options.fetch(:parent_field, options[:parent_fields]) || []].flatten
-                @child_fields = [options.fetch(:child_field, options[:child_fields]) || [0]].flatten
-                @key_fields = @parent_fields + @child_fields
-            end
-            @field_names = options[:field_names]
-            @warnings = []
-            index_source(source, options)
-        end
-        # Returns the row in the CSV source corresponding to the supplied key.
-        #
-        # @param key [String] The unique key to use to lookup the row.
-        # @return [Hash] The fields for the line corresponding to +key+, or nil
-        #   if the key is not recognised.
-        def [](key)
-            @lines[key]
-        end
-        private
-        # Given an array of lines, where each line is an array of fields, indexes
-        # the array contents so that it can be looked up by key.
-        def index_source(lines, options)
-            @lines = {}
-            @index = Hash.new{ |h, k| h[k] = [] }
-            @key_fields = find_field_indexes(@key_fields, @field_names) if @field_names
-            @case_sensitive = options.fetch(:case_sensitive, true)
-            line_num = 0
-            lines.each do |row|
-                line_num += 1
-                next if line_num == 1 && @field_names && options[:ignore_header]
-                unless @field_names
-                    @field_names = row
-                    @key_fields = find_field_indexes(@key_fields, @field_names)
-                    next
-                end
-                field_vals = row
-                line = {}
-                @field_names.each_with_index do |field, i|
-                    line[field] = field_vals[i]
-                end
-                key_values = @key_fields.map{ |kf| field_vals[kf].to_s.upcase }
-                key = key_values.join('~')
-                parent_key = key_values[0...(@parent_fields.length)].join('~')
-                parent_key.upcase! unless @case_sensitive
-                if @lines[key]
-                    @warnings << "Duplicate key '#{key}' encountered and ignored at line #{line_num}"
-                else
-                    @index[parent_key] << key
-                    @lines[key] = line
-                end
-            end
-        end
-        # Converts an array of field names to an array of indexes of the fields
-        # matching those names.
-        def find_field_indexes(key_fields, field_names)
-            key_fields.map do |field|
-                if field.is_a?(Fixnum)
-                    field
-                else
-                    field_names.index{ |field_name| field.to_s.downcase == field_name.downcase } or
-                        raise ArgumentError, "Could not locate field '#{field}' in source field names: #{
-                            field_names.join(', ')}"
+                # When you call CSV.open, it's best to pass in a block so that after it's yielded,
+                # the underlying file handle is closed. Otherwise, you risk leaking the handle.
+                @data = CSV.open(@path, mode_string, csv_options) do |csv|
+                     csv.readlines
                 end
+            elsif source.is_a?(Enumerable) && source.size == 0 || (source.size > 0 && source.first.is_a?(Enumerable))
+                @data = source
+            else
+                raise ArgumentError, "source must be a path to a file or an Enumerable<Enumerable>"
             end
+            index_source
         end
     end

data/lib/csv-diff/source.rb ADDED

@@ -0,0 +1,278 @@
+class CSVDiff
+    # Reppresents an input (i.e the left/from or tight/to input) to the diff
+    # process.
+    class Source
+        # @return [String] the path to the source file
+        attr_accessor :path
+        # @return [Array<Arrary>] The data for this source
+        attr_reader :data
+        # @return [Array<String>] The names of the fields in the source file
+        attr_reader :field_names
+        # @return [Array<String>] The names of the field(s) that uniquely
+        #   identify each row.
+        attr_reader :key_fields
+        # @return [Array<String>] The names of the field(s) that identify a
+        #   common parent of child records.
+        attr_reader :parent_fields
+        # @return [Array<String>] The names of the field(s) that distinguish a
+        #   child of a parent record.
+        attr_reader :child_fields
+        # @return [Array<Fixnum>] The indexes of the key fields in the source
+        #   file.
+        attr_reader :key_field_indexes
+        # @return [Array<Fixnum>] The indexes of the parent fields in the source
+        #   file.
+        attr_reader :parent_field_indexes
+        # @return [Array<Fixnum>] The indexes of the child fields in the source
+        #   file.
+        attr_reader :child_field_indexes
+        # @return [Boolean] True if the source has been indexed with case-
+        #   sensitive keys, or false if it has been indexed using upper-case key
+        #   values.
+        attr_reader :case_sensitive
+        alias_method :case_sensitive?, :case_sensitive
+        # @return [Boolean] True if leading/trailing whitespace should be stripped
+        #   from fields
+        attr_reader :trim_whitespace
+        # @return [Hash<String,Hash>] A hash containing each line of the source,
+        #   keyed on the values of the +key_fields+.
+        attr_reader :lines
+        # @return [Hash<String,Array<String>>] A hash containing each parent key,
+        #   and an Array of the child keys it is a parent of.
+        attr_reader :index
+        # @return [Array<String>] An array of any warnings encountered while
+        #   processing the source.
+        attr_reader :warnings
+        # @return [Fixnum] A count of the lines processed from this source.
+        #   Excludes any header and duplicate records identified during indexing.
+        attr_reader :line_count
+        # @return [Fixnum] A count of the lines from this source that were skipped
+        #   due to filter conditions.
+        attr_reader :skip_count
+        # @return [Fixnum] A count of the lines from this source that had the same
+        #   key value as another line.
+        attr_reader :dup_count
+        # Creates a new diff source.
+        #
+        # A diff source must contain at least one field that will be used as the
+        # key to identify the same record in a different version of this file.
+        # If not specified via one of the options, the first field is assumed to
+        # be the unique key.
+        #
+        # If multiple fields combine to form a unique key, the combined fields
+        # are considered as a single unique identifier. If your key represents
+        # data that can be represented as a tree, you can instead break your key
+        # fields into :parent_fields and :child_fields. By doing this, if a child
+        # key is deleted from one parent, and added to another, that will be
+        # reported as an update, with a change to the parent key part(s) of the
+        # record.
+        #
+        # All key options can be specified either by field name, or by field
+        # index (0 based).
+        #
+        # @param options [Hash] An options hash.
+        # @option options [Array<String>] :field_names The names of each of the
+        #   fields in +source+.
+        # @option options [Boolean] :ignore_header If true, and :field_names has
+        #   been specified, then the first row of the file is ignored.
+        # @option options [String] :key_field The name of the field that uniquely
+        #   identifies each row.
+        # @option options [Array<String>] :key_fields The names of the fields
+        #   that uniquely identifies each row.
+        # @option options [String] :parent_field The name of the field(s) that
+        #   identify a parent within which sibling order should be checked.
+        # @option options [String] :child_field The name of the field(s) that
+        #   uniquely identify a child of a parent.
+        # @option options [Boolean] :case_sensitive If true (the default), keys
+        #   are indexed as-is; if false, the index is built in upper-case for
+        #   case-insensitive comparisons.
+        # @option options [Hash] :include A hash of field name(s) or index(es) to
+        #   regular expression(s). Only source rows whose field values satisfy the
+        #   regular expressions will be indexed and included in the diff process.
+        # @option options [Hash] :exclude A hash of field name(s) or index(es) to
+        #   regular expression(s). Source rows with a field value that satisfies
+        #   the regular expressions will be excluded from the diff process.
+        def initialize(options = {})
+            if (options.keys & [:parent_field, :parent_fields, :child_field, :child_fields]).empty? &&
+               (kf = options.fetch(:key_field, options[:key_fields]))
+                @key_fields = [kf].flatten
+                @parent_fields = []
+                @child_fields = @key_fields
+            else
+                @parent_fields = [options.fetch(:parent_field, options[:parent_fields]) || []].flatten
+                @child_fields = [options.fetch(:child_field, options[:child_fields]) || [0]].flatten
+                @key_fields = @parent_fields + @child_fields
+            end
+            @field_names = options[:field_names]
+            @case_sensitive = options.fetch(:case_sensitive, true)
+            @trim_whitespace = options.fetch(:trim_whitespace, false)
+            @ignore_header = options[:ignore_header]
+            @include = options[:include]
+            @exclued = options[:exclude]
+            @path = options.fetch(:path, 'NA') unless @path
+            @warnings = []
+        end
+        def path?
+            @path != 'NA'
+        end
+        # Returns the row in the CSV source corresponding to the supplied key.
+        #
+        # @param key [String] The unique key to use to lookup the row.
+        # @return [Hash] The fields for the line corresponding to +key+, or nil
+        #   if the key is not recognised.
+        def [](key)
+            @lines[key]
+        end
+        # Given an array of lines, where each line is an array of fields, indexes
+        # the array contents so that it can be looked up by key.
+        def index_source
+            @lines = {}
+            @index = Hash.new{ |h, k| h[k] = [] }
+            if @field_names
+                index_fields
+                include_filter = convert_filter(@include, @field_names)
+                exclude_filter = convert_filter(@exclude, @field_names)
+            end
+            @line_count = 0
+            @skip_count = 0
+            @dup_count = 0
+            line_num = 0
+            @data.each do |row|
+                line_num += 1
+                next if line_num == 1 && @field_names && @ignore_header
+                unless @field_names
+                    if row.class.name == 'CSV::Row'
+                        @field_names = row.headers.each_with_index.map{ |f, i| f || i.to_s }
+                    else
+                        @field_names = row.each_with_index.map{ |f, i| f || i.to_s }
+                    end
+                    index_fields
+                    include_filter = convert_filter(@include, @field_names)
+                    exclude_filter = convert_filter(@exclude, @field_names)
+                    next
+                end
+                field_vals = row
+                line = {}
+                filter = false
+                @field_names.each_with_index do |field, i|
+                    val = field_vals[i]
+                    val = val.to_s.strip if val && @trim_whitespace
+                    line[field] = val
+                    if include_filter && f = include_filter[i]
+                        filter = !check_filter(f, line[field])
+                    end
+                    if exclude_filter && f = exclude_filter[i]
+                        filter = check_filter(f, line[field])
+                    end
+                    break if filter
+                end
+                if filter
+                    @skip_count += 1
+                    next
+                end
+                key_values = @key_field_indexes.map{ |kf| @case_sensitive ?
+                                                          field_vals[kf].to_s :
+                                                          field_vals[kf].to_s.upcase }
+                key = key_values.join('~')
+                parent_key = key_values[0...(@parent_fields.length)].join('~')
+                if @lines[key]
+                    @warnings << "Duplicate key '#{key}' encountered at line #{line_num}"
+                    @dup_count += 1
+                    key += "[#{@dup_count}]"
+                end
+                @index[parent_key] << key
+                @lines[key] = line
+                @line_count += 1
+            end
+        end
+        # Save the data in this Source as a CSV at +file_path+.
+        #
+        # @parma file_path [String] The target path to save the data to.
+        # @param options [Hash] A set of options to pass to CSV.open to control
+        #   how the CSV is generated.
+        def save_csv(file_path, options = {})
+            require 'csv'
+            default_opts = {
+                headers: @field_name, write_headers: true
+            }
+            CSV.open(file_path, 'wb', default_opts.merge(options)) do |csv|
+                @data.each{ |rec| csv << rec }
+            end
+        end
+        private
+        def index_fields
+            @key_field_indexes = find_field_indexes(@key_fields, @field_names)
+            @parent_field_indexes = find_field_indexes(@parent_fields, @field_names)
+            @child_field_indexes = find_field_indexes(@child_fields, @field_names)
+            @key_fields = @key_field_indexes.map{ |i| @field_names[i] }
+            @parent_fields = @parent_field_indexes.map{ |i| @field_names[i] }
+            @child_fields = @child_field_indexes.map{ |i| @field_names[i] }
+        end
+        # Converts an array of field names to an array of indexes of the fields
+        # matching those names.
+        def find_field_indexes(key_fields, field_names)
+            key_fields.map do |field|
+                if field.is_a?(Integer)
+                    field
+                else
+                    field_names.index{ |field_name| field.to_s.downcase == field_name.to_s.downcase } or
+                        raise ArgumentError, "Could not locate field '#{field}' in source field names: #{
+                            field_names.join(', ')}"
+                end
+            end
+        end
+        def convert_filter(hsh, field_names)
+            return unless hsh
+            if !hsh.is_a?(Hash)
+                raise ArgumentError, ":include/:exclude option must be a Hash of field name(s)/index(es) to RegExp(s)"
+            end
+            keys = hsh.keys
+            idxs = find_field_indexes(keys, @field_names)
+            Hash[keys.each_with_index.map{ |k, i| [idxs[i], hsh[k]] }]
+        end
+        def check_filter(filter, field_val)
+            case filter
+            when String
+                if @case_sensitive
+                    filter == field_val
+                else
+                    filter.downcase == field_val.to_s.downcase
+                end
+            when Regexp
+                filter.match(field_val)
+            when Proc
+                filter.call(field_val)
+            else
+                raise ArgumentError, "Unsupported filter expression: #{filter.inspect}"
+            end
+        end
+    end
+end

data/lib/csv-diff/xml_source.rb ADDED

@@ -0,0 +1,142 @@
+require 'nokogiri'
+require 'cgi'
+class CSVDiff
+    # Convert XML content to CSV format using XPath selectors to identify the
+    # rows and field values in an XML document
+    class XMLSource < Source
+        attr_accessor :context
+        # Create a new XMLSource, identified by +path+. Normally this is a path
+        # to the XML document, but any value is fine, as it isreally just a label
+        # to identify this data set.
+        #
+        # @param path [String] A label for this data set (often a path to the
+        #   XML document used as the source).
+        # @param options [Hash] An options hash.
+        # @option options [Array<String>] :field_names The names of each of the
+        #   fields in +source+.
+        # @option options [Boolean] :ignore_header If true, and :field_names has
+        #   been specified, then the first row of the file is ignored.
+        # @option options [String] :key_field The name of the field that uniquely
+        #   identifies each row.
+        # @option options [Array<String>] :key_fields The names of the fields
+        #   that uniquely identifies each row.
+        # @option options [String] :parent_field The name of the field(s) that
+        #   identify a parent within which sibling order should be checked.
+        # @option options [String] :child_field The name of the field(s) that
+        #   uniquely identify a child of a parent.
+        # @option options [Boolean] :case_sensitive If true (the default), keys
+        #   are indexed as-is; if false, the index is built in upper-case for
+        #   case-insensitive comparisons.
+        # @option options [Hash] :include A hash of field name(s) or index(es) to
+        #   regular expression(s). Only source rows whose field values satisfy the
+        #   regular expressions will be indexed and included in the diff process.
+        # @option options [Hash] :exclude A hash of field name(s) or index(es) to
+        #   regular expression(s). Source rows with a field value that satisfies
+        #   the regular expressions will be excluded from the diff process.
+        # @option options [String] :context A context value from which fields
+        #   can be populated using a Regexp.
+        def initialize(path, options = {})
+            super(options)
+            @path = path
+            @context = options[:context]
+            @data = []
+        end
+        # Process a +source+, converting the XML into a table of data, using
+        # +rec_xpath+ to identify the nodes that correspond each record that
+        # should appear in the output, and +field_maps+ to populate each field
+        # in each row.
+        #
+        # @param source [String|Array] may be a String containing XML content,
+        #   an Array of paths to files containing XML content, or a path to
+        #   a single file.
+        # @param rec_xpath [String] An XPath expression that selects all the
+        #   items in the XML document that are to be converted into new rows.
+        #   The returned items are not directly used to populate the fields,
+        #   but provide a context for the field XPath expreessions that populate
+        #   each field's content.
+        # @param field_maps [Hash<String, String>] A map of field names to
+        #   expressions that are evaluated in the context of each row node
+        #   selected by +rec_xpath+. The field expressions are typically XPath
+        #   expressions evaluated in the context of the nodes returned by the
+        #   +rec_xpath+. Alternatively, a String that is not an XPath expression
+        #   is used as a literal value for a field, while a Regexp can also
+        #   be used to pull a value from any context specified in the +options+
+        #   hash. The Regexp should include a single grouping, as the value used
+        #   will be the result in $1 after the match is performed.
+        # @param context [String] An optional context for the XML to be processed.
+        #   The value passed here can be referenced in field map expressions
+        #   using a Regexp, with the value of the first grouping in the regex
+        #   being the value returned for the field.
+        def process(source, rec_xpath, field_maps, context = nil)
+            @field_names = field_maps.keys unless @field_names
+            case source
+            when Nokogiri::XML::Document
+                add_data(source, rec_xpath, field_maps, context || @context)
+            when /<\?xml/
+                doc = Nokogiri::XML(source)
+                add_data(doc, rec_xpath, field_maps, context || @context)
+            when Array
+                source.each{ |f| process_file(f, rec_xpath, field_maps) }
+            when String
+                process_file(source, rec_xpath, field_maps)
+            else
+                raise ArgumentError, "Unhandled source type #{source.class.name}"
+            end
+            @data
+        end
+        private
+        # Load the XML document at +file_path+ and process it into rows of data.
+        def process_file(file_path, rec_xpath, field_maps)
+            begin
+                File.open(file_path) do |f|
+                    doc = Nokogiri::XML(f)
+                    add_data(doc, rec_xpath, field_maps, @context || file_path)
+                end
+            rescue
+                STDERR.puts "An error occurred while attempting to open #{file_path}"
+                raise
+            end
+        end
+        # Locate records in +doc+ using +rec_xpath+ to identify the nodes that
+        # correspond to a new record in the data, and +field_maps+ to populate
+        # the fields in each row.
+        def add_data(doc, rec_xpath, field_maps, context)
+            doc.xpath(rec_xpath).each do |rec_node|
+                rec = []
+                field_maps.each do |field_name, expr|
+                    case expr
+                    when Regexp         # Match context against Regexp and extract first grouping
+                        if context
+                            context =~ expr
+                            rec << $1
+                        else
+                            rec << nil
+                        end
+                    when %r{[/(.@]}     # XPath expression
+                        res = rec_node.xpath(expr)
+                        rec << CGI.unescape_html(res.to_s)
+                    else                # Use expr as the value for this field
+                        rec << expr
+                    end
+                end
+                @data << rec
+            end
+        end
+    end
+end

metadata CHANGED

@@ -1,34 +1,44 @@
 --- !ruby/object:Gem::Specification
 name: csv-diff
 version: !ruby/object:Gem::Version
-  version: 0.3.0
+  version: 0.6.0
 platform: ruby
 authors:
 - Adam Gardiner
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2014-08-14 00:00:00.000000000 Z
+date: 2020-08-28 00:00:00.000000000 Z
 dependencies: []
-description: ! "        This library performs diffs of CSV files.\n\n        Unlike
-  a standard diff that compares line by line, and is sensitive to the\n        ordering
-  of records, CSV-Diff identifies common lines by key field(s), and\n        then
-  compares the contents of the fields in each line.\n\n        Data may be supplied
-  in the form of CSV files, or as an array of arrays. The\n        diff process provides
-  a fine level of control over what to diff, and can\n        optionally ignore certain
-  types of changes (e.g. changes in position).\n\n        CSV-Diff is particularly
-  well suited to data in parent-child format. Parent-\n        child data does not
-  lend itself well to standard text diffs, as small changes\n        in the organisation
-  of the tree at an upper level can lead to big movements\n        in the position
-  of descendant records. By instead matching records by key,\n        CSV-Diff avoids
-  this issue, while still being able to detect changes in\n        sibling order.\n\n
-  \       This gem implements the core diff algorithm, and handles the loading and\n
-  \       diffing of CSV files. It returns a CSVDiff object, that contains the details\n
-  \       of differences in object form. This is useful for projects that need diff\n
-  \       capability, but want to handle the reporting of differences themselves.
-  For\n        a pre-built diff reporting capability, see the csv-diff-report gem,
-  which\n        provides a command-line tool for generating diff reports in HTML
-  or Excel\n        format.\n"
+description: |2
+          This library performs diffs of CSV data, or any table-like source.
+          Unlike a standard diff that compares line by line, and is sensitive to the
+          ordering of records, CSV-Diff identifies common lines by key field(s), and
+          then compares the contents of the fields in each line.
+          Data may be supplied in the form of CSV files, or as an array of arrays. The
+          diff process provides a fine level of control over what to diff, and can
+          optionally ignore certain types of changes (e.g. changes in position).
+          CSV-Diff is particularly well suited to data in parent-child format. Parent-
+          child data does not lend itself well to standard text diffs, as small changes
+          in the organisation of the tree at an upper level can lead to big movements
+          in the position of descendant records. By instead matching records by key,
+          CSV-Diff avoids this issue, while still being able to detect changes in
+          sibling order.
+          This gem implements the core diff algorithm, and handles the loading and
+          diffing of CSV files (or Arrays of Arrays). It also supports converting
+          data in XML format into tabular form, so that it can then be processed
+          like any other CSV or table-like source.  It returns a CSVDiff object
+          containing the details of differences in object form. This is useful for
+          projects that need diff capability, but want to handle the reporting or
+          actioning of differences themselves.
+          For a pre-built diff reporting capability, see the csv-diff-report gem,
+          which provides a command-line tool for generating diff reports in HTML,
+          Excel, or text formats.
 email: adam.b.gardiner@gmail.com
 executables: []
 extensions: []
@@ -40,9 +50,12 @@ files:
 - lib/csv-diff/algorithm.rb
 - lib/csv-diff/csv_diff.rb
 - lib/csv-diff/csv_source.rb
+- lib/csv-diff/source.rb
+- lib/csv-diff/xml_source.rb
 - lib/csv_diff.rb
 homepage: https://github.com/agardiner/csv-diff
-licenses: []
+licenses:
+- MIT
 metadata: {}
 post_install_message: For command-line tools and diff reports, 'gem install csv-diff-report'
 rdoc_options: []
@@ -50,18 +63,18 @@ require_paths:
 - lib
 required_ruby_version: !ruby/object:Gem::Requirement
   requirements:
-  - - ! '>='
+  - - ">="
     - !ruby/object:Gem::Version
       version: '0'
 required_rubygems_version: !ruby/object:Gem::Requirement
   requirements:
-  - - ! '>='
+  - - ">="
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
 rubyforge_project:
-rubygems_version: 2.4.1
+rubygems_version: 2.5.2.3
 signing_key:
 specification_version: 4
-summary: CSV Diff is a library for generating diffs from data in CSV format
+summary: CSV Diff is a library for generating diffs from data in CSV or XML format
 test_files: []