RubyGems - csv-diff - Versions diffs - 0.5.0 → 0.6.0 - Mend

csv-diff 0.5.0 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (4) hide show

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: d1b3b8deee34344d334e740285cb1f3c99074694
-  data.tar.gz: d95158d13861cb66fd460ee430714ec3c83cd0b1
+  metadata.gz: 9dde7ded89bb58f75505ae9237c97b8acd365c42
+  data.tar.gz: adcf17af6b67797c9b5fbab80b6c7d421cd73d6e
 SHA512:
-  metadata.gz: 50c74d6a4093012b0ba44fef70c2d749348d6777cfb9f2cfda66c6e075423191a4c6c22019a388b9d8bd14e22ac60d539f4e3b4aa85fd87fd774a64da15858c7
-  data.tar.gz: 8fa030a54e7a97db9913b3c36a1942de1e07a6549f9ae7aa58b5b3f44d522fe11f72d44e18b6b7612d2b2dc9f106ece1fea183557c507bcf18316891ab63f230
+  metadata.gz: 1aad2d1174758488d1e984239ae38917211bf1e185d2b37350a402e145bde652b44e437774f2695bb33c2b48531637f6c9ed13fb5fbaa0b610ac9d7810ec16ed
+  data.tar.gz: 3aab02344cfa4f111433c616caa445f43d3f4515c65e8ef7175f2287996d8f6bc592e1daca5683bef5d561e21f6f23e954a1115854577cced4551582ec4fa293

data/README.md CHANGED

@@ -1,14 +1,19 @@
 # CSV-Diff
-CSV-Diff is a small library for performing diffs of CSV data.
+CSV-Diff is a small library for performing diffs of tabular data, typically
+data loaded from CSV files.
 Unlike a standard diff that compares line by line, and is sensitive to the
 ordering of records, CSV-Diff identifies common lines by key field(s), and
 then compares the contents of the fields in each line.
-Data may be supplied in the form of CSV files, or as an array of arrays. The
-diff process provides a fine level of control over what to diff, and can
-optionally ignore certain types of changes (e.g. changes in position).
+Data may be supplied in the form of CSV files, or as an array of arrays.
+More complex usage also allows you to specify XPath expressions to extract
+tabular data from XML documents for diffing.
+The diff process provides a fine level of control over what to diff, and can
+optionally ignore certain types of changes (e.g. adds, deletes, changes in
+position etc).
 CSV-Diff is particularly well suited to data in parent-child format. Parent-
 child data does not lend itself well to standard text diffs, as small changes
@@ -21,17 +26,19 @@ sibling order.
 ## Usage
 CSV-Diff is supplied as a gem, and has no dependencies. To use it, simply:
-```
-gem install csv-diff
-```
+    ```
+    gem install csv-diff
+    ```
 To compare two CSV files where the field names are in the first row of the file,
 and the first field contains the unique key for each record, simply use:
-```ruby
-require 'csv-diff'
-diff = CSVDiff.new(file1, file2)
-```
+    ```ruby
+    require 'csv-diff'
+    diff = CSVDiff.new(file1, file2)
+    ```
 The returned diff object can be queried for the differences that exist between
 the two files, e.g.:
@@ -96,7 +103,7 @@ change in order) of all 6 rows.
 The more correct specification of this file is that column 0 contains a unique parent
 identifier, and column 1 contains a unique child identifier. CSVDiff can then correctly
-deduce that there is in fact only two changes in order - the swap in positions of A and
+deduce that there are in fact only two changes in order - the swap in positions of A and
 B below Root.
 Note: If you aren't interested in changes in the order of siblings, then you could use
@@ -121,43 +128,59 @@ Warnings may be raised for any of the following:
 The simplest use case is as shown above, where the data to be diffed is in CSV files
 with the column names as the first record, and where the unique key is the first
 column in the data. In this case, a diff can be created simply via:
-```ruby
-diff = CSVDiff.new(file1, file2)
-```
+    ```ruby
+    diff = CSVDiff.new(file1, file2)
+    ```
 ### Specifying Unique Row Identifiers
 Often however, rows are not uniquely identifiable via the first column in the file.
 In a parent-child hierarchy, for example, combinations of parent and child may be
-necessary to uniquely identify a row. In these cases, it is necessary to indicate
-which fields are used to uniquely identify common rows across the two files. This
-can be done in several different ways.
+necessary to uniquely identify a row, while in other cases a combination of fields
+may be needed to derive a natural unique key or identifier for each row.
+In these cases, it is necessary to indicate to CSVDiff which fields are needed to
+uniquely identify common rows across the two files. This can be done in several
+different ways.
-1. Using the :key_fields option with field numbers (these are 0-based):
+#### :key_field(s)
+The first method is using the **key_fields** option (or key_field if you have only a
+single key field). Use this option when your data represents a flat structure rather
+than a parent-child hierarchy or flattened tree. You can specify key_fields using
+field numbers/column indices (0-based):
     ```ruby
     diff = CSVDiff.new(file1, file2, key_fields: [0, 1])
     ```
-2. Using the :key_fields options with column names:
+Alternatively, you can use the :key_fields options with column names (provided CSVDiff
+knows the names of your fields, either via the **field_names** option or from headers
+in the file):
     ```ruby
-    diff = CSVDiff.new(file1, file2, key_fields: ['Parent', 'Child'])
+    diff = CSVDiff.new(file1, file2, key_fields: ['First Name', 'Last Name'])
     ```
-3. Using the :parent_fields and :child_fields with field numbers:
+#### :parent_field(s)/:child_field(s)
+The second method for identifying the unique identifiers in your file is to use the
+:parent_fields and :child_fields options. Use this option when your data represents
+a tree structure flattened to a table in parent-child form.
+Using the :parent_fields and :child_fields with field numbers:
     ```ruby
     diff = CSVDiff.new(file1, file2, parent_field: 1, child_fields: [2, 3])
     ```
-4. Using the :parent_fields and :child_fields with column names:
+Using the :parent_fields and :child_fields with column names:
     ```ruby
     diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'])
     ```
-### Using Non-CSV File Sources
+### Using Non-CSV Sources
 Data from non-CSV sources can be diffed, as long as it can be supplied as an Array
 of Arrays:
@@ -174,7 +197,53 @@ DATA2 = [
     ['A', 'A2', 'Account2']
 ]
-diff = CSVDiff.new(DATA1, DATA2, key_fields: [1, 0])
+diff = CSVDiff.new(DATA1, DATA2, parent_field: 1, child_field: 0)
+```
+Data can also be diffed if it is an XML source, although this requires a little
+more effort to tell CSVDiff how to transform/extract content from the XML document
+into an array-of-arrays form. It also introduces a dependency on Nokogiri - you
+will need to install this gem to use CSVDiff with XML sources.
+The first step is to use the CSVDiff::XMLSource class to define how to convert
+your XML content to an array-of-arrays. The XMLSource class is quite flexible,
+and can be used to convert single or multiple XML sources into a single data set
+for diffing, and different documents may even have different layouts.
+The first step is to create an XMLSource object, which requires a label to
+identify the type of data it will generate:
+```ruby
+xml_source_1 = CSVDiff::XMLSource.new('My Label')
+```
+Next, we pass XML documents to this source, and specify XPath expressions for each
+row and column of data to produce via the `process(rec_xpath, field_maps, options)`
+method:
+* An XPath expression is provided to select each node value in the document that
+  will represent a row. Taking an HTML table as an example of something we wanted
+  to parse, your rec_xpath value might be something like the following:
+  `'//table/tbody/tr'`. This would locate all tables in the document, and create
+  a new row of data in the XMLSource every time a `<tr>` tag was encountered.
+* A hash of field_maps is then provided to describe how to generate column values
+  for each row of data. The keys to field_maps are the names of the fields to be
+  output, while the values are the epression for how to generate values. Most
+  commonly, this will be another XPath expression that is evaluated in the context
+  of the node returned by the row XPath expression. So continuing our HTML example,
+  we might use `'./td[0]/text()'` as an expression to select the content of the
+  first `<td>` element within the `<tr>` representing the current row.
+```ruby
+xml_source1.process('//table/tbody/tr',
+                    col_A: './td[0]/text()',
+                    col_B: './td[1]/text()',
+                    col_C: './td[2]/text()')
+```
+Finally, to diff two XML sources, we create a CSVDiff object with two XMLSource
+objects as the source:
+```ruby
+diff = CSVDiff.new(xml_source1, xml_source2, key_field: 'col_A')
 ```
 ### Specifying Column Names

data/lib/csv-diff/source.rb CHANGED

@@ -66,10 +66,13 @@ class CSVDiff
         # If not specified via one of the options, the first field is assumed to
         # be the unique key.
         #
-        # If multiple fields combine to form a unique key, the parent is assumed
-        # to be identified by all but the last field of the unique key. If finer
-        # control is required, use a combination of the :parent_fields and
-        # :child_fields options.
+        # If multiple fields combine to form a unique key, the combined fields
+        # are considered as a single unique identifier. If your key represents
+        # data that can be represented as a tree, you can instead break your key
+        # fields into :parent_fields and :child_fields. By doing this, if a child
+        # key is deleted from one parent, and added to another, that will be
+        # reported as an update, with a change to the parent key part(s) of the
+        # record.
         #
         # All key options can be specified either by field name, or by field
         # index (0 based).
@@ -100,8 +103,8 @@ class CSVDiff
             if (options.keys & [:parent_field, :parent_fields, :child_field, :child_fields]).empty? &&
                (kf = options.fetch(:key_field, options[:key_fields]))
                 @key_fields = [kf].flatten
-                @parent_fields = @key_fields[0...-1]
-                @child_fields = @key_fields[-1..-1]
+                @parent_fields = []
+                @child_fields = @key_fields
             else
                 @parent_fields = [options.fetch(:parent_field, options[:parent_fields]) || []].flatten
                 @child_fields = [options.fetch(:child_field, options[:child_fields]) || [0]].flatten

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: csv-diff
 version: !ruby/object:Gem::Version
-  version: 0.5.0
+  version: 0.6.0
 platform: ruby
 authors:
 - Adam Gardiner
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2020-07-15 00:00:00.000000000 Z
+date: 2020-08-28 00:00:00.000000000 Z
 dependencies: []
 description: |2
           This library performs diffs of CSV data, or any table-like source.