csv-diff 0.5.0 → 0.6.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (4) hide show
  1. checksums.yaml +4 -4
  2. data/README.md +94 -25
  3. data/lib/csv-diff/source.rb +9 -6
  4. metadata +2 -2
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: d1b3b8deee34344d334e740285cb1f3c99074694
4
- data.tar.gz: d95158d13861cb66fd460ee430714ec3c83cd0b1
3
+ metadata.gz: 9dde7ded89bb58f75505ae9237c97b8acd365c42
4
+ data.tar.gz: adcf17af6b67797c9b5fbab80b6c7d421cd73d6e
5
5
  SHA512:
6
- metadata.gz: 50c74d6a4093012b0ba44fef70c2d749348d6777cfb9f2cfda66c6e075423191a4c6c22019a388b9d8bd14e22ac60d539f4e3b4aa85fd87fd774a64da15858c7
7
- data.tar.gz: 8fa030a54e7a97db9913b3c36a1942de1e07a6549f9ae7aa58b5b3f44d522fe11f72d44e18b6b7612d2b2dc9f106ece1fea183557c507bcf18316891ab63f230
6
+ metadata.gz: 1aad2d1174758488d1e984239ae38917211bf1e185d2b37350a402e145bde652b44e437774f2695bb33c2b48531637f6c9ed13fb5fbaa0b610ac9d7810ec16ed
7
+ data.tar.gz: 3aab02344cfa4f111433c616caa445f43d3f4515c65e8ef7175f2287996d8f6bc592e1daca5683bef5d561e21f6f23e954a1115854577cced4551582ec4fa293
data/README.md CHANGED
@@ -1,14 +1,19 @@
1
1
  # CSV-Diff
2
2
 
3
- CSV-Diff is a small library for performing diffs of CSV data.
3
+ CSV-Diff is a small library for performing diffs of tabular data, typically
4
+ data loaded from CSV files.
4
5
 
5
6
  Unlike a standard diff that compares line by line, and is sensitive to the
6
7
  ordering of records, CSV-Diff identifies common lines by key field(s), and
7
8
  then compares the contents of the fields in each line.
8
9
 
9
- Data may be supplied in the form of CSV files, or as an array of arrays. The
10
- diff process provides a fine level of control over what to diff, and can
11
- optionally ignore certain types of changes (e.g. changes in position).
10
+ Data may be supplied in the form of CSV files, or as an array of arrays.
11
+ More complex usage also allows you to specify XPath expressions to extract
12
+ tabular data from XML documents for diffing.
13
+
14
+ The diff process provides a fine level of control over what to diff, and can
15
+ optionally ignore certain types of changes (e.g. adds, deletes, changes in
16
+ position etc).
12
17
 
13
18
  CSV-Diff is particularly well suited to data in parent-child format. Parent-
14
19
  child data does not lend itself well to standard text diffs, as small changes
@@ -21,17 +26,19 @@ sibling order.
21
26
  ## Usage
22
27
 
23
28
  CSV-Diff is supplied as a gem, and has no dependencies. To use it, simply:
24
- ```
25
- gem install csv-diff
26
- ```
29
+
30
+ ```
31
+ gem install csv-diff
32
+ ```
27
33
 
28
34
  To compare two CSV files where the field names are in the first row of the file,
29
35
  and the first field contains the unique key for each record, simply use:
30
- ```ruby
31
- require 'csv-diff'
32
36
 
33
- diff = CSVDiff.new(file1, file2)
34
- ```
37
+ ```ruby
38
+ require 'csv-diff'
39
+
40
+ diff = CSVDiff.new(file1, file2)
41
+ ```
35
42
 
36
43
  The returned diff object can be queried for the differences that exist between
37
44
  the two files, e.g.:
@@ -96,7 +103,7 @@ change in order) of all 6 rows.
96
103
 
97
104
  The more correct specification of this file is that column 0 contains a unique parent
98
105
  identifier, and column 1 contains a unique child identifier. CSVDiff can then correctly
99
- deduce that there is in fact only two changes in order - the swap in positions of A and
106
+ deduce that there are in fact only two changes in order - the swap in positions of A and
100
107
  B below Root.
101
108
 
102
109
  Note: If you aren't interested in changes in the order of siblings, then you could use
@@ -121,43 +128,59 @@ Warnings may be raised for any of the following:
121
128
  The simplest use case is as shown above, where the data to be diffed is in CSV files
122
129
  with the column names as the first record, and where the unique key is the first
123
130
  column in the data. In this case, a diff can be created simply via:
124
- ```ruby
125
- diff = CSVDiff.new(file1, file2)
126
- ```
131
+
132
+ ```ruby
133
+ diff = CSVDiff.new(file1, file2)
134
+ ```
127
135
 
128
136
  ### Specifying Unique Row Identifiers
129
137
 
130
138
  Often however, rows are not uniquely identifiable via the first column in the file.
131
139
  In a parent-child hierarchy, for example, combinations of parent and child may be
132
- necessary to uniquely identify a row. In these cases, it is necessary to indicate
133
- which fields are used to uniquely identify common rows across the two files. This
134
- can be done in several different ways.
140
+ necessary to uniquely identify a row, while in other cases a combination of fields
141
+ may be needed to derive a natural unique key or identifier for each row.
142
+ In these cases, it is necessary to indicate to CSVDiff which fields are needed to
143
+ uniquely identify common rows across the two files. This can be done in several
144
+ different ways.
135
145
 
136
- 1. Using the :key_fields option with field numbers (these are 0-based):
146
+ #### :key_field(s)
147
+
148
+ The first method is using the **key_fields** option (or key_field if you have only a
149
+ single key field). Use this option when your data represents a flat structure rather
150
+ than a parent-child hierarchy or flattened tree. You can specify key_fields using
151
+ field numbers/column indices (0-based):
137
152
 
138
153
  ```ruby
139
154
  diff = CSVDiff.new(file1, file2, key_fields: [0, 1])
140
155
  ```
141
156
 
142
- 2. Using the :key_fields options with column names:
157
+ Alternatively, you can use the :key_fields options with column names (provided CSVDiff
158
+ knows the names of your fields, either via the **field_names** option or from headers
159
+ in the file):
143
160
 
144
161
  ```ruby
145
- diff = CSVDiff.new(file1, file2, key_fields: ['Parent', 'Child'])
162
+ diff = CSVDiff.new(file1, file2, key_fields: ['First Name', 'Last Name'])
146
163
  ```
147
164
 
148
- 3. Using the :parent_fields and :child_fields with field numbers:
165
+ #### :parent_field(s)/:child_field(s)
166
+
167
+ The second method for identifying the unique identifiers in your file is to use the
168
+ :parent_fields and :child_fields options. Use this option when your data represents
169
+ a tree structure flattened to a table in parent-child form.
170
+
171
+ Using the :parent_fields and :child_fields with field numbers:
149
172
 
150
173
  ```ruby
151
174
  diff = CSVDiff.new(file1, file2, parent_field: 1, child_fields: [2, 3])
152
175
  ```
153
176
 
154
- 4. Using the :parent_fields and :child_fields with column names:
177
+ Using the :parent_fields and :child_fields with column names:
155
178
 
156
179
  ```ruby
157
180
  diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'])
158
181
  ```
159
182
 
160
- ### Using Non-CSV File Sources
183
+ ### Using Non-CSV Sources
161
184
 
162
185
  Data from non-CSV sources can be diffed, as long as it can be supplied as an Array
163
186
  of Arrays:
@@ -174,7 +197,53 @@ DATA2 = [
174
197
  ['A', 'A2', 'Account2']
175
198
  ]
176
199
 
177
- diff = CSVDiff.new(DATA1, DATA2, key_fields: [1, 0])
200
+ diff = CSVDiff.new(DATA1, DATA2, parent_field: 1, child_field: 0)
201
+ ```
202
+
203
+ Data can also be diffed if it is an XML source, although this requires a little
204
+ more effort to tell CSVDiff how to transform/extract content from the XML document
205
+ into an array-of-arrays form. It also introduces a dependency on Nokogiri - you
206
+ will need to install this gem to use CSVDiff with XML sources.
207
+
208
+ The first step is to use the CSVDiff::XMLSource class to define how to convert
209
+ your XML content to an array-of-arrays. The XMLSource class is quite flexible,
210
+ and can be used to convert single or multiple XML sources into a single data set
211
+ for diffing, and different documents may even have different layouts.
212
+
213
+ The first step is to create an XMLSource object, which requires a label to
214
+ identify the type of data it will generate:
215
+ ```ruby
216
+ xml_source_1 = CSVDiff::XMLSource.new('My Label')
217
+ ```
218
+
219
+ Next, we pass XML documents to this source, and specify XPath expressions for each
220
+ row and column of data to produce via the `process(rec_xpath, field_maps, options)`
221
+ method:
222
+
223
+ * An XPath expression is provided to select each node value in the document that
224
+ will represent a row. Taking an HTML table as an example of something we wanted
225
+ to parse, your rec_xpath value might be something like the following:
226
+ `'//table/tbody/tr'`. This would locate all tables in the document, and create
227
+ a new row of data in the XMLSource every time a `<tr>` tag was encountered.
228
+ * A hash of field_maps is then provided to describe how to generate column values
229
+ for each row of data. The keys to field_maps are the names of the fields to be
230
+ output, while the values are the epression for how to generate values. Most
231
+ commonly, this will be another XPath expression that is evaluated in the context
232
+ of the node returned by the row XPath expression. So continuing our HTML example,
233
+ we might use `'./td[0]/text()'` as an expression to select the content of the
234
+ first `<td>` element within the `<tr>` representing the current row.
235
+
236
+ ```ruby
237
+ xml_source1.process('//table/tbody/tr',
238
+ col_A: './td[0]/text()',
239
+ col_B: './td[1]/text()',
240
+ col_C: './td[2]/text()')
241
+ ```
242
+
243
+ Finally, to diff two XML sources, we create a CSVDiff object with two XMLSource
244
+ objects as the source:
245
+ ```ruby
246
+ diff = CSVDiff.new(xml_source1, xml_source2, key_field: 'col_A')
178
247
  ```
179
248
 
180
249
  ### Specifying Column Names
@@ -66,10 +66,13 @@ class CSVDiff
66
66
  # If not specified via one of the options, the first field is assumed to
67
67
  # be the unique key.
68
68
  #
69
- # If multiple fields combine to form a unique key, the parent is assumed
70
- # to be identified by all but the last field of the unique key. If finer
71
- # control is required, use a combination of the :parent_fields and
72
- # :child_fields options.
69
+ # If multiple fields combine to form a unique key, the combined fields
70
+ # are considered as a single unique identifier. If your key represents
71
+ # data that can be represented as a tree, you can instead break your key
72
+ # fields into :parent_fields and :child_fields. By doing this, if a child
73
+ # key is deleted from one parent, and added to another, that will be
74
+ # reported as an update, with a change to the parent key part(s) of the
75
+ # record.
73
76
  #
74
77
  # All key options can be specified either by field name, or by field
75
78
  # index (0 based).
@@ -100,8 +103,8 @@ class CSVDiff
100
103
  if (options.keys & [:parent_field, :parent_fields, :child_field, :child_fields]).empty? &&
101
104
  (kf = options.fetch(:key_field, options[:key_fields]))
102
105
  @key_fields = [kf].flatten
103
- @parent_fields = @key_fields[0...-1]
104
- @child_fields = @key_fields[-1..-1]
106
+ @parent_fields = []
107
+ @child_fields = @key_fields
105
108
  else
106
109
  @parent_fields = [options.fetch(:parent_field, options[:parent_fields]) || []].flatten
107
110
  @child_fields = [options.fetch(:child_field, options[:child_fields]) || [0]].flatten
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: csv-diff
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.5.0
4
+ version: 0.6.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Adam Gardiner
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-07-15 00:00:00.000000000 Z
11
+ date: 2020-08-28 00:00:00.000000000 Z
12
12
  dependencies: []
13
13
  description: |2
14
14
  This library performs diffs of CSV data, or any table-like source.