csv-diff 0.5.0 → 0.6.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +94 -25
- data/lib/csv-diff/source.rb +9 -6
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 9dde7ded89bb58f75505ae9237c97b8acd365c42
|
4
|
+
data.tar.gz: adcf17af6b67797c9b5fbab80b6c7d421cd73d6e
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 1aad2d1174758488d1e984239ae38917211bf1e185d2b37350a402e145bde652b44e437774f2695bb33c2b48531637f6c9ed13fb5fbaa0b610ac9d7810ec16ed
|
7
|
+
data.tar.gz: 3aab02344cfa4f111433c616caa445f43d3f4515c65e8ef7175f2287996d8f6bc592e1daca5683bef5d561e21f6f23e954a1115854577cced4551582ec4fa293
|
data/README.md
CHANGED
@@ -1,14 +1,19 @@
|
|
1
1
|
# CSV-Diff
|
2
2
|
|
3
|
-
CSV-Diff is a small library for performing diffs of
|
3
|
+
CSV-Diff is a small library for performing diffs of tabular data, typically
|
4
|
+
data loaded from CSV files.
|
4
5
|
|
5
6
|
Unlike a standard diff that compares line by line, and is sensitive to the
|
6
7
|
ordering of records, CSV-Diff identifies common lines by key field(s), and
|
7
8
|
then compares the contents of the fields in each line.
|
8
9
|
|
9
|
-
Data may be supplied in the form of CSV files, or as an array of arrays.
|
10
|
-
|
11
|
-
|
10
|
+
Data may be supplied in the form of CSV files, or as an array of arrays.
|
11
|
+
More complex usage also allows you to specify XPath expressions to extract
|
12
|
+
tabular data from XML documents for diffing.
|
13
|
+
|
14
|
+
The diff process provides a fine level of control over what to diff, and can
|
15
|
+
optionally ignore certain types of changes (e.g. adds, deletes, changes in
|
16
|
+
position etc).
|
12
17
|
|
13
18
|
CSV-Diff is particularly well suited to data in parent-child format. Parent-
|
14
19
|
child data does not lend itself well to standard text diffs, as small changes
|
@@ -21,17 +26,19 @@ sibling order.
|
|
21
26
|
## Usage
|
22
27
|
|
23
28
|
CSV-Diff is supplied as a gem, and has no dependencies. To use it, simply:
|
24
|
-
|
25
|
-
|
26
|
-
|
29
|
+
|
30
|
+
```
|
31
|
+
gem install csv-diff
|
32
|
+
```
|
27
33
|
|
28
34
|
To compare two CSV files where the field names are in the first row of the file,
|
29
35
|
and the first field contains the unique key for each record, simply use:
|
30
|
-
```ruby
|
31
|
-
require 'csv-diff'
|
32
36
|
|
33
|
-
|
34
|
-
|
37
|
+
```ruby
|
38
|
+
require 'csv-diff'
|
39
|
+
|
40
|
+
diff = CSVDiff.new(file1, file2)
|
41
|
+
```
|
35
42
|
|
36
43
|
The returned diff object can be queried for the differences that exist between
|
37
44
|
the two files, e.g.:
|
@@ -96,7 +103,7 @@ change in order) of all 6 rows.
|
|
96
103
|
|
97
104
|
The more correct specification of this file is that column 0 contains a unique parent
|
98
105
|
identifier, and column 1 contains a unique child identifier. CSVDiff can then correctly
|
99
|
-
deduce that there
|
106
|
+
deduce that there are in fact only two changes in order - the swap in positions of A and
|
100
107
|
B below Root.
|
101
108
|
|
102
109
|
Note: If you aren't interested in changes in the order of siblings, then you could use
|
@@ -121,43 +128,59 @@ Warnings may be raised for any of the following:
|
|
121
128
|
The simplest use case is as shown above, where the data to be diffed is in CSV files
|
122
129
|
with the column names as the first record, and where the unique key is the first
|
123
130
|
column in the data. In this case, a diff can be created simply via:
|
124
|
-
|
125
|
-
|
126
|
-
|
131
|
+
|
132
|
+
```ruby
|
133
|
+
diff = CSVDiff.new(file1, file2)
|
134
|
+
```
|
127
135
|
|
128
136
|
### Specifying Unique Row Identifiers
|
129
137
|
|
130
138
|
Often however, rows are not uniquely identifiable via the first column in the file.
|
131
139
|
In a parent-child hierarchy, for example, combinations of parent and child may be
|
132
|
-
necessary to uniquely identify a row
|
133
|
-
|
134
|
-
|
140
|
+
necessary to uniquely identify a row, while in other cases a combination of fields
|
141
|
+
may be needed to derive a natural unique key or identifier for each row.
|
142
|
+
In these cases, it is necessary to indicate to CSVDiff which fields are needed to
|
143
|
+
uniquely identify common rows across the two files. This can be done in several
|
144
|
+
different ways.
|
135
145
|
|
136
|
-
|
146
|
+
#### :key_field(s)
|
147
|
+
|
148
|
+
The first method is using the **key_fields** option (or key_field if you have only a
|
149
|
+
single key field). Use this option when your data represents a flat structure rather
|
150
|
+
than a parent-child hierarchy or flattened tree. You can specify key_fields using
|
151
|
+
field numbers/column indices (0-based):
|
137
152
|
|
138
153
|
```ruby
|
139
154
|
diff = CSVDiff.new(file1, file2, key_fields: [0, 1])
|
140
155
|
```
|
141
156
|
|
142
|
-
|
157
|
+
Alternatively, you can use the :key_fields options with column names (provided CSVDiff
|
158
|
+
knows the names of your fields, either via the **field_names** option or from headers
|
159
|
+
in the file):
|
143
160
|
|
144
161
|
```ruby
|
145
|
-
diff = CSVDiff.new(file1, file2, key_fields: ['
|
162
|
+
diff = CSVDiff.new(file1, file2, key_fields: ['First Name', 'Last Name'])
|
146
163
|
```
|
147
164
|
|
148
|
-
|
165
|
+
#### :parent_field(s)/:child_field(s)
|
166
|
+
|
167
|
+
The second method for identifying the unique identifiers in your file is to use the
|
168
|
+
:parent_fields and :child_fields options. Use this option when your data represents
|
169
|
+
a tree structure flattened to a table in parent-child form.
|
170
|
+
|
171
|
+
Using the :parent_fields and :child_fields with field numbers:
|
149
172
|
|
150
173
|
```ruby
|
151
174
|
diff = CSVDiff.new(file1, file2, parent_field: 1, child_fields: [2, 3])
|
152
175
|
```
|
153
176
|
|
154
|
-
|
177
|
+
Using the :parent_fields and :child_fields with column names:
|
155
178
|
|
156
179
|
```ruby
|
157
180
|
diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'])
|
158
181
|
```
|
159
182
|
|
160
|
-
### Using Non-CSV
|
183
|
+
### Using Non-CSV Sources
|
161
184
|
|
162
185
|
Data from non-CSV sources can be diffed, as long as it can be supplied as an Array
|
163
186
|
of Arrays:
|
@@ -174,7 +197,53 @@ DATA2 = [
|
|
174
197
|
['A', 'A2', 'Account2']
|
175
198
|
]
|
176
199
|
|
177
|
-
diff = CSVDiff.new(DATA1, DATA2,
|
200
|
+
diff = CSVDiff.new(DATA1, DATA2, parent_field: 1, child_field: 0)
|
201
|
+
```
|
202
|
+
|
203
|
+
Data can also be diffed if it is an XML source, although this requires a little
|
204
|
+
more effort to tell CSVDiff how to transform/extract content from the XML document
|
205
|
+
into an array-of-arrays form. It also introduces a dependency on Nokogiri - you
|
206
|
+
will need to install this gem to use CSVDiff with XML sources.
|
207
|
+
|
208
|
+
The first step is to use the CSVDiff::XMLSource class to define how to convert
|
209
|
+
your XML content to an array-of-arrays. The XMLSource class is quite flexible,
|
210
|
+
and can be used to convert single or multiple XML sources into a single data set
|
211
|
+
for diffing, and different documents may even have different layouts.
|
212
|
+
|
213
|
+
The first step is to create an XMLSource object, which requires a label to
|
214
|
+
identify the type of data it will generate:
|
215
|
+
```ruby
|
216
|
+
xml_source_1 = CSVDiff::XMLSource.new('My Label')
|
217
|
+
```
|
218
|
+
|
219
|
+
Next, we pass XML documents to this source, and specify XPath expressions for each
|
220
|
+
row and column of data to produce via the `process(rec_xpath, field_maps, options)`
|
221
|
+
method:
|
222
|
+
|
223
|
+
* An XPath expression is provided to select each node value in the document that
|
224
|
+
will represent a row. Taking an HTML table as an example of something we wanted
|
225
|
+
to parse, your rec_xpath value might be something like the following:
|
226
|
+
`'//table/tbody/tr'`. This would locate all tables in the document, and create
|
227
|
+
a new row of data in the XMLSource every time a `<tr>` tag was encountered.
|
228
|
+
* A hash of field_maps is then provided to describe how to generate column values
|
229
|
+
for each row of data. The keys to field_maps are the names of the fields to be
|
230
|
+
output, while the values are the epression for how to generate values. Most
|
231
|
+
commonly, this will be another XPath expression that is evaluated in the context
|
232
|
+
of the node returned by the row XPath expression. So continuing our HTML example,
|
233
|
+
we might use `'./td[0]/text()'` as an expression to select the content of the
|
234
|
+
first `<td>` element within the `<tr>` representing the current row.
|
235
|
+
|
236
|
+
```ruby
|
237
|
+
xml_source1.process('//table/tbody/tr',
|
238
|
+
col_A: './td[0]/text()',
|
239
|
+
col_B: './td[1]/text()',
|
240
|
+
col_C: './td[2]/text()')
|
241
|
+
```
|
242
|
+
|
243
|
+
Finally, to diff two XML sources, we create a CSVDiff object with two XMLSource
|
244
|
+
objects as the source:
|
245
|
+
```ruby
|
246
|
+
diff = CSVDiff.new(xml_source1, xml_source2, key_field: 'col_A')
|
178
247
|
```
|
179
248
|
|
180
249
|
### Specifying Column Names
|
data/lib/csv-diff/source.rb
CHANGED
@@ -66,10 +66,13 @@ class CSVDiff
|
|
66
66
|
# If not specified via one of the options, the first field is assumed to
|
67
67
|
# be the unique key.
|
68
68
|
#
|
69
|
-
# If multiple fields combine to form a unique key, the
|
70
|
-
#
|
71
|
-
#
|
72
|
-
# :child_fields
|
69
|
+
# If multiple fields combine to form a unique key, the combined fields
|
70
|
+
# are considered as a single unique identifier. If your key represents
|
71
|
+
# data that can be represented as a tree, you can instead break your key
|
72
|
+
# fields into :parent_fields and :child_fields. By doing this, if a child
|
73
|
+
# key is deleted from one parent, and added to another, that will be
|
74
|
+
# reported as an update, with a change to the parent key part(s) of the
|
75
|
+
# record.
|
73
76
|
#
|
74
77
|
# All key options can be specified either by field name, or by field
|
75
78
|
# index (0 based).
|
@@ -100,8 +103,8 @@ class CSVDiff
|
|
100
103
|
if (options.keys & [:parent_field, :parent_fields, :child_field, :child_fields]).empty? &&
|
101
104
|
(kf = options.fetch(:key_field, options[:key_fields]))
|
102
105
|
@key_fields = [kf].flatten
|
103
|
-
@parent_fields =
|
104
|
-
@child_fields = @key_fields
|
106
|
+
@parent_fields = []
|
107
|
+
@child_fields = @key_fields
|
105
108
|
else
|
106
109
|
@parent_fields = [options.fetch(:parent_field, options[:parent_fields]) || []].flatten
|
107
110
|
@child_fields = [options.fetch(:child_field, options[:child_fields]) || [0]].flatten
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: csv-diff
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.6.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Adam Gardiner
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2020-
|
11
|
+
date: 2020-08-28 00:00:00.000000000 Z
|
12
12
|
dependencies: []
|
13
13
|
description: |2
|
14
14
|
This library performs diffs of CSV data, or any table-like source.
|