csv-diff 0.5.0 → 0.6.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +94 -25
- data/lib/csv-diff/source.rb +9 -6
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 9dde7ded89bb58f75505ae9237c97b8acd365c42
|
4
|
+
data.tar.gz: adcf17af6b67797c9b5fbab80b6c7d421cd73d6e
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 1aad2d1174758488d1e984239ae38917211bf1e185d2b37350a402e145bde652b44e437774f2695bb33c2b48531637f6c9ed13fb5fbaa0b610ac9d7810ec16ed
|
7
|
+
data.tar.gz: 3aab02344cfa4f111433c616caa445f43d3f4515c65e8ef7175f2287996d8f6bc592e1daca5683bef5d561e21f6f23e954a1115854577cced4551582ec4fa293
|
data/README.md
CHANGED
@@ -1,14 +1,19 @@
|
|
1
1
|
# CSV-Diff
|
2
2
|
|
3
|
-
CSV-Diff is a small library for performing diffs of
|
3
|
+
CSV-Diff is a small library for performing diffs of tabular data, typically
|
4
|
+
data loaded from CSV files.
|
4
5
|
|
5
6
|
Unlike a standard diff that compares line by line, and is sensitive to the
|
6
7
|
ordering of records, CSV-Diff identifies common lines by key field(s), and
|
7
8
|
then compares the contents of the fields in each line.
|
8
9
|
|
9
|
-
Data may be supplied in the form of CSV files, or as an array of arrays.
|
10
|
-
|
11
|
-
|
10
|
+
Data may be supplied in the form of CSV files, or as an array of arrays.
|
11
|
+
More complex usage also allows you to specify XPath expressions to extract
|
12
|
+
tabular data from XML documents for diffing.
|
13
|
+
|
14
|
+
The diff process provides a fine level of control over what to diff, and can
|
15
|
+
optionally ignore certain types of changes (e.g. adds, deletes, changes in
|
16
|
+
position etc).
|
12
17
|
|
13
18
|
CSV-Diff is particularly well suited to data in parent-child format. Parent-
|
14
19
|
child data does not lend itself well to standard text diffs, as small changes
|
@@ -21,17 +26,19 @@ sibling order.
|
|
21
26
|
## Usage
|
22
27
|
|
23
28
|
CSV-Diff is supplied as a gem, and has no dependencies. To use it, simply:
|
24
|
-
|
25
|
-
|
26
|
-
|
29
|
+
|
30
|
+
```
|
31
|
+
gem install csv-diff
|
32
|
+
```
|
27
33
|
|
28
34
|
To compare two CSV files where the field names are in the first row of the file,
|
29
35
|
and the first field contains the unique key for each record, simply use:
|
30
|
-
```ruby
|
31
|
-
require 'csv-diff'
|
32
36
|
|
33
|
-
|
34
|
-
|
37
|
+
```ruby
|
38
|
+
require 'csv-diff'
|
39
|
+
|
40
|
+
diff = CSVDiff.new(file1, file2)
|
41
|
+
```
|
35
42
|
|
36
43
|
The returned diff object can be queried for the differences that exist between
|
37
44
|
the two files, e.g.:
|
@@ -96,7 +103,7 @@ change in order) of all 6 rows.
|
|
96
103
|
|
97
104
|
The more correct specification of this file is that column 0 contains a unique parent
|
98
105
|
identifier, and column 1 contains a unique child identifier. CSVDiff can then correctly
|
99
|
-
deduce that there
|
106
|
+
deduce that there are in fact only two changes in order - the swap in positions of A and
|
100
107
|
B below Root.
|
101
108
|
|
102
109
|
Note: If you aren't interested in changes in the order of siblings, then you could use
|
@@ -121,43 +128,59 @@ Warnings may be raised for any of the following:
|
|
121
128
|
The simplest use case is as shown above, where the data to be diffed is in CSV files
|
122
129
|
with the column names as the first record, and where the unique key is the first
|
123
130
|
column in the data. In this case, a diff can be created simply via:
|
124
|
-
|
125
|
-
|
126
|
-
|
131
|
+
|
132
|
+
```ruby
|
133
|
+
diff = CSVDiff.new(file1, file2)
|
134
|
+
```
|
127
135
|
|
128
136
|
### Specifying Unique Row Identifiers
|
129
137
|
|
130
138
|
Often however, rows are not uniquely identifiable via the first column in the file.
|
131
139
|
In a parent-child hierarchy, for example, combinations of parent and child may be
|
132
|
-
necessary to uniquely identify a row
|
133
|
-
|
134
|
-
|
140
|
+
necessary to uniquely identify a row, while in other cases a combination of fields
|
141
|
+
may be needed to derive a natural unique key or identifier for each row.
|
142
|
+
In these cases, it is necessary to indicate to CSVDiff which fields are needed to
|
143
|
+
uniquely identify common rows across the two files. This can be done in several
|
144
|
+
different ways.
|
135
145
|
|
136
|
-
|
146
|
+
#### :key_field(s)
|
147
|
+
|
148
|
+
The first method is using the **key_fields** option (or key_field if you have only a
|
149
|
+
single key field). Use this option when your data represents a flat structure rather
|
150
|
+
than a parent-child hierarchy or flattened tree. You can specify key_fields using
|
151
|
+
field numbers/column indices (0-based):
|
137
152
|
|
138
153
|
```ruby
|
139
154
|
diff = CSVDiff.new(file1, file2, key_fields: [0, 1])
|
140
155
|
```
|
141
156
|
|
142
|
-
|
157
|
+
Alternatively, you can use the :key_fields options with column names (provided CSVDiff
|
158
|
+
knows the names of your fields, either via the **field_names** option or from headers
|
159
|
+
in the file):
|
143
160
|
|
144
161
|
```ruby
|
145
|
-
diff = CSVDiff.new(file1, file2, key_fields: ['
|
162
|
+
diff = CSVDiff.new(file1, file2, key_fields: ['First Name', 'Last Name'])
|
146
163
|
```
|
147
164
|
|
148
|
-
|
165
|
+
#### :parent_field(s)/:child_field(s)
|
166
|
+
|
167
|
+
The second method for identifying the unique identifiers in your file is to use the
|
168
|
+
:parent_fields and :child_fields options. Use this option when your data represents
|
169
|
+
a tree structure flattened to a table in parent-child form.
|
170
|
+
|
171
|
+
Using the :parent_fields and :child_fields with field numbers:
|
149
172
|
|
150
173
|
```ruby
|
151
174
|
diff = CSVDiff.new(file1, file2, parent_field: 1, child_fields: [2, 3])
|
152
175
|
```
|
153
176
|
|
154
|
-
|
177
|
+
Using the :parent_fields and :child_fields with column names:
|
155
178
|
|
156
179
|
```ruby
|
157
180
|
diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'])
|
158
181
|
```
|
159
182
|
|
160
|
-
### Using Non-CSV
|
183
|
+
### Using Non-CSV Sources
|
161
184
|
|
162
185
|
Data from non-CSV sources can be diffed, as long as it can be supplied as an Array
|
163
186
|
of Arrays:
|
@@ -174,7 +197,53 @@ DATA2 = [
|
|
174
197
|
['A', 'A2', 'Account2']
|
175
198
|
]
|
176
199
|
|
177
|
-
diff = CSVDiff.new(DATA1, DATA2,
|
200
|
+
diff = CSVDiff.new(DATA1, DATA2, parent_field: 1, child_field: 0)
|
201
|
+
```
|
202
|
+
|
203
|
+
Data can also be diffed if it is an XML source, although this requires a little
|
204
|
+
more effort to tell CSVDiff how to transform/extract content from the XML document
|
205
|
+
into an array-of-arrays form. It also introduces a dependency on Nokogiri - you
|
206
|
+
will need to install this gem to use CSVDiff with XML sources.
|
207
|
+
|
208
|
+
The first step is to use the CSVDiff::XMLSource class to define how to convert
|
209
|
+
your XML content to an array-of-arrays. The XMLSource class is quite flexible,
|
210
|
+
and can be used to convert single or multiple XML sources into a single data set
|
211
|
+
for diffing, and different documents may even have different layouts.
|
212
|
+
|
213
|
+
The first step is to create an XMLSource object, which requires a label to
|
214
|
+
identify the type of data it will generate:
|
215
|
+
```ruby
|
216
|
+
xml_source_1 = CSVDiff::XMLSource.new('My Label')
|
217
|
+
```
|
218
|
+
|
219
|
+
Next, we pass XML documents to this source, and specify XPath expressions for each
|
220
|
+
row and column of data to produce via the `process(rec_xpath, field_maps, options)`
|
221
|
+
method:
|
222
|
+
|
223
|
+
* An XPath expression is provided to select each node value in the document that
|
224
|
+
will represent a row. Taking an HTML table as an example of something we wanted
|
225
|
+
to parse, your rec_xpath value might be something like the following:
|
226
|
+
`'//table/tbody/tr'`. This would locate all tables in the document, and create
|
227
|
+
a new row of data in the XMLSource every time a `<tr>` tag was encountered.
|
228
|
+
* A hash of field_maps is then provided to describe how to generate column values
|
229
|
+
for each row of data. The keys to field_maps are the names of the fields to be
|
230
|
+
output, while the values are the epression for how to generate values. Most
|
231
|
+
commonly, this will be another XPath expression that is evaluated in the context
|
232
|
+
of the node returned by the row XPath expression. So continuing our HTML example,
|
233
|
+
we might use `'./td[0]/text()'` as an expression to select the content of the
|
234
|
+
first `<td>` element within the `<tr>` representing the current row.
|
235
|
+
|
236
|
+
```ruby
|
237
|
+
xml_source1.process('//table/tbody/tr',
|
238
|
+
col_A: './td[0]/text()',
|
239
|
+
col_B: './td[1]/text()',
|
240
|
+
col_C: './td[2]/text()')
|
241
|
+
```
|
242
|
+
|
243
|
+
Finally, to diff two XML sources, we create a CSVDiff object with two XMLSource
|
244
|
+
objects as the source:
|
245
|
+
```ruby
|
246
|
+
diff = CSVDiff.new(xml_source1, xml_source2, key_field: 'col_A')
|
178
247
|
```
|
179
248
|
|
180
249
|
### Specifying Column Names
|
data/lib/csv-diff/source.rb
CHANGED
@@ -66,10 +66,13 @@ class CSVDiff
|
|
66
66
|
# If not specified via one of the options, the first field is assumed to
|
67
67
|
# be the unique key.
|
68
68
|
#
|
69
|
-
# If multiple fields combine to form a unique key, the
|
70
|
-
#
|
71
|
-
#
|
72
|
-
# :child_fields
|
69
|
+
# If multiple fields combine to form a unique key, the combined fields
|
70
|
+
# are considered as a single unique identifier. If your key represents
|
71
|
+
# data that can be represented as a tree, you can instead break your key
|
72
|
+
# fields into :parent_fields and :child_fields. By doing this, if a child
|
73
|
+
# key is deleted from one parent, and added to another, that will be
|
74
|
+
# reported as an update, with a change to the parent key part(s) of the
|
75
|
+
# record.
|
73
76
|
#
|
74
77
|
# All key options can be specified either by field name, or by field
|
75
78
|
# index (0 based).
|
@@ -100,8 +103,8 @@ class CSVDiff
|
|
100
103
|
if (options.keys & [:parent_field, :parent_fields, :child_field, :child_fields]).empty? &&
|
101
104
|
(kf = options.fetch(:key_field, options[:key_fields]))
|
102
105
|
@key_fields = [kf].flatten
|
103
|
-
@parent_fields =
|
104
|
-
@child_fields = @key_fields
|
106
|
+
@parent_fields = []
|
107
|
+
@child_fields = @key_fields
|
105
108
|
else
|
106
109
|
@parent_fields = [options.fetch(:parent_field, options[:parent_fields]) || []].flatten
|
107
110
|
@child_fields = [options.fetch(:child_field, options[:child_fields]) || [0]].flatten
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: csv-diff
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.6.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Adam Gardiner
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2020-
|
11
|
+
date: 2020-08-28 00:00:00.000000000 Z
|
12
12
|
dependencies: []
|
13
13
|
description: |2
|
14
14
|
This library performs diffs of CSV data, or any table-like source.
|