csv-diff 0.3.1 → 0.6.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/LICENSE +1 -1
- data/README.md +112 -26
- data/lib/csv-diff.rb +1 -0
- data/lib/csv-diff/algorithm.rb +119 -35
- data/lib/csv-diff/csv_diff.rb +6 -6
- data/lib/csv-diff/csv_source.rb +19 -111
- data/lib/csv-diff/source.rb +289 -0
- data/lib/csv-diff/xml_source.rb +142 -0
- metadata +43 -32
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 87807b9af487947c60d18ac81d38b6782133dfdb
|
4
|
+
data.tar.gz: 3681d305dc566f49e7b5166fd22f5ea858f1260f
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: d4b617cae1c1e2201633ba5bdb2752aa16af6e293ac32c75b812502c11e54f8d930b965d7e86a81fd5592b1a469db2498a71b635fd563d576039a4808b6dd776
|
7
|
+
data.tar.gz: a6561a7c91e8e4cb8ef8487032fd8bd46794bc79854dca274d7134e251741395f049c2e5c8ca603e257310a961788d6c1472fdc9bf82a171b099f869c59cefcf
|
data/LICENSE
CHANGED
data/README.md
CHANGED
@@ -1,14 +1,19 @@
|
|
1
1
|
# CSV-Diff
|
2
2
|
|
3
|
-
CSV-Diff is a small library for performing diffs of
|
3
|
+
CSV-Diff is a small library for performing diffs of tabular data, typically
|
4
|
+
data loaded from CSV files.
|
4
5
|
|
5
6
|
Unlike a standard diff that compares line by line, and is sensitive to the
|
6
7
|
ordering of records, CSV-Diff identifies common lines by key field(s), and
|
7
8
|
then compares the contents of the fields in each line.
|
8
9
|
|
9
|
-
Data may be supplied in the form of CSV files, or as an array of arrays.
|
10
|
-
|
11
|
-
|
10
|
+
Data may be supplied in the form of CSV files, or as an array of arrays.
|
11
|
+
More complex usage also allows you to specify XPath expressions to extract
|
12
|
+
tabular data from XML documents for diffing.
|
13
|
+
|
14
|
+
The diff process provides a fine level of control over what to diff, and can
|
15
|
+
optionally ignore certain types of changes (e.g. adds, deletes, changes in
|
16
|
+
position etc).
|
12
17
|
|
13
18
|
CSV-Diff is particularly well suited to data in parent-child format. Parent-
|
14
19
|
child data does not lend itself well to standard text diffs, as small changes
|
@@ -21,17 +26,19 @@ sibling order.
|
|
21
26
|
## Usage
|
22
27
|
|
23
28
|
CSV-Diff is supplied as a gem, and has no dependencies. To use it, simply:
|
24
|
-
|
25
|
-
|
26
|
-
|
29
|
+
|
30
|
+
```
|
31
|
+
gem install csv-diff
|
32
|
+
```
|
27
33
|
|
28
34
|
To compare two CSV files where the field names are in the first row of the file,
|
29
35
|
and the first field contains the unique key for each record, simply use:
|
30
|
-
```ruby
|
31
|
-
require 'csv-diff'
|
32
36
|
|
33
|
-
|
34
|
-
|
37
|
+
```ruby
|
38
|
+
require 'csv-diff'
|
39
|
+
|
40
|
+
diff = CSVDiff.new(file1, file2)
|
41
|
+
```
|
35
42
|
|
36
43
|
The returned diff object can be queried for the differences that exist between
|
37
44
|
the two files, e.g.:
|
@@ -96,7 +103,7 @@ change in order) of all 6 rows.
|
|
96
103
|
|
97
104
|
The more correct specification of this file is that column 0 contains a unique parent
|
98
105
|
identifier, and column 1 contains a unique child identifier. CSVDiff can then correctly
|
99
|
-
deduce that there
|
106
|
+
deduce that there are in fact only two changes in order - the swap in positions of A and
|
100
107
|
B below Root.
|
101
108
|
|
102
109
|
Note: If you aren't interested in changes in the order of siblings, then you could use
|
@@ -121,43 +128,59 @@ Warnings may be raised for any of the following:
|
|
121
128
|
The simplest use case is as shown above, where the data to be diffed is in CSV files
|
122
129
|
with the column names as the first record, and where the unique key is the first
|
123
130
|
column in the data. In this case, a diff can be created simply via:
|
124
|
-
```ruby
|
125
|
-
diff = CSVDiff.new(file1, file2)
|
126
|
-
```
|
127
131
|
|
128
|
-
|
132
|
+
```ruby
|
133
|
+
diff = CSVDiff.new(file1, file2)
|
134
|
+
```
|
135
|
+
|
136
|
+
### Specifying Unique Row Identifiers
|
129
137
|
|
130
138
|
Often however, rows are not uniquely identifiable via the first column in the file.
|
131
139
|
In a parent-child hierarchy, for example, combinations of parent and child may be
|
132
|
-
necessary to uniquely identify a row
|
133
|
-
|
134
|
-
|
140
|
+
necessary to uniquely identify a row, while in other cases a combination of fields
|
141
|
+
may be needed to derive a natural unique key or identifier for each row.
|
142
|
+
In these cases, it is necessary to indicate to CSVDiff which fields are needed to
|
143
|
+
uniquely identify common rows across the two files. This can be done in several
|
144
|
+
different ways.
|
145
|
+
|
146
|
+
#### :key_field(s)
|
135
147
|
|
136
|
-
|
148
|
+
The first method is using the **key_fields** option (or key_field if you have only a
|
149
|
+
single key field). Use this option when your data represents a flat structure rather
|
150
|
+
than a parent-child hierarchy or flattened tree. You can specify key_fields using
|
151
|
+
field numbers/column indices (0-based):
|
137
152
|
|
138
153
|
```ruby
|
139
154
|
diff = CSVDiff.new(file1, file2, key_fields: [0, 1])
|
140
155
|
```
|
141
156
|
|
142
|
-
|
157
|
+
Alternatively, you can use the :key_fields options with column names (provided CSVDiff
|
158
|
+
knows the names of your fields, either via the **field_names** option or from headers
|
159
|
+
in the file):
|
143
160
|
|
144
161
|
```ruby
|
145
|
-
diff = CSVDiff.new(file1, file2, key_fields: ['
|
162
|
+
diff = CSVDiff.new(file1, file2, key_fields: ['First Name', 'Last Name'])
|
146
163
|
```
|
147
164
|
|
148
|
-
|
165
|
+
#### :parent_field(s)/:child_field(s)
|
166
|
+
|
167
|
+
The second method for identifying the unique identifiers in your file is to use the
|
168
|
+
:parent_fields and :child_fields options. Use this option when your data represents
|
169
|
+
a tree structure flattened to a table in parent-child form.
|
170
|
+
|
171
|
+
Using the :parent_fields and :child_fields with field numbers:
|
149
172
|
|
150
173
|
```ruby
|
151
174
|
diff = CSVDiff.new(file1, file2, parent_field: 1, child_fields: [2, 3])
|
152
175
|
```
|
153
176
|
|
154
|
-
|
177
|
+
Using the :parent_fields and :child_fields with column names:
|
155
178
|
|
156
179
|
```ruby
|
157
180
|
diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'])
|
158
181
|
```
|
159
182
|
|
160
|
-
### Using Non-CSV
|
183
|
+
### Using Non-CSV Sources
|
161
184
|
|
162
185
|
Data from non-CSV sources can be diffed, as long as it can be supplied as an Array
|
163
186
|
of Arrays:
|
@@ -174,7 +197,53 @@ DATA2 = [
|
|
174
197
|
['A', 'A2', 'Account2']
|
175
198
|
]
|
176
199
|
|
177
|
-
diff = CSVDiff.new(DATA1, DATA2,
|
200
|
+
diff = CSVDiff.new(DATA1, DATA2, parent_field: 1, child_field: 0)
|
201
|
+
```
|
202
|
+
|
203
|
+
Data can also be diffed if it is an XML source, although this requires a little
|
204
|
+
more effort to tell CSVDiff how to transform/extract content from the XML document
|
205
|
+
into an array-of-arrays form. It also introduces a dependency on Nokogiri - you
|
206
|
+
will need to install this gem to use CSVDiff with XML sources.
|
207
|
+
|
208
|
+
The first step is to use the CSVDiff::XMLSource class to define how to convert
|
209
|
+
your XML content to an array-of-arrays. The XMLSource class is quite flexible,
|
210
|
+
and can be used to convert single or multiple XML sources into a single data set
|
211
|
+
for diffing, and different documents may even have different layouts.
|
212
|
+
|
213
|
+
The first step is to create an XMLSource object, which requires a label to
|
214
|
+
identify the type of data it will generate:
|
215
|
+
```ruby
|
216
|
+
xml_source_1 = CSVDiff::XMLSource.new('My Label')
|
217
|
+
```
|
218
|
+
|
219
|
+
Next, we pass XML documents to this source, and specify XPath expressions for each
|
220
|
+
row and column of data to produce via the `process(rec_xpath, field_maps, options)`
|
221
|
+
method:
|
222
|
+
|
223
|
+
* An XPath expression is provided to select each node value in the document that
|
224
|
+
will represent a row. Taking an HTML table as an example of something we wanted
|
225
|
+
to parse, your rec_xpath value might be something like the following:
|
226
|
+
`'//table/tbody/tr'`. This would locate all tables in the document, and create
|
227
|
+
a new row of data in the XMLSource every time a `<tr>` tag was encountered.
|
228
|
+
* A hash of field_maps is then provided to describe how to generate column values
|
229
|
+
for each row of data. The keys to field_maps are the names of the fields to be
|
230
|
+
output, while the values are the epression for how to generate values. Most
|
231
|
+
commonly, this will be another XPath expression that is evaluated in the context
|
232
|
+
of the node returned by the row XPath expression. So continuing our HTML example,
|
233
|
+
we might use `'./td[0]/text()'` as an expression to select the content of the
|
234
|
+
first `<td>` element within the `<tr>` representing the current row.
|
235
|
+
|
236
|
+
```ruby
|
237
|
+
xml_source1.process('//table/tbody/tr',
|
238
|
+
col_A: './td[0]/text()',
|
239
|
+
col_B: './td[1]/text()',
|
240
|
+
col_C: './td[2]/text()')
|
241
|
+
```
|
242
|
+
|
243
|
+
Finally, to diff two XML sources, we create a CSVDiff object with two XMLSource
|
244
|
+
objects as the source:
|
245
|
+
```ruby
|
246
|
+
diff = CSVDiff.new(xml_source1, xml_source2, key_field: 'col_A')
|
178
247
|
```
|
179
248
|
|
180
249
|
### Specifying Column Names
|
@@ -211,6 +280,23 @@ diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam'
|
|
211
280
|
ignore_fields: ['CreatedAt', 'UpdatedAt'])
|
212
281
|
```
|
213
282
|
|
283
|
+
### Filtering Rows
|
284
|
+
|
285
|
+
If you need to filter source data before running the diff process, you can use the :include
|
286
|
+
and :exclude options to do so. Both options take a Hash as their value; the hash should have
|
287
|
+
keys that are the field names or indexes (0-based) on which to filter, and whose values are
|
288
|
+
regular expressions or lambdas to be applied to values of the corresponding field. Rows will
|
289
|
+
only be diffed if they satisfy :include conditions, and do not satisfy :exclude conditions.
|
290
|
+
```ruby
|
291
|
+
# Generate a diff of Arsenal home games not refereed by Clattenburg
|
292
|
+
diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'],
|
293
|
+
include: {HomeTeam: 'Arsenal'}, exclude: {Referee: /Clattenburg/})
|
294
|
+
|
295
|
+
# Generate a diff of games played over the Xmas/New Year period
|
296
|
+
diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'],
|
297
|
+
include: {Date: lambda{ |d| holiday_period.include?(Date.strptime(d, '%y/%m/%d')) } })
|
298
|
+
```
|
299
|
+
|
214
300
|
### Ignoring Certain Changes
|
215
301
|
|
216
302
|
CSVDiff identifies Adds, Updates, Moves and Deletes; any of these changes can be selectively
|
data/lib/csv-diff.rb
CHANGED
data/lib/csv-diff/algorithm.rb
CHANGED
@@ -3,6 +3,55 @@ class CSVDiff
|
|
3
3
|
# Implements the CSV diff algorithm.
|
4
4
|
module Algorithm
|
5
5
|
|
6
|
+
# Holds the details of a single difference
|
7
|
+
class Diff
|
8
|
+
|
9
|
+
attr_accessor :diff_type
|
10
|
+
attr_reader :fields
|
11
|
+
attr_reader :row
|
12
|
+
attr_reader :sibling_position
|
13
|
+
|
14
|
+
def initialize(diff_type, fields, row_idx, pos_idx)
|
15
|
+
@diff_type = diff_type
|
16
|
+
@fields = fields
|
17
|
+
@row = row_idx + 1
|
18
|
+
self.sibling_position = pos_idx
|
19
|
+
end
|
20
|
+
|
21
|
+
|
22
|
+
def sibling_position=(pos_idx)
|
23
|
+
if pos_idx.is_a?(Array)
|
24
|
+
pos_idx.compact!
|
25
|
+
if pos_idx.first != pos_idx.last
|
26
|
+
@sibling_position = pos_idx.map{ |pos| pos + 1 }
|
27
|
+
else
|
28
|
+
@sibling_position = pos_idx.first + 1
|
29
|
+
end
|
30
|
+
else
|
31
|
+
@sibling_position = pos_idx + 1
|
32
|
+
end
|
33
|
+
end
|
34
|
+
|
35
|
+
|
36
|
+
# For backwards compatibility and access to fields with differences
|
37
|
+
def [](key)
|
38
|
+
case key
|
39
|
+
when :action
|
40
|
+
a = diff_type.to_s
|
41
|
+
a[0] = a[0].upcase
|
42
|
+
a
|
43
|
+
when :row
|
44
|
+
@row
|
45
|
+
when :sibling_position
|
46
|
+
@sibling_position
|
47
|
+
else
|
48
|
+
@fields[key] || @fields[key.to_s.intern] || @fields[key.to_s]
|
49
|
+
end
|
50
|
+
end
|
51
|
+
|
52
|
+
end
|
53
|
+
|
54
|
+
|
6
55
|
# Diffs two CSVSource structures.
|
7
56
|
#
|
8
57
|
# @param left [CSVSource] A CSVSource object containing the contents of
|
@@ -22,43 +71,53 @@ class CSVDiff
|
|
22
71
|
# items that exist in both +left+ and +right+.
|
23
72
|
# @option options [Boolean] :ignore_deletes If set to true, we ignore any
|
24
73
|
# new items that appear only in +left+.
|
74
|
+
# @option options [Hash<Object,Proc>] :equality_procs A Hash mapping fields
|
75
|
+
# to a 2-arg Proc that should be used to compare values in that field for
|
76
|
+
# equality.
|
25
77
|
def diff_sources(left, right, key_fields, diff_fields, options = {})
|
26
78
|
unless left.case_sensitive? == right.case_sensitive?
|
27
79
|
raise ArgumentError, "Left and right must have same settings for case-sensitivity"
|
28
80
|
end
|
29
|
-
|
81
|
+
unless left.parent_fields.length == right.parent_fields.length
|
82
|
+
raise ArgumentError, "Left and right must have same settings for parent/child fields"
|
83
|
+
end
|
84
|
+
|
85
|
+
# Ensure key fields are not also in the diff_fields
|
86
|
+
diff_fields = diff_fields - key_fields
|
87
|
+
|
30
88
|
left_index = left.index
|
31
89
|
left_values = left.lines
|
32
90
|
left_keys = left_values.keys
|
33
91
|
right_index = right.index
|
34
92
|
right_values = right.lines
|
35
93
|
right_keys = right_values.keys
|
36
|
-
|
94
|
+
parent_field_count = left.parent_fields.length
|
37
95
|
|
38
96
|
include_adds = !options[:ignore_adds]
|
39
97
|
include_moves = !options[:ignore_moves]
|
40
98
|
include_updates = !options[:ignore_updates]
|
41
99
|
include_deletes = !options[:ignore_deletes]
|
42
100
|
|
43
|
-
|
101
|
+
@case_sensitive = left.case_sensitive?
|
102
|
+
@equality_procs = options.fetch(:equality_procs, {})
|
103
|
+
|
104
|
+
diffs = {}
|
105
|
+
potential_moves = Hash.new{ |h, k| h[k] = [] }
|
44
106
|
|
45
107
|
# First identify deletions
|
46
108
|
if include_deletes
|
47
109
|
(left_keys - right_keys).each do |key|
|
48
110
|
# Delete
|
49
111
|
key_vals = key.split('~', -1)
|
50
|
-
parent = key_vals[0...
|
112
|
+
parent = key_vals[0...parent_field_count].join('~')
|
113
|
+
child = key_vals[parent_field_count..-1].join('~')
|
51
114
|
left_parent = left_index[parent]
|
52
115
|
left_value = left_values[key]
|
53
|
-
|
54
|
-
|
55
|
-
|
56
|
-
|
57
|
-
|
58
|
-
key_fields.each do |field_name|
|
59
|
-
id[field_name] = left_value[field_name]
|
60
|
-
end
|
61
|
-
diffs[key].merge!(id.merge(left_values[key].merge(:action => 'Delete')))
|
116
|
+
row_idx = left_keys.index(key)
|
117
|
+
sib_idx = left_parent.index(key)
|
118
|
+
raise "Can't locate key #{key} in parent #{parent}" unless sib_idx
|
119
|
+
diffs[key] = Diff.new(:delete, left_value, row_idx, sib_idx)
|
120
|
+
potential_moves[child] << key
|
62
121
|
#puts "Delete: #{key}"
|
63
122
|
end
|
64
123
|
end
|
@@ -66,7 +125,7 @@ class CSVDiff
|
|
66
125
|
# Now identify adds/updates
|
67
126
|
right_keys.each_with_index do |key, right_row_id|
|
68
127
|
key_vals = key.split('~', -1)
|
69
|
-
parent = key_vals[0...
|
128
|
+
parent = key_vals[0...parent_field_count].join('~')
|
70
129
|
left_parent = left_index[parent]
|
71
130
|
right_parent = right_index[parent]
|
72
131
|
left_value = left_values[key]
|
@@ -74,13 +133,12 @@ class CSVDiff
|
|
74
133
|
left_idx = left_parent && left_parent.index(key)
|
75
134
|
right_idx = right_parent && right_parent.index(key)
|
76
135
|
|
77
|
-
id = {}
|
78
|
-
id[:row] = right_row_id + 1
|
79
|
-
id[:sibling_position] = right_idx + 1
|
80
|
-
key_fields.each do |field_name|
|
81
|
-
id[field_name] = right_value[field_name]
|
82
|
-
end
|
83
136
|
if left_idx && right_idx
|
137
|
+
if include_updates && (changes = diff_row(left_value, right_value, diff_fields))
|
138
|
+
id = id_fields(key_fields, right_value)
|
139
|
+
diffs[key] = Diff.new(:update, id.merge!(changes), right_row_id, right_idx)
|
140
|
+
#puts "Change: #{key}"
|
141
|
+
end
|
84
142
|
if include_moves
|
85
143
|
left_common = left_parent & right_parent
|
86
144
|
right_common = right_parent & left_parent
|
@@ -88,19 +146,31 @@ class CSVDiff
|
|
88
146
|
right_pos = right_common.index(key)
|
89
147
|
if left_pos != right_pos
|
90
148
|
# Move
|
91
|
-
diffs[key]
|
92
|
-
|
149
|
+
if d = diffs[key]
|
150
|
+
d.sibling_position = [left_idx, right_idx]
|
151
|
+
else
|
152
|
+
id = id_fields(key_fields, right_value)
|
153
|
+
diffs[key] = Diff.new(:move, id, right_row_id, [left_idx, right_idx])
|
154
|
+
end
|
93
155
|
#puts "Move #{left_idx} -> #{right_idx}: #{key}"
|
94
156
|
end
|
95
157
|
end
|
96
|
-
|
97
|
-
diffs[key].merge!(id.merge(changes.merge(:action => 'Update')))
|
98
|
-
#puts "Change: #{key}"
|
99
|
-
end
|
100
|
-
elsif include_adds && right_idx
|
158
|
+
elsif right_idx
|
101
159
|
# Add
|
102
|
-
|
103
|
-
|
160
|
+
child = key_vals[parent_field_count..-1].join('~')
|
161
|
+
if potential_moves.has_key?(child) && old_key = potential_moves[child].pop
|
162
|
+
diffs.delete(old_key)
|
163
|
+
if include_updates
|
164
|
+
left_value = left_values[old_key]
|
165
|
+
id = id_fields(right.child_fields, right_value)
|
166
|
+
changes = diff_row(left_value, right_value, left.parent_fields + diff_fields)
|
167
|
+
diffs[key] = Diff.new(:update, id.merge!(changes), right_row_id, right_idx)
|
168
|
+
#puts "Update Parent: #{key}"
|
169
|
+
end
|
170
|
+
elsif include_adds
|
171
|
+
diffs[key] = Diff.new(:add, right_value, right_row_id, right_idx)
|
172
|
+
#puts "Add: #{key}"
|
173
|
+
end
|
104
174
|
end
|
105
175
|
end
|
106
176
|
|
@@ -116,27 +186,41 @@ class CSVDiff
|
|
116
186
|
# @param right_row [Hash] The version of the CSV row from the right/to
|
117
187
|
# file.
|
118
188
|
# @param fields [Array<String>] An array of field names to compare.
|
119
|
-
# @param case_sensitive [Boolean] Whether field comparisons should be
|
120
|
-
# case sensitive or not.
|
121
189
|
# @return [Hash<String, Array>] A Hash whose keys are the fields that
|
122
190
|
# contain differences, and whose values are a two-element array of
|
123
191
|
# [left/from, right/to] values.
|
124
|
-
def diff_row(left_row, right_row, fields
|
192
|
+
def diff_row(left_row, right_row, fields)
|
125
193
|
diffs = {}
|
126
194
|
fields.each do |attr|
|
195
|
+
eq_proc = @equality_procs[attr]
|
127
196
|
right_val = right_row[attr]
|
128
197
|
right_val = nil if right_val == ""
|
129
198
|
left_val = left_row[attr]
|
130
199
|
left_val = nil if left_val == ""
|
131
|
-
if
|
132
|
-
|
200
|
+
if eq_proc
|
201
|
+
diffs[attr] = [left_val, right_val] unless eq_proc.call(left_val, right_val)
|
202
|
+
elsif @case_sensitive
|
203
|
+
diffs[attr] = [left_val, right_val] unless left_val == right_val
|
204
|
+
elsif (left_val.to_s.upcase != right_val.to_s.upcase)
|
133
205
|
diffs[attr] = [left_val, right_val]
|
134
|
-
#puts "#{attr}: #{left_val} -> #{right_val}"
|
135
206
|
end
|
136
207
|
end
|
137
208
|
diffs if diffs.size > 0
|
138
209
|
end
|
139
210
|
|
211
|
+
|
212
|
+
private
|
213
|
+
|
214
|
+
|
215
|
+
# Return a hash containing just the key field values
|
216
|
+
def id_fields(key_fields, fields)
|
217
|
+
id = {}
|
218
|
+
key_fields.each do |field_name|
|
219
|
+
id[field_name] = fields[field_name]
|
220
|
+
end
|
221
|
+
id
|
222
|
+
end
|
223
|
+
|
140
224
|
end
|
141
225
|
|
142
226
|
end
|
data/lib/csv-diff/csv_diff.rb
CHANGED
@@ -81,13 +81,15 @@ class CSVDiff
|
|
81
81
|
# @option options [Boolean] :ignore_deletes If true, records that appear
|
82
82
|
# in the left/from file but not in the right/to file are not reported.
|
83
83
|
def initialize(left, right, options = {})
|
84
|
-
@left = left.is_a?(
|
84
|
+
@left = left.is_a?(Source) ? left : CSVSource.new(left, options)
|
85
|
+
@left.index_source if @left.lines.nil?
|
85
86
|
raise "No field names found in left (from) source" unless @left.field_names && @left.field_names.size > 0
|
86
|
-
@right = right.is_a?(
|
87
|
+
@right = right.is_a?(Source) ? right : CSVSource.new(right, options)
|
88
|
+
@right.index_source if @right.lines.nil?
|
87
89
|
raise "No field names found in right (to) source" unless @right.field_names && @right.field_names.size > 0
|
88
90
|
@warnings = []
|
89
91
|
@diff_fields = get_diff_fields(@left.field_names, @right.field_names, options)
|
90
|
-
@key_fields = @left.key_fields
|
92
|
+
@key_fields = @left.key_fields
|
91
93
|
diff(options)
|
92
94
|
end
|
93
95
|
|
@@ -141,15 +143,13 @@ class CSVDiff
|
|
141
143
|
ignore_fields = options.fetch(:ignore_fields, [])
|
142
144
|
ignore_fields = [ignore_fields] unless ignore_fields.is_a?(Array)
|
143
145
|
ignore_fields.map! do |f|
|
144
|
-
(f.is_a?(
|
146
|
+
(f.is_a?(Numeric) ? right_fields[f] : f).upcase
|
145
147
|
end
|
146
148
|
diff_fields = []
|
147
149
|
if options[:diff_common_fields_only]
|
148
150
|
right_fields.each_with_index do |fld, i|
|
149
151
|
if left_fields.include?(fld)
|
150
152
|
diff_fields << fld unless ignore_fields.include?(fld.upcase)
|
151
|
-
else
|
152
|
-
@warnings << "Field '#{fld}' is missing from the left (from) file, and won't be diffed"
|
153
153
|
end
|
154
154
|
end
|
155
155
|
else
|
data/lib/csv-diff/csv_source.rb
CHANGED
@@ -2,39 +2,7 @@ class CSVDiff
|
|
2
2
|
|
3
3
|
# Represents a CSV input (i.e. the left/from or right/to input) to the diff
|
4
4
|
# process.
|
5
|
-
class CSVSource
|
6
|
-
|
7
|
-
# @return [String] the path to the source file
|
8
|
-
attr_accessor :path
|
9
|
-
# @return [Array<String>] The names of the fields in the source file
|
10
|
-
attr_reader :field_names
|
11
|
-
# @return [Array<String>] The names of the field(s) that uniquely
|
12
|
-
# identify each row.
|
13
|
-
attr_reader :key_fields
|
14
|
-
# @return [Array<String>] The names of the field(s) that identify a
|
15
|
-
# common parent of child records.
|
16
|
-
attr_reader :parent_fields
|
17
|
-
# @return [Array<String>] The names of the field(s) that distinguish a
|
18
|
-
# child of a parent record.
|
19
|
-
attr_reader :child_fields
|
20
|
-
# @return [Boolean] True if the source has been indexed with case-
|
21
|
-
# sensitive keys, or false if it has been indexed using upper-case key
|
22
|
-
# values.
|
23
|
-
attr_reader :case_sensitive
|
24
|
-
alias_method :case_sensitive?, :case_sensitive
|
25
|
-
# @return [Boolean] True if leading/trailing whitespace should be stripped
|
26
|
-
# from fields
|
27
|
-
attr_reader :trim_whitespace
|
28
|
-
# @return [Hash<String,Hash>] A hash containing each line of the source,
|
29
|
-
# keyed on the values of the +key_fields+.
|
30
|
-
attr_reader :lines
|
31
|
-
# @return [Hash<String,Array<String>>] A hash containing each parent key,
|
32
|
-
# and an Array of the child keys it is a parent of.
|
33
|
-
attr_reader :index
|
34
|
-
# @return [Array<String>] An array of any warnings encountered while
|
35
|
-
# processing the source.
|
36
|
-
attr_reader :warnings
|
37
|
-
|
5
|
+
class CSVSource < Source
|
38
6
|
|
39
7
|
# Creates a new diff source.
|
40
8
|
#
|
@@ -72,92 +40,32 @@ class CSVDiff
|
|
72
40
|
# @option options [String] :child_field The name of the field(s) that
|
73
41
|
# uniquely identify a child of a parent.
|
74
42
|
# @option options [Boolean] :case_sensitive If true (the default), keys
|
75
|
-
#
|
76
|
-
#
|
43
|
+
# are indexed as-is; if false, the index is built in upper-case for
|
44
|
+
# case-insensitive comparisons.
|
45
|
+
# @option options [Hash] :include A hash of field name(s) or index(es) to
|
46
|
+
# regular expression(s). Only source rows whose field values satisfy the
|
47
|
+
# regular expressions will be indexed and included in the diff process.
|
48
|
+
# @option options [Hash] :exclude A hash of field name(s) or index(es) to
|
49
|
+
# regular expression(s). Source rows with a field value that satisfies
|
50
|
+
# the regular expressions will be excluded from the diff process.
|
77
51
|
def initialize(source, options = {})
|
52
|
+
super(options)
|
78
53
|
if source.is_a?(String)
|
79
54
|
require 'csv'
|
80
55
|
mode_string = options[:encoding] ? "r:#{options[:encoding]}" : 'r'
|
81
56
|
csv_options = options.fetch(:csv_options, {})
|
82
57
|
@path = source
|
83
|
-
|
84
|
-
|
85
|
-
|
86
|
-
|
87
|
-
@parent_fields = @key_fields[0...-1]
|
88
|
-
@child_fields = @key_fields[-1..-1]
|
89
|
-
else
|
90
|
-
@parent_fields = [options.fetch(:parent_field, options[:parent_fields]) || []].flatten
|
91
|
-
@child_fields = [options.fetch(:child_field, options[:child_fields]) || [0]].flatten
|
92
|
-
@key_fields = @parent_fields + @child_fields
|
93
|
-
end
|
94
|
-
@field_names = options[:field_names]
|
95
|
-
@warnings = []
|
96
|
-
index_source(source, options)
|
97
|
-
end
|
98
|
-
|
99
|
-
|
100
|
-
# Returns the row in the CSV source corresponding to the supplied key.
|
101
|
-
#
|
102
|
-
# @param key [String] The unique key to use to lookup the row.
|
103
|
-
# @return [Hash] The fields for the line corresponding to +key+, or nil
|
104
|
-
# if the key is not recognised.
|
105
|
-
def [](key)
|
106
|
-
@lines[key]
|
107
|
-
end
|
108
|
-
|
109
|
-
|
110
|
-
private
|
111
|
-
|
112
|
-
# Given an array of lines, where each line is an array of fields, indexes
|
113
|
-
# the array contents so that it can be looked up by key.
|
114
|
-
def index_source(lines, options)
|
115
|
-
@lines = {}
|
116
|
-
@index = Hash.new{ |h, k| h[k] = [] }
|
117
|
-
@key_fields = find_field_indexes(@key_fields, @field_names) if @field_names
|
118
|
-
@case_sensitive = options.fetch(:case_sensitive, true)
|
119
|
-
@trim_whitespace = options.fetch(:trim_whitespace, false)
|
120
|
-
line_num = 0
|
121
|
-
lines.each do |row|
|
122
|
-
line_num += 1
|
123
|
-
next if line_num == 1 && @field_names && options[:ignore_header]
|
124
|
-
unless @field_names
|
125
|
-
@field_names = row
|
126
|
-
@key_fields = find_field_indexes(@key_fields, @field_names)
|
127
|
-
next
|
128
|
-
end
|
129
|
-
field_vals = row
|
130
|
-
line = {}
|
131
|
-
@field_names.each_with_index do |field, i|
|
132
|
-
line[field] = field_vals[i]
|
133
|
-
line[field].strip! if @trim_whitespace && line[field]
|
134
|
-
end
|
135
|
-
key_values = @key_fields.map{ |kf| field_vals[kf].to_s.upcase }
|
136
|
-
key = key_values.join('~')
|
137
|
-
parent_key = key_values[0...(@parent_fields.length)].join('~')
|
138
|
-
parent_key.upcase! unless @case_sensitive
|
139
|
-
if @lines[key]
|
140
|
-
@warnings << "Duplicate key '#{key}' encountered and ignored at line #{line_num}"
|
141
|
-
else
|
142
|
-
@index[parent_key] << key
|
143
|
-
@lines[key] = line
|
144
|
-
end
|
145
|
-
end
|
146
|
-
end
|
147
|
-
|
148
|
-
|
149
|
-
# Converts an array of field names to an array of indexes of the fields
|
150
|
-
# matching those names.
|
151
|
-
def find_field_indexes(key_fields, field_names)
|
152
|
-
key_fields.map do |field|
|
153
|
-
if field.is_a?(Fixnum)
|
154
|
-
field
|
155
|
-
else
|
156
|
-
field_names.index{ |field_name| field.to_s.downcase == field_name.downcase } or
|
157
|
-
raise ArgumentError, "Could not locate field '#{field}' in source field names: #{
|
158
|
-
field_names.join(', ')}"
|
58
|
+
# When you call CSV.open, it's best to pass in a block so that after it's yielded,
|
59
|
+
# the underlying file handle is closed. Otherwise, you risk leaking the handle.
|
60
|
+
@data = CSV.open(@path, mode_string, csv_options) do |csv|
|
61
|
+
csv.readlines
|
159
62
|
end
|
63
|
+
elsif source.is_a?(Enumerable) && source.size == 0 || (source.size > 0 && source.first.is_a?(Enumerable))
|
64
|
+
@data = source
|
65
|
+
else
|
66
|
+
raise ArgumentError, "source must be a path to a file or an Enumerable<Enumerable>"
|
160
67
|
end
|
68
|
+
index_source
|
161
69
|
end
|
162
70
|
|
163
71
|
end
|
@@ -0,0 +1,289 @@
|
|
1
|
+
class CSVDiff
|
2
|
+
|
3
|
+
# Reppresents an input (i.e the left/from or tight/to input) to the diff
|
4
|
+
# process.
|
5
|
+
class Source
|
6
|
+
|
7
|
+
# @return [String] the path to the source file
|
8
|
+
attr_accessor :path
|
9
|
+
# @return [Array<Arrary>] The data for this source
|
10
|
+
attr_reader :data
|
11
|
+
|
12
|
+
# @return [Array<String>] The names of the fields in the source file
|
13
|
+
attr_reader :field_names
|
14
|
+
# @return [Array<String>] The names of the field(s) that uniquely
|
15
|
+
# identify each row.
|
16
|
+
attr_reader :key_fields
|
17
|
+
# @return [Array<String>] The names of the field(s) that identify a
|
18
|
+
# common parent of child records.
|
19
|
+
attr_reader :parent_fields
|
20
|
+
# @return [Array<String>] The names of the field(s) that distinguish a
|
21
|
+
# child of a parent record.
|
22
|
+
attr_reader :child_fields
|
23
|
+
|
24
|
+
# @return [Array<Fixnum>] The indexes of the key fields in the source
|
25
|
+
# file.
|
26
|
+
attr_reader :key_field_indexes
|
27
|
+
# @return [Array<Fixnum>] The indexes of the parent fields in the source
|
28
|
+
# file.
|
29
|
+
attr_reader :parent_field_indexes
|
30
|
+
# @return [Array<Fixnum>] The indexes of the child fields in the source
|
31
|
+
# file.
|
32
|
+
attr_reader :child_field_indexes
|
33
|
+
|
34
|
+
# @return [Boolean] True if the source has been indexed with case-
|
35
|
+
# sensitive keys, or false if it has been indexed using upper-case key
|
36
|
+
# values.
|
37
|
+
attr_reader :case_sensitive
|
38
|
+
alias_method :case_sensitive?, :case_sensitive
|
39
|
+
# @return [Boolean] True if leading/trailing whitespace should be stripped
|
40
|
+
# from fields
|
41
|
+
attr_reader :trim_whitespace
|
42
|
+
# @return [Hash<String,Hash>] A hash containing each line of the source,
|
43
|
+
# keyed on the values of the +key_fields+.
|
44
|
+
attr_reader :lines
|
45
|
+
# @return [Hash<String,Array<String>>] A hash containing each parent key,
|
46
|
+
# and an Array of the child keys it is a parent of.
|
47
|
+
attr_reader :index
|
48
|
+
# @return [Array<String>] An array of any warnings encountered while
|
49
|
+
# processing the source.
|
50
|
+
attr_reader :warnings
|
51
|
+
# @return [Fixnum] A count of the lines processed from this source.
|
52
|
+
# Excludes any header and duplicate records identified during indexing.
|
53
|
+
attr_reader :line_count
|
54
|
+
# @return [Fixnum] A count of the lines from this source that were skipped
|
55
|
+
# due to filter conditions.
|
56
|
+
attr_reader :skip_count
|
57
|
+
# @return [Fixnum] A count of the lines from this source that had the same
|
58
|
+
# key value as another line.
|
59
|
+
attr_reader :dup_count
|
60
|
+
|
61
|
+
|
62
|
+
# Creates a new diff source.
|
63
|
+
#
|
64
|
+
# A diff source must contain at least one field that will be used as the
|
65
|
+
# key to identify the same record in a different version of this file.
|
66
|
+
# If not specified via one of the options, the first field is assumed to
|
67
|
+
# be the unique key.
|
68
|
+
#
|
69
|
+
# If multiple fields combine to form a unique key, the combined fields
|
70
|
+
# are considered as a single unique identifier. If your key represents
|
71
|
+
# data that can be represented as a tree, you can instead break your key
|
72
|
+
# fields into :parent_fields and :child_fields. By doing this, if a child
|
73
|
+
# key is deleted from one parent, and added to another, that will be
|
74
|
+
# reported as an update, with a change to the parent key part(s) of the
|
75
|
+
# record.
|
76
|
+
#
|
77
|
+
# All key options can be specified either by field name, or by field
|
78
|
+
# index (0 based).
|
79
|
+
#
|
80
|
+
# @param options [Hash] An options hash.
|
81
|
+
# @option options [Array<String>] :field_names The names of each of the
|
82
|
+
# fields in +source+.
|
83
|
+
# @option options [Boolean] :ignore_header If true, and :field_names has
|
84
|
+
# been specified, then the first row of the file is ignored.
|
85
|
+
# @option options [String] :key_field The name of the field that uniquely
|
86
|
+
# identifies each row.
|
87
|
+
# @option options [Array<String>] :key_fields The names of the fields
|
88
|
+
# that uniquely identifies each row.
|
89
|
+
# @option options [String] :parent_field The name of the field(s) that
|
90
|
+
# identify a parent within which sibling order should be checked.
|
91
|
+
# @option options [String] :child_field The name of the field(s) that
|
92
|
+
# uniquely identify a child of a parent.
|
93
|
+
# @option options [Boolean] :case_sensitive If true (the default), keys
|
94
|
+
# are indexed as-is; if false, the index is built in upper-case for
|
95
|
+
# case-insensitive comparisons.
|
96
|
+
# @option options [Hash] :include A hash of field name(s) or index(es) to
|
97
|
+
# regular expression(s). Only source rows whose field values satisfy the
|
98
|
+
# regular expressions will be indexed and included in the diff process.
|
99
|
+
# @option options [Hash] :exclude A hash of field name(s) or index(es) to
|
100
|
+
# regular expression(s). Source rows with a field value that satisfies
|
101
|
+
# the regular expressions will be excluded from the diff process.
|
102
|
+
def initialize(options = {})
|
103
|
+
if (options.keys & [:parent_field, :parent_fields, :child_field, :child_fields]).empty? &&
|
104
|
+
(kf = options.fetch(:key_field, options[:key_fields]))
|
105
|
+
@key_fields = [kf].flatten
|
106
|
+
@parent_fields = []
|
107
|
+
@child_fields = @key_fields
|
108
|
+
else
|
109
|
+
@parent_fields = [options.fetch(:parent_field, options[:parent_fields]) || []].flatten
|
110
|
+
@child_fields = [options.fetch(:child_field, options[:child_fields]) || [0]].flatten
|
111
|
+
@key_fields = @parent_fields + @child_fields
|
112
|
+
end
|
113
|
+
@field_names = options[:field_names]
|
114
|
+
@case_sensitive = options.fetch(:case_sensitive, true)
|
115
|
+
@trim_whitespace = options.fetch(:trim_whitespace, false)
|
116
|
+
@ignore_header = options[:ignore_header]
|
117
|
+
@include = options[:include]
|
118
|
+
@exclude = options[:exclude]
|
119
|
+
@path = options.fetch(:path, 'NA') unless @path
|
120
|
+
@warnings = []
|
121
|
+
end
|
122
|
+
|
123
|
+
|
124
|
+
def path?
|
125
|
+
@path != 'NA'
|
126
|
+
end
|
127
|
+
|
128
|
+
|
129
|
+
# Returns the row in the CSV source corresponding to the supplied key.
|
130
|
+
#
|
131
|
+
# @param key [String] The unique key to use to lookup the row.
|
132
|
+
# @return [Hash] The fields for the line corresponding to +key+, or nil
|
133
|
+
# if the key is not recognised.
|
134
|
+
def [](key)
|
135
|
+
@lines[key]
|
136
|
+
end
|
137
|
+
|
138
|
+
|
139
|
+
# Given an array of lines, where each line is an array of fields, indexes
|
140
|
+
# the array contents so that it can be looked up by key.
|
141
|
+
def index_source
|
142
|
+
@lines = {}
|
143
|
+
@index = Hash.new{ |h, k| h[k] = [] }
|
144
|
+
if @field_names
|
145
|
+
index_fields
|
146
|
+
include_filter = convert_filter(@include, @field_names)
|
147
|
+
exclude_filter = convert_filter(@exclude, @field_names)
|
148
|
+
end
|
149
|
+
@line_count = 0
|
150
|
+
@skip_count = 0
|
151
|
+
@dup_count = 0
|
152
|
+
line_num = 0
|
153
|
+
@data.each do |row|
|
154
|
+
line_num += 1
|
155
|
+
next if line_num == 1 && @field_names && @ignore_header
|
156
|
+
unless @field_names
|
157
|
+
if row.class.name == 'CSV::Row'
|
158
|
+
@field_names = row.headers.each_with_index.map{ |f, i| f || i.to_s }
|
159
|
+
else
|
160
|
+
@field_names = row.each_with_index.map{ |f, i| f || i.to_s }
|
161
|
+
end
|
162
|
+
index_fields
|
163
|
+
include_filter = convert_filter(@include, @field_names)
|
164
|
+
exclude_filter = convert_filter(@exclude, @field_names)
|
165
|
+
next
|
166
|
+
end
|
167
|
+
field_vals = row
|
168
|
+
line = {}
|
169
|
+
filter = false
|
170
|
+
@field_names.each_with_index do |field, i|
|
171
|
+
val = field_vals[i]
|
172
|
+
val = val.to_s.strip if val && @trim_whitespace
|
173
|
+
line[field] = val
|
174
|
+
if include_filter && f = include_filter[i]
|
175
|
+
filter = !check_filter(f, line[field])
|
176
|
+
end
|
177
|
+
if exclude_filter && f = exclude_filter[i]
|
178
|
+
filter = check_filter(f, line[field])
|
179
|
+
end
|
180
|
+
break if filter
|
181
|
+
end
|
182
|
+
if filter
|
183
|
+
@skip_count += 1
|
184
|
+
next
|
185
|
+
end
|
186
|
+
key_values = @key_field_indexes.map{ |kf| @case_sensitive ?
|
187
|
+
field_vals[kf].to_s :
|
188
|
+
field_vals[kf].to_s.upcase }
|
189
|
+
key = key_values.join('~')
|
190
|
+
parent_key = key_values[0...(@parent_fields.length)].join('~')
|
191
|
+
if @lines[key]
|
192
|
+
@warnings << "Duplicate key '#{key}' encountered at line #{line_num}"
|
193
|
+
@dup_count += 1
|
194
|
+
key += "[#{@dup_count}]"
|
195
|
+
end
|
196
|
+
@index[parent_key] << key
|
197
|
+
@lines[key] = line
|
198
|
+
@line_count += 1
|
199
|
+
end
|
200
|
+
end
|
201
|
+
|
202
|
+
|
203
|
+
# Save the data in this Source as a CSV at +file_path+.
|
204
|
+
#
|
205
|
+
# @parma file_path [String] The target path to save the data to.
|
206
|
+
# @param options [Hash] A set of options to pass to CSV.open to control
|
207
|
+
# how the CSV is generated.
|
208
|
+
def save_csv(file_path, options = {})
|
209
|
+
require 'csv'
|
210
|
+
default_opts = {
|
211
|
+
headers: @field_name, write_headers: true
|
212
|
+
}
|
213
|
+
CSV.open(file_path, 'wb', default_opts.merge(options)) do |csv|
|
214
|
+
@data.each{ |rec| csv << rec }
|
215
|
+
end
|
216
|
+
end
|
217
|
+
|
218
|
+
|
219
|
+
# Convert the data in this source to Array<Hash> using the field names
|
220
|
+
# as keys for the Hash in each row.
|
221
|
+
def to_hash
|
222
|
+
@data.map do |row|
|
223
|
+
hsh = {}
|
224
|
+
@field_names.each_with_index.map{ |fld, i| hsh[fld] = row[i] }
|
225
|
+
hsh
|
226
|
+
end
|
227
|
+
end
|
228
|
+
|
229
|
+
|
230
|
+
private
|
231
|
+
|
232
|
+
|
233
|
+
def index_fields
|
234
|
+
@key_field_indexes = find_field_indexes(@key_fields, @field_names)
|
235
|
+
@parent_field_indexes = find_field_indexes(@parent_fields, @field_names)
|
236
|
+
@child_field_indexes = find_field_indexes(@child_fields, @field_names)
|
237
|
+
@key_fields = @key_field_indexes.map{ |i| @field_names[i] }
|
238
|
+
@parent_fields = @parent_field_indexes.map{ |i| @field_names[i] }
|
239
|
+
@child_fields = @child_field_indexes.map{ |i| @field_names[i] }
|
240
|
+
end
|
241
|
+
|
242
|
+
|
243
|
+
# Converts an array of field names to an array of indexes of the fields
|
244
|
+
# matching those names.
|
245
|
+
def find_field_indexes(key_fields, field_names)
|
246
|
+
key_fields.map do |field|
|
247
|
+
if field.is_a?(Integer)
|
248
|
+
field
|
249
|
+
else
|
250
|
+
field_names.index{ |field_name| field.to_s.downcase == field_name.to_s.downcase } or
|
251
|
+
raise ArgumentError, "Could not locate field '#{field}' in source field names: #{
|
252
|
+
field_names.join(', ')}"
|
253
|
+
end
|
254
|
+
end
|
255
|
+
end
|
256
|
+
|
257
|
+
|
258
|
+
def convert_filter(hsh, field_names)
|
259
|
+
return unless hsh
|
260
|
+
if !hsh.is_a?(Hash)
|
261
|
+
raise ArgumentError, ":include/:exclude option must be a Hash of field name(s)/index(es) to RegExp(s)"
|
262
|
+
end
|
263
|
+
keys = hsh.keys
|
264
|
+
idxs = find_field_indexes(keys, @field_names)
|
265
|
+
Hash[keys.each_with_index.map{ |k, i| [idxs[i], hsh[k]] }]
|
266
|
+
end
|
267
|
+
|
268
|
+
|
269
|
+
def check_filter(filter, field_val)
|
270
|
+
case filter
|
271
|
+
when String
|
272
|
+
if @case_sensitive
|
273
|
+
filter == field_val
|
274
|
+
else
|
275
|
+
filter.downcase == field_val.to_s.downcase
|
276
|
+
end
|
277
|
+
when Regexp
|
278
|
+
filter.match(field_val)
|
279
|
+
when Proc
|
280
|
+
filter.call(field_val)
|
281
|
+
else
|
282
|
+
raise ArgumentError, "Unsupported filter expression: #{filter.inspect}"
|
283
|
+
end
|
284
|
+
end
|
285
|
+
|
286
|
+
end
|
287
|
+
|
288
|
+
end
|
289
|
+
|
@@ -0,0 +1,142 @@
|
|
1
|
+
require 'nokogiri'
|
2
|
+
require 'cgi'
|
3
|
+
|
4
|
+
|
5
|
+
class CSVDiff
|
6
|
+
|
7
|
+
# Convert XML content to CSV format using XPath selectors to identify the
|
8
|
+
# rows and field values in an XML document
|
9
|
+
class XMLSource < Source
|
10
|
+
|
11
|
+
attr_accessor :context
|
12
|
+
|
13
|
+
# Create a new XMLSource, identified by +path+. Normally this is a path
|
14
|
+
# to the XML document, but any value is fine, as it is just a label to
|
15
|
+
# identify this data set.
|
16
|
+
#
|
17
|
+
# @param path [String] A label for this data set (often a path to the
|
18
|
+
# XML document used as the source).
|
19
|
+
# @param options [Hash] An options hash.
|
20
|
+
# @option options [Array<String>] :field_names The names of each of the
|
21
|
+
# fields in +source+.
|
22
|
+
# @option options [Boolean] :ignore_header If true, and :field_names has
|
23
|
+
# been specified, then the first row of the file is ignored.
|
24
|
+
# @option options [String] :key_field The name of the field that uniquely
|
25
|
+
# identifies each row.
|
26
|
+
# @option options [Array<String>] :key_fields The names of the fields
|
27
|
+
# that uniquely identifies each row.
|
28
|
+
# @option options [String] :parent_field The name of the field(s) that
|
29
|
+
# identify a parent within which sibling order should be checked.
|
30
|
+
# @option options [String] :child_field The name of the field(s) that
|
31
|
+
# uniquely identify a child of a parent.
|
32
|
+
# @option options [Boolean] :case_sensitive If true (the default), keys
|
33
|
+
# are indexed as-is; if false, the index is built in upper-case for
|
34
|
+
# case-insensitive comparisons.
|
35
|
+
# @option options [Hash] :include A hash of field name(s) or index(es) to
|
36
|
+
# regular expression(s). Only source rows whose field values satisfy the
|
37
|
+
# regular expressions will be indexed and included in the diff process.
|
38
|
+
# @option options [Hash] :exclude A hash of field name(s) or index(es) to
|
39
|
+
# regular expression(s). Source rows with a field value that satisfies
|
40
|
+
# the regular expressions will be excluded from the diff process.
|
41
|
+
# @option options [String] :context A context value from which fields
|
42
|
+
# can be populated using a Regexp.
|
43
|
+
def initialize(path, options = {})
|
44
|
+
super(options)
|
45
|
+
@path = path
|
46
|
+
@context = options[:context]
|
47
|
+
@data = []
|
48
|
+
end
|
49
|
+
|
50
|
+
|
51
|
+
# Process a +source+, converting the XML into a table of data, using
|
52
|
+
# +rec_xpath+ to identify the nodes that correspond each record that
|
53
|
+
# should appear in the output, and +field_maps+ to populate each field
|
54
|
+
# in each row.
|
55
|
+
#
|
56
|
+
# @param source [String|Array] may be a String containing XML content,
|
57
|
+
# an Array of paths to files containing XML content, or a path to
|
58
|
+
# a single file.
|
59
|
+
# @param rec_xpath [String] An XPath expression that selects all the
|
60
|
+
# items in the XML document that are to be converted into new rows.
|
61
|
+
# The returned items are not directly used to populate the fields,
|
62
|
+
# but provide a context for the field XPath expreessions that populate
|
63
|
+
# each field's content.
|
64
|
+
# @param field_maps [Hash<String, String>] A map of field names to
|
65
|
+
# expressions that are evaluated in the context of each row node
|
66
|
+
# selected by +rec_xpath+. The field expressions are typically XPath
|
67
|
+
# expressions evaluated in the context of the nodes returned by the
|
68
|
+
# +rec_xpath+. Alternatively, a String that is not an XPath expression
|
69
|
+
# is used as a literal value for a field, while a Regexp can also
|
70
|
+
# be used to pull a value from any context specified in the +options+
|
71
|
+
# hash. The Regexp should include a single grouping, as the value used
|
72
|
+
# will be the result in $1 after the match is performed.
|
73
|
+
# @param context [String] An optional context for the XML to be processed.
|
74
|
+
# The value passed here can be referenced in field map expressions
|
75
|
+
# using a Regexp, with the value of the first grouping in the regex
|
76
|
+
# being the value returned for the field.
|
77
|
+
def process(source, rec_xpath, field_maps, context = nil)
|
78
|
+
@field_names = field_maps.keys unless @field_names
|
79
|
+
case source
|
80
|
+
when Nokogiri::XML::Document
|
81
|
+
add_data(source, rec_xpath, field_maps, context || @context)
|
82
|
+
when /<\?xml/
|
83
|
+
doc = Nokogiri::XML(source)
|
84
|
+
add_data(doc, rec_xpath, field_maps, context || @context)
|
85
|
+
when Array
|
86
|
+
source.each{ |f| process_file(f, rec_xpath, field_maps) }
|
87
|
+
when String
|
88
|
+
process_file(source, rec_xpath, field_maps)
|
89
|
+
else
|
90
|
+
raise ArgumentError, "Unhandled source type #{source.class.name}"
|
91
|
+
end
|
92
|
+
@data
|
93
|
+
end
|
94
|
+
|
95
|
+
|
96
|
+
private
|
97
|
+
|
98
|
+
|
99
|
+
# Load the XML document at +file_path+ and process it into rows of data.
|
100
|
+
def process_file(file_path, rec_xpath, field_maps)
|
101
|
+
begin
|
102
|
+
File.open(file_path) do |f|
|
103
|
+
doc = Nokogiri::XML(f)
|
104
|
+
add_data(doc, rec_xpath, field_maps, @context || file_path)
|
105
|
+
end
|
106
|
+
rescue
|
107
|
+
STDERR.puts "An error occurred while attempting to open #{file_path}"
|
108
|
+
raise
|
109
|
+
end
|
110
|
+
end
|
111
|
+
|
112
|
+
|
113
|
+
# Locate records in +doc+ using +rec_xpath+ to identify the nodes that
|
114
|
+
# correspond to a new record in the data, and +field_maps+ to populate
|
115
|
+
# the fields in each row.
|
116
|
+
def add_data(doc, rec_xpath, field_maps, context)
|
117
|
+
doc.xpath(rec_xpath).each do |rec_node|
|
118
|
+
rec = []
|
119
|
+
field_maps.each do |field_name, expr|
|
120
|
+
case expr
|
121
|
+
when Regexp # Match context against Regexp and extract first grouping
|
122
|
+
if context
|
123
|
+
context =~ expr
|
124
|
+
rec << $1
|
125
|
+
else
|
126
|
+
rec << nil
|
127
|
+
end
|
128
|
+
when %r{[/(.@]} # XPath expression
|
129
|
+
res = rec_node.xpath(expr)
|
130
|
+
rec << CGI.unescape_html(res.to_s)
|
131
|
+
else # Use expr as the value for this field
|
132
|
+
rec << expr
|
133
|
+
end
|
134
|
+
end
|
135
|
+
@data << rec
|
136
|
+
end
|
137
|
+
end
|
138
|
+
|
139
|
+
end
|
140
|
+
|
141
|
+
end
|
142
|
+
|
metadata
CHANGED
@@ -1,69 +1,80 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: csv-diff
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
5
|
-
prerelease:
|
4
|
+
version: 0.6.1
|
6
5
|
platform: ruby
|
7
6
|
authors:
|
8
7
|
- Adam Gardiner
|
9
8
|
autorequire:
|
10
9
|
bindir: bin
|
11
10
|
cert_chain: []
|
12
|
-
date:
|
11
|
+
date: 2020-10-21 00:00:00.000000000 Z
|
13
12
|
dependencies: []
|
14
|
-
description:
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
|
28
|
-
|
29
|
-
|
30
|
-
|
31
|
-
|
32
|
-
|
13
|
+
description: |2
|
14
|
+
This library performs diffs of CSV data, or any table-like source.
|
15
|
+
|
16
|
+
Unlike a standard diff that compares line by line, and is sensitive to the
|
17
|
+
ordering of records, CSV-Diff identifies common lines by key field(s), and
|
18
|
+
then compares the contents of the fields in each line.
|
19
|
+
|
20
|
+
Data may be supplied in the form of CSV files, or as an array of arrays. The
|
21
|
+
diff process provides a fine level of control over what to diff, and can
|
22
|
+
optionally ignore certain types of changes (e.g. changes in position).
|
23
|
+
|
24
|
+
CSV-Diff is particularly well suited to data in parent-child format. Parent-
|
25
|
+
child data does not lend itself well to standard text diffs, as small changes
|
26
|
+
in the organisation of the tree at an upper level can lead to big movements
|
27
|
+
in the position of descendant records. By instead matching records by key,
|
28
|
+
CSV-Diff avoids this issue, while still being able to detect changes in
|
29
|
+
sibling order.
|
30
|
+
|
31
|
+
This gem implements the core diff algorithm, and handles the loading and
|
32
|
+
diffing of CSV files (or Arrays of Arrays). It also supports converting
|
33
|
+
data in XML format into tabular form, so that it can then be processed
|
34
|
+
like any other CSV or table-like source. It returns a CSVDiff object
|
35
|
+
containing the details of differences in object form. This is useful for
|
36
|
+
projects that need diff capability, but want to handle the reporting or
|
37
|
+
actioning of differences themselves.
|
38
|
+
|
39
|
+
For a pre-built diff reporting capability, see the csv-diff-report gem,
|
40
|
+
which provides a command-line tool for generating diff reports in HTML,
|
41
|
+
Excel, or text formats.
|
33
42
|
email: adam.b.gardiner@gmail.com
|
34
43
|
executables: []
|
35
44
|
extensions: []
|
36
45
|
extra_rdoc_files: []
|
37
46
|
files:
|
38
|
-
- README.md
|
39
47
|
- LICENSE
|
48
|
+
- README.md
|
49
|
+
- lib/csv-diff.rb
|
40
50
|
- lib/csv-diff/algorithm.rb
|
41
51
|
- lib/csv-diff/csv_diff.rb
|
42
52
|
- lib/csv-diff/csv_source.rb
|
43
|
-
- lib/csv-diff.rb
|
53
|
+
- lib/csv-diff/source.rb
|
54
|
+
- lib/csv-diff/xml_source.rb
|
44
55
|
- lib/csv_diff.rb
|
45
56
|
homepage: https://github.com/agardiner/csv-diff
|
46
|
-
licenses:
|
57
|
+
licenses:
|
58
|
+
- MIT
|
59
|
+
metadata: {}
|
47
60
|
post_install_message: For command-line tools and diff reports, 'gem install csv-diff-report'
|
48
61
|
rdoc_options: []
|
49
62
|
require_paths:
|
50
63
|
- lib
|
51
64
|
required_ruby_version: !ruby/object:Gem::Requirement
|
52
|
-
none: false
|
53
65
|
requirements:
|
54
|
-
- -
|
66
|
+
- - ">="
|
55
67
|
- !ruby/object:Gem::Version
|
56
68
|
version: '0'
|
57
69
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
58
|
-
none: false
|
59
70
|
requirements:
|
60
|
-
- -
|
71
|
+
- - ">="
|
61
72
|
- !ruby/object:Gem::Version
|
62
73
|
version: '0'
|
63
74
|
requirements: []
|
64
75
|
rubyforge_project:
|
65
|
-
rubygems_version:
|
76
|
+
rubygems_version: 2.5.2.3
|
66
77
|
signing_key:
|
67
|
-
specification_version:
|
68
|
-
summary: CSV Diff is a library for generating diffs from data in CSV format
|
78
|
+
specification_version: 4
|
79
|
+
summary: CSV Diff is a library for generating diffs from data in CSV or XML format
|
69
80
|
test_files: []
|