csv-diff 0.3.1 → 0.6.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/LICENSE +1 -1
- data/README.md +112 -26
- data/lib/csv-diff.rb +1 -0
- data/lib/csv-diff/algorithm.rb +119 -35
- data/lib/csv-diff/csv_diff.rb +6 -6
- data/lib/csv-diff/csv_source.rb +19 -111
- data/lib/csv-diff/source.rb +289 -0
- data/lib/csv-diff/xml_source.rb +142 -0
- metadata +43 -32
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 87807b9af487947c60d18ac81d38b6782133dfdb
|
4
|
+
data.tar.gz: 3681d305dc566f49e7b5166fd22f5ea858f1260f
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: d4b617cae1c1e2201633ba5bdb2752aa16af6e293ac32c75b812502c11e54f8d930b965d7e86a81fd5592b1a469db2498a71b635fd563d576039a4808b6dd776
|
7
|
+
data.tar.gz: a6561a7c91e8e4cb8ef8487032fd8bd46794bc79854dca274d7134e251741395f049c2e5c8ca603e257310a961788d6c1472fdc9bf82a171b099f869c59cefcf
|
data/LICENSE
CHANGED
data/README.md
CHANGED
@@ -1,14 +1,19 @@
|
|
1
1
|
# CSV-Diff
|
2
2
|
|
3
|
-
CSV-Diff is a small library for performing diffs of
|
3
|
+
CSV-Diff is a small library for performing diffs of tabular data, typically
|
4
|
+
data loaded from CSV files.
|
4
5
|
|
5
6
|
Unlike a standard diff that compares line by line, and is sensitive to the
|
6
7
|
ordering of records, CSV-Diff identifies common lines by key field(s), and
|
7
8
|
then compares the contents of the fields in each line.
|
8
9
|
|
9
|
-
Data may be supplied in the form of CSV files, or as an array of arrays.
|
10
|
-
|
11
|
-
|
10
|
+
Data may be supplied in the form of CSV files, or as an array of arrays.
|
11
|
+
More complex usage also allows you to specify XPath expressions to extract
|
12
|
+
tabular data from XML documents for diffing.
|
13
|
+
|
14
|
+
The diff process provides a fine level of control over what to diff, and can
|
15
|
+
optionally ignore certain types of changes (e.g. adds, deletes, changes in
|
16
|
+
position etc).
|
12
17
|
|
13
18
|
CSV-Diff is particularly well suited to data in parent-child format. Parent-
|
14
19
|
child data does not lend itself well to standard text diffs, as small changes
|
@@ -21,17 +26,19 @@ sibling order.
|
|
21
26
|
## Usage
|
22
27
|
|
23
28
|
CSV-Diff is supplied as a gem, and has no dependencies. To use it, simply:
|
24
|
-
|
25
|
-
|
26
|
-
|
29
|
+
|
30
|
+
```
|
31
|
+
gem install csv-diff
|
32
|
+
```
|
27
33
|
|
28
34
|
To compare two CSV files where the field names are in the first row of the file,
|
29
35
|
and the first field contains the unique key for each record, simply use:
|
30
|
-
```ruby
|
31
|
-
require 'csv-diff'
|
32
36
|
|
33
|
-
|
34
|
-
|
37
|
+
```ruby
|
38
|
+
require 'csv-diff'
|
39
|
+
|
40
|
+
diff = CSVDiff.new(file1, file2)
|
41
|
+
```
|
35
42
|
|
36
43
|
The returned diff object can be queried for the differences that exist between
|
37
44
|
the two files, e.g.:
|
@@ -96,7 +103,7 @@ change in order) of all 6 rows.
|
|
96
103
|
|
97
104
|
The more correct specification of this file is that column 0 contains a unique parent
|
98
105
|
identifier, and column 1 contains a unique child identifier. CSVDiff can then correctly
|
99
|
-
deduce that there
|
106
|
+
deduce that there are in fact only two changes in order - the swap in positions of A and
|
100
107
|
B below Root.
|
101
108
|
|
102
109
|
Note: If you aren't interested in changes in the order of siblings, then you could use
|
@@ -121,43 +128,59 @@ Warnings may be raised for any of the following:
|
|
121
128
|
The simplest use case is as shown above, where the data to be diffed is in CSV files
|
122
129
|
with the column names as the first record, and where the unique key is the first
|
123
130
|
column in the data. In this case, a diff can be created simply via:
|
124
|
-
```ruby
|
125
|
-
diff = CSVDiff.new(file1, file2)
|
126
|
-
```
|
127
131
|
|
128
|
-
|
132
|
+
```ruby
|
133
|
+
diff = CSVDiff.new(file1, file2)
|
134
|
+
```
|
135
|
+
|
136
|
+
### Specifying Unique Row Identifiers
|
129
137
|
|
130
138
|
Often however, rows are not uniquely identifiable via the first column in the file.
|
131
139
|
In a parent-child hierarchy, for example, combinations of parent and child may be
|
132
|
-
necessary to uniquely identify a row
|
133
|
-
|
134
|
-
|
140
|
+
necessary to uniquely identify a row, while in other cases a combination of fields
|
141
|
+
may be needed to derive a natural unique key or identifier for each row.
|
142
|
+
In these cases, it is necessary to indicate to CSVDiff which fields are needed to
|
143
|
+
uniquely identify common rows across the two files. This can be done in several
|
144
|
+
different ways.
|
145
|
+
|
146
|
+
#### :key_field(s)
|
135
147
|
|
136
|
-
|
148
|
+
The first method is using the **key_fields** option (or key_field if you have only a
|
149
|
+
single key field). Use this option when your data represents a flat structure rather
|
150
|
+
than a parent-child hierarchy or flattened tree. You can specify key_fields using
|
151
|
+
field numbers/column indices (0-based):
|
137
152
|
|
138
153
|
```ruby
|
139
154
|
diff = CSVDiff.new(file1, file2, key_fields: [0, 1])
|
140
155
|
```
|
141
156
|
|
142
|
-
|
157
|
+
Alternatively, you can use the :key_fields options with column names (provided CSVDiff
|
158
|
+
knows the names of your fields, either via the **field_names** option or from headers
|
159
|
+
in the file):
|
143
160
|
|
144
161
|
```ruby
|
145
|
-
diff = CSVDiff.new(file1, file2, key_fields: ['
|
162
|
+
diff = CSVDiff.new(file1, file2, key_fields: ['First Name', 'Last Name'])
|
146
163
|
```
|
147
164
|
|
148
|
-
|
165
|
+
#### :parent_field(s)/:child_field(s)
|
166
|
+
|
167
|
+
The second method for identifying the unique identifiers in your file is to use the
|
168
|
+
:parent_fields and :child_fields options. Use this option when your data represents
|
169
|
+
a tree structure flattened to a table in parent-child form.
|
170
|
+
|
171
|
+
Using the :parent_fields and :child_fields with field numbers:
|
149
172
|
|
150
173
|
```ruby
|
151
174
|
diff = CSVDiff.new(file1, file2, parent_field: 1, child_fields: [2, 3])
|
152
175
|
```
|
153
176
|
|
154
|
-
|
177
|
+
Using the :parent_fields and :child_fields with column names:
|
155
178
|
|
156
179
|
```ruby
|
157
180
|
diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'])
|
158
181
|
```
|
159
182
|
|
160
|
-
### Using Non-CSV
|
183
|
+
### Using Non-CSV Sources
|
161
184
|
|
162
185
|
Data from non-CSV sources can be diffed, as long as it can be supplied as an Array
|
163
186
|
of Arrays:
|
@@ -174,7 +197,53 @@ DATA2 = [
|
|
174
197
|
['A', 'A2', 'Account2']
|
175
198
|
]
|
176
199
|
|
177
|
-
diff = CSVDiff.new(DATA1, DATA2,
|
200
|
+
diff = CSVDiff.new(DATA1, DATA2, parent_field: 1, child_field: 0)
|
201
|
+
```
|
202
|
+
|
203
|
+
Data can also be diffed if it is an XML source, although this requires a little
|
204
|
+
more effort to tell CSVDiff how to transform/extract content from the XML document
|
205
|
+
into an array-of-arrays form. It also introduces a dependency on Nokogiri - you
|
206
|
+
will need to install this gem to use CSVDiff with XML sources.
|
207
|
+
|
208
|
+
The first step is to use the CSVDiff::XMLSource class to define how to convert
|
209
|
+
your XML content to an array-of-arrays. The XMLSource class is quite flexible,
|
210
|
+
and can be used to convert single or multiple XML sources into a single data set
|
211
|
+
for diffing, and different documents may even have different layouts.
|
212
|
+
|
213
|
+
The first step is to create an XMLSource object, which requires a label to
|
214
|
+
identify the type of data it will generate:
|
215
|
+
```ruby
|
216
|
+
xml_source_1 = CSVDiff::XMLSource.new('My Label')
|
217
|
+
```
|
218
|
+
|
219
|
+
Next, we pass XML documents to this source, and specify XPath expressions for each
|
220
|
+
row and column of data to produce via the `process(rec_xpath, field_maps, options)`
|
221
|
+
method:
|
222
|
+
|
223
|
+
* An XPath expression is provided to select each node value in the document that
|
224
|
+
will represent a row. Taking an HTML table as an example of something we wanted
|
225
|
+
to parse, your rec_xpath value might be something like the following:
|
226
|
+
`'//table/tbody/tr'`. This would locate all tables in the document, and create
|
227
|
+
a new row of data in the XMLSource every time a `<tr>` tag was encountered.
|
228
|
+
* A hash of field_maps is then provided to describe how to generate column values
|
229
|
+
for each row of data. The keys to field_maps are the names of the fields to be
|
230
|
+
output, while the values are the epression for how to generate values. Most
|
231
|
+
commonly, this will be another XPath expression that is evaluated in the context
|
232
|
+
of the node returned by the row XPath expression. So continuing our HTML example,
|
233
|
+
we might use `'./td[0]/text()'` as an expression to select the content of the
|
234
|
+
first `<td>` element within the `<tr>` representing the current row.
|
235
|
+
|
236
|
+
```ruby
|
237
|
+
xml_source1.process('//table/tbody/tr',
|
238
|
+
col_A: './td[0]/text()',
|
239
|
+
col_B: './td[1]/text()',
|
240
|
+
col_C: './td[2]/text()')
|
241
|
+
```
|
242
|
+
|
243
|
+
Finally, to diff two XML sources, we create a CSVDiff object with two XMLSource
|
244
|
+
objects as the source:
|
245
|
+
```ruby
|
246
|
+
diff = CSVDiff.new(xml_source1, xml_source2, key_field: 'col_A')
|
178
247
|
```
|
179
248
|
|
180
249
|
### Specifying Column Names
|
@@ -211,6 +280,23 @@ diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam'
|
|
211
280
|
ignore_fields: ['CreatedAt', 'UpdatedAt'])
|
212
281
|
```
|
213
282
|
|
283
|
+
### Filtering Rows
|
284
|
+
|
285
|
+
If you need to filter source data before running the diff process, you can use the :include
|
286
|
+
and :exclude options to do so. Both options take a Hash as their value; the hash should have
|
287
|
+
keys that are the field names or indexes (0-based) on which to filter, and whose values are
|
288
|
+
regular expressions or lambdas to be applied to values of the corresponding field. Rows will
|
289
|
+
only be diffed if they satisfy :include conditions, and do not satisfy :exclude conditions.
|
290
|
+
```ruby
|
291
|
+
# Generate a diff of Arsenal home games not refereed by Clattenburg
|
292
|
+
diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'],
|
293
|
+
include: {HomeTeam: 'Arsenal'}, exclude: {Referee: /Clattenburg/})
|
294
|
+
|
295
|
+
# Generate a diff of games played over the Xmas/New Year period
|
296
|
+
diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'],
|
297
|
+
include: {Date: lambda{ |d| holiday_period.include?(Date.strptime(d, '%y/%m/%d')) } })
|
298
|
+
```
|
299
|
+
|
214
300
|
### Ignoring Certain Changes
|
215
301
|
|
216
302
|
CSVDiff identifies Adds, Updates, Moves and Deletes; any of these changes can be selectively
|
data/lib/csv-diff.rb
CHANGED
data/lib/csv-diff/algorithm.rb
CHANGED
@@ -3,6 +3,55 @@ class CSVDiff
|
|
3
3
|
# Implements the CSV diff algorithm.
|
4
4
|
module Algorithm
|
5
5
|
|
6
|
+
# Holds the details of a single difference
|
7
|
+
class Diff
|
8
|
+
|
9
|
+
attr_accessor :diff_type
|
10
|
+
attr_reader :fields
|
11
|
+
attr_reader :row
|
12
|
+
attr_reader :sibling_position
|
13
|
+
|
14
|
+
def initialize(diff_type, fields, row_idx, pos_idx)
|
15
|
+
@diff_type = diff_type
|
16
|
+
@fields = fields
|
17
|
+
@row = row_idx + 1
|
18
|
+
self.sibling_position = pos_idx
|
19
|
+
end
|
20
|
+
|
21
|
+
|
22
|
+
def sibling_position=(pos_idx)
|
23
|
+
if pos_idx.is_a?(Array)
|
24
|
+
pos_idx.compact!
|
25
|
+
if pos_idx.first != pos_idx.last
|
26
|
+
@sibling_position = pos_idx.map{ |pos| pos + 1 }
|
27
|
+
else
|
28
|
+
@sibling_position = pos_idx.first + 1
|
29
|
+
end
|
30
|
+
else
|
31
|
+
@sibling_position = pos_idx + 1
|
32
|
+
end
|
33
|
+
end
|
34
|
+
|
35
|
+
|
36
|
+
# For backwards compatibility and access to fields with differences
|
37
|
+
def [](key)
|
38
|
+
case key
|
39
|
+
when :action
|
40
|
+
a = diff_type.to_s
|
41
|
+
a[0] = a[0].upcase
|
42
|
+
a
|
43
|
+
when :row
|
44
|
+
@row
|
45
|
+
when :sibling_position
|
46
|
+
@sibling_position
|
47
|
+
else
|
48
|
+
@fields[key] || @fields[key.to_s.intern] || @fields[key.to_s]
|
49
|
+
end
|
50
|
+
end
|
51
|
+
|
52
|
+
end
|
53
|
+
|
54
|
+
|
6
55
|
# Diffs two CSVSource structures.
|
7
56
|
#
|
8
57
|
# @param left [CSVSource] A CSVSource object containing the contents of
|
@@ -22,43 +71,53 @@ class CSVDiff
|
|
22
71
|
# items that exist in both +left+ and +right+.
|
23
72
|
# @option options [Boolean] :ignore_deletes If set to true, we ignore any
|
24
73
|
# new items that appear only in +left+.
|
74
|
+
# @option options [Hash<Object,Proc>] :equality_procs A Hash mapping fields
|
75
|
+
# to a 2-arg Proc that should be used to compare values in that field for
|
76
|
+
# equality.
|
25
77
|
def diff_sources(left, right, key_fields, diff_fields, options = {})
|
26
78
|
unless left.case_sensitive? == right.case_sensitive?
|
27
79
|
raise ArgumentError, "Left and right must have same settings for case-sensitivity"
|
28
80
|
end
|
29
|
-
|
81
|
+
unless left.parent_fields.length == right.parent_fields.length
|
82
|
+
raise ArgumentError, "Left and right must have same settings for parent/child fields"
|
83
|
+
end
|
84
|
+
|
85
|
+
# Ensure key fields are not also in the diff_fields
|
86
|
+
diff_fields = diff_fields - key_fields
|
87
|
+
|
30
88
|
left_index = left.index
|
31
89
|
left_values = left.lines
|
32
90
|
left_keys = left_values.keys
|
33
91
|
right_index = right.index
|
34
92
|
right_values = right.lines
|
35
93
|
right_keys = right_values.keys
|
36
|
-
|
94
|
+
parent_field_count = left.parent_fields.length
|
37
95
|
|
38
96
|
include_adds = !options[:ignore_adds]
|
39
97
|
include_moves = !options[:ignore_moves]
|
40
98
|
include_updates = !options[:ignore_updates]
|
41
99
|
include_deletes = !options[:ignore_deletes]
|
42
100
|
|
43
|
-
|
101
|
+
@case_sensitive = left.case_sensitive?
|
102
|
+
@equality_procs = options.fetch(:equality_procs, {})
|
103
|
+
|
104
|
+
diffs = {}
|
105
|
+
potential_moves = Hash.new{ |h, k| h[k] = [] }
|
44
106
|
|
45
107
|
# First identify deletions
|
46
108
|
if include_deletes
|
47
109
|
(left_keys - right_keys).each do |key|
|
48
110
|
# Delete
|
49
111
|
key_vals = key.split('~', -1)
|
50
|
-
parent = key_vals[0...
|
112
|
+
parent = key_vals[0...parent_field_count].join('~')
|
113
|
+
child = key_vals[parent_field_count..-1].join('~')
|
51
114
|
left_parent = left_index[parent]
|
52
115
|
left_value = left_values[key]
|
53
|
-
|
54
|
-
|
55
|
-
|
56
|
-
|
57
|
-
|
58
|
-
key_fields.each do |field_name|
|
59
|
-
id[field_name] = left_value[field_name]
|
60
|
-
end
|
61
|
-
diffs[key].merge!(id.merge(left_values[key].merge(:action => 'Delete')))
|
116
|
+
row_idx = left_keys.index(key)
|
117
|
+
sib_idx = left_parent.index(key)
|
118
|
+
raise "Can't locate key #{key} in parent #{parent}" unless sib_idx
|
119
|
+
diffs[key] = Diff.new(:delete, left_value, row_idx, sib_idx)
|
120
|
+
potential_moves[child] << key
|
62
121
|
#puts "Delete: #{key}"
|
63
122
|
end
|
64
123
|
end
|
@@ -66,7 +125,7 @@ class CSVDiff
|
|
66
125
|
# Now identify adds/updates
|
67
126
|
right_keys.each_with_index do |key, right_row_id|
|
68
127
|
key_vals = key.split('~', -1)
|
69
|
-
parent = key_vals[0...
|
128
|
+
parent = key_vals[0...parent_field_count].join('~')
|
70
129
|
left_parent = left_index[parent]
|
71
130
|
right_parent = right_index[parent]
|
72
131
|
left_value = left_values[key]
|
@@ -74,13 +133,12 @@ class CSVDiff
|
|
74
133
|
left_idx = left_parent && left_parent.index(key)
|
75
134
|
right_idx = right_parent && right_parent.index(key)
|
76
135
|
|
77
|
-
id = {}
|
78
|
-
id[:row] = right_row_id + 1
|
79
|
-
id[:sibling_position] = right_idx + 1
|
80
|
-
key_fields.each do |field_name|
|
81
|
-
id[field_name] = right_value[field_name]
|
82
|
-
end
|
83
136
|
if left_idx && right_idx
|
137
|
+
if include_updates && (changes = diff_row(left_value, right_value, diff_fields))
|
138
|
+
id = id_fields(key_fields, right_value)
|
139
|
+
diffs[key] = Diff.new(:update, id.merge!(changes), right_row_id, right_idx)
|
140
|
+
#puts "Change: #{key}"
|
141
|
+
end
|
84
142
|
if include_moves
|
85
143
|
left_common = left_parent & right_parent
|
86
144
|
right_common = right_parent & left_parent
|
@@ -88,19 +146,31 @@ class CSVDiff
|
|
88
146
|
right_pos = right_common.index(key)
|
89
147
|
if left_pos != right_pos
|
90
148
|
# Move
|
91
|
-
diffs[key]
|
92
|
-
|
149
|
+
if d = diffs[key]
|
150
|
+
d.sibling_position = [left_idx, right_idx]
|
151
|
+
else
|
152
|
+
id = id_fields(key_fields, right_value)
|
153
|
+
diffs[key] = Diff.new(:move, id, right_row_id, [left_idx, right_idx])
|
154
|
+
end
|
93
155
|
#puts "Move #{left_idx} -> #{right_idx}: #{key}"
|
94
156
|
end
|
95
157
|
end
|
96
|
-
|
97
|
-
diffs[key].merge!(id.merge(changes.merge(:action => 'Update')))
|
98
|
-
#puts "Change: #{key}"
|
99
|
-
end
|
100
|
-
elsif include_adds && right_idx
|
158
|
+
elsif right_idx
|
101
159
|
# Add
|
102
|
-
|
103
|
-
|
160
|
+
child = key_vals[parent_field_count..-1].join('~')
|
161
|
+
if potential_moves.has_key?(child) && old_key = potential_moves[child].pop
|
162
|
+
diffs.delete(old_key)
|
163
|
+
if include_updates
|
164
|
+
left_value = left_values[old_key]
|
165
|
+
id = id_fields(right.child_fields, right_value)
|
166
|
+
changes = diff_row(left_value, right_value, left.parent_fields + diff_fields)
|
167
|
+
diffs[key] = Diff.new(:update, id.merge!(changes), right_row_id, right_idx)
|
168
|
+
#puts "Update Parent: #{key}"
|
169
|
+
end
|
170
|
+
elsif include_adds
|
171
|
+
diffs[key] = Diff.new(:add, right_value, right_row_id, right_idx)
|
172
|
+
#puts "Add: #{key}"
|
173
|
+
end
|
104
174
|
end
|
105
175
|
end
|
106
176
|
|
@@ -116,27 +186,41 @@ class CSVDiff
|
|
116
186
|
# @param right_row [Hash] The version of the CSV row from the right/to
|
117
187
|
# file.
|
118
188
|
# @param fields [Array<String>] An array of field names to compare.
|
119
|
-
# @param case_sensitive [Boolean] Whether field comparisons should be
|
120
|
-
# case sensitive or not.
|
121
189
|
# @return [Hash<String, Array>] A Hash whose keys are the fields that
|
122
190
|
# contain differences, and whose values are a two-element array of
|
123
191
|
# [left/from, right/to] values.
|
124
|
-
def diff_row(left_row, right_row, fields
|
192
|
+
def diff_row(left_row, right_row, fields)
|
125
193
|
diffs = {}
|
126
194
|
fields.each do |attr|
|
195
|
+
eq_proc = @equality_procs[attr]
|
127
196
|
right_val = right_row[attr]
|
128
197
|
right_val = nil if right_val == ""
|
129
198
|
left_val = left_row[attr]
|
130
199
|
left_val = nil if left_val == ""
|
131
|
-
if
|
132
|
-
|
200
|
+
if eq_proc
|
201
|
+
diffs[attr] = [left_val, right_val] unless eq_proc.call(left_val, right_val)
|
202
|
+
elsif @case_sensitive
|
203
|
+
diffs[attr] = [left_val, right_val] unless left_val == right_val
|
204
|
+
elsif (left_val.to_s.upcase != right_val.to_s.upcase)
|
133
205
|
diffs[attr] = [left_val, right_val]
|
134
|
-
#puts "#{attr}: #{left_val} -> #{right_val}"
|
135
206
|
end
|
136
207
|
end
|
137
208
|
diffs if diffs.size > 0
|
138
209
|
end
|
139
210
|
|
211
|
+
|
212
|
+
private
|
213
|
+
|
214
|
+
|
215
|
+
# Return a hash containing just the key field values
|
216
|
+
def id_fields(key_fields, fields)
|
217
|
+
id = {}
|
218
|
+
key_fields.each do |field_name|
|
219
|
+
id[field_name] = fields[field_name]
|
220
|
+
end
|
221
|
+
id
|
222
|
+
end
|
223
|
+
|
140
224
|
end
|
141
225
|
|
142
226
|
end
|
data/lib/csv-diff/csv_diff.rb
CHANGED
@@ -81,13 +81,15 @@ class CSVDiff
|
|
81
81
|
# @option options [Boolean] :ignore_deletes If true, records that appear
|
82
82
|
# in the left/from file but not in the right/to file are not reported.
|
83
83
|
def initialize(left, right, options = {})
|
84
|
-
@left = left.is_a?(
|
84
|
+
@left = left.is_a?(Source) ? left : CSVSource.new(left, options)
|
85
|
+
@left.index_source if @left.lines.nil?
|
85
86
|
raise "No field names found in left (from) source" unless @left.field_names && @left.field_names.size > 0
|
86
|
-
@right = right.is_a?(
|
87
|
+
@right = right.is_a?(Source) ? right : CSVSource.new(right, options)
|
88
|
+
@right.index_source if @right.lines.nil?
|
87
89
|
raise "No field names found in right (to) source" unless @right.field_names && @right.field_names.size > 0
|
88
90
|
@warnings = []
|
89
91
|
@diff_fields = get_diff_fields(@left.field_names, @right.field_names, options)
|
90
|
-
@key_fields = @left.key_fields
|
92
|
+
@key_fields = @left.key_fields
|
91
93
|
diff(options)
|
92
94
|
end
|
93
95
|
|
@@ -141,15 +143,13 @@ class CSVDiff
|
|
141
143
|
ignore_fields = options.fetch(:ignore_fields, [])
|
142
144
|
ignore_fields = [ignore_fields] unless ignore_fields.is_a?(Array)
|
143
145
|
ignore_fields.map! do |f|
|
144
|
-
(f.is_a?(
|
146
|
+
(f.is_a?(Numeric) ? right_fields[f] : f).upcase
|
145
147
|
end
|
146
148
|
diff_fields = []
|
147
149
|
if options[:diff_common_fields_only]
|
148
150
|
right_fields.each_with_index do |fld, i|
|
149
151
|
if left_fields.include?(fld)
|
150
152
|
diff_fields << fld unless ignore_fields.include?(fld.upcase)
|
151
|
-
else
|
152
|
-
@warnings << "Field '#{fld}' is missing from the left (from) file, and won't be diffed"
|
153
153
|
end
|
154
154
|
end
|
155
155
|
else
|
data/lib/csv-diff/csv_source.rb
CHANGED
@@ -2,39 +2,7 @@ class CSVDiff
|
|
2
2
|
|
3
3
|
# Represents a CSV input (i.e. the left/from or right/to input) to the diff
|
4
4
|
# process.
|
5
|
-
class CSVSource
|
6
|
-
|
7
|
-
# @return [String] the path to the source file
|
8
|
-
attr_accessor :path
|
9
|
-
# @return [Array<String>] The names of the fields in the source file
|
10
|
-
attr_reader :field_names
|
11
|
-
# @return [Array<String>] The names of the field(s) that uniquely
|
12
|
-
# identify each row.
|
13
|
-
attr_reader :key_fields
|
14
|
-
# @return [Array<String>] The names of the field(s) that identify a
|
15
|
-
# common parent of child records.
|
16
|
-
attr_reader :parent_fields
|
17
|
-
# @return [Array<String>] The names of the field(s) that distinguish a
|
18
|
-
# child of a parent record.
|
19
|
-
attr_reader :child_fields
|
20
|
-
# @return [Boolean] True if the source has been indexed with case-
|
21
|
-
# sensitive keys, or false if it has been indexed using upper-case key
|
22
|
-
# values.
|
23
|
-
attr_reader :case_sensitive
|
24
|
-
alias_method :case_sensitive?, :case_sensitive
|
25
|
-
# @return [Boolean] True if leading/trailing whitespace should be stripped
|
26
|
-
# from fields
|
27
|
-
attr_reader :trim_whitespace
|
28
|
-
# @return [Hash<String,Hash>] A hash containing each line of the source,
|
29
|
-
# keyed on the values of the +key_fields+.
|
30
|
-
attr_reader :lines
|
31
|
-
# @return [Hash<String,Array<String>>] A hash containing each parent key,
|
32
|
-
# and an Array of the child keys it is a parent of.
|
33
|
-
attr_reader :index
|
34
|
-
# @return [Array<String>] An array of any warnings encountered while
|
35
|
-
# processing the source.
|
36
|
-
attr_reader :warnings
|
37
|
-
|
5
|
+
class CSVSource < Source
|
38
6
|
|
39
7
|
# Creates a new diff source.
|
40
8
|
#
|
@@ -72,92 +40,32 @@ class CSVDiff
|
|
72
40
|
# @option options [String] :child_field The name of the field(s) that
|
73
41
|
# uniquely identify a child of a parent.
|
74
42
|
# @option options [Boolean] :case_sensitive If true (the default), keys
|
75
|
-
#
|
76
|
-
#
|
43
|
+
# are indexed as-is; if false, the index is built in upper-case for
|
44
|
+
# case-insensitive comparisons.
|
45
|
+
# @option options [Hash] :include A hash of field name(s) or index(es) to
|
46
|
+
# regular expression(s). Only source rows whose field values satisfy the
|
47
|
+
# regular expressions will be indexed and included in the diff process.
|
48
|
+
# @option options [Hash] :exclude A hash of field name(s) or index(es) to
|
49
|
+
# regular expression(s). Source rows with a field value that satisfies
|
50
|
+
# the regular expressions will be excluded from the diff process.
|
77
51
|
def initialize(source, options = {})
|
52
|
+
super(options)
|
78
53
|
if source.is_a?(String)
|
79
54
|
require 'csv'
|
80
55
|
mode_string = options[:encoding] ? "r:#{options[:encoding]}" : 'r'
|
81
56
|
csv_options = options.fetch(:csv_options, {})
|
82
57
|
@path = source
|
83
|
-
|
84
|
-
|
85
|
-
|
86
|
-
|
87
|
-
@parent_fields = @key_fields[0...-1]
|
88
|
-
@child_fields = @key_fields[-1..-1]
|
89
|
-
else
|
90
|
-
@parent_fields = [options.fetch(:parent_field, options[:parent_fields]) || []].flatten
|
91
|
-
@child_fields = [options.fetch(:child_field, options[:child_fields]) || [0]].flatten
|
92
|
-
@key_fields = @parent_fields + @child_fields
|
93
|
-
end
|
94
|
-
@field_names = options[:field_names]
|
95
|
-
@warnings = []
|
96
|
-
index_source(source, options)
|
97
|
-
end
|
98
|
-
|
99
|
-
|
100
|
-
# Returns the row in the CSV source corresponding to the supplied key.
|
101
|
-
#
|
102
|
-
# @param key [String] The unique key to use to lookup the row.
|
103
|
-
# @return [Hash] The fields for the line corresponding to +key+, or nil
|
104
|
-
# if the key is not recognised.
|
105
|
-
def [](key)
|
106
|
-
@lines[key]
|
107
|
-
end
|
108
|
-
|
109
|
-
|
110
|
-
private
|
111
|
-
|
112
|
-
# Given an array of lines, where each line is an array of fields, indexes
|
113
|
-
# the array contents so that it can be looked up by key.
|
114
|
-
def index_source(lines, options)
|
115
|
-
@lines = {}
|
116
|
-
@index = Hash.new{ |h, k| h[k] = [] }
|
117
|
-
@key_fields = find_field_indexes(@key_fields, @field_names) if @field_names
|
118
|
-
@case_sensitive = options.fetch(:case_sensitive, true)
|
119
|
-
@trim_whitespace = options.fetch(:trim_whitespace, false)
|
120
|
-
line_num = 0
|
121
|
-
lines.each do |row|
|
122
|
-
line_num += 1
|
123
|
-
next if line_num == 1 && @field_names && options[:ignore_header]
|
124
|
-
unless @field_names
|
125
|
-
@field_names = row
|
126
|
-
@key_fields = find_field_indexes(@key_fields, @field_names)
|
127
|
-
next
|
128
|
-
end
|
129
|
-
field_vals = row
|
130
|
-
line = {}
|
131
|
-
@field_names.each_with_index do |field, i|
|
132
|
-
line[field] = field_vals[i]
|
133
|
-
line[field].strip! if @trim_whitespace && line[field]
|
134
|
-
end
|
135
|
-
key_values = @key_fields.map{ |kf| field_vals[kf].to_s.upcase }
|
136
|
-
key = key_values.join('~')
|
137
|
-
parent_key = key_values[0...(@parent_fields.length)].join('~')
|
138
|
-
parent_key.upcase! unless @case_sensitive
|
139
|
-
if @lines[key]
|
140
|
-
@warnings << "Duplicate key '#{key}' encountered and ignored at line #{line_num}"
|
141
|
-
else
|
142
|
-
@index[parent_key] << key
|
143
|
-
@lines[key] = line
|
144
|
-
end
|
145
|
-
end
|
146
|
-
end
|
147
|
-
|
148
|
-
|
149
|
-
# Converts an array of field names to an array of indexes of the fields
|
150
|
-
# matching those names.
|
151
|
-
def find_field_indexes(key_fields, field_names)
|
152
|
-
key_fields.map do |field|
|
153
|
-
if field.is_a?(Fixnum)
|
154
|
-
field
|
155
|
-
else
|
156
|
-
field_names.index{ |field_name| field.to_s.downcase == field_name.downcase } or
|
157
|
-
raise ArgumentError, "Could not locate field '#{field}' in source field names: #{
|
158
|
-
field_names.join(', ')}"
|
58
|
+
# When you call CSV.open, it's best to pass in a block so that after it's yielded,
|
59
|
+
# the underlying file handle is closed. Otherwise, you risk leaking the handle.
|
60
|
+
@data = CSV.open(@path, mode_string, csv_options) do |csv|
|
61
|
+
csv.readlines
|
159
62
|
end
|
63
|
+
elsif source.is_a?(Enumerable) && source.size == 0 || (source.size > 0 && source.first.is_a?(Enumerable))
|
64
|
+
@data = source
|
65
|
+
else
|
66
|
+
raise ArgumentError, "source must be a path to a file or an Enumerable<Enumerable>"
|
160
67
|
end
|
68
|
+
index_source
|
161
69
|
end
|
162
70
|
|
163
71
|
end
|
@@ -0,0 +1,289 @@
|
|
1
|
+
class CSVDiff
|
2
|
+
|
3
|
+
# Reppresents an input (i.e the left/from or tight/to input) to the diff
|
4
|
+
# process.
|
5
|
+
class Source
|
6
|
+
|
7
|
+
# @return [String] the path to the source file
|
8
|
+
attr_accessor :path
|
9
|
+
# @return [Array<Arrary>] The data for this source
|
10
|
+
attr_reader :data
|
11
|
+
|
12
|
+
# @return [Array<String>] The names of the fields in the source file
|
13
|
+
attr_reader :field_names
|
14
|
+
# @return [Array<String>] The names of the field(s) that uniquely
|
15
|
+
# identify each row.
|
16
|
+
attr_reader :key_fields
|
17
|
+
# @return [Array<String>] The names of the field(s) that identify a
|
18
|
+
# common parent of child records.
|
19
|
+
attr_reader :parent_fields
|
20
|
+
# @return [Array<String>] The names of the field(s) that distinguish a
|
21
|
+
# child of a parent record.
|
22
|
+
attr_reader :child_fields
|
23
|
+
|
24
|
+
# @return [Array<Fixnum>] The indexes of the key fields in the source
|
25
|
+
# file.
|
26
|
+
attr_reader :key_field_indexes
|
27
|
+
# @return [Array<Fixnum>] The indexes of the parent fields in the source
|
28
|
+
# file.
|
29
|
+
attr_reader :parent_field_indexes
|
30
|
+
# @return [Array<Fixnum>] The indexes of the child fields in the source
|
31
|
+
# file.
|
32
|
+
attr_reader :child_field_indexes
|
33
|
+
|
34
|
+
# @return [Boolean] True if the source has been indexed with case-
|
35
|
+
# sensitive keys, or false if it has been indexed using upper-case key
|
36
|
+
# values.
|
37
|
+
attr_reader :case_sensitive
|
38
|
+
alias_method :case_sensitive?, :case_sensitive
|
39
|
+
# @return [Boolean] True if leading/trailing whitespace should be stripped
|
40
|
+
# from fields
|
41
|
+
attr_reader :trim_whitespace
|
42
|
+
# @return [Hash<String,Hash>] A hash containing each line of the source,
|
43
|
+
# keyed on the values of the +key_fields+.
|
44
|
+
attr_reader :lines
|
45
|
+
# @return [Hash<String,Array<String>>] A hash containing each parent key,
|
46
|
+
# and an Array of the child keys it is a parent of.
|
47
|
+
attr_reader :index
|
48
|
+
# @return [Array<String>] An array of any warnings encountered while
|
49
|
+
# processing the source.
|
50
|
+
attr_reader :warnings
|
51
|
+
# @return [Fixnum] A count of the lines processed from this source.
|
52
|
+
# Excludes any header and duplicate records identified during indexing.
|
53
|
+
attr_reader :line_count
|
54
|
+
# @return [Fixnum] A count of the lines from this source that were skipped
|
55
|
+
# due to filter conditions.
|
56
|
+
attr_reader :skip_count
|
57
|
+
# @return [Fixnum] A count of the lines from this source that had the same
|
58
|
+
# key value as another line.
|
59
|
+
attr_reader :dup_count
|
60
|
+
|
61
|
+
|
62
|
+
# Creates a new diff source.
|
63
|
+
#
|
64
|
+
# A diff source must contain at least one field that will be used as the
|
65
|
+
# key to identify the same record in a different version of this file.
|
66
|
+
# If not specified via one of the options, the first field is assumed to
|
67
|
+
# be the unique key.
|
68
|
+
#
|
69
|
+
# If multiple fields combine to form a unique key, the combined fields
|
70
|
+
# are considered as a single unique identifier. If your key represents
|
71
|
+
# data that can be represented as a tree, you can instead break your key
|
72
|
+
# fields into :parent_fields and :child_fields. By doing this, if a child
|
73
|
+
# key is deleted from one parent, and added to another, that will be
|
74
|
+
# reported as an update, with a change to the parent key part(s) of the
|
75
|
+
# record.
|
76
|
+
#
|
77
|
+
# All key options can be specified either by field name, or by field
|
78
|
+
# index (0 based).
|
79
|
+
#
|
80
|
+
# @param options [Hash] An options hash.
|
81
|
+
# @option options [Array<String>] :field_names The names of each of the
|
82
|
+
# fields in +source+.
|
83
|
+
# @option options [Boolean] :ignore_header If true, and :field_names has
|
84
|
+
# been specified, then the first row of the file is ignored.
|
85
|
+
# @option options [String] :key_field The name of the field that uniquely
|
86
|
+
# identifies each row.
|
87
|
+
# @option options [Array<String>] :key_fields The names of the fields
|
88
|
+
# that uniquely identifies each row.
|
89
|
+
# @option options [String] :parent_field The name of the field(s) that
|
90
|
+
# identify a parent within which sibling order should be checked.
|
91
|
+
# @option options [String] :child_field The name of the field(s) that
|
92
|
+
# uniquely identify a child of a parent.
|
93
|
+
# @option options [Boolean] :case_sensitive If true (the default), keys
|
94
|
+
# are indexed as-is; if false, the index is built in upper-case for
|
95
|
+
# case-insensitive comparisons.
|
96
|
+
# @option options [Hash] :include A hash of field name(s) or index(es) to
|
97
|
+
# regular expression(s). Only source rows whose field values satisfy the
|
98
|
+
# regular expressions will be indexed and included in the diff process.
|
99
|
+
# @option options [Hash] :exclude A hash of field name(s) or index(es) to
|
100
|
+
# regular expression(s). Source rows with a field value that satisfies
|
101
|
+
# the regular expressions will be excluded from the diff process.
|
102
|
+
def initialize(options = {})
|
103
|
+
if (options.keys & [:parent_field, :parent_fields, :child_field, :child_fields]).empty? &&
|
104
|
+
(kf = options.fetch(:key_field, options[:key_fields]))
|
105
|
+
@key_fields = [kf].flatten
|
106
|
+
@parent_fields = []
|
107
|
+
@child_fields = @key_fields
|
108
|
+
else
|
109
|
+
@parent_fields = [options.fetch(:parent_field, options[:parent_fields]) || []].flatten
|
110
|
+
@child_fields = [options.fetch(:child_field, options[:child_fields]) || [0]].flatten
|
111
|
+
@key_fields = @parent_fields + @child_fields
|
112
|
+
end
|
113
|
+
@field_names = options[:field_names]
|
114
|
+
@case_sensitive = options.fetch(:case_sensitive, true)
|
115
|
+
@trim_whitespace = options.fetch(:trim_whitespace, false)
|
116
|
+
@ignore_header = options[:ignore_header]
|
117
|
+
@include = options[:include]
|
118
|
+
@exclude = options[:exclude]
|
119
|
+
@path = options.fetch(:path, 'NA') unless @path
|
120
|
+
@warnings = []
|
121
|
+
end
|
122
|
+
|
123
|
+
|
124
|
+
def path?
|
125
|
+
@path != 'NA'
|
126
|
+
end
|
127
|
+
|
128
|
+
|
129
|
+
# Returns the row in the CSV source corresponding to the supplied key.
|
130
|
+
#
|
131
|
+
# @param key [String] The unique key to use to lookup the row.
|
132
|
+
# @return [Hash] The fields for the line corresponding to +key+, or nil
|
133
|
+
# if the key is not recognised.
|
134
|
+
def [](key)
|
135
|
+
@lines[key]
|
136
|
+
end
|
137
|
+
|
138
|
+
|
139
|
+
# Given an array of lines, where each line is an array of fields, indexes
|
140
|
+
# the array contents so that it can be looked up by key.
|
141
|
+
def index_source
|
142
|
+
@lines = {}
|
143
|
+
@index = Hash.new{ |h, k| h[k] = [] }
|
144
|
+
if @field_names
|
145
|
+
index_fields
|
146
|
+
include_filter = convert_filter(@include, @field_names)
|
147
|
+
exclude_filter = convert_filter(@exclude, @field_names)
|
148
|
+
end
|
149
|
+
@line_count = 0
|
150
|
+
@skip_count = 0
|
151
|
+
@dup_count = 0
|
152
|
+
line_num = 0
|
153
|
+
@data.each do |row|
|
154
|
+
line_num += 1
|
155
|
+
next if line_num == 1 && @field_names && @ignore_header
|
156
|
+
unless @field_names
|
157
|
+
if row.class.name == 'CSV::Row'
|
158
|
+
@field_names = row.headers.each_with_index.map{ |f, i| f || i.to_s }
|
159
|
+
else
|
160
|
+
@field_names = row.each_with_index.map{ |f, i| f || i.to_s }
|
161
|
+
end
|
162
|
+
index_fields
|
163
|
+
include_filter = convert_filter(@include, @field_names)
|
164
|
+
exclude_filter = convert_filter(@exclude, @field_names)
|
165
|
+
next
|
166
|
+
end
|
167
|
+
field_vals = row
|
168
|
+
line = {}
|
169
|
+
filter = false
|
170
|
+
@field_names.each_with_index do |field, i|
|
171
|
+
val = field_vals[i]
|
172
|
+
val = val.to_s.strip if val && @trim_whitespace
|
173
|
+
line[field] = val
|
174
|
+
if include_filter && f = include_filter[i]
|
175
|
+
filter = !check_filter(f, line[field])
|
176
|
+
end
|
177
|
+
if exclude_filter && f = exclude_filter[i]
|
178
|
+
filter = check_filter(f, line[field])
|
179
|
+
end
|
180
|
+
break if filter
|
181
|
+
end
|
182
|
+
if filter
|
183
|
+
@skip_count += 1
|
184
|
+
next
|
185
|
+
end
|
186
|
+
key_values = @key_field_indexes.map{ |kf| @case_sensitive ?
|
187
|
+
field_vals[kf].to_s :
|
188
|
+
field_vals[kf].to_s.upcase }
|
189
|
+
key = key_values.join('~')
|
190
|
+
parent_key = key_values[0...(@parent_fields.length)].join('~')
|
191
|
+
if @lines[key]
|
192
|
+
@warnings << "Duplicate key '#{key}' encountered at line #{line_num}"
|
193
|
+
@dup_count += 1
|
194
|
+
key += "[#{@dup_count}]"
|
195
|
+
end
|
196
|
+
@index[parent_key] << key
|
197
|
+
@lines[key] = line
|
198
|
+
@line_count += 1
|
199
|
+
end
|
200
|
+
end
|
201
|
+
|
202
|
+
|
203
|
+
# Save the data in this Source as a CSV at +file_path+.
|
204
|
+
#
|
205
|
+
# @parma file_path [String] The target path to save the data to.
|
206
|
+
# @param options [Hash] A set of options to pass to CSV.open to control
|
207
|
+
# how the CSV is generated.
|
208
|
+
def save_csv(file_path, options = {})
|
209
|
+
require 'csv'
|
210
|
+
default_opts = {
|
211
|
+
headers: @field_name, write_headers: true
|
212
|
+
}
|
213
|
+
CSV.open(file_path, 'wb', default_opts.merge(options)) do |csv|
|
214
|
+
@data.each{ |rec| csv << rec }
|
215
|
+
end
|
216
|
+
end
|
217
|
+
|
218
|
+
|
219
|
+
# Convert the data in this source to Array<Hash> using the field names
|
220
|
+
# as keys for the Hash in each row.
|
221
|
+
def to_hash
|
222
|
+
@data.map do |row|
|
223
|
+
hsh = {}
|
224
|
+
@field_names.each_with_index.map{ |fld, i| hsh[fld] = row[i] }
|
225
|
+
hsh
|
226
|
+
end
|
227
|
+
end
|
228
|
+
|
229
|
+
|
230
|
+
private
|
231
|
+
|
232
|
+
|
233
|
+
def index_fields
|
234
|
+
@key_field_indexes = find_field_indexes(@key_fields, @field_names)
|
235
|
+
@parent_field_indexes = find_field_indexes(@parent_fields, @field_names)
|
236
|
+
@child_field_indexes = find_field_indexes(@child_fields, @field_names)
|
237
|
+
@key_fields = @key_field_indexes.map{ |i| @field_names[i] }
|
238
|
+
@parent_fields = @parent_field_indexes.map{ |i| @field_names[i] }
|
239
|
+
@child_fields = @child_field_indexes.map{ |i| @field_names[i] }
|
240
|
+
end
|
241
|
+
|
242
|
+
|
243
|
+
# Converts an array of field names to an array of indexes of the fields
|
244
|
+
# matching those names.
|
245
|
+
def find_field_indexes(key_fields, field_names)
|
246
|
+
key_fields.map do |field|
|
247
|
+
if field.is_a?(Integer)
|
248
|
+
field
|
249
|
+
else
|
250
|
+
field_names.index{ |field_name| field.to_s.downcase == field_name.to_s.downcase } or
|
251
|
+
raise ArgumentError, "Could not locate field '#{field}' in source field names: #{
|
252
|
+
field_names.join(', ')}"
|
253
|
+
end
|
254
|
+
end
|
255
|
+
end
|
256
|
+
|
257
|
+
|
258
|
+
def convert_filter(hsh, field_names)
|
259
|
+
return unless hsh
|
260
|
+
if !hsh.is_a?(Hash)
|
261
|
+
raise ArgumentError, ":include/:exclude option must be a Hash of field name(s)/index(es) to RegExp(s)"
|
262
|
+
end
|
263
|
+
keys = hsh.keys
|
264
|
+
idxs = find_field_indexes(keys, @field_names)
|
265
|
+
Hash[keys.each_with_index.map{ |k, i| [idxs[i], hsh[k]] }]
|
266
|
+
end
|
267
|
+
|
268
|
+
|
269
|
+
def check_filter(filter, field_val)
|
270
|
+
case filter
|
271
|
+
when String
|
272
|
+
if @case_sensitive
|
273
|
+
filter == field_val
|
274
|
+
else
|
275
|
+
filter.downcase == field_val.to_s.downcase
|
276
|
+
end
|
277
|
+
when Regexp
|
278
|
+
filter.match(field_val)
|
279
|
+
when Proc
|
280
|
+
filter.call(field_val)
|
281
|
+
else
|
282
|
+
raise ArgumentError, "Unsupported filter expression: #{filter.inspect}"
|
283
|
+
end
|
284
|
+
end
|
285
|
+
|
286
|
+
end
|
287
|
+
|
288
|
+
end
|
289
|
+
|
@@ -0,0 +1,142 @@
|
|
1
|
+
require 'nokogiri'
|
2
|
+
require 'cgi'
|
3
|
+
|
4
|
+
|
5
|
+
class CSVDiff
|
6
|
+
|
7
|
+
# Convert XML content to CSV format using XPath selectors to identify the
|
8
|
+
# rows and field values in an XML document
|
9
|
+
class XMLSource < Source
|
10
|
+
|
11
|
+
attr_accessor :context
|
12
|
+
|
13
|
+
# Create a new XMLSource, identified by +path+. Normally this is a path
|
14
|
+
# to the XML document, but any value is fine, as it is just a label to
|
15
|
+
# identify this data set.
|
16
|
+
#
|
17
|
+
# @param path [String] A label for this data set (often a path to the
|
18
|
+
# XML document used as the source).
|
19
|
+
# @param options [Hash] An options hash.
|
20
|
+
# @option options [Array<String>] :field_names The names of each of the
|
21
|
+
# fields in +source+.
|
22
|
+
# @option options [Boolean] :ignore_header If true, and :field_names has
|
23
|
+
# been specified, then the first row of the file is ignored.
|
24
|
+
# @option options [String] :key_field The name of the field that uniquely
|
25
|
+
# identifies each row.
|
26
|
+
# @option options [Array<String>] :key_fields The names of the fields
|
27
|
+
# that uniquely identifies each row.
|
28
|
+
# @option options [String] :parent_field The name of the field(s) that
|
29
|
+
# identify a parent within which sibling order should be checked.
|
30
|
+
# @option options [String] :child_field The name of the field(s) that
|
31
|
+
# uniquely identify a child of a parent.
|
32
|
+
# @option options [Boolean] :case_sensitive If true (the default), keys
|
33
|
+
# are indexed as-is; if false, the index is built in upper-case for
|
34
|
+
# case-insensitive comparisons.
|
35
|
+
# @option options [Hash] :include A hash of field name(s) or index(es) to
|
36
|
+
# regular expression(s). Only source rows whose field values satisfy the
|
37
|
+
# regular expressions will be indexed and included in the diff process.
|
38
|
+
# @option options [Hash] :exclude A hash of field name(s) or index(es) to
|
39
|
+
# regular expression(s). Source rows with a field value that satisfies
|
40
|
+
# the regular expressions will be excluded from the diff process.
|
41
|
+
# @option options [String] :context A context value from which fields
|
42
|
+
# can be populated using a Regexp.
|
43
|
+
def initialize(path, options = {})
|
44
|
+
super(options)
|
45
|
+
@path = path
|
46
|
+
@context = options[:context]
|
47
|
+
@data = []
|
48
|
+
end
|
49
|
+
|
50
|
+
|
51
|
+
# Process a +source+, converting the XML into a table of data, using
|
52
|
+
# +rec_xpath+ to identify the nodes that correspond each record that
|
53
|
+
# should appear in the output, and +field_maps+ to populate each field
|
54
|
+
# in each row.
|
55
|
+
#
|
56
|
+
# @param source [String|Array] may be a String containing XML content,
|
57
|
+
# an Array of paths to files containing XML content, or a path to
|
58
|
+
# a single file.
|
59
|
+
# @param rec_xpath [String] An XPath expression that selects all the
|
60
|
+
# items in the XML document that are to be converted into new rows.
|
61
|
+
# The returned items are not directly used to populate the fields,
|
62
|
+
# but provide a context for the field XPath expreessions that populate
|
63
|
+
# each field's content.
|
64
|
+
# @param field_maps [Hash<String, String>] A map of field names to
|
65
|
+
# expressions that are evaluated in the context of each row node
|
66
|
+
# selected by +rec_xpath+. The field expressions are typically XPath
|
67
|
+
# expressions evaluated in the context of the nodes returned by the
|
68
|
+
# +rec_xpath+. Alternatively, a String that is not an XPath expression
|
69
|
+
# is used as a literal value for a field, while a Regexp can also
|
70
|
+
# be used to pull a value from any context specified in the +options+
|
71
|
+
# hash. The Regexp should include a single grouping, as the value used
|
72
|
+
# will be the result in $1 after the match is performed.
|
73
|
+
# @param context [String] An optional context for the XML to be processed.
|
74
|
+
# The value passed here can be referenced in field map expressions
|
75
|
+
# using a Regexp, with the value of the first grouping in the regex
|
76
|
+
# being the value returned for the field.
|
77
|
+
def process(source, rec_xpath, field_maps, context = nil)
|
78
|
+
@field_names = field_maps.keys unless @field_names
|
79
|
+
case source
|
80
|
+
when Nokogiri::XML::Document
|
81
|
+
add_data(source, rec_xpath, field_maps, context || @context)
|
82
|
+
when /<\?xml/
|
83
|
+
doc = Nokogiri::XML(source)
|
84
|
+
add_data(doc, rec_xpath, field_maps, context || @context)
|
85
|
+
when Array
|
86
|
+
source.each{ |f| process_file(f, rec_xpath, field_maps) }
|
87
|
+
when String
|
88
|
+
process_file(source, rec_xpath, field_maps)
|
89
|
+
else
|
90
|
+
raise ArgumentError, "Unhandled source type #{source.class.name}"
|
91
|
+
end
|
92
|
+
@data
|
93
|
+
end
|
94
|
+
|
95
|
+
|
96
|
+
private
|
97
|
+
|
98
|
+
|
99
|
+
# Load the XML document at +file_path+ and process it into rows of data.
|
100
|
+
def process_file(file_path, rec_xpath, field_maps)
|
101
|
+
begin
|
102
|
+
File.open(file_path) do |f|
|
103
|
+
doc = Nokogiri::XML(f)
|
104
|
+
add_data(doc, rec_xpath, field_maps, @context || file_path)
|
105
|
+
end
|
106
|
+
rescue
|
107
|
+
STDERR.puts "An error occurred while attempting to open #{file_path}"
|
108
|
+
raise
|
109
|
+
end
|
110
|
+
end
|
111
|
+
|
112
|
+
|
113
|
+
# Locate records in +doc+ using +rec_xpath+ to identify the nodes that
|
114
|
+
# correspond to a new record in the data, and +field_maps+ to populate
|
115
|
+
# the fields in each row.
|
116
|
+
def add_data(doc, rec_xpath, field_maps, context)
|
117
|
+
doc.xpath(rec_xpath).each do |rec_node|
|
118
|
+
rec = []
|
119
|
+
field_maps.each do |field_name, expr|
|
120
|
+
case expr
|
121
|
+
when Regexp # Match context against Regexp and extract first grouping
|
122
|
+
if context
|
123
|
+
context =~ expr
|
124
|
+
rec << $1
|
125
|
+
else
|
126
|
+
rec << nil
|
127
|
+
end
|
128
|
+
when %r{[/(.@]} # XPath expression
|
129
|
+
res = rec_node.xpath(expr)
|
130
|
+
rec << CGI.unescape_html(res.to_s)
|
131
|
+
else # Use expr as the value for this field
|
132
|
+
rec << expr
|
133
|
+
end
|
134
|
+
end
|
135
|
+
@data << rec
|
136
|
+
end
|
137
|
+
end
|
138
|
+
|
139
|
+
end
|
140
|
+
|
141
|
+
end
|
142
|
+
|
metadata
CHANGED
@@ -1,69 +1,80 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: csv-diff
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
5
|
-
prerelease:
|
4
|
+
version: 0.6.1
|
6
5
|
platform: ruby
|
7
6
|
authors:
|
8
7
|
- Adam Gardiner
|
9
8
|
autorequire:
|
10
9
|
bindir: bin
|
11
10
|
cert_chain: []
|
12
|
-
date:
|
11
|
+
date: 2020-10-21 00:00:00.000000000 Z
|
13
12
|
dependencies: []
|
14
|
-
description:
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
|
28
|
-
|
29
|
-
|
30
|
-
|
31
|
-
|
32
|
-
|
13
|
+
description: |2
|
14
|
+
This library performs diffs of CSV data, or any table-like source.
|
15
|
+
|
16
|
+
Unlike a standard diff that compares line by line, and is sensitive to the
|
17
|
+
ordering of records, CSV-Diff identifies common lines by key field(s), and
|
18
|
+
then compares the contents of the fields in each line.
|
19
|
+
|
20
|
+
Data may be supplied in the form of CSV files, or as an array of arrays. The
|
21
|
+
diff process provides a fine level of control over what to diff, and can
|
22
|
+
optionally ignore certain types of changes (e.g. changes in position).
|
23
|
+
|
24
|
+
CSV-Diff is particularly well suited to data in parent-child format. Parent-
|
25
|
+
child data does not lend itself well to standard text diffs, as small changes
|
26
|
+
in the organisation of the tree at an upper level can lead to big movements
|
27
|
+
in the position of descendant records. By instead matching records by key,
|
28
|
+
CSV-Diff avoids this issue, while still being able to detect changes in
|
29
|
+
sibling order.
|
30
|
+
|
31
|
+
This gem implements the core diff algorithm, and handles the loading and
|
32
|
+
diffing of CSV files (or Arrays of Arrays). It also supports converting
|
33
|
+
data in XML format into tabular form, so that it can then be processed
|
34
|
+
like any other CSV or table-like source. It returns a CSVDiff object
|
35
|
+
containing the details of differences in object form. This is useful for
|
36
|
+
projects that need diff capability, but want to handle the reporting or
|
37
|
+
actioning of differences themselves.
|
38
|
+
|
39
|
+
For a pre-built diff reporting capability, see the csv-diff-report gem,
|
40
|
+
which provides a command-line tool for generating diff reports in HTML,
|
41
|
+
Excel, or text formats.
|
33
42
|
email: adam.b.gardiner@gmail.com
|
34
43
|
executables: []
|
35
44
|
extensions: []
|
36
45
|
extra_rdoc_files: []
|
37
46
|
files:
|
38
|
-
- README.md
|
39
47
|
- LICENSE
|
48
|
+
- README.md
|
49
|
+
- lib/csv-diff.rb
|
40
50
|
- lib/csv-diff/algorithm.rb
|
41
51
|
- lib/csv-diff/csv_diff.rb
|
42
52
|
- lib/csv-diff/csv_source.rb
|
43
|
-
- lib/csv-diff.rb
|
53
|
+
- lib/csv-diff/source.rb
|
54
|
+
- lib/csv-diff/xml_source.rb
|
44
55
|
- lib/csv_diff.rb
|
45
56
|
homepage: https://github.com/agardiner/csv-diff
|
46
|
-
licenses:
|
57
|
+
licenses:
|
58
|
+
- MIT
|
59
|
+
metadata: {}
|
47
60
|
post_install_message: For command-line tools and diff reports, 'gem install csv-diff-report'
|
48
61
|
rdoc_options: []
|
49
62
|
require_paths:
|
50
63
|
- lib
|
51
64
|
required_ruby_version: !ruby/object:Gem::Requirement
|
52
|
-
none: false
|
53
65
|
requirements:
|
54
|
-
- -
|
66
|
+
- - ">="
|
55
67
|
- !ruby/object:Gem::Version
|
56
68
|
version: '0'
|
57
69
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
58
|
-
none: false
|
59
70
|
requirements:
|
60
|
-
- -
|
71
|
+
- - ">="
|
61
72
|
- !ruby/object:Gem::Version
|
62
73
|
version: '0'
|
63
74
|
requirements: []
|
64
75
|
rubyforge_project:
|
65
|
-
rubygems_version:
|
76
|
+
rubygems_version: 2.5.2.3
|
66
77
|
signing_key:
|
67
|
-
specification_version:
|
68
|
-
summary: CSV Diff is a library for generating diffs from data in CSV format
|
78
|
+
specification_version: 4
|
79
|
+
summary: CSV Diff is a library for generating diffs from data in CSV or XML format
|
69
80
|
test_files: []
|