csv-diff 0.1
Sign up to get free protection for your applications and to get access to all the features.
- data/LICENSE +22 -0
- data/README.md +222 -0
- data/lib/csv-diff.rb +4 -0
- data/lib/csv-diff/algorithm.rb +124 -0
- data/lib/csv-diff/csv_diff.rb +142 -0
- data/lib/csv-diff/csv_source.rb +151 -0
- data/lib/csv_diff.rb +2 -0
- metadata +62 -0
data/LICENSE
ADDED
@@ -0,0 +1,22 @@
|
|
1
|
+
Copyright (c) 2013, Adam Gardiner
|
2
|
+
All rights reserved.
|
3
|
+
|
4
|
+
Redistribution and use in source and binary forms, with or without
|
5
|
+
modification, are permitted provided that the following conditions are met:
|
6
|
+
|
7
|
+
* Redistributions of source code must retain the above copyright notice, this
|
8
|
+
list of conditions and the following disclaimer.
|
9
|
+
* Redistributions in binary form must reproduce the above copyright notice,
|
10
|
+
this list of conditions and the following disclaimer in the documentation
|
11
|
+
and/or other materials provided with the distribution.
|
12
|
+
|
13
|
+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
14
|
+
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
15
|
+
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
16
|
+
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE
|
17
|
+
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
18
|
+
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
19
|
+
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
20
|
+
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
21
|
+
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
22
|
+
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
data/README.md
ADDED
@@ -0,0 +1,222 @@
|
|
1
|
+
# CSV-Diff
|
2
|
+
|
3
|
+
CSV-Diff is a small library for performing diffs of CSV data.
|
4
|
+
|
5
|
+
Unlike a standard diff that compares line by line, and is sensitive to the
|
6
|
+
ordering of records, CSV-Diff identifies common lines by key field(s), and
|
7
|
+
then compares the contents of the fields in each line.
|
8
|
+
|
9
|
+
Data may be supplied in the form of CSV files, or as an array of arrays. The
|
10
|
+
diff process provides a fine level of control over what to diff, and can
|
11
|
+
optionally ignore certain types of changes (e.g. changes in position).
|
12
|
+
|
13
|
+
CSV-Diff is particularly well suited to data in parent-child format. Parent-
|
14
|
+
child data does not lend itself well to standard text diffs, as small changes
|
15
|
+
in the organisation of the tree at an upper level can lead to big movements
|
16
|
+
in the position of descendant records. By instead matching records by key,
|
17
|
+
CSV-Diff avoids this issue, while still being able to detect changes in
|
18
|
+
sibling order.
|
19
|
+
|
20
|
+
|
21
|
+
## Usage
|
22
|
+
|
23
|
+
CSV-Diff is supplied as a gem, and has no dependencies. To use it, simply:
|
24
|
+
```
|
25
|
+
gem install csv-diff
|
26
|
+
```
|
27
|
+
|
28
|
+
To compare two CSV files where the field names are in the first row of the file,
|
29
|
+
and the first field contains the unique key for each record, simply use:
|
30
|
+
```ruby
|
31
|
+
require 'csv-diff'
|
32
|
+
|
33
|
+
diff = CSVDiff.new(file1, file2)
|
34
|
+
```
|
35
|
+
|
36
|
+
The returned diff object can be queried for the differences that exist between
|
37
|
+
the two files, e.g.:
|
38
|
+
```ruby
|
39
|
+
puts diff.summary.inspect # Summary of the adds, deletes, updates, and moves
|
40
|
+
puts diff.adds.inspect # Details of the additions to file2
|
41
|
+
puts diff.deletes.inspect # Details of the deletions to file1
|
42
|
+
puts diff.updates.inspect # Details of the updates from file1 to file2
|
43
|
+
puts diff.moves.inspect # Details of the moves from file1 to file2
|
44
|
+
puts diff.diffs.inspect # Details of all differences
|
45
|
+
puts diff.warnings.inspect # Any warnings generated during the diff process
|
46
|
+
```
|
47
|
+
|
48
|
+
## Unique Row Identifiers
|
49
|
+
|
50
|
+
CSVDiff is preferable over a standard line-by-line diff when row order is
|
51
|
+
significantly impacted by small changes. The classic example is a parent-child
|
52
|
+
file generated by a hierarchy traversal. A simple change in position of a parent
|
53
|
+
member near the root of the hierarchy will have a large impact on the positions
|
54
|
+
of all descendant rows. Consider the following example:
|
55
|
+
```
|
56
|
+
Root
|
57
|
+
|- A
|
58
|
+
| |- A1
|
59
|
+
| |- A2
|
60
|
+
|
|
61
|
+
|- B
|
62
|
+
|- B1
|
63
|
+
|- B2
|
64
|
+
```
|
65
|
+
|
66
|
+
A hierarchy traversal of this tree into a parent-child format would generate a CSV
|
67
|
+
as follows:
|
68
|
+
```
|
69
|
+
Root,A
|
70
|
+
A,A1
|
71
|
+
A,A2
|
72
|
+
Root,B
|
73
|
+
B,B1
|
74
|
+
B,B2
|
75
|
+
```
|
76
|
+
|
77
|
+
If the positions of A and B were swapped, a hierarchy traversal would now produce a CSV
|
78
|
+
as follows:
|
79
|
+
```
|
80
|
+
Root,B
|
81
|
+
B,B1
|
82
|
+
B,B2
|
83
|
+
Root,A
|
84
|
+
A,A1
|
85
|
+
A,A2
|
86
|
+
```
|
87
|
+
|
88
|
+
A simple diff using a diff utility would highlight this as 3 additions and 3 deletions.
|
89
|
+
CSVDiff, however, would classify this as 2 moves (a change in sibling position for A and B).
|
90
|
+
|
91
|
+
In order to do this, CSVDiff needs to know what field(s) confer uniqueness on each row.
|
92
|
+
In this example, we could use the child field alone (since each member name only appears
|
93
|
+
once); however, this would imply a flat structure, where all rows are children of a single
|
94
|
+
parent. This in turn would cause CSVDiff to classify the above change as a Move (i.e. a
|
95
|
+
change in order) of all 6 rows.
|
96
|
+
|
97
|
+
The more correct specification of this file is that column 0 contains a unique parent
|
98
|
+
identifier, and column 1 contains a unique child identifier. CSVDiff can then correctly
|
99
|
+
deduce that there is in fact only two changes in order - the swap in positions of A and
|
100
|
+
B below Root.
|
101
|
+
|
102
|
+
Note: If you aren't interested in changes in the order of siblings, then you could use
|
103
|
+
CSVDiff with a :key_field option of column 1, and specify the :ignore_moves option.
|
104
|
+
|
105
|
+
## Warnings
|
106
|
+
|
107
|
+
When processing and diffing files, CSVDiff may encounter problems with the data or
|
108
|
+
the specifications it has been given. It will continue even in the face of problems,
|
109
|
+
but will log details of the problems in a #warnings Array. The number of warnings
|
110
|
+
will also be included in the Hash returned by the #summary method.
|
111
|
+
|
112
|
+
Warnings may be raised for any of the following:
|
113
|
+
* Missing fields: If the right/to file contains fields that are not present in the
|
114
|
+
left/from file, a warning is raised and the field is ignored for diff purposes.
|
115
|
+
* Duplicate keys: If two rows are found that have the same values for the key field(s),
|
116
|
+
a warning is raised, and the duplicate values are ignored.
|
117
|
+
|
118
|
+
|
119
|
+
## Examples
|
120
|
+
|
121
|
+
The simplest use case is as shown above, where the data to be diffed is in CSV files
|
122
|
+
with the column names as the first record, and where the unique key is the first
|
123
|
+
column in the data. In this case, a diff can be created simply via:
|
124
|
+
```ruby
|
125
|
+
diff = CSVDiff.new(file1, file2)
|
126
|
+
```
|
127
|
+
|
128
|
+
### Specifynig Unique Row Identifiers
|
129
|
+
|
130
|
+
Often however, rows are not uniquely identifiable via the first column in the file.
|
131
|
+
In a parent-child hierarchy, for example, combinations of parent and child may be
|
132
|
+
necessary to uniquely identify a row. In these cases, it is necessary to indicate
|
133
|
+
which fields are used to uniquely identify common rows across the two files. This
|
134
|
+
can be done in several different ways.
|
135
|
+
|
136
|
+
1. Using the :key_fields option with field numbers (these are 0-based):
|
137
|
+
|
138
|
+
```ruby
|
139
|
+
diff = CSVDiff.new(file1, file2, key_fields: [0, 1])
|
140
|
+
```
|
141
|
+
|
142
|
+
2. Using the :key_fields options with column names:
|
143
|
+
|
144
|
+
```ruby
|
145
|
+
diff = CSVDiff.new(file1, file2, key_fields: ['Parent', 'Child'])
|
146
|
+
```
|
147
|
+
|
148
|
+
3. Using the :parent_fields and :child_fields with field numbers:
|
149
|
+
|
150
|
+
```ruby
|
151
|
+
diff = CSVDiff.new(file1, file2, parent_field: 1, child_fields: [2, 3])
|
152
|
+
```
|
153
|
+
|
154
|
+
4. Using the :parent_fields and :child_fields with column names:
|
155
|
+
|
156
|
+
```ruby
|
157
|
+
diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'])
|
158
|
+
```
|
159
|
+
|
160
|
+
### Using Non-CSV File Sources
|
161
|
+
|
162
|
+
Data from non-CSV sources can be diffed, as long as it can be supplied as an Array
|
163
|
+
of Arrays:
|
164
|
+
```ruby
|
165
|
+
DATA1 = [
|
166
|
+
['Parent', 'Child', 'Description'],
|
167
|
+
['A', 'A1', 'Account 1'],
|
168
|
+
['A', 'A2', 'Account 2']
|
169
|
+
]
|
170
|
+
|
171
|
+
DATA2 = [
|
172
|
+
['Parent', 'Child', 'Description'],
|
173
|
+
['A', 'A1', 'Account1'],
|
174
|
+
['A', 'A2', 'Account2']
|
175
|
+
]
|
176
|
+
|
177
|
+
diff = CSVDiff.new(DATA1, DATA2, key_fields: [1, 0])
|
178
|
+
```
|
179
|
+
|
180
|
+
### Specifying Column Names
|
181
|
+
|
182
|
+
If your data file does not include column headers, you can specify the names of
|
183
|
+
each column when creating the diff. The names supplied are the keys used in the
|
184
|
+
diff results:
|
185
|
+
|
186
|
+
```ruby
|
187
|
+
DATA1 = [
|
188
|
+
['A', 'A1', 'Account 1'],
|
189
|
+
['A', 'A2', 'Account 2']
|
190
|
+
]
|
191
|
+
|
192
|
+
DATA2 = [
|
193
|
+
['A', 'A1', 'Account1'],
|
194
|
+
['A', 'A2', 'Account2']
|
195
|
+
]
|
196
|
+
|
197
|
+
diff = CSVDiff.new(DATA1, DATA2, key_fields: [1, 0], field_names: ['Parent', 'Child', 'Description'])
|
198
|
+
```
|
199
|
+
|
200
|
+
If your data file does contain a header row, but you wish to use your own column
|
201
|
+
names, you can specify the :field_names option and the :ignore_header option to
|
202
|
+
ignore the first row.
|
203
|
+
|
204
|
+
|
205
|
+
### Ignoring Fields
|
206
|
+
|
207
|
+
If your data contains fields that you aren't interested in, these can be excluded
|
208
|
+
from the diff process using the :ignore_fields option:
|
209
|
+
```ruby
|
210
|
+
diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'],
|
211
|
+
ignore_fields: ['CreatedAt', 'UpdatedAt'])
|
212
|
+
```
|
213
|
+
|
214
|
+
### Ignoring Certain Changes
|
215
|
+
|
216
|
+
CSVDiff identifies Adds, Updates, Moves and Deletes; any of these changes can be selectively
|
217
|
+
ignored, e.g. if you are not interested in Deletes, you can pass the :ignore_deletes option:
|
218
|
+
```ruby
|
219
|
+
diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'],
|
220
|
+
ignore_fields: ['CreatedAt', 'UpdatedAt'],
|
221
|
+
ignore_deletes: true, ignore_moves: true)
|
222
|
+
```
|
data/lib/csv-diff.rb
ADDED
@@ -0,0 +1,124 @@
|
|
1
|
+
class CSVDiff
|
2
|
+
|
3
|
+
# Implements the CSV diff algorithm.
|
4
|
+
module Algorithm
|
5
|
+
|
6
|
+
# Diffs two CSVSource structures.
|
7
|
+
#
|
8
|
+
# @param left [CSVSource] A CSVSource object containing the contents of
|
9
|
+
# the left/from input.
|
10
|
+
# @param right [CSVSource] A CSVSource object containing the contents of
|
11
|
+
# the right/to input.
|
12
|
+
# @param key_fields [Array] An array containing the names of the field(s)
|
13
|
+
# that uniquely identify each row.
|
14
|
+
# @param diff_fields [Array] An array containing the names of the fields
|
15
|
+
# to be diff-ed.
|
16
|
+
def diff_sources(left, right, key_fields, diff_fields, options = {})
|
17
|
+
left_index = left.index
|
18
|
+
left_values = left.lines
|
19
|
+
left_keys = left_values.keys
|
20
|
+
right_index = right.index
|
21
|
+
right_values = right.lines
|
22
|
+
right_keys = right_values.keys
|
23
|
+
parent_fields = left.parent_fields.length
|
24
|
+
|
25
|
+
include_adds = !options[:ignore_adds]
|
26
|
+
include_moves = !options[:ignore_moves]
|
27
|
+
include_updates = !options[:ignore_updates]
|
28
|
+
include_deletes = !options[:ignore_deletes]
|
29
|
+
|
30
|
+
diffs = Hash.new{ |h, k| h[k] = {} }
|
31
|
+
right_keys.each_with_index do |key, right_row_id|
|
32
|
+
key_vals = key.split('~')
|
33
|
+
parent = key_vals[0...parent_fields].join('~')
|
34
|
+
child = key_vals[parent_fields..-1].join('~')
|
35
|
+
left_parent = left_index[parent]
|
36
|
+
right_parent = right_index[parent]
|
37
|
+
left_value = left_values[key]
|
38
|
+
right_value = right_values[key]
|
39
|
+
left_idx = left_parent && left_parent.index(key)
|
40
|
+
right_idx = right_parent && right_parent.index(key)
|
41
|
+
|
42
|
+
id = {}
|
43
|
+
id[:row] = right_row_id + 1
|
44
|
+
id[:sibling_position] = right_idx + 1
|
45
|
+
key_fields.each do |field_name|
|
46
|
+
id[field_name] = right_value[field_name]
|
47
|
+
end
|
48
|
+
if left_idx && right_idx
|
49
|
+
if include_moves
|
50
|
+
left_common = left_parent & right_parent
|
51
|
+
right_common = right_parent & left_parent
|
52
|
+
left_pos = left_common.index(key)
|
53
|
+
right_pos = right_common.index(key)
|
54
|
+
if left_pos != right_pos
|
55
|
+
# Move
|
56
|
+
diffs[key].merge!(id.merge!(:action => 'Move',
|
57
|
+
:sibling_position => [left_idx + 1, right_idx + 1]))
|
58
|
+
#puts "Move #{left_idx} -> #{right_idx}: #{key}"
|
59
|
+
end
|
60
|
+
end
|
61
|
+
if include_updates && (changes = diff_row(left_values[key], right_values[key], diff_fields))
|
62
|
+
diffs[key].merge!(id.merge(changes.merge(:action => 'Update')))
|
63
|
+
#puts "Change: #{key}"
|
64
|
+
end
|
65
|
+
elsif include_adds && right_idx
|
66
|
+
# Add
|
67
|
+
diffs[key].merge!(id.merge(right_values[key].merge(:action => 'Add')))
|
68
|
+
#puts "Add: #{key}"
|
69
|
+
end
|
70
|
+
end
|
71
|
+
|
72
|
+
# Now identify deletions
|
73
|
+
if include_deletes
|
74
|
+
(left_keys - right_keys).each do |key|
|
75
|
+
# Delete
|
76
|
+
key_vals = key.split('~')
|
77
|
+
parent = key_vals[0...parent_fields].join('~')
|
78
|
+
child = key_vals[parent_fields..-1].join('~')
|
79
|
+
left_parent = left_index[parent]
|
80
|
+
left_value = left_values[key]
|
81
|
+
left_idx = left_parent.index(key)
|
82
|
+
next unless left_idx
|
83
|
+
id = {}
|
84
|
+
id[:row] = left_keys.index(key) + 1
|
85
|
+
id[:sibling_position] = left_idx + 1
|
86
|
+
key_fields.each do |field_name|
|
87
|
+
id[field_name] = left_value[field_name]
|
88
|
+
end
|
89
|
+
diffs[key].merge!(id.merge(left_values[key].merge(:action => 'Delete')))
|
90
|
+
#puts "Delete: #{key}"
|
91
|
+
end
|
92
|
+
end
|
93
|
+
diffs
|
94
|
+
end
|
95
|
+
|
96
|
+
|
97
|
+
# Identifies the fields that are different between two versions of the
|
98
|
+
# same row.
|
99
|
+
#
|
100
|
+
# @param left_row [Hash] The version of the CSV row from the left/from
|
101
|
+
# file.
|
102
|
+
# @param right_row [Hash] The version of the CSV row from the right/to
|
103
|
+
# file.
|
104
|
+
# @return [Hash<String, Array>] A Hash whose keys are the fields that
|
105
|
+
# contain differences, and whose values are a two-element array of
|
106
|
+
# [left/from, right/to] values.
|
107
|
+
def diff_row(left_row, right_row, fields)
|
108
|
+
diffs = {}
|
109
|
+
fields.each do |attr|
|
110
|
+
right_val = right_row[attr]
|
111
|
+
right_val = nil if right_val == ""
|
112
|
+
left_val = left_row[attr]
|
113
|
+
left_val = nil if left_val == ""
|
114
|
+
if left_val != right_val
|
115
|
+
diffs[attr] = [left_val, right_val]
|
116
|
+
#puts "#{attr}: #{left_val} -> #{right_val}"
|
117
|
+
end
|
118
|
+
end
|
119
|
+
diffs if diffs.size > 0
|
120
|
+
end
|
121
|
+
|
122
|
+
end
|
123
|
+
|
124
|
+
end
|
@@ -0,0 +1,142 @@
|
|
1
|
+
# This library performs diffs of flat file content that contains structured data
|
2
|
+
# in fields, with rows provided in a parent-child format.
|
3
|
+
#
|
4
|
+
# Parent-child data does not lend itself well to standard text diffs, as small
|
5
|
+
# changes in the organisation of the tree at an upper level (e.g. re-ordering of
|
6
|
+
# two ancestor nodes) can lead to big movements in the position of descendant
|
7
|
+
# records - particularly when the parent-child data is generated by a hierarchy
|
8
|
+
# traversal.
|
9
|
+
#
|
10
|
+
# Additionally, simple line-based diffs can identify that a line has changed,
|
11
|
+
# but not which field(s) in the line have changed.
|
12
|
+
#
|
13
|
+
# Data may be supplied in the form of CSV files, or as an array of arrays. The
|
14
|
+
# diff process process provides a fine level of control over what to diff, and
|
15
|
+
# can optionally ignore certain types of changes (e.g. changes in order).
|
16
|
+
class CSVDiff
|
17
|
+
|
18
|
+
# @return [CSVSource] CSVSource object containing details of the left/from
|
19
|
+
# input.
|
20
|
+
attr_reader :left
|
21
|
+
alias_method :from, :left
|
22
|
+
# @return [CSVSource] CSVSource object containing details of the right/to
|
23
|
+
# input.
|
24
|
+
attr_reader :right
|
25
|
+
alias_method :to, :right
|
26
|
+
# @return [Array<Hash>] An array of differences
|
27
|
+
attr_reader :diffs
|
28
|
+
# @return [Array<String>] An array of field names that are compared in the
|
29
|
+
# diff process.
|
30
|
+
attr_reader :diff_fields
|
31
|
+
# @return [Array<Fixnum>] An array of field indexes identifying the key
|
32
|
+
# fields that uniquely identify each row.
|
33
|
+
attr_reader :key_fields
|
34
|
+
# @return [Array<String>] An array of field names for the parent field(s).
|
35
|
+
attr_reader :parent_fields
|
36
|
+
# @return [Array<String>] An array of field names for the child field(s).
|
37
|
+
attr_reader :child_fields
|
38
|
+
|
39
|
+
|
40
|
+
# Generates a diff between two hierarchical tree structures, provided
|
41
|
+
# as +left+ and +right+, each of which consists of an array of lines in CSV
|
42
|
+
# format.
|
43
|
+
# An array of field indexes can also be specified as +key_fields+;
|
44
|
+
# a minimum of one field index must be specified; the last index is the
|
45
|
+
# child id, and the remaining fields (if any) are the parent field(s) that
|
46
|
+
# uniquely qualify the child instance.
|
47
|
+
#
|
48
|
+
# @param left [Array<Array<String>>] An Array of lines, each of which is in
|
49
|
+
# turn an Array containing fields.
|
50
|
+
# @param right [Array<Array<String>>] An Array of lines, each of which is in
|
51
|
+
# turn an Array containing fields.
|
52
|
+
# @param options [Hash] A hash containing options.
|
53
|
+
# @option options [Array<String>] :field_names An Array of field names for
|
54
|
+
# each field in +left+ and +right+. If not provided, the first row is
|
55
|
+
# assumed to contain field names.
|
56
|
+
# @option options [Boolean] :ignore_header If true, the first line of each
|
57
|
+
# file is ignored. This option can only be true if :field_names is
|
58
|
+
# specified.
|
59
|
+
# @options options [Array] :ignore_fields The names of any fields to be
|
60
|
+
# ignored when performing the diff.
|
61
|
+
# @option options [String] :key_field The name of the field that uniquely
|
62
|
+
# identifies each row.
|
63
|
+
# @option options [Array<String>] :key_fields The names of the fields
|
64
|
+
# that uniquely identifies each row.
|
65
|
+
# @option options [String] :parent_field The name of the field that
|
66
|
+
# identifies a parent within which sibling order should be checked.
|
67
|
+
# @option options [String] :child_field The name of the field that
|
68
|
+
# uniquely identifies a child of a parent.
|
69
|
+
# @option options [Boolean] :ignore_adds If true, records that appear in
|
70
|
+
# the right/to file but not in the left/from file are not reported.
|
71
|
+
# @option options [Boolean] :ignore_updates If true, records that have been
|
72
|
+
# updated are not reported.
|
73
|
+
# @option options [Boolean] :ignore_moves If true, changes in row position
|
74
|
+
# amongst sibling rows are not reported.
|
75
|
+
# @option options [Boolean] :ignore_deletes If true, records that appear
|
76
|
+
# in the left/from file but not in the right/to file are not reported.
|
77
|
+
def initialize(left, right, options = {})
|
78
|
+
@left = CSVSource.new(left, options)
|
79
|
+
raise "No field names found in left (from) source" unless @left.field_names && @left.field_names.size > 0
|
80
|
+
@right = CSVSource.new(right, options)
|
81
|
+
raise "No field names found in right (to) source" unless @right.field_names && @right.field_names.size > 0
|
82
|
+
@warnings = []
|
83
|
+
@diff_fields = get_diff_fields(@left.field_names, @right.field_names, options.fetch(:ignore_fields, []))
|
84
|
+
@key_fields = @left.key_fields.map{ |kf| @diff_fields[kf] }
|
85
|
+
diff(options)
|
86
|
+
end
|
87
|
+
|
88
|
+
|
89
|
+
# Performs a diff with the specified +options+.
|
90
|
+
def diff(options = {})
|
91
|
+
@summary = nil
|
92
|
+
@diffs = diff_sources(@left, @right, @key_fields, @diff_fields, options)
|
93
|
+
end
|
94
|
+
|
95
|
+
|
96
|
+
# Returns a summary of the number of adds, deletes, moves, and updates.
|
97
|
+
def summary
|
98
|
+
unless @summary
|
99
|
+
@summary = Hash.new{ |h, k| h[k] = 0 }
|
100
|
+
@diffs.each{ |k, v| @summary[v[:action]] += 1 }
|
101
|
+
@summary['Warnings'] = warnings.size if warnings.size > 0
|
102
|
+
end
|
103
|
+
@summary
|
104
|
+
end
|
105
|
+
|
106
|
+
|
107
|
+
[:adds, :deletes, :updates, :moves].each do |mthd|
|
108
|
+
define_method mthd do
|
109
|
+
action = mthd.to_s.chomp('s')
|
110
|
+
@diffs.select{ |k, v| v[:action].downcase == action }
|
111
|
+
end
|
112
|
+
end
|
113
|
+
|
114
|
+
|
115
|
+
# @return [Array<String>] an array of warning messages generated during the
|
116
|
+
# diff process.
|
117
|
+
def warnings
|
118
|
+
@left.warnings + @right.warnings + @warnings
|
119
|
+
end
|
120
|
+
|
121
|
+
|
122
|
+
private
|
123
|
+
|
124
|
+
|
125
|
+
# Given two sets of field names, determines the common set of fields present
|
126
|
+
# in both, on which members can be diffed.
|
127
|
+
def get_diff_fields(left_fields, right_fields, ignore_fields)
|
128
|
+
diff_fields = []
|
129
|
+
right_fields.each do |fld|
|
130
|
+
if left_fields.include?(fld)
|
131
|
+
diff_fields << fld unless ignore_fields.include?(fld)
|
132
|
+
else
|
133
|
+
@warnings << "Field '#{fld}' is missing from the left (from) file, and won't be diffed"
|
134
|
+
end
|
135
|
+
end
|
136
|
+
diff_fields
|
137
|
+
end
|
138
|
+
|
139
|
+
|
140
|
+
include Algorithm
|
141
|
+
|
142
|
+
end
|
@@ -0,0 +1,151 @@
|
|
1
|
+
class CSVDiff
|
2
|
+
|
3
|
+
# Represents a CSV input (i.e. the left/from or right/to input) to the diff
|
4
|
+
# process.
|
5
|
+
class CSVSource
|
6
|
+
|
7
|
+
# @return [String] the path to the source file
|
8
|
+
attr_accessor :path
|
9
|
+
# @return [Array<String>] The names of the fields in the source file
|
10
|
+
attr_reader :field_names
|
11
|
+
# @return [Array<String>] The names of the field(s) that uniquely
|
12
|
+
# identify each row.
|
13
|
+
attr_reader :key_fields
|
14
|
+
# @return [Array<String>] The names of the field(s) that identify a
|
15
|
+
# common parent of child records.
|
16
|
+
attr_reader :parent_fields
|
17
|
+
# @return [Array<String>] The names of the field(s) that distinguish a
|
18
|
+
# child of a parent record.
|
19
|
+
attr_reader :child_fields
|
20
|
+
# @return [Hash<String,Hash>] A hash containing each line of the source,
|
21
|
+
# keyed on the values of the +key_fields+.
|
22
|
+
attr_reader :lines
|
23
|
+
# @return [Hash<String,Array<String>>] A hash containing each parent key,
|
24
|
+
# and an Array of the child keys it is a parent of.
|
25
|
+
attr_reader :index
|
26
|
+
# @return [Array<String>] An array of any warnings encountered while
|
27
|
+
# processing the source.
|
28
|
+
attr_reader :warnings
|
29
|
+
|
30
|
+
|
31
|
+
# Creates a new diff source.
|
32
|
+
#
|
33
|
+
# A diff source must contain at least one field that will be used as the
|
34
|
+
# key to identify the same record in a different version of this file.
|
35
|
+
# If not specified via one of the options, the first field is assumed to
|
36
|
+
# be the unique key.
|
37
|
+
#
|
38
|
+
# If multiple fields combine to form a unique key, the parent is assumed
|
39
|
+
# to be identified by all but the last field of the unique key. If finer
|
40
|
+
# control is required, use a combination of the :parent_fields and
|
41
|
+
# :child_fields options.
|
42
|
+
#
|
43
|
+
# All key options can be specified either by field name, or by field
|
44
|
+
# index (0 based).
|
45
|
+
#
|
46
|
+
# @param source [String|Array<Array>] Either a path to a CSV file, or an
|
47
|
+
# Array of Arrays containing CSV data. If the :field_names option is
|
48
|
+
# not specified, the first line must contain the names of the fields.
|
49
|
+
# @param options [Hash] An options hash.
|
50
|
+
# @option options [String] :mode_string The mode to use when opening the
|
51
|
+
# CSV file. Defaults to 'r'.
|
52
|
+
# @option options [Hash] :csv_options Any options you wish to pass to
|
53
|
+
# CSV.open, e.g. :col_sep.
|
54
|
+
# @option options [Array<String>] :field_names The names of each of the
|
55
|
+
# fields in +source+.
|
56
|
+
# @option options [Boolean] :ignore_header If true, and :field_names has
|
57
|
+
# been specified, then the first row of the file is ignored.
|
58
|
+
# @option options [String] :key_field The name of the field that uniquely
|
59
|
+
# identifies each row.
|
60
|
+
# @option options [Array<String>] :key_fields The names of the fields
|
61
|
+
# that uniquely identifies each row.
|
62
|
+
# @option options [String] :parent_field The name of the field that
|
63
|
+
# identifies a parent within which sibling order should be checked.
|
64
|
+
# @option options [String] :child_field The name of the field that
|
65
|
+
# uniquely identifies a child of a parent.
|
66
|
+
def initialize(source, options = {})
|
67
|
+
if source.is_a?(String)
|
68
|
+
require 'csv'
|
69
|
+
mode_string = options.fetch(:mode_string, 'r')
|
70
|
+
csv_options = options.fetch(:csv_options, {})
|
71
|
+
@path = source
|
72
|
+
source = CSV.open(@path, mode_string, csv_options).readlines
|
73
|
+
end
|
74
|
+
if kf = options.fetch(:key_field, options[:key_fields])
|
75
|
+
@key_fields = [kf].flatten
|
76
|
+
@parent_fields = @key_fields[0...-1]
|
77
|
+
@child_fields = @key_fields[-1..-1]
|
78
|
+
else
|
79
|
+
@parent_fields = [options.fetch(:parent_field, options.fetch(:parent_fields, []))].flatten
|
80
|
+
@child_fields = [options.fetch(:child_field, options.fetch(:child_fields, [0]))].flatten
|
81
|
+
@key_fields = @parent_fields + @child_fields
|
82
|
+
end
|
83
|
+
@field_names = options[:field_names]
|
84
|
+
@warnings = []
|
85
|
+
index_source(source, options)
|
86
|
+
end
|
87
|
+
|
88
|
+
|
89
|
+
# Returns the row in the CSV source corresponding to the supplied key.
|
90
|
+
#
|
91
|
+
# @param key [String] The unique key to use to lookup the row.
|
92
|
+
# @return [Hash] The fields for the line corresponding to +key+, or nil
|
93
|
+
# if the key is not recognised.
|
94
|
+
def [](key)
|
95
|
+
@lines[key]
|
96
|
+
end
|
97
|
+
|
98
|
+
|
99
|
+
private
|
100
|
+
|
101
|
+
# Given an array of lines, where each line is an array of fields, indexes
|
102
|
+
# the array contents so that it can be looked up by key.
|
103
|
+
def index_source(lines, options)
|
104
|
+
@lines = {}
|
105
|
+
@index = Hash.new{ |h, k| h[k] = [] }
|
106
|
+
@key_fields = find_field_indexes(@key_fields, @field_names) if @field_names
|
107
|
+
line_num = 0
|
108
|
+
lines.each do |row|
|
109
|
+
line_num += 1
|
110
|
+
next if line_num == 1 && @field_names && options[:ignore_header]
|
111
|
+
unless @field_names
|
112
|
+
@field_names = row
|
113
|
+
@key_fields = find_field_indexes(@key_fields, @field_names)
|
114
|
+
next
|
115
|
+
end
|
116
|
+
field_vals = row
|
117
|
+
line = {}
|
118
|
+
@field_names.each_with_index do |field, i|
|
119
|
+
line[field] = field_vals[i]
|
120
|
+
end
|
121
|
+
key_values = @key_fields.map{ |kf| field_vals[kf].to_s.upcase }
|
122
|
+
key = key_values.join('~')
|
123
|
+
parent_key = key_values[0...(@parent_fields.length)].join('~')
|
124
|
+
if @lines[key]
|
125
|
+
@warnings << "Duplicate key '#{key}' encountered and ignored at line #{line_num}"
|
126
|
+
else
|
127
|
+
@index[parent_key] << key
|
128
|
+
@lines[key] = line
|
129
|
+
end
|
130
|
+
end
|
131
|
+
end
|
132
|
+
|
133
|
+
|
134
|
+
# Converts an array of field names to an array of indexes of the fields
|
135
|
+
# matching those names.
|
136
|
+
def find_field_indexes(key_fields, field_names)
|
137
|
+
key_fields.map do |field|
|
138
|
+
if field.is_a?(Fixnum)
|
139
|
+
field
|
140
|
+
else
|
141
|
+
field_names.index{ |field_name| field.to_s.downcase == field_name.downcase } or
|
142
|
+
raise ArgumentError, "Could not locate field '#{field}' in source field names: #{
|
143
|
+
field_names.join(', ')}"
|
144
|
+
end
|
145
|
+
end
|
146
|
+
end
|
147
|
+
|
148
|
+
end
|
149
|
+
|
150
|
+
end
|
151
|
+
|
data/lib/csv_diff.rb
ADDED
metadata
ADDED
@@ -0,0 +1,62 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: csv-diff
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: '0.1'
|
5
|
+
prerelease:
|
6
|
+
platform: ruby
|
7
|
+
authors:
|
8
|
+
- Adam Gardiner
|
9
|
+
autorequire:
|
10
|
+
bindir: bin
|
11
|
+
cert_chain: []
|
12
|
+
date: 2014-05-30 00:00:00.000000000 Z
|
13
|
+
dependencies: []
|
14
|
+
description: ! " This library performs diffs of CSV files.\n\n Unlike
|
15
|
+
a standard diff that compares line by line, and is sensitive to the\n ordering
|
16
|
+
of records, CSV-Diff identifies common lines by key field(s), and\n then
|
17
|
+
compares the contents of the fields in each line.\n\n Data may be supplied
|
18
|
+
in the form of CSV files, or as an array of arrays. The\n diff process provides
|
19
|
+
a fine level of control over what to diff, and can\n optionally ignore certain
|
20
|
+
types of changes (e.g. changes in position).\n\n CSV-Diff is particularly
|
21
|
+
well suited to data in parent-child format. Parent-\n child data does not
|
22
|
+
lend itself well to standard text diffs, as small changes\n in the organisation
|
23
|
+
of the tree at an upper level can lead to big movements\n in the position
|
24
|
+
of descendant records. By instead matching records by key,\n CSV-Diff avoids
|
25
|
+
this issue, while still being able to detect changes in\n sibling order.\n"
|
26
|
+
email: adam.b.gardiner@gmail.com
|
27
|
+
executables: []
|
28
|
+
extensions: []
|
29
|
+
extra_rdoc_files: []
|
30
|
+
files:
|
31
|
+
- README.md
|
32
|
+
- LICENSE
|
33
|
+
- lib/csv-diff/algorithm.rb
|
34
|
+
- lib/csv-diff/csv_diff.rb
|
35
|
+
- lib/csv-diff/csv_source.rb
|
36
|
+
- lib/csv-diff.rb
|
37
|
+
- lib/csv_diff.rb
|
38
|
+
homepage: https://github.com/agardiner/csv-diff
|
39
|
+
licenses: []
|
40
|
+
post_install_message:
|
41
|
+
rdoc_options: []
|
42
|
+
require_paths:
|
43
|
+
- lib
|
44
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
45
|
+
none: false
|
46
|
+
requirements:
|
47
|
+
- - ! '>='
|
48
|
+
- !ruby/object:Gem::Version
|
49
|
+
version: '0'
|
50
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
51
|
+
none: false
|
52
|
+
requirements:
|
53
|
+
- - ! '>='
|
54
|
+
- !ruby/object:Gem::Version
|
55
|
+
version: '0'
|
56
|
+
requirements: []
|
57
|
+
rubyforge_project:
|
58
|
+
rubygems_version: 1.8.23
|
59
|
+
signing_key:
|
60
|
+
specification_version: 3
|
61
|
+
summary: CSV Diff is a library for generating diffs from data in CSV format
|
62
|
+
test_files: []
|