csv-diff 0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/LICENSE +22 -0
- data/README.md +222 -0
- data/lib/csv-diff.rb +4 -0
- data/lib/csv-diff/algorithm.rb +124 -0
- data/lib/csv-diff/csv_diff.rb +142 -0
- data/lib/csv-diff/csv_source.rb +151 -0
- data/lib/csv_diff.rb +2 -0
- metadata +62 -0
data/LICENSE
ADDED
@@ -0,0 +1,22 @@
|
|
1
|
+
Copyright (c) 2013, Adam Gardiner
|
2
|
+
All rights reserved.
|
3
|
+
|
4
|
+
Redistribution and use in source and binary forms, with or without
|
5
|
+
modification, are permitted provided that the following conditions are met:
|
6
|
+
|
7
|
+
* Redistributions of source code must retain the above copyright notice, this
|
8
|
+
list of conditions and the following disclaimer.
|
9
|
+
* Redistributions in binary form must reproduce the above copyright notice,
|
10
|
+
this list of conditions and the following disclaimer in the documentation
|
11
|
+
and/or other materials provided with the distribution.
|
12
|
+
|
13
|
+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
14
|
+
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
15
|
+
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
16
|
+
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE
|
17
|
+
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
18
|
+
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
19
|
+
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
20
|
+
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
21
|
+
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
22
|
+
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
data/README.md
ADDED
@@ -0,0 +1,222 @@
|
|
1
|
+
# CSV-Diff
|
2
|
+
|
3
|
+
CSV-Diff is a small library for performing diffs of CSV data.
|
4
|
+
|
5
|
+
Unlike a standard diff that compares line by line, and is sensitive to the
|
6
|
+
ordering of records, CSV-Diff identifies common lines by key field(s), and
|
7
|
+
then compares the contents of the fields in each line.
|
8
|
+
|
9
|
+
Data may be supplied in the form of CSV files, or as an array of arrays. The
|
10
|
+
diff process provides a fine level of control over what to diff, and can
|
11
|
+
optionally ignore certain types of changes (e.g. changes in position).
|
12
|
+
|
13
|
+
CSV-Diff is particularly well suited to data in parent-child format. Parent-
|
14
|
+
child data does not lend itself well to standard text diffs, as small changes
|
15
|
+
in the organisation of the tree at an upper level can lead to big movements
|
16
|
+
in the position of descendant records. By instead matching records by key,
|
17
|
+
CSV-Diff avoids this issue, while still being able to detect changes in
|
18
|
+
sibling order.
|
19
|
+
|
20
|
+
|
21
|
+
## Usage
|
22
|
+
|
23
|
+
CSV-Diff is supplied as a gem, and has no dependencies. To use it, simply:
|
24
|
+
```
|
25
|
+
gem install csv-diff
|
26
|
+
```
|
27
|
+
|
28
|
+
To compare two CSV files where the field names are in the first row of the file,
|
29
|
+
and the first field contains the unique key for each record, simply use:
|
30
|
+
```ruby
|
31
|
+
require 'csv-diff'
|
32
|
+
|
33
|
+
diff = CSVDiff.new(file1, file2)
|
34
|
+
```
|
35
|
+
|
36
|
+
The returned diff object can be queried for the differences that exist between
|
37
|
+
the two files, e.g.:
|
38
|
+
```ruby
|
39
|
+
puts diff.summary.inspect # Summary of the adds, deletes, updates, and moves
|
40
|
+
puts diff.adds.inspect # Details of the additions to file2
|
41
|
+
puts diff.deletes.inspect # Details of the deletions to file1
|
42
|
+
puts diff.updates.inspect # Details of the updates from file1 to file2
|
43
|
+
puts diff.moves.inspect # Details of the moves from file1 to file2
|
44
|
+
puts diff.diffs.inspect # Details of all differences
|
45
|
+
puts diff.warnings.inspect # Any warnings generated during the diff process
|
46
|
+
```
|
47
|
+
|
48
|
+
## Unique Row Identifiers
|
49
|
+
|
50
|
+
CSVDiff is preferable over a standard line-by-line diff when row order is
|
51
|
+
significantly impacted by small changes. The classic example is a parent-child
|
52
|
+
file generated by a hierarchy traversal. A simple change in position of a parent
|
53
|
+
member near the root of the hierarchy will have a large impact on the positions
|
54
|
+
of all descendant rows. Consider the following example:
|
55
|
+
```
|
56
|
+
Root
|
57
|
+
|- A
|
58
|
+
| |- A1
|
59
|
+
| |- A2
|
60
|
+
|
|
61
|
+
|- B
|
62
|
+
|- B1
|
63
|
+
|- B2
|
64
|
+
```
|
65
|
+
|
66
|
+
A hierarchy traversal of this tree into a parent-child format would generate a CSV
|
67
|
+
as follows:
|
68
|
+
```
|
69
|
+
Root,A
|
70
|
+
A,A1
|
71
|
+
A,A2
|
72
|
+
Root,B
|
73
|
+
B,B1
|
74
|
+
B,B2
|
75
|
+
```
|
76
|
+
|
77
|
+
If the positions of A and B were swapped, a hierarchy traversal would now produce a CSV
|
78
|
+
as follows:
|
79
|
+
```
|
80
|
+
Root,B
|
81
|
+
B,B1
|
82
|
+
B,B2
|
83
|
+
Root,A
|
84
|
+
A,A1
|
85
|
+
A,A2
|
86
|
+
```
|
87
|
+
|
88
|
+
A simple diff using a diff utility would highlight this as 3 additions and 3 deletions.
|
89
|
+
CSVDiff, however, would classify this as 2 moves (a change in sibling position for A and B).
|
90
|
+
|
91
|
+
In order to do this, CSVDiff needs to know what field(s) confer uniqueness on each row.
|
92
|
+
In this example, we could use the child field alone (since each member name only appears
|
93
|
+
once); however, this would imply a flat structure, where all rows are children of a single
|
94
|
+
parent. This in turn would cause CSVDiff to classify the above change as a Move (i.e. a
|
95
|
+
change in order) of all 6 rows.
|
96
|
+
|
97
|
+
The more correct specification of this file is that column 0 contains a unique parent
|
98
|
+
identifier, and column 1 contains a unique child identifier. CSVDiff can then correctly
|
99
|
+
deduce that there is in fact only two changes in order - the swap in positions of A and
|
100
|
+
B below Root.
|
101
|
+
|
102
|
+
Note: If you aren't interested in changes in the order of siblings, then you could use
|
103
|
+
CSVDiff with a :key_field option of column 1, and specify the :ignore_moves option.
|
104
|
+
|
105
|
+
## Warnings
|
106
|
+
|
107
|
+
When processing and diffing files, CSVDiff may encounter problems with the data or
|
108
|
+
the specifications it has been given. It will continue even in the face of problems,
|
109
|
+
but will log details of the problems in a #warnings Array. The number of warnings
|
110
|
+
will also be included in the Hash returned by the #summary method.
|
111
|
+
|
112
|
+
Warnings may be raised for any of the following:
|
113
|
+
* Missing fields: If the right/to file contains fields that are not present in the
|
114
|
+
left/from file, a warning is raised and the field is ignored for diff purposes.
|
115
|
+
* Duplicate keys: If two rows are found that have the same values for the key field(s),
|
116
|
+
a warning is raised, and the duplicate values are ignored.
|
117
|
+
|
118
|
+
|
119
|
+
## Examples
|
120
|
+
|
121
|
+
The simplest use case is as shown above, where the data to be diffed is in CSV files
|
122
|
+
with the column names as the first record, and where the unique key is the first
|
123
|
+
column in the data. In this case, a diff can be created simply via:
|
124
|
+
```ruby
|
125
|
+
diff = CSVDiff.new(file1, file2)
|
126
|
+
```
|
127
|
+
|
128
|
+
### Specifynig Unique Row Identifiers
|
129
|
+
|
130
|
+
Often however, rows are not uniquely identifiable via the first column in the file.
|
131
|
+
In a parent-child hierarchy, for example, combinations of parent and child may be
|
132
|
+
necessary to uniquely identify a row. In these cases, it is necessary to indicate
|
133
|
+
which fields are used to uniquely identify common rows across the two files. This
|
134
|
+
can be done in several different ways.
|
135
|
+
|
136
|
+
1. Using the :key_fields option with field numbers (these are 0-based):
|
137
|
+
|
138
|
+
```ruby
|
139
|
+
diff = CSVDiff.new(file1, file2, key_fields: [0, 1])
|
140
|
+
```
|
141
|
+
|
142
|
+
2. Using the :key_fields options with column names:
|
143
|
+
|
144
|
+
```ruby
|
145
|
+
diff = CSVDiff.new(file1, file2, key_fields: ['Parent', 'Child'])
|
146
|
+
```
|
147
|
+
|
148
|
+
3. Using the :parent_fields and :child_fields with field numbers:
|
149
|
+
|
150
|
+
```ruby
|
151
|
+
diff = CSVDiff.new(file1, file2, parent_field: 1, child_fields: [2, 3])
|
152
|
+
```
|
153
|
+
|
154
|
+
4. Using the :parent_fields and :child_fields with column names:
|
155
|
+
|
156
|
+
```ruby
|
157
|
+
diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'])
|
158
|
+
```
|
159
|
+
|
160
|
+
### Using Non-CSV File Sources
|
161
|
+
|
162
|
+
Data from non-CSV sources can be diffed, as long as it can be supplied as an Array
|
163
|
+
of Arrays:
|
164
|
+
```ruby
|
165
|
+
DATA1 = [
|
166
|
+
['Parent', 'Child', 'Description'],
|
167
|
+
['A', 'A1', 'Account 1'],
|
168
|
+
['A', 'A2', 'Account 2']
|
169
|
+
]
|
170
|
+
|
171
|
+
DATA2 = [
|
172
|
+
['Parent', 'Child', 'Description'],
|
173
|
+
['A', 'A1', 'Account1'],
|
174
|
+
['A', 'A2', 'Account2']
|
175
|
+
]
|
176
|
+
|
177
|
+
diff = CSVDiff.new(DATA1, DATA2, key_fields: [1, 0])
|
178
|
+
```
|
179
|
+
|
180
|
+
### Specifying Column Names
|
181
|
+
|
182
|
+
If your data file does not include column headers, you can specify the names of
|
183
|
+
each column when creating the diff. The names supplied are the keys used in the
|
184
|
+
diff results:
|
185
|
+
|
186
|
+
```ruby
|
187
|
+
DATA1 = [
|
188
|
+
['A', 'A1', 'Account 1'],
|
189
|
+
['A', 'A2', 'Account 2']
|
190
|
+
]
|
191
|
+
|
192
|
+
DATA2 = [
|
193
|
+
['A', 'A1', 'Account1'],
|
194
|
+
['A', 'A2', 'Account2']
|
195
|
+
]
|
196
|
+
|
197
|
+
diff = CSVDiff.new(DATA1, DATA2, key_fields: [1, 0], field_names: ['Parent', 'Child', 'Description'])
|
198
|
+
```
|
199
|
+
|
200
|
+
If your data file does contain a header row, but you wish to use your own column
|
201
|
+
names, you can specify the :field_names option and the :ignore_header option to
|
202
|
+
ignore the first row.
|
203
|
+
|
204
|
+
|
205
|
+
### Ignoring Fields
|
206
|
+
|
207
|
+
If your data contains fields that you aren't interested in, these can be excluded
|
208
|
+
from the diff process using the :ignore_fields option:
|
209
|
+
```ruby
|
210
|
+
diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'],
|
211
|
+
ignore_fields: ['CreatedAt', 'UpdatedAt'])
|
212
|
+
```
|
213
|
+
|
214
|
+
### Ignoring Certain Changes
|
215
|
+
|
216
|
+
CSVDiff identifies Adds, Updates, Moves and Deletes; any of these changes can be selectively
|
217
|
+
ignored, e.g. if you are not interested in Deletes, you can pass the :ignore_deletes option:
|
218
|
+
```ruby
|
219
|
+
diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'],
|
220
|
+
ignore_fields: ['CreatedAt', 'UpdatedAt'],
|
221
|
+
ignore_deletes: true, ignore_moves: true)
|
222
|
+
```
|
data/lib/csv-diff.rb
ADDED
@@ -0,0 +1,124 @@
|
|
1
|
+
class CSVDiff
|
2
|
+
|
3
|
+
# Implements the CSV diff algorithm.
|
4
|
+
module Algorithm
|
5
|
+
|
6
|
+
# Diffs two CSVSource structures.
|
7
|
+
#
|
8
|
+
# @param left [CSVSource] A CSVSource object containing the contents of
|
9
|
+
# the left/from input.
|
10
|
+
# @param right [CSVSource] A CSVSource object containing the contents of
|
11
|
+
# the right/to input.
|
12
|
+
# @param key_fields [Array] An array containing the names of the field(s)
|
13
|
+
# that uniquely identify each row.
|
14
|
+
# @param diff_fields [Array] An array containing the names of the fields
|
15
|
+
# to be diff-ed.
|
16
|
+
def diff_sources(left, right, key_fields, diff_fields, options = {})
|
17
|
+
left_index = left.index
|
18
|
+
left_values = left.lines
|
19
|
+
left_keys = left_values.keys
|
20
|
+
right_index = right.index
|
21
|
+
right_values = right.lines
|
22
|
+
right_keys = right_values.keys
|
23
|
+
parent_fields = left.parent_fields.length
|
24
|
+
|
25
|
+
include_adds = !options[:ignore_adds]
|
26
|
+
include_moves = !options[:ignore_moves]
|
27
|
+
include_updates = !options[:ignore_updates]
|
28
|
+
include_deletes = !options[:ignore_deletes]
|
29
|
+
|
30
|
+
diffs = Hash.new{ |h, k| h[k] = {} }
|
31
|
+
right_keys.each_with_index do |key, right_row_id|
|
32
|
+
key_vals = key.split('~')
|
33
|
+
parent = key_vals[0...parent_fields].join('~')
|
34
|
+
child = key_vals[parent_fields..-1].join('~')
|
35
|
+
left_parent = left_index[parent]
|
36
|
+
right_parent = right_index[parent]
|
37
|
+
left_value = left_values[key]
|
38
|
+
right_value = right_values[key]
|
39
|
+
left_idx = left_parent && left_parent.index(key)
|
40
|
+
right_idx = right_parent && right_parent.index(key)
|
41
|
+
|
42
|
+
id = {}
|
43
|
+
id[:row] = right_row_id + 1
|
44
|
+
id[:sibling_position] = right_idx + 1
|
45
|
+
key_fields.each do |field_name|
|
46
|
+
id[field_name] = right_value[field_name]
|
47
|
+
end
|
48
|
+
if left_idx && right_idx
|
49
|
+
if include_moves
|
50
|
+
left_common = left_parent & right_parent
|
51
|
+
right_common = right_parent & left_parent
|
52
|
+
left_pos = left_common.index(key)
|
53
|
+
right_pos = right_common.index(key)
|
54
|
+
if left_pos != right_pos
|
55
|
+
# Move
|
56
|
+
diffs[key].merge!(id.merge!(:action => 'Move',
|
57
|
+
:sibling_position => [left_idx + 1, right_idx + 1]))
|
58
|
+
#puts "Move #{left_idx} -> #{right_idx}: #{key}"
|
59
|
+
end
|
60
|
+
end
|
61
|
+
if include_updates && (changes = diff_row(left_values[key], right_values[key], diff_fields))
|
62
|
+
diffs[key].merge!(id.merge(changes.merge(:action => 'Update')))
|
63
|
+
#puts "Change: #{key}"
|
64
|
+
end
|
65
|
+
elsif include_adds && right_idx
|
66
|
+
# Add
|
67
|
+
diffs[key].merge!(id.merge(right_values[key].merge(:action => 'Add')))
|
68
|
+
#puts "Add: #{key}"
|
69
|
+
end
|
70
|
+
end
|
71
|
+
|
72
|
+
# Now identify deletions
|
73
|
+
if include_deletes
|
74
|
+
(left_keys - right_keys).each do |key|
|
75
|
+
# Delete
|
76
|
+
key_vals = key.split('~')
|
77
|
+
parent = key_vals[0...parent_fields].join('~')
|
78
|
+
child = key_vals[parent_fields..-1].join('~')
|
79
|
+
left_parent = left_index[parent]
|
80
|
+
left_value = left_values[key]
|
81
|
+
left_idx = left_parent.index(key)
|
82
|
+
next unless left_idx
|
83
|
+
id = {}
|
84
|
+
id[:row] = left_keys.index(key) + 1
|
85
|
+
id[:sibling_position] = left_idx + 1
|
86
|
+
key_fields.each do |field_name|
|
87
|
+
id[field_name] = left_value[field_name]
|
88
|
+
end
|
89
|
+
diffs[key].merge!(id.merge(left_values[key].merge(:action => 'Delete')))
|
90
|
+
#puts "Delete: #{key}"
|
91
|
+
end
|
92
|
+
end
|
93
|
+
diffs
|
94
|
+
end
|
95
|
+
|
96
|
+
|
97
|
+
# Identifies the fields that are different between two versions of the
|
98
|
+
# same row.
|
99
|
+
#
|
100
|
+
# @param left_row [Hash] The version of the CSV row from the left/from
|
101
|
+
# file.
|
102
|
+
# @param right_row [Hash] The version of the CSV row from the right/to
|
103
|
+
# file.
|
104
|
+
# @return [Hash<String, Array>] A Hash whose keys are the fields that
|
105
|
+
# contain differences, and whose values are a two-element array of
|
106
|
+
# [left/from, right/to] values.
|
107
|
+
def diff_row(left_row, right_row, fields)
|
108
|
+
diffs = {}
|
109
|
+
fields.each do |attr|
|
110
|
+
right_val = right_row[attr]
|
111
|
+
right_val = nil if right_val == ""
|
112
|
+
left_val = left_row[attr]
|
113
|
+
left_val = nil if left_val == ""
|
114
|
+
if left_val != right_val
|
115
|
+
diffs[attr] = [left_val, right_val]
|
116
|
+
#puts "#{attr}: #{left_val} -> #{right_val}"
|
117
|
+
end
|
118
|
+
end
|
119
|
+
diffs if diffs.size > 0
|
120
|
+
end
|
121
|
+
|
122
|
+
end
|
123
|
+
|
124
|
+
end
|
@@ -0,0 +1,142 @@
|
|
1
|
+
# This library performs diffs of flat file content that contains structured data
|
2
|
+
# in fields, with rows provided in a parent-child format.
|
3
|
+
#
|
4
|
+
# Parent-child data does not lend itself well to standard text diffs, as small
|
5
|
+
# changes in the organisation of the tree at an upper level (e.g. re-ordering of
|
6
|
+
# two ancestor nodes) can lead to big movements in the position of descendant
|
7
|
+
# records - particularly when the parent-child data is generated by a hierarchy
|
8
|
+
# traversal.
|
9
|
+
#
|
10
|
+
# Additionally, simple line-based diffs can identify that a line has changed,
|
11
|
+
# but not which field(s) in the line have changed.
|
12
|
+
#
|
13
|
+
# Data may be supplied in the form of CSV files, or as an array of arrays. The
|
14
|
+
# diff process process provides a fine level of control over what to diff, and
|
15
|
+
# can optionally ignore certain types of changes (e.g. changes in order).
|
16
|
+
class CSVDiff
|
17
|
+
|
18
|
+
# @return [CSVSource] CSVSource object containing details of the left/from
|
19
|
+
# input.
|
20
|
+
attr_reader :left
|
21
|
+
alias_method :from, :left
|
22
|
+
# @return [CSVSource] CSVSource object containing details of the right/to
|
23
|
+
# input.
|
24
|
+
attr_reader :right
|
25
|
+
alias_method :to, :right
|
26
|
+
# @return [Array<Hash>] An array of differences
|
27
|
+
attr_reader :diffs
|
28
|
+
# @return [Array<String>] An array of field names that are compared in the
|
29
|
+
# diff process.
|
30
|
+
attr_reader :diff_fields
|
31
|
+
# @return [Array<Fixnum>] An array of field indexes identifying the key
|
32
|
+
# fields that uniquely identify each row.
|
33
|
+
attr_reader :key_fields
|
34
|
+
# @return [Array<String>] An array of field names for the parent field(s).
|
35
|
+
attr_reader :parent_fields
|
36
|
+
# @return [Array<String>] An array of field names for the child field(s).
|
37
|
+
attr_reader :child_fields
|
38
|
+
|
39
|
+
|
40
|
+
# Generates a diff between two hierarchical tree structures, provided
|
41
|
+
# as +left+ and +right+, each of which consists of an array of lines in CSV
|
42
|
+
# format.
|
43
|
+
# An array of field indexes can also be specified as +key_fields+;
|
44
|
+
# a minimum of one field index must be specified; the last index is the
|
45
|
+
# child id, and the remaining fields (if any) are the parent field(s) that
|
46
|
+
# uniquely qualify the child instance.
|
47
|
+
#
|
48
|
+
# @param left [Array<Array<String>>] An Array of lines, each of which is in
|
49
|
+
# turn an Array containing fields.
|
50
|
+
# @param right [Array<Array<String>>] An Array of lines, each of which is in
|
51
|
+
# turn an Array containing fields.
|
52
|
+
# @param options [Hash] A hash containing options.
|
53
|
+
# @option options [Array<String>] :field_names An Array of field names for
|
54
|
+
# each field in +left+ and +right+. If not provided, the first row is
|
55
|
+
# assumed to contain field names.
|
56
|
+
# @option options [Boolean] :ignore_header If true, the first line of each
|
57
|
+
# file is ignored. This option can only be true if :field_names is
|
58
|
+
# specified.
|
59
|
+
# @options options [Array] :ignore_fields The names of any fields to be
|
60
|
+
# ignored when performing the diff.
|
61
|
+
# @option options [String] :key_field The name of the field that uniquely
|
62
|
+
# identifies each row.
|
63
|
+
# @option options [Array<String>] :key_fields The names of the fields
|
64
|
+
# that uniquely identifies each row.
|
65
|
+
# @option options [String] :parent_field The name of the field that
|
66
|
+
# identifies a parent within which sibling order should be checked.
|
67
|
+
# @option options [String] :child_field The name of the field that
|
68
|
+
# uniquely identifies a child of a parent.
|
69
|
+
# @option options [Boolean] :ignore_adds If true, records that appear in
|
70
|
+
# the right/to file but not in the left/from file are not reported.
|
71
|
+
# @option options [Boolean] :ignore_updates If true, records that have been
|
72
|
+
# updated are not reported.
|
73
|
+
# @option options [Boolean] :ignore_moves If true, changes in row position
|
74
|
+
# amongst sibling rows are not reported.
|
75
|
+
# @option options [Boolean] :ignore_deletes If true, records that appear
|
76
|
+
# in the left/from file but not in the right/to file are not reported.
|
77
|
+
def initialize(left, right, options = {})
|
78
|
+
@left = CSVSource.new(left, options)
|
79
|
+
raise "No field names found in left (from) source" unless @left.field_names && @left.field_names.size > 0
|
80
|
+
@right = CSVSource.new(right, options)
|
81
|
+
raise "No field names found in right (to) source" unless @right.field_names && @right.field_names.size > 0
|
82
|
+
@warnings = []
|
83
|
+
@diff_fields = get_diff_fields(@left.field_names, @right.field_names, options.fetch(:ignore_fields, []))
|
84
|
+
@key_fields = @left.key_fields.map{ |kf| @diff_fields[kf] }
|
85
|
+
diff(options)
|
86
|
+
end
|
87
|
+
|
88
|
+
|
89
|
+
# Performs a diff with the specified +options+.
|
90
|
+
def diff(options = {})
|
91
|
+
@summary = nil
|
92
|
+
@diffs = diff_sources(@left, @right, @key_fields, @diff_fields, options)
|
93
|
+
end
|
94
|
+
|
95
|
+
|
96
|
+
# Returns a summary of the number of adds, deletes, moves, and updates.
|
97
|
+
def summary
|
98
|
+
unless @summary
|
99
|
+
@summary = Hash.new{ |h, k| h[k] = 0 }
|
100
|
+
@diffs.each{ |k, v| @summary[v[:action]] += 1 }
|
101
|
+
@summary['Warnings'] = warnings.size if warnings.size > 0
|
102
|
+
end
|
103
|
+
@summary
|
104
|
+
end
|
105
|
+
|
106
|
+
|
107
|
+
[:adds, :deletes, :updates, :moves].each do |mthd|
|
108
|
+
define_method mthd do
|
109
|
+
action = mthd.to_s.chomp('s')
|
110
|
+
@diffs.select{ |k, v| v[:action].downcase == action }
|
111
|
+
end
|
112
|
+
end
|
113
|
+
|
114
|
+
|
115
|
+
# @return [Array<String>] an array of warning messages generated during the
|
116
|
+
# diff process.
|
117
|
+
def warnings
|
118
|
+
@left.warnings + @right.warnings + @warnings
|
119
|
+
end
|
120
|
+
|
121
|
+
|
122
|
+
private
|
123
|
+
|
124
|
+
|
125
|
+
# Given two sets of field names, determines the common set of fields present
|
126
|
+
# in both, on which members can be diffed.
|
127
|
+
def get_diff_fields(left_fields, right_fields, ignore_fields)
|
128
|
+
diff_fields = []
|
129
|
+
right_fields.each do |fld|
|
130
|
+
if left_fields.include?(fld)
|
131
|
+
diff_fields << fld unless ignore_fields.include?(fld)
|
132
|
+
else
|
133
|
+
@warnings << "Field '#{fld}' is missing from the left (from) file, and won't be diffed"
|
134
|
+
end
|
135
|
+
end
|
136
|
+
diff_fields
|
137
|
+
end
|
138
|
+
|
139
|
+
|
140
|
+
include Algorithm
|
141
|
+
|
142
|
+
end
|
@@ -0,0 +1,151 @@
|
|
1
|
+
class CSVDiff
|
2
|
+
|
3
|
+
# Represents a CSV input (i.e. the left/from or right/to input) to the diff
|
4
|
+
# process.
|
5
|
+
class CSVSource
|
6
|
+
|
7
|
+
# @return [String] the path to the source file
|
8
|
+
attr_accessor :path
|
9
|
+
# @return [Array<String>] The names of the fields in the source file
|
10
|
+
attr_reader :field_names
|
11
|
+
# @return [Array<String>] The names of the field(s) that uniquely
|
12
|
+
# identify each row.
|
13
|
+
attr_reader :key_fields
|
14
|
+
# @return [Array<String>] The names of the field(s) that identify a
|
15
|
+
# common parent of child records.
|
16
|
+
attr_reader :parent_fields
|
17
|
+
# @return [Array<String>] The names of the field(s) that distinguish a
|
18
|
+
# child of a parent record.
|
19
|
+
attr_reader :child_fields
|
20
|
+
# @return [Hash<String,Hash>] A hash containing each line of the source,
|
21
|
+
# keyed on the values of the +key_fields+.
|
22
|
+
attr_reader :lines
|
23
|
+
# @return [Hash<String,Array<String>>] A hash containing each parent key,
|
24
|
+
# and an Array of the child keys it is a parent of.
|
25
|
+
attr_reader :index
|
26
|
+
# @return [Array<String>] An array of any warnings encountered while
|
27
|
+
# processing the source.
|
28
|
+
attr_reader :warnings
|
29
|
+
|
30
|
+
|
31
|
+
# Creates a new diff source.
|
32
|
+
#
|
33
|
+
# A diff source must contain at least one field that will be used as the
|
34
|
+
# key to identify the same record in a different version of this file.
|
35
|
+
# If not specified via one of the options, the first field is assumed to
|
36
|
+
# be the unique key.
|
37
|
+
#
|
38
|
+
# If multiple fields combine to form a unique key, the parent is assumed
|
39
|
+
# to be identified by all but the last field of the unique key. If finer
|
40
|
+
# control is required, use a combination of the :parent_fields and
|
41
|
+
# :child_fields options.
|
42
|
+
#
|
43
|
+
# All key options can be specified either by field name, or by field
|
44
|
+
# index (0 based).
|
45
|
+
#
|
46
|
+
# @param source [String|Array<Array>] Either a path to a CSV file, or an
|
47
|
+
# Array of Arrays containing CSV data. If the :field_names option is
|
48
|
+
# not specified, the first line must contain the names of the fields.
|
49
|
+
# @param options [Hash] An options hash.
|
50
|
+
# @option options [String] :mode_string The mode to use when opening the
|
51
|
+
# CSV file. Defaults to 'r'.
|
52
|
+
# @option options [Hash] :csv_options Any options you wish to pass to
|
53
|
+
# CSV.open, e.g. :col_sep.
|
54
|
+
# @option options [Array<String>] :field_names The names of each of the
|
55
|
+
# fields in +source+.
|
56
|
+
# @option options [Boolean] :ignore_header If true, and :field_names has
|
57
|
+
# been specified, then the first row of the file is ignored.
|
58
|
+
# @option options [String] :key_field The name of the field that uniquely
|
59
|
+
# identifies each row.
|
60
|
+
# @option options [Array<String>] :key_fields The names of the fields
|
61
|
+
# that uniquely identifies each row.
|
62
|
+
# @option options [String] :parent_field The name of the field that
|
63
|
+
# identifies a parent within which sibling order should be checked.
|
64
|
+
# @option options [String] :child_field The name of the field that
|
65
|
+
# uniquely identifies a child of a parent.
|
66
|
+
def initialize(source, options = {})
|
67
|
+
if source.is_a?(String)
|
68
|
+
require 'csv'
|
69
|
+
mode_string = options.fetch(:mode_string, 'r')
|
70
|
+
csv_options = options.fetch(:csv_options, {})
|
71
|
+
@path = source
|
72
|
+
source = CSV.open(@path, mode_string, csv_options).readlines
|
73
|
+
end
|
74
|
+
if kf = options.fetch(:key_field, options[:key_fields])
|
75
|
+
@key_fields = [kf].flatten
|
76
|
+
@parent_fields = @key_fields[0...-1]
|
77
|
+
@child_fields = @key_fields[-1..-1]
|
78
|
+
else
|
79
|
+
@parent_fields = [options.fetch(:parent_field, options.fetch(:parent_fields, []))].flatten
|
80
|
+
@child_fields = [options.fetch(:child_field, options.fetch(:child_fields, [0]))].flatten
|
81
|
+
@key_fields = @parent_fields + @child_fields
|
82
|
+
end
|
83
|
+
@field_names = options[:field_names]
|
84
|
+
@warnings = []
|
85
|
+
index_source(source, options)
|
86
|
+
end
|
87
|
+
|
88
|
+
|
89
|
+
# Returns the row in the CSV source corresponding to the supplied key.
|
90
|
+
#
|
91
|
+
# @param key [String] The unique key to use to lookup the row.
|
92
|
+
# @return [Hash] The fields for the line corresponding to +key+, or nil
|
93
|
+
# if the key is not recognised.
|
94
|
+
def [](key)
|
95
|
+
@lines[key]
|
96
|
+
end
|
97
|
+
|
98
|
+
|
99
|
+
private
|
100
|
+
|
101
|
+
# Given an array of lines, where each line is an array of fields, indexes
|
102
|
+
# the array contents so that it can be looked up by key.
|
103
|
+
def index_source(lines, options)
|
104
|
+
@lines = {}
|
105
|
+
@index = Hash.new{ |h, k| h[k] = [] }
|
106
|
+
@key_fields = find_field_indexes(@key_fields, @field_names) if @field_names
|
107
|
+
line_num = 0
|
108
|
+
lines.each do |row|
|
109
|
+
line_num += 1
|
110
|
+
next if line_num == 1 && @field_names && options[:ignore_header]
|
111
|
+
unless @field_names
|
112
|
+
@field_names = row
|
113
|
+
@key_fields = find_field_indexes(@key_fields, @field_names)
|
114
|
+
next
|
115
|
+
end
|
116
|
+
field_vals = row
|
117
|
+
line = {}
|
118
|
+
@field_names.each_with_index do |field, i|
|
119
|
+
line[field] = field_vals[i]
|
120
|
+
end
|
121
|
+
key_values = @key_fields.map{ |kf| field_vals[kf].to_s.upcase }
|
122
|
+
key = key_values.join('~')
|
123
|
+
parent_key = key_values[0...(@parent_fields.length)].join('~')
|
124
|
+
if @lines[key]
|
125
|
+
@warnings << "Duplicate key '#{key}' encountered and ignored at line #{line_num}"
|
126
|
+
else
|
127
|
+
@index[parent_key] << key
|
128
|
+
@lines[key] = line
|
129
|
+
end
|
130
|
+
end
|
131
|
+
end
|
132
|
+
|
133
|
+
|
134
|
+
# Converts an array of field names to an array of indexes of the fields
|
135
|
+
# matching those names.
|
136
|
+
def find_field_indexes(key_fields, field_names)
|
137
|
+
key_fields.map do |field|
|
138
|
+
if field.is_a?(Fixnum)
|
139
|
+
field
|
140
|
+
else
|
141
|
+
field_names.index{ |field_name| field.to_s.downcase == field_name.downcase } or
|
142
|
+
raise ArgumentError, "Could not locate field '#{field}' in source field names: #{
|
143
|
+
field_names.join(', ')}"
|
144
|
+
end
|
145
|
+
end
|
146
|
+
end
|
147
|
+
|
148
|
+
end
|
149
|
+
|
150
|
+
end
|
151
|
+
|
data/lib/csv_diff.rb
ADDED
metadata
ADDED
@@ -0,0 +1,62 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: csv-diff
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: '0.1'
|
5
|
+
prerelease:
|
6
|
+
platform: ruby
|
7
|
+
authors:
|
8
|
+
- Adam Gardiner
|
9
|
+
autorequire:
|
10
|
+
bindir: bin
|
11
|
+
cert_chain: []
|
12
|
+
date: 2014-05-30 00:00:00.000000000 Z
|
13
|
+
dependencies: []
|
14
|
+
description: ! " This library performs diffs of CSV files.\n\n Unlike
|
15
|
+
a standard diff that compares line by line, and is sensitive to the\n ordering
|
16
|
+
of records, CSV-Diff identifies common lines by key field(s), and\n then
|
17
|
+
compares the contents of the fields in each line.\n\n Data may be supplied
|
18
|
+
in the form of CSV files, or as an array of arrays. The\n diff process provides
|
19
|
+
a fine level of control over what to diff, and can\n optionally ignore certain
|
20
|
+
types of changes (e.g. changes in position).\n\n CSV-Diff is particularly
|
21
|
+
well suited to data in parent-child format. Parent-\n child data does not
|
22
|
+
lend itself well to standard text diffs, as small changes\n in the organisation
|
23
|
+
of the tree at an upper level can lead to big movements\n in the position
|
24
|
+
of descendant records. By instead matching records by key,\n CSV-Diff avoids
|
25
|
+
this issue, while still being able to detect changes in\n sibling order.\n"
|
26
|
+
email: adam.b.gardiner@gmail.com
|
27
|
+
executables: []
|
28
|
+
extensions: []
|
29
|
+
extra_rdoc_files: []
|
30
|
+
files:
|
31
|
+
- README.md
|
32
|
+
- LICENSE
|
33
|
+
- lib/csv-diff/algorithm.rb
|
34
|
+
- lib/csv-diff/csv_diff.rb
|
35
|
+
- lib/csv-diff/csv_source.rb
|
36
|
+
- lib/csv-diff.rb
|
37
|
+
- lib/csv_diff.rb
|
38
|
+
homepage: https://github.com/agardiner/csv-diff
|
39
|
+
licenses: []
|
40
|
+
post_install_message:
|
41
|
+
rdoc_options: []
|
42
|
+
require_paths:
|
43
|
+
- lib
|
44
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
45
|
+
none: false
|
46
|
+
requirements:
|
47
|
+
- - ! '>='
|
48
|
+
- !ruby/object:Gem::Version
|
49
|
+
version: '0'
|
50
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
51
|
+
none: false
|
52
|
+
requirements:
|
53
|
+
- - ! '>='
|
54
|
+
- !ruby/object:Gem::Version
|
55
|
+
version: '0'
|
56
|
+
requirements: []
|
57
|
+
rubyforge_project:
|
58
|
+
rubygems_version: 1.8.23
|
59
|
+
signing_key:
|
60
|
+
specification_version: 3
|
61
|
+
summary: CSV Diff is a library for generating diffs from data in CSV format
|
62
|
+
test_files: []
|