csv-diff 0.1

Sign up to get free protection for your applications and to get access to all the features.
data/LICENSE ADDED
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2013, Adam Gardiner
2
+ All rights reserved.
3
+
4
+ Redistribution and use in source and binary forms, with or without
5
+ modification, are permitted provided that the following conditions are met:
6
+
7
+ * Redistributions of source code must retain the above copyright notice, this
8
+ list of conditions and the following disclaimer.
9
+ * Redistributions in binary form must reproduce the above copyright notice,
10
+ this list of conditions and the following disclaimer in the documentation
11
+ and/or other materials provided with the distribution.
12
+
13
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
14
+ AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
15
+ IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
16
+ DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE
17
+ FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
18
+ DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
19
+ SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
20
+ CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
21
+ OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
22
+ OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
@@ -0,0 +1,222 @@
1
+ # CSV-Diff
2
+
3
+ CSV-Diff is a small library for performing diffs of CSV data.
4
+
5
+ Unlike a standard diff that compares line by line, and is sensitive to the
6
+ ordering of records, CSV-Diff identifies common lines by key field(s), and
7
+ then compares the contents of the fields in each line.
8
+
9
+ Data may be supplied in the form of CSV files, or as an array of arrays. The
10
+ diff process provides a fine level of control over what to diff, and can
11
+ optionally ignore certain types of changes (e.g. changes in position).
12
+
13
+ CSV-Diff is particularly well suited to data in parent-child format. Parent-
14
+ child data does not lend itself well to standard text diffs, as small changes
15
+ in the organisation of the tree at an upper level can lead to big movements
16
+ in the position of descendant records. By instead matching records by key,
17
+ CSV-Diff avoids this issue, while still being able to detect changes in
18
+ sibling order.
19
+
20
+
21
+ ## Usage
22
+
23
+ CSV-Diff is supplied as a gem, and has no dependencies. To use it, simply:
24
+ ```
25
+ gem install csv-diff
26
+ ```
27
+
28
+ To compare two CSV files where the field names are in the first row of the file,
29
+ and the first field contains the unique key for each record, simply use:
30
+ ```ruby
31
+ require 'csv-diff'
32
+
33
+ diff = CSVDiff.new(file1, file2)
34
+ ```
35
+
36
+ The returned diff object can be queried for the differences that exist between
37
+ the two files, e.g.:
38
+ ```ruby
39
+ puts diff.summary.inspect # Summary of the adds, deletes, updates, and moves
40
+ puts diff.adds.inspect # Details of the additions to file2
41
+ puts diff.deletes.inspect # Details of the deletions to file1
42
+ puts diff.updates.inspect # Details of the updates from file1 to file2
43
+ puts diff.moves.inspect # Details of the moves from file1 to file2
44
+ puts diff.diffs.inspect # Details of all differences
45
+ puts diff.warnings.inspect # Any warnings generated during the diff process
46
+ ```
47
+
48
+ ## Unique Row Identifiers
49
+
50
+ CSVDiff is preferable over a standard line-by-line diff when row order is
51
+ significantly impacted by small changes. The classic example is a parent-child
52
+ file generated by a hierarchy traversal. A simple change in position of a parent
53
+ member near the root of the hierarchy will have a large impact on the positions
54
+ of all descendant rows. Consider the following example:
55
+ ```
56
+ Root
57
+ |- A
58
+ | |- A1
59
+ | |- A2
60
+ |
61
+ |- B
62
+ |- B1
63
+ |- B2
64
+ ```
65
+
66
+ A hierarchy traversal of this tree into a parent-child format would generate a CSV
67
+ as follows:
68
+ ```
69
+ Root,A
70
+ A,A1
71
+ A,A2
72
+ Root,B
73
+ B,B1
74
+ B,B2
75
+ ```
76
+
77
+ If the positions of A and B were swapped, a hierarchy traversal would now produce a CSV
78
+ as follows:
79
+ ```
80
+ Root,B
81
+ B,B1
82
+ B,B2
83
+ Root,A
84
+ A,A1
85
+ A,A2
86
+ ```
87
+
88
+ A simple diff using a diff utility would highlight this as 3 additions and 3 deletions.
89
+ CSVDiff, however, would classify this as 2 moves (a change in sibling position for A and B).
90
+
91
+ In order to do this, CSVDiff needs to know what field(s) confer uniqueness on each row.
92
+ In this example, we could use the child field alone (since each member name only appears
93
+ once); however, this would imply a flat structure, where all rows are children of a single
94
+ parent. This in turn would cause CSVDiff to classify the above change as a Move (i.e. a
95
+ change in order) of all 6 rows.
96
+
97
+ The more correct specification of this file is that column 0 contains a unique parent
98
+ identifier, and column 1 contains a unique child identifier. CSVDiff can then correctly
99
+ deduce that there is in fact only two changes in order - the swap in positions of A and
100
+ B below Root.
101
+
102
+ Note: If you aren't interested in changes in the order of siblings, then you could use
103
+ CSVDiff with a :key_field option of column 1, and specify the :ignore_moves option.
104
+
105
+ ## Warnings
106
+
107
+ When processing and diffing files, CSVDiff may encounter problems with the data or
108
+ the specifications it has been given. It will continue even in the face of problems,
109
+ but will log details of the problems in a #warnings Array. The number of warnings
110
+ will also be included in the Hash returned by the #summary method.
111
+
112
+ Warnings may be raised for any of the following:
113
+ * Missing fields: If the right/to file contains fields that are not present in the
114
+ left/from file, a warning is raised and the field is ignored for diff purposes.
115
+ * Duplicate keys: If two rows are found that have the same values for the key field(s),
116
+ a warning is raised, and the duplicate values are ignored.
117
+
118
+
119
+ ## Examples
120
+
121
+ The simplest use case is as shown above, where the data to be diffed is in CSV files
122
+ with the column names as the first record, and where the unique key is the first
123
+ column in the data. In this case, a diff can be created simply via:
124
+ ```ruby
125
+ diff = CSVDiff.new(file1, file2)
126
+ ```
127
+
128
+ ### Specifynig Unique Row Identifiers
129
+
130
+ Often however, rows are not uniquely identifiable via the first column in the file.
131
+ In a parent-child hierarchy, for example, combinations of parent and child may be
132
+ necessary to uniquely identify a row. In these cases, it is necessary to indicate
133
+ which fields are used to uniquely identify common rows across the two files. This
134
+ can be done in several different ways.
135
+
136
+ 1. Using the :key_fields option with field numbers (these are 0-based):
137
+
138
+ ```ruby
139
+ diff = CSVDiff.new(file1, file2, key_fields: [0, 1])
140
+ ```
141
+
142
+ 2. Using the :key_fields options with column names:
143
+
144
+ ```ruby
145
+ diff = CSVDiff.new(file1, file2, key_fields: ['Parent', 'Child'])
146
+ ```
147
+
148
+ 3. Using the :parent_fields and :child_fields with field numbers:
149
+
150
+ ```ruby
151
+ diff = CSVDiff.new(file1, file2, parent_field: 1, child_fields: [2, 3])
152
+ ```
153
+
154
+ 4. Using the :parent_fields and :child_fields with column names:
155
+
156
+ ```ruby
157
+ diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'])
158
+ ```
159
+
160
+ ### Using Non-CSV File Sources
161
+
162
+ Data from non-CSV sources can be diffed, as long as it can be supplied as an Array
163
+ of Arrays:
164
+ ```ruby
165
+ DATA1 = [
166
+ ['Parent', 'Child', 'Description'],
167
+ ['A', 'A1', 'Account 1'],
168
+ ['A', 'A2', 'Account 2']
169
+ ]
170
+
171
+ DATA2 = [
172
+ ['Parent', 'Child', 'Description'],
173
+ ['A', 'A1', 'Account1'],
174
+ ['A', 'A2', 'Account2']
175
+ ]
176
+
177
+ diff = CSVDiff.new(DATA1, DATA2, key_fields: [1, 0])
178
+ ```
179
+
180
+ ### Specifying Column Names
181
+
182
+ If your data file does not include column headers, you can specify the names of
183
+ each column when creating the diff. The names supplied are the keys used in the
184
+ diff results:
185
+
186
+ ```ruby
187
+ DATA1 = [
188
+ ['A', 'A1', 'Account 1'],
189
+ ['A', 'A2', 'Account 2']
190
+ ]
191
+
192
+ DATA2 = [
193
+ ['A', 'A1', 'Account1'],
194
+ ['A', 'A2', 'Account2']
195
+ ]
196
+
197
+ diff = CSVDiff.new(DATA1, DATA2, key_fields: [1, 0], field_names: ['Parent', 'Child', 'Description'])
198
+ ```
199
+
200
+ If your data file does contain a header row, but you wish to use your own column
201
+ names, you can specify the :field_names option and the :ignore_header option to
202
+ ignore the first row.
203
+
204
+
205
+ ### Ignoring Fields
206
+
207
+ If your data contains fields that you aren't interested in, these can be excluded
208
+ from the diff process using the :ignore_fields option:
209
+ ```ruby
210
+ diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'],
211
+ ignore_fields: ['CreatedAt', 'UpdatedAt'])
212
+ ```
213
+
214
+ ### Ignoring Certain Changes
215
+
216
+ CSVDiff identifies Adds, Updates, Moves and Deletes; any of these changes can be selectively
217
+ ignored, e.g. if you are not interested in Deletes, you can pass the :ignore_deletes option:
218
+ ```ruby
219
+ diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'],
220
+ ignore_fields: ['CreatedAt', 'UpdatedAt'],
221
+ ignore_deletes: true, ignore_moves: true)
222
+ ```
@@ -0,0 +1,4 @@
1
+ require 'csv-diff/csv_source'
2
+ require 'csv-diff/algorithm'
3
+ require 'csv-diff/csv_diff'
4
+
@@ -0,0 +1,124 @@
1
+ class CSVDiff
2
+
3
+ # Implements the CSV diff algorithm.
4
+ module Algorithm
5
+
6
+ # Diffs two CSVSource structures.
7
+ #
8
+ # @param left [CSVSource] A CSVSource object containing the contents of
9
+ # the left/from input.
10
+ # @param right [CSVSource] A CSVSource object containing the contents of
11
+ # the right/to input.
12
+ # @param key_fields [Array] An array containing the names of the field(s)
13
+ # that uniquely identify each row.
14
+ # @param diff_fields [Array] An array containing the names of the fields
15
+ # to be diff-ed.
16
+ def diff_sources(left, right, key_fields, diff_fields, options = {})
17
+ left_index = left.index
18
+ left_values = left.lines
19
+ left_keys = left_values.keys
20
+ right_index = right.index
21
+ right_values = right.lines
22
+ right_keys = right_values.keys
23
+ parent_fields = left.parent_fields.length
24
+
25
+ include_adds = !options[:ignore_adds]
26
+ include_moves = !options[:ignore_moves]
27
+ include_updates = !options[:ignore_updates]
28
+ include_deletes = !options[:ignore_deletes]
29
+
30
+ diffs = Hash.new{ |h, k| h[k] = {} }
31
+ right_keys.each_with_index do |key, right_row_id|
32
+ key_vals = key.split('~')
33
+ parent = key_vals[0...parent_fields].join('~')
34
+ child = key_vals[parent_fields..-1].join('~')
35
+ left_parent = left_index[parent]
36
+ right_parent = right_index[parent]
37
+ left_value = left_values[key]
38
+ right_value = right_values[key]
39
+ left_idx = left_parent && left_parent.index(key)
40
+ right_idx = right_parent && right_parent.index(key)
41
+
42
+ id = {}
43
+ id[:row] = right_row_id + 1
44
+ id[:sibling_position] = right_idx + 1
45
+ key_fields.each do |field_name|
46
+ id[field_name] = right_value[field_name]
47
+ end
48
+ if left_idx && right_idx
49
+ if include_moves
50
+ left_common = left_parent & right_parent
51
+ right_common = right_parent & left_parent
52
+ left_pos = left_common.index(key)
53
+ right_pos = right_common.index(key)
54
+ if left_pos != right_pos
55
+ # Move
56
+ diffs[key].merge!(id.merge!(:action => 'Move',
57
+ :sibling_position => [left_idx + 1, right_idx + 1]))
58
+ #puts "Move #{left_idx} -> #{right_idx}: #{key}"
59
+ end
60
+ end
61
+ if include_updates && (changes = diff_row(left_values[key], right_values[key], diff_fields))
62
+ diffs[key].merge!(id.merge(changes.merge(:action => 'Update')))
63
+ #puts "Change: #{key}"
64
+ end
65
+ elsif include_adds && right_idx
66
+ # Add
67
+ diffs[key].merge!(id.merge(right_values[key].merge(:action => 'Add')))
68
+ #puts "Add: #{key}"
69
+ end
70
+ end
71
+
72
+ # Now identify deletions
73
+ if include_deletes
74
+ (left_keys - right_keys).each do |key|
75
+ # Delete
76
+ key_vals = key.split('~')
77
+ parent = key_vals[0...parent_fields].join('~')
78
+ child = key_vals[parent_fields..-1].join('~')
79
+ left_parent = left_index[parent]
80
+ left_value = left_values[key]
81
+ left_idx = left_parent.index(key)
82
+ next unless left_idx
83
+ id = {}
84
+ id[:row] = left_keys.index(key) + 1
85
+ id[:sibling_position] = left_idx + 1
86
+ key_fields.each do |field_name|
87
+ id[field_name] = left_value[field_name]
88
+ end
89
+ diffs[key].merge!(id.merge(left_values[key].merge(:action => 'Delete')))
90
+ #puts "Delete: #{key}"
91
+ end
92
+ end
93
+ diffs
94
+ end
95
+
96
+
97
+ # Identifies the fields that are different between two versions of the
98
+ # same row.
99
+ #
100
+ # @param left_row [Hash] The version of the CSV row from the left/from
101
+ # file.
102
+ # @param right_row [Hash] The version of the CSV row from the right/to
103
+ # file.
104
+ # @return [Hash<String, Array>] A Hash whose keys are the fields that
105
+ # contain differences, and whose values are a two-element array of
106
+ # [left/from, right/to] values.
107
+ def diff_row(left_row, right_row, fields)
108
+ diffs = {}
109
+ fields.each do |attr|
110
+ right_val = right_row[attr]
111
+ right_val = nil if right_val == ""
112
+ left_val = left_row[attr]
113
+ left_val = nil if left_val == ""
114
+ if left_val != right_val
115
+ diffs[attr] = [left_val, right_val]
116
+ #puts "#{attr}: #{left_val} -> #{right_val}"
117
+ end
118
+ end
119
+ diffs if diffs.size > 0
120
+ end
121
+
122
+ end
123
+
124
+ end
@@ -0,0 +1,142 @@
1
+ # This library performs diffs of flat file content that contains structured data
2
+ # in fields, with rows provided in a parent-child format.
3
+ #
4
+ # Parent-child data does not lend itself well to standard text diffs, as small
5
+ # changes in the organisation of the tree at an upper level (e.g. re-ordering of
6
+ # two ancestor nodes) can lead to big movements in the position of descendant
7
+ # records - particularly when the parent-child data is generated by a hierarchy
8
+ # traversal.
9
+ #
10
+ # Additionally, simple line-based diffs can identify that a line has changed,
11
+ # but not which field(s) in the line have changed.
12
+ #
13
+ # Data may be supplied in the form of CSV files, or as an array of arrays. The
14
+ # diff process process provides a fine level of control over what to diff, and
15
+ # can optionally ignore certain types of changes (e.g. changes in order).
16
+ class CSVDiff
17
+
18
+ # @return [CSVSource] CSVSource object containing details of the left/from
19
+ # input.
20
+ attr_reader :left
21
+ alias_method :from, :left
22
+ # @return [CSVSource] CSVSource object containing details of the right/to
23
+ # input.
24
+ attr_reader :right
25
+ alias_method :to, :right
26
+ # @return [Array<Hash>] An array of differences
27
+ attr_reader :diffs
28
+ # @return [Array<String>] An array of field names that are compared in the
29
+ # diff process.
30
+ attr_reader :diff_fields
31
+ # @return [Array<Fixnum>] An array of field indexes identifying the key
32
+ # fields that uniquely identify each row.
33
+ attr_reader :key_fields
34
+ # @return [Array<String>] An array of field names for the parent field(s).
35
+ attr_reader :parent_fields
36
+ # @return [Array<String>] An array of field names for the child field(s).
37
+ attr_reader :child_fields
38
+
39
+
40
+ # Generates a diff between two hierarchical tree structures, provided
41
+ # as +left+ and +right+, each of which consists of an array of lines in CSV
42
+ # format.
43
+ # An array of field indexes can also be specified as +key_fields+;
44
+ # a minimum of one field index must be specified; the last index is the
45
+ # child id, and the remaining fields (if any) are the parent field(s) that
46
+ # uniquely qualify the child instance.
47
+ #
48
+ # @param left [Array<Array<String>>] An Array of lines, each of which is in
49
+ # turn an Array containing fields.
50
+ # @param right [Array<Array<String>>] An Array of lines, each of which is in
51
+ # turn an Array containing fields.
52
+ # @param options [Hash] A hash containing options.
53
+ # @option options [Array<String>] :field_names An Array of field names for
54
+ # each field in +left+ and +right+. If not provided, the first row is
55
+ # assumed to contain field names.
56
+ # @option options [Boolean] :ignore_header If true, the first line of each
57
+ # file is ignored. This option can only be true if :field_names is
58
+ # specified.
59
+ # @options options [Array] :ignore_fields The names of any fields to be
60
+ # ignored when performing the diff.
61
+ # @option options [String] :key_field The name of the field that uniquely
62
+ # identifies each row.
63
+ # @option options [Array<String>] :key_fields The names of the fields
64
+ # that uniquely identifies each row.
65
+ # @option options [String] :parent_field The name of the field that
66
+ # identifies a parent within which sibling order should be checked.
67
+ # @option options [String] :child_field The name of the field that
68
+ # uniquely identifies a child of a parent.
69
+ # @option options [Boolean] :ignore_adds If true, records that appear in
70
+ # the right/to file but not in the left/from file are not reported.
71
+ # @option options [Boolean] :ignore_updates If true, records that have been
72
+ # updated are not reported.
73
+ # @option options [Boolean] :ignore_moves If true, changes in row position
74
+ # amongst sibling rows are not reported.
75
+ # @option options [Boolean] :ignore_deletes If true, records that appear
76
+ # in the left/from file but not in the right/to file are not reported.
77
+ def initialize(left, right, options = {})
78
+ @left = CSVSource.new(left, options)
79
+ raise "No field names found in left (from) source" unless @left.field_names && @left.field_names.size > 0
80
+ @right = CSVSource.new(right, options)
81
+ raise "No field names found in right (to) source" unless @right.field_names && @right.field_names.size > 0
82
+ @warnings = []
83
+ @diff_fields = get_diff_fields(@left.field_names, @right.field_names, options.fetch(:ignore_fields, []))
84
+ @key_fields = @left.key_fields.map{ |kf| @diff_fields[kf] }
85
+ diff(options)
86
+ end
87
+
88
+
89
+ # Performs a diff with the specified +options+.
90
+ def diff(options = {})
91
+ @summary = nil
92
+ @diffs = diff_sources(@left, @right, @key_fields, @diff_fields, options)
93
+ end
94
+
95
+
96
+ # Returns a summary of the number of adds, deletes, moves, and updates.
97
+ def summary
98
+ unless @summary
99
+ @summary = Hash.new{ |h, k| h[k] = 0 }
100
+ @diffs.each{ |k, v| @summary[v[:action]] += 1 }
101
+ @summary['Warnings'] = warnings.size if warnings.size > 0
102
+ end
103
+ @summary
104
+ end
105
+
106
+
107
+ [:adds, :deletes, :updates, :moves].each do |mthd|
108
+ define_method mthd do
109
+ action = mthd.to_s.chomp('s')
110
+ @diffs.select{ |k, v| v[:action].downcase == action }
111
+ end
112
+ end
113
+
114
+
115
+ # @return [Array<String>] an array of warning messages generated during the
116
+ # diff process.
117
+ def warnings
118
+ @left.warnings + @right.warnings + @warnings
119
+ end
120
+
121
+
122
+ private
123
+
124
+
125
+ # Given two sets of field names, determines the common set of fields present
126
+ # in both, on which members can be diffed.
127
+ def get_diff_fields(left_fields, right_fields, ignore_fields)
128
+ diff_fields = []
129
+ right_fields.each do |fld|
130
+ if left_fields.include?(fld)
131
+ diff_fields << fld unless ignore_fields.include?(fld)
132
+ else
133
+ @warnings << "Field '#{fld}' is missing from the left (from) file, and won't be diffed"
134
+ end
135
+ end
136
+ diff_fields
137
+ end
138
+
139
+
140
+ include Algorithm
141
+
142
+ end
@@ -0,0 +1,151 @@
1
+ class CSVDiff
2
+
3
+ # Represents a CSV input (i.e. the left/from or right/to input) to the diff
4
+ # process.
5
+ class CSVSource
6
+
7
+ # @return [String] the path to the source file
8
+ attr_accessor :path
9
+ # @return [Array<String>] The names of the fields in the source file
10
+ attr_reader :field_names
11
+ # @return [Array<String>] The names of the field(s) that uniquely
12
+ # identify each row.
13
+ attr_reader :key_fields
14
+ # @return [Array<String>] The names of the field(s) that identify a
15
+ # common parent of child records.
16
+ attr_reader :parent_fields
17
+ # @return [Array<String>] The names of the field(s) that distinguish a
18
+ # child of a parent record.
19
+ attr_reader :child_fields
20
+ # @return [Hash<String,Hash>] A hash containing each line of the source,
21
+ # keyed on the values of the +key_fields+.
22
+ attr_reader :lines
23
+ # @return [Hash<String,Array<String>>] A hash containing each parent key,
24
+ # and an Array of the child keys it is a parent of.
25
+ attr_reader :index
26
+ # @return [Array<String>] An array of any warnings encountered while
27
+ # processing the source.
28
+ attr_reader :warnings
29
+
30
+
31
+ # Creates a new diff source.
32
+ #
33
+ # A diff source must contain at least one field that will be used as the
34
+ # key to identify the same record in a different version of this file.
35
+ # If not specified via one of the options, the first field is assumed to
36
+ # be the unique key.
37
+ #
38
+ # If multiple fields combine to form a unique key, the parent is assumed
39
+ # to be identified by all but the last field of the unique key. If finer
40
+ # control is required, use a combination of the :parent_fields and
41
+ # :child_fields options.
42
+ #
43
+ # All key options can be specified either by field name, or by field
44
+ # index (0 based).
45
+ #
46
+ # @param source [String|Array<Array>] Either a path to a CSV file, or an
47
+ # Array of Arrays containing CSV data. If the :field_names option is
48
+ # not specified, the first line must contain the names of the fields.
49
+ # @param options [Hash] An options hash.
50
+ # @option options [String] :mode_string The mode to use when opening the
51
+ # CSV file. Defaults to 'r'.
52
+ # @option options [Hash] :csv_options Any options you wish to pass to
53
+ # CSV.open, e.g. :col_sep.
54
+ # @option options [Array<String>] :field_names The names of each of the
55
+ # fields in +source+.
56
+ # @option options [Boolean] :ignore_header If true, and :field_names has
57
+ # been specified, then the first row of the file is ignored.
58
+ # @option options [String] :key_field The name of the field that uniquely
59
+ # identifies each row.
60
+ # @option options [Array<String>] :key_fields The names of the fields
61
+ # that uniquely identifies each row.
62
+ # @option options [String] :parent_field The name of the field that
63
+ # identifies a parent within which sibling order should be checked.
64
+ # @option options [String] :child_field The name of the field that
65
+ # uniquely identifies a child of a parent.
66
+ def initialize(source, options = {})
67
+ if source.is_a?(String)
68
+ require 'csv'
69
+ mode_string = options.fetch(:mode_string, 'r')
70
+ csv_options = options.fetch(:csv_options, {})
71
+ @path = source
72
+ source = CSV.open(@path, mode_string, csv_options).readlines
73
+ end
74
+ if kf = options.fetch(:key_field, options[:key_fields])
75
+ @key_fields = [kf].flatten
76
+ @parent_fields = @key_fields[0...-1]
77
+ @child_fields = @key_fields[-1..-1]
78
+ else
79
+ @parent_fields = [options.fetch(:parent_field, options.fetch(:parent_fields, []))].flatten
80
+ @child_fields = [options.fetch(:child_field, options.fetch(:child_fields, [0]))].flatten
81
+ @key_fields = @parent_fields + @child_fields
82
+ end
83
+ @field_names = options[:field_names]
84
+ @warnings = []
85
+ index_source(source, options)
86
+ end
87
+
88
+
89
+ # Returns the row in the CSV source corresponding to the supplied key.
90
+ #
91
+ # @param key [String] The unique key to use to lookup the row.
92
+ # @return [Hash] The fields for the line corresponding to +key+, or nil
93
+ # if the key is not recognised.
94
+ def [](key)
95
+ @lines[key]
96
+ end
97
+
98
+
99
+ private
100
+
101
+ # Given an array of lines, where each line is an array of fields, indexes
102
+ # the array contents so that it can be looked up by key.
103
+ def index_source(lines, options)
104
+ @lines = {}
105
+ @index = Hash.new{ |h, k| h[k] = [] }
106
+ @key_fields = find_field_indexes(@key_fields, @field_names) if @field_names
107
+ line_num = 0
108
+ lines.each do |row|
109
+ line_num += 1
110
+ next if line_num == 1 && @field_names && options[:ignore_header]
111
+ unless @field_names
112
+ @field_names = row
113
+ @key_fields = find_field_indexes(@key_fields, @field_names)
114
+ next
115
+ end
116
+ field_vals = row
117
+ line = {}
118
+ @field_names.each_with_index do |field, i|
119
+ line[field] = field_vals[i]
120
+ end
121
+ key_values = @key_fields.map{ |kf| field_vals[kf].to_s.upcase }
122
+ key = key_values.join('~')
123
+ parent_key = key_values[0...(@parent_fields.length)].join('~')
124
+ if @lines[key]
125
+ @warnings << "Duplicate key '#{key}' encountered and ignored at line #{line_num}"
126
+ else
127
+ @index[parent_key] << key
128
+ @lines[key] = line
129
+ end
130
+ end
131
+ end
132
+
133
+
134
+ # Converts an array of field names to an array of indexes of the fields
135
+ # matching those names.
136
+ def find_field_indexes(key_fields, field_names)
137
+ key_fields.map do |field|
138
+ if field.is_a?(Fixnum)
139
+ field
140
+ else
141
+ field_names.index{ |field_name| field.to_s.downcase == field_name.downcase } or
142
+ raise ArgumentError, "Could not locate field '#{field}' in source field names: #{
143
+ field_names.join(', ')}"
144
+ end
145
+ end
146
+ end
147
+
148
+ end
149
+
150
+ end
151
+
@@ -0,0 +1,2 @@
1
+ require 'csv-diff'
2
+
metadata ADDED
@@ -0,0 +1,62 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: csv-diff
3
+ version: !ruby/object:Gem::Version
4
+ version: '0.1'
5
+ prerelease:
6
+ platform: ruby
7
+ authors:
8
+ - Adam Gardiner
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+ date: 2014-05-30 00:00:00.000000000 Z
13
+ dependencies: []
14
+ description: ! " This library performs diffs of CSV files.\n\n Unlike
15
+ a standard diff that compares line by line, and is sensitive to the\n ordering
16
+ of records, CSV-Diff identifies common lines by key field(s), and\n then
17
+ compares the contents of the fields in each line.\n\n Data may be supplied
18
+ in the form of CSV files, or as an array of arrays. The\n diff process provides
19
+ a fine level of control over what to diff, and can\n optionally ignore certain
20
+ types of changes (e.g. changes in position).\n\n CSV-Diff is particularly
21
+ well suited to data in parent-child format. Parent-\n child data does not
22
+ lend itself well to standard text diffs, as small changes\n in the organisation
23
+ of the tree at an upper level can lead to big movements\n in the position
24
+ of descendant records. By instead matching records by key,\n CSV-Diff avoids
25
+ this issue, while still being able to detect changes in\n sibling order.\n"
26
+ email: adam.b.gardiner@gmail.com
27
+ executables: []
28
+ extensions: []
29
+ extra_rdoc_files: []
30
+ files:
31
+ - README.md
32
+ - LICENSE
33
+ - lib/csv-diff/algorithm.rb
34
+ - lib/csv-diff/csv_diff.rb
35
+ - lib/csv-diff/csv_source.rb
36
+ - lib/csv-diff.rb
37
+ - lib/csv_diff.rb
38
+ homepage: https://github.com/agardiner/csv-diff
39
+ licenses: []
40
+ post_install_message:
41
+ rdoc_options: []
42
+ require_paths:
43
+ - lib
44
+ required_ruby_version: !ruby/object:Gem::Requirement
45
+ none: false
46
+ requirements:
47
+ - - ! '>='
48
+ - !ruby/object:Gem::Version
49
+ version: '0'
50
+ required_rubygems_version: !ruby/object:Gem::Requirement
51
+ none: false
52
+ requirements:
53
+ - - ! '>='
54
+ - !ruby/object:Gem::Version
55
+ version: '0'
56
+ requirements: []
57
+ rubyforge_project:
58
+ rubygems_version: 1.8.23
59
+ signing_key:
60
+ specification_version: 3
61
+ summary: CSV Diff is a library for generating diffs from data in CSV format
62
+ test_files: []