csv-diff 0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/LICENSE ADDED
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2013, Adam Gardiner
2
+ All rights reserved.
3
+
4
+ Redistribution and use in source and binary forms, with or without
5
+ modification, are permitted provided that the following conditions are met:
6
+
7
+ * Redistributions of source code must retain the above copyright notice, this
8
+ list of conditions and the following disclaimer.
9
+ * Redistributions in binary form must reproduce the above copyright notice,
10
+ this list of conditions and the following disclaimer in the documentation
11
+ and/or other materials provided with the distribution.
12
+
13
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
14
+ AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
15
+ IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
16
+ DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE
17
+ FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
18
+ DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
19
+ SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
20
+ CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
21
+ OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
22
+ OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
@@ -0,0 +1,222 @@
1
+ # CSV-Diff
2
+
3
+ CSV-Diff is a small library for performing diffs of CSV data.
4
+
5
+ Unlike a standard diff that compares line by line, and is sensitive to the
6
+ ordering of records, CSV-Diff identifies common lines by key field(s), and
7
+ then compares the contents of the fields in each line.
8
+
9
+ Data may be supplied in the form of CSV files, or as an array of arrays. The
10
+ diff process provides a fine level of control over what to diff, and can
11
+ optionally ignore certain types of changes (e.g. changes in position).
12
+
13
+ CSV-Diff is particularly well suited to data in parent-child format. Parent-
14
+ child data does not lend itself well to standard text diffs, as small changes
15
+ in the organisation of the tree at an upper level can lead to big movements
16
+ in the position of descendant records. By instead matching records by key,
17
+ CSV-Diff avoids this issue, while still being able to detect changes in
18
+ sibling order.
19
+
20
+
21
+ ## Usage
22
+
23
+ CSV-Diff is supplied as a gem, and has no dependencies. To use it, simply:
24
+ ```
25
+ gem install csv-diff
26
+ ```
27
+
28
+ To compare two CSV files where the field names are in the first row of the file,
29
+ and the first field contains the unique key for each record, simply use:
30
+ ```ruby
31
+ require 'csv-diff'
32
+
33
+ diff = CSVDiff.new(file1, file2)
34
+ ```
35
+
36
+ The returned diff object can be queried for the differences that exist between
37
+ the two files, e.g.:
38
+ ```ruby
39
+ puts diff.summary.inspect # Summary of the adds, deletes, updates, and moves
40
+ puts diff.adds.inspect # Details of the additions to file2
41
+ puts diff.deletes.inspect # Details of the deletions to file1
42
+ puts diff.updates.inspect # Details of the updates from file1 to file2
43
+ puts diff.moves.inspect # Details of the moves from file1 to file2
44
+ puts diff.diffs.inspect # Details of all differences
45
+ puts diff.warnings.inspect # Any warnings generated during the diff process
46
+ ```
47
+
48
+ ## Unique Row Identifiers
49
+
50
+ CSVDiff is preferable over a standard line-by-line diff when row order is
51
+ significantly impacted by small changes. The classic example is a parent-child
52
+ file generated by a hierarchy traversal. A simple change in position of a parent
53
+ member near the root of the hierarchy will have a large impact on the positions
54
+ of all descendant rows. Consider the following example:
55
+ ```
56
+ Root
57
+ |- A
58
+ | |- A1
59
+ | |- A2
60
+ |
61
+ |- B
62
+ |- B1
63
+ |- B2
64
+ ```
65
+
66
+ A hierarchy traversal of this tree into a parent-child format would generate a CSV
67
+ as follows:
68
+ ```
69
+ Root,A
70
+ A,A1
71
+ A,A2
72
+ Root,B
73
+ B,B1
74
+ B,B2
75
+ ```
76
+
77
+ If the positions of A and B were swapped, a hierarchy traversal would now produce a CSV
78
+ as follows:
79
+ ```
80
+ Root,B
81
+ B,B1
82
+ B,B2
83
+ Root,A
84
+ A,A1
85
+ A,A2
86
+ ```
87
+
88
+ A simple diff using a diff utility would highlight this as 3 additions and 3 deletions.
89
+ CSVDiff, however, would classify this as 2 moves (a change in sibling position for A and B).
90
+
91
+ In order to do this, CSVDiff needs to know what field(s) confer uniqueness on each row.
92
+ In this example, we could use the child field alone (since each member name only appears
93
+ once); however, this would imply a flat structure, where all rows are children of a single
94
+ parent. This in turn would cause CSVDiff to classify the above change as a Move (i.e. a
95
+ change in order) of all 6 rows.
96
+
97
+ The more correct specification of this file is that column 0 contains a unique parent
98
+ identifier, and column 1 contains a unique child identifier. CSVDiff can then correctly
99
+ deduce that there is in fact only two changes in order - the swap in positions of A and
100
+ B below Root.
101
+
102
+ Note: If you aren't interested in changes in the order of siblings, then you could use
103
+ CSVDiff with a :key_field option of column 1, and specify the :ignore_moves option.
104
+
105
+ ## Warnings
106
+
107
+ When processing and diffing files, CSVDiff may encounter problems with the data or
108
+ the specifications it has been given. It will continue even in the face of problems,
109
+ but will log details of the problems in a #warnings Array. The number of warnings
110
+ will also be included in the Hash returned by the #summary method.
111
+
112
+ Warnings may be raised for any of the following:
113
+ * Missing fields: If the right/to file contains fields that are not present in the
114
+ left/from file, a warning is raised and the field is ignored for diff purposes.
115
+ * Duplicate keys: If two rows are found that have the same values for the key field(s),
116
+ a warning is raised, and the duplicate values are ignored.
117
+
118
+
119
+ ## Examples
120
+
121
+ The simplest use case is as shown above, where the data to be diffed is in CSV files
122
+ with the column names as the first record, and where the unique key is the first
123
+ column in the data. In this case, a diff can be created simply via:
124
+ ```ruby
125
+ diff = CSVDiff.new(file1, file2)
126
+ ```
127
+
128
+ ### Specifynig Unique Row Identifiers
129
+
130
+ Often however, rows are not uniquely identifiable via the first column in the file.
131
+ In a parent-child hierarchy, for example, combinations of parent and child may be
132
+ necessary to uniquely identify a row. In these cases, it is necessary to indicate
133
+ which fields are used to uniquely identify common rows across the two files. This
134
+ can be done in several different ways.
135
+
136
+ 1. Using the :key_fields option with field numbers (these are 0-based):
137
+
138
+ ```ruby
139
+ diff = CSVDiff.new(file1, file2, key_fields: [0, 1])
140
+ ```
141
+
142
+ 2. Using the :key_fields options with column names:
143
+
144
+ ```ruby
145
+ diff = CSVDiff.new(file1, file2, key_fields: ['Parent', 'Child'])
146
+ ```
147
+
148
+ 3. Using the :parent_fields and :child_fields with field numbers:
149
+
150
+ ```ruby
151
+ diff = CSVDiff.new(file1, file2, parent_field: 1, child_fields: [2, 3])
152
+ ```
153
+
154
+ 4. Using the :parent_fields and :child_fields with column names:
155
+
156
+ ```ruby
157
+ diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'])
158
+ ```
159
+
160
+ ### Using Non-CSV File Sources
161
+
162
+ Data from non-CSV sources can be diffed, as long as it can be supplied as an Array
163
+ of Arrays:
164
+ ```ruby
165
+ DATA1 = [
166
+ ['Parent', 'Child', 'Description'],
167
+ ['A', 'A1', 'Account 1'],
168
+ ['A', 'A2', 'Account 2']
169
+ ]
170
+
171
+ DATA2 = [
172
+ ['Parent', 'Child', 'Description'],
173
+ ['A', 'A1', 'Account1'],
174
+ ['A', 'A2', 'Account2']
175
+ ]
176
+
177
+ diff = CSVDiff.new(DATA1, DATA2, key_fields: [1, 0])
178
+ ```
179
+
180
+ ### Specifying Column Names
181
+
182
+ If your data file does not include column headers, you can specify the names of
183
+ each column when creating the diff. The names supplied are the keys used in the
184
+ diff results:
185
+
186
+ ```ruby
187
+ DATA1 = [
188
+ ['A', 'A1', 'Account 1'],
189
+ ['A', 'A2', 'Account 2']
190
+ ]
191
+
192
+ DATA2 = [
193
+ ['A', 'A1', 'Account1'],
194
+ ['A', 'A2', 'Account2']
195
+ ]
196
+
197
+ diff = CSVDiff.new(DATA1, DATA2, key_fields: [1, 0], field_names: ['Parent', 'Child', 'Description'])
198
+ ```
199
+
200
+ If your data file does contain a header row, but you wish to use your own column
201
+ names, you can specify the :field_names option and the :ignore_header option to
202
+ ignore the first row.
203
+
204
+
205
+ ### Ignoring Fields
206
+
207
+ If your data contains fields that you aren't interested in, these can be excluded
208
+ from the diff process using the :ignore_fields option:
209
+ ```ruby
210
+ diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'],
211
+ ignore_fields: ['CreatedAt', 'UpdatedAt'])
212
+ ```
213
+
214
+ ### Ignoring Certain Changes
215
+
216
+ CSVDiff identifies Adds, Updates, Moves and Deletes; any of these changes can be selectively
217
+ ignored, e.g. if you are not interested in Deletes, you can pass the :ignore_deletes option:
218
+ ```ruby
219
+ diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'],
220
+ ignore_fields: ['CreatedAt', 'UpdatedAt'],
221
+ ignore_deletes: true, ignore_moves: true)
222
+ ```
@@ -0,0 +1,4 @@
1
+ require 'csv-diff/csv_source'
2
+ require 'csv-diff/algorithm'
3
+ require 'csv-diff/csv_diff'
4
+
@@ -0,0 +1,124 @@
1
+ class CSVDiff
2
+
3
+ # Implements the CSV diff algorithm.
4
+ module Algorithm
5
+
6
+ # Diffs two CSVSource structures.
7
+ #
8
+ # @param left [CSVSource] A CSVSource object containing the contents of
9
+ # the left/from input.
10
+ # @param right [CSVSource] A CSVSource object containing the contents of
11
+ # the right/to input.
12
+ # @param key_fields [Array] An array containing the names of the field(s)
13
+ # that uniquely identify each row.
14
+ # @param diff_fields [Array] An array containing the names of the fields
15
+ # to be diff-ed.
16
+ def diff_sources(left, right, key_fields, diff_fields, options = {})
17
+ left_index = left.index
18
+ left_values = left.lines
19
+ left_keys = left_values.keys
20
+ right_index = right.index
21
+ right_values = right.lines
22
+ right_keys = right_values.keys
23
+ parent_fields = left.parent_fields.length
24
+
25
+ include_adds = !options[:ignore_adds]
26
+ include_moves = !options[:ignore_moves]
27
+ include_updates = !options[:ignore_updates]
28
+ include_deletes = !options[:ignore_deletes]
29
+
30
+ diffs = Hash.new{ |h, k| h[k] = {} }
31
+ right_keys.each_with_index do |key, right_row_id|
32
+ key_vals = key.split('~')
33
+ parent = key_vals[0...parent_fields].join('~')
34
+ child = key_vals[parent_fields..-1].join('~')
35
+ left_parent = left_index[parent]
36
+ right_parent = right_index[parent]
37
+ left_value = left_values[key]
38
+ right_value = right_values[key]
39
+ left_idx = left_parent && left_parent.index(key)
40
+ right_idx = right_parent && right_parent.index(key)
41
+
42
+ id = {}
43
+ id[:row] = right_row_id + 1
44
+ id[:sibling_position] = right_idx + 1
45
+ key_fields.each do |field_name|
46
+ id[field_name] = right_value[field_name]
47
+ end
48
+ if left_idx && right_idx
49
+ if include_moves
50
+ left_common = left_parent & right_parent
51
+ right_common = right_parent & left_parent
52
+ left_pos = left_common.index(key)
53
+ right_pos = right_common.index(key)
54
+ if left_pos != right_pos
55
+ # Move
56
+ diffs[key].merge!(id.merge!(:action => 'Move',
57
+ :sibling_position => [left_idx + 1, right_idx + 1]))
58
+ #puts "Move #{left_idx} -> #{right_idx}: #{key}"
59
+ end
60
+ end
61
+ if include_updates && (changes = diff_row(left_values[key], right_values[key], diff_fields))
62
+ diffs[key].merge!(id.merge(changes.merge(:action => 'Update')))
63
+ #puts "Change: #{key}"
64
+ end
65
+ elsif include_adds && right_idx
66
+ # Add
67
+ diffs[key].merge!(id.merge(right_values[key].merge(:action => 'Add')))
68
+ #puts "Add: #{key}"
69
+ end
70
+ end
71
+
72
+ # Now identify deletions
73
+ if include_deletes
74
+ (left_keys - right_keys).each do |key|
75
+ # Delete
76
+ key_vals = key.split('~')
77
+ parent = key_vals[0...parent_fields].join('~')
78
+ child = key_vals[parent_fields..-1].join('~')
79
+ left_parent = left_index[parent]
80
+ left_value = left_values[key]
81
+ left_idx = left_parent.index(key)
82
+ next unless left_idx
83
+ id = {}
84
+ id[:row] = left_keys.index(key) + 1
85
+ id[:sibling_position] = left_idx + 1
86
+ key_fields.each do |field_name|
87
+ id[field_name] = left_value[field_name]
88
+ end
89
+ diffs[key].merge!(id.merge(left_values[key].merge(:action => 'Delete')))
90
+ #puts "Delete: #{key}"
91
+ end
92
+ end
93
+ diffs
94
+ end
95
+
96
+
97
+ # Identifies the fields that are different between two versions of the
98
+ # same row.
99
+ #
100
+ # @param left_row [Hash] The version of the CSV row from the left/from
101
+ # file.
102
+ # @param right_row [Hash] The version of the CSV row from the right/to
103
+ # file.
104
+ # @return [Hash<String, Array>] A Hash whose keys are the fields that
105
+ # contain differences, and whose values are a two-element array of
106
+ # [left/from, right/to] values.
107
+ def diff_row(left_row, right_row, fields)
108
+ diffs = {}
109
+ fields.each do |attr|
110
+ right_val = right_row[attr]
111
+ right_val = nil if right_val == ""
112
+ left_val = left_row[attr]
113
+ left_val = nil if left_val == ""
114
+ if left_val != right_val
115
+ diffs[attr] = [left_val, right_val]
116
+ #puts "#{attr}: #{left_val} -> #{right_val}"
117
+ end
118
+ end
119
+ diffs if diffs.size > 0
120
+ end
121
+
122
+ end
123
+
124
+ end
@@ -0,0 +1,142 @@
1
+ # This library performs diffs of flat file content that contains structured data
2
+ # in fields, with rows provided in a parent-child format.
3
+ #
4
+ # Parent-child data does not lend itself well to standard text diffs, as small
5
+ # changes in the organisation of the tree at an upper level (e.g. re-ordering of
6
+ # two ancestor nodes) can lead to big movements in the position of descendant
7
+ # records - particularly when the parent-child data is generated by a hierarchy
8
+ # traversal.
9
+ #
10
+ # Additionally, simple line-based diffs can identify that a line has changed,
11
+ # but not which field(s) in the line have changed.
12
+ #
13
+ # Data may be supplied in the form of CSV files, or as an array of arrays. The
14
+ # diff process process provides a fine level of control over what to diff, and
15
+ # can optionally ignore certain types of changes (e.g. changes in order).
16
+ class CSVDiff
17
+
18
+ # @return [CSVSource] CSVSource object containing details of the left/from
19
+ # input.
20
+ attr_reader :left
21
+ alias_method :from, :left
22
+ # @return [CSVSource] CSVSource object containing details of the right/to
23
+ # input.
24
+ attr_reader :right
25
+ alias_method :to, :right
26
+ # @return [Array<Hash>] An array of differences
27
+ attr_reader :diffs
28
+ # @return [Array<String>] An array of field names that are compared in the
29
+ # diff process.
30
+ attr_reader :diff_fields
31
+ # @return [Array<Fixnum>] An array of field indexes identifying the key
32
+ # fields that uniquely identify each row.
33
+ attr_reader :key_fields
34
+ # @return [Array<String>] An array of field names for the parent field(s).
35
+ attr_reader :parent_fields
36
+ # @return [Array<String>] An array of field names for the child field(s).
37
+ attr_reader :child_fields
38
+
39
+
40
+ # Generates a diff between two hierarchical tree structures, provided
41
+ # as +left+ and +right+, each of which consists of an array of lines in CSV
42
+ # format.
43
+ # An array of field indexes can also be specified as +key_fields+;
44
+ # a minimum of one field index must be specified; the last index is the
45
+ # child id, and the remaining fields (if any) are the parent field(s) that
46
+ # uniquely qualify the child instance.
47
+ #
48
+ # @param left [Array<Array<String>>] An Array of lines, each of which is in
49
+ # turn an Array containing fields.
50
+ # @param right [Array<Array<String>>] An Array of lines, each of which is in
51
+ # turn an Array containing fields.
52
+ # @param options [Hash] A hash containing options.
53
+ # @option options [Array<String>] :field_names An Array of field names for
54
+ # each field in +left+ and +right+. If not provided, the first row is
55
+ # assumed to contain field names.
56
+ # @option options [Boolean] :ignore_header If true, the first line of each
57
+ # file is ignored. This option can only be true if :field_names is
58
+ # specified.
59
+ # @options options [Array] :ignore_fields The names of any fields to be
60
+ # ignored when performing the diff.
61
+ # @option options [String] :key_field The name of the field that uniquely
62
+ # identifies each row.
63
+ # @option options [Array<String>] :key_fields The names of the fields
64
+ # that uniquely identifies each row.
65
+ # @option options [String] :parent_field The name of the field that
66
+ # identifies a parent within which sibling order should be checked.
67
+ # @option options [String] :child_field The name of the field that
68
+ # uniquely identifies a child of a parent.
69
+ # @option options [Boolean] :ignore_adds If true, records that appear in
70
+ # the right/to file but not in the left/from file are not reported.
71
+ # @option options [Boolean] :ignore_updates If true, records that have been
72
+ # updated are not reported.
73
+ # @option options [Boolean] :ignore_moves If true, changes in row position
74
+ # amongst sibling rows are not reported.
75
+ # @option options [Boolean] :ignore_deletes If true, records that appear
76
+ # in the left/from file but not in the right/to file are not reported.
77
+ def initialize(left, right, options = {})
78
+ @left = CSVSource.new(left, options)
79
+ raise "No field names found in left (from) source" unless @left.field_names && @left.field_names.size > 0
80
+ @right = CSVSource.new(right, options)
81
+ raise "No field names found in right (to) source" unless @right.field_names && @right.field_names.size > 0
82
+ @warnings = []
83
+ @diff_fields = get_diff_fields(@left.field_names, @right.field_names, options.fetch(:ignore_fields, []))
84
+ @key_fields = @left.key_fields.map{ |kf| @diff_fields[kf] }
85
+ diff(options)
86
+ end
87
+
88
+
89
+ # Performs a diff with the specified +options+.
90
+ def diff(options = {})
91
+ @summary = nil
92
+ @diffs = diff_sources(@left, @right, @key_fields, @diff_fields, options)
93
+ end
94
+
95
+
96
+ # Returns a summary of the number of adds, deletes, moves, and updates.
97
+ def summary
98
+ unless @summary
99
+ @summary = Hash.new{ |h, k| h[k] = 0 }
100
+ @diffs.each{ |k, v| @summary[v[:action]] += 1 }
101
+ @summary['Warnings'] = warnings.size if warnings.size > 0
102
+ end
103
+ @summary
104
+ end
105
+
106
+
107
+ [:adds, :deletes, :updates, :moves].each do |mthd|
108
+ define_method mthd do
109
+ action = mthd.to_s.chomp('s')
110
+ @diffs.select{ |k, v| v[:action].downcase == action }
111
+ end
112
+ end
113
+
114
+
115
+ # @return [Array<String>] an array of warning messages generated during the
116
+ # diff process.
117
+ def warnings
118
+ @left.warnings + @right.warnings + @warnings
119
+ end
120
+
121
+
122
+ private
123
+
124
+
125
+ # Given two sets of field names, determines the common set of fields present
126
+ # in both, on which members can be diffed.
127
+ def get_diff_fields(left_fields, right_fields, ignore_fields)
128
+ diff_fields = []
129
+ right_fields.each do |fld|
130
+ if left_fields.include?(fld)
131
+ diff_fields << fld unless ignore_fields.include?(fld)
132
+ else
133
+ @warnings << "Field '#{fld}' is missing from the left (from) file, and won't be diffed"
134
+ end
135
+ end
136
+ diff_fields
137
+ end
138
+
139
+
140
+ include Algorithm
141
+
142
+ end
@@ -0,0 +1,151 @@
1
+ class CSVDiff
2
+
3
+ # Represents a CSV input (i.e. the left/from or right/to input) to the diff
4
+ # process.
5
+ class CSVSource
6
+
7
+ # @return [String] the path to the source file
8
+ attr_accessor :path
9
+ # @return [Array<String>] The names of the fields in the source file
10
+ attr_reader :field_names
11
+ # @return [Array<String>] The names of the field(s) that uniquely
12
+ # identify each row.
13
+ attr_reader :key_fields
14
+ # @return [Array<String>] The names of the field(s) that identify a
15
+ # common parent of child records.
16
+ attr_reader :parent_fields
17
+ # @return [Array<String>] The names of the field(s) that distinguish a
18
+ # child of a parent record.
19
+ attr_reader :child_fields
20
+ # @return [Hash<String,Hash>] A hash containing each line of the source,
21
+ # keyed on the values of the +key_fields+.
22
+ attr_reader :lines
23
+ # @return [Hash<String,Array<String>>] A hash containing each parent key,
24
+ # and an Array of the child keys it is a parent of.
25
+ attr_reader :index
26
+ # @return [Array<String>] An array of any warnings encountered while
27
+ # processing the source.
28
+ attr_reader :warnings
29
+
30
+
31
+ # Creates a new diff source.
32
+ #
33
+ # A diff source must contain at least one field that will be used as the
34
+ # key to identify the same record in a different version of this file.
35
+ # If not specified via one of the options, the first field is assumed to
36
+ # be the unique key.
37
+ #
38
+ # If multiple fields combine to form a unique key, the parent is assumed
39
+ # to be identified by all but the last field of the unique key. If finer
40
+ # control is required, use a combination of the :parent_fields and
41
+ # :child_fields options.
42
+ #
43
+ # All key options can be specified either by field name, or by field
44
+ # index (0 based).
45
+ #
46
+ # @param source [String|Array<Array>] Either a path to a CSV file, or an
47
+ # Array of Arrays containing CSV data. If the :field_names option is
48
+ # not specified, the first line must contain the names of the fields.
49
+ # @param options [Hash] An options hash.
50
+ # @option options [String] :mode_string The mode to use when opening the
51
+ # CSV file. Defaults to 'r'.
52
+ # @option options [Hash] :csv_options Any options you wish to pass to
53
+ # CSV.open, e.g. :col_sep.
54
+ # @option options [Array<String>] :field_names The names of each of the
55
+ # fields in +source+.
56
+ # @option options [Boolean] :ignore_header If true, and :field_names has
57
+ # been specified, then the first row of the file is ignored.
58
+ # @option options [String] :key_field The name of the field that uniquely
59
+ # identifies each row.
60
+ # @option options [Array<String>] :key_fields The names of the fields
61
+ # that uniquely identifies each row.
62
+ # @option options [String] :parent_field The name of the field that
63
+ # identifies a parent within which sibling order should be checked.
64
+ # @option options [String] :child_field The name of the field that
65
+ # uniquely identifies a child of a parent.
66
+ def initialize(source, options = {})
67
+ if source.is_a?(String)
68
+ require 'csv'
69
+ mode_string = options.fetch(:mode_string, 'r')
70
+ csv_options = options.fetch(:csv_options, {})
71
+ @path = source
72
+ source = CSV.open(@path, mode_string, csv_options).readlines
73
+ end
74
+ if kf = options.fetch(:key_field, options[:key_fields])
75
+ @key_fields = [kf].flatten
76
+ @parent_fields = @key_fields[0...-1]
77
+ @child_fields = @key_fields[-1..-1]
78
+ else
79
+ @parent_fields = [options.fetch(:parent_field, options.fetch(:parent_fields, []))].flatten
80
+ @child_fields = [options.fetch(:child_field, options.fetch(:child_fields, [0]))].flatten
81
+ @key_fields = @parent_fields + @child_fields
82
+ end
83
+ @field_names = options[:field_names]
84
+ @warnings = []
85
+ index_source(source, options)
86
+ end
87
+
88
+
89
+ # Returns the row in the CSV source corresponding to the supplied key.
90
+ #
91
+ # @param key [String] The unique key to use to lookup the row.
92
+ # @return [Hash] The fields for the line corresponding to +key+, or nil
93
+ # if the key is not recognised.
94
+ def [](key)
95
+ @lines[key]
96
+ end
97
+
98
+
99
+ private
100
+
101
+ # Given an array of lines, where each line is an array of fields, indexes
102
+ # the array contents so that it can be looked up by key.
103
+ def index_source(lines, options)
104
+ @lines = {}
105
+ @index = Hash.new{ |h, k| h[k] = [] }
106
+ @key_fields = find_field_indexes(@key_fields, @field_names) if @field_names
107
+ line_num = 0
108
+ lines.each do |row|
109
+ line_num += 1
110
+ next if line_num == 1 && @field_names && options[:ignore_header]
111
+ unless @field_names
112
+ @field_names = row
113
+ @key_fields = find_field_indexes(@key_fields, @field_names)
114
+ next
115
+ end
116
+ field_vals = row
117
+ line = {}
118
+ @field_names.each_with_index do |field, i|
119
+ line[field] = field_vals[i]
120
+ end
121
+ key_values = @key_fields.map{ |kf| field_vals[kf].to_s.upcase }
122
+ key = key_values.join('~')
123
+ parent_key = key_values[0...(@parent_fields.length)].join('~')
124
+ if @lines[key]
125
+ @warnings << "Duplicate key '#{key}' encountered and ignored at line #{line_num}"
126
+ else
127
+ @index[parent_key] << key
128
+ @lines[key] = line
129
+ end
130
+ end
131
+ end
132
+
133
+
134
+ # Converts an array of field names to an array of indexes of the fields
135
+ # matching those names.
136
+ def find_field_indexes(key_fields, field_names)
137
+ key_fields.map do |field|
138
+ if field.is_a?(Fixnum)
139
+ field
140
+ else
141
+ field_names.index{ |field_name| field.to_s.downcase == field_name.downcase } or
142
+ raise ArgumentError, "Could not locate field '#{field}' in source field names: #{
143
+ field_names.join(', ')}"
144
+ end
145
+ end
146
+ end
147
+
148
+ end
149
+
150
+ end
151
+
@@ -0,0 +1,2 @@
1
+ require 'csv-diff'
2
+
metadata ADDED
@@ -0,0 +1,62 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: csv-diff
3
+ version: !ruby/object:Gem::Version
4
+ version: '0.1'
5
+ prerelease:
6
+ platform: ruby
7
+ authors:
8
+ - Adam Gardiner
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+ date: 2014-05-30 00:00:00.000000000 Z
13
+ dependencies: []
14
+ description: ! " This library performs diffs of CSV files.\n\n Unlike
15
+ a standard diff that compares line by line, and is sensitive to the\n ordering
16
+ of records, CSV-Diff identifies common lines by key field(s), and\n then
17
+ compares the contents of the fields in each line.\n\n Data may be supplied
18
+ in the form of CSV files, or as an array of arrays. The\n diff process provides
19
+ a fine level of control over what to diff, and can\n optionally ignore certain
20
+ types of changes (e.g. changes in position).\n\n CSV-Diff is particularly
21
+ well suited to data in parent-child format. Parent-\n child data does not
22
+ lend itself well to standard text diffs, as small changes\n in the organisation
23
+ of the tree at an upper level can lead to big movements\n in the position
24
+ of descendant records. By instead matching records by key,\n CSV-Diff avoids
25
+ this issue, while still being able to detect changes in\n sibling order.\n"
26
+ email: adam.b.gardiner@gmail.com
27
+ executables: []
28
+ extensions: []
29
+ extra_rdoc_files: []
30
+ files:
31
+ - README.md
32
+ - LICENSE
33
+ - lib/csv-diff/algorithm.rb
34
+ - lib/csv-diff/csv_diff.rb
35
+ - lib/csv-diff/csv_source.rb
36
+ - lib/csv-diff.rb
37
+ - lib/csv_diff.rb
38
+ homepage: https://github.com/agardiner/csv-diff
39
+ licenses: []
40
+ post_install_message:
41
+ rdoc_options: []
42
+ require_paths:
43
+ - lib
44
+ required_ruby_version: !ruby/object:Gem::Requirement
45
+ none: false
46
+ requirements:
47
+ - - ! '>='
48
+ - !ruby/object:Gem::Version
49
+ version: '0'
50
+ required_rubygems_version: !ruby/object:Gem::Requirement
51
+ none: false
52
+ requirements:
53
+ - - ! '>='
54
+ - !ruby/object:Gem::Version
55
+ version: '0'
56
+ requirements: []
57
+ rubyforge_project:
58
+ rubygems_version: 1.8.23
59
+ signing_key:
60
+ specification_version: 3
61
+ summary: CSV Diff is a library for generating diffs from data in CSV format
62
+ test_files: []