idata 0.1.33 → 0.2.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: efc9c2bc3a3ee9b79d83c531abbcbc185360dda4
4
- data.tar.gz: 0fbca044baf3f5a454f5e55e3ff3b40fd8b480f8
3
+ metadata.gz: 26fd9bafa90c2af8eef61231c2b070c6a395bbcb
4
+ data.tar.gz: b1952edd955ea4eeb72d31f8c640f053d5515a17
5
5
  SHA512:
6
- metadata.gz: 59e4d3862a80d105fcc6bfa7973dd18c766fa93a921f4f3c9784398ed71187d2f259d05d9375d15e5dc723bdfe9e037bac861dcd677bd7a70a06b4a058cc3152
7
- data.tar.gz: a685200d759507941fc6800615cf3fc7191e055baecbf56bf4cc8c2f9bce929fc30cd18df76797d75b46a0c2ffa35dcb7cfc1c3a84800f455967cfaea1313693
6
+ metadata.gz: 79b8f98bad03bcfccea93cf2cb074271391def428288784059842435e4e6334928f1d5806c50707816a61621e38e4761cae761443c634bcf2ba0baa5187c3e61
7
+ data.tar.gz: 2513beafcacb28bc7d745e93ab6dd09e0f3b246c25bcee2d81b5bee92b66778b26feab7f43887790a284dfbc8e6906c3860d71082f530a78f1ff82e598085df7
data/README.md CHANGED
@@ -1,5 +1,5 @@
1
1
  # Overview
2
- We provide some useful utilities for validating data in a PostgreSQL data table
2
+ We provide some utilities for validating data in a PostgreSQL data table.
3
3
  These utilities can be used as simple terminal commands and can be installed by:
4
4
 
5
5
  gem install idata
@@ -13,7 +13,7 @@ idata comes along with the following commands:
13
13
  * imerge
14
14
  * isanitize
15
15
 
16
- Run a command with --help switch for the details
16
+ Run a command with `--help` switch for the details
17
17
 
18
18
  Prequisites:
19
19
  * PostgreSQL 9.0 or above
@@ -23,34 +23,34 @@ Prequisites:
23
23
  # Usage
24
24
  Suppose we have an `items` table, and we want to validate its records against certain criteria like:
25
25
 
26
- * `item_id` must not be null
27
- * `item_title` must not be null
28
- * The composite `[item_id, item_title]` must be unique
29
- * One `item_id` corresponds to only ONE `item_title` (in other words, there must not be two items with different titles but with the same `item_id`)
26
+ * `vendor_code` must not be null
27
+ * `vendor_name` must not be null
28
+ * The composite `[vendor_code, vendor_name]` must be unique
29
+ * One `vendor_code` corresponds to only ONE `vendor_name` (in other words, there must not be two items with different `vendor_name` but with the same `vendor_code`)
30
30
  and vice-versa
31
31
  * `vendor_code` must reference the `code` column in the `vendors` table
32
32
 
33
33
  Then the validation command could be:
34
34
  ```
35
- ivalidate --host=localhost --user=postgres --database=mydb --table=items --log-to=validation_errors \
36
- --not-null="vendor_id" \
35
+ ivalidate --host=localhost --user=postgres --database=mydb --table=items
36
+ --log-to=validation_errors \
37
+ --not-null="vendor_code" \
37
38
  --not-null="vendor_name" \
38
- --unique="vendor_id,vendor_name" \
39
- --consistent-by="vendor_id|vendor_name" \
40
- --consistent-by="vendor_id|vendor_name" \
39
+ --unique="vendor_code,vendor_name" \
40
+ --consistent-by="vendor_code|vendor_name" \
41
+ --consistent-by="vendor_name|vendor_code" \
41
42
  --cross-reference="vendor_code|vendors.code"
42
43
  ```
43
44
  Validation results for every single record are logged to an additional column named `validation_errors`
44
- of the `items` table, as specified by the `--log-to` switch
45
-
46
- As you can see, most common checks can be performed using the supported switches:
45
+ of the `items` table, as specified by the `--log-to` switch. As you can see, most common checks can be performed using the supported switches:
47
46
  ```
48
47
  --not-null
49
48
  --unique
50
49
  --consistent-by
51
50
  --cross-reference
52
51
  ```
53
- For more generic check, we support some other switches:
52
+ # Custom Validation
53
+ For more customized checks, we support some other switches.
54
54
 
55
55
  The `--match="field/pattern/"` switch tells the program to check if value of a `field` matches the provided `pattern` (which is a regular expression).
56
56
  For example:
@@ -61,22 +61,20 @@ For example:
61
61
  # Check if value of status is either 'A' or 'I' (any other value is not allowed)
62
62
  ivalidate --match="status/^(A|I)$/"
63
63
  ```
64
- In case you need even more customized validation other than the supported ones (match, unique, not-null, cross-reference...)
65
- then `--query` switch may be helpful. For example:
64
+ In case you need even more customized validation other than the supported ones (`match`, `unique`, `not-null`, `cross-reference`...)
65
+ then the `--query` switch may come in handy. For example:
66
66
  ```
67
- ivalidate --query="string_to_date(start_date) >= '01/02/2014' -- invalid date"
67
+ ivalidate --query="start_date >= string_to_date('01/02/2014') -- invalid date"
68
68
  ```
69
69
  You can also use `--rquery` which is the reversed counterpart of `--query`
70
- For example, the following two checks are equivalent:
70
+ For example, the following two checks are equivalent, mark any record whose `start_date < '01/02/2014'` as "invalid date":
71
71
  ```
72
- ivalidate --query="string_to_date(start_date) >= '01/02/2014' -- invalid date"
73
- ivalidate --rquery="string_to_date(start_date) < '01/02/2014' -- invalid date"
72
+ ivalidate --query="start_date >= string_to_date('01/02/2014') -- invalid date"
73
+ ivalidate --rquery="start_date < string_to_date('01/02/2014') -- invalid date"
74
74
  ```
75
- (mark any record whose `start_date < '01/02/2014'` as "invalid date")
76
75
 
77
76
  Note: run `ivalidate --help` to see the full list of supported switches
78
77
 
79
-
80
78
  # Put it all together
81
79
  You can put several `ivalidate` commands (for several data tables) in one single bash/sh file.
82
80
  Besides `ivalidate`, we also support some other utilities to:
@@ -84,6 +82,6 @@ Besides `ivalidate`, we also support some other utilities to:
84
82
  + Modify data tables
85
83
  + Generate summary reports
86
84
 
87
- For a full example, see our `sample.sh`
85
+ See our `sample.sh` for a comprehensive example
88
86
 
89
87
 
data/README2.md ADDED
@@ -0,0 +1,37 @@
1
+ ### Overview
2
+ File dùng cho validation criteria có cấu trúc chung như sau:
3
+ ```yaml
4
+ table:
5
+ - field: field1, field2, etc.
6
+ validations:
7
+ - rule:
8
+ code:
9
+ error:
10
+ impact:
11
+ solution:
12
+ ```
13
+
14
+ ##### Explain:
15
+ + Một `table` có một hoặc nhiều `field`
16
+ + Một `field` có một section `validations` chứa một hoặc nhiều rules
17
+ + Mỗi rule chứa `rule` (required) và `code`, `error`, `solution`, `impact`, `priority` tương ứng (optional)
18
+ + Rule: viết theo format covered bên dưới
19
+ + Các field khác free text trên cùng 1 line, dùng \n làm dấu cách dòng
20
+ + Giá trị của field có thể là một hoặc tập gồm nhiều field name, separated by comma
21
+
22
+ ## Writing rules
23
+ Supported rules include:
24
+
25
+ | Rule | Description | Example |
26
+ | ---- | ----------- | ------- |
27
+ | `not null` | Giá trị của field tương ứng không được rỗng | |
28
+ | `unique` | Giá trị của field tương ứng phải unique trong table | |
29
+ | `matches "/regexp/"` | Giá trị của field phải thỏa format định nghĩa bởi `regexp` | |
30
+ | `not matches "/regexp/"` | Reverse counterpart của `matches` | |
31
+ | `consistent by "ref"` | Giá trị của field tương ứng phải consistent với `ref` | |
32
+ | `cross references "table.field"` | Giá trị của field phải reference tới một field khác `table.field` | |
33
+ | `custom query "query"` | Dùng custom SQL `query` (trong trường hợp business phức tạp không thể biểu diễn bằng các rule khác) | |
34
+ | `reverse query "query"` | Reverse counterpart của `custom query` | |
35
+
36
+ ## Others
37
+ TBD
data/bin/ivalidate2 ADDED
@@ -0,0 +1,426 @@
1
+ #!/usr/bin/env ruby
2
+ # DATA VALIDATOR
3
+ #
4
+ # @author Nghi Pham
5
+ # @date April 2014
6
+ #
7
+ # Data validation includes:
8
+ # * Uniqueness
9
+ # * Integrity (cross reference)
10
+ # * Data type: numeric, text, enum, etc.
11
+ # * Data format: text size, text values, enum, inclusion, exclusion, etc.
12
+ #
13
+ # Issue ruby load.rb --help for guideline/examples
14
+ #
15
+ require 'optparse'
16
+ require 'active_record'
17
+ require 'rubygems'
18
+ require 'logger'
19
+
20
+ $options = {}
21
+ parser = OptionParser.new("", 24) do |opts|
22
+ opts.banner = "\nProgram: Data Validator\nAuthor: MCKI\n\n"
23
+
24
+ opts.on("-c", "--config FILE", "Configuration file") do |v|
25
+ $options[:config] = v
26
+ end
27
+
28
+ opts.on("-h", "--host HOST", "PostgreSQL host") do |v|
29
+ $options[:host] = v
30
+ end
31
+
32
+ opts.on("-d", "--database DATABASE", "PostgreSQL database") do |v|
33
+ $options[:database] = v
34
+ end
35
+
36
+ opts.on("-u", "--username USER", "PostgreSQL username") do |v|
37
+ $options[:username] = v
38
+ end
39
+
40
+ opts.on("-p", "--password PASSWORD", "PostgreSQL password") do |v|
41
+ $options[:password] = v
42
+ end
43
+
44
+ opts.on("-l", "--listen PORT", "PostgreSQL listen port (default to 5432)") do |v|
45
+ $options[:listen] = v
46
+ end
47
+
48
+ opts.on_tail('--help', 'Displays this help') do
49
+ puts opts, "", help
50
+ exit
51
+ end
52
+ end
53
+
54
+ def help
55
+ return ''
56
+ end
57
+
58
+ begin
59
+ parser.parse!
60
+ rescue SystemExit => ex
61
+ exit
62
+ end
63
+
64
+ # Load parameters from ENVIRONMENT if exist
65
+ $options[:host] ||= ENV['HOST']
66
+ $options[:username] ||= ENV['USERNAME']
67
+ $options[:password] ||= ENV['PASSWORD']
68
+ $options[:listen] ||= ENV['LISTEN']
69
+ $options[:database] ||= ENV['DATABASE']
70
+
71
+ # validate parameters
72
+ if $options[:config].nil?
73
+ puts "\nPlease specify config file: -c\n\n"
74
+ exit
75
+ end
76
+
77
+ if $options[:host].nil?
78
+ puts "\nPlease specify host name: -h\n\n"
79
+ exit
80
+ end
81
+
82
+ if $options[:database].nil?
83
+ puts "\nPlease specify PostgreSQL database name: -d\n\n"
84
+ exit
85
+ end
86
+
87
+ if $options[:username].nil?
88
+ puts "\nPlease specify PostgreSQL username: -u\n\n"
89
+ exit
90
+ end
91
+
92
+ # Default value
93
+ $options[:listen] ||= 5432
94
+
95
+ # Database dump
96
+ ActiveRecord::Base.establish_connection(
97
+ 'adapter' => 'postgresql',
98
+ 'host' => $options[:host],
99
+ 'database' => $options[:database],
100
+ 'username' => $options[:username],
101
+ 'password' => $options[:password],
102
+ 'port' => $options[:listen],
103
+ 'timeout' => 15000
104
+ )
105
+
106
+ class String
107
+ def not_null_sql
108
+ a = self.split(/\s*,\s*/)
109
+ sql = a.map{|s|
110
+ "#{s} IS NOT NULL AND length(trim(#{s}::text)) <> 0"
111
+ }.join(" AND ")
112
+
113
+ "(#{sql})"
114
+ end
115
+
116
+ def lower
117
+ a = self.split(/\s*,\s*/)
118
+ sql = a.map{|s|
119
+ "lower(#{s})"
120
+ }.join(",")
121
+
122
+ sql
123
+ end
124
+
125
+ def black; "\033[30m#{self}\033[0m" end
126
+ def red; "\033[31m#{self}\033[0m" end
127
+ def green; "\033[32m#{self}\033[0m" end
128
+ def brown; "\033[33m#{self}\033[0m" end
129
+ def blue; "\033[34m#{self}\033[0m" end
130
+ def magenta; "\033[35m#{self}\033[0m" end
131
+ def cyan; "\033[36m#{self}\033[0m" end
132
+ def gray; "\033[37m#{self}\033[0m" end
133
+ end
134
+
135
+ class Logger
136
+ alias_method :_old_info, :info
137
+ alias_method :_old_warn, :warn
138
+ alias_method :_old_error, :error
139
+
140
+
141
+ def error(msg)
142
+ _old_error(msg.red)
143
+ end
144
+
145
+ def info(msg)
146
+ _old_info(msg.green)
147
+ end
148
+
149
+ def warn(msg)
150
+ _old_warn(msg.brown)
151
+ end
152
+ end
153
+
154
+ module IData
155
+ class Validator
156
+ SUPPORTED_RULES_REGEXP = /^\s*(not null|cross references|matches|not matches|custom query|reverse query|unique|consistent by)\s*/
157
+ DEFAULT_ERROR_FIELD = 'validation_errors'
158
+ META_TABLE = 'validation_meta'
159
+ SUMMARY_TABLE = 'summary'
160
+
161
+ def initialize(file)
162
+ @config = YAML.load_file(file)
163
+ @logger = Logger.new(STDOUT)
164
+ @rules = []
165
+ @logger.formatter = proc do |severity, datetime, progname, msg|
166
+ "#{severity}: #{datetime} - #{msg}\n"
167
+ end
168
+ @config.each do |table, fields|
169
+ unless table_exists?(table)
170
+ @logger.warn "Table #{table} does not exist!"
171
+ end
172
+ @logger.info "Validating table #{table}"
173
+ fields.each do |field|
174
+ field['validations'].each do |rule|
175
+ p rule
176
+ type, args = parse_rule(rule['rule'])
177
+ options = rule.merge('table' => table, 'field' => field['field'], 'type' => type, 'args' => args)
178
+ options['code'] = Digest::SHA1.hexdigest([table, field['field'], rule['rule']].join(""))
179
+ @rules << options
180
+ end
181
+ end
182
+ end
183
+ end
184
+
185
+ def validate!
186
+ # reset the meta table
187
+ execute("DROP TABLE IF EXISTS #{META_TABLE}")
188
+
189
+ # validate
190
+ @logger.info "Validation started!"
191
+ @rules.each {|r| validate(r) }
192
+
193
+ # create meta table
194
+ create_table_from_array(@rules, table_name: META_TABLE, drop_table: true, extra_fields: ['impact', 'solution', 'count', 'percentage'])
195
+
196
+ # Done!
197
+ @logger.info "Validation done!"
198
+ end
199
+
200
+ def create_table_from_array(entries, options = {})
201
+ raise "Please specify :table_name" unless options[:table_name]
202
+
203
+ extra_fields = options[:extra_fields] || []
204
+ columns = entries.inject([]) {|x, i| x += i.keys } + extra_fields
205
+ columns.uniq!
206
+
207
+ if options[:drop_table]
208
+ execute "DROP TABLE IF EXISTS #{options[:table_name]}"
209
+ end
210
+
211
+ execute "CREATE TABLE IF NOT EXISTS #{options[:table_name]} ( #{columns.map{|c| quote_col_name(c.to_s) + ' VARCHAR' }.join(', ')} )"
212
+
213
+ insert_sql = entries.map { |r|
214
+ "INSERT INTO #{options[:table_name]}(#{columns.map{|c| quote_col_name(c.to_s)}.join(', ')}) VALUES(#{ columns.map{|c| quote(r[c])}.join(',') });"
215
+ }.join("")
216
+
217
+ execute insert_sql
218
+ end
219
+
220
+ def validate(options)
221
+ unless table_exists?(options['table'])
222
+ return
223
+ end
224
+
225
+ add_error_field(options)
226
+ case options['type']
227
+
228
+ when 'not null'
229
+ validate_not_null(options)
230
+ when 'custom query'
231
+ validate_custom_query(options)
232
+ when 'reverse query'
233
+ validate_reverse_query(options)
234
+ when 'custom query reversed'
235
+ validate_custom_query(options)
236
+ when 'matches'
237
+ validate_match(options)
238
+ when 'cross references'
239
+ validate_cross_reference(options)
240
+ when 'consistent by'
241
+ validate_consistent_by(options)
242
+ when 'unique'
243
+ validate_unique(options)
244
+ else
245
+ raise "Rule not recognized"
246
+ end
247
+ rescue Exception => ex
248
+ @logger.warn ex.message.split(/[\n]/).first.strip
249
+ end
250
+
251
+ def report!
252
+ sql = @rules.map {|r|
253
+ "(SELECT unnest(string_to_array(#{DEFAULT_ERROR_FIELD}, ' || ')) as code, count(*), round((count(*) * 100)::numeric / (SELECT count(*) FROM #{r['table']}), 2)::varchar || '%' AS percentage FROM #{r['table']} GROUP BY code)"
254
+ }
255
+
256
+ execute("
257
+ UPDATE #{META_TABLE} meta
258
+ SET count = stat.count,
259
+ percentage = stat.percentage
260
+ FROM (#{sql.join(" UNION ")}) stat
261
+ WHERE meta.code = stat.code"
262
+ )
263
+ end
264
+
265
+ private
266
+ def add_error_field(options)
267
+ error_field = options['log_to'] || DEFAULT_ERROR_FIELD
268
+ execute("ALTER TABLE #{options['table']} ADD COLUMN #{error_field} VARCHAR DEFAULT '';")
269
+ rescue Exception => ex
270
+ # @todo
271
+ end
272
+
273
+ def parse_rule(rule)
274
+ # @todo
275
+ type = rule[SUPPORTED_RULES_REGEXP]
276
+ if type.nil?
277
+ @logger.error "Invalid rule: #{rule}"
278
+ exit(0)
279
+ end
280
+
281
+ type.strip!
282
+ args = rule.gsub(SUPPORTED_RULES_REGEXP, '').gsub(/(^\s*["']|["']\s*$)/, "")
283
+ return type, args
284
+ end
285
+
286
+ def validate_not_null(options)
287
+ @logger.info "Validating data presence: #{options['table']}.[#{options['field']}]"
288
+ options['error'] ||= "[#{options['field']}] is null"
289
+ execute <<-eos
290
+ #{ update_sql(options) }
291
+ WHERE #{options['field']} IS NULL OR length(trim(#{options['field']})) = 0;
292
+ eos
293
+ end
294
+
295
+ def validate_custom_query(options)
296
+ @logger.info "Validating with custom query: #{options['args'][0..50]}#{(options['args'].size > 50) ? '...' : ''}"
297
+ options['error'] ||= "Unknown"
298
+ execute <<-eos
299
+ #{ update_sql(options) }
300
+ WHERE NOT (#{options['args']})
301
+ eos
302
+ end
303
+
304
+ def validate_reverse_query(options)
305
+ @logger.info "Validating with custom query: #{options['args'][0..50]}#{(options['args'].size > 50) ? '...' : ''}"
306
+ options['error'] ||= "Unknown"
307
+ execute <<-eos
308
+ #{ update_sql(options) }
309
+ WHERE (#{options['args']})
310
+ eos
311
+ end
312
+
313
+ def validate_consistent_by(options)
314
+ @logger.info "Validating integrity: #{options['table']}.[#{options['field']}] #{options['rule']}"
315
+ options['error'] ||= "Same [#{options['field']}] but different [#{options['args']}]"
316
+
317
+ f1_case = f1 = options['field']
318
+ f2_case = f2 = options['args']
319
+
320
+ if options['case_insensitive']
321
+ f1_case = f1_case.lower
322
+ f2_case = f2_case.lower
323
+ end
324
+
325
+ execute <<-eos
326
+ #{ update_sql(options) }
327
+ WHERE id IN (
328
+ SELECT unnest(array_agg(id)) FROM #{options['table']}
329
+ WHERE #{f1.not_null_sql} AND #{f2.not_null_sql}
330
+ GROUP BY #{f2_case}
331
+ HAVING COUNT(distinct #{f1_case}) > 1
332
+ );
333
+ eos
334
+ end
335
+
336
+ def validate_unique(options)
337
+ @logger.info "Validating uniqueness: #{options['table']}.[#{options['field']}]"
338
+ options['error'] ||= "[#{options['field']}] is not unique"
339
+
340
+ if options['case_insensitive']
341
+ f_lower = options['field'].lower
342
+ else
343
+ f_lower = options['field']
344
+ end
345
+
346
+ execute <<-eos
347
+ #{ update_sql(options) }
348
+ WHERE id IN (
349
+ SELECT unnest(array_agg(id)) FROM #{options['table']} GROUP BY #{f_lower}
350
+ HAVING count(*) > 1
351
+ ) AND #{options['field'].not_null_sql};
352
+ eos
353
+ end
354
+
355
+ def validate_cross_reference(options)
356
+ @logger.info "Validating reference: #{options['table']}.[#{options['field']}] #{options['rule']}"
357
+
358
+ options['error'] ||= "[#{options['field']}] does not reference [#{options['args']}]"
359
+
360
+ field = options['field']
361
+ ref_table, ref_field = options['args'].split(/[\.]/)
362
+
363
+ if options['args'].split(/[\.]/).size != 2
364
+ raise "Invalid rule"
365
+ exit(0)
366
+ end
367
+
368
+ if options['case_insensitive']
369
+ join_condition = "on lower(origin.#{field}) = lower(target.#{ref_field})"
370
+ else
371
+ join_condition = "on origin.#{field}::text = target.#{ref_field}::text"
372
+ end
373
+
374
+ # @todo: poor performance here, think of a better SQL!!!
375
+ execute <<-eos
376
+ #{ update_sql(options) }
377
+ WHERE #{field} IN (
378
+ SELECT origin.#{field} from #{options['table']} origin LEFT JOIN #{ref_table} target
379
+ #{join_condition}
380
+ where target.#{ref_field} is null
381
+ ) AND #{field} IS NOT NULL AND length(trim(#{field})) <> 0;
382
+ eos
383
+ end
384
+
385
+ def validate_match(options)
386
+ @logger.info "Validating regexp: #{options['table']}.[#{options['field']}] #{options['rule']}"
387
+ options['error'] ||= "[#{options['field']}] does not match #{options['args']}"
388
+ execute <<-eos
389
+ #{ update_sql(options) }
390
+ WHERE #{options['field']} IS NOT NULL AND length(trim(#{options['field']})) <> 0 AND #{options['field']} !~ '#{options['args']}';
391
+ eos
392
+ end
393
+
394
+ def update_sql(options)
395
+ log_to = options['log_to'] || DEFAULT_ERROR_FIELD
396
+ sql = "UPDATE #{options['table']} SET #{log_to} = array_to_string(string_to_array(#{log_to}, ' || ') || string_to_array(#{quote(options['code'])}, ' || '), ' || ')"
397
+ end
398
+
399
+ def execute(sql)
400
+ ActiveRecord::Base.connection.execute(sql)
401
+ end
402
+
403
+ def quote(str = "")
404
+ ActiveRecord::Base.connection.quote(str)
405
+ end
406
+
407
+ def quote_col_name(str = "")
408
+ ActiveRecord::Base.connection.quote_column_name(str)
409
+ end
410
+
411
+ def table_exists?(table)
412
+ results = execute "SELECT * FROM pg_tables WHERE schemaname='public' AND tablename = #{quote(table)};"
413
+ return !results.first.nil?
414
+ end
415
+
416
+ def drop_table(table_name)
417
+ execute "DROP TABLE IF EXISTS #{table_name}"
418
+ end
419
+ end
420
+ end
421
+
422
+
423
+ x = IData::Validator.new $options[:config]
424
+ x.validate!
425
+ x.report!
426
+