idata 0.1.33 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: efc9c2bc3a3ee9b79d83c531abbcbc185360dda4
4
- data.tar.gz: 0fbca044baf3f5a454f5e55e3ff3b40fd8b480f8
3
+ metadata.gz: 26fd9bafa90c2af8eef61231c2b070c6a395bbcb
4
+ data.tar.gz: b1952edd955ea4eeb72d31f8c640f053d5515a17
5
5
  SHA512:
6
- metadata.gz: 59e4d3862a80d105fcc6bfa7973dd18c766fa93a921f4f3c9784398ed71187d2f259d05d9375d15e5dc723bdfe9e037bac861dcd677bd7a70a06b4a058cc3152
7
- data.tar.gz: a685200d759507941fc6800615cf3fc7191e055baecbf56bf4cc8c2f9bce929fc30cd18df76797d75b46a0c2ffa35dcb7cfc1c3a84800f455967cfaea1313693
6
+ metadata.gz: 79b8f98bad03bcfccea93cf2cb074271391def428288784059842435e4e6334928f1d5806c50707816a61621e38e4761cae761443c634bcf2ba0baa5187c3e61
7
+ data.tar.gz: 2513beafcacb28bc7d745e93ab6dd09e0f3b246c25bcee2d81b5bee92b66778b26feab7f43887790a284dfbc8e6906c3860d71082f530a78f1ff82e598085df7
data/README.md CHANGED
@@ -1,5 +1,5 @@
1
1
  # Overview
2
- We provide some useful utilities for validating data in a PostgreSQL data table
2
+ We provide some utilities for validating data in a PostgreSQL data table.
3
3
  These utilities can be used as simple terminal commands and can be installed by:
4
4
 
5
5
  gem install idata
@@ -13,7 +13,7 @@ idata comes along with the following commands:
13
13
  * imerge
14
14
  * isanitize
15
15
 
16
- Run a command with --help switch for the details
16
+ Run a command with `--help` switch for the details
17
17
 
18
18
  Prequisites:
19
19
  * PostgreSQL 9.0 or above
@@ -23,34 +23,34 @@ Prequisites:
23
23
  # Usage
24
24
  Suppose we have an `items` table, and we want to validate its records against certain criteria like:
25
25
 
26
- * `item_id` must not be null
27
- * `item_title` must not be null
28
- * The composite `[item_id, item_title]` must be unique
29
- * One `item_id` corresponds to only ONE `item_title` (in other words, there must not be two items with different titles but with the same `item_id`)
26
+ * `vendor_code` must not be null
27
+ * `vendor_name` must not be null
28
+ * The composite `[vendor_code, vendor_name]` must be unique
29
+ * One `vendor_code` corresponds to only ONE `vendor_name` (in other words, there must not be two items with different `vendor_name` but with the same `vendor_code`)
30
30
  and vice-versa
31
31
  * `vendor_code` must reference the `code` column in the `vendors` table
32
32
 
33
33
  Then the validation command could be:
34
34
  ```
35
- ivalidate --host=localhost --user=postgres --database=mydb --table=items --log-to=validation_errors \
36
- --not-null="vendor_id" \
35
+ ivalidate --host=localhost --user=postgres --database=mydb --table=items
36
+ --log-to=validation_errors \
37
+ --not-null="vendor_code" \
37
38
  --not-null="vendor_name" \
38
- --unique="vendor_id,vendor_name" \
39
- --consistent-by="vendor_id|vendor_name" \
40
- --consistent-by="vendor_id|vendor_name" \
39
+ --unique="vendor_code,vendor_name" \
40
+ --consistent-by="vendor_code|vendor_name" \
41
+ --consistent-by="vendor_name|vendor_code" \
41
42
  --cross-reference="vendor_code|vendors.code"
42
43
  ```
43
44
  Validation results for every single record are logged to an additional column named `validation_errors`
44
- of the `items` table, as specified by the `--log-to` switch
45
-
46
- As you can see, most common checks can be performed using the supported switches:
45
+ of the `items` table, as specified by the `--log-to` switch. As you can see, most common checks can be performed using the supported switches:
47
46
  ```
48
47
  --not-null
49
48
  --unique
50
49
  --consistent-by
51
50
  --cross-reference
52
51
  ```
53
- For more generic check, we support some other switches:
52
+ # Custom Validation
53
+ For more customized checks, we support some other switches.
54
54
 
55
55
  The `--match="field/pattern/"` switch tells the program to check if value of a `field` matches the provided `pattern` (which is a regular expression).
56
56
  For example:
@@ -61,22 +61,20 @@ For example:
61
61
  # Check if value of status is either 'A' or 'I' (any other value is not allowed)
62
62
  ivalidate --match="status/^(A|I)$/"
63
63
  ```
64
- In case you need even more customized validation other than the supported ones (match, unique, not-null, cross-reference...)
65
- then `--query` switch may be helpful. For example:
64
+ In case you need even more customized validation other than the supported ones (`match`, `unique`, `not-null`, `cross-reference`...)
65
+ then the `--query` switch may come in handy. For example:
66
66
  ```
67
- ivalidate --query="string_to_date(start_date) >= '01/02/2014' -- invalid date"
67
+ ivalidate --query="start_date >= string_to_date('01/02/2014') -- invalid date"
68
68
  ```
69
69
  You can also use `--rquery` which is the reversed counterpart of `--query`
70
- For example, the following two checks are equivalent:
70
+ For example, the following two checks are equivalent, mark any record whose `start_date < '01/02/2014'` as "invalid date":
71
71
  ```
72
- ivalidate --query="string_to_date(start_date) >= '01/02/2014' -- invalid date"
73
- ivalidate --rquery="string_to_date(start_date) < '01/02/2014' -- invalid date"
72
+ ivalidate --query="start_date >= string_to_date('01/02/2014') -- invalid date"
73
+ ivalidate --rquery="start_date < string_to_date('01/02/2014') -- invalid date"
74
74
  ```
75
- (mark any record whose `start_date < '01/02/2014'` as "invalid date")
76
75
 
77
76
  Note: run `ivalidate --help` to see the full list of supported switches
78
77
 
79
-
80
78
  # Put it all together
81
79
  You can put several `ivalidate` commands (for several data tables) in one single bash/sh file.
82
80
  Besides `ivalidate`, we also support some other utilities to:
@@ -84,6 +82,6 @@ Besides `ivalidate`, we also support some other utilities to:
84
82
  + Modify data tables
85
83
  + Generate summary reports
86
84
 
87
- For a full example, see our `sample.sh`
85
+ See our `sample.sh` for a comprehensive example
88
86
 
89
87
 
data/README2.md ADDED
@@ -0,0 +1,37 @@
1
+ ### Overview
2
+ File dùng cho validation criteria có cấu trúc chung như sau:
3
+ ```yaml
4
+ table:
5
+ - field: field1, field2, etc.
6
+ validations:
7
+ - rule:
8
+ code:
9
+ error:
10
+ impact:
11
+ solution:
12
+ ```
13
+
14
+ ##### Explain:
15
+ + Một `table` có một hoặc nhiều `field`
16
+ + Một `field` có một section `validations` chứa một hoặc nhiều rules
17
+ + Mỗi rule chứa `rule` (required) và `code`, `error`, `solution`, `impact`, `priority` tương ứng (optional)
18
+ + Rule: viết theo format covered bên dưới
19
+ + Các field khác free text trên cùng 1 line, dùng \n làm dấu cách dòng
20
+ + Giá trị của field có thể là một hoặc tập gồm nhiều field name, separated by comma
21
+
22
+ ## Writing rules
23
+ Supported rules include:
24
+
25
+ | Rule | Description | Example |
26
+ | ---- | ----------- | ------- |
27
+ | `not null` | Giá trị của field tương ứng không được rỗng | |
28
+ | `unique` | Giá trị của field tương ứng phải unique trong table | |
29
+ | `matches "/regexp/"` | Giá trị của field phải thỏa format định nghĩa bởi `regexp` | |
30
+ | `not matches "/regexp/"` | Reverse counterpart của `matches` | |
31
+ | `consistent by "ref"` | Giá trị của field tương ứng phải consistent với `ref` | |
32
+ | `cross references "table.field"` | Giá trị của field phải reference tới một field khác `table.field` | |
33
+ | `custom query "query"` | Dùng custom SQL `query` (trong trường hợp business phức tạp không thể biểu diễn bằng các rule khác) | |
34
+ | `reverse query "query"` | Reverse counterpart của `custom query` | |
35
+
36
+ ## Others
37
+ TBD
data/bin/ivalidate2 ADDED
@@ -0,0 +1,426 @@
1
+ #!/usr/bin/env ruby
2
+ # DATA VALIDATOR
3
+ #
4
+ # @author Nghi Pham
5
+ # @date April 2014
6
+ #
7
+ # Data validation includes:
8
+ # * Uniqueness
9
+ # * Integrity (cross reference)
10
+ # * Data type: numeric, text, enum, etc.
11
+ # * Data format: text size, text values, enum, inclusion, exclusion, etc.
12
+ #
13
+ # Issue ruby load.rb --help for guideline/examples
14
+ #
15
+ require 'optparse'
16
+ require 'active_record'
17
+ require 'rubygems'
18
+ require 'logger'
19
+
20
+ $options = {}
21
+ parser = OptionParser.new("", 24) do |opts|
22
+ opts.banner = "\nProgram: Data Validator\nAuthor: MCKI\n\n"
23
+
24
+ opts.on("-c", "--config FILE", "Configuration file") do |v|
25
+ $options[:config] = v
26
+ end
27
+
28
+ opts.on("-h", "--host HOST", "PostgreSQL host") do |v|
29
+ $options[:host] = v
30
+ end
31
+
32
+ opts.on("-d", "--database DATABASE", "PostgreSQL database") do |v|
33
+ $options[:database] = v
34
+ end
35
+
36
+ opts.on("-u", "--username USER", "PostgreSQL username") do |v|
37
+ $options[:username] = v
38
+ end
39
+
40
+ opts.on("-p", "--password PASSWORD", "PostgreSQL password") do |v|
41
+ $options[:password] = v
42
+ end
43
+
44
+ opts.on("-l", "--listen PORT", "PostgreSQL listen port (default to 5432)") do |v|
45
+ $options[:listen] = v
46
+ end
47
+
48
+ opts.on_tail('--help', 'Displays this help') do
49
+ puts opts, "", help
50
+ exit
51
+ end
52
+ end
53
+
54
+ def help
55
+ return ''
56
+ end
57
+
58
+ begin
59
+ parser.parse!
60
+ rescue SystemExit => ex
61
+ exit
62
+ end
63
+
64
+ # Load parameters from ENVIRONMENT if exist
65
+ $options[:host] ||= ENV['HOST']
66
+ $options[:username] ||= ENV['USERNAME']
67
+ $options[:password] ||= ENV['PASSWORD']
68
+ $options[:listen] ||= ENV['LISTEN']
69
+ $options[:database] ||= ENV['DATABASE']
70
+
71
+ # validate parameters
72
+ if $options[:config].nil?
73
+ puts "\nPlease specify config file: -c\n\n"
74
+ exit
75
+ end
76
+
77
+ if $options[:host].nil?
78
+ puts "\nPlease specify host name: -h\n\n"
79
+ exit
80
+ end
81
+
82
+ if $options[:database].nil?
83
+ puts "\nPlease specify PostgreSQL database name: -d\n\n"
84
+ exit
85
+ end
86
+
87
+ if $options[:username].nil?
88
+ puts "\nPlease specify PostgreSQL username: -u\n\n"
89
+ exit
90
+ end
91
+
92
+ # Default value
93
+ $options[:listen] ||= 5432
94
+
95
+ # Database dump
96
+ ActiveRecord::Base.establish_connection(
97
+ 'adapter' => 'postgresql',
98
+ 'host' => $options[:host],
99
+ 'database' => $options[:database],
100
+ 'username' => $options[:username],
101
+ 'password' => $options[:password],
102
+ 'port' => $options[:listen],
103
+ 'timeout' => 15000
104
+ )
105
+
106
+ class String
107
+ def not_null_sql
108
+ a = self.split(/\s*,\s*/)
109
+ sql = a.map{|s|
110
+ "#{s} IS NOT NULL AND length(trim(#{s}::text)) <> 0"
111
+ }.join(" AND ")
112
+
113
+ "(#{sql})"
114
+ end
115
+
116
+ def lower
117
+ a = self.split(/\s*,\s*/)
118
+ sql = a.map{|s|
119
+ "lower(#{s})"
120
+ }.join(",")
121
+
122
+ sql
123
+ end
124
+
125
+ def black; "\033[30m#{self}\033[0m" end
126
+ def red; "\033[31m#{self}\033[0m" end
127
+ def green; "\033[32m#{self}\033[0m" end
128
+ def brown; "\033[33m#{self}\033[0m" end
129
+ def blue; "\033[34m#{self}\033[0m" end
130
+ def magenta; "\033[35m#{self}\033[0m" end
131
+ def cyan; "\033[36m#{self}\033[0m" end
132
+ def gray; "\033[37m#{self}\033[0m" end
133
+ end
134
+
135
+ class Logger
136
+ alias_method :_old_info, :info
137
+ alias_method :_old_warn, :warn
138
+ alias_method :_old_error, :error
139
+
140
+
141
+ def error(msg)
142
+ _old_error(msg.red)
143
+ end
144
+
145
+ def info(msg)
146
+ _old_info(msg.green)
147
+ end
148
+
149
+ def warn(msg)
150
+ _old_warn(msg.brown)
151
+ end
152
+ end
153
+
154
+ module IData
155
+ class Validator
156
+ SUPPORTED_RULES_REGEXP = /^\s*(not null|cross references|matches|not matches|custom query|reverse query|unique|consistent by)\s*/
157
+ DEFAULT_ERROR_FIELD = 'validation_errors'
158
+ META_TABLE = 'validation_meta'
159
+ SUMMARY_TABLE = 'summary'
160
+
161
+ def initialize(file)
162
+ @config = YAML.load_file(file)
163
+ @logger = Logger.new(STDOUT)
164
+ @rules = []
165
+ @logger.formatter = proc do |severity, datetime, progname, msg|
166
+ "#{severity}: #{datetime} - #{msg}\n"
167
+ end
168
+ @config.each do |table, fields|
169
+ unless table_exists?(table)
170
+ @logger.warn "Table #{table} does not exist!"
171
+ end
172
+ @logger.info "Validating table #{table}"
173
+ fields.each do |field|
174
+ field['validations'].each do |rule|
175
+ p rule
176
+ type, args = parse_rule(rule['rule'])
177
+ options = rule.merge('table' => table, 'field' => field['field'], 'type' => type, 'args' => args)
178
+ options['code'] = Digest::SHA1.hexdigest([table, field['field'], rule['rule']].join(""))
179
+ @rules << options
180
+ end
181
+ end
182
+ end
183
+ end
184
+
185
+ def validate!
186
+ # reset the meta table
187
+ execute("DROP TABLE IF EXISTS #{META_TABLE}")
188
+
189
+ # validate
190
+ @logger.info "Validation started!"
191
+ @rules.each {|r| validate(r) }
192
+
193
+ # create meta table
194
+ create_table_from_array(@rules, table_name: META_TABLE, drop_table: true, extra_fields: ['impact', 'solution', 'count', 'percentage'])
195
+
196
+ # Done!
197
+ @logger.info "Validation done!"
198
+ end
199
+
200
+ def create_table_from_array(entries, options = {})
201
+ raise "Please specify :table_name" unless options[:table_name]
202
+
203
+ extra_fields = options[:extra_fields] || []
204
+ columns = entries.inject([]) {|x, i| x += i.keys } + extra_fields
205
+ columns.uniq!
206
+
207
+ if options[:drop_table]
208
+ execute "DROP TABLE IF EXISTS #{options[:table_name]}"
209
+ end
210
+
211
+ execute "CREATE TABLE IF NOT EXISTS #{options[:table_name]} ( #{columns.map{|c| quote_col_name(c.to_s) + ' VARCHAR' }.join(', ')} )"
212
+
213
+ insert_sql = entries.map { |r|
214
+ "INSERT INTO #{options[:table_name]}(#{columns.map{|c| quote_col_name(c.to_s)}.join(', ')}) VALUES(#{ columns.map{|c| quote(r[c])}.join(',') });"
215
+ }.join("")
216
+
217
+ execute insert_sql
218
+ end
219
+
220
+ def validate(options)
221
+ unless table_exists?(options['table'])
222
+ return
223
+ end
224
+
225
+ add_error_field(options)
226
+ case options['type']
227
+
228
+ when 'not null'
229
+ validate_not_null(options)
230
+ when 'custom query'
231
+ validate_custom_query(options)
232
+ when 'reverse query'
233
+ validate_reverse_query(options)
234
+ when 'custom query reversed'
235
+ validate_custom_query(options)
236
+ when 'matches'
237
+ validate_match(options)
238
+ when 'cross references'
239
+ validate_cross_reference(options)
240
+ when 'consistent by'
241
+ validate_consistent_by(options)
242
+ when 'unique'
243
+ validate_unique(options)
244
+ else
245
+ raise "Rule not recognized"
246
+ end
247
+ rescue Exception => ex
248
+ @logger.warn ex.message.split(/[\n]/).first.strip
249
+ end
250
+
251
+ def report!
252
+ sql = @rules.map {|r|
253
+ "(SELECT unnest(string_to_array(#{DEFAULT_ERROR_FIELD}, ' || ')) as code, count(*), round((count(*) * 100)::numeric / (SELECT count(*) FROM #{r['table']}), 2)::varchar || '%' AS percentage FROM #{r['table']} GROUP BY code)"
254
+ }
255
+
256
+ execute("
257
+ UPDATE #{META_TABLE} meta
258
+ SET count = stat.count,
259
+ percentage = stat.percentage
260
+ FROM (#{sql.join(" UNION ")}) stat
261
+ WHERE meta.code = stat.code"
262
+ )
263
+ end
264
+
265
+ private
266
+ def add_error_field(options)
267
+ error_field = options['log_to'] || DEFAULT_ERROR_FIELD
268
+ execute("ALTER TABLE #{options['table']} ADD COLUMN #{error_field} VARCHAR DEFAULT '';")
269
+ rescue Exception => ex
270
+ # @todo
271
+ end
272
+
273
+ def parse_rule(rule)
274
+ # @todo
275
+ type = rule[SUPPORTED_RULES_REGEXP]
276
+ if type.nil?
277
+ @logger.error "Invalid rule: #{rule}"
278
+ exit(0)
279
+ end
280
+
281
+ type.strip!
282
+ args = rule.gsub(SUPPORTED_RULES_REGEXP, '').gsub(/(^\s*["']|["']\s*$)/, "")
283
+ return type, args
284
+ end
285
+
286
+ def validate_not_null(options)
287
+ @logger.info "Validating data presence: #{options['table']}.[#{options['field']}]"
288
+ options['error'] ||= "[#{options['field']}] is null"
289
+ execute <<-eos
290
+ #{ update_sql(options) }
291
+ WHERE #{options['field']} IS NULL OR length(trim(#{options['field']})) = 0;
292
+ eos
293
+ end
294
+
295
+ def validate_custom_query(options)
296
+ @logger.info "Validating with custom query: #{options['args'][0..50]}#{(options['args'].size > 50) ? '...' : ''}"
297
+ options['error'] ||= "Unknown"
298
+ execute <<-eos
299
+ #{ update_sql(options) }
300
+ WHERE NOT (#{options['args']})
301
+ eos
302
+ end
303
+
304
+ def validate_reverse_query(options)
305
+ @logger.info "Validating with custom query: #{options['args'][0..50]}#{(options['args'].size > 50) ? '...' : ''}"
306
+ options['error'] ||= "Unknown"
307
+ execute <<-eos
308
+ #{ update_sql(options) }
309
+ WHERE (#{options['args']})
310
+ eos
311
+ end
312
+
313
+ def validate_consistent_by(options)
314
+ @logger.info "Validating integrity: #{options['table']}.[#{options['field']}] #{options['rule']}"
315
+ options['error'] ||= "Same [#{options['field']}] but different [#{options['args']}]"
316
+
317
+ f1_case = f1 = options['field']
318
+ f2_case = f2 = options['args']
319
+
320
+ if options['case_insensitive']
321
+ f1_case = f1_case.lower
322
+ f2_case = f2_case.lower
323
+ end
324
+
325
+ execute <<-eos
326
+ #{ update_sql(options) }
327
+ WHERE id IN (
328
+ SELECT unnest(array_agg(id)) FROM #{options['table']}
329
+ WHERE #{f1.not_null_sql} AND #{f2.not_null_sql}
330
+ GROUP BY #{f2_case}
331
+ HAVING COUNT(distinct #{f1_case}) > 1
332
+ );
333
+ eos
334
+ end
335
+
336
+ def validate_unique(options)
337
+ @logger.info "Validating uniqueness: #{options['table']}.[#{options['field']}]"
338
+ options['error'] ||= "[#{options['field']}] is not unique"
339
+
340
+ if options['case_insensitive']
341
+ f_lower = options['field'].lower
342
+ else
343
+ f_lower = options['field']
344
+ end
345
+
346
+ execute <<-eos
347
+ #{ update_sql(options) }
348
+ WHERE id IN (
349
+ SELECT unnest(array_agg(id)) FROM #{options['table']} GROUP BY #{f_lower}
350
+ HAVING count(*) > 1
351
+ ) AND #{options['field'].not_null_sql};
352
+ eos
353
+ end
354
+
355
+ def validate_cross_reference(options)
356
+ @logger.info "Validating reference: #{options['table']}.[#{options['field']}] #{options['rule']}"
357
+
358
+ options['error'] ||= "[#{options['field']}] does not reference [#{options['args']}]"
359
+
360
+ field = options['field']
361
+ ref_table, ref_field = options['args'].split(/[\.]/)
362
+
363
+ if options['args'].split(/[\.]/).size != 2
364
+ raise "Invalid rule"
365
+ exit(0)
366
+ end
367
+
368
+ if options['case_insensitive']
369
+ join_condition = "on lower(origin.#{field}) = lower(target.#{ref_field})"
370
+ else
371
+ join_condition = "on origin.#{field}::text = target.#{ref_field}::text"
372
+ end
373
+
374
+ # @todo: poor performance here, think of a better SQL!!!
375
+ execute <<-eos
376
+ #{ update_sql(options) }
377
+ WHERE #{field} IN (
378
+ SELECT origin.#{field} from #{options['table']} origin LEFT JOIN #{ref_table} target
379
+ #{join_condition}
380
+ where target.#{ref_field} is null
381
+ ) AND #{field} IS NOT NULL AND length(trim(#{field})) <> 0;
382
+ eos
383
+ end
384
+
385
+ def validate_match(options)
386
+ @logger.info "Validating regexp: #{options['table']}.[#{options['field']}] #{options['rule']}"
387
+ options['error'] ||= "[#{options['field']}] does not match #{options['args']}"
388
+ execute <<-eos
389
+ #{ update_sql(options) }
390
+ WHERE #{options['field']} IS NOT NULL AND length(trim(#{options['field']})) <> 0 AND #{options['field']} !~ '#{options['args']}';
391
+ eos
392
+ end
393
+
394
+ def update_sql(options)
395
+ log_to = options['log_to'] || DEFAULT_ERROR_FIELD
396
+ sql = "UPDATE #{options['table']} SET #{log_to} = array_to_string(string_to_array(#{log_to}, ' || ') || string_to_array(#{quote(options['code'])}, ' || '), ' || ')"
397
+ end
398
+
399
+ def execute(sql)
400
+ ActiveRecord::Base.connection.execute(sql)
401
+ end
402
+
403
+ def quote(str = "")
404
+ ActiveRecord::Base.connection.quote(str)
405
+ end
406
+
407
+ def quote_col_name(str = "")
408
+ ActiveRecord::Base.connection.quote_column_name(str)
409
+ end
410
+
411
+ def table_exists?(table)
412
+ results = execute "SELECT * FROM pg_tables WHERE schemaname='public' AND tablename = #{quote(table)};"
413
+ return !results.first.nil?
414
+ end
415
+
416
+ def drop_table(table_name)
417
+ execute "DROP TABLE IF EXISTS #{table_name}"
418
+ end
419
+ end
420
+ end
421
+
422
+
423
+ x = IData::Validator.new $options[:config]
424
+ x.validate!
425
+ x.report!
426
+