idata 0.1.26 → 0.1.27

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 7615b5632f23a4d6a9f12ae44516ba1a985c71d9
4
- data.tar.gz: 1de25681e830eba21537288a82a36b5d0a4658e6
3
+ metadata.gz: a77794735a66403f2d15fb44b25a24ae954a07d0
4
+ data.tar.gz: 6ffa6f9e348842a760bdc4b1ea5895fad91bd1c8
5
5
  SHA512:
6
- metadata.gz: 8c38b43eac63a08f792b2fee967ec6aaa59e12e446a92641f4c5c7fb6a783ff210ac2925ba5051d4a27b2e7c0bd7e7f1678a6e7df43d9040d5294c44a0fa13a1
7
- data.tar.gz: 7f8a19d9c36edb2a5a1d2521b457cb631671755838f626f13b0f9f12339812c71ceebfb2af445350ef779158e6748fd4320115f537fcd4b3e8b934c800224eff
6
+ metadata.gz: add59efee5a88c2f914008e585d0e3bb6f988a25b56e31a3314155ec9b93c99390d659f4f9967645d4857345815486ebdbf01cbb4af66cb24f7bcd4d40a56b19
7
+ data.tar.gz: 9b0d4e913980fc6fdefad215a5bafe3d59effce4da0ac256e17da63b6a53d09a71b4c42c964418dea3d17b0b54ec73ea5203915c6d7faffa2c73c181b9e9cec4
data/README.md CHANGED
@@ -1,29 +1,77 @@
1
- # Idata
1
+ # OVERVIEW
2
+ We provide some useful utilities for validating data in a PostgreSQL data table
3
+ These utilities can be used as simple terminal commands and can be installed by:
2
4
 
3
- TODO: Write a gem description
5
+ gem install idata
4
6
 
5
- ## Installation
7
+ Prequisites:
8
+ * PostgreSQL 9.0 or above
9
+ * Ruby 2.0 or above
6
10
 
7
- Add this line to your application's Gemfile:
11
+ # USAGE
12
+ Suppose we have an `items` table, and we want to validate its records against certain criteria like:
8
13
 
9
- gem 'idata'
14
+ * `item_id` must not be null
15
+ * `item_title` must not be null
16
+ * The composite `[item_id, item_title]` must be unique
17
+ * One `item_id` corresponds to only ONE `item_title` (in other words, there must not be two items with different titles but with the same `item_id`)
18
+ and vice-versa
19
+ * `vendor_code` must reference the `code` column in the `vendors` table
10
20
 
11
- And then execute:
21
+ Then the validation command could be:
22
+ ```
23
+ ivalidate --host=localhost --user=postgres --database=mydb --table=items --log-to=validation_errors \
24
+ --not-null="vendor_id" \
25
+ --not-null="vendor_name" \
26
+ --unique="vendor_id,vendor_name" \
27
+ --consistent-by="vendor_id|vendor_name" \
28
+ --consistent-by="vendor_id|vendor_name" \
29
+ --cross-reference="vendor_code|vendors.code"
30
+ ```
31
+ Validation results for every single record are logged to an additional column named `validation_errors`
32
+ of the `items` table, as specified by the `--log-to` switch
12
33
 
13
- $ bundle
34
+ As you can see, most common checks can be performed using the supported switches:
35
+ ```
36
+ --not-null
37
+ --unique
38
+ --consistent-by
39
+ --cross-reference
40
+ ```
41
+ For more generic check, we support some other switches:
14
42
 
15
- Or install it yourself as:
43
+ The `--match="field/pattern/"` switch tells the program to check if value of a `field` matches the provided `pattern` (which is a regular expression).
44
+ For example:
45
+ ```
46
+ # Check if item_id is a number:
47
+ ivalidate --match="item_id/[0-9]+/"
48
+
49
+ # Check if value of status is either 'A' or 'I' (any other value is not allowed)
50
+ ivalidate --match="status/^(A|I)$/"
51
+ ```
52
+ In case you need even more customized validation other than the supported ones (match, unique, not-null, cross-reference...)
53
+ then `--query` switch may be helpful. For example:
54
+ ```
55
+ ivalidate --query="string_to_date(start_date) >= '01/02/2014' -- invalid date"
56
+ ```
57
+ You can also use `--rquery` which is the reversed counterpart of `--query`
58
+ For example, the following two checks are equivalent:
59
+ ```
60
+ ivalidate --query="string_to_date(start_date) >= '01/02/2014' -- invalid date"
61
+ ivalidate --rquery="string_to_date(start_date) < '01/02/2014' -- invalid date"
62
+ ```
63
+ (mark any record whose `start_date < '01/02/2014'` as "invalid date")
16
64
 
17
- $ gem install idata
65
+ Note: run `ivalidate --help` to see the full list of supported switches
18
66
 
19
- ## Usage
20
67
 
21
- TODO: Write usage instructions here
68
+ # PUT IT ALL TOGETHER
69
+ You can put several `ivalidate` commands (for several data tables) in one single bash/sh file.
70
+ Besides `ivalidate`, we also support some other utilities to:
71
+ + Load data from text files to SQL tables
72
+ + Modify data tables
73
+ + Generate summary reports
74
+
75
+ For a full example, see our `sample.sh`
22
76
 
23
- ## Contributing
24
77
 
25
- 1. Fork it
26
- 2. Create your feature branch (`git checkout -b my-new-feature`)
27
- 3. Commit your changes (`git commit -am 'Add some feature'`)
28
- 4. Push to the branch (`git push origin my-new-feature`)
29
- 5. Create new Pull Request
data/bin/ivalidate CHANGED
@@ -69,6 +69,10 @@ parser = OptionParser.new("", 24) do |opts|
69
69
  opts.on("--log-to FIELD", "Field to log error to") do |v|
70
70
  $options[:log_to] = v
71
71
  end
72
+
73
+ opts.on("--pretty", "Use more human-readable error message") do |v|
74
+ $options[:pretty] = v
75
+ end
72
76
 
73
77
  opts.on("-h", "--host HOST", "PostgreSQL host") do |v|
74
78
  $options[:host] = v
@@ -270,8 +274,16 @@ $options[:not_null].each do |field|
270
274
  eos
271
275
 
272
276
  ActiveRecord::Base.connection.execute(not_null_sql)
277
+
273
278
  rescue Exception => ex
274
- puts " --> *** ERROR ***: #{ex.message.split("\n").first }"
279
+ if ex.message.include?('PG::UndefinedColumn')
280
+ not_null_sql = <<-eos
281
+ UPDATE #{$options[:table]} SET #{$options[:log_to]} = array_to_string(string_to_array(#{$options[:log_to]}, ' || ') || string_to_array('Column `#{field}` is missing', ' || '), ' || ')
282
+ eos
283
+ ActiveRecord::Base.connection.execute(not_null_sql)
284
+ else
285
+ puts " --> *** ERROR ***: #{ex.message.split("\n").first }"
286
+ end
275
287
  end
276
288
  end
277
289
 
@@ -283,9 +295,11 @@ $options[:match].each do |value|
283
295
  field = value[/^[^\/]+/]
284
296
  regexp = value[/(?<=\/).*(?=\/)/]
285
297
  puts "Checking REGEXP matching: #{field} ~ #{regexp}"
286
-
298
+
299
+ msg = $options[:pretty] ? "Invalid #{field}" : "#{field} does not match [#{regexp.gsub("'", "''")}]"
300
+
287
301
  match_sql = <<-eos
288
- UPDATE #{$options[:table]} SET #{$options[:log_to]} = array_to_string(string_to_array(#{$options[:log_to]}, ' || ') || string_to_array('#{field} does not match [#{regexp.gsub("'", "''")}]', ' || '), ' || ')
302
+ UPDATE #{$options[:table]} SET #{$options[:log_to]} = array_to_string(string_to_array(#{$options[:log_to]}, ' || ') || string_to_array('#{msg}', ' || '), ' || ')
289
303
  WHERE #{field} IS NOT NULL AND length(trim(#{field})) <> 0 AND #{field} !~ '#{regexp}';
290
304
  eos
291
305
 
data/lib/idata/version.rb CHANGED
@@ -1,3 +1,3 @@
1
1
  module Idata
2
- VERSION = "0.1.26"
2
+ VERSION = "0.1.27"
3
3
  end
data/sample.sh ADDED
@@ -0,0 +1,143 @@
1
+ # Idea: to validate data in a text file, we first load it to a data table
2
+ # then use the validation utilities to validate the table
3
+ #
4
+ # The followings are performed by this script:
5
+ # * 1 Load raw text files into corresponding SQL tables
6
+ # * 2 Perform validation
7
+ # * 3 Generate reports
8
+ #
9
+ # @author NghiPM
10
+ # @date May 2014
11
+
12
+ ###################################################################################
13
+ # SET UP ENVIRONMENT VARIABLES
14
+ ###################################################################################
15
+ # Instead of passing PostgreSQL credentials as parameters to every validation command,
16
+ # you can set the corresponding environment variables which can be used by the those commands
17
+ export HOST="localhost"
18
+ export USERNAME="postgres"
19
+ export PASSWORD="postgres"
20
+ export DATABASE="northeast_georgia"
21
+ export LISTEN=5432
22
+
23
+ # Input file paths and corresponding table names
24
+ FVENDOR="VendorMaster.csv"
25
+ VENDOR="vendors"
26
+
27
+ FITEM="ItemMaster.csv"
28
+ ITEM="items"
29
+
30
+ # Specify a temp folder for writing temporary outputs
31
+ # Specify the path to the output summary report
32
+ TMP="/tmp"
33
+ REPORT="/tmp/report.xls"
34
+
35
+ ###################################################################################
36
+ # STEP 1 - Load raw text files into corresponding SQL tables
37
+ ###################################################################################
38
+
39
+ # Load data from VendorMaster.csv to the corresponding vendors table
40
+ # and from ItemMaster.csv to items table.
41
+ # Note: instead of using iload utility,you can use the PSQL COPY of PostgreSQL
42
+ iload -i "$FVENDOR" -t "$VENDOR" -f csv
43
+ iload -i "$FITEM" -t "$ITEM" -f csv
44
+
45
+ ###################################################################################
46
+ # STEP 2 - Perform validation, log the results to an additional field
47
+ ###################################################################################
48
+ # validate the vendors table
49
+ ivalidate --table=$VENDOR --log-to=validation_errors \
50
+ --not-null="vendor_code" \
51
+ --not-null="vendor_name" \
52
+ --unique="vendor_code" \
53
+ --unique="vendor_name" \
54
+ --match="vendor_code/[a-zA-Z0-9]/" \
55
+ --match="vendor_name/[a-zA-Z0-9]/" \
56
+ --consistent-by="vendor_code|vendor_name" \
57
+ --consistent-by="vendor_name|vendor_code" \
58
+ --consistent-by="country_code|country_name" \
59
+ --consistent-by="country_name|country_code"
60
+
61
+ # validate the items table
62
+ ivalidate --table=$ITEM \
63
+ --log-to=validation_errors \
64
+ --not-null="item_id" \
65
+ --match="item_id/[a-zA-Z0-9]/" \
66
+ --not-null="item_desc" \
67
+ --match="item_desc/[a-zA-Z0-9]/" \
68
+ --not-null="item_uom" \
69
+ --not-null="default_uom" \
70
+ --not-null="item_price" \
71
+ --not-null="item_qoe" \
72
+ --not-null="corp_id" \
73
+ --not-null="corp_name" \
74
+ --not-null="vendor_code" \
75
+ --not-null="vendor_name" \
76
+ --not-null="mfr_number" \
77
+ --not-null="mfr_name" \
78
+ --not-null="active" \
79
+ --match="corp_id/[a-zA-Z0-9]/" \
80
+ --match="corp_name/[a-zA-Z0-9]/" \
81
+ --match="vendor_code/[a-zA-Z0-9]/" \
82
+ --match="vendor_name/[a-zA-Z0-9]/" \
83
+ --match="mfr_number/[a-zA-Z0-9]/" \
84
+ --match="mfr_name/[a-zA-Z0-9]/" \
85
+ --match="active/^(1|2|3|A|I)$/" \
86
+ --consisten-by="corp_id|corp_name" \
87
+ --consisten-by="corp_name|corp_id" \
88
+ --consisten-by="vendor_code|vendor_name" \
89
+ --consisten-by="vendor_name|vendor_code" \
90
+ --cross-reference="vendor_code|$VENDOR.vendor_code" \
91
+ --cross-reference="vendor_name|$VENDOR.vendor_name"
92
+
93
+ ###################################################################################
94
+ # Step 3 - Generate summary report
95
+ ###################################################################################
96
+ # After the validation step above, an additional field named validation_errors
97
+ # is added to every table. In case the record does not pass a validation creterion, a corresponding error shall be logged to this field
98
+ # One record may have more than one error logged
99
+ #
100
+ # You can simply look at the validation_errors field to see errors associated to a record
101
+ #
102
+ # Just to make a MORE comprehensive report, we can:
103
+ # 1 Create a summary table which tells us how many errors found, how many records associated with each...
104
+ # 2 Extract the first 1000 sample records for every error
105
+ # 3 Put all together into one single Excel report
106
+
107
+
108
+ # 1) Create error summary report table and write to /tmp/summary.csv
109
+ # This can be done using the iexport utility which can generate a CSV file from a data table or from a custom query
110
+ # Run iexport --help for more information
111
+ iexport --output="$TMP/summary.csv" -f csv --no-quote-empty --quotes --headers \
112
+ --query="(select '$FVENDOR' as input_file, unnest(string_to_array(validation_errors, ' || ')) as error, count(*), round((count(*) * 100)::numeric / (select count(*) from $VENDOR), 2)::varchar || '%' as percentage from $VENDOR group by error order by error) union
113
+ (select '$FITEM' as input_file, unnest(string_to_array(validation_errors, ' || ')) as error, count(*), round((count(*) * 100)::numeric / (select count(*) from $ITEM), 2)::varchar || '%' as percentage from $ITEM group by error order by error)"
114
+
115
+ # Export the first 1000 records of every error in the items table
116
+ # Write the results to /tmp/items.csv
117
+ iexport --table=$VENDOR --output="$TMP/$VENDOR.csv" -f csv --no-quote-empty --quotes --headers \
118
+ --query="select * from (select ROW_NUMBER() OVER (PARTITION BY error) AS group_index, *
119
+ FROM ( select unnest(string_to_array(validation_errors, ' || ')) as error, * from
120
+ $VENDOR order by id ) as main) as tmp
121
+ where group_index <= 1000" \
122
+ --exclude="id, validation_errors, group_index"
123
+
124
+ # 2) Export the first 1000 records of every error in the vendors table
125
+ # Write the results to /tmp/vendors.csv
126
+ iexport --table=$ITEM --output="$TMP/$ITEM.csv" -f csv --no-quote-empty --quotes --headers \
127
+ --query="select * from (select ROW_NUMBER() OVER (PARTITION BY error) AS group_index, *
128
+ FROM ( select unnest(string_to_array(validation_errors, ' || ')) as error, * from
129
+ $ITEM order by id ) as main) as tmp
130
+ where group_index <= 1000" \
131
+ --exclude="id, validation_errors, group_index"
132
+
133
+ # 3) Put the above 3 CSV files into one Excel file /tmp/report.xls
134
+ # This can be done using imerge which takes a list of CSV files put them to corresponding sheets
135
+ # of one single Excel file
136
+ imerge --output=$REPORT \
137
+ --input="Summary:$TMP/summary.csv" \
138
+ --input="$FVENDOR:$TMP/$VENDOR.csv" \
139
+ --input="ItemMaster:$TMP/$ITEM.csv"
140
+
141
+ # CLEANUP
142
+ # Remember to drop the temporary tables you create (items and vendors)
143
+
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: idata
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.26
4
+ version: 0.1.27
5
5
  platform: ruby
6
6
  authors:
7
7
  - Nghi Pham
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2014-07-04 00:00:00.000000000 Z
11
+ date: 2014-07-15 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bundler
@@ -92,10 +92,10 @@ files:
92
92
  - bin/ipatch
93
93
  - bin/isanitize
94
94
  - bin/ivalidate
95
- - guideline/Guideline.docx
96
95
  - idata.gemspec
97
96
  - lib/idata.rb
98
97
  - lib/idata/version.rb
98
+ - sample.sh
99
99
  homepage: http://bolero.vn
100
100
  licenses:
101
101
  - MIT
Binary file