utf8_sanitizer 0.0.2.pre.rc.04 → 1.01
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.rspec_status +4 -4
- data/README.md +65 -56
- data/Rakefile +6 -2
- data/lib/utf8_sanitizer/csv/seeds_dirty_1.csv +2 -0
- data/lib/utf8_sanitizer/utf.rb +55 -47
- data/lib/utf8_sanitizer/version.rb +1 -1
- data/lib/utf8_sanitizer.rb +4 -20
- data/utf8_sanitizer.gemspec +5 -5
- metadata +24 -21
- data/lib/utf8_sanitizer/seed.rb +0 -74
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 0de542eedb064eda2b7a85eda41c674dc7ae822e5bff19f0758bf7418ace84ee
|
4
|
+
data.tar.gz: b252e6c2aa92f32c2068ba59fb8f0719afedb034c47c8513c3017dda9588933b
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 8bf27abb9db602ab114a0606cb83b1e16d931bec77901381e5ad851334b58027c13cd1aa5942607334cf57b66a1394ff50984d62f17c8bac21cdcafe7907e074
|
7
|
+
data.tar.gz: 1e8bcbb08ef7bd8db08af0a79ba9dc06d824634197f005abfe5fc8e390703be0729a0c1ce568378d4fc22266db871e635df4b2237f54bf4eb3b023235bfea5d4
|
data/.rspec_status
CHANGED
@@ -1,4 +1,4 @@
|
|
1
|
-
example_id
|
2
|
-
|
3
|
-
./spec/utf8_sanitizer_spec.rb[1:1]
|
4
|
-
./spec/utf8_sanitizer_spec.rb[1:2] |
|
1
|
+
example_id | status | run_time |
|
2
|
+
------------------------------------ | ------ | ---------------------- |
|
3
|
+
./spec/utf8_sanitizer_spec.rb[1:1] | passed | 2 minutes 40.4 seconds |
|
4
|
+
./spec/utf8_sanitizer_spec.rb[1:2:1] | passed | 0.0017 seconds |
|
data/README.md
CHANGED
@@ -1,6 +1,6 @@
|
|
1
1
|
# Utf8Sanitizer
|
2
2
|
|
3
|
-
Removes invalid UTF8 characters & extra whitespace (carriage returns, new lines, tabs, spaces, etc.) from csv or strings.
|
3
|
+
Removes invalid UTF8 characters & extra whitespace (carriage returns, new lines, tabs, spaces, etc.) from csv or strings. Also provides detailed report indicating row numbers containing non-UTF8 and extra whitespace, and before and after to compare changes.
|
4
4
|
|
5
5
|
Example:
|
6
6
|
```
|
@@ -35,7 +35,9 @@ Or install it yourself as:
|
|
35
35
|
|
36
36
|
## Usage
|
37
37
|
|
38
|
-
|
38
|
+
Options for UTF8 Sanitizing data:
|
39
|
+
1. CSV Parsing
|
40
|
+
2. Data Hash of strings
|
39
41
|
|
40
42
|
#### 1. CSV Parsing
|
41
43
|
This is a good option if you are having problems with a CSV containing non-UTF8 characters. Pass your file_path as a hash like below. Hash MUST be a SYMBOL and named `:file_path`. If not, default seeds will be passed as the system detects empty user input and thinks user is trying to run built-in seed data for testing.
|
@@ -47,29 +49,23 @@ sanitized_data = Utf8Sanitizer.sanitize(args)
|
|
47
49
|
#### 2. Hash of Strings
|
48
50
|
This is a good option if you are scraping data or cleaning up existing databases. Pass your data as a hash like below. Hash MUST be a SYMBOL and named `:data`. The value of `:data` should be an array of hashes like below and can be any size from one to many tens of thousands. The hashes inside the data array can be named anything from crm contact data like below, stats, recipes, or any custom hashes as long as they are in an array and resemble the syntax and structure like below.
|
49
51
|
```
|
50
|
-
|
51
|
-
|
52
|
-
|
53
|
-
|
54
|
-
|
55
|
-
|
56
|
-
|
57
|
-
|
58
|
-
|
59
|
-
|
60
|
-
|
61
|
-
|
62
|
-
|
63
|
-
|
64
|
-
|
65
|
-
|
66
|
-
sanitized_data = Utf8Sanitizer.sanitize(data:
|
67
|
-
```
|
68
|
-
|
69
|
-
#### 3. Run Seed Data to Test
|
70
|
-
If you want to run built-in seed data to first test, simply run as below without passing args.
|
71
|
-
```
|
72
|
-
sanitized_data = Utf8Sanitizer.sanitize
|
52
|
+
array_of_hashes = [ { url: 'abc_autos_example.com',
|
53
|
+
act_name: 'ABC Aut\x92os',
|
54
|
+
street: '123 E Main St\r\n',
|
55
|
+
city: 'Austin',
|
56
|
+
state: 'TX',
|
57
|
+
zip: '78735',
|
58
|
+
phone: '(888) 555-1234\r\n' },
|
59
|
+
{ url: 'xyz_trucks_example',
|
60
|
+
act_name: 'XYZ Aut\xC1os',
|
61
|
+
street: '456 W Main St\r\n',
|
62
|
+
city: 'Austin',
|
63
|
+
state: 'TX',
|
64
|
+
zip: '78735',
|
65
|
+
phone: '(800) 555-5678\r\n' },
|
66
|
+
}]
|
67
|
+
|
68
|
+
sanitized_data = Utf8Sanitizer.sanitize(data: array_of_hashes)
|
73
69
|
```
|
74
70
|
|
75
71
|
### Returned Sanitized Data Format
|
@@ -79,39 +75,52 @@ The `:stats` are a breakdown of the results. `:defective_rows` and `:error_rows`
|
|
79
75
|
|
80
76
|
`:data` is broken down into the following categories: `:valid_data`, `:encoded_data`, `:defective_data`, and `:error_data`.
|
81
77
|
|
82
|
-
`:valid_data` is the most important data and you can access it with `sanitized_data[:data][:valid_data]`. Each non-UTF8 row will be included in its original syntax like below and can be accessed directly via `sanitized_data[:data][:encoded_data]`.
|
78
|
+
`:valid_data` is the most important data and you can access it with `sanitized_data[:data][:valid_data]`. Each non-UTF8 row will be included in its original syntax like below and can be accessed directly via `sanitized_data[:data][:encoded_data]`.
|
79
|
+
|
80
|
+
**You can change the name of `sanitized_data` to anything you like, but it must be followed with `[:data][:valid_data]` and `[:data][:encoded_data]`, etc.**
|
83
81
|
|
84
|
-
`:pollute_seeds` is only for running seed data. It injects each row with non-UTF8 and extra whitespace for testing. It can be ignored and will only run if your input is nil, which tells the system that you are intentionally trying to run seed data for testing.
|
85
82
|
```
|
86
|
-
{:
|
87
|
-
{
|
88
|
-
|
89
|
-
|
90
|
-
|
91
|
-
|
92
|
-
|
93
|
-
|
94
|
-
|
95
|
-
|
96
|
-
|
97
|
-
|
98
|
-
|
99
|
-
|
100
|
-
|
101
|
-
|
102
|
-
:
|
103
|
-
|
104
|
-
|
105
|
-
|
106
|
-
|
107
|
-
|
108
|
-
|
109
|
-
|
110
|
-
|
111
|
-
|
112
|
-
|
113
|
-
|
114
|
-
|
83
|
+
{ stats:
|
84
|
+
{
|
85
|
+
total_rows: 2,
|
86
|
+
header_row: 1,
|
87
|
+
valid_rows: 2,
|
88
|
+
error_rows: 0,
|
89
|
+
defective_rows: 0,
|
90
|
+
perfect_rows: 0,
|
91
|
+
encoded_rows: 2,
|
92
|
+
wchar_rows: 2
|
93
|
+
},
|
94
|
+
file_path: nil,
|
95
|
+
data:
|
96
|
+
{
|
97
|
+
valid_data:
|
98
|
+
[
|
99
|
+
{ row_id: '1',
|
100
|
+
utf_status: 'encoded, wchar',
|
101
|
+
url: 'abc_autos_example.com',
|
102
|
+
act_name: 'ABC Autos Example',
|
103
|
+
street: '123 E Main St',
|
104
|
+
city: 'Austin',
|
105
|
+
state: 'TX',
|
106
|
+
zip: '78735',
|
107
|
+
phone: '(888) 555-1234' },
|
108
|
+
{ row_id: '2',
|
109
|
+
utf_status: 'encoded, wchar',
|
110
|
+
url: 'xyz_trucks_example',
|
111
|
+
act_name: 'XYZ Trucks Example',
|
112
|
+
street: '456 W Main St',
|
113
|
+
city: 'Austin',
|
114
|
+
state: 'TX',
|
115
|
+
zip: '78735',
|
116
|
+
phone: '(800) 555-4321' }
|
117
|
+
],
|
118
|
+
encoded_data: [{ row_id: 1, text: "1,abc_autos_example.com,ABC Autos Example\x98_\xC0,123 E Main St,Austin,TX,78735,(888) 555-1234\r\n" },
|
119
|
+
{ row_id: 2, text: "2,xyz_trucks_example,XYZ \xC1_\xCCTrucks Example,456 W Main St,Austin,TX,78735,(800) 555-4321\r\n" }],
|
120
|
+
defective_data: [],
|
121
|
+
error_data: []
|
122
|
+
}
|
123
|
+
}
|
115
124
|
```
|
116
125
|
|
117
126
|
## Development
|
data/Rakefile
CHANGED
@@ -10,10 +10,14 @@ task :test => :spec
|
|
10
10
|
task :console do
|
11
11
|
require 'irb'
|
12
12
|
require 'irb/completion'
|
13
|
-
require 'utf8_sanitizer'
|
13
|
+
require 'utf8_sanitizer'
|
14
14
|
require "active_support/all"
|
15
15
|
ARGV.clear
|
16
|
-
|
16
|
+
orig_hashes = [{ :row_id=>"1", :url=>"stanleykaufman.com", :act_name=>"Stanley Chevrolet Kaufman\x99_\xCC", :street=>"825 E Fair St", :city=>"Kaufman", :state=>"TX", :zip=>"75142", :phone=>"(888) 457-4391\r\n" }]
|
17
|
+
|
18
|
+
# sanitized_data = Utf8Sanitizer.sanitize(file_path: './lib/utf8_sanitizer/csv/seeds_dirty_1.csv')
|
19
|
+
# sanitized_data = Utf8Sanitizer.sanitize(data: orig_hashes)
|
17
20
|
sanitized_data = Utf8Sanitizer.sanitize
|
21
|
+
puts sanitized_data.inspect
|
18
22
|
IRB.start
|
19
23
|
end
|
data/lib/utf8_sanitizer/utf.rb
CHANGED
@@ -1,8 +1,10 @@
|
|
1
1
|
# frozen_string_literal: false
|
2
|
-
|
2
|
+
require 'csv'
|
3
3
|
|
4
4
|
module Utf8Sanitizer
|
5
5
|
class UTF
|
6
|
+
attr_accessor :headers, :valid_rows, :encoded_rows, :row_id, :data_hash, :defective_rows, :error_rows
|
7
|
+
|
6
8
|
def initialize(args={})
|
7
9
|
@valid_rows = []
|
8
10
|
@encoded_rows = []
|
@@ -13,19 +15,6 @@ module Utf8Sanitizer
|
|
13
15
|
@data_hash = {}
|
14
16
|
end
|
15
17
|
|
16
|
-
#################### * VALIDATE DATA * ####################
|
17
|
-
def validate_data(args={})
|
18
|
-
args = args.slice(:file_path, :data, :pollute_seeds)
|
19
|
-
args = args.compact
|
20
|
-
@seed = Seed.new if args[:pollute_seeds]
|
21
|
-
file_path = args[:file_path]
|
22
|
-
data = args[:data]
|
23
|
-
|
24
|
-
utf_result = validate_csv(file_path) if file_path
|
25
|
-
utf_result = validate_hashes(data) if data
|
26
|
-
utf_result
|
27
|
-
end
|
28
|
-
|
29
18
|
#################### * COMPILE RESULTS * ####################
|
30
19
|
def compile_results
|
31
20
|
utf_status = @valid_rows.map { |hsh| hsh[:utf_status] }
|
@@ -35,44 +24,38 @@ module Utf8Sanitizer
|
|
35
24
|
perfect = groups['perfect']
|
36
25
|
|
37
26
|
header_row_count = @headers.any? ? 1 : 0
|
27
|
+
|
38
28
|
utf_result = {
|
39
29
|
stats: { total_rows: @row_id, header_row: header_row_count, valid_rows: @valid_rows.count, error_rows: @error_rows.count, defective_rows: @defective_rows.count, perfect_rows: perfect, encoded_rows: @encoded_rows.count, wchar_rows: wchar },
|
40
30
|
data: { valid_data: @valid_rows, encoded_data: @encoded_rows, defective_data: @defective_rows, error_data: @error_rows }
|
41
31
|
}
|
32
|
+
utf_result
|
42
33
|
end
|
43
34
|
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
48
|
-
|
49
|
-
|
50
|
-
|
51
|
-
|
52
|
-
|
53
|
-
|
54
|
-
|
55
|
-
@data_hash.merge!(row_to_hsh(row))
|
56
|
-
@valid_rows << @data_hash
|
57
|
-
end
|
58
|
-
end
|
59
|
-
end
|
60
|
-
rescue StandardError => error
|
61
|
-
@error_rows << { row_id: @row_id, text: error.message }
|
62
|
-
end
|
63
|
-
compile_results
|
35
|
+
|
36
|
+
#################### * VALIDATE DATA * ####################
|
37
|
+
def validate_data(args={})
|
38
|
+
args = args.slice(:file_path, :data)
|
39
|
+
args = args.compact
|
40
|
+
file_path = args[:file_path]
|
41
|
+
data = args[:data]
|
42
|
+
|
43
|
+
utf_result = validate_csv(file_path) if file_path
|
44
|
+
utf_result = validate_hashes(data) if data
|
45
|
+
utf_result
|
64
46
|
end
|
65
47
|
|
66
48
|
#################### * VALIDATE HASHES * ####################
|
67
49
|
def validate_hashes(orig_hashes)
|
68
50
|
return unless orig_hashes.present?
|
69
51
|
begin
|
70
|
-
process_hash_row(orig_hashes.first) ##
|
71
|
-
orig_hashes.each { |hsh| process_hash_row(hsh) } ##
|
52
|
+
process_hash_row(orig_hashes.first) ## keys for headers.
|
53
|
+
orig_hashes.each { |hsh| process_hash_row(hsh) } ## values
|
72
54
|
rescue StandardError => error
|
73
55
|
@error_rows << { row_id: @row_id, text: error.message }
|
74
56
|
end
|
75
|
-
compile_results ## handles returns.
|
57
|
+
results = compile_results ## handles returns.
|
58
|
+
results
|
76
59
|
end
|
77
60
|
|
78
61
|
### process_hash_row - helper VALIDATE HASHES ###
|
@@ -86,7 +69,9 @@ module Utf8Sanitizer
|
|
86
69
|
end
|
87
70
|
|
88
71
|
file_line = keys_or_values.join(',')
|
89
|
-
|
72
|
+
validated_line = utf_filter(check_utf(file_line))
|
73
|
+
res = line_parse(validated_line)
|
74
|
+
res
|
90
75
|
end
|
91
76
|
|
92
77
|
### line_parse - helper VALIDATE HASHES ###
|
@@ -105,9 +90,9 @@ module Utf8Sanitizer
|
|
105
90
|
|
106
91
|
#################### * CHECK UTF * ####################
|
107
92
|
def check_utf(text)
|
108
|
-
return
|
109
|
-
text = @seed.pollute_seeds(text) if @seed && @headers.any?
|
93
|
+
return if text.nil?
|
110
94
|
results = { text: text, encoded: nil, wchar: nil, error: nil }
|
95
|
+
|
111
96
|
begin
|
112
97
|
if !text.valid_encoding?
|
113
98
|
encoded = text.chars.select(&:valid_encoding?).join
|
@@ -128,7 +113,7 @@ module Utf8Sanitizer
|
|
128
113
|
#################### * UTF FILTER * ####################
|
129
114
|
def utf_filter(utf)
|
130
115
|
return unless utf.present?
|
131
|
-
puts utf.inspect
|
116
|
+
# puts utf.inspect
|
132
117
|
utf_status = utf.except(:text).compact.keys
|
133
118
|
utf_status = utf_status&.map(&:to_s)&.join(', ')
|
134
119
|
utf_status = 'perfect' if utf_status.blank?
|
@@ -145,6 +130,30 @@ module Utf8Sanitizer
|
|
145
130
|
line
|
146
131
|
end
|
147
132
|
|
133
|
+
|
134
|
+
#################### * VALIDATE CSV * ####################
|
135
|
+
def validate_csv(file_path)
|
136
|
+
return unless file_path.present?
|
137
|
+
File.open(file_path).each do |file_line|
|
138
|
+
validated_line = utf_filter(check_utf(file_line))
|
139
|
+
@row_id += 1
|
140
|
+
if validated_line
|
141
|
+
CSV.parse(validated_line) do |row|
|
142
|
+
if @headers.empty?
|
143
|
+
@headers = row
|
144
|
+
else
|
145
|
+
@data_hash.merge!(row_to_hsh(row))
|
146
|
+
@valid_rows << @data_hash
|
147
|
+
end
|
148
|
+
end
|
149
|
+
end
|
150
|
+
rescue StandardError => error
|
151
|
+
@error_rows << { row_id: @row_id, text: error.message }
|
152
|
+
end
|
153
|
+
utf_results = compile_results
|
154
|
+
end
|
155
|
+
|
156
|
+
|
148
157
|
############# !! HELPERS BELOW !! #############
|
149
158
|
############# KEY VALUE CONVERTERS #############
|
150
159
|
def row_to_hsh(row)
|
@@ -152,15 +161,14 @@ module Utf8Sanitizer
|
|
152
161
|
h.symbolize_keys
|
153
162
|
end
|
154
163
|
|
155
|
-
def val_hsh(cols, hsh)
|
156
|
-
keys = hsh.keys
|
157
|
-
keys.each { |key| hsh.delete(key) unless cols.include?(key) }
|
158
|
-
hsh
|
159
|
-
end
|
160
|
-
|
161
164
|
def make_groups_from_array(array)
|
162
165
|
array.each_with_object(Hash.new(0)) { |e, h| h[e] += 1; }
|
163
166
|
end
|
164
167
|
|
168
|
+
# def val_hsh(cols, hsh)
|
169
|
+
# keys = hsh.keys
|
170
|
+
# keys.each { |key| hsh.delete(key) unless cols.include?(key) }
|
171
|
+
# hsh
|
172
|
+
# end
|
165
173
|
end
|
166
174
|
end
|
data/lib/utf8_sanitizer.rb
CHANGED
@@ -1,29 +1,13 @@
|
|
1
|
-
require
|
2
|
-
require 'utf8_sanitizer/seed'
|
1
|
+
require 'utf8_sanitizer/version'
|
3
2
|
require 'utf8_sanitizer/utf'
|
4
3
|
require 'pry'
|
5
4
|
|
6
5
|
module Utf8Sanitizer
|
7
|
-
|
8
6
|
## Args must include :data or :file_path, else seeds will run by default.
|
9
|
-
def self.sanitize(args={})
|
10
|
-
keys = args.compact.keys
|
7
|
+
def self.sanitize(args = {})
|
11
8
|
input = { stats: nil, file_path: nil, data: nil }.merge(args)
|
12
9
|
|
13
|
-
|
14
|
-
|
15
|
-
## Toggle data[:file_path] & data[:data] to test csv parsing or data hashes.
|
16
|
-
# input[:file_path] = Seed.new.grab_seed_file_path
|
17
|
-
input[:data] = Seed.new.grab_seed_hashes
|
18
|
-
|
19
|
-
## For Testing: Pollute_seeds adds non-utf8 chars to each line.
|
20
|
-
input[:pollute_seeds] = true
|
21
|
-
end
|
22
|
-
|
23
|
-
## Sanitizes input hash, then merges results to original input hash, and returns as sanitized_data.
|
24
|
-
sanitized_data = input.merge!(Utf8Sanitizer::UTF.new.validate_data(input))
|
25
|
-
sanitized_data
|
10
|
+
return input unless input.compact.any?
|
11
|
+
sanitized_data = input.merge(Utf8Sanitizer::UTF.new.validate_data(input))
|
26
12
|
end
|
27
|
-
|
28
|
-
|
29
13
|
end
|
data/utf8_sanitizer.gemspec
CHANGED
@@ -14,11 +14,11 @@ Gem::Specification.new do |spec|
|
|
14
14
|
spec.homepage = 'https://github.com/4rlm/utf8_sanitizer'
|
15
15
|
spec.license = 'MIT'
|
16
16
|
|
17
|
-
spec.summary =
|
18
|
-
spec.description = "Removes invalid UTF8 characters & extra whitespace (carriage returns, new lines, tabs, spaces, etc.) from csv or strings.\n Example: ABC Au\\xC1tos,123 E Main St,Anytown,TX,75142,(888) 555-1234\\n\\r\\n => ABC Autos,123 E Main St,Anytown,TX,75142,(888) 555-1234"
|
17
|
+
spec.summary = 'Removes invalid UTF8 characters & extra whitespace (carriage returns, new lines, tabs, spaces, etc.) from csv or strings. Also provides detailed report indicating row numbers containing non-UTF8 and extra whitespace, and before and after to compare changes.'
|
18
|
+
spec.description = "Removes invalid UTF8 characters & extra whitespace (carriage returns, new lines, tabs, spaces, etc.) from csv or strings. Also provides detailed report indicating row numbers containing non-UTF8 and extra whitespace, and before and after to compare changes.\n Example: ABC Au\\xC1tos,123 E Main St,Anytown,TX,75142,(888) 555-1234\\n\\r\\n => ABC Autos,123 E Main St,Anytown,TX,75142,(888) 555-1234"
|
19
19
|
|
20
20
|
if spec.respond_to?(:metadata)
|
21
|
-
spec.metadata['allowed_push_host'] =
|
21
|
+
spec.metadata['allowed_push_host'] = 'https://rubygems.org'
|
22
22
|
else
|
23
23
|
raise 'RubyGems 2.0 or newer is required to protect against ' \
|
24
24
|
'public gem pushes.'
|
@@ -31,7 +31,7 @@ Gem::Specification.new do |spec|
|
|
31
31
|
spec.bindir = 'exe'
|
32
32
|
spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
|
33
33
|
spec.require_paths = ['lib']
|
34
|
-
spec.post_install_message =
|
34
|
+
spec.post_install_message = 'Thanks for installing utf8_sanitizer!'
|
35
35
|
|
36
36
|
spec.required_ruby_version = '~> 2.5.1'
|
37
37
|
spec.add_dependency 'activesupport', '~> 5.2', '>= 5.2.0'
|
@@ -40,11 +40,11 @@ Gem::Specification.new do |spec|
|
|
40
40
|
spec.add_development_dependency 'byebug', '~> 10.0', '>= 10.0.2'
|
41
41
|
spec.add_development_dependency 'class_indexer', '~> 0.3.0'
|
42
42
|
spec.add_development_dependency 'irbtools', '~> 2.2', '>= 2.2.1'
|
43
|
+
spec.add_development_dependency 'pry', '~> 0.11.3'
|
43
44
|
spec.add_development_dependency 'rake', '~> 12.3', '>= 12.3.1'
|
44
45
|
spec.add_development_dependency 'rspec', '~> 3.7'
|
45
46
|
spec.add_development_dependency 'rubocop', '~> 0.56.0'
|
46
47
|
spec.add_development_dependency 'ruby-beautify', '~> 0.97.4'
|
47
|
-
spec.add_development_dependency "pry", "~> 0.11.3"
|
48
48
|
# spec.add_runtime_dependency 'library', '~> 2.2'
|
49
49
|
# spec.add_dependency 'activerecord', '>= 3.0'
|
50
50
|
# spec.add_dependency 'actionpack', '>= 3.0'
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: utf8_sanitizer
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version:
|
4
|
+
version: '1.01'
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Adam Booth
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2018-06-
|
11
|
+
date: 2018-06-21 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: activesupport
|
@@ -104,6 +104,20 @@ dependencies:
|
|
104
104
|
- - ">="
|
105
105
|
- !ruby/object:Gem::Version
|
106
106
|
version: 2.2.1
|
107
|
+
- !ruby/object:Gem::Dependency
|
108
|
+
name: pry
|
109
|
+
requirement: !ruby/object:Gem::Requirement
|
110
|
+
requirements:
|
111
|
+
- - "~>"
|
112
|
+
- !ruby/object:Gem::Version
|
113
|
+
version: 0.11.3
|
114
|
+
type: :development
|
115
|
+
prerelease: false
|
116
|
+
version_requirements: !ruby/object:Gem::Requirement
|
117
|
+
requirements:
|
118
|
+
- - "~>"
|
119
|
+
- !ruby/object:Gem::Version
|
120
|
+
version: 0.11.3
|
107
121
|
- !ruby/object:Gem::Dependency
|
108
122
|
name: rake
|
109
123
|
requirement: !ruby/object:Gem::Requirement
|
@@ -166,22 +180,8 @@ dependencies:
|
|
166
180
|
- - "~>"
|
167
181
|
- !ruby/object:Gem::Version
|
168
182
|
version: 0.97.4
|
169
|
-
- !ruby/object:Gem::Dependency
|
170
|
-
name: pry
|
171
|
-
requirement: !ruby/object:Gem::Requirement
|
172
|
-
requirements:
|
173
|
-
- - "~>"
|
174
|
-
- !ruby/object:Gem::Version
|
175
|
-
version: 0.11.3
|
176
|
-
type: :development
|
177
|
-
prerelease: false
|
178
|
-
version_requirements: !ruby/object:Gem::Requirement
|
179
|
-
requirements:
|
180
|
-
- - "~>"
|
181
|
-
- !ruby/object:Gem::Version
|
182
|
-
version: 0.11.3
|
183
183
|
description: |-
|
184
|
-
Removes invalid UTF8 characters & extra whitespace (carriage returns, new lines, tabs, spaces, etc.) from csv or strings.
|
184
|
+
Removes invalid UTF8 characters & extra whitespace (carriage returns, new lines, tabs, spaces, etc.) from csv or strings. Also provides detailed report indicating row numbers containing non-UTF8 and extra whitespace, and before and after to compare changes.
|
185
185
|
Example: ABC Au\xC1tos,123 E Main St,Anytown,TX,75142,(888) 555-1234\n\r\n => ABC Autos,123 E Main St,Anytown,TX,75142,(888) 555-1234
|
186
186
|
email:
|
187
187
|
- 4rlm@protonmail.ch
|
@@ -203,12 +203,12 @@ files:
|
|
203
203
|
- lib/utf8_sanitizer/csv/extensions.csv
|
204
204
|
- lib/utf8_sanitizer/csv/seeds_clean.csv
|
205
205
|
- lib/utf8_sanitizer/csv/seeds_dirty.csv
|
206
|
+
- lib/utf8_sanitizer/csv/seeds_dirty_1.csv
|
206
207
|
- lib/utf8_sanitizer/csv/seeds_mega.csv
|
207
208
|
- lib/utf8_sanitizer/csv/seeds_mini.csv
|
208
209
|
- lib/utf8_sanitizer/csv/seeds_mini.csv,
|
209
210
|
- lib/utf8_sanitizer/csv/seeds_mini_10.csv
|
210
211
|
- lib/utf8_sanitizer/csv/seeds_mini_2_bug.csv
|
211
|
-
- lib/utf8_sanitizer/seed.rb
|
212
212
|
- lib/utf8_sanitizer/utf.rb
|
213
213
|
- lib/utf8_sanitizer/version.rb
|
214
214
|
- utf8_sanitizer.gemspec
|
@@ -228,13 +228,16 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
228
228
|
version: 2.5.1
|
229
229
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
230
230
|
requirements:
|
231
|
-
- - "
|
231
|
+
- - ">="
|
232
232
|
- !ruby/object:Gem::Version
|
233
|
-
version:
|
233
|
+
version: '0'
|
234
234
|
requirements: []
|
235
235
|
rubyforge_project:
|
236
236
|
rubygems_version: 2.7.6
|
237
237
|
signing_key:
|
238
238
|
specification_version: 4
|
239
|
-
summary: Removes invalid UTF8 characters & extra whitespace
|
239
|
+
summary: Removes invalid UTF8 characters & extra whitespace (carriage returns, new
|
240
|
+
lines, tabs, spaces, etc.) from csv or strings. Also provides detailed report indicating
|
241
|
+
row numbers containing non-UTF8 and extra whitespace, and before and after to compare
|
242
|
+
changes.
|
240
243
|
test_files: []
|
data/lib/utf8_sanitizer/seed.rb
DELETED
@@ -1,74 +0,0 @@
|
|
1
|
-
require 'csv'
|
2
|
-
|
3
|
-
module Utf8Sanitizer
|
4
|
-
class Seed
|
5
|
-
def initialize(args={})
|
6
|
-
# @pollute_seeds = args.fetch(:pollute_seeds, false)
|
7
|
-
# @seed_hashes = args.fetch(:seed_hashes, false)
|
8
|
-
# @seed_csv = args.fetch(:seed_csv, false)
|
9
|
-
end
|
10
|
-
|
11
|
-
def pollute_seeds(text)
|
12
|
-
list = ['h∑', 'lÔ', "\x92", "\x98", "\x99", "\xC0", "\xC1", "\xC2", "\xCC", "\xDD", "\xE5", "\xF8"]
|
13
|
-
index = text.length / 2
|
14
|
-
var = "#{list.sample}_#{list.sample}"
|
15
|
-
text.insert(index, var)
|
16
|
-
text.insert(-1, "\r\n")
|
17
|
-
text
|
18
|
-
end
|
19
|
-
|
20
|
-
def grab_seed_file_path
|
21
|
-
# "./lib/utf8_sanitizer/csv/seeds_clean.csv"
|
22
|
-
"./lib/utf8_sanitizer/csv/seeds_dirty.csv"
|
23
|
-
# "./lib/utf8_sanitizer/csv/seeds_mega.csv"
|
24
|
-
# "./lib/utf8_sanitizer/csv/seeds_mini.csv"
|
25
|
-
# "./lib/utf8_sanitizer/csv/seeds_mini_10.csv"
|
26
|
-
# './lib/utf8_sanitizer/csv/seeds_mini_2_bug.csv'
|
27
|
-
end
|
28
|
-
|
29
|
-
### Sample Hashes for validate_data
|
30
|
-
def grab_seed_hashes
|
31
|
-
[{ row_id: 1,
|
32
|
-
url: 'stanleykaufman.com',
|
33
|
-
act_name: 'Stanley Chevrolet Kaufman',
|
34
|
-
street: '825 E Fair St',
|
35
|
-
city: 'Kaufman',
|
36
|
-
state: 'TX',
|
37
|
-
zip: '75142',
|
38
|
-
phone: '(888) 457-4391' },
|
39
|
-
{ row_id: 2,
|
40
|
-
url: 'leepartyka',
|
41
|
-
act_name: 'Lee Partyka Chevrolet Mazda Isuzu Truck',
|
42
|
-
street: '200 Skiff St',
|
43
|
-
city: 'Hamden',
|
44
|
-
state: 'CT',
|
45
|
-
zip: '6518',
|
46
|
-
phone: '(203) 288-7761' },
|
47
|
-
{ row_id: 3,
|
48
|
-
url: 'burienhonda.fake.not.net.com',
|
49
|
-
act_name: 'Honda of Burien 15026 1st Avenue South, Burien, WA 98148',
|
50
|
-
street: '15026 1st Avenue South',
|
51
|
-
city: 'Burien',
|
52
|
-
state: 'WA',
|
53
|
-
zip: '98148',
|
54
|
-
phone: '(206) 246-9700' },
|
55
|
-
{ row_id: 4,
|
56
|
-
url: 'cortlandchryslerdodgejeep.com',
|
57
|
-
act_name: 'Cortland Chrysler Dodge Jeep RAM',
|
58
|
-
street: '3878 West Rd',
|
59
|
-
city: 'Cortland',
|
60
|
-
state: 'NY',
|
61
|
-
zip: '13045',
|
62
|
-
phone: '(877) 279-3113' },
|
63
|
-
{ row_id: 5,
|
64
|
-
url: 'imperialmotors.net',
|
65
|
-
act_name: 'Imperial Motors',
|
66
|
-
street: '4839 Virginia Beach Blvd',
|
67
|
-
city: 'Virginia Beach',
|
68
|
-
state: 'VA',
|
69
|
-
zip: '23462',
|
70
|
-
phone: '(757) 490-3651' }]
|
71
|
-
end
|
72
|
-
|
73
|
-
end
|
74
|
-
end
|