utf8_sanitizer 0.0.2.pre.rc.04 → 1.01

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 05155a4029ddd8224888b1482972b0640b4a81ef0990d7bd8f29311a5798d309
4
- data.tar.gz: 6d3bd40596087c6e8dfa41934a44ffd2459f6bf738abf77c66aa20ce4591b641
3
+ metadata.gz: 0de542eedb064eda2b7a85eda41c674dc7ae822e5bff19f0758bf7418ace84ee
4
+ data.tar.gz: b252e6c2aa92f32c2068ba59fb8f0719afedb034c47c8513c3017dda9588933b
5
5
  SHA512:
6
- metadata.gz: 23d64036dec061d1290d186069adfe018825bbe16a89f733994bce7eb6793fd5f0d402d9cc00c57a590fcb24f6b2f14890cac0611ae54f73bec0cc9b39e060d4
7
- data.tar.gz: 972355c619c5c6ef4753df81d7a94f18b4f0e0622555e3fb4c3d227d5a31477ebeea015a10682281ace975bb322daed2752f9cf71123c19b194429893f1b6f78
6
+ metadata.gz: 8bf27abb9db602ab114a0606cb83b1e16d931bec77901381e5ad851334b58027c13cd1aa5942607334cf57b66a1394ff50984d62f17c8bac21cdcafe7907e074
7
+ data.tar.gz: 1e8bcbb08ef7bd8db08af0a79ba9dc06d824634197f005abfe5fc8e390703be0729a0c1ce568378d4fc22266db871e635df4b2237f54bf4eb3b023235bfea5d4
data/.rspec_status CHANGED
@@ -1,4 +1,4 @@
1
- example_id | status | run_time |
2
- ---------------------------------- | ------ | --------------- |
3
- ./spec/utf8_sanitizer_spec.rb[1:1] | passed | 0.00112 seconds |
4
- ./spec/utf8_sanitizer_spec.rb[1:2] | failed | 0.02392 seconds |
1
+ example_id | status | run_time |
2
+ ------------------------------------ | ------ | ---------------------- |
3
+ ./spec/utf8_sanitizer_spec.rb[1:1] | passed | 2 minutes 40.4 seconds |
4
+ ./spec/utf8_sanitizer_spec.rb[1:2:1] | passed | 0.0017 seconds |
data/README.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # Utf8Sanitizer
2
2
 
3
- Removes invalid UTF8 characters & extra whitespace (carriage returns, new lines, tabs, spaces, etc.) from csv or strings.
3
+ Removes invalid UTF8 characters & extra whitespace (carriage returns, new lines, tabs, spaces, etc.) from csv or strings. Also provides detailed report indicating row numbers containing non-UTF8 and extra whitespace, and before and after to compare changes.
4
4
 
5
5
  Example:
6
6
  ```
@@ -35,7 +35,9 @@ Or install it yourself as:
35
35
 
36
36
  ## Usage
37
37
 
38
- You have three options for UTF8 Sanitizing your data: CSV Parsing, Data Hash of strings, or run default seed data to test.
38
+ Options for UTF8 Sanitizing data:
39
+ 1. CSV Parsing
40
+ 2. Data Hash of strings
39
41
 
40
42
  #### 1. CSV Parsing
41
43
  This is a good option if you are having problems with a CSV containing non-UTF8 characters. Pass your file_path as a hash like below. Hash MUST be a SYMBOL and named `:file_path`. If not, default seeds will be passed as the system detects empty user input and thinks user is trying to run built-in seed data for testing.
@@ -47,29 +49,23 @@ sanitized_data = Utf8Sanitizer.sanitize(args)
47
49
  #### 2. Hash of Strings
48
50
  This is a good option if you are scraping data or cleaning up existing databases. Pass your data as a hash like below. Hash MUST be a SYMBOL and named `:data`. The value of `:data` should be an array of hashes like below and can be any size from one to many tens of thousands. The hashes inside the data array can be named anything from crm contact data like below, stats, recipes, or any custom hashes as long as they are in an array and resemble the syntax and structure like below.
49
51
  ```
50
- data_hash = [ { url: 'abc_autos_example.com',
51
- act_name: 'ABC Aut\x92os',
52
- street: '123 E Main St\r\n',
53
- city: 'Austin',
54
- state: 'TX',
55
- zip: '78735',
56
- phone: '(888) 555-1234\r\n' },
57
- { url: 'xyz_trucks_example',
58
- act_name: 'XYZ Aut\xC1os',
59
- street: '456 W Main St\r\n',
60
- city: 'Austin',
61
- state: 'TX',
62
- zip: '78735',
63
- phone: '(800) 555-5678\r\n' },
64
- }]
65
-
66
- sanitized_data = Utf8Sanitizer.sanitize(data: data_hash)
67
- ```
68
-
69
- #### 3. Run Seed Data to Test
70
- If you want to run built-in seed data to first test, simply run as below without passing args.
71
- ```
72
- sanitized_data = Utf8Sanitizer.sanitize
52
+ array_of_hashes = [ { url: 'abc_autos_example.com',
53
+ act_name: 'ABC Aut\x92os',
54
+ street: '123 E Main St\r\n',
55
+ city: 'Austin',
56
+ state: 'TX',
57
+ zip: '78735',
58
+ phone: '(888) 555-1234\r\n' },
59
+ { url: 'xyz_trucks_example',
60
+ act_name: 'XYZ Aut\xC1os',
61
+ street: '456 W Main St\r\n',
62
+ city: 'Austin',
63
+ state: 'TX',
64
+ zip: '78735',
65
+ phone: '(800) 555-5678\r\n' },
66
+ }]
67
+
68
+ sanitized_data = Utf8Sanitizer.sanitize(data: array_of_hashes)
73
69
  ```
74
70
 
75
71
  ### Returned Sanitized Data Format
@@ -79,39 +75,52 @@ The `:stats` are a breakdown of the results. `:defective_rows` and `:error_rows`
79
75
 
80
76
  `:data` is broken down into the following categories: `:valid_data`, `:encoded_data`, `:defective_data`, and `:error_data`.
81
77
 
82
- `:valid_data` is the most important data and you can access it with `sanitized_data[:data][:valid_data]`. Each non-UTF8 row will be included in its original syntax like below and can be accessed directly via `sanitized_data[:data][:encoded_data]`. **You can change the name of `sanitized_data` to anything you like, but it must be followed with `[:data][:valid_data]` and `[:data][:encoded_data]`, etc.**
78
+ `:valid_data` is the most important data and you can access it with `sanitized_data[:data][:valid_data]`. Each non-UTF8 row will be included in its original syntax like below and can be accessed directly via `sanitized_data[:data][:encoded_data]`.
79
+
80
+ **You can change the name of `sanitized_data` to anything you like, but it must be followed with `[:data][:valid_data]` and `[:data][:encoded_data]`, etc.**
83
81
 
84
- `:pollute_seeds` is only for running seed data. It injects each row with non-UTF8 and extra whitespace for testing. It can be ignored and will only run if your input is nil, which tells the system that you are intentionally trying to run seed data for testing.
85
82
  ```
86
- {:stats=>
87
- {:total_rows=>2, :header_row=>1, :valid_rows=>2, :error_rows=>0, :defective_rows=>0, :perfect_rows=>0, :encoded_rows=>2, :wchar_rows=>2},
88
- :file_path=>nil,
89
- :data=>
90
- {:valid_data=>
91
- [{:row_id=>"1",
92
- :utf_status=>"encoded, wchar",
93
- :url=>"abc_autos_example.com",
94
- :act_name=>"ABC Autos Example",
95
- :street=>"123 E Main St",
96
- :city=>"Austin",
97
- :state=>"TX",
98
- :zip=>"78735",
99
- :phone=>"(888) 555-1234"},
100
- {:row_id=>"2",
101
- :utf_status=>"encoded, wchar",
102
- :url=>"xyz_trucks_example",
103
- :act_name=>"XYZ Trucks Example",
104
- :street=>"456 W Main St",
105
- :city=>"Austin",
106
- :state=>"TX",
107
- :zip=>"78735",
108
- :phone=>"(800) 555-4321"}],
109
- :encoded_data=>
110
- [{:row_id=>1, :text=>"1,abc_autos_example.com,ABC Autos Example\x98_\xC0,123 E Main St,Austin,TX,78735,(888) 555-1234\r\n"},
111
- {:row_id=>2, :text=>"2,xyz_trucks_example,XYZ \xC1_\xCCTrucks Example,456 W Main St,Austin,TX,78735,(800) 555-4321\r\n"}],
112
- :defective_data=>[],
113
- :error_data=>[]},
114
- :pollute_seeds=>true}
83
+ { stats:
84
+ {
85
+ total_rows: 2,
86
+ header_row: 1,
87
+ valid_rows: 2,
88
+ error_rows: 0,
89
+ defective_rows: 0,
90
+ perfect_rows: 0,
91
+ encoded_rows: 2,
92
+ wchar_rows: 2
93
+ },
94
+ file_path: nil,
95
+ data:
96
+ {
97
+ valid_data:
98
+ [
99
+ { row_id: '1',
100
+ utf_status: 'encoded, wchar',
101
+ url: 'abc_autos_example.com',
102
+ act_name: 'ABC Autos Example',
103
+ street: '123 E Main St',
104
+ city: 'Austin',
105
+ state: 'TX',
106
+ zip: '78735',
107
+ phone: '(888) 555-1234' },
108
+ { row_id: '2',
109
+ utf_status: 'encoded, wchar',
110
+ url: 'xyz_trucks_example',
111
+ act_name: 'XYZ Trucks Example',
112
+ street: '456 W Main St',
113
+ city: 'Austin',
114
+ state: 'TX',
115
+ zip: '78735',
116
+ phone: '(800) 555-4321' }
117
+ ],
118
+ encoded_data: [{ row_id: 1, text: "1,abc_autos_example.com,ABC Autos Example\x98_\xC0,123 E Main St,Austin,TX,78735,(888) 555-1234\r\n" },
119
+ { row_id: 2, text: "2,xyz_trucks_example,XYZ \xC1_\xCCTrucks Example,456 W Main St,Austin,TX,78735,(800) 555-4321\r\n" }],
120
+ defective_data: [],
121
+ error_data: []
122
+ }
123
+ }
115
124
  ```
116
125
 
117
126
  ## Development
data/Rakefile CHANGED
@@ -10,10 +10,14 @@ task :test => :spec
10
10
  task :console do
11
11
  require 'irb'
12
12
  require 'irb/completion'
13
- require 'utf8_sanitizer' # You know what to do.
13
+ require 'utf8_sanitizer'
14
14
  require "active_support/all"
15
15
  ARGV.clear
16
- # sanitized_data = Utf8Sanitizer.sanitize(file_path: "./lib/utf8_sanitizer/csv/seeds_mini.csv")
16
+ orig_hashes = [{ :row_id=>"1", :url=>"stanleykaufman.com", :act_name=>"Stanley Chevrolet Kaufman\x99_\xCC", :street=>"825 E Fair St", :city=>"Kaufman", :state=>"TX", :zip=>"75142", :phone=>"(888) 457-4391\r\n" }]
17
+
18
+ # sanitized_data = Utf8Sanitizer.sanitize(file_path: './lib/utf8_sanitizer/csv/seeds_dirty_1.csv')
19
+ # sanitized_data = Utf8Sanitizer.sanitize(data: orig_hashes)
17
20
  sanitized_data = Utf8Sanitizer.sanitize
21
+ puts sanitized_data.inspect
18
22
  IRB.start
19
23
  end
@@ -0,0 +1,2 @@
1
+ url,act_name,street,city,state,zip,phone
2
+ http://www.courtesyfordsales.com,Courtesy Ford,__����__����____1410 West Pine Street Hattiesburg,Wexford,MS,39401,512-555-1212
@@ -1,8 +1,10 @@
1
1
  # frozen_string_literal: false
2
- # require 'csv'
2
+ require 'csv'
3
3
 
4
4
  module Utf8Sanitizer
5
5
  class UTF
6
+ attr_accessor :headers, :valid_rows, :encoded_rows, :row_id, :data_hash, :defective_rows, :error_rows
7
+
6
8
  def initialize(args={})
7
9
  @valid_rows = []
8
10
  @encoded_rows = []
@@ -13,19 +15,6 @@ module Utf8Sanitizer
13
15
  @data_hash = {}
14
16
  end
15
17
 
16
- #################### * VALIDATE DATA * ####################
17
- def validate_data(args={})
18
- args = args.slice(:file_path, :data, :pollute_seeds)
19
- args = args.compact
20
- @seed = Seed.new if args[:pollute_seeds]
21
- file_path = args[:file_path]
22
- data = args[:data]
23
-
24
- utf_result = validate_csv(file_path) if file_path
25
- utf_result = validate_hashes(data) if data
26
- utf_result
27
- end
28
-
29
18
  #################### * COMPILE RESULTS * ####################
30
19
  def compile_results
31
20
  utf_status = @valid_rows.map { |hsh| hsh[:utf_status] }
@@ -35,44 +24,38 @@ module Utf8Sanitizer
35
24
  perfect = groups['perfect']
36
25
 
37
26
  header_row_count = @headers.any? ? 1 : 0
27
+
38
28
  utf_result = {
39
29
  stats: { total_rows: @row_id, header_row: header_row_count, valid_rows: @valid_rows.count, error_rows: @error_rows.count, defective_rows: @defective_rows.count, perfect_rows: perfect, encoded_rows: @encoded_rows.count, wchar_rows: wchar },
40
30
  data: { valid_data: @valid_rows, encoded_data: @encoded_rows, defective_data: @defective_rows, error_data: @error_rows }
41
31
  }
32
+ utf_result
42
33
  end
43
34
 
44
- #################### * VALIDATE CSV * ####################
45
- def validate_csv(file_path)
46
- return unless file_path.present?
47
- File.open(file_path).each do |file_line|
48
- validated_line = utf_filter(check_utf(file_line))
49
- @row_id += 1
50
- if validated_line
51
- CSV.parse(validated_line) do |row|
52
- if @headers.empty?
53
- @headers = row
54
- else
55
- @data_hash.merge!(row_to_hsh(row))
56
- @valid_rows << @data_hash
57
- end
58
- end
59
- end
60
- rescue StandardError => error
61
- @error_rows << { row_id: @row_id, text: error.message }
62
- end
63
- compile_results
35
+
36
+ #################### * VALIDATE DATA * ####################
37
+ def validate_data(args={})
38
+ args = args.slice(:file_path, :data)
39
+ args = args.compact
40
+ file_path = args[:file_path]
41
+ data = args[:data]
42
+
43
+ utf_result = validate_csv(file_path) if file_path
44
+ utf_result = validate_hashes(data) if data
45
+ utf_result
64
46
  end
65
47
 
66
48
  #################### * VALIDATE HASHES * ####################
67
49
  def validate_hashes(orig_hashes)
68
50
  return unless orig_hashes.present?
69
51
  begin
70
- process_hash_row(orig_hashes.first) ## re keys for headers.
71
- orig_hashes.each { |hsh| process_hash_row(hsh) } ## re values
52
+ process_hash_row(orig_hashes.first) ## keys for headers.
53
+ orig_hashes.each { |hsh| process_hash_row(hsh) } ## values
72
54
  rescue StandardError => error
73
55
  @error_rows << { row_id: @row_id, text: error.message }
74
56
  end
75
- compile_results ## handles returns.
57
+ results = compile_results ## handles returns.
58
+ results
76
59
  end
77
60
 
78
61
  ### process_hash_row - helper VALIDATE HASHES ###
@@ -86,7 +69,9 @@ module Utf8Sanitizer
86
69
  end
87
70
 
88
71
  file_line = keys_or_values.join(',')
89
- line_parse(utf_filter(check_utf(file_line)))
72
+ validated_line = utf_filter(check_utf(file_line))
73
+ res = line_parse(validated_line)
74
+ res
90
75
  end
91
76
 
92
77
  ### line_parse - helper VALIDATE HASHES ###
@@ -105,9 +90,9 @@ module Utf8Sanitizer
105
90
 
106
91
  #################### * CHECK UTF * ####################
107
92
  def check_utf(text)
108
- return unless text.present?
109
- text = @seed.pollute_seeds(text) if @seed && @headers.any?
93
+ return if text.nil?
110
94
  results = { text: text, encoded: nil, wchar: nil, error: nil }
95
+
111
96
  begin
112
97
  if !text.valid_encoding?
113
98
  encoded = text.chars.select(&:valid_encoding?).join
@@ -128,7 +113,7 @@ module Utf8Sanitizer
128
113
  #################### * UTF FILTER * ####################
129
114
  def utf_filter(utf)
130
115
  return unless utf.present?
131
- puts utf.inspect
116
+ # puts utf.inspect
132
117
  utf_status = utf.except(:text).compact.keys
133
118
  utf_status = utf_status&.map(&:to_s)&.join(', ')
134
119
  utf_status = 'perfect' if utf_status.blank?
@@ -145,6 +130,30 @@ module Utf8Sanitizer
145
130
  line
146
131
  end
147
132
 
133
+
134
+ #################### * VALIDATE CSV * ####################
135
+ def validate_csv(file_path)
136
+ return unless file_path.present?
137
+ File.open(file_path).each do |file_line|
138
+ validated_line = utf_filter(check_utf(file_line))
139
+ @row_id += 1
140
+ if validated_line
141
+ CSV.parse(validated_line) do |row|
142
+ if @headers.empty?
143
+ @headers = row
144
+ else
145
+ @data_hash.merge!(row_to_hsh(row))
146
+ @valid_rows << @data_hash
147
+ end
148
+ end
149
+ end
150
+ rescue StandardError => error
151
+ @error_rows << { row_id: @row_id, text: error.message }
152
+ end
153
+ utf_results = compile_results
154
+ end
155
+
156
+
148
157
  ############# !! HELPERS BELOW !! #############
149
158
  ############# KEY VALUE CONVERTERS #############
150
159
  def row_to_hsh(row)
@@ -152,15 +161,14 @@ module Utf8Sanitizer
152
161
  h.symbolize_keys
153
162
  end
154
163
 
155
- def val_hsh(cols, hsh)
156
- keys = hsh.keys
157
- keys.each { |key| hsh.delete(key) unless cols.include?(key) }
158
- hsh
159
- end
160
-
161
164
  def make_groups_from_array(array)
162
165
  array.each_with_object(Hash.new(0)) { |e, h| h[e] += 1; }
163
166
  end
164
167
 
168
+ # def val_hsh(cols, hsh)
169
+ # keys = hsh.keys
170
+ # keys.each { |key| hsh.delete(key) unless cols.include?(key) }
171
+ # hsh
172
+ # end
165
173
  end
166
174
  end
@@ -1,4 +1,4 @@
1
1
  module Utf8Sanitizer
2
2
  # VERSION = "0.0.1-rc.1"
3
- VERSION = "0.0.2.pre.rc.04"
3
+ VERSION = '1.01'.freeze
4
4
  end
@@ -1,29 +1,13 @@
1
- require "utf8_sanitizer/version"
2
- require 'utf8_sanitizer/seed'
1
+ require 'utf8_sanitizer/version'
3
2
  require 'utf8_sanitizer/utf'
4
3
  require 'pry'
5
4
 
6
5
  module Utf8Sanitizer
7
-
8
6
  ## Args must include :data or :file_path, else seeds will run by default.
9
- def self.sanitize(args={})
10
- keys = args.compact.keys
7
+ def self.sanitize(args = {})
11
8
  input = { stats: nil, file_path: nil, data: nil }.merge(args)
12
9
 
13
- ## Grabs seeds if :data or :file_path empty.
14
- unless (keys & [:data, :file_path]).any?
15
- ## Toggle data[:file_path] & data[:data] to test csv parsing or data hashes.
16
- # input[:file_path] = Seed.new.grab_seed_file_path
17
- input[:data] = Seed.new.grab_seed_hashes
18
-
19
- ## For Testing: Pollute_seeds adds non-utf8 chars to each line.
20
- input[:pollute_seeds] = true
21
- end
22
-
23
- ## Sanitizes input hash, then merges results to original input hash, and returns as sanitized_data.
24
- sanitized_data = input.merge!(Utf8Sanitizer::UTF.new.validate_data(input))
25
- sanitized_data
10
+ return input unless input.compact.any?
11
+ sanitized_data = input.merge(Utf8Sanitizer::UTF.new.validate_data(input))
26
12
  end
27
-
28
-
29
13
  end
@@ -14,11 +14,11 @@ Gem::Specification.new do |spec|
14
14
  spec.homepage = 'https://github.com/4rlm/utf8_sanitizer'
15
15
  spec.license = 'MIT'
16
16
 
17
- spec.summary = "Removes invalid UTF8 characters & extra whitespace from csv or strings."
18
- spec.description = "Removes invalid UTF8 characters & extra whitespace (carriage returns, new lines, tabs, spaces, etc.) from csv or strings.\n Example: ABC Au\\xC1tos,123 E Main St,Anytown,TX,75142,(888) 555-1234\\n\\r\\n => ABC Autos,123 E Main St,Anytown,TX,75142,(888) 555-1234"
17
+ spec.summary = 'Removes invalid UTF8 characters & extra whitespace (carriage returns, new lines, tabs, spaces, etc.) from csv or strings. Also provides detailed report indicating row numbers containing non-UTF8 and extra whitespace, and before and after to compare changes.'
18
+ spec.description = "Removes invalid UTF8 characters & extra whitespace (carriage returns, new lines, tabs, spaces, etc.) from csv or strings. Also provides detailed report indicating row numbers containing non-UTF8 and extra whitespace, and before and after to compare changes.\n Example: ABC Au\\xC1tos,123 E Main St,Anytown,TX,75142,(888) 555-1234\\n\\r\\n => ABC Autos,123 E Main St,Anytown,TX,75142,(888) 555-1234"
19
19
 
20
20
  if spec.respond_to?(:metadata)
21
- spec.metadata['allowed_push_host'] = "https://rubygems.org"
21
+ spec.metadata['allowed_push_host'] = 'https://rubygems.org'
22
22
  else
23
23
  raise 'RubyGems 2.0 or newer is required to protect against ' \
24
24
  'public gem pushes.'
@@ -31,7 +31,7 @@ Gem::Specification.new do |spec|
31
31
  spec.bindir = 'exe'
32
32
  spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
33
33
  spec.require_paths = ['lib']
34
- spec.post_install_message = "Thanks for installing utf8_sanitizer!"
34
+ spec.post_install_message = 'Thanks for installing utf8_sanitizer!'
35
35
 
36
36
  spec.required_ruby_version = '~> 2.5.1'
37
37
  spec.add_dependency 'activesupport', '~> 5.2', '>= 5.2.0'
@@ -40,11 +40,11 @@ Gem::Specification.new do |spec|
40
40
  spec.add_development_dependency 'byebug', '~> 10.0', '>= 10.0.2'
41
41
  spec.add_development_dependency 'class_indexer', '~> 0.3.0'
42
42
  spec.add_development_dependency 'irbtools', '~> 2.2', '>= 2.2.1'
43
+ spec.add_development_dependency 'pry', '~> 0.11.3'
43
44
  spec.add_development_dependency 'rake', '~> 12.3', '>= 12.3.1'
44
45
  spec.add_development_dependency 'rspec', '~> 3.7'
45
46
  spec.add_development_dependency 'rubocop', '~> 0.56.0'
46
47
  spec.add_development_dependency 'ruby-beautify', '~> 0.97.4'
47
- spec.add_development_dependency "pry", "~> 0.11.3"
48
48
  # spec.add_runtime_dependency 'library', '~> 2.2'
49
49
  # spec.add_dependency 'activerecord', '>= 3.0'
50
50
  # spec.add_dependency 'actionpack', '>= 3.0'
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: utf8_sanitizer
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.2.pre.rc.04
4
+ version: '1.01'
5
5
  platform: ruby
6
6
  authors:
7
7
  - Adam Booth
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2018-06-01 00:00:00.000000000 Z
11
+ date: 2018-06-21 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: activesupport
@@ -104,6 +104,20 @@ dependencies:
104
104
  - - ">="
105
105
  - !ruby/object:Gem::Version
106
106
  version: 2.2.1
107
+ - !ruby/object:Gem::Dependency
108
+ name: pry
109
+ requirement: !ruby/object:Gem::Requirement
110
+ requirements:
111
+ - - "~>"
112
+ - !ruby/object:Gem::Version
113
+ version: 0.11.3
114
+ type: :development
115
+ prerelease: false
116
+ version_requirements: !ruby/object:Gem::Requirement
117
+ requirements:
118
+ - - "~>"
119
+ - !ruby/object:Gem::Version
120
+ version: 0.11.3
107
121
  - !ruby/object:Gem::Dependency
108
122
  name: rake
109
123
  requirement: !ruby/object:Gem::Requirement
@@ -166,22 +180,8 @@ dependencies:
166
180
  - - "~>"
167
181
  - !ruby/object:Gem::Version
168
182
  version: 0.97.4
169
- - !ruby/object:Gem::Dependency
170
- name: pry
171
- requirement: !ruby/object:Gem::Requirement
172
- requirements:
173
- - - "~>"
174
- - !ruby/object:Gem::Version
175
- version: 0.11.3
176
- type: :development
177
- prerelease: false
178
- version_requirements: !ruby/object:Gem::Requirement
179
- requirements:
180
- - - "~>"
181
- - !ruby/object:Gem::Version
182
- version: 0.11.3
183
183
  description: |-
184
- Removes invalid UTF8 characters & extra whitespace (carriage returns, new lines, tabs, spaces, etc.) from csv or strings.
184
+ Removes invalid UTF8 characters & extra whitespace (carriage returns, new lines, tabs, spaces, etc.) from csv or strings. Also provides detailed report indicating row numbers containing non-UTF8 and extra whitespace, and before and after to compare changes.
185
185
  Example: ABC Au\xC1tos,123 E Main St,Anytown,TX,75142,(888) 555-1234\n\r\n => ABC Autos,123 E Main St,Anytown,TX,75142,(888) 555-1234
186
186
  email:
187
187
  - 4rlm@protonmail.ch
@@ -203,12 +203,12 @@ files:
203
203
  - lib/utf8_sanitizer/csv/extensions.csv
204
204
  - lib/utf8_sanitizer/csv/seeds_clean.csv
205
205
  - lib/utf8_sanitizer/csv/seeds_dirty.csv
206
+ - lib/utf8_sanitizer/csv/seeds_dirty_1.csv
206
207
  - lib/utf8_sanitizer/csv/seeds_mega.csv
207
208
  - lib/utf8_sanitizer/csv/seeds_mini.csv
208
209
  - lib/utf8_sanitizer/csv/seeds_mini.csv,
209
210
  - lib/utf8_sanitizer/csv/seeds_mini_10.csv
210
211
  - lib/utf8_sanitizer/csv/seeds_mini_2_bug.csv
211
- - lib/utf8_sanitizer/seed.rb
212
212
  - lib/utf8_sanitizer/utf.rb
213
213
  - lib/utf8_sanitizer/version.rb
214
214
  - utf8_sanitizer.gemspec
@@ -228,13 +228,16 @@ required_ruby_version: !ruby/object:Gem::Requirement
228
228
  version: 2.5.1
229
229
  required_rubygems_version: !ruby/object:Gem::Requirement
230
230
  requirements:
231
- - - ">"
231
+ - - ">="
232
232
  - !ruby/object:Gem::Version
233
- version: 1.3.1
233
+ version: '0'
234
234
  requirements: []
235
235
  rubyforge_project:
236
236
  rubygems_version: 2.7.6
237
237
  signing_key:
238
238
  specification_version: 4
239
- summary: Removes invalid UTF8 characters & extra whitespace from csv or strings.
239
+ summary: Removes invalid UTF8 characters & extra whitespace (carriage returns, new
240
+ lines, tabs, spaces, etc.) from csv or strings. Also provides detailed report indicating
241
+ row numbers containing non-UTF8 and extra whitespace, and before and after to compare
242
+ changes.
240
243
  test_files: []
@@ -1,74 +0,0 @@
1
- require 'csv'
2
-
3
- module Utf8Sanitizer
4
- class Seed
5
- def initialize(args={})
6
- # @pollute_seeds = args.fetch(:pollute_seeds, false)
7
- # @seed_hashes = args.fetch(:seed_hashes, false)
8
- # @seed_csv = args.fetch(:seed_csv, false)
9
- end
10
-
11
- def pollute_seeds(text)
12
- list = ['h∑', 'lÔ', "\x92", "\x98", "\x99", "\xC0", "\xC1", "\xC2", "\xCC", "\xDD", "\xE5", "\xF8"]
13
- index = text.length / 2
14
- var = "#{list.sample}_#{list.sample}"
15
- text.insert(index, var)
16
- text.insert(-1, "\r\n")
17
- text
18
- end
19
-
20
- def grab_seed_file_path
21
- # "./lib/utf8_sanitizer/csv/seeds_clean.csv"
22
- "./lib/utf8_sanitizer/csv/seeds_dirty.csv"
23
- # "./lib/utf8_sanitizer/csv/seeds_mega.csv"
24
- # "./lib/utf8_sanitizer/csv/seeds_mini.csv"
25
- # "./lib/utf8_sanitizer/csv/seeds_mini_10.csv"
26
- # './lib/utf8_sanitizer/csv/seeds_mini_2_bug.csv'
27
- end
28
-
29
- ### Sample Hashes for validate_data
30
- def grab_seed_hashes
31
- [{ row_id: 1,
32
- url: 'stanleykaufman.com',
33
- act_name: 'Stanley Chevrolet Kaufman',
34
- street: '825 E Fair St',
35
- city: 'Kaufman',
36
- state: 'TX',
37
- zip: '75142',
38
- phone: '(888) 457-4391' },
39
- { row_id: 2,
40
- url: 'leepartyka',
41
- act_name: 'Lee Partyka Chevrolet Mazda Isuzu Truck',
42
- street: '200 Skiff St',
43
- city: 'Hamden',
44
- state: 'CT',
45
- zip: '6518',
46
- phone: '(203) 288-7761' },
47
- { row_id: 3,
48
- url: 'burienhonda.fake.not.net.com',
49
- act_name: 'Honda of Burien 15026 1st Avenue South, Burien, WA 98148',
50
- street: '15026 1st Avenue South',
51
- city: 'Burien',
52
- state: 'WA',
53
- zip: '98148',
54
- phone: '(206) 246-9700' },
55
- { row_id: 4,
56
- url: 'cortlandchryslerdodgejeep.com',
57
- act_name: 'Cortland Chrysler Dodge Jeep RAM',
58
- street: '3878 West Rd',
59
- city: 'Cortland',
60
- state: 'NY',
61
- zip: '13045',
62
- phone: '(877) 279-3113' },
63
- { row_id: 5,
64
- url: 'imperialmotors.net',
65
- act_name: 'Imperial Motors',
66
- street: '4839 Virginia Beach Blvd',
67
- city: 'Virginia Beach',
68
- state: 'VA',
69
- zip: '23462',
70
- phone: '(757) 490-3651' }]
71
- end
72
-
73
- end
74
- end