utf8_sanitizer 0.0.2.pre.rc.03 → 0.0.2.pre.rc.04

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 1128d1d63dfb289df84978713ef4fc7f8efc1bfb7d6784ab7accd85684824bff
4
- data.tar.gz: 54111bdd0d5e3e1c6f97892c5404d68cbaf283068f77a3372e3bc6facb7cd9d3
3
+ metadata.gz: 05155a4029ddd8224888b1482972b0640b4a81ef0990d7bd8f29311a5798d309
4
+ data.tar.gz: 6d3bd40596087c6e8dfa41934a44ffd2459f6bf738abf77c66aa20ce4591b641
5
5
  SHA512:
6
- metadata.gz: 0e64c64ab3dafaa19fdc1cac3fae947b8aafe909bde3d747b8a23b8516ad5224769ae690605074a22e85082275e799173e866d58963ef951b62d7183d9503ae7
7
- data.tar.gz: '0813f8565e88f697799fd3882ed883595c89d5af7a6042a792b1918fe4e97524ed0ba6da5ac0e1e21b0c91dc7ef89766122000cdeffaaa08505fbfbb300a63c5'
6
+ metadata.gz: 23d64036dec061d1290d186069adfe018825bbe16a89f733994bce7eb6793fd5f0d402d9cc00c57a590fcb24f6b2f14890cac0611ae54f73bec0cc9b39e060d4
7
+ data.tar.gz: 972355c619c5c6ef4753df81d7a94f18b4f0e0622555e3fb4c3d227d5a31477ebeea015a10682281ace975bb322daed2752f9cf71123c19b194429893f1b6f78
data/README.md CHANGED
@@ -1,8 +1,21 @@
1
1
  # Utf8Sanitizer
2
2
 
3
- Welcome to your new gem! In this directory, you'll find the files you need to be able to package up your Ruby library into a gem. Put your Ruby code in the file `lib/utf8_sanitizer`. To experiment with that code, run `bin/console` for an interactive prompt.
3
+ Removes invalid UTF8 characters & extra whitespace (carriage returns, new lines, tabs, spaces, etc.) from csv or strings.
4
+
5
+ Example:
6
+ ```
7
+ "ABC Au\xC1tos, 123 E Main St, Anytown, TX, 78735, (888) 555-1234\n\r\n"
8
+ ```
9
+
10
+ Returns:
11
+ ```
12
+ "ABC Autos, 123 E Main St, Anytown, TX, 78735, (888) 555-1234"
13
+ ```
14
+
15
+ Removed:
16
+ Non-UTF8: `\xC1`
17
+ Extra whitespace: `\n\r\n`
4
18
 
5
- TODO: Delete this and the text above, and describe your gem
6
19
 
7
20
  ## Installation
8
21
 
@@ -22,7 +35,84 @@ Or install it yourself as:
22
35
 
23
36
  ## Usage
24
37
 
25
- TODO: Write usage instructions here
38
+ You have three options for UTF8 Sanitizing your data: CSV Parsing, Data Hash of strings, or run default seed data to test.
39
+
40
+ #### 1. CSV Parsing
41
+ This is a good option if you are having problems with a CSV containing non-UTF8 characters. Pass your file_path as a hash like below. Hash MUST be a SYMBOL and named `:file_path`. If not, default seeds will be passed as the system detects empty user input and thinks user is trying to run built-in seed data for testing.
42
+ ```
43
+ args = {file_path: "./path/to/your_csv.csv"}
44
+ sanitized_data = Utf8Sanitizer.sanitize(args)
45
+ ```
46
+
47
+ #### 2. Hash of Strings
48
+ This is a good option if you are scraping data or cleaning up existing databases. Pass your data as a hash like below. Hash MUST be a SYMBOL and named `:data`. The value of `:data` should be an array of hashes like below and can be any size from one to many tens of thousands. The hashes inside the data array can be named anything from crm contact data like below, stats, recipes, or any custom hashes as long as they are in an array and resemble the syntax and structure like below.
49
+ ```
50
+ data_hash = [ { url: 'abc_autos_example.com',
51
+ act_name: 'ABC Aut\x92os',
52
+ street: '123 E Main St\r\n',
53
+ city: 'Austin',
54
+ state: 'TX',
55
+ zip: '78735',
56
+ phone: '(888) 555-1234\r\n' },
57
+ { url: 'xyz_trucks_example',
58
+ act_name: 'XYZ Aut\xC1os',
59
+ street: '456 W Main St\r\n',
60
+ city: 'Austin',
61
+ state: 'TX',
62
+ zip: '78735',
63
+ phone: '(800) 555-5678\r\n' },
64
+ }]
65
+
66
+ sanitized_data = Utf8Sanitizer.sanitize(data: data_hash)
67
+ ```
68
+
69
+ #### 3. Run Seed Data to Test
70
+ If you want to run built-in seed data to first test, simply run as below without passing args.
71
+ ```
72
+ sanitized_data = Utf8Sanitizer.sanitize
73
+ ```
74
+
75
+ ### Returned Sanitized Data Format
76
+ The returned data will be in hash format with the following keys: `:stats`, `:file_path`, `:data` like below.
77
+
78
+ The `:stats` are a breakdown of the results. `:defective_rows` and `:error_rows` will usually be the same number which refer to the rows which are beyond repair (very rare). Otherwise, the results will be `:valid_rows` if they were perfect or successfully sanitized, including `:encoded_rows` which refers to the number of rows that contained non-utf8 characters, and `:wchar_rows` which is short for 'whitespace character rows'.
79
+
80
+ `:data` is broken down into the following categories: `:valid_data`, `:encoded_data`, `:defective_data`, and `:error_data`.
81
+
82
+ `:valid_data` is the most important data and you can access it with `sanitized_data[:data][:valid_data]`. Each non-UTF8 row will be included in its original syntax like below and can be accessed directly via `sanitized_data[:data][:encoded_data]`. **You can change the name of `sanitized_data` to anything you like, but it must be followed with `[:data][:valid_data]` and `[:data][:encoded_data]`, etc.**
83
+
84
+ `:pollute_seeds` is only for running seed data. It injects each row with non-UTF8 and extra whitespace for testing. It can be ignored and will only run if your input is nil, which tells the system that you are intentionally trying to run seed data for testing.
85
+ ```
86
+ {:stats=>
87
+ {:total_rows=>2, :header_row=>1, :valid_rows=>2, :error_rows=>0, :defective_rows=>0, :perfect_rows=>0, :encoded_rows=>2, :wchar_rows=>2},
88
+ :file_path=>nil,
89
+ :data=>
90
+ {:valid_data=>
91
+ [{:row_id=>"1",
92
+ :utf_status=>"encoded, wchar",
93
+ :url=>"abc_autos_example.com",
94
+ :act_name=>"ABC Autos Example",
95
+ :street=>"123 E Main St",
96
+ :city=>"Austin",
97
+ :state=>"TX",
98
+ :zip=>"78735",
99
+ :phone=>"(888) 555-1234"},
100
+ {:row_id=>"2",
101
+ :utf_status=>"encoded, wchar",
102
+ :url=>"xyz_trucks_example",
103
+ :act_name=>"XYZ Trucks Example",
104
+ :street=>"456 W Main St",
105
+ :city=>"Austin",
106
+ :state=>"TX",
107
+ :zip=>"78735",
108
+ :phone=>"(800) 555-4321"}],
109
+ :encoded_data=>
110
+ [{:row_id=>1, :text=>"1,abc_autos_example.com,ABC Autos Example\x98_\xC0,123 E Main St,Austin,TX,78735,(888) 555-1234\r\n"},
111
+ {:row_id=>2, :text=>"2,xyz_trucks_example,XYZ \xC1_\xCCTrucks Example,456 W Main St,Austin,TX,78735,(800) 555-4321\r\n"}],
112
+ :defective_data=>[],
113
+ :error_data=>[]},
114
+ :pollute_seeds=>true}
115
+ ```
26
116
 
27
117
  ## Development
28
118
 
@@ -32,7 +122,7 @@ To install this gem onto your local machine, run `bundle exec rake install`. To
32
122
 
33
123
  ## Contributing
34
124
 
35
- Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/utf8_sanitizer. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
125
+ Bug reports and pull requests are welcome on GitHub at https://github.com/4rlm/utf8_sanitizer. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
36
126
 
37
127
  ## License
38
128
 
@@ -40,4 +130,4 @@ The gem is available as open source under the terms of the [MIT License](https:/
40
130
 
41
131
  ## Code of Conduct
42
132
 
43
- Everyone interacting in the Utf8Sanitizer project’s codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/[USERNAME]/utf8_sanitizer/blob/master/CODE_OF_CONDUCT.md).
133
+ Everyone interacting in the Utf8Sanitizer project’s codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/4rlm/utf8_sanitizer/blob/master/CODE_OF_CONDUCT.md).
File without changes
@@ -1,4 +1,4 @@
1
1
  module Utf8Sanitizer
2
2
  # VERSION = "0.0.1-rc.1"
3
- VERSION = "0.0.2.pre.rc.03"
3
+ VERSION = "0.0.2.pre.rc.04"
4
4
  end
@@ -21,7 +21,8 @@ module Utf8Sanitizer
21
21
  end
22
22
 
23
23
  ## Sanitizes input hash, then merges results to original input hash, and returns as sanitized_data.
24
- input.merge!(Utf8Sanitizer::UTF.new.validate_data(input))
24
+ sanitized_data = input.merge!(Utf8Sanitizer::UTF.new.validate_data(input))
25
+ sanitized_data
25
26
  end
26
27
 
27
28
 
@@ -15,7 +15,7 @@ Gem::Specification.new do |spec|
15
15
  spec.license = 'MIT'
16
16
 
17
17
  spec.summary = "Removes invalid UTF8 characters & extra whitespace from csv or strings."
18
- spec.description = "Removes invalid UTF8 characters & extra whitespace (carriage returns, new lines, tabs, spaces, etc.) from csv, or strings.\n Example: ABC Au\\xC1tos,123 E Main St,Anytown,TX,75142,(888) 555-1234\\n\\r\\n => ABC Autos,123 E Main St,Anytown,TX,75142,(888) 555-1234"
18
+ spec.description = "Removes invalid UTF8 characters & extra whitespace (carriage returns, new lines, tabs, spaces, etc.) from csv or strings.\n Example: ABC Au\\xC1tos,123 E Main St,Anytown,TX,75142,(888) 555-1234\\n\\r\\n => ABC Autos,123 E Main St,Anytown,TX,75142,(888) 555-1234"
19
19
 
20
20
  if spec.respond_to?(:metadata)
21
21
  spec.metadata['allowed_push_host'] = "https://rubygems.org"
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: utf8_sanitizer
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.2.pre.rc.03
4
+ version: 0.0.2.pre.rc.04
5
5
  platform: ruby
6
6
  authors:
7
7
  - Adam Booth
@@ -181,7 +181,7 @@ dependencies:
181
181
  - !ruby/object:Gem::Version
182
182
  version: 0.11.3
183
183
  description: |-
184
- Removes invalid UTF8 characters & extra whitespace (carriage returns, new lines, tabs, spaces, etc.) from csv, or strings.
184
+ Removes invalid UTF8 characters & extra whitespace (carriage returns, new lines, tabs, spaces, etc.) from csv or strings.
185
185
  Example: ABC Au\xC1tos,123 E Main St,Anytown,TX,75142,(888) 555-1234\n\r\n => ABC Autos,123 E Main St,Anytown,TX,75142,(888) 555-1234
186
186
  email:
187
187
  - 4rlm@protonmail.ch
@@ -205,6 +205,7 @@ files:
205
205
  - lib/utf8_sanitizer/csv/seeds_dirty.csv
206
206
  - lib/utf8_sanitizer/csv/seeds_mega.csv
207
207
  - lib/utf8_sanitizer/csv/seeds_mini.csv
208
+ - lib/utf8_sanitizer/csv/seeds_mini.csv,
208
209
  - lib/utf8_sanitizer/csv/seeds_mini_10.csv
209
210
  - lib/utf8_sanitizer/csv/seeds_mini_2_bug.csv
210
211
  - lib/utf8_sanitizer/seed.rb