utf8_sanitizer 0.0.2.pre.rc.03 → 0.0.2.pre.rc.04
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +95 -5
- data/lib/utf8_sanitizer/csv/seeds_mini.csv, +0 -0
- data/lib/utf8_sanitizer/version.rb +1 -1
- data/lib/utf8_sanitizer.rb +2 -1
- data/utf8_sanitizer.gemspec +1 -1
- metadata +3 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 05155a4029ddd8224888b1482972b0640b4a81ef0990d7bd8f29311a5798d309
|
4
|
+
data.tar.gz: 6d3bd40596087c6e8dfa41934a44ffd2459f6bf738abf77c66aa20ce4591b641
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 23d64036dec061d1290d186069adfe018825bbe16a89f733994bce7eb6793fd5f0d402d9cc00c57a590fcb24f6b2f14890cac0611ae54f73bec0cc9b39e060d4
|
7
|
+
data.tar.gz: 972355c619c5c6ef4753df81d7a94f18b4f0e0622555e3fb4c3d227d5a31477ebeea015a10682281ace975bb322daed2752f9cf71123c19b194429893f1b6f78
|
data/README.md
CHANGED
@@ -1,8 +1,21 @@
|
|
1
1
|
# Utf8Sanitizer
|
2
2
|
|
3
|
-
|
3
|
+
Removes invalid UTF8 characters & extra whitespace (carriage returns, new lines, tabs, spaces, etc.) from csv or strings.
|
4
|
+
|
5
|
+
Example:
|
6
|
+
```
|
7
|
+
"ABC Au\xC1tos, 123 E Main St, Anytown, TX, 78735, (888) 555-1234\n\r\n"
|
8
|
+
```
|
9
|
+
|
10
|
+
Returns:
|
11
|
+
```
|
12
|
+
"ABC Autos, 123 E Main St, Anytown, TX, 78735, (888) 555-1234"
|
13
|
+
```
|
14
|
+
|
15
|
+
Removed:
|
16
|
+
Non-UTF8: `\xC1`
|
17
|
+
Extra whitespace: `\n\r\n`
|
4
18
|
|
5
|
-
TODO: Delete this and the text above, and describe your gem
|
6
19
|
|
7
20
|
## Installation
|
8
21
|
|
@@ -22,7 +35,84 @@ Or install it yourself as:
|
|
22
35
|
|
23
36
|
## Usage
|
24
37
|
|
25
|
-
|
38
|
+
You have three options for UTF8 Sanitizing your data: CSV Parsing, Data Hash of strings, or run default seed data to test.
|
39
|
+
|
40
|
+
#### 1. CSV Parsing
|
41
|
+
This is a good option if you are having problems with a CSV containing non-UTF8 characters. Pass your file_path as a hash like below. Hash MUST be a SYMBOL and named `:file_path`. If not, default seeds will be passed as the system detects empty user input and thinks user is trying to run built-in seed data for testing.
|
42
|
+
```
|
43
|
+
args = {file_path: "./path/to/your_csv.csv"}
|
44
|
+
sanitized_data = Utf8Sanitizer.sanitize(args)
|
45
|
+
```
|
46
|
+
|
47
|
+
#### 2. Hash of Strings
|
48
|
+
This is a good option if you are scraping data or cleaning up existing databases. Pass your data as a hash like below. Hash MUST be a SYMBOL and named `:data`. The value of `:data` should be an array of hashes like below and can be any size from one to many tens of thousands. The hashes inside the data array can be named anything from crm contact data like below, stats, recipes, or any custom hashes as long as they are in an array and resemble the syntax and structure like below.
|
49
|
+
```
|
50
|
+
data_hash = [ { url: 'abc_autos_example.com',
|
51
|
+
act_name: 'ABC Aut\x92os',
|
52
|
+
street: '123 E Main St\r\n',
|
53
|
+
city: 'Austin',
|
54
|
+
state: 'TX',
|
55
|
+
zip: '78735',
|
56
|
+
phone: '(888) 555-1234\r\n' },
|
57
|
+
{ url: 'xyz_trucks_example',
|
58
|
+
act_name: 'XYZ Aut\xC1os',
|
59
|
+
street: '456 W Main St\r\n',
|
60
|
+
city: 'Austin',
|
61
|
+
state: 'TX',
|
62
|
+
zip: '78735',
|
63
|
+
phone: '(800) 555-5678\r\n' },
|
64
|
+
}]
|
65
|
+
|
66
|
+
sanitized_data = Utf8Sanitizer.sanitize(data: data_hash)
|
67
|
+
```
|
68
|
+
|
69
|
+
#### 3. Run Seed Data to Test
|
70
|
+
If you want to run built-in seed data to first test, simply run as below without passing args.
|
71
|
+
```
|
72
|
+
sanitized_data = Utf8Sanitizer.sanitize
|
73
|
+
```
|
74
|
+
|
75
|
+
### Returned Sanitized Data Format
|
76
|
+
The returned data will be in hash format with the following keys: `:stats`, `:file_path`, `:data` like below.
|
77
|
+
|
78
|
+
The `:stats` are a breakdown of the results. `:defective_rows` and `:error_rows` will usually be the same number which refer to the rows which are beyond repair (very rare). Otherwise, the results will be `:valid_rows` if they were perfect or successfully sanitized, including `:encoded_rows` which refers to the number of rows that contained non-utf8 characters, and `:wchar_rows` which is short for 'whitespace character rows'.
|
79
|
+
|
80
|
+
`:data` is broken down into the following categories: `:valid_data`, `:encoded_data`, `:defective_data`, and `:error_data`.
|
81
|
+
|
82
|
+
`:valid_data` is the most important data and you can access it with `sanitized_data[:data][:valid_data]`. Each non-UTF8 row will be included in its original syntax like below and can be accessed directly via `sanitized_data[:data][:encoded_data]`. **You can change the name of `sanitized_data` to anything you like, but it must be followed with `[:data][:valid_data]` and `[:data][:encoded_data]`, etc.**
|
83
|
+
|
84
|
+
`:pollute_seeds` is only for running seed data. It injects each row with non-UTF8 and extra whitespace for testing. It can be ignored and will only run if your input is nil, which tells the system that you are intentionally trying to run seed data for testing.
|
85
|
+
```
|
86
|
+
{:stats=>
|
87
|
+
{:total_rows=>2, :header_row=>1, :valid_rows=>2, :error_rows=>0, :defective_rows=>0, :perfect_rows=>0, :encoded_rows=>2, :wchar_rows=>2},
|
88
|
+
:file_path=>nil,
|
89
|
+
:data=>
|
90
|
+
{:valid_data=>
|
91
|
+
[{:row_id=>"1",
|
92
|
+
:utf_status=>"encoded, wchar",
|
93
|
+
:url=>"abc_autos_example.com",
|
94
|
+
:act_name=>"ABC Autos Example",
|
95
|
+
:street=>"123 E Main St",
|
96
|
+
:city=>"Austin",
|
97
|
+
:state=>"TX",
|
98
|
+
:zip=>"78735",
|
99
|
+
:phone=>"(888) 555-1234"},
|
100
|
+
{:row_id=>"2",
|
101
|
+
:utf_status=>"encoded, wchar",
|
102
|
+
:url=>"xyz_trucks_example",
|
103
|
+
:act_name=>"XYZ Trucks Example",
|
104
|
+
:street=>"456 W Main St",
|
105
|
+
:city=>"Austin",
|
106
|
+
:state=>"TX",
|
107
|
+
:zip=>"78735",
|
108
|
+
:phone=>"(800) 555-4321"}],
|
109
|
+
:encoded_data=>
|
110
|
+
[{:row_id=>1, :text=>"1,abc_autos_example.com,ABC Autos Example\x98_\xC0,123 E Main St,Austin,TX,78735,(888) 555-1234\r\n"},
|
111
|
+
{:row_id=>2, :text=>"2,xyz_trucks_example,XYZ \xC1_\xCCTrucks Example,456 W Main St,Austin,TX,78735,(800) 555-4321\r\n"}],
|
112
|
+
:defective_data=>[],
|
113
|
+
:error_data=>[]},
|
114
|
+
:pollute_seeds=>true}
|
115
|
+
```
|
26
116
|
|
27
117
|
## Development
|
28
118
|
|
@@ -32,7 +122,7 @@ To install this gem onto your local machine, run `bundle exec rake install`. To
|
|
32
122
|
|
33
123
|
## Contributing
|
34
124
|
|
35
|
-
Bug reports and pull requests are welcome on GitHub at https://github.com/
|
125
|
+
Bug reports and pull requests are welcome on GitHub at https://github.com/4rlm/utf8_sanitizer. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
|
36
126
|
|
37
127
|
## License
|
38
128
|
|
@@ -40,4 +130,4 @@ The gem is available as open source under the terms of the [MIT License](https:/
|
|
40
130
|
|
41
131
|
## Code of Conduct
|
42
132
|
|
43
|
-
Everyone interacting in the Utf8Sanitizer project’s codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/
|
133
|
+
Everyone interacting in the Utf8Sanitizer project’s codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/4rlm/utf8_sanitizer/blob/master/CODE_OF_CONDUCT.md).
|
File without changes
|
data/lib/utf8_sanitizer.rb
CHANGED
@@ -21,7 +21,8 @@ module Utf8Sanitizer
|
|
21
21
|
end
|
22
22
|
|
23
23
|
## Sanitizes input hash, then merges results to original input hash, and returns as sanitized_data.
|
24
|
-
input.merge!(Utf8Sanitizer::UTF.new.validate_data(input))
|
24
|
+
sanitized_data = input.merge!(Utf8Sanitizer::UTF.new.validate_data(input))
|
25
|
+
sanitized_data
|
25
26
|
end
|
26
27
|
|
27
28
|
|
data/utf8_sanitizer.gemspec
CHANGED
@@ -15,7 +15,7 @@ Gem::Specification.new do |spec|
|
|
15
15
|
spec.license = 'MIT'
|
16
16
|
|
17
17
|
spec.summary = "Removes invalid UTF8 characters & extra whitespace from csv or strings."
|
18
|
-
spec.description = "Removes invalid UTF8 characters & extra whitespace (carriage returns, new lines, tabs, spaces, etc.) from csv
|
18
|
+
spec.description = "Removes invalid UTF8 characters & extra whitespace (carriage returns, new lines, tabs, spaces, etc.) from csv or strings.\n Example: ABC Au\\xC1tos,123 E Main St,Anytown,TX,75142,(888) 555-1234\\n\\r\\n => ABC Autos,123 E Main St,Anytown,TX,75142,(888) 555-1234"
|
19
19
|
|
20
20
|
if spec.respond_to?(:metadata)
|
21
21
|
spec.metadata['allowed_push_host'] = "https://rubygems.org"
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: utf8_sanitizer
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.2.pre.rc.
|
4
|
+
version: 0.0.2.pre.rc.04
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Adam Booth
|
@@ -181,7 +181,7 @@ dependencies:
|
|
181
181
|
- !ruby/object:Gem::Version
|
182
182
|
version: 0.11.3
|
183
183
|
description: |-
|
184
|
-
Removes invalid UTF8 characters & extra whitespace (carriage returns, new lines, tabs, spaces, etc.) from csv
|
184
|
+
Removes invalid UTF8 characters & extra whitespace (carriage returns, new lines, tabs, spaces, etc.) from csv or strings.
|
185
185
|
Example: ABC Au\xC1tos,123 E Main St,Anytown,TX,75142,(888) 555-1234\n\r\n => ABC Autos,123 E Main St,Anytown,TX,75142,(888) 555-1234
|
186
186
|
email:
|
187
187
|
- 4rlm@protonmail.ch
|
@@ -205,6 +205,7 @@ files:
|
|
205
205
|
- lib/utf8_sanitizer/csv/seeds_dirty.csv
|
206
206
|
- lib/utf8_sanitizer/csv/seeds_mega.csv
|
207
207
|
- lib/utf8_sanitizer/csv/seeds_mini.csv
|
208
|
+
- lib/utf8_sanitizer/csv/seeds_mini.csv,
|
208
209
|
- lib/utf8_sanitizer/csv/seeds_mini_10.csv
|
209
210
|
- lib/utf8_sanitizer/csv/seeds_mini_2_bug.csv
|
210
211
|
- lib/utf8_sanitizer/seed.rb
|