utf8_sanitizer 1.01 → 1.02
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +27 -8
- data/lib/utf8_sanitizer/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: f0e4ff707e58e6238c6f62c2b8818b783406240a40e537663e928a914198e199
|
4
|
+
data.tar.gz: cff25ea7735710a5ed4d8c5fe1f779688d9f9117880e710e998c2eaef5ac8e3a
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: a6b37f0b41b0f4340d350438580e83c78719ffc83c1fc91cf0580eb921354ba9e661215bd710ee4f7391bb9920dde812b0735c4c367ea0381b60f9304a3770ae
|
7
|
+
data.tar.gz: 8785995ab6dc235136a9e83f5ffb7e4d74fe5a4cee2d67539b0b0dd864542f6e959c10156e22e3ece65020e3f58c41e369b90267fdf2c6d0a86b41e0ad156254
|
data/README.md
CHANGED
@@ -40,14 +40,27 @@ Options for UTF8 Sanitizing data:
|
|
40
40
|
2. Data Hash of strings
|
41
41
|
|
42
42
|
#### 1. CSV Parsing
|
43
|
-
|
43
|
+
To clean CSV file containing non-UTF8 characters, pass file_path as a hash like below. Hash MUST meet the following guidelines:
|
44
|
+
|
45
|
+
a. key as a SYMBOL `:` (not key as string)
|
46
|
+
|
47
|
+
b. named `:file_path`
|
48
|
+
|
49
|
+
c. be an Absolute Path from root `./`
|
50
|
+
|
51
|
+
d. be a hash `{file_path: "./path/to/your_csv.csv"}`
|
52
|
+
|
53
|
+
e. passed to `Utf8Sanitizer.sanitize()`
|
54
|
+
|
55
|
+
Syntax Example Below:
|
44
56
|
```
|
45
|
-
|
46
|
-
sanitized_data = Utf8Sanitizer.sanitize(args)
|
57
|
+
sanitized_data = Utf8Sanitizer.sanitize({file_path: "./path/to/your_csv.csv"})
|
47
58
|
```
|
48
59
|
|
49
60
|
#### 2. Hash of Strings
|
50
|
-
|
61
|
+
To clean existing databases, web form submissions, or scraped data, pass input data as a hash like below. Hash MUST be a SYMBOL and named `:data`. The value of `:data` should be an array of hashes like below.
|
62
|
+
|
63
|
+
Below is just an example. Your input hash keys inside the parent data array can be named anything (not limited to url, act_name, street, etc.), but must be hashes inside a parent array like the below structure and syntax.
|
51
64
|
```
|
52
65
|
array_of_hashes = [ { url: 'abc_autos_example.com',
|
53
66
|
act_name: 'ABC Aut\x92os',
|
@@ -65,19 +78,25 @@ array_of_hashes = [ { url: 'abc_autos_example.com',
|
|
65
78
|
phone: '(800) 555-5678\r\n' },
|
66
79
|
}]
|
67
80
|
|
68
|
-
sanitized_data = Utf8Sanitizer.sanitize(data: array_of_hashes)
|
81
|
+
sanitized_data = Utf8Sanitizer.sanitize({data: array_of_hashes})
|
69
82
|
```
|
70
83
|
|
71
84
|
### Returned Sanitized Data Format
|
72
|
-
The returned data will be in hash format with the following keys: `:stats`, `:file_path`, `:data` like below.
|
85
|
+
The returned data will contain a detailed report of the row or line numbers where UTF8 violations and extra white space were located. The broad categories in the returned data will be in hash format with the following keys: `:stats`, `:file_path`, `:data` like below.
|
86
|
+
|
87
|
+
IMPORTANT: `:valid_data` is the clean, converted output from your CSV or strings input, directly accessible via `sanitized_data[:data][:valid_data]`.
|
88
|
+
|
89
|
+
Returned data also indicates if the input data was successfully encoded. In rare cases the data is beyond repair, and will be listed in the `:error` category.
|
90
|
+
|
91
|
+
Each non-UTF8 row will be included in its original syntax like the example below and can be accessed directly via `sanitized_data[:data][:encoded_data]`.
|
73
92
|
|
74
93
|
The `:stats` are a breakdown of the results. `:defective_rows` and `:error_rows` will usually be the same number which refer to the rows which are beyond repair (very rare). Otherwise, the results will be `:valid_rows` if they were perfect or successfully sanitized, including `:encoded_rows` which refers to the number of rows that contained non-utf8 characters, and `:wchar_rows` which is short for 'whitespace character rows'.
|
75
94
|
|
76
95
|
`:data` is broken down into the following categories: `:valid_data`, `:encoded_data`, `:defective_data`, and `:error_data`.
|
77
96
|
|
78
|
-
|
97
|
+
Below is an example of the returned data (`:stats`, `:file_path`, `:data`)
|
79
98
|
|
80
|
-
|
99
|
+
**`sanitized_data` is a local variable, which you can name anything you like, but it must be assigned in the following syntax: `[:data][:valid_data]` and `[:data][:encoded_data]`, etc.**
|
81
100
|
|
82
101
|
```
|
83
102
|
{ stats:
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: utf8_sanitizer
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: '1.
|
4
|
+
version: '1.02'
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Adam Booth
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2018-06-
|
11
|
+
date: 2018-06-22 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: activesupport
|