utf8_sanitizer 1.01 → 1.02

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 0de542eedb064eda2b7a85eda41c674dc7ae822e5bff19f0758bf7418ace84ee
4
- data.tar.gz: b252e6c2aa92f32c2068ba59fb8f0719afedb034c47c8513c3017dda9588933b
3
+ metadata.gz: f0e4ff707e58e6238c6f62c2b8818b783406240a40e537663e928a914198e199
4
+ data.tar.gz: cff25ea7735710a5ed4d8c5fe1f779688d9f9117880e710e998c2eaef5ac8e3a
5
5
  SHA512:
6
- metadata.gz: 8bf27abb9db602ab114a0606cb83b1e16d931bec77901381e5ad851334b58027c13cd1aa5942607334cf57b66a1394ff50984d62f17c8bac21cdcafe7907e074
7
- data.tar.gz: 1e8bcbb08ef7bd8db08af0a79ba9dc06d824634197f005abfe5fc8e390703be0729a0c1ce568378d4fc22266db871e635df4b2237f54bf4eb3b023235bfea5d4
6
+ metadata.gz: a6b37f0b41b0f4340d350438580e83c78719ffc83c1fc91cf0580eb921354ba9e661215bd710ee4f7391bb9920dde812b0735c4c367ea0381b60f9304a3770ae
7
+ data.tar.gz: 8785995ab6dc235136a9e83f5ffb7e4d74fe5a4cee2d67539b0b0dd864542f6e959c10156e22e3ece65020e3f58c41e369b90267fdf2c6d0a86b41e0ad156254
data/README.md CHANGED
@@ -40,14 +40,27 @@ Options for UTF8 Sanitizing data:
40
40
  2. Data Hash of strings
41
41
 
42
42
  #### 1. CSV Parsing
43
- This is a good option if you are having problems with a CSV containing non-UTF8 characters. Pass your file_path as a hash like below. Hash MUST be a SYMBOL and named `:file_path`. If not, default seeds will be passed as the system detects empty user input and thinks user is trying to run built-in seed data for testing.
43
+ To clean CSV file containing non-UTF8 characters, pass file_path as a hash like below. Hash MUST meet the following guidelines:
44
+
45
+ a. key as a SYMBOL `:` (not key as string)
46
+
47
+ b. named `:file_path`
48
+
49
+ c. be an Absolute Path from root `./`
50
+
51
+ d. be a hash `{file_path: "./path/to/your_csv.csv"}`
52
+
53
+ e. passed to `Utf8Sanitizer.sanitize()`
54
+
55
+ Syntax Example Below:
44
56
  ```
45
- args = {file_path: "./path/to/your_csv.csv"}
46
- sanitized_data = Utf8Sanitizer.sanitize(args)
57
+ sanitized_data = Utf8Sanitizer.sanitize({file_path: "./path/to/your_csv.csv"})
47
58
  ```
48
59
 
49
60
  #### 2. Hash of Strings
50
- This is a good option if you are scraping data or cleaning up existing databases. Pass your data as a hash like below. Hash MUST be a SYMBOL and named `:data`. The value of `:data` should be an array of hashes like below and can be any size from one to many tens of thousands. The hashes inside the data array can be named anything from crm contact data like below, stats, recipes, or any custom hashes as long as they are in an array and resemble the syntax and structure like below.
61
+ To clean existing databases, web form submissions, or scraped data, pass input data as a hash like below. Hash MUST be a SYMBOL and named `:data`. The value of `:data` should be an array of hashes like below.
62
+
63
+ Below is just an example. Your input hash keys inside the parent data array can be named anything (not limited to url, act_name, street, etc.), but must be hashes inside a parent array like the below structure and syntax.
51
64
  ```
52
65
  array_of_hashes = [ { url: 'abc_autos_example.com',
53
66
  act_name: 'ABC Aut\x92os',
@@ -65,19 +78,25 @@ array_of_hashes = [ { url: 'abc_autos_example.com',
65
78
  phone: '(800) 555-5678\r\n' },
66
79
  }]
67
80
 
68
- sanitized_data = Utf8Sanitizer.sanitize(data: array_of_hashes)
81
+ sanitized_data = Utf8Sanitizer.sanitize({data: array_of_hashes})
69
82
  ```
70
83
 
71
84
  ### Returned Sanitized Data Format
72
- The returned data will be in hash format with the following keys: `:stats`, `:file_path`, `:data` like below.
85
+ The returned data will contain a detailed report of the row or line numbers where UTF8 violations and extra white space were located. The broad categories in the returned data will be in hash format with the following keys: `:stats`, `:file_path`, `:data` like below.
86
+
87
+ IMPORTANT: `:valid_data` is the clean, converted output from your CSV or strings input, directly accessible via `sanitized_data[:data][:valid_data]`.
88
+
89
+ Returned data also indicates if the input data was successfully encoded. In rare cases the data is beyond repair, and will be listed in the `:error` category.
90
+
91
+ Each non-UTF8 row will be included in its original syntax like the example below and can be accessed directly via `sanitized_data[:data][:encoded_data]`.
73
92
 
74
93
  The `:stats` are a breakdown of the results. `:defective_rows` and `:error_rows` will usually be the same number which refer to the rows which are beyond repair (very rare). Otherwise, the results will be `:valid_rows` if they were perfect or successfully sanitized, including `:encoded_rows` which refers to the number of rows that contained non-utf8 characters, and `:wchar_rows` which is short for 'whitespace character rows'.
75
94
 
76
95
  `:data` is broken down into the following categories: `:valid_data`, `:encoded_data`, `:defective_data`, and `:error_data`.
77
96
 
78
- `:valid_data` is the most important data and you can access it with `sanitized_data[:data][:valid_data]`. Each non-UTF8 row will be included in its original syntax like below and can be accessed directly via `sanitized_data[:data][:encoded_data]`.
97
+ Below is an example of the returned data (`:stats`, `:file_path`, `:data`)
79
98
 
80
- **You can change the name of `sanitized_data` to anything you like, but it must be followed with `[:data][:valid_data]` and `[:data][:encoded_data]`, etc.**
99
+ **`sanitized_data` is a local variable, which you can name anything you like, but it must be assigned in the following syntax: `[:data][:valid_data]` and `[:data][:encoded_data]`, etc.**
81
100
 
82
101
  ```
83
102
  { stats:
@@ -1,4 +1,4 @@
1
1
  module Utf8Sanitizer
2
2
  # VERSION = "0.0.1-rc.1"
3
- VERSION = '1.01'.freeze
3
+ VERSION = '1.02'.freeze
4
4
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: utf8_sanitizer
3
3
  version: !ruby/object:Gem::Version
4
- version: '1.01'
4
+ version: '1.02'
5
5
  platform: ruby
6
6
  authors:
7
7
  - Adam Booth
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2018-06-21 00:00:00.000000000 Z
11
+ date: 2018-06-22 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: activesupport