RubyGems - scrub_db - Versions diffs - 0.0.1.pre.rc.03 → 2.0 - Mend

scrub_db 0.0.1.pre.rc.03 → 2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

checksums.yaml +4 -4
data/.rspec_status +4 -0
data/README.md +415 -31
data/Rakefile +71 -7
data/junk.rb +114 -0
data/lib/scrub_db/filter.rb +40 -21
data/lib/scrub_db/strings.rb +52 -0
data/lib/scrub_db/version.rb +1 -1
data/lib/scrub_db/webs.rb +70 -0
data/lib/scrub_db.rb +2 -1
data/lib/{web_criteria.rb → webs_criteria.rb} +5 -5
data/scrub_db.gemspec +3 -3
metadata +20 -11
data/lib/scrub_db/web.rb +0 -108

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 38b27fd85ba16c4f14ca0542699874b3a101f17eb1faf10fadf163cee9e43d20
-  data.tar.gz: 321ca87b878ac6e66c7788da976e28dc49fa14d83e77d41f19aafb6c26c4ad7f
+  metadata.gz: e6140658a28c9843df6f94a6c5948df622edffeaa905bc37232b229e4ac30b62
+  data.tar.gz: 56858739677c75f755583d51e69db49634822671a4e0c2eb346ea537682674c6
 SHA512:
-  metadata.gz: 6f6c11ca3c7b1575c23d811cdf9e977b1dd2a0a198533260c42e5fd18f3146c1d8b4bdcb73615c094f34a2463c670ec26ee5a4a74ac3fa29ce9f4e0c6949f156
-  data.tar.gz: 9ed5a5cbecd487b94f6a99b65e666baba8dc19938d0b111fdd573790aa530d627b79403759996d4ac2188690b6bab9b37e17834f5e948f33638767c05d7c7704
+  metadata.gz: 4a606e4be1bc35a3530ede12fa66af81970628c3fe7504f50045bac27a4933b1248e06e9f18362fe573d2b8b28cfd10a2e3b797e6a1795ed1f92711506ca4057
+  data.tar.gz: e6b10af82247ebebd24357a9c04393a1e16fbd9708cfba703cb5c764138a336fa8f63824b5d2e3f8c3337eeb9c7083739593006de5a2c7510c4ef41ed6b8902b

data/.rspec_status ADDED Viewed

@@ -0,0 +1,4 @@
+example_id                   | status | run_time        |
+---------------------------- | ------ | --------------- |
+./spec/scrub_db_spec.rb[1:1] | passed | 0.00148 seconds |
+./spec/scrub_db_spec.rb[1:2] | failed | 0.01648 seconds |

data/README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 # ScrubDb
-#### Scrub data with your custom criteria.  Returns detailed reporting.
+#### Scrub your database, api data, web scraping data, and web form submissions based on your your custom criteria.  Allows for different criteria for different jobs.  Returns detailed reporting to zero-in on your data with ease, efficiency, and greater insight.  Allows for option to pre-format data before scrubbing to also normalize and standardize your data sets, ex uniform URL patterns
 ## Installation
@@ -19,17 +19,65 @@ Or install it yourself as:
 ## Usage
-More methods coming soon.  Currently, Scrub Array of URLs is fully functional.
+### I. Usage Overview
+#### Step 1: Load Your Scrub Criteria:
+##### 1. For String Criteria
+```
+strings_criteria = {
+  pos_criteria: %w[your positive criteria here],
+  neg_criteria: %w[your negative criteria here]
+}
+strings_obj = ScrubDb::Strings.new(strings_criteria)
+```
+##### 2. For Web Criteria
+```
+webs_criteria = {
+  pos_criteria: %w[your positive criteria here],
+  neg_criteria: %w[your negative criteria here]
+}
+webs_obj = ScrubDb::Webs.new(webs_criteria)
+```
+#### Step 2: Load Your Data to Scrub:
+##### Methods available to scrub data:
+##### 1. Scrub URLs:
+```
+scrub_web_obj = ScrubDb::Webs.new(criteria)
+scrubbed_web_hashes = scrub_web_obj.scrub_urls(urls)
+```
+##### 2. Scrub Strings:
+```
+strings_obj = ScrubDb::Strings.new(strings_criteria)
+scrubbed_strings = strings_obj.scrub_strings(array_of_strings)
+```
+##### 3. Scrub Proper Strings:
+```
+strings_obj = ScrubDb::Strings.new(strings_criteria)
+scrubbed_prop_strings = strings_obj.scrub_proper_strings(array_of_props)
+```
+### II. Usage Details
 ### 1. Scrub Array of URLs:
 This is an example of scrubbing auto dealership urls.  We only want URLs based in the US, and paths of the staff.  Most of our URLs are good, but we want to confirm that they all meet our requirements.
 ### A. Pass in Scrub Criteria
-First step is to load your web criteria in hash format.  It's not required to enter all the keys below, but for those you are using, each key must be a symbol and be exactly the same as the ones below.  The values must each be an array of strings.
+First step is to load your Webs criteria in hash format.  It's not required to enter all the keys below, but for those you are using, each key must be a symbol and be exactly the same as the ones below.  The values must each be an array of strings.
 ```
 criteria = {
-      neg_urls: %w[pprov avis budget collis eat],
+      neg_urls: %w[aprov avis budget collis eat],
       pos_urls: %w[acura audi bmw bentley],
       neg_paths: %w[buy bye call cash cheap click collis cont distrib],
       pos_paths: %w[team staff management],
@@ -37,17 +85,18 @@ criteria = {
       pos_exts: %w[com net]
     }
-web_obj = ScrubDb::Web.new(criteria)
+scrub_web_obj = ScrubDb::Webs.new(criteria)
 ```
 ### B. Pass in URLs List
 Next, pass your list of URLs to `scrub_urls(urls)` with the syntax below.
 ```
 urls = %w[
+  austinchevrolet.not.real
   smith_acura.com/staff
   abcrepair.ca
-  austinchevrolet.not.real
   hertzrentals.com/review
   londonhyundai.uk/fleet
   http://www.townbuick.net/staff
@@ -62,7 +111,7 @@ urls = %w[
   www.www.yellowpages.com/business
 ]
-scrubbed_web_hashes = web_obj.scrub_urls(urls)
+scrubbed_web_hashes = scrub_web_obj.scrub_urls(urls)
 ```
 ### C. Returned Results
@@ -71,7 +120,20 @@ Notice that the URLs in the list above are NOT uniformly formatted.  ScrubDb lev
 ```
 scrubbed_web_hashes = [
   {
-    web_status: 'formatted',
+    web_status: 'invalid',
+    url: 'austinchevrolet.not.real',
+    url_f: nil,
+    url_path: nil,
+    web_neg: 'error: ext.invalid [not, real]',
+    url_exts: [],
+    neg_exts: [],
+    pos_exts: [],
+    neg_paths: [],
+    pos_paths: [],
+    neg_urls: [],
+    pos_urls: []
+  },
+  { web_status: 'formatted',
     url: 'smith_acura.com/staff',
     url_f: 'http://www.smith_acura.com',
     url_path: '/staff',
@@ -84,8 +146,7 @@ scrubbed_web_hashes = [
     neg_urls: [],
     pos_urls: ['acura']
   },
-  {
-    web_status: 'formatted',
+  { web_status: 'formatted',
     url: 'abcrepair.ca',
     url_f: 'http://www.abcrepair.ca',
     url_path: nil,
@@ -98,8 +159,7 @@ scrubbed_web_hashes = [
     neg_urls: ['repair'],
     pos_urls: []
   },
-  {
-    web_status: 'formatted',
+  { web_status: 'formatted',
     url: 'hertzrentals.com/review',
     url_f: 'http://www.hertzrentals.com',
     url_path: '/review',
@@ -112,8 +172,7 @@ scrubbed_web_hashes = [
     neg_urls: ['hertz, rent'],
     pos_urls: []
   },
-  {
-    web_status: 'formatted',
+  { web_status: 'formatted',
     url: 'londonhyundai.uk/fleet',
     url_f: 'http://www.londonhyundai.uk',
     url_path: '/fleet',
@@ -126,8 +185,7 @@ scrubbed_web_hashes = [
     neg_urls: [],
     pos_urls: ['hyundai']
   },
-  {
-    web_status: 'formatted',
+  { web_status: 'formatted',
     url: 'http://www.townbuick.net/staff',
     url_f: 'http://www.townbuick.net',
     url_path: nil,
@@ -140,8 +198,7 @@ scrubbed_web_hashes = [
     neg_urls: [],
     pos_urls: ['buick']
   },
-  {
-    web_status: 'formatted',
+  { web_status: 'formatted',
     url: 'http://youtube.com/download',
     url_f: 'http://www.youtube.com',
     url_path: nil,
@@ -154,8 +211,7 @@ scrubbed_web_hashes = [
     neg_urls: ['youtube'],
     pos_urls: []
   },
-  {
-    web_status: 'formatted',
+  { web_status: 'formatted',
     url: 'www.madridinfiniti.es/collision',
     url_f: 'http://www.madridinfiniti.es',
     url_path: '/collision',
@@ -168,8 +224,20 @@ scrubbed_web_hashes = [
     neg_urls: [],
     pos_urls: ['infiniti']
   },
-  {
-    web_status: 'formatted',
+  { web_status: 'invalid',
+    url: 'www.mitsubishideals.sofake',
+    url_f: nil,
+    url_path: nil,
+    web_neg: 'error: ext.invalid [sofake]',
+    url_exts: [],
+    neg_exts: [],
+    pos_exts: [],
+    neg_paths: [],
+    pos_paths: [],
+    neg_urls: [],
+    pos_urls: []
+  },
+  { web_status: 'formatted',
     url: 'www.dallassubaru.com.sofake',
     url_f: 'http://www.dallassubaru.com',
     url_path: nil,
@@ -182,8 +250,7 @@ scrubbed_web_hashes = [
     neg_urls: [],
     pos_urls: ['subaru']
   },
-  {
-    web_status: 'formatted',
+  { web_status: 'formatted',
     url: 'www.quickeats.net/contact_us',
     url_f: 'http://www.quickeats.net',
     url_path: '/contact_us',
@@ -196,8 +263,7 @@ scrubbed_web_hashes = [
     neg_urls: ['eat, quick'],
     pos_urls: []
   },
-  {
-    web_status: 'formatted',
+  { web_status: 'formatted',
     url: 'www.school.edu/teachers',
     url_f: 'http://www.school.edu',
     url_path: '/teachers',
@@ -210,8 +276,20 @@ scrubbed_web_hashes = [
     neg_urls: [],
     pos_urls: []
   },
-  {
-    web_status: 'formatted',
+  { web_status: 'invalid',
+    url: 'www.www.nissancars/inventory',
+    url_f: nil,
+    url_path: nil,
+    web_neg: 'error: ext.none',
+    url_exts: [],
+    neg_exts: [],
+    pos_exts: [],
+    neg_paths: [],
+    pos_paths: [],
+    neg_urls: [],
+    pos_urls: []
+  },
+  { web_status: 'formatted',
     url: 'www.www.toyotatown.net/staff/management',
     url_f: 'http://www.toyotatown.net',
     url_path: '/staff/management',
@@ -220,12 +298,11 @@ scrubbed_web_hashes = [
     neg_exts: [],
     pos_exts: ['net'],
     neg_paths: [],
-    pos_paths: ['staff, management'],
+    pos_paths: ['management, staff'],
     neg_urls: [],
     pos_urls: ['toyota']
   },
-  {
-    web_status: 'formatted',
+  { web_status: 'formatted',
     url: 'www.www.yellowpages.com/business',
     url_f: 'http://www.yellowpages.com',
     url_path: '/business',
@@ -242,6 +319,313 @@ scrubbed_web_hashes = [
 ```
+### 2. Scrub Array of Strings:
+You can scrub an array of strings with or without formatting.
+For scrubbing proper strings (account and business names, job titles, article titles, brands, locations, etc.) like below, you might prefer the proper scrub method, but these examples will use the same criteria and same array of strings to illustrate the difference.
+Continuing with the auto dealership example above, the following examples are to scrub the auto dealership account names.  We want to prioritize our data based on those who match our positive criteria, those who match our negative criteria, and those who are neutral.
+### A. Pass in Scrub Criteria
+First step is to load your Strings criteria in hash format.  It's not required to enter all the keys below, but for those you are using, each key must be a symbol and be exactly the same as the ones below.  The values must each be an array of strings.
+```
+strings_criteria = {
+      neg_urls: %w[aprov avis budget collis eat],
+      pos_urls: %w[acura audi bmw bentley],
+      neg_paths: %w[buy bye call cash cheap click collis cont distrib],
+      pos_paths: %w[team staff management],
+      neg_exts: %w[au ca edu es gov in ru uk us],
+      pos_exts: %w[com net]
+    }
+strings_obj = ScrubDb::Strings.new(strings_criteria)
+```
+### B. Pass in Strings List
+Next, pass your list of strings to `scrub_strings(strings)` with the syntax below.
+```
+array_of_strings = [
+  'quick auto approval, inc',
+  'the gmc and bmw-world of AUSTIN tx',
+  'DOWNTOWN CAR REPAIR, INC',
+  'TEXAS TRAVEL, CO',
+  '123 Car-world Kia OF CHICAGO IL',
+  'Main Street Ford in DALLAS tX',
+  'broad st fiat of houston',
+  'hot-deal auto insurance',
+  'BUDGET - AUTOMOTORES ZONA & FRANCA, INC',
+  'Young Gmc Trucks',
+  'youmans Chevrolet',
+  'yazell chevy',
+  'quick cAr LUBE',
+  'yAtEs AuTo maLL',
+  'YADKIN VALLEY COLLISION CO',
+  'XIT FORD INC'
+]
+scrubbed_strings = strings_obj.scrub_strings(array_of_strings)
+```
+### C. Returned Results
+```
+scrubbed_strings = [
+  {
+   string: 'quick auto approval, inc',
+   pos_criteria: [],
+   neg_criteria: ['approv, quick']
+  },
+  {
+   string: 'the gmc and bmw-world of AUSTIN tx',
+   pos_criteria: ['bmw, gmc'],
+   neg_criteria: []
+  },
+  {
+   string: 'DOWNTOWN CAR REPAIR, INC',
+   pos_criteria: [],
+   neg_criteria: ['repair']
+  },
+  {
+   string: 'TEXAS TRAVEL, CO',
+   pos_criteria: [],
+   neg_criteria: ['travel']
+  },
+  {
+   string: '123 Car-world Kia OF CHICAGO IL',
+   pos_criteria: ['kia'],
+   neg_criteria: []
+  },
+  {
+   string: 'Main Street Ford in DALLAS tX',
+   pos_criteria: ['ford'],
+   neg_criteria: []
+  },
+  {
+   string: 'broad st fiat of houston',
+   pos_criteria: ['fiat'],
+   neg_criteria: []
+  },
+  {
+   string: 'hot-deal auto insurance',
+   pos_criteria: [],
+   neg_criteria: ['insur']
+  },
+  {
+   string: 'BUDGET - AUTOMOTORES ZONA & FRANCA, INC',
+   pos_criteria: [],
+   neg_criteria: ['budget']
+  },
+  {
+   string: 'Young Gmc Trucks',
+   pos_criteria: ['gmc'],
+   neg_criteria: []
+  },
+  {
+   string: 'youmans Chevrolet',
+   pos_criteria: ['chevrolet'],
+   neg_criteria: []
+  },
+  {
+   string: 'yazell chevy',
+   pos_criteria: [],
+   neg_criteria: []
+  },
+  {
+   string: 'quick cAr LUBE',
+   pos_criteria: [],
+   neg_criteria: ['lube, quick']
+  },
+  {
+   string: 'yAtEs AuTo maLL',
+   pos_criteria: [],
+   neg_criteria: []
+  },
+  {
+   string: 'YADKIN VALLEY COLLISION CO',
+   pos_criteria: [],
+   neg_criteria: ['collis']
+  },
+  {
+   string: 'XIT FORD INC',
+   pos_criteria: ['ford'],
+   neg_criteria: []
+  }
+]
+```
+### 3. Scrub Array of Proper Strings:
+This method is designed for scrubbing proper strings, like account and business names, job titles, article titles, brands, locations, etc.
+This method is identical to example 2 above (Scrub Array of Strings), except this method first formats the strings using the `Utf8Sanitizer gem` and `CrmFormatter gem`, then passes the results to the method above to scrub.  So, this is a 2-in-1 method, Format + Scrub! Again, this method treats your strings as if they are proper nouns, so compare the results of these two methods to determine which is most suitable for your data.
+### A. Pass in Scrub Criteria
+First step is to load your Strings criteria in hash format.  It's not required to enter all the keys below, but for those you are using, each key must be a symbol and be exactly the same as the ones below.  The values must each be an array of strings.
+```
+strings_criteria = {
+      neg_urls: %w[aprov avis budget collis eat],
+      pos_urls: %w[acura audi bmw bentley],
+      neg_paths: %w[buy bye call cash cheap click collis cont distrib],
+      pos_paths: %w[team staff management],
+      neg_exts: %w[au ca edu es gov in ru uk us],
+      pos_exts: %w[com net]
+    }
+strings_obj = ScrubDb::Strings.new(strings_criteria)
+```
+### B. Pass in Strings List
+Next, pass your list of strings to `scrub_proper_strings(strings)` with the syntax below.
+```
+array_of_strings = [
+  'quick auto approval, inc',
+  'the gmc and bmw-world of AUSTIN tx',
+  'DOWNTOWN CAR REPAIR, INC',
+  'TEXAS TRAVEL, CO',
+  '123 Car-world Kia OF CHICAGO IL',
+  'Main Street Ford in DALLAS tX',
+  'broad st fiat of houston',
+  'hot-deal auto insurance',
+  'BUDGET - AUTOMOTORES ZONA & FRANCA, INC',
+  'Young Gmc Trucks',
+  'youmans Chevrolet',
+  'yazell chevy',
+  'quick cAr LUBE',
+  'yAtEs AuTo maLL',
+  'YADKIN VALLEY COLLISION CO',
+  'XIT FORD INC'
+]
+scrubbed_proper_strings = strings_obj.scrub_proper_strings(array_of_strings)
+```
+### C. Returned Results
+```
+scrubbed_proper_strings = [
+    {
+      proper_status: 'formatted',
+      proper: 'quick auto approval, inc',
+      proper_f: 'Quick Auto Approval, Inc',
+      pos_criteria: [],
+      neg_criteria: ['approv, quick']
+    },
+    {
+      proper_status: 'formatted',
+      proper: 'the gmc and bmw-world of AUSTIN tx',
+      proper_f: 'The GMC and BMW-World of Austin TX',
+      pos_criteria: ['bmw, gmc'],
+      neg_criteria: []
+    },
+    {
+      proper_status: 'formatted',
+      proper: 'DOWNTOWN CAR REPAIR, INC',
+      proper_f: 'Downtown Car Repair, Inc',
+      pos_criteria: [],
+      neg_criteria: ['repair']
+    },
+    {
+      proper_status: 'formatted',
+      proper: 'TEXAS TRAVEL, CO',
+      proper_f: 'Texas Travel, Co',
+      pos_criteria: [],
+      neg_criteria: ['travel']
+    },
+    {
+      proper_status: 'formatted',
+      proper: '123 Car-world Kia OF CHICAGO IL',
+      proper_f: '123 Car-World Kia of Chicago IL',
+      pos_criteria: ['kia'],
+      neg_criteria: []
+    },
+    {
+      proper_status: 'formatted',
+      proper: 'Main Street Ford in DALLAS tX',
+      proper_f: 'Main Street Ford in Dallas TX',
+      pos_criteria: ['ford'],
+      neg_criteria: []
+    },
+    {
+      proper_status: 'formatted',
+      proper: 'broad st fiat of houston',
+      proper_f: 'Broad St Fiat of Houston',
+      pos_criteria: ['fiat'],
+      neg_criteria: []
+    },
+    {
+      proper_status: 'formatted',
+      proper: 'hot-deal auto insurance',
+      proper_f: 'Hot-Deal Auto Insurance',
+      pos_criteria: [],
+      neg_criteria: ['insur']
+    },
+    {
+      proper_status: 'formatted',
+      proper: 'BUDGET - AUTOMOTORES ZONA & FRANCA, INC',
+      proper_f: 'Budget - Automotores Zona & Franca, Inc',
+      pos_criteria: [],
+      neg_criteria: ['budget']
+    },
+    {
+      proper_status: 'formatted',
+      proper: 'Young Gmc Trucks',
+      proper_f: 'Young GMC Trucks',
+      pos_criteria: ['gmc'],
+      neg_criteria: []
+    },
+    {
+      proper_status: 'formatted',
+      proper: 'youmans Chevrolet',
+      proper_f: 'Youmans Chevrolet',
+      pos_criteria: ['chevrolet'],
+      neg_criteria: []
+    },
+    {
+      proper_status: 'formatted',
+      proper: 'yazell chevy',
+      proper_f: 'Yazell Chevy',
+      pos_criteria: [],
+      neg_criteria: []
+    },
+    {
+      proper_status: 'formatted',
+      proper: 'quick cAr LUBE',
+      proper_f: 'Quick Car Lube',
+      pos_criteria: [],
+      neg_criteria: ['lube, quick']
+    },
+    {
+      proper_status: 'formatted',
+      proper: 'yAtEs AuTo maLL',
+      proper_f: 'Yates Auto Mall',
+      pos_criteria: [],
+      neg_criteria: []
+    },
+    {
+      proper_status: 'formatted',
+      proper: 'YADKIN VALLEY COLLISION CO',
+      proper_f: 'Yadkin Valley Collision Co',
+      pos_criteria: [],
+      neg_criteria: ['collis']
+    },
+    {
+      proper_status: 'formatted',
+      proper: 'XIT FORD INC',
+      proper_f: 'Xit Ford Inc',
+      pos_criteria: ['ford'],
+      neg_criteria: []
+    }
+  ]
+```
 ## Author
 Adam J Booth  - [4rlm](https://github.com/4rlm)

data/Rakefile CHANGED Viewed

@@ -1,7 +1,7 @@
 require "bundler/gem_tasks"
 require "rspec/core/rake_task"
 require 'scrub_db'
-require 'web_criteria'
+require 'webs_criteria'
 RSpec::Core::RakeTask.new(:spec)
@@ -17,17 +17,81 @@ task :console do
   require "active_support/all"
   ARGV.clear
-  scrubbed_urls = scrub_sample_urls
-  binding.pry
+  scrubbed_webs = run_scrub_webs
+  # scrubbed_strings = run_scrub_strings
+  # scrubbed_proper_strings = run_scrub_proper_strings
+  # binding.pry
   IRB.start
 end
-def scrub_sample_urls
+def run_scrub_strings
+  strings_criteria = {
+    pos_criteria: WebsCriteria.seed_pos_urls,
+    neg_criteria: WebsCriteria.seed_neg_urls
+  }
+  array_of_strings = [
+    'quick auto approval, inc',
+    'the gmc and bmw-world of AUSTIN tx',
+    'DOWNTOWN CAR REPAIR, INC',
+    'TEXAS TRAVEL, CO',
+    '123 Car-world Kia OF CHICAGO IL',
+    'Main Street Ford in DALLAS tX',
+    'broad st fiat of houston',
+    'hot-deal auto insurance',
+    'BUDGET - AUTOMOTORES ZONA & FRANCA, INC',
+    'Young Gmc Trucks',
+    'youmans Chevrolet',
+    'yazell chevy',
+    'quick cAr LUBE',
+    'yAtEs AuTo maLL',
+    'YADKIN VALLEY COLLISION CO',
+    'XIT FORD INC'
+  ]
+  strings_obj = ScrubDb::Strings.new(strings_criteria)
+  scrubbed_strings = strings_obj.scrub_strings(array_of_strings)
+end
+def run_scrub_proper_strings
+  strings_criteria = {
+    pos_criteria: WebsCriteria.seed_pos_urls,
+    neg_criteria: WebsCriteria.seed_neg_urls
+  }
+  array_of_propers = [
+    'quick auto approval, inc',
+    'the gmc and bmw-world of AUSTIN tx',
+    'DOWNTOWN CAR REPAIR, INC',
+    'TEXAS TRAVEL, CO',
+    '123 Car-world Kia OF CHICAGO IL',
+    'Main Street Ford in DALLAS tX',
+    'broad st fiat of houston',
+    'hot-deal auto insurance',
+    'BUDGET - AUTOMOTORES ZONA & FRANCA, INC',
+    'Young Gmc Trucks',
+    'youmans Chevrolet',
+    'yazell chevy',
+    'quick cAr LUBE',
+    'yAtEs AuTo maLL',
+    'YADKIN VALLEY COLLISION CO',
+    'XIT FORD INC'
+  ]
+  strings_obj = ScrubDb::Strings.new(strings_criteria)
+  scrubbed_proper_strings = strings_obj.scrub_proper_strings(array_of_propers)
+end
+def run_scrub_webs
   urls = %w[
+    austinchevrolet.not.real
     smith_acura.com/staff
     abcrepair.ca
-    austinchevrolet.not.real
     hertzrentals.com/review
     londonhyundai.uk/fleet
     http://www.townbuick.net/staff
@@ -42,6 +106,6 @@ def scrub_sample_urls
     www.www.yellowpages.com/business
   ]
-  web_obj = ScrubDb::Web.new(WebCriteria.all_web_criteria)
-  scrubbed_webs = web_obj.scrub_urls(urls)
+  webs_obj = ScrubDb::Webs.new(WebsCriteria.all_scrub_web_criteria)
+  scrubbed_webs = webs_obj.scrub_urls(urls)
 end

data/junk.rb ADDED Viewed

@@ -0,0 +1,114 @@
+[
+  {
+    proper_status: 'formatted',
+    proper: 'quick auto approval, inc',
+    proper_f: 'Quick Auto Approval, Inc',
+    pos_criteria: [],
+    neg_criteria: ['approv, quick']
+  },
+  {
+    proper_status: 'formatted',
+    proper: 'the gmc and bmw-world of AUSTIN tx',
+    proper_f: 'The GMC and BMW-World of Austin TX',
+    pos_criteria: ['bmw, gmc'],
+    neg_criteria: []
+  },
+  {
+    proper_status: 'formatted',
+    proper: 'DOWNTOWN CAR REPAIR, INC',
+    proper_f: 'Downtown Car Repair, Inc',
+    pos_criteria: [],
+    neg_criteria: ['repair']
+  },
+  {
+    proper_status: 'formatted',
+    proper: 'TEXAS TRAVEL, CO',
+    proper_f: 'Texas Travel, Co',
+    pos_criteria: [],
+    neg_criteria: ['travel']
+  },
+  {
+    proper_status: 'formatted',
+    proper: '123 Car-world Kia OF CHICAGO IL',
+    proper_f: '123 Car-World Kia of Chicago IL',
+    pos_criteria: ['kia'],
+    neg_criteria: []
+  },
+  {
+    proper_status: 'formatted',
+    proper: 'Main Street Ford in DALLAS tX',
+    proper_f: 'Main Street Ford in Dallas TX',
+    pos_criteria: ['ford'],
+    neg_criteria: []
+  },
+  {
+    proper_status: 'formatted',
+    proper: 'broad st fiat of houston',
+    proper_f: 'Broad St Fiat of Houston',
+    pos_criteria: ['fiat'],
+    neg_criteria: []
+  },
+  {
+    proper_status: 'formatted',
+    proper: 'hot-deal auto insurance',
+    proper_f: 'Hot-Deal Auto Insurance',
+    pos_criteria: [],
+    neg_criteria: ['insur']
+  },
+  {
+    proper_status: 'formatted',
+    proper: 'BUDGET - AUTOMOTORES ZONA & FRANCA, INC',
+    proper_f: 'Budget - Automotores Zona & Franca, Inc',
+    pos_criteria: [],
+    neg_criteria: ['budget']
+  },
+  {
+    proper_status: 'formatted',
+    proper: 'Young Gmc Trucks',
+    proper_f: 'Young GMC Trucks',
+    pos_criteria: ['gmc'],
+    neg_criteria: []
+  },
+  {
+    proper_status: 'formatted',
+    proper: 'youmans Chevrolet',
+    proper_f: 'Youmans Chevrolet',
+    pos_criteria: ['chevrolet'],
+    neg_criteria: []
+  },
+  {
+    proper_status: 'formatted',
+    proper: 'yazell chevy',
+    proper_f: 'Yazell Chevy',
+    pos_criteria: [],
+    neg_criteria: []
+  },
+  {
+    proper_status: 'formatted',
+    proper: 'quick cAr LUBE',
+    proper_f: 'Quick Car Lube',
+    pos_criteria: [],
+    neg_criteria: ['lube, quick']
+  },
+  {
+    proper_status: 'formatted',
+    proper: 'yAtEs AuTo maLL',
+    proper_f: 'Yates Auto Mall',
+    pos_criteria: [],
+    neg_criteria: []
+  },
+  {
+    proper_status: 'formatted',
+    proper: 'YADKIN VALLEY COLLISION CO',
+    proper_f: 'Yadkin Valley Collision Co',
+    pos_criteria: [],
+    neg_criteria: ['collis']
+  },
+  {
+    proper_status: 'formatted',
+    proper: 'XIT FORD INC',
+    proper_f: 'Xit Ford Inc',
+    pos_criteria: ['ford'],
+    neg_criteria: []
+  }
+]

data/lib/scrub_db/filter.rb CHANGED Viewed

@@ -5,47 +5,66 @@ module ScrubDb
     def initialize(args={})
       @args = args
-      # @global_hash = grab_global_hash
       @empty_criteria = args.empty?
     end
     def scrub_oa(hash, target, oa_name, include_or_equal)
       return hash unless oa_name.present? && !@empty_criteria && target.present?
-      criteria = @args.fetch(oa_name.to_sym, [])
+      criteria = fetch_criteria(oa_name)
       return hash unless criteria.any?
-      tars = target.is_a?(::String) ? target.split(', ') : target
-      binding.pry if !tars.present?
+      target = prep_target(target)
+      tars = target_to_tars(target)
+      scrub_matches = match_criteria(tars, include_or_equal, criteria)
+      string_match = stringify_matches(scrub_matches)
+      hash = match_to_hash(hash, string_match, oa_name)
+    end
+    def match_to_hash(hsh, match, oa_name)
+      return hsh unless match.present?
+      hsh[oa_name.to_sym] << match
+      hsh
+    end
+    def stringify_matches(matches=[])
+      string_match = matches&.uniq&.sort&.join(', ') if matches.any?
+    end
+    def fetch_criteria(oa_name)
+      criteria = @args.fetch(oa_name.to_sym, [])
+      criteria = criteria&.map(&:downcase)
+    end
+    def match_criteria(tars, include_or_equal, criteria)
       scrub_matches = tars.map do |tar|
-        return hash unless criteria.present?
         if include_or_equal == 'include'
-          criteria.select { |crit| crit if tar.include?(crit) }.join(', ')
+          criteria.map { |crit| crit if tar.include?(crit) }
         elsif include_or_equal == 'equal'
-          criteria.select { |crit| crit if tar == crit }.join(', ')
+          criteria.map { |crit| crit if tar == crit }
         end
       end
+      scrub_matches = scrub_matches.flatten.compact
+    end
-      scrub_match = scrub_matches&.uniq&.sort&.join(', ')
-      return hash unless scrub_match.present?
-      hash[oa_name.to_sym] << scrub_match
-      hash
+    def prep_target(target)
+      target = target.join if target.is_a?(Array)
+      target = target.downcase
+      target = target.gsub(',', ' ')
+      target = target.gsub('-', ' ')
+      target = target.squeeze(' ')
+    end
-      ### Delete below after testing above. ###
-      # scrub_match = scrub_matches&.uniq&.sort&.join(', ')
-      # return hash unless scrub_match.present?
-      # if oa_name.include?('web_neg')
-      #   hash[:web_neg] << "#{oa_name}: #{scrub_match}"
-      # else
-      #   hash[:web_pos] << "#{oa_name}: #{scrub_match}"
-      # end
+    def target_to_tars(target)
+      tars = target.is_a?(::String) ? target.split(' ') : target
     end
     ######################################
     # def grab_global_hash
-    #   keys = %i[row_id act_name street city state zip full_addr phone url street_f city_f state_f zip_f full_addr_f phone_f url_f url_path web_neg address_status phone_status web_status utf_status]
+    #   keys = %i[row_id act_name street city state zip full_addr phone url street_f city_f state_f zip_f full_addr_f phone_f url_f url_path ScrubWeb_neg address_status phone_status ScrubWeb_status utf_status]
     #   @global_hash = Hash[keys.map { |a| [a, nil] }]
     # end

data/lib/scrub_db/strings.rb ADDED Viewed

@@ -0,0 +1,52 @@
+module ScrubDb
+  class Strings
+    # attr_accessor :headers, :valid_rows, :encoded_rows, :row_id, :data_hash, :defective_rows, :error_rows
+    def initialize(criteria={})
+      @empty_criteria = criteria&.empty?
+      @filter = ScrubDb::Filter.new(criteria) unless @empty_criteria
+    end
+    def scrub_proper_strings(props=[])
+      prop_hashes = CrmFormatter.format_propers(props)
+      prop_hashes = merge_criteria(prop_hashes)
+      prop_hashes.map! { |prop_hsh| scrub_hash(prop_hsh) }
+    end
+    def scrub_strings(strings=[])
+      str_hashes = strings_to_hashes(strings)
+      str_hashes = merge_criteria(str_hashes)
+      str_hashes.map! { |str_hsh| scrub_hash(str_hsh) }
+    end
+    def strings_to_hashes(strings)
+      str_hashes = strings.map { |str| { string: str } }
+    end
+    def merge_criteria(hashes)
+      hashes.map do |hsh|
+        hsh.merge({ pos_criteria: [], neg_criteria: [] })
+      end
+    end
+    def scrub_hash(hsh)
+      str = hsh[:string]
+      prop = hsh[:proper_f]
+      if str.present?
+        hsh = @filter.scrub_oa(hsh, str, 'neg_criteria', 'include')
+        hsh = @filter.scrub_oa(hsh, str, 'pos_criteria', 'include')
+      end
+      if prop.present?
+        hsh = @filter.scrub_oa(hsh, prop, 'neg_criteria', 'include')
+        hsh = @filter.scrub_oa(hsh, prop, 'pos_criteria', 'include')
+      end
+      hsh
+    end
+  end
+end

data/lib/scrub_db/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module ScrubDb
-  VERSION = "0.0.1.pre.rc.03"
+  VERSION = "2.0"
 end

data/lib/scrub_db/webs.rb ADDED Viewed

@@ -0,0 +1,70 @@
+module ScrubDb
+  class Webs
+    # attr_accessor :headers, :valid_rows, :encoded_rows, :row_id, :data_hash, :defective_rows, :error_rows
+    def initialize(criteria={})
+      @empty_criteria = criteria&.empty?
+      @filter = ScrubDb::Filter.new(criteria) unless @empty_criteria
+    end
+    def scrub_urls(urls=[])
+      formatted_url_hashes = CrmFormatter.format_urls(urls)
+      formatted_url_hashes = merge_criteria_hashes(formatted_url_hashes)
+      formatted_url_hashes = pre_scrub(formatted_url_hashes)
+    end
+    def pre_scrub(hashes)
+      hashes = hashes.map do |hsh|
+        if hsh[:url_f].present?
+          hsh[:url_exts] = extract_exts(hsh)
+          hsh = scrub_url_hash(hsh)
+        end
+        hsh
+      end
+    end
+    def merge_criteria_hashes(hashes)
+      hashes.map! do |url_hash|
+        merge_criteria_hash(url_hash)
+      end
+    end
+    def merge_criteria_hash(url_hash)
+      url_hash.merge!(
+        {
+          url_exts: [],
+          neg_exts: [],
+          pos_exts: [],
+          neg_paths: [],
+          pos_paths: [],
+          neg_urls: [],
+          pos_urls: []
+        }
+      )
+    end
+    def extract_exts(url_hash)
+      uri_parts = URI(url_hash[:url_f]).host&.split('.')
+      url_exts = uri_parts[2..-1]
+    end
+    def scrub_url_hash(url_hash)
+      url = url_hash[:url_f]
+      path = url_hash[:url_path]
+      href = url_hash[:href]
+      url_exts = url_hash[:url_exts]
+      url_hash = @filter.scrub_oa(url_hash, url_exts, 'neg_exts', 'equal')
+      url_hash = @filter.scrub_oa(url_hash, url_exts, 'pos_exts', 'equal')
+      url_hash = @filter.scrub_oa(url_hash, url, 'neg_urls', 'include')
+      url_hash = @filter.scrub_oa(url_hash, url, 'pos_urls', 'include')
+      url_hash = @filter.scrub_oa(url_hash, path, 'neg_paths', 'include')
+      url_hash = @filter.scrub_oa(url_hash, path, 'pos_paths', 'include')
+      url_hash
+    end
+  end
+end

data/lib/scrub_db.rb CHANGED Viewed

@@ -1,5 +1,6 @@
 require "scrub_db/version"
-require 'scrub_db/web'
+require 'scrub_db/webs'
+require 'scrub_db/strings'
 require 'scrub_db/filter'
 require 'pry'
 require 'crm_formatter'

data/lib/{web_criteria.rb → webs_criteria.rb} RENAMED Viewed

@@ -1,8 +1,8 @@
-# WebCriteria.new.all_web_criteria
+# WebsCriteria.new.all_scrub_web_criteria
-class WebCriteria
+class WebsCriteria
-  def self.all_web_criteria
+  def self.all_scrub_web_criteria
     {
       neg_urls: seed_neg_urls,
       pos_urls: seed_pos_urls,
@@ -46,10 +46,10 @@ class WebCriteria
   # end
-  # ##Rails C: StartCrm.run_webs
+  # ##Rails C: StartCrm.run_scrub_webs
   # def self.get_urls
   #   urls = %w(approvedautosales.org autosmartfinance.com leessummitautorepair.net melodytoyota.com northeastacura.com gemmazda.com)
-  #   urls += %w(website.com website.business.site website website.fake website.fake.com website.com.fake)
+  #   urls += %w(Scrubwebsite.com Scrubwebsite.business.site Scrubwebsite Scrubwebsite.fake Scrubwebsite.fake.com Scrubwebsite.com.fake)
   # end
 end

data/scrub_db.gemspec CHANGED Viewed

@@ -12,8 +12,8 @@ Gem::Specification.new do |spec|
   spec.homepage      = 'https://github.com/4rlm/scrub_db'
   spec.license       = "MIT"
-  spec.summary       = %q{Scrub data with your custom criteria.  Returns detailed reporting.}
-  spec.description   = %q{Scrub data with your custom criteria.  Returns detailed reporting.  Rspecs coming soon.}
+  spec.summary       = %q{Scrub your database, api data, web scraping data, and web form submissions based on your your custom criteria.  Allows for different criteria for different jobs.  Returns detailed reporting to zero-in on your data with ease, efficiency, and greater insight.}
+  spec.description   = %q{Scrub your database, api data, web scraping data, and web form submissions based on your your custom criteria.  Allows for different criteria for different jobs.  Returns detailed reporting to zero-in on your data with ease, efficiency, and greater insight.  Allows for option to pre-format data before scrubbing to also normalize and standardize your data sets, ex uniform URL patterns}
   if spec.respond_to?(:metadata)
     spec.metadata['allowed_push_host'] = 'https://rubygems.org'
@@ -42,7 +42,7 @@ Gem::Specification.new do |spec|
   # spec.add_dependency "activesupport-inflector", ['~> 0.1.0']
   spec.add_dependency "utf8_sanitizer", "~> 2.0"
-  spec.add_dependency "crm_formatter", "~> 2.4"
+  spec.add_dependency "crm_formatter", "~> 2.6"
   spec.add_development_dependency 'bundler', '~> 1.16', '>= 1.16.2'
   spec.add_development_dependency 'byebug', '~> 10.0', '>= 10.0.2'

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: scrub_db
 version: !ruby/object:Gem::Version
-  version: 0.0.1.pre.rc.03
+  version: '2.0'
 platform: ruby
 authors:
 - Adam Booth
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2018-06-27 00:00:00.000000000 Z
+date: 2018-06-29 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: activesupport
@@ -50,14 +50,14 @@ dependencies:
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '2.4'
+        version: '2.6'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '2.4'
+        version: '2.6'
 - !ruby/object:Gem::Dependency
   name: bundler
   requirement: !ruby/object:Gem::Requirement
@@ -208,8 +208,11 @@ dependencies:
     - - "~>"
       - !ruby/object:Gem::Version
         version: 0.97.4
-description: Scrub data with your custom criteria.  Returns detailed reporting.  Rspecs
-  coming soon.
+description: Scrub your database, api data, web scraping data, and web form submissions
+  based on your your custom criteria.  Allows for different criteria for different
+  jobs.  Returns detailed reporting to zero-in on your data with ease, efficiency,
+  and greater insight.  Allows for option to pre-format data before scrubbing to also
+  normalize and standardize your data sets, ex uniform URL patterns
 email:
 - 4rlm@protonmail.ch
 executables: []
@@ -218,6 +221,7 @@ extra_rdoc_files: []
 files:
 - ".gitignore"
 - ".rspec"
+- ".rspec_status"
 - ".travis.yml"
 - CODE_OF_CONDUCT.md
 - Gemfile
@@ -226,11 +230,13 @@ files:
 - Rakefile
 - bin/console
 - bin/setup
+- junk.rb
 - lib/scrub_db.rb
 - lib/scrub_db/filter.rb
+- lib/scrub_db/strings.rb
 - lib/scrub_db/version.rb
-- lib/scrub_db/web.rb
-- lib/web_criteria.rb
+- lib/scrub_db/webs.rb
+- lib/webs_criteria.rb
 - scrub_db.gemspec
 homepage: https://github.com/4rlm/scrub_db
 licenses:
@@ -248,13 +254,16 @@ required_ruby_version: !ruby/object:Gem::Requirement
       version: 2.5.1
 required_rubygems_version: !ruby/object:Gem::Requirement
   requirements:
-  - - ">"
+  - - ">="
     - !ruby/object:Gem::Version
-      version: 1.3.1
+      version: '0'
 requirements: []
 rubyforge_project:
 rubygems_version: 2.7.6
 signing_key:
 specification_version: 4
-summary: Scrub data with your custom criteria.  Returns detailed reporting.
+summary: Scrub your database, api data, web scraping data, and web form submissions
+  based on your your custom criteria.  Allows for different criteria for different
+  jobs.  Returns detailed reporting to zero-in on your data with ease, efficiency,
+  and greater insight.
 test_files: []

data/lib/scrub_db/web.rb DELETED Viewed

@@ -1,108 +0,0 @@
-module ScrubDb
-  class Web
-    # attr_accessor :headers, :valid_rows, :encoded_rows, :row_id, :data_hash, :defective_rows, :error_rows
-    def initialize(criteria={})
-      @empty_criteria = criteria&.empty?
-      @filter = ScrubDb::Filter.new(criteria) unless @empty_criteria
-    end
-    def scrub_urls(urls=[])
-      formatted_url_hashes = CrmFormatter.format_urls(urls)
-      formatted_url_hashes = merge_criteria_hashes(formatted_url_hashes)
-      formatted_url_hashes.map! do |url_hash|
-        if url_hash[:web_status] != 'invalid' && url_hash[:url_f].present?
-          url_hash[:url_exts] = extract_exts(url_hash)
-          url_hash = scrub_url_hash(url_hash)
-        end
-      end
-    end
-    def merge_criteria_hashes(hashes)
-      hashes.map! do |url_hash|
-        merge_criteria_hash(url_hash)
-      end
-    end
-    def merge_criteria_hash(url_hash)
-      url_hash.merge!(
-        {
-          url_exts: [],
-          neg_exts: [],
-          pos_exts: [],
-          neg_paths: [],
-          pos_paths: [],
-          neg_urls: [],
-          pos_urls: []
-        }
-      )
-    end
-    def extract_exts(url_hash)
-      uri_parts = URI(url_hash[:url_f]).host&.split('.')
-      url_exts = uri_parts[2..-1]
-    end
-    def scrub_url_hash(url_hash)
-      url = url_hash[:url_f]
-      path = url_hash[:url_path]
-      href = url_hash[:href]
-      url_exts = url_hash[:url_exts]
-      url_hash = @filter.scrub_oa(url_hash, url_exts, 'neg_exts', 'equal')
-      url_hash = @filter.scrub_oa(url_hash, url_exts, 'pos_exts', 'equal')
-      url_hash = @filter.scrub_oa(url_hash, url, 'neg_urls', 'include')
-      url_hash = @filter.scrub_oa(url_hash, url, 'pos_urls', 'include')
-      url_hash = @filter.scrub_oa(url_hash, path, 'neg_paths', 'include')
-      url_hash = @filter.scrub_oa(url_hash, path, 'pos_paths', 'include')
-      url_hash
-    end
-    # def remove_invalid_links(link)
-    #   link_hsh = { link: link, valid_link: nil, flags: nil }
-    #   return link_hsh unless link.present?
-    #   @neg_paths += get_symbs
-    #   flags = @neg_paths.select { |red| link&.include?(red) }
-    #   flags << "below #{2}" if link.length < 2
-    #   flags << "over #{100}" if link.length > 100
-    #   flags = flags.flatten.compact
-    #   valid_link = flags.any? ? nil : link
-    #   link_hsh[:valid_link] = valid_link
-    #   link_hsh[:flags] = flags.join(', ')
-    #   binding.pry
-    #   link_hsh
-    # end
-    # def remove_invalid_hrefs(href)
-    #   href_hsh = { href: href, valid_href: nil, flags: nil }
-    #   return href_hsh unless href.present?
-    #   @neg_hrefs += get_symbs
-    #   href = href.split('|').join(' ')
-    #   href = href.split('/').join(' ')
-    #   href&.gsub!('(', ' ')
-    #   href&.gsub!(')', ' ')
-    #   href&.gsub!('[', ' ')
-    #   href&.gsub!(']', ' ')
-    #   href&.gsub!(',', ' ')
-    #   href&.gsub!("'", ' ')
-    #
-    #   flags = []
-    #   flags << "over #{100}" if href.length > 100
-    #   invalid_text = Regexp.new(/[0-9]/)
-    #   flags << invalid_text&.match(href)
-    #   href = href&.downcase
-    #   href = href&.strip
-    #
-    #   flags << @neg_hrefs.select { |red| href&.include?(red) }
-    #   flags = flags.flatten.compact.uniq
-    #   href_hsh[:valid_href] = href unless flags.any?
-    #   href_hsh[:flags] = flags.join(', ')
-    #   href_hsh
-    # end
-  end
-end