crm_formatter 1.0.6.pre.rc.1 → 1.0.7.pre.rc.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: '031873afa102ba537d240f6c86c828bd29c75d85bf5063af6f6f5b715398203f'
4
- data.tar.gz: a4c08972f40519a94ac636b5c5f287c68bb485bb6ae3c1126ffcc80a61dd3b2a
3
+ metadata.gz: c1a9e605ef7ec90b0c88e80a32cd22fa604a9cc810fe559eb19c389c72c0e494
4
+ data.tar.gz: 07b4436cf1d31125e44a0bd71424b7c7c21a50954faa8e14b2792f08fd7391d9
5
5
  SHA512:
6
- metadata.gz: 6a37c47288003b2593422fc199d6875dfb47a0104d48d53b1bb2366a59c7038229082cb17d5b1adc14b14f4227d704c07cfd6672adb5678c7b7744a0c6fb8169
7
- data.tar.gz: 4ca91da1f810d5d93451fd6d36d82878185fc33df5f32e7318537a4fa5367eb9cf91ac22c038261bbb38340cf2e207424ae9f68175ba7c80174a34e3f0662cbd
6
+ metadata.gz: 26357b48e1933ed4a421c9fc8ce1bd01a097d553e7d1810910bb7b891496f341ce456a45cff6313189adf1321e7fe95aad25bf9d4bf6471301dfa61bf4c637c3
7
+ data.tar.gz: bb86e8cdcc5ea5526e16d7963687e4dd058e3097bdc2b88f6b65284f5cd282ab7f6890304abf90477e1f7503c0a0023b89f48b41c994467bc0f52feffc8509b5
data/.gitignore CHANGED
@@ -11,3 +11,5 @@ crm_formatter-*.gem
11
11
  .DS_Store
12
12
  .idea/
13
13
  .txt
14
+ .csv
15
+ !extensions.csv
data/README.md CHANGED
@@ -1,14 +1,21 @@
1
- # CRMFormatter
2
1
 
3
- Reformat and Normalize CRM Contact Data, Addresses, Phones, Emails and URLs.
4
- Please note, that this gem is a rapid work in process. It is from a collection of modules currently being used on a production app, but decided to open them up. The tests have not yet been written for the gem and there will still be changes in the near future. Documentation is limited, but is coming. Here are some basic points below to help you get started.
2
+ # **CRM Formatter**
3
+
4
+ #### Reformat and Normalize CRM Contact Data, such as Addresses, Phones and URLs.
5
+
6
+ **CRM Formatter** was originally designed to curate high-volume enterprise-scale asynchronous web scraping via Nokogiri, Mechanize, and Delayed_job. Web Scraping *aka Web Harvesting / Data Mining* is notoriously unreliable *sticky* work with endless edge-cases to overcome. Accurately, yet efficiently curating such data is a constant and evolving task, and will continue to be the core functionality of **CRM Formatter**.
7
+ However, it also plays an integral role in routine functions of apps, like formatting, normalizing, and even scrubbing existing databases, and submitted form data before saving to the database; via model callbacks, such as `before_validation` or `before_save`.
8
+
9
+ ###### The **CRM Formatter** Gem is currently in `--pre` versioning, aka **beta mode** with frequent updates. Formal tests in the gem environment are still on the way.
10
+ However, **CRM Formatter** has been developed continuously for several years and is a reliable and integral part of a production CRM data verification app. The process of isolating the various modules into a consolidated open source gem has just recently begun, so documentation is still limited, but is frequently being added and refined.
5
11
 
6
12
  ## Getting Started
13
+ **CRM Formatter** is compatible with Rails 4.2 and 5.0, 5.1 and 5.2 on Ruby 2.2 and later.
7
14
 
8
15
  In your Gemfile add:
9
16
 
10
17
  ```
11
- gem 'crm_formatter', '~> 1.0.5.pre.rc.1'
18
+ gem 'crm_formatter', '~> 1.0.6.pre.rc.1'
12
19
  ```
13
20
 
14
21
  Or to install locally:
@@ -17,42 +24,99 @@ Or to install locally:
17
24
  gem install crm_formatter --pre
18
25
  ```
19
26
 
20
- ### Prerequisites
21
-
22
- CRMFormatter is optimized for Rails 5.2 and Ruby 2.5, but has worked well in older versions too.
23
-
24
- ### Architecture
25
-
26
- CRMFormatter is the top level Module wrapper comprising of 3 Classes, Address, Phone and Web. There is also a Helpers Module for shared tasks.
27
-
28
-
29
- ### Usage
30
-
31
- CRMFormatter could be used anywhere in an app for various reasons. Most commonly from an app's helper methods, controllers or model. If you wanted to ensure all new data is properly formatted before saving, methods could be accessed from the model via before_save, callbacks, or validations. In addition to CRMFormatter formatting CRM data, it also provides detailed reports on what if anything has changed and it always preserves the original data submitted. Here are a few examples below of how it could be used.
32
-
33
- The gem can be used both en mass to clean up an entire database, or fully-integrated into an app's regular environment.
34
- They were however, designed for an app that Harvests business data for sales and marketing teams, so they work perfectly with NokoGiri and Mechanize!
35
-
36
- ** These are just examples below, not strict usage guides ...
27
+ ## Usage
28
+
29
+ ##### Usage is organized into three sections, Overview, Methods and Examples.
30
+
31
+ ### I. Overview
32
+
33
+ #### 1. Access and Integration
34
+ ##### Using **CRM Formatter** in your app is very simple, and could be accessed from your app's concerns, controllers, helpers, lib, models, or services, but depends on the scope, location, and size of your application and server.
35
+ * Simple form submission validations: model callback typically ideal.
36
+ * Database normalizing tasks: wrapper method in concerns, helpers, or lib typically ideal.
37
+ * Long running processes like web scraping or high volume APIs calls, like Google Linkedin, or Twitter: the lib or services might be ideal (multithreaded asynchronously even better)
38
+
39
+ #### 2. Hash Response
40
+ ##### Formatted data will always be returned as a hash datatype the following key-value pairs:
41
+ * The originally submitted data as the first pair.
42
+ * Formatted data in the remaining pairs.
43
+ * A T/F boolean indicator pair regarding if the original and formatted data are different.
44
+
45
+ #### 3. Optional Arguments *OA*
46
+ ##### A class can be instantiated with optional arguments *OA*.
47
+ * OA house the criteria by which you'd like to scrub your data.
48
+ * Each is either 'Pos' or 'Neg', for more accurate reporting of your scrubbing results.
49
+ * List of available Web OA is below, and each accepts data in the hash datatype, aka 'keyword-args'.
50
+ * For example, you might want to know which URLs contain 'twitter', 'facebook', or 'linkedin' either to focus on developing a list of business social media links, or perhaps you want to use such a list to better avoid such links.
51
+ * *OA is currently only available for the Web class.*
52
+ * *Address OA & Phone OA will be available in a future release.*
53
+
54
+ ### II. Methods
55
+ ##### CRM Formatter**'s top level module is `CRMFormatter` and contains the following three classes:
56
+ 1. Address: `CRMFormatter::Address.new`
57
+ 2. Phone: `CRMFormatter::Address.new`
58
+ 3. Web: `CRMFormatter::Address.new`
59
+
60
+ ###### Then assign the above to a variable name of your choosing.
61
+ `addr_formatter = CRMFormatter::Address.new`
62
+ `@addr_formatter = CRMFormatter::Address.new`
63
+
64
+ ###### Web accepts optional arguments *OA* as a Hash (with Key-Value pairs)
65
+ Without OA: Instantiate normally if not using OA.
66
+ `web_formatter = CRMFormatter::Web.new`
67
+
68
+ With OA: Follow the steps to use Web OA:
69
+ 1. Available Web OA and the required Key-Value naming and datatypes.
70
+ * Only list the OA K-V Pairs you're using. No need to list empty values. It's not all or nothing. These are empty to illustrate the expected datatypes.
71
+
72
+ Below is how the OA are received in the Web class at initialization.
73
+ **3. Web Examples at the very bottom has a very detailed example including how OA can be used.**
74
+ ```
75
+ def initialize(args={})
76
+ @empty_oa = args.empty?
77
+ @pos_urls = args.fetch(:pos_urls, [])
78
+ @neg_urls = args.fetch(:neg_urls, [])
79
+ @pos_links = args.fetch(:pos_links, [])
80
+ @neg_links = args.fetch(:neg_links, [])
81
+ @pos_hrefs = args.fetch(:pos_hrefs, [])
82
+ @neg_hrefs = args.fetch(:neg_hrefs, [])
83
+ @pos_exts = args.fetch(:pos_exts, [])
84
+ @neg_exts = args.fetch(:neg_exts, [])
85
+ @min_length = args.fetch(:min_length, 2)
86
+ @max_length = args.fetch(:max_length, 100)
87
+ end
88
+ ```
37
89
 
38
- ### Usage by Class & Methods
39
- * The examples at the bottom of the page are rather verbose, so first, here is a list of methods available to you.
90
+ Example: Below is the syntax for how to use OA.
91
+ There are both Positive and Negative. They work the same and could just be included in the same array if you prefer. But they are intended to help you scrub data against negative criteria and for positive criteria.
92
+ ```
93
+ oa_args = { neg_urls: %w(approv insur invest loan quick rent repair),
94
+ neg_links: %w(buy call cash cheap click gas insta),
95
+ neg_hrefs: %w(after anounc apply approved blog buy call click),
96
+ neg_exts: %w(au ca edu es gov in ru uk us),
97
+ min_length: 0,
98
+ max_length: 30
99
+ }
100
+
101
+ @web_formatter = CRMFormatter::Web.new(oa_args)
102
+ ```
40
103
 
41
- # Address
104
+ #### 1. Address Methods
42
105
 
43
- * 'get_full_address' takes a hash of address parts then runs each through their respective formatters, then also adds an additional feature of combining them into a long full address string, and indicates if there were any changes from the original version and newly formatted.
106
+ `get_full_address()` takes a hash of address parts then runs each through their respective formatters, then also adds an additional feature of combining them into a long full address string, and indicates if there were any changes from the original version and newly formatted.
44
107
 
45
108
  ```
46
109
  addr_formatter = CRMFormatter::Address.new
47
110
 
48
- addr_formatter.get_full_address(full_address_hsh)
49
- full_address_hsh = {street: street, city: city, state: state, zip: zip}
111
+ full_address_hash = {street: street, city: city, state: state, zip: zip}
50
112
 
51
- addr_formatter.format_street(street)
113
+ addr_formatter.get_full_address(full_address_hash)
52
114
 
53
- addr_formatter.format_city(city)
115
+ addr_formatter.format_street(street_string)
54
116
 
55
- addr_formatter.format_state(state)
117
+ addr_formatter.format_city(city_string)
118
+
119
+ addr_formatter.format_state(state_string)
56
120
 
57
121
  addr_formatter.format_zip(zip)
58
122
 
@@ -60,12 +124,11 @@ They were however, designed for an app that Harvests business data for sales and
60
124
 
61
125
  addr_formatter.compare_versions(original, formatted)
62
126
 
63
-
64
127
  ```
65
128
 
66
- # Phone
129
+ #### Phone Methods
67
130
 
68
- * Subtle but important distinction between 'format_phone' which simply puts a phone in any format, like 555-123-4567 into normalized (555) 123-4567, and 'validate_phone' which also uses 'format_phone' to normalize its output, but is mainly tasked with determining if the phone number seem legitimate. If you know for sure that it is a phone number, but just want to normalize then first try format_phone. If you are doing web scraping or throwing in strings of text mixed with phones, then validate_phone might work better.
131
+ Subtle but important distinction between 'format_phone' which simply puts a phone in any format, like 555-123-4567 into normalized (555) 123-4567, and 'validate_phone' which also uses 'format_phone' to normalize its output, but is mainly tasked with determining if the phone number seem legitimate. If you know for sure that it is a phone number, but just want to normalize then first try format_phone. If you are doing web scraping or throwing in strings of text mixed with phones, then validate_phone might work better.
69
132
 
70
133
  ```
71
134
  ph_formatter = CRMFormatter::Phone.new
@@ -76,9 +139,7 @@ They were however, designed for an app that Harvests business data for sales and
76
139
 
77
140
  ```
78
141
 
79
- # Web
80
-
81
-
142
+ #### Web Methods
82
143
 
83
144
  ```
84
145
  web_formatter = CRMFormatter::Web.new
@@ -95,22 +156,12 @@ They were however, designed for an app that Harvests business data for sales and
95
156
 
96
157
  ```
97
158
 
159
+ ### III. Examples
160
+ Some of the examples are excessively verbose to help illustrate the datatypes and processes. Here are a few guidelines and tips:
161
+ **3. Web Examples at the very bottom is the most detailed and recent. It might be a good place to start.**
162
+ *These are just examples below, not strict usage guides ...*
98
163
 
99
-
100
-
101
-
102
-
103
-
104
-
105
-
106
-
107
- * Data will always be returned as hashes, with your original, modified, and details about what has changed.
108
- * You may pass optional arguments at initialization to provide lists of data to match against, for example a list of words that if in a URL, it would automatically report as junk (but still keeping your original in tact.)
109
-
110
- Typically, you will want to create your own method as a wrapper for the Gem methods, like below...
111
-
112
- ## Address
113
- ** The examples below are rather verbose, so you can make them much more compact of course.
164
+ #### 1. Address Examples
114
165
 
115
166
  ```
116
167
  def self.run_adrs
@@ -132,9 +183,7 @@ end
132
183
 
133
184
  ```
134
185
 
135
-
136
-
137
- ## Phone
186
+ #### 2. Phone Examples
138
187
 
139
188
  In the phone example, format_all_phone_in_my_db could be a custom wrapper method, which when called by Rails C or from a front end GUI process, could grab all phones in db meeting certain criteria to be scrubbed. The results will always be in hash format, such as below.... phone_hash
140
189
 
@@ -153,140 +202,99 @@ end
153
202
  phone_hash = { phone: 555-123-4567, valid_phone: (555) 123-4567, phone_edit: true }
154
203
  ```
155
204
 
205
+ #### 3. Web Examples
156
206
 
157
- ## Web
158
-
159
- In the example below, you might write a wrapper method named anything you like, such as 'format_a_url' and 'clean_many_websites' to pass in urls to be formatted.
207
+ The steps below will show you an option for how you could integrate larger processes in your app.
208
+ 1. Create a wrapper method you can call from an action or Rails C. In this example, a new class was also created in Lib for that purpose, as there could be related methods to create.
209
+ * These examples only include `CRMFormatter::Web.new.format_url(url)` method. There are several additional methods available to you. Documentation is on the way, but in the mean time, try out the below example, then play around with the others too.
160
210
 
161
211
  ```
162
- @web = CRMFormatter::Web.new
212
+ # /app/lib/start_crm.rb
163
213
 
164
- def format_a_url(url)
165
- hsh = @web.format_url(url)
166
- end
214
+ class StartCrm
167
215
 
168
- def clean_many_websites(array_of_urls)
216
+ ##Rails C: StartCrm.run_webs
217
+ def self.run_webs
218
+ oa_args = get_args
219
+ web = CRMFormatter::Web.new(oa_args)
169
220
 
170
- hashes = array_of_urls.map do |url|
171
- @web.format_url(url)
172
- end
221
+ formatted_url_hashes = get_urls.map do |url|
222
+ url_hash = web.format_url(url)
223
+ end
173
224
 
225
+ formatted_url_hashes
174
226
  end
175
- ```
176
-
177
-
178
- ### Additional Usage Examples
179
-
180
- ### Webs
181
- * Another example below.
182
227
 
228
+ end
183
229
  ```
184
- def self.run_webs
185
-
186
- url_flags = %w(approv avis budget business collis eat enterprise facebook financ food google gourmet hertz hotel hyatt insur invest loan lube mobility motel motorola parts quick rent repair restaur rv ryder service softwar travel twitter webhost yellowpages yelp youtube)
187
-
188
- link_flags = %w(: .biz .co .edu .gov .jpg .net // anounc book business buy bye call cash cheap click collis cont distrib download drop event face feature feed financ find fleet form gas generat graphic hello home hospi hour hours http info insta)
189
-
190
- href_flags = %w(? .com .jpg @ * after anounc apply approved blog book business buy call care career cash charit cheap check click)
230
+ 2. Make sure to modify your application config file to recognize your new class.
191
231
 
192
- extension_flags = %w(au ca edu es gov in ru uk us)
193
-
194
- args = { url_flags: url_flags, link_flags: link_flags, href_flags: href_flags, extension_flags: extension_flags }
195
- web = CRMFormatter::Web.new(args)
232
+ ```
233
+ #/app/config/application.rb
196
234
 
197
- urls = Accounts.where.not(url: nil).pluck(:url)
235
+ config.eager_load_paths << Rails.root.join('lib/**')
236
+ config.eager_load_paths += Dir["#{config.root}/lib/**/"]
237
+ ```
238
+ 3. Create your db query or put together a list of URLs to process, along with any OA to include. The below example is very verbose, but designed to be helpful. In reality, you might have various criteria saved in the db rather than writing it out.
239
+ In this example, we have auto dealer URLs. In this process, we're focusing on franchise dealers.
240
+ ```
241
+ def self.get_args
242
+ neg_urls = %w(approv avis budget collis eat enterprise facebook financ food google gourmet hertz hotel hyatt insur invest loan lube mobility motel motorola parts quick rent repair restaur rv ryder service softwar travel twitter webhost yellowpages yelp youtube)
198
243
 
199
- validated_url_hashes = urls.map { |url| web.format_url(url) }
200
- valid_urls = validated_url_hashes.map { |hsh| hsh[:valid_url] }.compact
201
- extracted_link_hashes = urls.map { |url| web.extract_link(url) }
202
- links = extracted_link_hashes.map { |hsh| hsh[:link] }.compact
203
- validated_link_hashes = links.map { |link| web.remove_invalid_links(link) }
204
- hrefs = ["Hot Inventory", "Join Our Sale!", "Don't Wait till Later", "Apply Today!", "No Cash Down!"]
205
- validated_href_hashes = hrefs.map { |href| web.remove_invalid_hrefs(href) }
244
+ pos_urls = ["acura", "alfa romeo", "aston martin", "audi", "bmw", "bentley", "bugatti", "buick", "cdjr", "cadillac", "chevrolet", "chrysler", "dodge", "ferrari", "fiat", "ford", "gmc", "group", "group", "honda", "hummer", "hyundai", "infiniti", "isuzu", "jaguar", "jeep", "kia", "lamborghini", "lexus", "lincoln", "lotus", "mini", "maserati", "mazda", "mclaren", "mercedes-benz", "mitsubishi", "nissan", "porsche", "ram", "rolls-royce", "saab", "scion", "smart", "subaru", "suzuki", "toyota", "volkswagen", "volvo"]
206
245
 
246
+ neg_exts = %w(au ca edu es gov in ru uk us)
247
+ oa_args = {neg_urls: neg_urls, pos_urls: pos_urls, neg_exts: neg_exts}
207
248
  end
208
249
 
250
+ def self.get_urls
251
+ urls = ["https://www.stevXXXXXXmitsubishiserviceandpartscenter.com", "https://www.perXXXXXXchryslerjeepcenterville.com", "http://www.peXXXXXXchryslerjeepcenterville.com", "http://www.colXXXXXXchryslerdodgejeepram.com"]
252
+ end
209
253
  ```
254
+ 4. Run your class and wrapper method in Rails C. By creating the wrapper method, you have set up the entire process to run like a runner. In reality, you might have several different criteria accessible from a GUI or even running in Cron Jobs.
210
255
 
211
- ### Fully Integrated into an App Example
212
- ** The gem is currently being used within another app in the following way...
256
+ `2.5.1 :001 > StartCrm.run_webs`
213
257
 
214
- ```
215
- @web_formatter.convert_to_scheme_host(url)
216
- @web_formatter.format_url(url)
217
- ```
218
- The two methods above, which are many available to you in the gem are being used below.
219
- ```
220
- curl_result[:response_code] = result&.response_code.to_s
221
- web_hsh = @web_formatter.format_url(result&.last_effective_url)
258
+ 5. Results are always in a Hash, like below. The URLs are slightly obfuscated out of respect (it's not a bug). These are examples from a large DB that runs on a loop 24/7 and gets to each organization about once a week, so it's already pretty well up to date, so there aren't any big changes below, but there are still a few things to point out.
222
259
 
223
- if web_hsh[:formatted_url].present?
224
- curl_result[:verified_url] = @web_formatter.convert_to_scheme_host(web_hsh[:formatted_url])
225
- end
260
+ * `:is_reformatted` indicates T/F if url_path and `:formatted_url` differ. If False, then it means they are the same, or the `:url_path` had significant errors which prevented it from being formatted, thus `:formatted_url` would be nil in such a case. The reality is that you might have some URLs that are so far off that, that they can't be reliably reformatted, so better to only let them pass if we are confident that they are reliable.
226
261
 
227
- ```
262
+ * `:url_path` is the url originally submitted by the client. It can include directory links on the end too, '/careers/, '/about-us/', etc.
228
263
 
229
- # Above is an isolated sliver of the larger environment shown below...
264
+ * `:formatted_url` is the formatted version of `:url_path`. It will be stripped of additional paths, '/deals/', '/staff/', etc. Also, often times people ommit 'http://:' and 'www' in CRMs. This can sometimes cause errors for users or Mechanized Web Scrapers. So, those will always be included to ensure consistency. In our production app we follow up the formatting with url redirect following, which our configurations require the entire path, so it will always be included. The redirect following gem is already being worked on and will be released as an additional gem shortly.
230
265
 
231
- ```
232
- def start_curl(url, timeout)
233
-
234
- curl_result = { verified_url: nil, response_code: nil, curl_err: nil }
235
- if url.present?
236
- result = nil
237
-
238
- begin # Curl Exception Handling
239
- begin # Timeout Exception Handling
240
- Timeout.timeout(timeout) do
241
- puts "\n\n=== WAITING FOR CURL RESPONSE ==="
242
- result = Curl::Easy.perform(url) do |curl|
243
- curl.follow_location = true
244
- curl.useragent = "curb"
245
- curl.connect_timeout = timeout
246
- curl.enable_cookies = true
247
- curl.head = true #testing - new
248
- end # result
249
-
250
- curl_result[:response_code] = result&.response_code.to_s
251
- web_hsh = @web_formatter.format_url(result&.last_effective_url)
252
-
253
- if web_hsh[:formatted_url].present?
254
- curl_result[:verified_url] = @web_formatter.convert_to_scheme_host(web_hsh[:formatted_url])
255
- end
256
- end
257
-
258
- rescue Timeout::Error # Timeout Exception Handling
259
- curl_result[:curl_err] = "Error: Timeout"
260
- end
261
-
262
- rescue LoadError => e # Curl Exception Handling
263
- curl_err = error_parser("Error: #{$!.message}")
264
- # CheckInt.new.check_int if curl_err.include?('TCP')
265
- curl_result[:curl_err] = curl_err
266
- end
267
- else ## If no url present?
268
- curl_result[:curl_err] = 'URL Nil'
269
- end
266
+ * `:neg` is an array of all the errors and negative, undesirable criteria to scrub against. If you include the criteria in OA `neg_urls:`, like above, it will automatically scrub and report. Regardless, any errors will also be included in there. So, if the url was not ultimately formatted, there will be details regarding why in `:neg`.
270
267
 
271
- print_result(curl_result)
272
- curl_result
273
- end
274
- ```
275
-
276
-
277
- ### Another Example from a Production Environment...
268
+ * `:pos` is the opposite, which highlights positive criteria you might be looking for. It too is available in OA via `pos_urls:`, like above.
278
269
 
279
270
  ```
280
- def format_url(url)
281
- url_hash = @web_formatter.format_url(url)
282
- url_hash.merge!({ verified_url: nil, url_redirected: false, response_code: nil, url_sts: nil, url_date: Time.now, wx_date: nil, timeout: nil })
283
- url_hash = evaluate_formatted_url(url_hash)
284
- end
285
-
271
+ [ {:is_reformatted=>false,
272
+ :url_path=>"https://www.steXXXXXXmitsubishiserviceandpartscenter.com",
273
+ :formatted_url=>"https://www.steXXXXXXmitsubishiserviceandpartscenter.com",
274
+ :neg=>["neg_urls: parts, rv, service"],
275
+ :pos=>["pos_urls: mitsubishi"]},
276
+
277
+ {:is_reformatted=>false,
278
+ :url_path=>"https://www.perXXXXXXchryslerjeepcenterville.com",
279
+ :formatted_url=>"https://www.perXXXXXXchryslerjeepcenterville.com",
280
+ :neg=>["neg_urls: rv"],
281
+ :pos=>["pos_urls: chrysler, jeep"]},
282
+
283
+ {:is_reformatted=>false,
284
+ :url_path=>"http://www.pXXXXXXchryslerjeepcenterville.com",
285
+ :formatted_url=>"http://www.XXXXXXechryslerjeepcenterville.com",
286
+ :neg=>["neg_urls: rv"],
287
+ :pos=>["pos_urls: chrysler, jeep"]},
288
+
289
+ {:is_reformatted=>false,
290
+ :url_path=>"http://www.colXXXXXXchryslerdodgejeepram.com",
291
+ :formatted_url=>"http://www.colXXXXXXchryslerdodgejeepram.com",
292
+ :neg=>["neg_urls: rv"],
293
+ :pos=>["pos_urls: chrysler, dodge, jeep, ram"]}
294
+ ]
286
295
  ```
287
296
 
288
297
 
289
-
290
298
  ## Author
291
299
 
292
300
  Adam J Booth - [4rlm](https://github.com/4rlm)
@@ -1,3 +1,32 @@
1
+
2
+
3
+ ========= Quick Reference =========
4
+ * delete current gem build record.
5
+ * up count gemspec version
6
+ $ gem build crm_formatter.gemspec
7
+ * Uninstall Locally
8
+ $ gem list
9
+ $ gem uninstall crm_formatter
10
+ $ gem install crm_formatter-1.0.6.pre.rc.1.gem
11
+ * Test Locally
12
+ * Edit Gem Tester App Gemfile
13
+
14
+ * git add / commit / push
15
+ $ gem push crm_formatter-1.0.6.pre.rc.1.gem
16
+
17
+
18
+ ========= TO DO =========
19
+ 1. Red Flags: Change to 'OA'
20
+ 2. Change Meth from 'remove' to 'compare_oa'
21
+ 3. Zip can be integer?
22
+ 4. Copy formatted results for README.
23
+ 5. Continue and finish README.
24
+ 6. Test Changes locally before pushing.
25
+ 7. Push.
26
+ =========================
27
+
28
+
29
+
1
30
  ==== FIRST CREATE APP TO TEST GEM ====
2
31
  -----------------------------------
3
32
  rails new gem_tester
@@ -38,9 +67,9 @@ $ gem build crm_formatter.gemspec
38
67
  Successfully built RubyGem
39
68
  Name: crm_formatter
40
69
  Version: 1.0.5.pre.rc.1
41
- File: crm_formatter-1.0.5.pre.rc.1.gem
70
+ File: crm_formatter-1.0.6.pre.rc.1.gem
42
71
  -----------------------------------
43
- $ gem install crm_formatter-1.0.5.pre.rc.1.gem
72
+ $ gem install crm_formatter-1.0.6.pre.rc.1.gem
44
73
  Successfully installed crm_formatter-1.0.5.pre.rc.1
45
74
  Parsing documentation for crm_formatter-1.0.5.pre.rc.1
46
75
  Installing ri documentation for crm_formatter-1.0.5.pre.rc.1
@@ -56,7 +85,7 @@ $ irb
56
85
  $ curl -u adamjbooth https://rubygems.org/api/v1/api_key.yaml > ~/.gem/credentials; chmod 0600 ~/.gem/credentials
57
86
  Enter host password for user 'adamjbooth': RG<usual>rg
58
87
  -----------------------------------
59
- $ gem push crm_formatter-1.0.5.pre.rc.1.gem
88
+ $ gem push crm_formatter-1.0.6.pre.rc.1.gem
60
89
  Pushing gem to https://rubygems.org...
61
90
  Successfully registered gem: crm_formatter (1.0.4.pre.rc.1)
62
91
  -----------------------------------
@@ -0,0 +1 @@
1
+ aaa
@@ -1,3 +1,3 @@
1
1
  module CRMFormatter
2
- VERSION = "1.0.6-rc.1"
2
+ VERSION = "1.0.7-rc.1"
3
3
  end
@@ -1,171 +1,309 @@
1
+ require 'csv'
2
+
1
3
  module CRMFormatter
2
4
  class Web
3
5
 
4
- def initialize(args={})
5
- @url_flags = args.fetch(:url_flags, [])
6
- @link_flags = args.fetch(:link_flags, [])
7
- @href_flags = args.fetch(:href_flags, [])
8
- @extension_flags = args.fetch(:extension_flags, [])
9
- @length_min = args.fetch(:length_min, 2)
10
- @length_max = args.fetch(:length_max, 100)
6
+ def initialize(args={})
7
+ @empty_oa = args.empty?
8
+ @pos_urls = args.fetch(:pos_urls, [])
9
+ @neg_urls = args.fetch(:neg_urls, [])
10
+ @pos_links = args.fetch(:pos_links, [])
11
+ @neg_links = args.fetch(:neg_links, [])
12
+ @pos_hrefs = args.fetch(:pos_hrefs, [])
13
+ @neg_hrefs = args.fetch(:neg_hrefs, [])
14
+ @pos_exts = args.fetch(:pos_exts, [])
15
+ @neg_exts = args.fetch(:neg_exts, [])
16
+ @min_length = args.fetch(:min_length, 2)
17
+ @max_length = args.fetch(:max_length, 100)
18
+ end
19
+
20
+ def banned_symbols
21
+ banned_symbols = ["!", "$", "%", "'", "(", ")", "*", "+", ",", "<", ">", "@", "[", "]", "^", "{", "}", "~"]
22
+ end
23
+
24
+ ##Call: StartCrm.run_webs
25
+ def format_url(url)
26
+ prep_result = prep_for_uri(url)
27
+ url_hash = prep_result[:url_hash]
28
+ url = prep_result[:url]
29
+ url = nil if has_errors(url_hash)
30
+
31
+ if url.present?
32
+ uri_result = run_uri(url_hash, url)
33
+ url_hash = uri_result[:url_hash]
34
+ url = uri_result[:url]
35
+ (url = nil if has_errors(url_hash)) if url.present?
36
+ end
37
+
38
+ url_hash[:formatted_url] = url
39
+ url_hash = check_reformatted_status(url_hash) if url.present?
40
+ url_hash
41
+ end
42
+
43
+
44
+ def check_reformatted_status(url_hash)
45
+ formatted = url_hash[:formatted_url]
46
+ if formatted.present?
47
+ url_hash[:is_reformatted] = url_hash[:url_path] != formatted
11
48
  end
49
+ url_hash
50
+ end
51
+
52
+
53
+ def has_errors(url_hash)
54
+ errors = url_hash[:neg].map { |neg| neg.include?('error') }
55
+ errors.any?
56
+ end
57
+
58
+
59
+ ##Call: StartCrm.run_webs
60
+ def prep_for_uri(url)
61
+ url_hash = { is_reformatted: false, url_path: url, formatted_url: nil, neg: [], pos: [] }
62
+ begin
63
+ url = url&.split('|')&.first
64
+ url = url&.split('\\')&.first
65
+ url&.gsub!(/\P{ASCII}/, '')
66
+ url = url&.downcase&.strip
67
+
68
+ 2.times { remove_ww3(url) } if url.present?
69
+ url = remove_slashes(url) if url.present?
70
+ url&.strip!
12
71
 
13
- def format_url(url)
14
- url_hsh = {url_path: url, formatted_url: nil, url_edit: false }
15
72
  if url.present?
16
- begin
17
- url = url&.split('|')&.first
18
- url = url&.split('\\')&.first
19
- url&.gsub!(/\P{ASCII}/, '')
20
- url = url&.downcase&.strip
21
- return url_hsh if url&.length < @length_min
73
+ url = nil if url.include?(' ')
74
+ url = url[0..-2] if url.present? && url[-1] == '/'
75
+ end
22
76
 
23
- 2.times { remove_ww3(url) } if url.present?
24
- url = remove_slashes(url) if url.present?
25
- url&.strip!
77
+ url = nil if url.present? && banned_symbols.any? {|symb| url&.include?(symb) }
26
78
 
27
- return url_hsh if !url.present? || url&.include?(' ')
28
- url = url[0..-2] if url[-1] == '/'
79
+ if url.present?
80
+ url_hash = compare_criteria(url_hash, url, 'pos_urls', 'include') if !@empty_oa
81
+ url_hash = compare_criteria(url_hash, url, 'neg_urls', 'include') if !@empty_oa
82
+ else
83
+ url_hash[:neg] << "error: syntax"
84
+ url_hash[:formatted_url] = url
85
+ end
29
86
 
30
- symbs = ['(', ')', '[', ']', '{', '}', '*', '@', '^', '$', '+', '!', '<', '>', '~', ',', "'"]
87
+ rescue Exception => e
88
+ url_hash[:neg] << "error: #{e}"
89
+ url = nil
90
+ url_hash
91
+ end
31
92
 
32
- return url_hsh if symbs.any? {|symb| url&.include?(symb) }
93
+ prep_result = { url_hash: url_hash, url: url }
94
+ end
33
95
 
34
- uri = URI(url)
35
- if uri.present?
36
- host_parts = uri.host&.split(".")
37
96
 
38
- if @extension_flags.any?
39
- bad_host_sts = host_parts&.map { |part| TRUE if @extension_flags.any? {|ext| part == ext } }&.compact&.first
40
- return url_hsh if bad_host_sts
41
- end
97
+ ##Call: StartCrm.run_webs
98
+ def run_uri(url_hash, url)
99
+ begin
100
+ uri = URI(url)
101
+ host_parts = uri.host&.split(".")
42
102
 
43
- host = uri.host
44
- scheme = uri.scheme
45
- url = "#{scheme}://#{host}" if host.present? && scheme.present?
46
- url = "http://#{url}" if url[0..3] != "http"
47
- url = url.gsub("//", "//www.") if !url.include?("www.")
103
+ url_hash = compare_criteria(url_hash, host_parts, 'pos_exts', 'equal') if !@empty_oa
104
+ url_hash = compare_criteria(url_hash, host_parts, 'neg_exts', 'equal') if !@empty_oa
48
105
 
49
- return url_hsh if @url_flags.any? { |bad_text| url&.include?(bad_text) }
106
+ host = uri.host
107
+ scheme = uri.scheme
108
+ url = "#{scheme}://#{host}" if host.present? && scheme.present?
109
+ url = "http://#{url}" if url[0..3] != "http"
110
+ url = url.gsub("//", "//www.") if !url.include?("www.")
111
+ samp_url = convert_to_scheme_host(url)
50
112
 
51
- url_hsh[:formatted_url] = convert_to_scheme_host(url) if url.present?
52
- url_hsh[:url_edit] = url_hsh[:formatted_url] != url_hsh[:url_path]
53
- end
54
- rescue
55
- return url_hsh
113
+ url = convert_to_scheme_host(url) if url.present?
114
+ url_extens_result = check_url_extens(url_hash, url)
115
+ url_hash = url_extens_result[:url_hash]
116
+ url = url_extens_result[:url]
117
+
118
+ rescue Exception => e
119
+ url_hash[:neg] << "error: #{e}"
120
+ url = nil
121
+ url_hash
122
+ end
123
+
124
+ uri_result = { url_hash: url_hash, url: url }
125
+ end
126
+
127
+
128
+ #Source: http://www.iana.org/domains/root/db
129
+ #Text: http://data.iana.org/TLD/tlds-alpha-by-domain.txt
130
+ def check_url_extens(url_hash, url)
131
+ if url.present?
132
+ url_extens = URI(url).host&.split(".")[2..-1]
133
+ if url_extens.count > 1
134
+ file_path = "./lib/crm_formatter/extensions.csv"
135
+ extens_list = CSV.read(file_path).flatten
136
+ valid_url_extens = extens_list & url_extens
137
+
138
+ if valid_url_extens.count != 1
139
+ extens_str = valid_url_extens.map { |ext| ".#{ext}" }.join(', ')
140
+ url_hash[:neg] << "error: exts.count > 1 [#{extens_str}]"
141
+ url = nil
56
142
  end
57
143
  end
58
- url_hsh
59
144
  end
60
145
 
146
+ url_hash[:formatted_url] = url
147
+ url_extens_result = {url_hash: url_hash, url: url}
148
+ end
61
149
 
62
- ###### Supporting Methods Below #######
63
150
 
64
- #CALL: Formatter.new.remove_ww3(url)
65
- def remove_ww3(url)
66
- if url.present?
67
- url.split('.').map { |part| url.gsub!(part,'www') if part.scan(/ww[0-9]/).any? }
68
- url&.gsub!("www.www", "www")
151
+ ## This process, compare_criteria only runs if client OA args were passed at initialization.
152
+ ## Results listed in url_hash[:neg]/[:pos], and don't impact or hinder final formatted url.
153
+ ## Simply adds more details about user's preferences and criteria for the url are.
154
+
155
+ def compare_criteria(hash, target, list_name, include_or_equal)
156
+ unless @empty_oa
157
+ if list_name.present?
158
+ criteria_list = instance_variable_get("@#{list_name}")
159
+
160
+ if criteria_list.present?
161
+ if target.is_a?(::String)
162
+ tars = target.split(', ')
163
+ else
164
+ tars = target
165
+ end
166
+
167
+ pn_matches = tars.map do |tar|
168
+ if criteria_list.present?
169
+ if include_or_equal == 'include'
170
+ criteria_list.select { |el| el if tar.include?(el) }.join(', ')
171
+ elsif include_or_equal == 'equal'
172
+ criteria_list.select { |el| el if tar == el }.join(', ')
173
+ end
174
+ end
175
+ end
176
+
177
+ pn_match = pn_matches&.uniq&.sort&.join(', ')
178
+ if pn_match.present?
179
+ if list_name.include?('neg')
180
+ hash[:neg] << "#{list_name}: #{pn_match}"
181
+ else
182
+ hash[:pos] << "#{list_name}: #{pn_match}"
183
+ end
184
+ end
185
+ end
186
+
69
187
  end
70
188
  end
189
+
190
+ hash
191
+ end
71
192
 
193
+ ###### Supporting Methods Below #######
72
194
 
73
- # For rare cases w/ urls with mistaken double slash twice.
74
- def remove_slashes(url)
75
- if url.present? && url.include?('//')
76
- parts = url.split('//')
77
- return parts[0..1].join if parts.length > 2
78
- end
195
+ def extract_link(url_path)
196
+ url_hash = format_url(url_path)
197
+ url = url_hash[:formatted_url]
198
+ link = url_path
199
+ link_hsh = {url_path: url_path, url: url, link: nil }
200
+ if url.present? && link.present? && link.length > @min_length
201
+ url = strip_down_url(url)
202
+ link = strip_down_url(link)
203
+ link&.gsub!(url, '')
204
+ link = link&.split('.net')&.last
205
+ link = link&.split('.com')&.last
206
+ link = link&.split('.org')&.last
207
+ link = "/#{link.split("/").reject(&:empty?).join("/")}" if link.present?
208
+ link_hsh[:link] = link if link.present? && link.length > @min_length
209
+ end
210
+ link_hsh
211
+ end
212
+
213
+
214
+ def strip_down_url(url)
215
+ if url.present?
216
+ url = url.downcase.strip
217
+ url = url.gsub('www.', '')
218
+ url = url.split('://')
219
+ url = url[-1]
79
220
  return url
80
221
  end
222
+ end
81
223
 
82
224
 
83
- def extract_link(url_path)
84
- url_hsh = format_url(url_path)
85
- url = url_hsh[:formatted_url]
86
- link = url_path
87
- link_hsh = {url_path: url_path, url: url, link: nil }
88
- if url.present? && link.present? && link.length > @length_min
89
- url = strip_down_url(url)
90
- link = strip_down_url(link)
91
- link&.gsub!(url, '')
92
- link = link&.split('.net')&.last
93
- link = link&.split('.com')&.last
94
- link = link&.split('.org')&.last
95
- link = "/#{link.split("/").reject(&:empty?).join("/")}" if link.present?
96
- link_hsh[:link] = link if link.present? && link.length > @length_min
97
- end
98
- link_hsh
225
+ def remove_invalid_links(link)
226
+ link_hsh = {link: link, valid_link: nil, flags: nil }
227
+ if link.present?
228
+ @neg_links += get_symbs
229
+ flags = @neg_links.select { |red| link&.include?(red) }
230
+ flags << "below #{@min_length}" if link.length < @min_length
231
+ flags << "over #{@max_length}" if link.length > @max_length
232
+ flags = flags.flatten.compact
233
+ flags.any? ? valid_link = nil : valid_link = link
234
+ link_hsh[:valid_link] = valid_link
235
+ link_hsh[:flags] = flags.join(', ')
99
236
  end
237
+ link_hsh
238
+ end
100
239
 
101
240
 
102
- def strip_down_url(url)
103
- if url.present?
104
- url = url.downcase.strip
105
- url = url.gsub('www.', '')
106
- url = url.split('://')
107
- url = url[-1]
108
- return url
109
- end
241
+ def remove_invalid_hrefs(href)
242
+ href_hsh = {href: href, valid_href: nil, flags: nil }
243
+ if href.present?
244
+ @neg_hrefs += get_symbs
245
+ href = href.split('|').join(' ')
246
+ href = href.split('/').join(' ')
247
+ href&.gsub!("(", ' ')
248
+ href&.gsub!(")", ' ')
249
+ href&.gsub!("[", ' ')
250
+ href&.gsub!("]", ' ')
251
+ href&.gsub!(",", ' ')
252
+ href&.gsub!("'", ' ')
253
+
254
+ flags = []
255
+ flags << "over #{@max_length}" if href.length > @max_length
256
+ invalid_text = Regexp.new(/[0-9]/)
257
+ flags << invalid_text&.match(href)
258
+ href = href&.downcase
259
+ href = href&.strip
260
+
261
+ flags << @neg_hrefs.select { |red| href&.include?(red) }
262
+ flags = flags.flatten.compact.uniq
263
+ href_hsh[:valid_href] = href unless flags.any?
264
+ href_hsh[:flags] = flags.join(', ')
110
265
  end
266
+ href_hsh
267
+ end
111
268
 
112
269
 
113
- def remove_invalid_links(link)
114
- link_hsh = {link: link, valid_link: nil, flags: nil }
115
- if link.present?
116
- invalid_chars = ['(', ')', '[', ']', '{', '}', '*', '@', '^', '$', '%', '+', '!', '<', '>', '~', ',', "'"]
117
- @link_flags += invalid_chars
118
- flags = @link_flags.select { |red| link&.include?(red) }
119
- flags << "below #{@length_min}" if link.length < @length_min
120
- flags << "over #{@length_max}" if link.length > @length_max
121
- flags = flags.flatten.compact
122
- flags.any? ? valid_link = nil : valid_link = link
123
- link_hsh[:valid_link] = valid_link
124
- link_hsh[:flags] = flags.join(', ')
125
- end
126
- link_hsh
270
+ def convert_to_scheme_host(url)
271
+ if url.present?
272
+ uri = URI(url)
273
+ scheme = uri&.scheme
274
+ host = uri&.host
275
+ url = "#{scheme}://#{host}" if (scheme.present? && host.present?)
276
+ return url
127
277
  end
278
+ end
128
279
 
129
280
 
130
- def remove_invalid_hrefs(href)
131
- href_hsh = {href: href, valid_href: nil, flags: nil }
132
- if href.present?
133
- symbs = ['{', '}', '*', '@', '^', '$', '%', '+', '!', '<', '>', '~']
134
- @href_flags += symbs
135
- href = href.split('|').join(' ')
136
- href = href.split('/').join(' ')
137
- href&.gsub!("(", ' ')
138
- href&.gsub!(")", ' ')
139
- href&.gsub!("[", ' ')
140
- href&.gsub!("]", ' ')
141
- href&.gsub!(",", ' ')
142
- href&.gsub!("'", ' ')
143
-
144
- flags = []
145
- flags << "over #{@length_max}" if href.length > @length_max
146
- invalid_text = Regexp.new(/[0-9]/)
147
- flags << invalid_text&.match(href)
148
- href = href&.downcase
149
- href = href&.strip
150
-
151
- flags << @href_flags.select { |red| href&.include?(red) }
152
- flags = flags.flatten.compact.uniq
153
- href_hsh[:valid_href] = href unless flags.any?
154
- href_hsh[:flags] = flags.join(', ')
155
- end
156
- href_hsh
281
+ #CALL: Formatter.new.remove_ww3(url)
282
+ def remove_ww3(url)
283
+ if url.present?
284
+ url.split('.').map { |part| url.gsub!(part,'www') if part.scan(/ww[0-9]/).any? }
285
+ url&.gsub!("www.www", "www")
157
286
  end
287
+ end
158
288
 
159
289
 
160
- def convert_to_scheme_host(url)
161
- if url.present?
162
- uri = URI(url)
163
- scheme = uri&.scheme
164
- host = uri&.host
165
- url = "#{scheme}://#{host}" if (scheme.present? && host.present?)
166
- return url
167
- end
290
+ # For rare cases w/ urls with mistaken double slash twice.
291
+ def remove_slashes(url)
292
+ if url.present? && url.include?('//')
293
+ parts = url.split('//')
294
+ return parts[0..1].join if parts.length > 2
168
295
  end
296
+ return url
297
+ end
298
+
299
+ ##Call: StartCrm.run_webs
300
+ # def get_ext_list
301
+ # # Source: http://www.iana.org/domains/root/db
302
+ # # .txt list: http://data.iana.org/TLD/tlds-alpha-by-domain.txt
303
+ # file_path = "./lib/crm_formatter/extensions.csv"
304
+ # extensions = CSV.read(file_path)
305
+ # end
306
+
169
307
 
170
308
  end
171
309
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: crm_formatter
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.6.pre.rc.1
4
+ version: 1.0.7.pre.rc.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Adam Booth
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2018-05-14 00:00:00.000000000 Z
11
+ date: 2018-05-15 00:00:00.000000000 Z
12
12
  dependencies: []
13
13
  description: Reformat and Normalize CRM Contact Data, Addresses, Phones, Emails and
14
14
  URLs. Originally developed for proprietary use in an enterprise software suite. Recently
@@ -30,6 +30,7 @@ files:
30
30
  - gem_notes_crm_formatter.txt
31
31
  - lib/crm_formatter.rb
32
32
  - lib/crm_formatter/address.rb
33
+ - lib/crm_formatter/extensions.csv
33
34
  - lib/crm_formatter/helpers.rb
34
35
  - lib/crm_formatter/phone.rb
35
36
  - lib/crm_formatter/version.rb