crm_formatter 1.0.6.pre.rc.1 → 1.0.7.pre.rc.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.gitignore +2 -0
- data/README.md +167 -159
- data/gem_notes_crm_formatter.txt +32 -3
- data/lib/crm_formatter/extensions.csv +1 -0
- data/lib/crm_formatter/version.rb +1 -1
- data/lib/crm_formatter/web.rb +264 -126
- metadata +3 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: c1a9e605ef7ec90b0c88e80a32cd22fa604a9cc810fe559eb19c389c72c0e494
|
4
|
+
data.tar.gz: 07b4436cf1d31125e44a0bd71424b7c7c21a50954faa8e14b2792f08fd7391d9
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 26357b48e1933ed4a421c9fc8ce1bd01a097d553e7d1810910bb7b891496f341ce456a45cff6313189adf1321e7fe95aad25bf9d4bf6471301dfa61bf4c637c3
|
7
|
+
data.tar.gz: bb86e8cdcc5ea5526e16d7963687e4dd058e3097bdc2b88f6b65284f5cd282ab7f6890304abf90477e1f7503c0a0023b89f48b41c994467bc0f52feffc8509b5
|
data/.gitignore
CHANGED
data/README.md
CHANGED
@@ -1,14 +1,21 @@
|
|
1
|
-
# CRMFormatter
|
2
1
|
|
3
|
-
|
4
|
-
|
2
|
+
# **CRM Formatter**
|
3
|
+
|
4
|
+
#### Reformat and Normalize CRM Contact Data, such as Addresses, Phones and URLs.
|
5
|
+
|
6
|
+
**CRM Formatter** was originally designed to curate high-volume enterprise-scale asynchronous web scraping via Nokogiri, Mechanize, and Delayed_job. Web Scraping *aka Web Harvesting / Data Mining* is notoriously unreliable *sticky* work with endless edge-cases to overcome. Accurately, yet efficiently curating such data is a constant and evolving task, and will continue to be the core functionality of **CRM Formatter**.
|
7
|
+
However, it also plays an integral role in routine functions of apps, like formatting, normalizing, and even scrubbing existing databases, and submitted form data before saving to the database; via model callbacks, such as `before_validation` or `before_save`.
|
8
|
+
|
9
|
+
###### The **CRM Formatter** Gem is currently in `--pre` versioning, aka **beta mode** with frequent updates. Formal tests in the gem environment are still on the way.
|
10
|
+
However, **CRM Formatter** has been developed continuously for several years and is a reliable and integral part of a production CRM data verification app. The process of isolating the various modules into a consolidated open source gem has just recently begun, so documentation is still limited, but is frequently being added and refined.
|
5
11
|
|
6
12
|
## Getting Started
|
13
|
+
**CRM Formatter** is compatible with Rails 4.2 and 5.0, 5.1 and 5.2 on Ruby 2.2 and later.
|
7
14
|
|
8
15
|
In your Gemfile add:
|
9
16
|
|
10
17
|
```
|
11
|
-
gem 'crm_formatter', '~> 1.0.
|
18
|
+
gem 'crm_formatter', '~> 1.0.6.pre.rc.1'
|
12
19
|
```
|
13
20
|
|
14
21
|
Or to install locally:
|
@@ -17,42 +24,99 @@ Or to install locally:
|
|
17
24
|
gem install crm_formatter --pre
|
18
25
|
```
|
19
26
|
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
###
|
25
|
-
|
26
|
-
|
27
|
-
|
28
|
-
|
29
|
-
|
30
|
-
|
31
|
-
|
32
|
-
|
33
|
-
|
34
|
-
|
35
|
-
|
36
|
-
|
27
|
+
## Usage
|
28
|
+
|
29
|
+
##### Usage is organized into three sections, Overview, Methods and Examples.
|
30
|
+
|
31
|
+
### I. Overview
|
32
|
+
|
33
|
+
#### 1. Access and Integration
|
34
|
+
##### Using **CRM Formatter** in your app is very simple, and could be accessed from your app's concerns, controllers, helpers, lib, models, or services, but depends on the scope, location, and size of your application and server.
|
35
|
+
* Simple form submission validations: model callback typically ideal.
|
36
|
+
* Database normalizing tasks: wrapper method in concerns, helpers, or lib typically ideal.
|
37
|
+
* Long running processes like web scraping or high volume APIs calls, like Google Linkedin, or Twitter: the lib or services might be ideal (multithreaded asynchronously even better)
|
38
|
+
|
39
|
+
#### 2. Hash Response
|
40
|
+
##### Formatted data will always be returned as a hash datatype the following key-value pairs:
|
41
|
+
* The originally submitted data as the first pair.
|
42
|
+
* Formatted data in the remaining pairs.
|
43
|
+
* A T/F boolean indicator pair regarding if the original and formatted data are different.
|
44
|
+
|
45
|
+
#### 3. Optional Arguments *OA*
|
46
|
+
##### A class can be instantiated with optional arguments *OA*.
|
47
|
+
* OA house the criteria by which you'd like to scrub your data.
|
48
|
+
* Each is either 'Pos' or 'Neg', for more accurate reporting of your scrubbing results.
|
49
|
+
* List of available Web OA is below, and each accepts data in the hash datatype, aka 'keyword-args'.
|
50
|
+
* For example, you might want to know which URLs contain 'twitter', 'facebook', or 'linkedin' either to focus on developing a list of business social media links, or perhaps you want to use such a list to better avoid such links.
|
51
|
+
* *OA is currently only available for the Web class.*
|
52
|
+
* *Address OA & Phone OA will be available in a future release.*
|
53
|
+
|
54
|
+
### II. Methods
|
55
|
+
##### CRM Formatter**'s top level module is `CRMFormatter` and contains the following three classes:
|
56
|
+
1. Address: `CRMFormatter::Address.new`
|
57
|
+
2. Phone: `CRMFormatter::Address.new`
|
58
|
+
3. Web: `CRMFormatter::Address.new`
|
59
|
+
|
60
|
+
###### Then assign the above to a variable name of your choosing.
|
61
|
+
`addr_formatter = CRMFormatter::Address.new`
|
62
|
+
`@addr_formatter = CRMFormatter::Address.new`
|
63
|
+
|
64
|
+
###### Web accepts optional arguments *OA* as a Hash (with Key-Value pairs)
|
65
|
+
Without OA: Instantiate normally if not using OA.
|
66
|
+
`web_formatter = CRMFormatter::Web.new`
|
67
|
+
|
68
|
+
With OA: Follow the steps to use Web OA:
|
69
|
+
1. Available Web OA and the required Key-Value naming and datatypes.
|
70
|
+
* Only list the OA K-V Pairs you're using. No need to list empty values. It's not all or nothing. These are empty to illustrate the expected datatypes.
|
71
|
+
|
72
|
+
Below is how the OA are received in the Web class at initialization.
|
73
|
+
**3. Web Examples at the very bottom has a very detailed example including how OA can be used.**
|
74
|
+
```
|
75
|
+
def initialize(args={})
|
76
|
+
@empty_oa = args.empty?
|
77
|
+
@pos_urls = args.fetch(:pos_urls, [])
|
78
|
+
@neg_urls = args.fetch(:neg_urls, [])
|
79
|
+
@pos_links = args.fetch(:pos_links, [])
|
80
|
+
@neg_links = args.fetch(:neg_links, [])
|
81
|
+
@pos_hrefs = args.fetch(:pos_hrefs, [])
|
82
|
+
@neg_hrefs = args.fetch(:neg_hrefs, [])
|
83
|
+
@pos_exts = args.fetch(:pos_exts, [])
|
84
|
+
@neg_exts = args.fetch(:neg_exts, [])
|
85
|
+
@min_length = args.fetch(:min_length, 2)
|
86
|
+
@max_length = args.fetch(:max_length, 100)
|
87
|
+
end
|
88
|
+
```
|
37
89
|
|
38
|
-
|
39
|
-
|
90
|
+
Example: Below is the syntax for how to use OA.
|
91
|
+
There are both Positive and Negative. They work the same and could just be included in the same array if you prefer. But they are intended to help you scrub data against negative criteria and for positive criteria.
|
92
|
+
```
|
93
|
+
oa_args = { neg_urls: %w(approv insur invest loan quick rent repair),
|
94
|
+
neg_links: %w(buy call cash cheap click gas insta),
|
95
|
+
neg_hrefs: %w(after anounc apply approved blog buy call click),
|
96
|
+
neg_exts: %w(au ca edu es gov in ru uk us),
|
97
|
+
min_length: 0,
|
98
|
+
max_length: 30
|
99
|
+
}
|
100
|
+
|
101
|
+
@web_formatter = CRMFormatter::Web.new(oa_args)
|
102
|
+
```
|
40
103
|
|
41
|
-
|
104
|
+
#### 1. Address Methods
|
42
105
|
|
43
|
-
|
106
|
+
`get_full_address()` takes a hash of address parts then runs each through their respective formatters, then also adds an additional feature of combining them into a long full address string, and indicates if there were any changes from the original version and newly formatted.
|
44
107
|
|
45
108
|
```
|
46
109
|
addr_formatter = CRMFormatter::Address.new
|
47
110
|
|
48
|
-
|
49
|
-
full_address_hsh = {street: street, city: city, state: state, zip: zip}
|
111
|
+
full_address_hash = {street: street, city: city, state: state, zip: zip}
|
50
112
|
|
51
|
-
addr_formatter.
|
113
|
+
addr_formatter.get_full_address(full_address_hash)
|
52
114
|
|
53
|
-
addr_formatter.
|
115
|
+
addr_formatter.format_street(street_string)
|
54
116
|
|
55
|
-
addr_formatter.
|
117
|
+
addr_formatter.format_city(city_string)
|
118
|
+
|
119
|
+
addr_formatter.format_state(state_string)
|
56
120
|
|
57
121
|
addr_formatter.format_zip(zip)
|
58
122
|
|
@@ -60,12 +124,11 @@ They were however, designed for an app that Harvests business data for sales and
|
|
60
124
|
|
61
125
|
addr_formatter.compare_versions(original, formatted)
|
62
126
|
|
63
|
-
|
64
127
|
```
|
65
128
|
|
66
|
-
|
129
|
+
#### Phone Methods
|
67
130
|
|
68
|
-
|
131
|
+
Subtle but important distinction between 'format_phone' which simply puts a phone in any format, like 555-123-4567 into normalized (555) 123-4567, and 'validate_phone' which also uses 'format_phone' to normalize its output, but is mainly tasked with determining if the phone number seem legitimate. If you know for sure that it is a phone number, but just want to normalize then first try format_phone. If you are doing web scraping or throwing in strings of text mixed with phones, then validate_phone might work better.
|
69
132
|
|
70
133
|
```
|
71
134
|
ph_formatter = CRMFormatter::Phone.new
|
@@ -76,9 +139,7 @@ They were however, designed for an app that Harvests business data for sales and
|
|
76
139
|
|
77
140
|
```
|
78
141
|
|
79
|
-
|
80
|
-
|
81
|
-
|
142
|
+
#### Web Methods
|
82
143
|
|
83
144
|
```
|
84
145
|
web_formatter = CRMFormatter::Web.new
|
@@ -95,22 +156,12 @@ They were however, designed for an app that Harvests business data for sales and
|
|
95
156
|
|
96
157
|
```
|
97
158
|
|
159
|
+
### III. Examples
|
160
|
+
Some of the examples are excessively verbose to help illustrate the datatypes and processes. Here are a few guidelines and tips:
|
161
|
+
**3. Web Examples at the very bottom is the most detailed and recent. It might be a good place to start.**
|
162
|
+
*These are just examples below, not strict usage guides ...*
|
98
163
|
|
99
|
-
|
100
|
-
|
101
|
-
|
102
|
-
|
103
|
-
|
104
|
-
|
105
|
-
|
106
|
-
|
107
|
-
* Data will always be returned as hashes, with your original, modified, and details about what has changed.
|
108
|
-
* You may pass optional arguments at initialization to provide lists of data to match against, for example a list of words that if in a URL, it would automatically report as junk (but still keeping your original in tact.)
|
109
|
-
|
110
|
-
Typically, you will want to create your own method as a wrapper for the Gem methods, like below...
|
111
|
-
|
112
|
-
## Address
|
113
|
-
** The examples below are rather verbose, so you can make them much more compact of course.
|
164
|
+
#### 1. Address Examples
|
114
165
|
|
115
166
|
```
|
116
167
|
def self.run_adrs
|
@@ -132,9 +183,7 @@ end
|
|
132
183
|
|
133
184
|
```
|
134
185
|
|
135
|
-
|
136
|
-
|
137
|
-
## Phone
|
186
|
+
#### 2. Phone Examples
|
138
187
|
|
139
188
|
In the phone example, format_all_phone_in_my_db could be a custom wrapper method, which when called by Rails C or from a front end GUI process, could grab all phones in db meeting certain criteria to be scrubbed. The results will always be in hash format, such as below.... phone_hash
|
140
189
|
|
@@ -153,140 +202,99 @@ end
|
|
153
202
|
phone_hash = { phone: 555-123-4567, valid_phone: (555) 123-4567, phone_edit: true }
|
154
203
|
```
|
155
204
|
|
205
|
+
#### 3. Web Examples
|
156
206
|
|
157
|
-
|
158
|
-
|
159
|
-
|
207
|
+
The steps below will show you an option for how you could integrate larger processes in your app.
|
208
|
+
1. Create a wrapper method you can call from an action or Rails C. In this example, a new class was also created in Lib for that purpose, as there could be related methods to create.
|
209
|
+
* These examples only include `CRMFormatter::Web.new.format_url(url)` method. There are several additional methods available to you. Documentation is on the way, but in the mean time, try out the below example, then play around with the others too.
|
160
210
|
|
161
211
|
```
|
162
|
-
|
212
|
+
# /app/lib/start_crm.rb
|
163
213
|
|
164
|
-
|
165
|
-
hsh = @web.format_url(url)
|
166
|
-
end
|
214
|
+
class StartCrm
|
167
215
|
|
168
|
-
|
216
|
+
##Rails C: StartCrm.run_webs
|
217
|
+
def self.run_webs
|
218
|
+
oa_args = get_args
|
219
|
+
web = CRMFormatter::Web.new(oa_args)
|
169
220
|
|
170
|
-
|
171
|
-
|
172
|
-
|
221
|
+
formatted_url_hashes = get_urls.map do |url|
|
222
|
+
url_hash = web.format_url(url)
|
223
|
+
end
|
173
224
|
|
225
|
+
formatted_url_hashes
|
174
226
|
end
|
175
|
-
```
|
176
|
-
|
177
|
-
|
178
|
-
### Additional Usage Examples
|
179
|
-
|
180
|
-
### Webs
|
181
|
-
* Another example below.
|
182
227
|
|
228
|
+
end
|
183
229
|
```
|
184
|
-
|
185
|
-
|
186
|
-
url_flags = %w(approv avis budget business collis eat enterprise facebook financ food google gourmet hertz hotel hyatt insur invest loan lube mobility motel motorola parts quick rent repair restaur rv ryder service softwar travel twitter webhost yellowpages yelp youtube)
|
187
|
-
|
188
|
-
link_flags = %w(: .biz .co .edu .gov .jpg .net // anounc book business buy bye call cash cheap click collis cont distrib download drop event face feature feed financ find fleet form gas generat graphic hello home hospi hour hours http info insta)
|
189
|
-
|
190
|
-
href_flags = %w(? .com .jpg @ * after anounc apply approved blog book business buy call care career cash charit cheap check click)
|
230
|
+
2. Make sure to modify your application config file to recognize your new class.
|
191
231
|
|
192
|
-
|
193
|
-
|
194
|
-
args = { url_flags: url_flags, link_flags: link_flags, href_flags: href_flags, extension_flags: extension_flags }
|
195
|
-
web = CRMFormatter::Web.new(args)
|
232
|
+
```
|
233
|
+
#/app/config/application.rb
|
196
234
|
|
197
|
-
|
235
|
+
config.eager_load_paths << Rails.root.join('lib/**')
|
236
|
+
config.eager_load_paths += Dir["#{config.root}/lib/**/"]
|
237
|
+
```
|
238
|
+
3. Create your db query or put together a list of URLs to process, along with any OA to include. The below example is very verbose, but designed to be helpful. In reality, you might have various criteria saved in the db rather than writing it out.
|
239
|
+
In this example, we have auto dealer URLs. In this process, we're focusing on franchise dealers.
|
240
|
+
```
|
241
|
+
def self.get_args
|
242
|
+
neg_urls = %w(approv avis budget collis eat enterprise facebook financ food google gourmet hertz hotel hyatt insur invest loan lube mobility motel motorola parts quick rent repair restaur rv ryder service softwar travel twitter webhost yellowpages yelp youtube)
|
198
243
|
|
199
|
-
|
200
|
-
valid_urls = validated_url_hashes.map { |hsh| hsh[:valid_url] }.compact
|
201
|
-
extracted_link_hashes = urls.map { |url| web.extract_link(url) }
|
202
|
-
links = extracted_link_hashes.map { |hsh| hsh[:link] }.compact
|
203
|
-
validated_link_hashes = links.map { |link| web.remove_invalid_links(link) }
|
204
|
-
hrefs = ["Hot Inventory", "Join Our Sale!", "Don't Wait till Later", "Apply Today!", "No Cash Down!"]
|
205
|
-
validated_href_hashes = hrefs.map { |href| web.remove_invalid_hrefs(href) }
|
244
|
+
pos_urls = ["acura", "alfa romeo", "aston martin", "audi", "bmw", "bentley", "bugatti", "buick", "cdjr", "cadillac", "chevrolet", "chrysler", "dodge", "ferrari", "fiat", "ford", "gmc", "group", "group", "honda", "hummer", "hyundai", "infiniti", "isuzu", "jaguar", "jeep", "kia", "lamborghini", "lexus", "lincoln", "lotus", "mini", "maserati", "mazda", "mclaren", "mercedes-benz", "mitsubishi", "nissan", "porsche", "ram", "rolls-royce", "saab", "scion", "smart", "subaru", "suzuki", "toyota", "volkswagen", "volvo"]
|
206
245
|
|
246
|
+
neg_exts = %w(au ca edu es gov in ru uk us)
|
247
|
+
oa_args = {neg_urls: neg_urls, pos_urls: pos_urls, neg_exts: neg_exts}
|
207
248
|
end
|
208
249
|
|
250
|
+
def self.get_urls
|
251
|
+
urls = ["https://www.stevXXXXXXmitsubishiserviceandpartscenter.com", "https://www.perXXXXXXchryslerjeepcenterville.com", "http://www.peXXXXXXchryslerjeepcenterville.com", "http://www.colXXXXXXchryslerdodgejeepram.com"]
|
252
|
+
end
|
209
253
|
```
|
254
|
+
4. Run your class and wrapper method in Rails C. By creating the wrapper method, you have set up the entire process to run like a runner. In reality, you might have several different criteria accessible from a GUI or even running in Cron Jobs.
|
210
255
|
|
211
|
-
|
212
|
-
** The gem is currently being used within another app in the following way...
|
256
|
+
`2.5.1 :001 > StartCrm.run_webs`
|
213
257
|
|
214
|
-
|
215
|
-
@web_formatter.convert_to_scheme_host(url)
|
216
|
-
@web_formatter.format_url(url)
|
217
|
-
```
|
218
|
-
The two methods above, which are many available to you in the gem are being used below.
|
219
|
-
```
|
220
|
-
curl_result[:response_code] = result&.response_code.to_s
|
221
|
-
web_hsh = @web_formatter.format_url(result&.last_effective_url)
|
258
|
+
5. Results are always in a Hash, like below. The URLs are slightly obfuscated out of respect (it's not a bug). These are examples from a large DB that runs on a loop 24/7 and gets to each organization about once a week, so it's already pretty well up to date, so there aren't any big changes below, but there are still a few things to point out.
|
222
259
|
|
223
|
-
|
224
|
-
curl_result[:verified_url] = @web_formatter.convert_to_scheme_host(web_hsh[:formatted_url])
|
225
|
-
end
|
260
|
+
* `:is_reformatted` indicates T/F if url_path and `:formatted_url` differ. If False, then it means they are the same, or the `:url_path` had significant errors which prevented it from being formatted, thus `:formatted_url` would be nil in such a case. The reality is that you might have some URLs that are so far off that, that they can't be reliably reformatted, so better to only let them pass if we are confident that they are reliable.
|
226
261
|
|
227
|
-
|
262
|
+
* `:url_path` is the url originally submitted by the client. It can include directory links on the end too, '/careers/, '/about-us/', etc.
|
228
263
|
|
229
|
-
|
264
|
+
* `:formatted_url` is the formatted version of `:url_path`. It will be stripped of additional paths, '/deals/', '/staff/', etc. Also, often times people ommit 'http://:' and 'www' in CRMs. This can sometimes cause errors for users or Mechanized Web Scrapers. So, those will always be included to ensure consistency. In our production app we follow up the formatting with url redirect following, which our configurations require the entire path, so it will always be included. The redirect following gem is already being worked on and will be released as an additional gem shortly.
|
230
265
|
|
231
|
-
|
232
|
-
def start_curl(url, timeout)
|
233
|
-
|
234
|
-
curl_result = { verified_url: nil, response_code: nil, curl_err: nil }
|
235
|
-
if url.present?
|
236
|
-
result = nil
|
237
|
-
|
238
|
-
begin # Curl Exception Handling
|
239
|
-
begin # Timeout Exception Handling
|
240
|
-
Timeout.timeout(timeout) do
|
241
|
-
puts "\n\n=== WAITING FOR CURL RESPONSE ==="
|
242
|
-
result = Curl::Easy.perform(url) do |curl|
|
243
|
-
curl.follow_location = true
|
244
|
-
curl.useragent = "curb"
|
245
|
-
curl.connect_timeout = timeout
|
246
|
-
curl.enable_cookies = true
|
247
|
-
curl.head = true #testing - new
|
248
|
-
end # result
|
249
|
-
|
250
|
-
curl_result[:response_code] = result&.response_code.to_s
|
251
|
-
web_hsh = @web_formatter.format_url(result&.last_effective_url)
|
252
|
-
|
253
|
-
if web_hsh[:formatted_url].present?
|
254
|
-
curl_result[:verified_url] = @web_formatter.convert_to_scheme_host(web_hsh[:formatted_url])
|
255
|
-
end
|
256
|
-
end
|
257
|
-
|
258
|
-
rescue Timeout::Error # Timeout Exception Handling
|
259
|
-
curl_result[:curl_err] = "Error: Timeout"
|
260
|
-
end
|
261
|
-
|
262
|
-
rescue LoadError => e # Curl Exception Handling
|
263
|
-
curl_err = error_parser("Error: #{$!.message}")
|
264
|
-
# CheckInt.new.check_int if curl_err.include?('TCP')
|
265
|
-
curl_result[:curl_err] = curl_err
|
266
|
-
end
|
267
|
-
else ## If no url present?
|
268
|
-
curl_result[:curl_err] = 'URL Nil'
|
269
|
-
end
|
266
|
+
* `:neg` is an array of all the errors and negative, undesirable criteria to scrub against. If you include the criteria in OA `neg_urls:`, like above, it will automatically scrub and report. Regardless, any errors will also be included in there. So, if the url was not ultimately formatted, there will be details regarding why in `:neg`.
|
270
267
|
|
271
|
-
|
272
|
-
curl_result
|
273
|
-
end
|
274
|
-
```
|
275
|
-
|
276
|
-
|
277
|
-
### Another Example from a Production Environment...
|
268
|
+
* `:pos` is the opposite, which highlights positive criteria you might be looking for. It too is available in OA via `pos_urls:`, like above.
|
278
269
|
|
279
270
|
```
|
280
|
-
|
281
|
-
|
282
|
-
|
283
|
-
|
284
|
-
|
285
|
-
|
271
|
+
[ {:is_reformatted=>false,
|
272
|
+
:url_path=>"https://www.steXXXXXXmitsubishiserviceandpartscenter.com",
|
273
|
+
:formatted_url=>"https://www.steXXXXXXmitsubishiserviceandpartscenter.com",
|
274
|
+
:neg=>["neg_urls: parts, rv, service"],
|
275
|
+
:pos=>["pos_urls: mitsubishi"]},
|
276
|
+
|
277
|
+
{:is_reformatted=>false,
|
278
|
+
:url_path=>"https://www.perXXXXXXchryslerjeepcenterville.com",
|
279
|
+
:formatted_url=>"https://www.perXXXXXXchryslerjeepcenterville.com",
|
280
|
+
:neg=>["neg_urls: rv"],
|
281
|
+
:pos=>["pos_urls: chrysler, jeep"]},
|
282
|
+
|
283
|
+
{:is_reformatted=>false,
|
284
|
+
:url_path=>"http://www.pXXXXXXchryslerjeepcenterville.com",
|
285
|
+
:formatted_url=>"http://www.XXXXXXechryslerjeepcenterville.com",
|
286
|
+
:neg=>["neg_urls: rv"],
|
287
|
+
:pos=>["pos_urls: chrysler, jeep"]},
|
288
|
+
|
289
|
+
{:is_reformatted=>false,
|
290
|
+
:url_path=>"http://www.colXXXXXXchryslerdodgejeepram.com",
|
291
|
+
:formatted_url=>"http://www.colXXXXXXchryslerdodgejeepram.com",
|
292
|
+
:neg=>["neg_urls: rv"],
|
293
|
+
:pos=>["pos_urls: chrysler, dodge, jeep, ram"]}
|
294
|
+
]
|
286
295
|
```
|
287
296
|
|
288
297
|
|
289
|
-
|
290
298
|
## Author
|
291
299
|
|
292
300
|
Adam J Booth - [4rlm](https://github.com/4rlm)
|
data/gem_notes_crm_formatter.txt
CHANGED
@@ -1,3 +1,32 @@
|
|
1
|
+
|
2
|
+
|
3
|
+
========= Quick Reference =========
|
4
|
+
* delete current gem build record.
|
5
|
+
* up count gemspec version
|
6
|
+
$ gem build crm_formatter.gemspec
|
7
|
+
* Uninstall Locally
|
8
|
+
$ gem list
|
9
|
+
$ gem uninstall crm_formatter
|
10
|
+
$ gem install crm_formatter-1.0.6.pre.rc.1.gem
|
11
|
+
* Test Locally
|
12
|
+
* Edit Gem Tester App Gemfile
|
13
|
+
|
14
|
+
* git add / commit / push
|
15
|
+
$ gem push crm_formatter-1.0.6.pre.rc.1.gem
|
16
|
+
|
17
|
+
|
18
|
+
========= TO DO =========
|
19
|
+
1. Red Flags: Change to 'OA'
|
20
|
+
2. Change Meth from 'remove' to 'compare_oa'
|
21
|
+
3. Zip can be integer?
|
22
|
+
4. Copy formatted results for README.
|
23
|
+
5. Continue and finish README.
|
24
|
+
6. Test Changes locally before pushing.
|
25
|
+
7. Push.
|
26
|
+
=========================
|
27
|
+
|
28
|
+
|
29
|
+
|
1
30
|
==== FIRST CREATE APP TO TEST GEM ====
|
2
31
|
-----------------------------------
|
3
32
|
rails new gem_tester
|
@@ -38,9 +67,9 @@ $ gem build crm_formatter.gemspec
|
|
38
67
|
Successfully built RubyGem
|
39
68
|
Name: crm_formatter
|
40
69
|
Version: 1.0.5.pre.rc.1
|
41
|
-
File: crm_formatter-1.0.
|
70
|
+
File: crm_formatter-1.0.6.pre.rc.1.gem
|
42
71
|
-----------------------------------
|
43
|
-
$ gem install crm_formatter-1.0.
|
72
|
+
$ gem install crm_formatter-1.0.6.pre.rc.1.gem
|
44
73
|
Successfully installed crm_formatter-1.0.5.pre.rc.1
|
45
74
|
Parsing documentation for crm_formatter-1.0.5.pre.rc.1
|
46
75
|
Installing ri documentation for crm_formatter-1.0.5.pre.rc.1
|
@@ -56,7 +85,7 @@ $ irb
|
|
56
85
|
$ curl -u adamjbooth https://rubygems.org/api/v1/api_key.yaml > ~/.gem/credentials; chmod 0600 ~/.gem/credentials
|
57
86
|
Enter host password for user 'adamjbooth': RG<usual>rg
|
58
87
|
-----------------------------------
|
59
|
-
$ gem push crm_formatter-1.0.
|
88
|
+
$ gem push crm_formatter-1.0.6.pre.rc.1.gem
|
60
89
|
Pushing gem to https://rubygems.org...
|
61
90
|
Successfully registered gem: crm_formatter (1.0.4.pre.rc.1)
|
62
91
|
-----------------------------------
|
@@ -0,0 +1 @@
|
|
1
|
+
aaa
|
data/lib/crm_formatter/web.rb
CHANGED
@@ -1,171 +1,309 @@
|
|
1
|
+
require 'csv'
|
2
|
+
|
1
3
|
module CRMFormatter
|
2
4
|
class Web
|
3
5
|
|
4
|
-
|
5
|
-
|
6
|
-
|
7
|
-
|
8
|
-
|
9
|
-
|
10
|
-
|
6
|
+
def initialize(args={})
|
7
|
+
@empty_oa = args.empty?
|
8
|
+
@pos_urls = args.fetch(:pos_urls, [])
|
9
|
+
@neg_urls = args.fetch(:neg_urls, [])
|
10
|
+
@pos_links = args.fetch(:pos_links, [])
|
11
|
+
@neg_links = args.fetch(:neg_links, [])
|
12
|
+
@pos_hrefs = args.fetch(:pos_hrefs, [])
|
13
|
+
@neg_hrefs = args.fetch(:neg_hrefs, [])
|
14
|
+
@pos_exts = args.fetch(:pos_exts, [])
|
15
|
+
@neg_exts = args.fetch(:neg_exts, [])
|
16
|
+
@min_length = args.fetch(:min_length, 2)
|
17
|
+
@max_length = args.fetch(:max_length, 100)
|
18
|
+
end
|
19
|
+
|
20
|
+
def banned_symbols
|
21
|
+
banned_symbols = ["!", "$", "%", "'", "(", ")", "*", "+", ",", "<", ">", "@", "[", "]", "^", "{", "}", "~"]
|
22
|
+
end
|
23
|
+
|
24
|
+
##Call: StartCrm.run_webs
|
25
|
+
def format_url(url)
|
26
|
+
prep_result = prep_for_uri(url)
|
27
|
+
url_hash = prep_result[:url_hash]
|
28
|
+
url = prep_result[:url]
|
29
|
+
url = nil if has_errors(url_hash)
|
30
|
+
|
31
|
+
if url.present?
|
32
|
+
uri_result = run_uri(url_hash, url)
|
33
|
+
url_hash = uri_result[:url_hash]
|
34
|
+
url = uri_result[:url]
|
35
|
+
(url = nil if has_errors(url_hash)) if url.present?
|
36
|
+
end
|
37
|
+
|
38
|
+
url_hash[:formatted_url] = url
|
39
|
+
url_hash = check_reformatted_status(url_hash) if url.present?
|
40
|
+
url_hash
|
41
|
+
end
|
42
|
+
|
43
|
+
|
44
|
+
def check_reformatted_status(url_hash)
|
45
|
+
formatted = url_hash[:formatted_url]
|
46
|
+
if formatted.present?
|
47
|
+
url_hash[:is_reformatted] = url_hash[:url_path] != formatted
|
11
48
|
end
|
49
|
+
url_hash
|
50
|
+
end
|
51
|
+
|
52
|
+
|
53
|
+
def has_errors(url_hash)
|
54
|
+
errors = url_hash[:neg].map { |neg| neg.include?('error') }
|
55
|
+
errors.any?
|
56
|
+
end
|
57
|
+
|
58
|
+
|
59
|
+
##Call: StartCrm.run_webs
|
60
|
+
def prep_for_uri(url)
|
61
|
+
url_hash = { is_reformatted: false, url_path: url, formatted_url: nil, neg: [], pos: [] }
|
62
|
+
begin
|
63
|
+
url = url&.split('|')&.first
|
64
|
+
url = url&.split('\\')&.first
|
65
|
+
url&.gsub!(/\P{ASCII}/, '')
|
66
|
+
url = url&.downcase&.strip
|
67
|
+
|
68
|
+
2.times { remove_ww3(url) } if url.present?
|
69
|
+
url = remove_slashes(url) if url.present?
|
70
|
+
url&.strip!
|
12
71
|
|
13
|
-
def format_url(url)
|
14
|
-
url_hsh = {url_path: url, formatted_url: nil, url_edit: false }
|
15
72
|
if url.present?
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
url&.gsub!(/\P{ASCII}/, '')
|
20
|
-
url = url&.downcase&.strip
|
21
|
-
return url_hsh if url&.length < @length_min
|
73
|
+
url = nil if url.include?(' ')
|
74
|
+
url = url[0..-2] if url.present? && url[-1] == '/'
|
75
|
+
end
|
22
76
|
|
23
|
-
|
24
|
-
url = remove_slashes(url) if url.present?
|
25
|
-
url&.strip!
|
77
|
+
url = nil if url.present? && banned_symbols.any? {|symb| url&.include?(symb) }
|
26
78
|
|
27
|
-
|
28
|
-
|
79
|
+
if url.present?
|
80
|
+
url_hash = compare_criteria(url_hash, url, 'pos_urls', 'include') if !@empty_oa
|
81
|
+
url_hash = compare_criteria(url_hash, url, 'neg_urls', 'include') if !@empty_oa
|
82
|
+
else
|
83
|
+
url_hash[:neg] << "error: syntax"
|
84
|
+
url_hash[:formatted_url] = url
|
85
|
+
end
|
29
86
|
|
30
|
-
|
87
|
+
rescue Exception => e
|
88
|
+
url_hash[:neg] << "error: #{e}"
|
89
|
+
url = nil
|
90
|
+
url_hash
|
91
|
+
end
|
31
92
|
|
32
|
-
|
93
|
+
prep_result = { url_hash: url_hash, url: url }
|
94
|
+
end
|
33
95
|
|
34
|
-
uri = URI(url)
|
35
|
-
if uri.present?
|
36
|
-
host_parts = uri.host&.split(".")
|
37
96
|
|
38
|
-
|
39
|
-
|
40
|
-
|
41
|
-
|
97
|
+
##Call: StartCrm.run_webs
|
98
|
+
def run_uri(url_hash, url)
|
99
|
+
begin
|
100
|
+
uri = URI(url)
|
101
|
+
host_parts = uri.host&.split(".")
|
42
102
|
|
43
|
-
|
44
|
-
|
45
|
-
url = "#{scheme}://#{host}" if host.present? && scheme.present?
|
46
|
-
url = "http://#{url}" if url[0..3] != "http"
|
47
|
-
url = url.gsub("//", "//www.") if !url.include?("www.")
|
103
|
+
url_hash = compare_criteria(url_hash, host_parts, 'pos_exts', 'equal') if !@empty_oa
|
104
|
+
url_hash = compare_criteria(url_hash, host_parts, 'neg_exts', 'equal') if !@empty_oa
|
48
105
|
|
49
|
-
|
106
|
+
host = uri.host
|
107
|
+
scheme = uri.scheme
|
108
|
+
url = "#{scheme}://#{host}" if host.present? && scheme.present?
|
109
|
+
url = "http://#{url}" if url[0..3] != "http"
|
110
|
+
url = url.gsub("//", "//www.") if !url.include?("www.")
|
111
|
+
samp_url = convert_to_scheme_host(url)
|
50
112
|
|
51
|
-
|
52
|
-
|
53
|
-
|
54
|
-
|
55
|
-
|
113
|
+
url = convert_to_scheme_host(url) if url.present?
|
114
|
+
url_extens_result = check_url_extens(url_hash, url)
|
115
|
+
url_hash = url_extens_result[:url_hash]
|
116
|
+
url = url_extens_result[:url]
|
117
|
+
|
118
|
+
rescue Exception => e
|
119
|
+
url_hash[:neg] << "error: #{e}"
|
120
|
+
url = nil
|
121
|
+
url_hash
|
122
|
+
end
|
123
|
+
|
124
|
+
uri_result = { url_hash: url_hash, url: url }
|
125
|
+
end
|
126
|
+
|
127
|
+
|
128
|
+
#Source: http://www.iana.org/domains/root/db
|
129
|
+
#Text: http://data.iana.org/TLD/tlds-alpha-by-domain.txt
|
130
|
+
def check_url_extens(url_hash, url)
|
131
|
+
if url.present?
|
132
|
+
url_extens = URI(url).host&.split(".")[2..-1]
|
133
|
+
if url_extens.count > 1
|
134
|
+
file_path = "./lib/crm_formatter/extensions.csv"
|
135
|
+
extens_list = CSV.read(file_path).flatten
|
136
|
+
valid_url_extens = extens_list & url_extens
|
137
|
+
|
138
|
+
if valid_url_extens.count != 1
|
139
|
+
extens_str = valid_url_extens.map { |ext| ".#{ext}" }.join(', ')
|
140
|
+
url_hash[:neg] << "error: exts.count > 1 [#{extens_str}]"
|
141
|
+
url = nil
|
56
142
|
end
|
57
143
|
end
|
58
|
-
url_hsh
|
59
144
|
end
|
60
145
|
|
146
|
+
url_hash[:formatted_url] = url
|
147
|
+
url_extens_result = {url_hash: url_hash, url: url}
|
148
|
+
end
|
61
149
|
|
62
|
-
###### Supporting Methods Below #######
|
63
150
|
|
64
|
-
|
65
|
-
|
66
|
-
|
67
|
-
|
68
|
-
|
151
|
+
## This process, compare_criteria only runs if client OA args were passed at initialization.
|
152
|
+
## Results listed in url_hash[:neg]/[:pos], and don't impact or hinder final formatted url.
|
153
|
+
## Simply adds more details about user's preferences and criteria for the url are.
|
154
|
+
|
155
|
+
def compare_criteria(hash, target, list_name, include_or_equal)
|
156
|
+
unless @empty_oa
|
157
|
+
if list_name.present?
|
158
|
+
criteria_list = instance_variable_get("@#{list_name}")
|
159
|
+
|
160
|
+
if criteria_list.present?
|
161
|
+
if target.is_a?(::String)
|
162
|
+
tars = target.split(', ')
|
163
|
+
else
|
164
|
+
tars = target
|
165
|
+
end
|
166
|
+
|
167
|
+
pn_matches = tars.map do |tar|
|
168
|
+
if criteria_list.present?
|
169
|
+
if include_or_equal == 'include'
|
170
|
+
criteria_list.select { |el| el if tar.include?(el) }.join(', ')
|
171
|
+
elsif include_or_equal == 'equal'
|
172
|
+
criteria_list.select { |el| el if tar == el }.join(', ')
|
173
|
+
end
|
174
|
+
end
|
175
|
+
end
|
176
|
+
|
177
|
+
pn_match = pn_matches&.uniq&.sort&.join(', ')
|
178
|
+
if pn_match.present?
|
179
|
+
if list_name.include?('neg')
|
180
|
+
hash[:neg] << "#{list_name}: #{pn_match}"
|
181
|
+
else
|
182
|
+
hash[:pos] << "#{list_name}: #{pn_match}"
|
183
|
+
end
|
184
|
+
end
|
185
|
+
end
|
186
|
+
|
69
187
|
end
|
70
188
|
end
|
189
|
+
|
190
|
+
hash
|
191
|
+
end
|
71
192
|
|
193
|
+
###### Supporting Methods Below #######
|
72
194
|
|
73
|
-
|
74
|
-
|
75
|
-
|
76
|
-
|
77
|
-
|
78
|
-
|
195
|
+
def extract_link(url_path)
|
196
|
+
url_hash = format_url(url_path)
|
197
|
+
url = url_hash[:formatted_url]
|
198
|
+
link = url_path
|
199
|
+
link_hsh = {url_path: url_path, url: url, link: nil }
|
200
|
+
if url.present? && link.present? && link.length > @min_length
|
201
|
+
url = strip_down_url(url)
|
202
|
+
link = strip_down_url(link)
|
203
|
+
link&.gsub!(url, '')
|
204
|
+
link = link&.split('.net')&.last
|
205
|
+
link = link&.split('.com')&.last
|
206
|
+
link = link&.split('.org')&.last
|
207
|
+
link = "/#{link.split("/").reject(&:empty?).join("/")}" if link.present?
|
208
|
+
link_hsh[:link] = link if link.present? && link.length > @min_length
|
209
|
+
end
|
210
|
+
link_hsh
|
211
|
+
end
|
212
|
+
|
213
|
+
|
214
|
+
def strip_down_url(url)
|
215
|
+
if url.present?
|
216
|
+
url = url.downcase.strip
|
217
|
+
url = url.gsub('www.', '')
|
218
|
+
url = url.split('://')
|
219
|
+
url = url[-1]
|
79
220
|
return url
|
80
221
|
end
|
222
|
+
end
|
81
223
|
|
82
224
|
|
83
|
-
|
84
|
-
|
85
|
-
|
86
|
-
|
87
|
-
|
88
|
-
|
89
|
-
|
90
|
-
|
91
|
-
|
92
|
-
|
93
|
-
|
94
|
-
link = link&.split('.org')&.last
|
95
|
-
link = "/#{link.split("/").reject(&:empty?).join("/")}" if link.present?
|
96
|
-
link_hsh[:link] = link if link.present? && link.length > @length_min
|
97
|
-
end
|
98
|
-
link_hsh
|
225
|
+
def remove_invalid_links(link)
|
226
|
+
link_hsh = {link: link, valid_link: nil, flags: nil }
|
227
|
+
if link.present?
|
228
|
+
@neg_links += get_symbs
|
229
|
+
flags = @neg_links.select { |red| link&.include?(red) }
|
230
|
+
flags << "below #{@min_length}" if link.length < @min_length
|
231
|
+
flags << "over #{@max_length}" if link.length > @max_length
|
232
|
+
flags = flags.flatten.compact
|
233
|
+
flags.any? ? valid_link = nil : valid_link = link
|
234
|
+
link_hsh[:valid_link] = valid_link
|
235
|
+
link_hsh[:flags] = flags.join(', ')
|
99
236
|
end
|
237
|
+
link_hsh
|
238
|
+
end
|
100
239
|
|
101
240
|
|
102
|
-
|
103
|
-
|
104
|
-
|
105
|
-
|
106
|
-
|
107
|
-
|
108
|
-
|
109
|
-
|
241
|
+
def remove_invalid_hrefs(href)
|
242
|
+
href_hsh = {href: href, valid_href: nil, flags: nil }
|
243
|
+
if href.present?
|
244
|
+
@neg_hrefs += get_symbs
|
245
|
+
href = href.split('|').join(' ')
|
246
|
+
href = href.split('/').join(' ')
|
247
|
+
href&.gsub!("(", ' ')
|
248
|
+
href&.gsub!(")", ' ')
|
249
|
+
href&.gsub!("[", ' ')
|
250
|
+
href&.gsub!("]", ' ')
|
251
|
+
href&.gsub!(",", ' ')
|
252
|
+
href&.gsub!("'", ' ')
|
253
|
+
|
254
|
+
flags = []
|
255
|
+
flags << "over #{@max_length}" if href.length > @max_length
|
256
|
+
invalid_text = Regexp.new(/[0-9]/)
|
257
|
+
flags << invalid_text&.match(href)
|
258
|
+
href = href&.downcase
|
259
|
+
href = href&.strip
|
260
|
+
|
261
|
+
flags << @neg_hrefs.select { |red| href&.include?(red) }
|
262
|
+
flags = flags.flatten.compact.uniq
|
263
|
+
href_hsh[:valid_href] = href unless flags.any?
|
264
|
+
href_hsh[:flags] = flags.join(', ')
|
110
265
|
end
|
266
|
+
href_hsh
|
267
|
+
end
|
111
268
|
|
112
269
|
|
113
|
-
|
114
|
-
|
115
|
-
|
116
|
-
|
117
|
-
|
118
|
-
|
119
|
-
|
120
|
-
flags << "over #{@length_max}" if link.length > @length_max
|
121
|
-
flags = flags.flatten.compact
|
122
|
-
flags.any? ? valid_link = nil : valid_link = link
|
123
|
-
link_hsh[:valid_link] = valid_link
|
124
|
-
link_hsh[:flags] = flags.join(', ')
|
125
|
-
end
|
126
|
-
link_hsh
|
270
|
+
def convert_to_scheme_host(url)
|
271
|
+
if url.present?
|
272
|
+
uri = URI(url)
|
273
|
+
scheme = uri&.scheme
|
274
|
+
host = uri&.host
|
275
|
+
url = "#{scheme}://#{host}" if (scheme.present? && host.present?)
|
276
|
+
return url
|
127
277
|
end
|
278
|
+
end
|
128
279
|
|
129
280
|
|
130
|
-
|
131
|
-
|
132
|
-
|
133
|
-
|
134
|
-
|
135
|
-
href = href.split('|').join(' ')
|
136
|
-
href = href.split('/').join(' ')
|
137
|
-
href&.gsub!("(", ' ')
|
138
|
-
href&.gsub!(")", ' ')
|
139
|
-
href&.gsub!("[", ' ')
|
140
|
-
href&.gsub!("]", ' ')
|
141
|
-
href&.gsub!(",", ' ')
|
142
|
-
href&.gsub!("'", ' ')
|
143
|
-
|
144
|
-
flags = []
|
145
|
-
flags << "over #{@length_max}" if href.length > @length_max
|
146
|
-
invalid_text = Regexp.new(/[0-9]/)
|
147
|
-
flags << invalid_text&.match(href)
|
148
|
-
href = href&.downcase
|
149
|
-
href = href&.strip
|
150
|
-
|
151
|
-
flags << @href_flags.select { |red| href&.include?(red) }
|
152
|
-
flags = flags.flatten.compact.uniq
|
153
|
-
href_hsh[:valid_href] = href unless flags.any?
|
154
|
-
href_hsh[:flags] = flags.join(', ')
|
155
|
-
end
|
156
|
-
href_hsh
|
281
|
+
#CALL: Formatter.new.remove_ww3(url)
|
282
|
+
def remove_ww3(url)
|
283
|
+
if url.present?
|
284
|
+
url.split('.').map { |part| url.gsub!(part,'www') if part.scan(/ww[0-9]/).any? }
|
285
|
+
url&.gsub!("www.www", "www")
|
157
286
|
end
|
287
|
+
end
|
158
288
|
|
159
289
|
|
160
|
-
|
161
|
-
|
162
|
-
|
163
|
-
|
164
|
-
|
165
|
-
url = "#{scheme}://#{host}" if (scheme.present? && host.present?)
|
166
|
-
return url
|
167
|
-
end
|
290
|
+
# For rare cases w/ urls with mistaken double slash twice.
|
291
|
+
def remove_slashes(url)
|
292
|
+
if url.present? && url.include?('//')
|
293
|
+
parts = url.split('//')
|
294
|
+
return parts[0..1].join if parts.length > 2
|
168
295
|
end
|
296
|
+
return url
|
297
|
+
end
|
298
|
+
|
299
|
+
##Call: StartCrm.run_webs
|
300
|
+
# def get_ext_list
|
301
|
+
# # Source: http://www.iana.org/domains/root/db
|
302
|
+
# # .txt list: http://data.iana.org/TLD/tlds-alpha-by-domain.txt
|
303
|
+
# file_path = "./lib/crm_formatter/extensions.csv"
|
304
|
+
# extensions = CSV.read(file_path)
|
305
|
+
# end
|
306
|
+
|
169
307
|
|
170
308
|
end
|
171
309
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: crm_formatter
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.0.
|
4
|
+
version: 1.0.7.pre.rc.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Adam Booth
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2018-05-
|
11
|
+
date: 2018-05-15 00:00:00.000000000 Z
|
12
12
|
dependencies: []
|
13
13
|
description: Reformat and Normalize CRM Contact Data, Addresses, Phones, Emails and
|
14
14
|
URLs. Originally developed for proprietary use in an enterprise software suite. Recently
|
@@ -30,6 +30,7 @@ files:
|
|
30
30
|
- gem_notes_crm_formatter.txt
|
31
31
|
- lib/crm_formatter.rb
|
32
32
|
- lib/crm_formatter/address.rb
|
33
|
+
- lib/crm_formatter/extensions.csv
|
33
34
|
- lib/crm_formatter/helpers.rb
|
34
35
|
- lib/crm_formatter/phone.rb
|
35
36
|
- lib/crm_formatter/version.rb
|