RubyGems - brilliant_web_scraper - Versions diffs - 0.1 → 0.2 - Mend

brilliant_web_scraper 0.1 → 0.2

Files changed (22) hide show

checksums.yaml +4 -4
data/Gemfile.lock +90 -0
data/README.md +6 -9
data/brilliant_web_scraper.gemspec +12 -7
data/lib/brilliant_web_scraper.rb +2 -1
data/lib/parsers/description_helper.rb +1 -1
data/lib/parsers/emails.rb +2 -2
data/lib/parsers/facebook_profile.rb +1 -1
data/lib/parsers/redirected_to.rb +2 -1
data/lib/parsers/twitter_profile.rb +1 -1
data/lib/parsers/vimeo_profile.rb +1 -1
data/lib/scraper/scrape_helper.rb +27 -27
data/lib/scraper/scrape_request.rb +12 -8
data/lib/version.rb +1 -1
data/spec/lib/parsers/emails_spec.rb +4 -0
data/spec/lib/parsers/facebook_profile_spec.rb +1 -0
data/spec/lib/parsers/redirected_to_spec.rb +209 -191
data/spec/lib/parsers/twitter_profile_spec.rb +1 -0
data/spec/lib/parsers/vimeo_profile_spec.rb +6 -4
metadata +21 -46
data/brilliant_web_scraper-1.0.0.gem +0 -0
data/brilliant_web_scraper-1.0.gem +0 -0

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: efbe9d1a0688fd10e200d972b56c3e2ec86203f1
-  data.tar.gz: 20cce1c52197f11dcea73813831bb4172829ddaa
+  metadata.gz: 9ad5219a19dcfc311bed756a83d82fd3758bd71a
+  data.tar.gz: c085eb2a96b8eb503cd44edc87821823cf0ad965
 SHA512:
-  metadata.gz: 638c34f7efbc963613f4bb841abbf183bf134ee3197bebc99f9403ba7864befd44243f53092a9aa3ba7ea58314475b61d6671816e8e3f8ef4deb7f49b6f0ef52
-  data.tar.gz: f91110f69e8228de408aa0c35050fe6137fac22bdb93ff86be3c70d380e1cf57534f50f78e5d585cb63e421ff0fc51aa04089d38e8a86ab3a6ca305659dc909a
+  metadata.gz: 4a5c3c9dd78f3e123b0c279a04c2d0d262c58aef8d0086668c37aea8e52b6ec8da98d6d242debddf600922dcfed3d377bd159e62fc9bb1e7390b1e62e881eb2b
+  data.tar.gz: de051e60d7b90bde7984871d1b39e52f46be6366910591f0ec31c434a41037a9cc1274989dd50755cbf6e03d0e9f498a3271aad173b12a8ea3fd6a578eb8fbc9

data/Gemfile.lock ADDED

@@ -0,0 +1,90 @@
+PATH
+  remote: .
+  specs:
+    brilliant_web_scraper (0.2)
+      charlock_holmes (~> 0.7.6)
+      nesty (~> 1.0, >= 1.0.1)
+      rest-client (~> 2.0, >= 2.0.2)
+GEM
+  remote: http://rubygems.org/
+  specs:
+    addressable (2.6.0)
+      public_suffix (>= 2.0.2, < 4.0)
+    ast (2.4.0)
+    charlock_holmes (0.7.6)
+    coderay (1.1.2)
+    crack (0.4.3)
+      safe_yaml (~> 1.0.0)
+    diff-lcs (1.3)
+    domain_name (0.5.20190701)
+      unf (>= 0.0.5, < 1.0.0)
+    hashdiff (1.0.0)
+    http-accept (1.7.0)
+    http-cookie (1.0.3)
+      domain_name (~> 0.5)
+    jaro_winkler (1.5.3)
+    method_source (0.9.2)
+    mime-types (3.2.2)
+      mime-types-data (~> 3.2015)
+    mime-types-data (3.2019.0331)
+    nesty (1.0.2)
+    netrc (0.11.0)
+    parallel (1.17.0)
+    parser (2.6.3.0)
+      ast (~> 2.4.0)
+    pry (0.12.2)
+      coderay (~> 1.1.0)
+      method_source (~> 0.9.0)
+    public_suffix (3.1.1)
+    rainbow (3.0.0)
+    rest-client (2.1.0)
+      http-accept (>= 1.7.0, < 2.0)
+      http-cookie (>= 1.0.2, < 2.0)
+      mime-types (>= 1.16, < 4.0)
+      netrc (~> 0.8)
+    rspec (3.8.0)
+      rspec-core (~> 3.8.0)
+      rspec-expectations (~> 3.8.0)
+      rspec-mocks (~> 3.8.0)
+    rspec-core (3.8.2)
+      rspec-support (~> 3.8.0)
+    rspec-expectations (3.8.4)
+      diff-lcs (>= 1.2.0, < 2.0)
+      rspec-support (~> 3.8.0)
+    rspec-mocks (3.8.1)
+      diff-lcs (>= 1.2.0, < 2.0)
+      rspec-support (~> 3.8.0)
+    rspec-support (3.8.2)
+    rubocop (0.73.0)
+      jaro_winkler (~> 1.5.1)
+      parallel (~> 1.10)
+      parser (>= 2.6)
+      rainbow (>= 2.2.2, < 4.0)
+      ruby-progressbar (~> 1.7)
+      unicode-display_width (>= 1.4.0, < 1.7)
+    ruby-progressbar (1.10.1)
+    safe_yaml (1.0.5)
+    unf (0.1.4)
+      unf_ext
+    unf_ext (0.0.7.6)
+    unicode-display_width (1.6.0)
+    vcr (3.0.3)
+    webmock (2.3.2)
+      addressable (>= 2.3.6)
+      crack (>= 0.3.2)
+      hashdiff
+PLATFORMS
+  ruby
+DEPENDENCIES
+  brilliant_web_scraper!
+  pry (~> 0.12.2)
+  rspec (~> 3.5)
+  rubocop (~> 0.73.0)
+  vcr (~> 3.0, >= 3.0.1)
+  webmock (~> 2.1)
+BUNDLED WITH
+   1.16.6

data/README.md CHANGED

@@ -1,14 +1,11 @@
-# WebScraper [![Build Status](https://api.travis-ci.com/bkotu6717/brilliant_web_scraper.svg)](https://travis-ci.com/bkotu6717/brilliant_web_scraper)
+# BrilliantWebScraper [![Build Status](https://api.travis-ci.com/bkotu6717/brilliant_web_scraper.svg)](https://travis-ci.com/bkotu6717/brilliant_web_scraper)[![Maintainability](https://api.codeclimate.com/v1/badges/15a8a6e117f11bd94376/maintainability)](https://codeclimate.com/github/bkotu6717/brilliant_web_scraper/maintainability)
-A decent web scraping gem. Scrapes website description, social profiles, contact details, youtube channels.
-It accepts a URL or Domain as input and gets it's title, descrptios, social profiles, YouTube channels and it's current URL if got redirected.
+A decent web scraping gem. Scrapes website title, description, social profiles such as linkedin, facebook, twitter, instgram, vimeo, pinterest, youtube channel and contact details such as emails, phone numbers.
 ## See it in action!
-You can try WebScraper live at this little demo: [https://brilliantweb-scraper-demo.herokuapp.com](https://brilliant-web-scraper-demo.herokuapp.com)
+You can try BrillaintWebScraper live at this little demo: [https://brilliant-web-scraper-demo.herokuapp.com](https://brilliant-web-scraper-demo.herokuapp.com)
 ## Installation
@@ -21,11 +18,11 @@ gem 'brilliant_web_scraper'
 ## Usage
-Initialize a BrilliantWebScraper instance for an URL, like this:
+Initialize a BrilliantWebScraper instance for an URL, like this with optional timeouts, default connection_timeout and read_timeouts are 10s, 10s respectively:
 ```ruby
 require 'brilliant_web_scraper'
+results = BrilliantWebScraper.new('http://pwc.com', 5, 5)
 results = BrilliantWebScraper.new('http://pwc.com')
 ```
-If you don't include the scheme on the URL, it is fine:

data/brilliant_web_scraper.gemspec CHANGED

@@ -6,23 +6,28 @@ Gem::Specification.new do |s|
   s.name =  'brilliant_web_scraper'
   s.version = WebScraper::VERSION
   s.licenses = ['Nonstandard']
-  s.summary = 'A decent web scraping ruby library!'
-  s.description = 'Scrapes data such as description, social profiles, contact details'
+  s.summary = 'A decent web scraping ruby gem!'
+  s.description = 'A decent web scraping gem.'\
+                  'Scrapes website\'s title, description,'\
+                  'social profiles such as linkedin, '\
+                  'facebook, twitter, instgram, vimeo,'\
+                  'pinterest, youtube channel and'\
+                  ' contact details such as emails, phone numbers.'
   s.authors = ['Kotu Bhaskara Rao']
   s.email = 'bkotu6717@gmail.com'
   s.require_paths = ['lib']
   s.homepage = 'https://github.com/bkotu6717/brilliant_web_scraper'
   s.files = Dir['**/*'].keep_if { |file|
-    file != "brilliant_web_scraper-#{WebScraper::VERSION}.gem" && File.file?(file)
+    file != "brilliant_web_scraper-#{WebScraper::VERSION}.gem" &&
+    File.file?(file)
   }
   s.required_ruby_version = '>= 2.3.0'
-  s.add_dependency 'nesty', '~> 1.0', '>= 1.0.1'
-  s.add_dependency 'rest-client', '~> 2.0', '>= 2.0.2'
+  s.add_runtime_dependency 'charlock_holmes', '~> 0.7.6'
+  s.add_runtime_dependency 'nesty', '~> 1.0', '>= 1.0.1'
+  s.add_runtime_dependency 'rest-client', '~> 2.0', '>= 2.0.2'
-  s.add_development_dependency 'nesty', '~> 1.0', '>= 1.0.1'
   s.add_development_dependency 'pry', '~> 0.12.2'
-  s.add_development_dependency 'rest-client', '~> 2.0', '>= 2.0.2'
   s.add_development_dependency 'rspec', '~> 3.5'
   s.add_development_dependency 'rubocop', '~> 0.73.0'
   s.add_development_dependency 'vcr', '~> 3.0', '>= 3.0.1'

data/lib/brilliant_web_scraper.rb CHANGED

@@ -2,7 +2,8 @@
 require 'rest-client'
 require 'cgi'
-require 'benchmark'
+require 'charlock_holmes/string'
+require 'timeout'
 current_directory = File.dirname(__FILE__) + '/scraper'
 require File.expand_path(File.join(current_directory, 'errors'))

data/lib/parsers/description_helper.rb CHANGED

@@ -21,7 +21,7 @@ module DescriptionHelper
   def parse_description(descriptions)
     return if descriptions.nil? || descriptions.empty?
-    descriptions = descriptions.reject { |x| x.nil? || x.empty? }
+    descriptions = descriptions.reject { |x| x.nil? || x.empty? || x =~ /^\s*$/}
     descriptions = descriptions.map { |x| unescape_html(x) }
     descriptions.find { |x| (x !~ /^\s*[|-]?\s*$/) }
   end

data/lib/parsers/emails.rb CHANGED

@@ -10,7 +10,7 @@ module Emails
     return if response.nil? || response.empty?
     first_regex = /(?im)mailto:\s*([^\?"',\\<>\s]+)/
-    second_regex = %r{(?im)["'\s><\/]*([\w._%-]+@(?!(?:example|e?mail|domain|company|your(?:domain|company|email)|address|emailad?dress|yyy|test)\.)[\w._%-]+\.(?!png|jpe?g|tif|svg)[A-Z]{2,3})["'\s><]}
+    second_regex = %r{(?im)["'\s><\/]*([\w._%-]+@(?!(?:example|e?mail|domain|company|your(?:domain|company|email)|address|emailad?dress|yyy|test)\.)[\w._%-]+\.(?!png|jpe?g|tif|svg|css|js|ico|gif)[A-Z]{2,3})["'\s><]}
     first_set = response.scan(first_regex).flatten.compact
     first_set = get_processed_emails(first_set)
     second_set = response.scan(second_regex).flatten.compact
@@ -24,7 +24,7 @@ module Emails
     unescaped_emails = email_set.map { |email| unescape_html(email) }
     return [] if unescaped_emails.empty?
-    email_match_regex = /[\w._%-]+@(?!(?:example|e?mail|domain|company|your(?:domain|company|email)|address|emailad?dress|yyy|test)\.)[\w._%-]+\.(?!png|jpe?g|tif|svg)[A-Z]{2,3}/im
+    email_match_regex = /[\w._%-]+@(?!(?:example|e?mail|domain|company|your(?:domain|company|email)|address|emailad?dress|yyy|test)\.)[\w._%-]+\.(?!png|jpe?g|tif|svg|css|js|ico|gif)[A-Z]{2,3}/im
     unescaped_emails.select { |data| data =~ email_match_regex }
   end
 end

data/lib/parsers/facebook_profile.rb CHANGED

@@ -5,7 +5,7 @@ module FacebookProfile
   def grep_facebook_profile(response)
     return if response.nil? || response.empty?
-    facebook_url_regex = /(https?:\/\/(?:www\.)?(?:facebook|fb)\.com\/(?!tr\?|(?:[\/\w\d]*(?:photo|sharer?|like(?:box)?|offsite_event|plugins|permalink|home|search))\.php|\d+\/fbml|(?:dialog|hashtag|plugins|sharer|login|recover|security|help|v\d+\.\d+)\/|(?:privacy|#|your-profile|yourfacebookpage)\/?|home\?)[^"'<>\&\s]+)/im
+    facebook_url_regex = /(https?:\/\/(?:www\.)?(?:facebook|fb)\.com\/(?!tr\?|(?:[\/\w\d]*(?:photo|sharer?|like(?:box)?|offsite_event|plugins|permalink|home|search))\.php|\d+\/fbml|(?:dialog|hashtag|plugins|sharer|login|recover|security|help|images|v\d+\.\d+)\/|(?:privacy|#|your-profile|yourfacebookpage)\/?|home\?)[^"'<>\&\s]+)/im
     response.scan(facebook_url_regex).flatten.compact.uniq
   end
 end

data/lib/parsers/redirected_to.rb CHANGED

@@ -2,6 +2,7 @@
 # Fetch latest url of the given website
 module RedirectedTo
+  include UnescapeHtmlHelper
   def grep_redirected_to_url(response)
     return if response.nil? || response.empty?
@@ -18,7 +19,7 @@ module RedirectedTo
       url = parser(web_urls)
       break unless url.nil?
     end
-    url
+    unescape_html(url)
   end
   private

data/lib/parsers/twitter_profile.rb CHANGED

@@ -5,7 +5,7 @@ module TwitterProfile
   def grep_twitter_profile(response)
     return if response.nil? || response.empty?
-    twitter_regex = %r{(?im)(https?:\/\/(?:www\.)?twitter\.com\/(?!(?:share|download|search|home|login|privacy)(?:\?|\/|\b)|(?:hashtag|i|javascripts|statuses|#!|intent)\/|(?:#|'|%))[^"'&\?<>\s\\]+)}
+    twitter_regex = %r{(?im)(https?:\/\/(?:www\.)?twitter\.com\/(?!\{\{)(?!(?:share|download|search|home|login|privacy)(?:\?|\/|\b)|(?:hashtag|i|javascripts|statuses|#!|intent)\/|(?:#|'|%))[^"'&\?<>\s\\]+)}
     response.scan(twitter_regex).flatten.compact.uniq
   end
 end

data/lib/parsers/vimeo_profile.rb CHANGED

@@ -5,7 +5,7 @@ module VimeoProfile
   def grep_vimeo_profile(response)
     return if response.nil? || response.empty?
-    vimeo_regex = %r{(?im)(https?:\/\/(?:www\.)?vimeo\.com\/(?!upgrade|features|enterprise|upload|api)\/?[^"'\&\?<>\s]+)}
+    vimeo_regex = %r{(?im)(https?:\/\/(?:www\.)?vimeo\.com\/(?!upgrade|features|enterprise|upload|api)\/?[^"'\\\&\?<>\s]+)}
     response.scan(vimeo_regex).flatten.compact.uniq
   end
 end

data/lib/scraper/scrape_helper.rb CHANGED

@@ -6,38 +6,38 @@
 # @Social Profiles
 # @Contact Details
 module ScrapeHelper
-  def perform_scrape(url, read_timeout, connection_timeout)
-    response = nil
-    request_duration = Benchmark.measure do
-      response = ScrapeRequest.new(url, read_timeout, connection_timeout)
-    end.real
-    retry_count = 0
-    begin
-      scrape_data = nil
-      scrape_duration = Benchmark.measure do
-        scrape_data = grep_data(response.body)
-      end.real
-      data_hash = {
-        web_request_duration: request_duration,
-        response_scrape_duraton: scrape_duration,
-        scrape_data: scrape_data
-      }
-    rescue ArgumentError => e
-      retry_count += 1
-      raise WebScraper::ParserError, e.message if retry_count > 1
-      response = response.encode('UTF-16be', invalid: :replace, replace: '?')
-      response = response.encode('UTF-8')
-      retry
-    rescue Encoding::CompatibilityError => e
-      raise WebScraper::ParserError, e.message
+  def perform_scrape(url, read_timeout, open_timeout)
+    timeout_in_sec = scraper_timeout(read_timeout, open_timeout)
+    Timeout::timeout(timeout_in_sec) do
+      response = ScrapeRequest.new(url, read_timeout, open_timeout)
+      retry_count = 0
+      body = response.body
+      begin
+        body = body.tr("\000", '')
+        encoding = body.detect_encoding[:encoding]
+        body = body.encode('UTF-8', encoding)
+        grep_data(body)
+      rescue Encoding::UndefinedConversionError, ArgumentError => e
+        retry_count += 1
+        raise WebScraper::ParserError, e.message if retry_count > 1
+        body = body.encode('UTF-8', invalid: :replace, undef: :replace, replace: '?')
+        retry
+      rescue Encoding::CompatibilityError => e
+        raise WebScraper::ParserError, e.message
+      rescue StandardError => e
+        raise WebScraper::RequestError, e.message
+      end
     end
-    data_hash
+  rescue Timeout::Error => e
+    raise WebScraper::TimeoutError, e.message
   end
   private
+  def scraper_timeout(read_timeout, open_timeout)
+    ( read_timeout + open_timeout + 1 )
+  end
   def grep_data(response)
     {
       title: grep_title(response),

data/lib/scraper/scrape_request.rb CHANGED

@@ -1,24 +1,28 @@
 # frozen_string_literal: true
-# @Makes actual scrape request, either raises exception or response
+# @Makes actual scrape request, either raises exception or serves response
 module ScrapeRequest
   extend ScrapeExceptions
   class << self
     def new(url, read_timeout, connection_timeout)
+      params_hash = {
+        method: :get,
+        url: url,
+        read_timeout: read_timeout,
+        open_timeout: connection_timeout,
+        max_redirects: 10,
+        verify_ssl: false
+      }
       begin
-        params_hash = {
-          method: :get,
-          url: url,
-          read_timeout: read_timeout,
-          connection_timeout: connection_timeout,
-          headers: { 'accept-encoding': 'identity' }
-        }
         response = RestClient::Request.execute(params_hash)
         content_type = response.headers[:content_type]
         return response if content_type =~ %r{(?i)text\s*\/\s*html}
         exception_message = "Invalid response format received: #{content_type}"
         raise WebScraper::NonHtmlError, exception_message
+      rescue Zlib::DataError
+        params_hash[:headers] = { 'accept-encoding': 'identity' }
+        retry
       rescue *TIMEOUT_EXCEPTIONS => e
         raise WebScraper::TimeoutError, e.message
       rescue *GENERAL_EXCEPTIONS => e

data/lib/version.rb CHANGED

@@ -2,5 +2,5 @@
 # Holds current version number
 module WebScraper
-  VERSION = '0.1'
+  VERSION = '0.2'
 end

data/spec/lib/parsers/emails_spec.rb CHANGED

@@ -27,6 +27,10 @@ describe 'Emails' do
       <a href="mailto:xxx@yyy.zzz">xxx@yyy.zzz</a>
       <a href="mailto:test@test.com">test@test.com</a>
       <a href="mailto:@example.com">@example.com"</a>
+      <a href="mailto:v@201908240100.css">v@201908240100.css"</a>
+      <a href="mailto:v@201908240100.js">v@201908240100.js"</a>
+      <a href="mailto:ajax-loader@2x.gif">ajax-loader@2x.gif"</a>
+      <a href="mailto:favicon@2x.ico">favicon@2x.ico"</a>
   	HTML
   	expect(dummy_object.grep_emails(html.to_s)).to eq([])
   end

data/spec/lib/parsers/facebook_profile_spec.rb CHANGED

@@ -14,6 +14,7 @@ describe 'FaceBook Profile' do
   it 'should not grep any non profile url' do
     html = <<~HTML
+      <a href="https://www.facebook.com/images/fb_icon_325x325.png" target="_blank" class="sqs-svg-icon--wrapper facebook">
       <a href="http://www.facebook.com/2008/fbml" target="_blank" class="sqs-svg-icon--wrapper facebook">
       <a href="https://www.facebook.com/v2.0/dialog/share" target="_blank" class="sqs-svg-icon--wrapper facebook">
       <a href="https://www.facebook.com/plugins/video.php?href=https%3A%2F%2Fwww.facebook.com%2FHFXMooseheads%2Fvideos" target="_blank" class="sqs-svg-icon--wrapper facebook">

data/spec/lib/parsers/redirected_to_spec.rb CHANGED

@@ -3,205 +3,223 @@ require 'spec_helper'
 describe 'Website Redirected To' do
   class DummyTestClass
-  	include RedirectedTo
-	end
+    include RedirectedTo
+  end
   let(:dummy_object) { DummyTestClass.new }
   it 'should return nil for invalid input' do
-		expect(dummy_object.grep_redirected_to_url(nil)).to be_nil
-		expect(dummy_object.grep_redirected_to_url('')).to be_nil
-	end
+    expect(dummy_object.grep_redirected_to_url(nil)).to be_nil
+    expect(dummy_object.grep_redirected_to_url('')).to be_nil
+  end
   describe 'Website grep from link tag' do
-  	describe 'rel attribute first ' do
-	  	it 'should return nil when canonical url is empty' do
-	  		html = <<~HTML
-	  			<link rel="canonical" href="">
-	  			<link rel="canonical" href=''>
-	  		HTML
-	  		website = dummy_object.grep_redirected_to_url(html.to_s)
-	  		expect(website).to be_nil
-	  	end
-	  	it 'should grep website' do
-	  		html = <<~HTML
-	  			<link rel="canonical" href="">
-	  			<link rel="canonical" href='https://www.apple.com/'>
-	  		HTML
-	  		website = dummy_object.grep_redirected_to_url(html.to_s)
-	  		expect(website).to eq('https://www.apple.com/')
-	  	end
-	  	it 'should grep website even with extra attributes' do
-	  		html = <<~HTML
-	  			<link rel="canonical" href="" itemprop="current_url">
-	  			<link rel="canonical" href='https://www.apple.com/'
-	  			itemprop="current_url" >
-	  		HTML
-	  		website = dummy_object.grep_redirected_to_url(html.to_s)
-	  		expect(website).to eq('https://www.apple.com/')
-	  	end
-	  end
-	  describe 'href attribute first' do
-	  	it 'should return nil when canonical url is empty' do
-	  		html = <<~HTML
-	  			<link href="" rel="canonical" >
-	  			<link href='' rel="canonical" >
-	  		HTML
-	  		website = dummy_object.grep_redirected_to_url(html.to_s)
-	  		expect(website).to be_nil
-	  	end
-	  	it 'should grep website' do
-	  		html = <<~HTML
-	  			<link rel="canonical" href="">
-	  			<link href='https://www.apple.com/' rel="canonical">
-	  		HTML
-	  		website = dummy_object.grep_redirected_to_url(html.to_s)
-	  		expect(website).to eq('https://www.apple.com/')
-	  	end
-	  	it 'should grep website even with extra attributes' do
-	  		html = <<~HTML
-	  			<link href="" itemprop="current_url" rel="canonical">
-	  			<link href='https://www.apple.com/' rel="canonical"
-	  			itemprop="current_url" >
-	  		HTML
-	  		website = dummy_object.grep_redirected_to_url(html.to_s)
-	  		expect(website).to eq('https://www.apple.com/')
-	  	end
-	  end
+    describe 'rel attribute first ' do
+      it 'should return nil when canonical url is empty' do
+        html = <<~HTML
+          <link rel="canonical" href="">
+          <link rel="canonical" href=''>
+        HTML
+        website = dummy_object.grep_redirected_to_url(html.to_s)
+        expect(website).to be_nil
+      end
+      it 'should grep website' do
+        html = <<~HTML
+          <link rel="canonical" href="">
+          <link rel="canonical" href='https://www.apple.com/'>
+        HTML
+        website = dummy_object.grep_redirected_to_url(html.to_s)
+        expect(website).to eq('https://www.apple.com/')
+      end
+      it 'should grep website even with extra attributes' do
+        html = <<~HTML
+          <link rel="canonical" href="" itemprop="current_url">
+          <link rel="canonical" href='https://www.apple.com/'
+          itemprop="current_url" >
+        HTML
+        website = dummy_object.grep_redirected_to_url(html.to_s)
+        expect(website).to eq('https://www.apple.com/')
+      end
+    end
+    describe 'href attribute first' do
+      it 'should return nil when canonical url is empty' do
+        html = <<~HTML
+          <link href="" rel="canonical" >
+          <link href='' rel="canonical" >
+        HTML
+        website = dummy_object.grep_redirected_to_url(html.to_s)
+        expect(website).to be_nil
+      end
+      it 'should grep website' do
+        html = <<~HTML
+          <link rel="canonical" href="">
+          <link href='https://www.apple.com/' rel="canonical">
+        HTML
+        website = dummy_object.grep_redirected_to_url(html.to_s)
+        expect(website).to eq('https://www.apple.com/')
+      end
+      it 'should grep website even with extra attributes' do
+        html = <<~HTML
+          <link href="" itemprop="current_url" rel="canonical">
+          <link href='https://www.apple.com/' rel="canonical"
+          itemprop="current_url" >
+        HTML
+        website = dummy_object.grep_redirected_to_url(html.to_s)
+        expect(website).to eq('https://www.apple.com/')
+      end
+    end
   end
   describe 'Website grep from organization URL' do
-  	describe 'property attribute first ' do
-	  	it 'should return nil when canonical url is empty' do
-	  		html = <<~HTML
-	  			<meta property="og:url" content="" />
-	  			<meta property="og:url" content='' />
-	  		HTML
-	  		website = dummy_object.grep_redirected_to_url(html.to_s)
-	  		expect(website).to be_nil
-	  	end
-	  	it 'should grep website' do
-	  		html = <<~HTML
-	  			<link property="og:url" content="">
-	  			<meta property="og:url" content="https://www.dieppe.ca/fr/index.aspx" />
-	  		HTML
-	  		website = dummy_object.grep_redirected_to_url(html.to_s)
-	  		expect(website).to eq('https://www.dieppe.ca/fr/index.aspx')
-	  	end
-	  	it 'should grep website even with extra attributes' do
-	  		html = <<~HTML
-	  			<link property="og:url" content="" calss="og-url">
-	  			<meta property="og:url" content='https://www.dieppe.ca/fr/index.aspx'
-	  			class="og-url" />
-	  		HTML
-	  		website = dummy_object.grep_redirected_to_url(html.to_s)
-	  		expect(website).to eq('https://www.dieppe.ca/fr/index.aspx')
-	  	end
-	  end
-	  describe 'content attribute first ' do
-	  	it 'should return nil when canonical url is empty' do
-	  		html = <<~HTML
-	  			<meta content="" property="og:url" />
-	  			<meta content='' property="og:url"/>
-	  		HTML
-	  		website = dummy_object.grep_redirected_to_url(html.to_s)
-	  		expect(website).to be_nil
-	  	end
-	  	it 'should grep website' do
-	  		html = <<~HTML
-	  			<link content="" property="og:url" >
-	  			<meta content="https://www.dieppe.ca/fr/index.aspx" property="og:url"  />
-	  		HTML
-	  		website = dummy_object.grep_redirected_to_url(html.to_s)
-	  		expect(website).to eq('https://www.dieppe.ca/fr/index.aspx')
-	  	end
-	  	it 'should grep website even with extra attributes' do
-	  		html = <<~HTML
-	  			<link content="" calss="og-url" property="og:url">
-	  			<meta content='https://www.dieppe.ca/fr/index.aspx'
-	  			class="og-url" property="og:url" />
-	  		HTML
-	  		website = dummy_object.grep_redirected_to_url(html.to_s)
-	  		expect(website).to eq('https://www.dieppe.ca/fr/index.aspx')
-	  	end
-	  end
+    describe 'property attribute first ' do
+      it 'should return nil when canonical url is empty' do
+        html = <<~HTML
+          <meta property="og:url" content="" />
+          <meta property="og:url" content='' />
+        HTML
+        website = dummy_object.grep_redirected_to_url(html.to_s)
+        expect(website).to be_nil
+      end
+      it 'should grep website' do
+        html = <<~HTML
+          <link property="og:url" content="">
+          <meta property="og:url" content="https://www.dieppe.ca/fr/index.aspx" />
+        HTML
+        website = dummy_object.grep_redirected_to_url(html.to_s)
+        expect(website).to eq('https://www.dieppe.ca/fr/index.aspx')
+      end
+      it 'should grep website even with extra attributes' do
+        html = <<~HTML
+          <link property="og:url" content="" calss="og-url">
+          <meta property="og:url" content='https://www.dieppe.ca/fr/index.aspx'
+          class="og-url" />
+        HTML
+        website = dummy_object.grep_redirected_to_url(html.to_s)
+        expect(website).to eq('https://www.dieppe.ca/fr/index.aspx')
+      end
+    end
+    describe 'content attribute first ' do
+      it 'should return nil when canonical url is empty' do
+        html = <<~HTML
+          <meta content="" property="og:url" />
+          <meta content='' property="og:url"/>
+        HTML
+        website = dummy_object.grep_redirected_to_url(html.to_s)
+        expect(website).to be_nil
+      end
+      it 'should grep website' do
+        html = <<~HTML
+          <link content="" property="og:url" >
+          <meta content="https://www.dieppe.ca/fr/index.aspx" property="og:url"  />
+        HTML
+        website = dummy_object.grep_redirected_to_url(html.to_s)
+        expect(website).to eq('https://www.dieppe.ca/fr/index.aspx')
+      end
+      it 'should grep website even with extra attributes' do
+        html = <<~HTML
+          <link content="" calss="og-url" property="og:url">
+          <meta content='https://www.dieppe.ca/fr/index.aspx'
+          class="og-url" property="og:url" />
+        HTML
+        website = dummy_object.grep_redirected_to_url(html.to_s)
+        expect(website).to eq('https://www.dieppe.ca/fr/index.aspx')
+      end
+    end
   end
   describe 'grep website' do
-  	it 'it should return nil when link or og:url is absent' do
-  		html = <<~HTML
-  			<head>
-			    <meta charset="utf-8">
-			    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
-			    <meta http-equiv="x-ua-compatible" content="ie=edge">
-			    <title>Techmologic | index</title>
-			    <!-- Font Awesome -->
-			    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.6.0/css/font-awesome.min.css">
-			    <!-- Bootstrap core CSS -->
-			    <link href="css/bootstrap.min.css" rel="stylesheet">
-			    <!-- Material Design Bootstrap -->
-			    <link href="css/mdb.min.css" rel="stylesheet">
-			    <!-- Your custom styles (optional) -->
-			    <link href="css/style.css" rel="stylesheet">
-			</head>
-  		HTML
-  		website = dummy_object.grep_redirected_to_url(html.to_s)
-  		expect(website).to be_nil
-  	end
-  	it 'should grep one of canonical or og:url' do
-  		html = <<~HTML
-  			<head>
-			    <meta charset="utf-8">
-			    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
-			    <meta http-equiv="x-ua-compatible" content="ie=edge">
-			    <title>Techmologic | index</title>
-			   	<link rel="canonical" href="">
-			   	<meta property="og:url" content="" />
-			    <!-- Font Awesome -->
-			    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.6.0/css/font-awesome.min.css">
-			    <!-- Bootstrap core CSS -->
-			    <link href="css/bootstrap.min.css" rel="stylesheet">
-			    <!-- Material Design Bootstrap -->
-			    <link href="css/mdb.min.css" rel="stylesheet">
-			    <!-- Your custom styles (optional) -->
-			    <link href="css/style.css" rel="stylesheet">
-			    <link rel="canonical" href="http://techmologics.com/">
-			    <meta property="og:url" content="http://techmologics.com/" />
-			</head>
-  		HTML
-  		website = dummy_object.grep_redirected_to_url(html.to_s)
-  		expect(website).to eq('http://techmologics.com/')
-  	end
-  		it 'should grep one of canonical or og:url whatever it\'s position' do
-  		html = <<~HTML
-  			<head>
-			    <meta charset="utf-8">
-			    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
-			    <meta http-equiv="x-ua-compatible" content="ie=edge">
-			    <title>Techmologic | index</title>
-			   	<link  href="" rel="canonical">
-			   	<meta  content="" property="og:url"/>
-			    <!-- Font Awesome -->
-			    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.6.0/css/font-awesome.min.css">
-			    <!-- Bootstrap core CSS -->
-			    <link href="css/bootstrap.min.css" rel="stylesheet">
-			    <!-- Material Design Bootstrap -->
-			    <link href="css/mdb.min.css" rel="stylesheet">
-			    <!-- Your custom styles (optional) -->
-			    <link href="css/style.css" rel="stylesheet">
-			    <link href="http://techmologics.com/" rel="canonical" class="canonical">
-			    <meta content="http://techmologics.com/" property="og:url"/>
-			</head>
-  		HTML
-  		website = dummy_object.grep_redirected_to_url(html.to_s)
-  		expect(website).to eq('http://techmologics.com/')
-  	end
+    it 'it should return nil when link or og:url is absent' do
+      html = <<~HTML
+        <head>
+          <meta charset="utf-8">
+          <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
+          <meta http-equiv="x-ua-compatible" content="ie=edge">
+          <title>Techmologic | index</title>
+          <!-- Font Awesome -->
+          <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.6.0/css/font-awesome.min.css">
+          <!-- Bootstrap core CSS -->
+          <link href="css/bootstrap.min.css" rel="stylesheet">
+          <!-- Material Design Bootstrap -->
+          <link href="css/mdb.min.css" rel="stylesheet">
+          <!-- Your custom styles (optional) -->
+          <link href="css/style.css" rel="stylesheet">
+      </head>
+      HTML
+      website = dummy_object.grep_redirected_to_url(html.to_s)
+      expect(website).to be_nil
+    end
+    it 'should grep one of canonical or og:url' do
+      html = <<~HTML
+        <head>
+          <meta charset="utf-8">
+          <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
+          <meta http-equiv="x-ua-compatible" content="ie=edge">
+          <title>Techmologic | index</title>
+          <link rel="canonical" href="">
+          <meta property="og:url" content="" />
+          <!-- Font Awesome -->
+          <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.6.0/css/font-awesome.min.css">
+          <!-- Bootstrap core CSS -->
+          <link href="css/bootstrap.min.css" rel="stylesheet">
+          <!-- Material Design Bootstrap -->
+          <link href="css/mdb.min.css" rel="stylesheet">
+          <!-- Your custom styles (optional) -->
+          <link href="css/style.css" rel="stylesheet">
+          <link rel="canonical" href="http://techmologics.com/">
+          <meta property="og:url" content="http://techmologics.com/" />
+      </head>
+      HTML
+      website = dummy_object.grep_redirected_to_url(html.to_s)
+      expect(website).to eq('http://techmologics.com/')
+    end
+    it 'should grep one of canonical or og:url whatever it\'s position' do
+      html = <<~HTML
+        <head>
+          <meta charset="utf-8">
+          <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
+          <meta http-equiv="x-ua-compatible" content="ie=edge">
+          <title>Techmologic | index</title>
+          <link  href="" rel="canonical">
+          <meta  content="" property="og:url"/>
+          <!-- Font Awesome -->
+          <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.6.0/css/font-awesome.min.css">
+          <!-- Bootstrap core CSS -->
+          <link href="css/bootstrap.min.css" rel="stylesheet">
+          <!-- Material Design Bootstrap -->
+          <link href="css/mdb.min.css" rel="stylesheet">
+          <!-- Your custom styles (optional) -->
+          <link href="css/style.css" rel="stylesheet">
+          <link href="http://techmologics.com/" rel="canonical" class="canonical">
+          <meta content="http://techmologics.com/" property="og:url"/>
+      </head>
+      HTML
+      website = dummy_object.grep_redirected_to_url(html.to_s)
+      expect(website).to eq('http://techmologics.com/')
+    end
+    it 'should decode html entities in the redirected_to url' do
+      html = <<~HTML
+        <meta content="https&#x3a;&#x2f;&#x2f;www&#x2e;santanderbank&#x2e;com&#x2f;us&#x2f;personal" property="og:url"/>
+      HTML
+      website = dummy_object.grep_redirected_to_url(html.to_s)
+      expect(website).to eq('https://www.santanderbank.com/us/personal')
+    end
   end
-end
+end

data/spec/lib/parsers/twitter_profile_spec.rb CHANGED

@@ -31,6 +31,7 @@ describe 'Twitter Profile' do
       <a href=" http://twitter.com/share/" target="_blank">
       <a href="https://twitter.com/#!/Farmer_Brothers" target="_blank">
       <a href="http://twitter.com/javascripts/blogger.js" target="_blank">
+      <a href="https://twitter.com/{{../user.screen_name}}/status/{{../id_str}}" target="_blank">
     HTML
     expect(dummy_object.grep_twitter_profile(html.to_s)).to eq([])
   end

data/spec/lib/parsers/vimeo_profile_spec.rb CHANGED

@@ -30,13 +30,15 @@ describe 'Vimeo Profile' do
       <a href="https://vimeo.com/channels/332103" target="_blank">
       <a href="https://vimeo.com/talech" target="_blank">
       <a href="https://vimeo.com/292173295/fdb8634a35/" target="_blank">
+      <a href="https://vimeo.com/337614648\\" target="_blank">
     HTML
     vimeo_profiles = dummy_object.grep_vimeo_profile(html.to_s)
     expected_profiles = [
-      'https://vimeo.com/107578087',
-      'https://vimeo.com/channels/332103',
-      'https://vimeo.com/talech',
-      'https://vimeo.com/292173295/fdb8634a35/',
+      'https://vimeo.com/107578087',
+      'https://vimeo.com/channels/332103',
+      'https://vimeo.com/talech',
+      'https://vimeo.com/292173295/fdb8634a35/',
+      'https://vimeo.com/337614648'
     ]
     expect(vimeo_profiles).to eq(expected_profiles)
   end

metadata CHANGED

@@ -1,75 +1,69 @@
 --- !ruby/object:Gem::Specification
 name: brilliant_web_scraper
 version: !ruby/object:Gem::Version
-  version: '0.1'
+  version: '0.2'
 platform: ruby
 authors:
 - Kotu Bhaskara Rao
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2019-08-11 00:00:00.000000000 Z
+date: 2019-08-31 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
-  name: nesty
+  name: charlock_holmes
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.0'
-    - - ">="
-      - !ruby/object:Gem::Version
-        version: 1.0.1
+        version: 0.7.6
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.0'
-    - - ">="
-      - !ruby/object:Gem::Version
-        version: 1.0.1
+        version: 0.7.6
 - !ruby/object:Gem::Dependency
-  name: rest-client
+  name: nesty
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '2.0'
+        version: '1.0'
     - - ">="
       - !ruby/object:Gem::Version
-        version: 2.0.2
+        version: 1.0.1
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '2.0'
+        version: '1.0'
     - - ">="
       - !ruby/object:Gem::Version
-        version: 2.0.2
+        version: 1.0.1
 - !ruby/object:Gem::Dependency
-  name: nesty
+  name: rest-client
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.0'
+        version: '2.0'
     - - ">="
       - !ruby/object:Gem::Version
-        version: 1.0.1
-  type: :development
+        version: 2.0.2
+  type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.0'
+        version: '2.0'
     - - ">="
       - !ruby/object:Gem::Version
-        version: 1.0.1
+        version: 2.0.2
 - !ruby/object:Gem::Dependency
   name: pry
   requirement: !ruby/object:Gem::Requirement
@@ -84,26 +78,6 @@ dependencies:
     - - "~>"
       - !ruby/object:Gem::Version
         version: 0.12.2
-- !ruby/object:Gem::Dependency
-  name: rest-client
-  requirement: !ruby/object:Gem::Requirement
-    requirements:
-    - - "~>"
-      - !ruby/object:Gem::Version
-        version: '2.0'
-    - - ">="
-      - !ruby/object:Gem::Version
-        version: 2.0.2
-  type: :development
-  prerelease: false
-  version_requirements: !ruby/object:Gem::Requirement
-    requirements:
-    - - "~>"
-      - !ruby/object:Gem::Version
-        version: '2.0'
-    - - ">="
-      - !ruby/object:Gem::Version
-        version: 2.0.2
 - !ruby/object:Gem::Dependency
   name: rspec
   requirement: !ruby/object:Gem::Requirement
@@ -166,16 +140,17 @@ dependencies:
     - - "~>"
       - !ruby/object:Gem::Version
         version: '2.1'
-description: Scrapes data such as description, social profiles, contact details
+description: A decent web scraping gem.Scrapes website's title, description,social
+  profiles such as linkedin, facebook, twitter, instgram, vimeo,pinterest, youtube
+  channel and contact details such as emails, phone numbers.
 email: bkotu6717@gmail.com
 executables: []
 extensions: []
 extra_rdoc_files: []
 files:
 - Gemfile
+- Gemfile.lock
 - README.md
-- brilliant_web_scraper-1.0.0.gem
-- brilliant_web_scraper-1.0.gem
 - brilliant_web_scraper.gemspec
 - lib/brilliant_web_scraper.rb
 - lib/parsers/description_helper.rb
@@ -246,5 +221,5 @@ rubyforge_project:
 rubygems_version: 2.5.1
 signing_key:
 specification_version: 4
-summary: A decent web scraping ruby library!
+summary: A decent web scraping ruby gem!
 test_files: []

data/brilliant_web_scraper-1.0.0.gem DELETED

Binary file

data/brilliant_web_scraper-1.0.gem DELETED

Binary file