RubyGems - brilliant_web_scraper - Versions diffs - 0.1 → 0.2 - Mend

brilliant_web_scraper 0.1 → 0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (22) hide show

checksums.yaml +4 -4
data/Gemfile.lock +90 -0
data/README.md +6 -9
data/brilliant_web_scraper.gemspec +12 -7
data/lib/brilliant_web_scraper.rb +2 -1
data/lib/parsers/description_helper.rb +1 -1
data/lib/parsers/emails.rb +2 -2
data/lib/parsers/facebook_profile.rb +1 -1
data/lib/parsers/redirected_to.rb +2 -1
data/lib/parsers/twitter_profile.rb +1 -1
data/lib/parsers/vimeo_profile.rb +1 -1
data/lib/scraper/scrape_helper.rb +27 -27
data/lib/scraper/scrape_request.rb +12 -8
data/lib/version.rb +1 -1
data/spec/lib/parsers/emails_spec.rb +4 -0
data/spec/lib/parsers/facebook_profile_spec.rb +1 -0
data/spec/lib/parsers/redirected_to_spec.rb +209 -191
data/spec/lib/parsers/twitter_profile_spec.rb +1 -0
data/spec/lib/parsers/vimeo_profile_spec.rb +6 -4
metadata +21 -46
data/brilliant_web_scraper-1.0.0.gem +0 -0
data/brilliant_web_scraper-1.0.gem +0 -0

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: efbe9d1a0688fd10e200d972b56c3e2ec86203f1
-  data.tar.gz: 20cce1c52197f11dcea73813831bb4172829ddaa
+  metadata.gz: 9ad5219a19dcfc311bed756a83d82fd3758bd71a
+  data.tar.gz: c085eb2a96b8eb503cd44edc87821823cf0ad965
 SHA512:
-  metadata.gz: 638c34f7efbc963613f4bb841abbf183bf134ee3197bebc99f9403ba7864befd44243f53092a9aa3ba7ea58314475b61d6671816e8e3f8ef4deb7f49b6f0ef52
-  data.tar.gz: f91110f69e8228de408aa0c35050fe6137fac22bdb93ff86be3c70d380e1cf57534f50f78e5d585cb63e421ff0fc51aa04089d38e8a86ab3a6ca305659dc909a
+  metadata.gz: 4a5c3c9dd78f3e123b0c279a04c2d0d262c58aef8d0086668c37aea8e52b6ec8da98d6d242debddf600922dcfed3d377bd159e62fc9bb1e7390b1e62e881eb2b
+  data.tar.gz: de051e60d7b90bde7984871d1b39e52f46be6366910591f0ec31c434a41037a9cc1274989dd50755cbf6e03d0e9f498a3271aad173b12a8ea3fd6a578eb8fbc9

data/Gemfile.lock ADDED

@@ -0,0 +1,90 @@
+PATH
+  remote: .
+  specs:
+    brilliant_web_scraper (0.2)
+      charlock_holmes (~> 0.7.6)
+      nesty (~> 1.0, >= 1.0.1)
+      rest-client (~> 2.0, >= 2.0.2)
+GEM
+  remote: http://rubygems.org/
+  specs:
+    addressable (2.6.0)
+      public_suffix (>= 2.0.2, < 4.0)
+    ast (2.4.0)
+    charlock_holmes (0.7.6)
+    coderay (1.1.2)
+    crack (0.4.3)
+      safe_yaml (~> 1.0.0)
+    diff-lcs (1.3)
+    domain_name (0.5.20190701)
+      unf (>= 0.0.5, < 1.0.0)
+    hashdiff (1.0.0)
+    http-accept (1.7.0)
+    http-cookie (1.0.3)
+      domain_name (~> 0.5)
+    jaro_winkler (1.5.3)
+    method_source (0.9.2)
+    mime-types (3.2.2)
+      mime-types-data (~> 3.2015)
+    mime-types-data (3.2019.0331)
+    nesty (1.0.2)
+    netrc (0.11.0)
+    parallel (1.17.0)
+    parser (2.6.3.0)
+      ast (~> 2.4.0)
+    pry (0.12.2)
+      coderay (~> 1.1.0)
+      method_source (~> 0.9.0)
+    public_suffix (3.1.1)
+    rainbow (3.0.0)
+    rest-client (2.1.0)
+      http-accept (>= 1.7.0, < 2.0)
+      http-cookie (>= 1.0.2, < 2.0)
+      mime-types (>= 1.16, < 4.0)
+      netrc (~> 0.8)
+    rspec (3.8.0)
+      rspec-core (~> 3.8.0)
+      rspec-expectations (~> 3.8.0)
+      rspec-mocks (~> 3.8.0)
+    rspec-core (3.8.2)
+      rspec-support (~> 3.8.0)
+    rspec-expectations (3.8.4)
+      diff-lcs (>= 1.2.0, < 2.0)
+      rspec-support (~> 3.8.0)
+    rspec-mocks (3.8.1)
+      diff-lcs (>= 1.2.0, < 2.0)
+      rspec-support (~> 3.8.0)
+    rspec-support (3.8.2)
+    rubocop (0.73.0)
+      jaro_winkler (~> 1.5.1)
+      parallel (~> 1.10)
+      parser (>= 2.6)
+      rainbow (>= 2.2.2, < 4.0)
+      ruby-progressbar (~> 1.7)
+      unicode-display_width (>= 1.4.0, < 1.7)
+    ruby-progressbar (1.10.1)
+    safe_yaml (1.0.5)
+    unf (0.1.4)
+      unf_ext
+    unf_ext (0.0.7.6)
+    unicode-display_width (1.6.0)
+    vcr (3.0.3)
+    webmock (2.3.2)
+      addressable (>= 2.3.6)
+      crack (>= 0.3.2)
+      hashdiff
+PLATFORMS
+  ruby
+DEPENDENCIES
+  brilliant_web_scraper!
+  pry (~> 0.12.2)
+  rspec (~> 3.5)
+  rubocop (~> 0.73.0)
+  vcr (~> 3.0, >= 3.0.1)
+  webmock (~> 2.1)
+BUNDLED WITH
+   1.16.6

data/README.md CHANGED

@@ -1,14 +1,11 @@
-# WebScraper [![Build Status](https://api.travis-ci.com/bkotu6717/brilliant_web_scraper.svg)](https://travis-ci.com/bkotu6717/brilliant_web_scraper)
+# BrilliantWebScraper [![Build Status](https://api.travis-ci.com/bkotu6717/brilliant_web_scraper.svg)](https://travis-ci.com/bkotu6717/brilliant_web_scraper)[![Maintainability](https://api.codeclimate.com/v1/badges/15a8a6e117f11bd94376/maintainability)](https://codeclimate.com/github/bkotu6717/brilliant_web_scraper/maintainability)
-A decent web scraping gem. Scrapes website description, social profiles, contact details, youtube channels.
-It accepts a URL or Domain as input and gets it's title, descrptios, social profiles, YouTube channels and it's current URL if got redirected.
+A decent web scraping gem. Scrapes website title, description, social profiles such as linkedin, facebook, twitter, instgram, vimeo, pinterest, youtube channel and contact details such as emails, phone numbers.
 ## See it in action!
-You can try WebScraper live at this little demo: [https://brilliantweb-scraper-demo.herokuapp.com](https://brilliant-web-scraper-demo.herokuapp.com)
+You can try BrillaintWebScraper live at this little demo: [https://brilliant-web-scraper-demo.herokuapp.com](https://brilliant-web-scraper-demo.herokuapp.com)
 ## Installation
@@ -21,11 +18,11 @@ gem 'brilliant_web_scraper'
 ## Usage
-Initialize a BrilliantWebScraper instance for an URL, like this:
+Initialize a BrilliantWebScraper instance for an URL, like this with optional timeouts, default connection_timeout and read_timeouts are 10s, 10s respectively:
 ```ruby
 require 'brilliant_web_scraper'
+results = BrilliantWebScraper.new('http://pwc.com', 5, 5)
 results = BrilliantWebScraper.new('http://pwc.com')
 ```
-If you don't include the scheme on the URL, it is fine:

data/brilliant_web_scraper.gemspec CHANGED

@@ -6,23 +6,28 @@ Gem::Specification.new do |s|
   s.name =  'brilliant_web_scraper'
   s.version = WebScraper::VERSION
   s.licenses = ['Nonstandard']
-  s.summary = 'A decent web scraping ruby library!'
-  s.description = 'Scrapes data such as description, social profiles, contact details'
+  s.summary = 'A decent web scraping ruby gem!'
+  s.description = 'A decent web scraping gem.'\
+                  'Scrapes website\'s title, description,'\
+                  'social profiles such as linkedin, '\
+                  'facebook, twitter, instgram, vimeo,'\
+                  'pinterest, youtube channel and'\
+                  ' contact details such as emails, phone numbers.'
   s.authors = ['Kotu Bhaskara Rao']
   s.email = 'bkotu6717@gmail.com'
   s.require_paths = ['lib']
   s.homepage = 'https://github.com/bkotu6717/brilliant_web_scraper'
   s.files = Dir['**/*'].keep_if { |file|
-    file != "brilliant_web_scraper-#{WebScraper::VERSION}.gem" && File.file?(file)
+    file != "brilliant_web_scraper-#{WebScraper::VERSION}.gem" &&
+    File.file?(file)
   }
   s.required_ruby_version = '>= 2.3.0'
-  s.add_dependency 'nesty', '~> 1.0', '>= 1.0.1'
-  s.add_dependency 'rest-client', '~> 2.0', '>= 2.0.2'
+  s.add_runtime_dependency 'charlock_holmes', '~> 0.7.6'
+  s.add_runtime_dependency 'nesty', '~> 1.0', '>= 1.0.1'
+  s.add_runtime_dependency 'rest-client', '~> 2.0', '>= 2.0.2'
-  s.add_development_dependency 'nesty', '~> 1.0', '>= 1.0.1'
   s.add_development_dependency 'pry', '~> 0.12.2'
-  s.add_development_dependency 'rest-client', '~> 2.0', '>= 2.0.2'
   s.add_development_dependency 'rspec', '~> 3.5'
   s.add_development_dependency 'rubocop', '~> 0.73.0'
   s.add_development_dependency 'vcr', '~> 3.0', '>= 3.0.1'

data/lib/brilliant_web_scraper.rb CHANGED

@@ -2,7 +2,8 @@
 require 'rest-client'
 require 'cgi'
-require 'benchmark'
+require 'charlock_holmes/string'
+require 'timeout'
 current_directory = File.dirname(__FILE__) + '/scraper'
 require File.expand_path(File.join(current_directory, 'errors'))

data/lib/parsers/description_helper.rb CHANGED

@@ -21,7 +21,7 @@ module DescriptionHelper
   def parse_description(descriptions)
     return if descriptions.nil? || descriptions.empty?
-    descriptions = descriptions.reject { |x| x.nil? || x.empty? }
+    descriptions = descriptions.reject { |x| x.nil? || x.empty? || x =~ /^\s*$/}
     descriptions = descriptions.map { |x| unescape_html(x) }
     descriptions.find { |x| (x !~ /^\s*[|-]?\s*$/) }
   end

data/lib/parsers/emails.rb CHANGED

@@ -10,7 +10,7 @@ module Emails
     return if response.nil? || response.empty?
     first_regex = /(?im)mailto:\s*([^\?"',\\<>\s]+)/
-    second_regex = %r{(?im)["'\s><\/]*([\w._%-]+@(?!(?:example|e?mail|domain|company|your(?:domain|company|email)|address|emailad?dress|yyy|test)\.)[\w._%-]+\.(?!png|jpe?g|tif|svg)[A-Z]{2,3})["'\s><]}
+    second_regex = %r{(?im)["'\s><\/]*([\w._%-]+@(?!(?:example|e?mail|domain|company|your(?:domain|company|email)|address|emailad?dress|yyy|test)\.)[\w._%-]+\.(?!png|jpe?g|tif|svg|css|js|ico|gif)[A-Z]{2,3})["'\s><]}
     first_set = response.scan(first_regex).flatten.compact
     first_set = get_processed_emails(first_set)
     second_set = response.scan(second_regex).flatten.compact
@@ -24,7 +24,7 @@ module Emails
     unescaped_emails = email_set.map { |email| unescape_html(email) }
     return [] if unescaped_emails.empty?
-    email_match_regex = /[\w._%-]+@(?!(?:example|e?mail|domain|company|your(?:domain|company|email)|address|emailad?dress|yyy|test)\.)[\w._%-]+\.(?!png|jpe?g|tif|svg)[A-Z]{2,3}/im
+    email_match_regex = /[\w._%-]+@(?!(?:example|e?mail|domain|company|your(?:domain|company|email)|address|emailad?dress|yyy|test)\.)[\w._%-]+\.(?!png|jpe?g|tif|svg|css|js|ico|gif)[A-Z]{2,3}/im
     unescaped_emails.select { |data| data =~ email_match_regex }
   end
 end

data/lib/parsers/facebook_profile.rb CHANGED

@@ -5,7 +5,7 @@ module FacebookProfile
   def grep_facebook_profile(response)
     return if response.nil? || response.empty?
-    facebook_url_regex = /(https?:\/\/(?:www\.)?(?:facebook|fb)\.com\/(?!tr\?|(?:[\/\w\d]*(?:photo|sharer?|like(?:box)?|offsite_event|plugins|permalink|home|search))\.php|\d+\/fbml|(?:dialog|hashtag|plugins|sharer|login|recover|security|help|v\d+\.\d+)\/|(?:privacy|#|your-profile|yourfacebookpage)\/?|home\?)[^"'<>\&\s]+)/im
+    facebook_url_regex = /(https?:\/\/(?:www\.)?(?:facebook|fb)\.com\/(?!tr\?|(?:[\/\w\d]*(?:photo|sharer?|like(?:box)?|offsite_event|plugins|permalink|home|search))\.php|\d+\/fbml|(?:dialog|hashtag|plugins|sharer|login|recover|security|help|images|v\d+\.\d+)\/|(?:privacy|#|your-profile|yourfacebookpage)\/?|home\?)[^"'<>\&\s]+)/im
     response.scan(facebook_url_regex).flatten.compact.uniq
   end
 end

data/lib/parsers/redirected_to.rb CHANGED

@@ -2,6 +2,7 @@
 # Fetch latest url of the given website
 module RedirectedTo
+  include UnescapeHtmlHelper
   def grep_redirected_to_url(response)
     return if response.nil? || response.empty?
@@ -18,7 +19,7 @@ module RedirectedTo
       url = parser(web_urls)
       break unless url.nil?
     end
-    url
+    unescape_html(url)
   end
   private

data/lib/parsers/twitter_profile.rb CHANGED

@@ -5,7 +5,7 @@ module TwitterProfile
   def grep_twitter_profile(response)
     return if response.nil? || response.empty?
-    twitter_regex = %r{(?im)(https?:\/\/(?:www\.)?twitter\.com\/(?!(?:share|download|search|home|login|privacy)(?:\?|\/|\b)|(?:hashtag|i|javascripts|statuses|#!|intent)\/|(?:#|'|%))[^"'&\?<>\s\\]+)}
+    twitter_regex = %r{(?im)(https?:\/\/(?:www\.)?twitter\.com\/(?!\{\{)(?!(?:share|download|search|home|login|privacy)(?:\?|\/|\b)|(?:hashtag|i|javascripts|statuses|#!|intent)\/|(?:#|'|%))[^"'&\?<>\s\\]+)}
     response.scan(twitter_regex).flatten.compact.uniq
   end
 end

data/lib/parsers/vimeo_profile.rb CHANGED

@@ -5,7 +5,7 @@ module VimeoProfile
   def grep_vimeo_profile(response)
     return if response.nil? || response.empty?
-    vimeo_regex = %r{(?im)(https?:\/\/(?:www\.)?vimeo\.com\/(?!upgrade|features|enterprise|upload|api)\/?[^"'\&\?<>\s]+)}
+    vimeo_regex = %r{(?im)(https?:\/\/(?:www\.)?vimeo\.com\/(?!upgrade|features|enterprise|upload|api)\/?[^"'\\\&\?<>\s]+)}
     response.scan(vimeo_regex).flatten.compact.uniq
   end
 end

data/lib/scraper/scrape_helper.rb CHANGED

@@ -6,38 +6,38 @@
 # @Social Profiles
 # @Contact Details
 module ScrapeHelper
-  def perform_scrape(url, read_timeout, connection_timeout)
-    response = nil
-    request_duration = Benchmark.measure do
-      response = ScrapeRequest.new(url, read_timeout, connection_timeout)
-    end.real
-    retry_count = 0
-    begin
-      scrape_data = nil
-      scrape_duration = Benchmark.measure do
-        scrape_data = grep_data(response.body)
-      end.real
-      data_hash = {
-        web_request_duration: request_duration,
-        response_scrape_duraton: scrape_duration,
-        scrape_data: scrape_data
-      }
-    rescue ArgumentError => e
-      retry_count += 1
-      raise WebScraper::ParserError, e.message if retry_count > 1
-      response = response.encode('UTF-16be', invalid: :replace, replace: '?')
-      response = response.encode('UTF-8')
-      retry
-    rescue Encoding::CompatibilityError => e
-      raise WebScraper::ParserError, e.message
+  def perform_scrape(url, read_timeout, open_timeout)
+    timeout_in_sec = scraper_timeout(read_timeout, open_timeout)
+    Timeout::timeout(timeout_in_sec) do
+      response = ScrapeRequest.new(url, read_timeout, open_timeout)
+      retry_count = 0
+      body = response.body
+      begin
+        body = body.tr("\000", '')
+        encoding = body.detect_encoding[:encoding]
+        body = body.encode('UTF-8', encoding)
+        grep_data(body)
+      rescue Encoding::UndefinedConversionError, ArgumentError => e
+        retry_count += 1
+        raise WebScraper::ParserError, e.message if retry_count > 1
+        body = body.encode('UTF-8', invalid: :replace, undef: :replace, replace: '?')
+        retry
+      rescue Encoding::CompatibilityError => e
+        raise WebScraper::ParserError, e.message
+      rescue StandardError => e
+        raise WebScraper::RequestError, e.message
+      end
     end
-    data_hash
+  rescue Timeout::Error => e
+    raise WebScraper::TimeoutError, e.message
   end
   private
+  def scraper_timeout(read_timeout, open_timeout)
+    ( read_timeout + open_timeout + 1 )
+  end
   def grep_data(response)
     {
       title: grep_title(response),

data/lib/scraper/scrape_request.rb CHANGED

@@ -1,24 +1,28 @@
 # frozen_string_literal: true
-# @Makes actual scrape request, either raises exception or response
+# @Makes actual scrape request, either raises exception or serves response
 module ScrapeRequest
   extend ScrapeExceptions
   class << self
     def new(url, read_timeout, connection_timeout)
+      params_hash = {
+        method: :get,
+        url: url,
+        read_timeout: read_timeout,
+        open_timeout: connection_timeout,
+        max_redirects: 10,
+        verify_ssl: false
+      }
       begin
-        params_hash = {
-          method: :get,
-          url: url,
-          read_timeout: read_timeout,
-          connection_timeout: connection_timeout,
-          headers: { 'accept-encoding': 'identity' }
-        }
         response = RestClient::Request.execute(params_hash)
         content_type = response.headers[:content_type]
         return response if content_type =~ %r{(?i)text\s*\/\s*html}
         exception_message = "Invalid response format received: #{content_type}"
         raise WebScraper::NonHtmlError, exception_message
+      rescue Zlib::DataError
+        params_hash[:headers] = { 'accept-encoding': 'identity' }
+        retry
       rescue *TIMEOUT_EXCEPTIONS => e
         raise WebScraper::TimeoutError, e.message
       rescue *GENERAL_EXCEPTIONS => e

data/lib/version.rb CHANGED

@@ -2,5 +2,5 @@
 # Holds current version number
 module WebScraper
-  VERSION = '0.1'
+  VERSION = '0.2'
 end

data/spec/lib/parsers/emails_spec.rb CHANGED

@@ -27,6 +27,10 @@ describe 'Emails' do
       <a href="mailto:xxx@yyy.zzz">xxx@yyy.zzz</a>
       <a href="mailto:test@test.com">test@test.com</a>
       <a href="mailto:@example.com">@example.com"</a>
+      <a href="mailto:v@201908240100.css">v@201908240100.css"</a>
+      <a href="mailto:v@201908240100.js">v@201908240100.js"</a>
+      <a href="mailto:ajax-loader@2x.gif">ajax-loader@2x.gif"</a>
+      <a href="mailto:favicon@2x.ico">favicon@2x.ico"</a>
   	HTML
   	expect(dummy_object.grep_emails(html.to_s)).to eq([])
   end

data/spec/lib/parsers/facebook_profile_spec.rb CHANGED

@@ -14,6 +14,7 @@ describe 'FaceBook Profile' do
   it 'should not grep any non profile url' do
     html = <<~HTML
+      <a href="https://www.facebook.com/images/fb_icon_325x325.png" target="_blank" class="sqs-svg-icon--wrapper facebook">
       <a href="http://www.facebook.com/2008/fbml" target="_blank" class="sqs-svg-icon--wrapper facebook">
       <a href="https://www.facebook.com/v2.0/dialog/share" target="_blank" class="sqs-svg-icon--wrapper facebook">
       <a href="https://www.facebook.com/plugins/video.php?href=https%3A%2F%2Fwww.facebook.com%2FHFXMooseheads%2Fvideos" target="_blank" class="sqs-svg-icon--wrapper facebook">

data/spec/lib/parsers/redirected_to_spec.rb CHANGED

@@ -3,205 +3,223 @@ require 'spec_helper'
 describe 'Website Redirected To' do
   class DummyTestClass
-  	include RedirectedTo
-	end
+    include RedirectedTo
+  end
   let(:dummy_object) { DummyTestClass.new }
   it 'should return nil for invalid input' do
-		expect(dummy_object.grep_redirected_to_url(nil)).to be_nil
-		expect(dummy_object.grep_redirected_to_url('')).to be_nil
-	end
+    expect(dummy_object.grep_redirected_to_url(nil)).to be_nil
+    expect(dummy_object.grep_redirected_to_url('')).to be_nil
+  end
   describe 'Website grep from link tag' do
-  	describe 'rel attribute first ' do
-	  	it 'should return nil when canonical url is empty' do
-	  		html = <<~HTML
-	  			<link rel="canonical" href="">
-	  			<link rel="canonical" href=''>
-	  		HTML
-	  		website = dummy_object.grep_redirected_to_url(html.to_s)
-	  		expect(website).to be_nil
-	  	end
-	  	it 'should grep website' do
-	  		html = <<~HTML
-	  			<link rel="canonical" href="">
-	  			<link rel="canonical" href='https://www.apple.com/'>
-	  		HTML
-	  		website = dummy_object.grep_redirected_to_url(html.to_s)
-	  		expect(website).to eq('https://www.apple.com/')
-	  	end
-	  	it 'should grep website even with extra attributes' do
-	  		html = <<~HTML
-	  			<link rel="canonical" href="" itemprop="current_url">
-	  			<link rel="canonical" href='https://www.apple.com/'
-	  			itemprop="current_url" >
-	  		HTML
-	  		website = dummy_object.grep_redirected_to_url(html.to_s)
-	  		expect(website).to eq('https://www.apple.com/')
-	  	end
-	  end
-	  describe 'href attribute first' do
-	  	it 'should return nil when canonical url is empty' do
-	  		html = <<~HTML
-	  			<link href="" rel="canonical" >
-	  			<link href='' rel="canonical" >
-	  		HTML
-	  		website = dummy_object.grep_redirected_to_url(html.to_s)
-	  		expect(website).to be_nil
-	  	end
-	  	it 'should grep website' do
-	  		html = <<~HTML
-	  			<link rel="canonical" href="">
-	  			<link href='https://www.apple.com/' rel="canonical">
-	  		HTML
-	  		website = dummy_object.grep_redirected_to_url(html.to_s)
-	  		expect(website).to eq('https://www.apple.com/')
-	  	end
-	  	it 'should grep website even with extra attributes' do
-	  		html = <<~HTML
-	  			<link href="" itemprop="current_url" rel="canonical">
-	  			<link href='https://www.apple.com/' rel="canonical"
-	  			itemprop="current_url" >
-	  		HTML
-	  		website = dummy_object.grep_redirected_to_url(html.to_s)
-	  		expect(website).to eq('https://www.apple.com/')
-	  	end
-	  end
+    describe 'rel attribute first ' do
+      it 'should return nil when canonical url is empty' do
+        html = <<~HTML
+          <link rel="canonical" href="">
+          <link rel="canonical" href=''>
+        HTML
+        website = dummy_object.grep_redirected_to_url(html.to_s)
+        expect(website).to be_nil
+      end
+      it 'should grep website' do
+        html = <<~HTML
+          <link rel="canonical" href="">
+          <link rel="canonical" href='https://www.apple.com/'>
+        HTML
+        website = dummy_object.grep_redirected_to_url(html.to_s)
+        expect(website).to eq('https://www.apple.com/')
+      end
+      it 'should grep website even with extra attributes' do
+        html = <<~HTML
+          <link rel="canonical" href="" itemprop="current_url">
+          <link rel="canonical" href='https://www.apple.com/'
+          itemprop="current_url" >
+        HTML
+        website = dummy_object.grep_redirected_to_url(html.to_s)
+        expect(website).to eq('https://www.apple.com/')
+      end
+    end
+    describe 'href attribute first' do
+      it 'should return nil when canonical url is empty' do
+        html = <<~HTML
+          <link href="" rel="canonical" >
+          <link href='' rel="canonical" >
+        HTML
+        website = dummy_object.grep_redirected_to_url(html.to_s)
+        expect(website).to be_nil
+      end
+      it 'should grep website' do
+        html = <<~HTML
+          <link rel="canonical" href="">
+          <link href='https://www.apple.com/' rel="canonical">
+        HTML
+        website = dummy_object.grep_redirected_to_url(html.to_s)
+        expect(website).to eq('https://www.apple.com/')
+      end
+      it 'should grep website even with extra attributes' do
+        html = <<~HTML
+          <link href="" itemprop="current_url" rel="canonical">
+          <link href='https://www.apple.com/' rel="canonical"
+          itemprop="current_url" >
+        HTML
+        website = dummy_object.grep_redirected_to_url(html.to_s)
+        expect(website).to eq('https://www.apple.com/')
+      end
+    end
   end
   describe 'Website grep from organization URL' do
-  	describe 'property attribute first ' do
-	  	it 'should return nil when canonical url is empty' do
-	  		html = <<~HTML
-	  			<meta property="og:url" content="" />
-	  			<meta property="og:url" content='' />
-	  		HTML
-	  		website = dummy_object.grep_redirected_to_url(html.to_s)
-	  		expect(website).to be_nil
-	  	end
-	  	it 'should grep website' do
-	  		html = <<~HTML
-	  			<link property="og:url" content="">
-	  			<meta property="og:url" content="https://www.dieppe.ca/fr/index.aspx" />
-	  		HTML
-	  		website = dummy_object.grep_redirected_to_url(html.to_s)
-	  		expect(website).to eq('https://www.dieppe.ca/fr/index.aspx')
-	  	end
-	  	it 'should grep website even with extra attributes' do
-	  		html = <<~HTML
-	  			<link property="og:url" content="" calss="og-url">
-	  			<meta property="og:url" content='https://www.dieppe.ca/fr/index.aspx'
-	  			class="og-url" />
-	  		HTML
-	  		website = dummy_object.grep_redirected_to_url(html.to_s)
-	  		expect(website).to eq('https://www.dieppe.ca/fr/index.aspx')
-	  	end
-	  end
-	  describe 'content attribute first ' do
-	  	it 'should return nil when canonical url is empty' do
-	  		html = <<~HTML
-	  			<meta content="" property="og:url" />
-	  			<meta content='' property="og:url"/>
-	  		HTML
-	  		website = dummy_object.grep_redirected_to_url(html.to_s)
-	  		expect(website).to be_nil
-	  	end
-	  	it 'should grep website' do
-	  		html = <<~HTML
-	  			<link content="" property="og:url" >
-	  			<meta content="https://www.dieppe.ca/fr/index.aspx" property="og:url"  />
-	  		HTML
-	  		website = dummy_object.grep_redirected_to_url(html.to_s)
-	  		expect(website).to eq('https://www.dieppe.ca/fr/index.aspx')
-	  	end
-	  	it 'should grep website even with extra attributes' do
-	  		html = <<~HTML
-	  			<link content="" calss="og-url" property="og:url">
-	  			<meta content='https://www.dieppe.ca/fr/index.aspx'
-	  			class="og-url" property="og:url" />
-	  		HTML
-	  		website = dummy_object.grep_redirected_to_url(html.to_s)
-	  		expect(website).to eq('https://www.dieppe.ca/fr/index.aspx')
-	  	end
-	  end
+    describe 'property attribute first ' do
+      it 'should return nil when canonical url is empty' do
+        html = <<~HTML
+          <meta property="og:url" content="" />
+          <meta property="og:url" content='' />
+        HTML
+        website = dummy_object.grep_redirected_to_url(html.to_s)
+        expect(website).to be_nil
+      end
+      it 'should grep website' do
+        html = <<~HTML
+          <link property="og:url" content="">
+          <meta property="og:url" content="https://www.dieppe.ca/fr/index.aspx" />
+        HTML
+        website = dummy_object.grep_redirected_to_url(html.to_s)
+        expect(website).to eq('https://www.dieppe.ca/fr/index.aspx')
+      end
+      it 'should grep website even with extra attributes' do
+        html = <<~HTML
+          <link property="og:url" content="" calss="og-url">
+          <meta property="og:url" content='https://www.dieppe.ca/fr/index.aspx'
+          class="og-url" />
+        HTML
+        website = dummy_object.grep_redirected_to_url(html.to_s)
+        expect(website).to eq('https://www.dieppe.ca/fr/index.aspx')
+      end
+    end
+    describe 'content attribute first ' do
+      it 'should return nil when canonical url is empty' do
+        html = <<~HTML
+          <meta content="" property="og:url" />
+          <meta content='' property="og:url"/>
+        HTML
+        website = dummy_object.grep_redirected_to_url(html.to_s)
+        expect(website).to be_nil
+      end
+      it 'should grep website' do
+        html = <<~HTML
+          <link content="" property="og:url" >
+          <meta content="https://www.dieppe.ca/fr/index.aspx" property="og:url"  />
+        HTML
+        website = dummy_object.grep_redirected_to_url(html.to_s)
+        expect(website).to eq('https://www.dieppe.ca/fr/index.aspx')
+      end
+      it 'should grep website even with extra attributes' do
+        html = <<~HTML
+          <link content="" calss="og-url" property="og:url">
+          <meta content='https://www.dieppe.ca/fr/index.aspx'
+          class="og-url" property="og:url" />
+        HTML
+        website = dummy_object.grep_redirected_to_url(html.to_s)
+        expect(website).to eq('https://www.dieppe.ca/fr/index.aspx')
+      end
+    end
   end
   describe 'grep website' do
-  	it 'it should return nil when link or og:url is absent' do
-  		html = <<~HTML
-  			<head>
-			    <meta charset="utf-8">
-			    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
-			    <meta http-equiv="x-ua-compatible" content="ie=edge">
-			    <title>Techmologic | index</title>
-			    <!-- Font Awesome -->
-			    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.6.0/css/font-awesome.min.css">
-			    <!-- Bootstrap core CSS -->
-			    <link href="css/bootstrap.min.css" rel="stylesheet">
-			    <!-- Material Design Bootstrap -->
-			    <link href="css/mdb.min.css" rel="stylesheet">
-			    <!-- Your custom styles (optional) -->
-			    <link href="css/style.css" rel="stylesheet">
-			</head>
-  		HTML
-  		website = dummy_object.grep_redirected_to_url(html.to_s)
-  		expect(website).to be_nil
-  	end
-  	it 'should grep one of canonical or og:url' do
-  		html = <<~HTML
-  			<head>
-			    <meta charset="utf-8">
-			    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
-			    <meta http-equiv="x-ua-compatible" content="ie=edge">
-			    <title>Techmologic | index</title>
-			   	<link rel="canonical" href="">
-			   	<meta property="og:url" content="" />
-			    <!-- Font Awesome -->
-			    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.6.0/css/font-awesome.min.css">
-			    <!-- Bootstrap core CSS -->
-			    <link href="css/bootstrap.min.css" rel="stylesheet">
-			    <!-- Material Design Bootstrap -->
-			    <link href="css/mdb.min.css" rel="stylesheet">
-			    <!-- Your custom styles (optional) -->
-			    <link href="css/style.css" rel="stylesheet">
-			    <link rel="canonical" href="http://techmologics.com/">
-			    <meta property="og:url" content="http://techmologics.com/" />
-			</head>
-  		HTML
-  		website = dummy_object.grep_redirected_to_url(html.to_s)
-  		expect(website).to eq('http://techmologics.com/')
-  	end
-  		it 'should grep one of canonical or og:url whatever it\'s position' do
-  		html = <<~HTML
-  			<head>
-			    <meta charset="utf-8">
-			    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
-			    <meta http-equiv="x-ua-compatible" content="ie=edge">
-			    <title>Techmologic | index</title>
-			   	<link  href="" rel="canonical">
-			   	<meta  content="" property="og:url"/>
-			    <!-- Font Awesome -->
-			    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.6.0/css/font-awesome.min.css">
-			    <!-- Bootstrap core CSS -->
-			    <link href="css/bootstrap.min.css" rel="stylesheet">
-			    <!-- Material Design Bootstrap -->
-			    <link href="css/mdb.min.css" rel="stylesheet">
-			    <!-- Your custom styles (optional) -->
-			    <link href="css/style.css" rel="stylesheet">
-			    <link href="http://techmologics.com/" rel="canonical" class="canonical">
-			    <meta content="http://techmologics.com/" property="og:url"/>
-			</head>
-  		HTML
-  		website = dummy_object.grep_redirected_to_url(html.to_s)
-  		expect(website).to eq('http://techmologics.com/')
-  	end
+    it 'it should return nil when link or og:url is absent' do
+      html = <<~HTML
+        <head>
+          <meta charset="utf-8">
+          <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
+          <meta http-equiv="x-ua-compatible" content="ie=edge">
+          <title>Techmologic | index</title>
+          <!-- Font Awesome -->
+          <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.6.0/css/font-awesome.min.css">
+          <!-- Bootstrap core CSS -->
+          <link href="css/bootstrap.min.css" rel="stylesheet">
+          <!-- Material Design Bootstrap -->
+          <link href="css/mdb.min.css" rel="stylesheet">
+          <!-- Your custom styles (optional) -->
+          <link href="css/style.css" rel="stylesheet">
+      </head>
+      HTML
+      website = dummy_object.grep_redirected_to_url(html.to_s)
+      expect(website).to be_nil
+    end
+    it 'should grep one of canonical or og:url' do
+      html = <<~HTML
+        <head>
+          <meta charset="utf-8">
+          <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
+          <meta http-equiv="x-ua-compatible" content="ie=edge">
+          <title>Techmologic | index</title>
+          <link rel="canonical" href="">
+          <meta property="og:url" content="" />
+          <!-- Font Awesome -->
+          <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.6.0/css/font-awesome.min.css">
+          <!-- Bootstrap core CSS -->
+          <link href="css/bootstrap.min.css" rel="stylesheet">
+          <!-- Material Design Bootstrap -->
+          <link href="css/mdb.min.css" rel="stylesheet">
+          <!-- Your custom styles (optional) -->
+          <link href="css/style.css" rel="stylesheet">
+          <link rel="canonical" href="http://techmologics.com/">
+          <meta property="og:url" content="http://techmologics.com/" />
+      </head>
+      HTML
+      website = dummy_object.grep_redirected_to_url(html.to_s)
+      expect(website).to eq('http://techmologics.com/')
+    end
+    it 'should grep one of canonical or og:url whatever it\'s position' do
+      html = <<~HTML
+        <head>
+          <meta charset="utf-8">
+          <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
+          <meta http-equiv="x-ua-compatible" content="ie=edge">
+          <title>Techmologic | index</title>
+          <link  href="" rel="canonical">
+          <meta  content="" property="og:url"/>
+          <!-- Font Awesome -->
+          <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.6.0/css/font-awesome.min.css">
+          <!-- Bootstrap core CSS -->
+          <link href="css/bootstrap.min.css" rel="stylesheet">
+          <!-- Material Design Bootstrap -->
+          <link href="css/mdb.min.css" rel="stylesheet">
+          <!-- Your custom styles (optional) -->
+          <link href="css/style.css" rel="stylesheet">
+          <link href="http://techmologics.com/" rel="canonical" class="canonical">
+          <meta content="http://techmologics.com/" property="og:url"/>
+      </head>
+      HTML
+      website = dummy_object.grep_redirected_to_url(html.to_s)
+      expect(website).to eq('http://techmologics.com/')
+    end
+    it 'should decode html entities in the redirected_to url' do
+      html = <<~HTML
+        <meta content="https&#x3a;&#x2f;&#x2f;www&#x2e;santanderbank&#x2e;com&#x2f;us&#x2f;personal" property="og:url"/>
+      HTML
+      website = dummy_object.grep_redirected_to_url(html.to_s)
+      expect(website).to eq('https://www.santanderbank.com/us/personal')
+    end
   end
-end
+end

data/spec/lib/parsers/twitter_profile_spec.rb CHANGED

@@ -31,6 +31,7 @@ describe 'Twitter Profile' do
       <a href=" http://twitter.com/share/" target="_blank">
       <a href="https://twitter.com/#!/Farmer_Brothers" target="_blank">
       <a href="http://twitter.com/javascripts/blogger.js" target="_blank">
+      <a href="https://twitter.com/{{../user.screen_name}}/status/{{../id_str}}" target="_blank">
     HTML
     expect(dummy_object.grep_twitter_profile(html.to_s)).to eq([])
   end

data/spec/lib/parsers/vimeo_profile_spec.rb CHANGED

@@ -30,13 +30,15 @@ describe 'Vimeo Profile' do
       <a href="https://vimeo.com/channels/332103" target="_blank">
       <a href="https://vimeo.com/talech" target="_blank">
       <a href="https://vimeo.com/292173295/fdb8634a35/" target="_blank">
+      <a href="https://vimeo.com/337614648\\" target="_blank">
     HTML
     vimeo_profiles = dummy_object.grep_vimeo_profile(html.to_s)
     expected_profiles = [
-      'https://vimeo.com/107578087',
-      'https://vimeo.com/channels/332103',
-      'https://vimeo.com/talech',
-      'https://vimeo.com/292173295/fdb8634a35/',
+      'https://vimeo.com/107578087',
+      'https://vimeo.com/channels/332103',
+      'https://vimeo.com/talech',
+      'https://vimeo.com/292173295/fdb8634a35/',
+      'https://vimeo.com/337614648'
     ]
     expect(vimeo_profiles).to eq(expected_profiles)
   end

metadata CHANGED

@@ -1,75 +1,69 @@
 --- !ruby/object:Gem::Specification
 name: brilliant_web_scraper
 version: !ruby/object:Gem::Version
-  version: '0.1'
+  version: '0.2'
 platform: ruby
 authors:
 - Kotu Bhaskara Rao
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2019-08-11 00:00:00.000000000 Z
+date: 2019-08-31 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
-  name: nesty
+  name: charlock_holmes
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.0'
-    - - ">="
-      - !ruby/object:Gem::Version
-        version: 1.0.1
+        version: 0.7.6
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.0'
-    - - ">="
-      - !ruby/object:Gem::Version
-        version: 1.0.1
+        version: 0.7.6
 - !ruby/object:Gem::Dependency
-  name: rest-client
+  name: nesty
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '2.0'
+        version: '1.0'
     - - ">="
       - !ruby/object:Gem::Version
-        version: 2.0.2
+        version: 1.0.1
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '2.0'
+        version: '1.0'
     - - ">="
       - !ruby/object:Gem::Version
-        version: 2.0.2
+        version: 1.0.1
 - !ruby/object:Gem::Dependency
-  name: nesty
+  name: rest-client
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.0'
+        version: '2.0'
     - - ">="
       - !ruby/object:Gem::Version
-        version: 1.0.1
-  type: :development
+        version: 2.0.2
+  type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.0'
+        version: '2.0'
     - - ">="
       - !ruby/object:Gem::Version
-        version: 1.0.1
+        version: 2.0.2
 - !ruby/object:Gem::Dependency
   name: pry
   requirement: !ruby/object:Gem::Requirement
@@ -84,26 +78,6 @@ dependencies:
     - - "~>"
       - !ruby/object:Gem::Version
         version: 0.12.2
-- !ruby/object:Gem::Dependency
-  name: rest-client
-  requirement: !ruby/object:Gem::Requirement
-    requirements:
-    - - "~>"
-      - !ruby/object:Gem::Version
-        version: '2.0'
-    - - ">="
-      - !ruby/object:Gem::Version
-        version: 2.0.2
-  type: :development
-  prerelease: false
-  version_requirements: !ruby/object:Gem::Requirement
-    requirements:
-    - - "~>"
-      - !ruby/object:Gem::Version
-        version: '2.0'
-    - - ">="
-      - !ruby/object:Gem::Version
-        version: 2.0.2
 - !ruby/object:Gem::Dependency
   name: rspec
   requirement: !ruby/object:Gem::Requirement
@@ -166,16 +140,17 @@ dependencies:
     - - "~>"
       - !ruby/object:Gem::Version
         version: '2.1'
-description: Scrapes data such as description, social profiles, contact details
+description: A decent web scraping gem.Scrapes website's title, description,social
+  profiles such as linkedin, facebook, twitter, instgram, vimeo,pinterest, youtube
+  channel and contact details such as emails, phone numbers.
 email: bkotu6717@gmail.com
 executables: []
 extensions: []
 extra_rdoc_files: []
 files:
 - Gemfile
+- Gemfile.lock
 - README.md
-- brilliant_web_scraper-1.0.0.gem
-- brilliant_web_scraper-1.0.gem
 - brilliant_web_scraper.gemspec
 - lib/brilliant_web_scraper.rb
 - lib/parsers/description_helper.rb
@@ -246,5 +221,5 @@ rubyforge_project:
 rubygems_version: 2.5.1
 signing_key:
 specification_version: 4
-summary: A decent web scraping ruby library!
+summary: A decent web scraping ruby gem!
 test_files: []

data/brilliant_web_scraper-1.0.0.gem DELETED

Binary file

data/brilliant_web_scraper-1.0.gem DELETED

Binary file