RubyGems - linkedin-scraper - Versions diffs - 0.0.1 → 0.0.2 - Mend

linkedin-scraper 0.0.1 → 0.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

data/README.rdoc +104 -0
data/lib/linkedin-scraper.rb +1 -1
data/lib/linkedin-scraper/client.rb +125 -0
data/lib/linkedin-scraper/contact.rb +134 -0
data/lib/linkedin-scraper/profile.rb +148 -0
data/lib/linkedin-scraper/version.rb +1 -1
data/linkedin-scraper.gemspec +2 -2
metadata +14 -10
data/README.md +0 -29

data/README.rdoc ADDED Viewed

@@ -0,0 +1,104 @@
+= Linkedin-Scraper {<img src="http://travis-ci.org/jaimeiniesta/metainspector.png" />}[http://travis-ci.org/jaimeiniesta/metainspector]
+Linkedin-scraper is a gem for scraping linkedin public profiles. You give it an URL, and it lets you easily get its title,name,country,area,current_companies .
+= Installation
+Install the gem from RubyGems:
+  gem install linkedin-scraper
+This gem is tested on Ruby versions 1.8.7, 1.9.2 and 1.9.3.
+= Usage
+Initialize a scraper instance for an URL, like this:
+  profile = Linkedin::Scraper.get_profile('http://in.linkedin.com/pub/yatish-mehta/22/460/a86')
+Then you can see the scraped data like this:
+  profile.first_name          #the First name of the contact
+  profile.last_name           #the last name of the contact
+  profile.title               #the linkedin job title
+  profile.location            #the location of the contact
+  profile.country             #the country of the contact
+  profile.industry            #the domain for which the contact belongs
+  profile.past_companies
+  #Array of hash containing its past job companies and job profile
+  #Example
+  #  [
+  #    [0] {
+  #          :past_title => "Intern",
+  #        :past_company => "Sungard"
+  #        },
+  #    [1] {
+  #          :past_title => "Software Developer",
+  #        :past_company => "Microsoft"
+  #        }
+  #  ]
+  profile.current_companies
+  #Array of hash containing its current job companies and job profile
+  #Example
+  #  [
+  #    [0] {
+  #          :current_title => "Intern",
+  #        :current_company => "Sungard"
+  #        },
+  #    [1] {
+  #          :current_title => "Software Developer",
+  #        :current_company => "Microsoft"
+  #        }
+  #  ]
+  profile.linkedin_url        #url of the profile
+  profile.recommended_visitors
+= Examples
+When a link is given, it scrapes the profile and gets the data
+  $ irb
+  >> require 'metainspector'
+  => true
+  >> page = MetaInspector.new('http://pagerankalert.com')
+  => #<MetaInspector:0x11330c0 @url="http://pagerankalert.com">
+  >> page.title
+  => "PageRankAlert.com :: Track your PageRank changes"
+  >> page.meta_description
+  => "Track your PageRank(TM) changes and receive alerts by email"
+  >> page.meta_keywords
+  => "pagerank, seo, optimization, google"
+  >> page.links.size
+  => 8
+  >> page.links[5]
+  => "http://pagerankalert.posterous.com"
+  >> page.document.class
+  => String
+  >> page.parsed_document.class
+  => Nokogiri::HTML::Document
+= ZOMG Fork! Thank you!
+You're welcome to fork this project and send pull requests. I want to thank specially:
+= To Do
+*
+Copyright (c) 2009-2012 Yatish Mehta, released under the MIT license

data/lib/linkedin-scraper.rb CHANGED Viewed

@@ -4,7 +4,7 @@ require "mechanize"
 require "awesome_print"
 %w(client contact profile).each do |file|
-  require File.join(File.dirname(__FILE__), 'linkedin', file)
+  require File.join(File.dirname(__FILE__), 'linkedin-scraper', file)
 end

data/lib/linkedin-scraper/client.rb ADDED Viewed

@@ -0,0 +1,125 @@
+# To change this template, choose Tools | Templates
+# and open the template in the editor.
+module Linkedin
+  class Client
+    USER_AGENTS = ["Windows IE 6", "Windows IE 7", "Windows Mozilla", "Mac Safari", "Mac FireFox", "Mac Mozilla", "Linux Mozilla", "Linux Firefox", "Linux Konqueror"]
+    attr_accessor :contacts ,:matched_tag,:probability
+    def initialize(first_name,last_name ,company,options={})
+      @first_name=first_name.downcase
+      @last_name=last_name.downcase
+      @company=company
+      @country=options[:country] || "us"
+      @search_linkedin_url="http://#{@country}.linkedin.com/pub/dir/#{@first_name}/#{@last_name}"
+      @contacts=[]
+      @links=[]
+      get_agent
+    end
+    def get_agent
+      @agent=Mechanize.new
+      @agent.user_agent_alias = USER_AGENTS.sample
+      @agent.max_history = 0
+      @agent
+    end
+    def get_contacts
+      begin
+        sleep(2+rand(4))
+        puts "===>Father:Scrapping linkedin url "+ @search_linkedin_url
+        @page=@agent.get @search_linkedin_url
+        @page.search(".vcard").each do |node|
+          @contacts<<Linkedin::Contact.new(node)
+        end
+      rescue Mechanize::ResponseCodeError=>e
+        puts "RESCUE"
+      end
+      return @contacts
+    end
+    #TODO need to refactor this function need seperate function of each case
+    def get_verified_contact
+      get_contacts
+      @contacts.each do |contact|
+        #check current company
+        contact.current_companies.each do |company|
+          if company[:current_company]
+            if company[:current_company].match(/#{@company}/i)
+              @matched_tag="CURRENT"
+              return contact
+            end
+          end
+        end if contact.current_companies
+        #title of profile
+        if contact.title.match(/#{@company}/i)
+          @matched_tag="CURRENT"
+          return contact
+        end
+        #check past companies
+        contact.past_companies.each do |company|
+          if company[:past_company]
+            if company[:past_company].match(/#{@company}/i)
+              @matched_tag="PAST"
+              return contact
+            end
+          end
+        end if contact.past_companies
+        #
+        #Going in to profile homepage and then checking
+        #
+        sleep(2+rand(4))
+        puts "===>Child:Scrapping linkedin url: "+ contact.linkedin_url
+        profile=contact.get_profile(get_agent.get(contact.linkedin_url),contact.linkedin_url)
+        #check current company
+        profile.current_companies.each do |company|
+          if company[:current_company]
+            if company[:current_company].match(/#{@company}/i)
+              @matched_tag="CURRENT"
+              return profile
+            end
+          end
+        end if profile.current_companies
+        #title of profile
+        if profile.title
+          if profile.title.match(/#{@company}/i)
+            @matched_tag="CURRENT"
+            return profile
+          end
+        end
+        #check past companies
+        profile.past_companies.each do |company|
+          if company[:past_company]
+            if company[:past_company].match(/#{@company}/i)
+              @matched_tag="PAST"
+              return profile
+            end
+          end
+        end if profile.past_companies
+        #check recommended visitors
+        if profile.recommended_visitors
+          cnt=0
+          profile.recommended_visitors.each do |visitor|
+            if visitor[:company]
+              if visitor[:company].match(/#{@company}/i)
+                cnt+=1
+              end
+            end
+          end
+          @probability=cnt/profile.recommended_visitors.length.to_f
+          @matched_tag="RECOMMENDED"
+          return profile if @probability>=0.5
+        end
+      end unless @contacts.empty?
+      return nil
+    end
+  end
+end

data/lib/linkedin-scraper/contact.rb ADDED Viewed

@@ -0,0 +1,134 @@
+# To change this template, choose Tools | Templates
+# and open the template in the editor.
+module Linkedin
+  class Contact
+    #the First name of the contact
+    attr_accessor :first_name
+    #the last name of the contact
+    attr_accessor :last_name
+    #the linkedin job title
+    attr_accessor :title
+    #the location of the contact
+    attr_accessor :location
+    #the country of the contact
+    attr_accessor :country
+    #the domain for which the contact belongs
+    attr_accessor :industry
+    #the entire profile of the contact
+    attr_accessor :profile
+    #Array of hash containing its past job companies and job profile
+    #Example
+    #  [
+    #    [0] {
+    #          :past_title => "Intern",
+    #        :past_company => "Sungard"
+    #        },
+    #    [1] {
+    #          :past_title => "Software Developer",
+    #        :past_company => "Microsoft"
+    #        }
+    #  ]
+    attr_accessor :past_companies
+    #Array of hash containing its current job companies and job profile
+    #Example
+    #  [
+    #    [0] {
+    #          :current_title => "Intern",
+    #        :current_company => "Sungard"
+    #        },
+    #    [1] {
+    #          :current_title => "Software Developer",
+    #        :current_company => "Microsoft"
+    #        }
+    #  ]
+    attr_accessor :current_companies
+    attr_accessor :linkedin_url
+    attr_accessor :profile
+    def initialize(node=[])
+      unless node.class==Array
+        @first_name=get_first_name(node)
+        @last_name=get_last_name(node)
+        @title=get_title(node)
+        @location=get_location(node)
+        @country=get_country(node)
+        @industry=get_industry(node)
+        @current_companies=get_current_companies node
+        @past_companies=get_past_companies node
+        @linkedin_url=get_linkedin_url node
+      end
+    end
+    #page is a Nokogiri::XML node of the profile page
+    #returns object of Linkedin::Profile
+    def get_profile page,url
+      @profile=Linkedin::Profile.new(page,url)
+    end
+    private
+    def get_first_name node
+      return node.at(".given-name").text.strip if node.search(".given-name").first
+    end
+    def get_last_name node
+      return node.at(".family-name").text.strip if node.search(".family-name").first
+    end
+    def get_title node
+      return node.at(".title").text.gsub(/\s+/, " ").strip if node.search(".title").first
+    end
+    def get_location node
+      return node.at(".location").text.split(",").first.strip if node.search(".location").first
+    end
+    def get_country node
+      return node.at(".location").text.split(",").last.strip if node.search(".location").first
+    end
+    def get_industry node
+      return node.at(".industry").text.strip if node.search(".industry").first
+    end
+    def get_linkedin_url node
+      node.at("h2/strong/a").attributes["href"]
+    end
+    def get_current_companies node
+      current_cs=[]
+      if node.search(".current-content").first
+        node.at(".current-content").text.split(",").each do |content|
+          title,company=content.split(" at ")
+          company=company.gsub(/\s+/, " ").strip if company
+          title=title.gsub(/\s+/, " ").strip if title
+          current_company={:current_company=>company,:current_title=> title}
+          current_cs<<current_company
+        end
+        return current_cs
+      end
+    end
+    def get_past_companies node
+      past_cs=[]
+      if node.search(".past-content").first
+        node.at(".past-content").text.split(",").each do |content|
+          title,company=content.split(" at ")
+          company=company.gsub(/\s+/, " ").strip if company
+          title=title.gsub(/\s+/, " ").strip if title
+          past_company={:past_company=>company,:past_title=> title }
+          past_cs<<past_company
+        end
+        return past_cs
+      end
+    end
+  end
+end

data/lib/linkedin-scraper/profile.rb ADDED Viewed

@@ -0,0 +1,148 @@
+# To change this template, choose Tools | Templates
+# and open the template in the editor.
+module Linkedin
+  class Profile
+    USER_AGENTS = ["Windows IE 6", "Windows IE 7", "Windows Mozilla", "Mac Safari", "Mac FireFox", "Mac Mozilla", "Linux Mozilla", "Linux Firefox", "Linux Konqueror"]
+    #the First name of the contact
+    attr_accessor :first_name
+    #the last name of the contact
+    attr_accessor :last_name
+    #the linkedin job title
+    attr_accessor :title
+    #the location of the contact
+    attr_accessor :location
+    #the country of the contact
+    attr_accessor :country
+    #the domain for which the contact belongs
+    attr_accessor :industry
+    #the entire profile of the contact
+    attr_accessor :profile
+    #Array of hash containing its past job companies and job profile
+    #Example
+    #  [
+    #    [0] {
+    #          :past_title => "Intern",
+    #        :past_company => "Sungard"
+    #        },
+    #    [1] {
+    #          :past_title => "Software Developer",
+    #        :past_company => "Microsoft"
+    #        }
+    #  ]
+    attr_accessor :past_companies
+    #Array of hash containing its current job companies and job profile
+    #Example
+    #  [
+    #    [0] {
+    #          :current_title => "Intern",
+    #        :current_company => "Sungard"
+    #        },
+    #    [1] {
+    #          :current_title => "Software Developer",
+    #        :current_company => "Microsoft"
+    #        }
+    #  ]
+    attr_accessor :current_companies
+    #url of the profile
+    attr_accessor :linkedin_url
+    #Array of hash containing its recommended visitors which come on the
+    attr_accessor :recommended_visitors
+    def initialize(page,url)
+      @first_name=get_first_name(page)
+      @last_name=get_last_name(page)
+      @title=get_title(page)
+      @location=get_location(page)
+      @country=get_country(page)
+      @industry=get_industry(page)
+      @current_companies=get_current_companies page
+      @past_companies=get_past_companies page
+      @recommended_visitors=get_recommended_visitors page
+      @linkedin_url=url
+    end
+    #returns:nil if it gives a 404 request
+    def get_profile url
+      begin
+        @agent=Mechanize.new
+        @agent.user_agent_alias = USER_AGENTS.sample
+        @agent.max_history = 0
+        page=@agent.get url
+        return Linkedin::Profile.new(page, url)
+      rescue=>e
+        puts e
+      end
+    end
+    private
+    def get_first_name page
+      return page.at(".given-name").text.strip if page.search(".given-name").first
+    end
+    def get_last_name page
+      return page.at(".family-name").text.strip if page.search(".family-name").first
+    end
+    def get_title page
+      return page.at(".headline-title").text.gsub(/\s+/, " ").strip if page.search(".headline-title").first
+    end
+    def get_location page
+      return page.at(".locality").text.split(",").first.strip if page.search(".locality").first
+    end
+    def get_country page
+      return page.at(".locality").text.split(",").last.strip if page.search(".locality").first
+    end
+    def get_industry page
+      return page.at(".industry").text.gsub(/\s+/, " ").strip if page.search(".industry").first
+    end
+    def get_past_companies page
+      past_cs=[]
+      if page.search(".past").first
+        page.search(".past").search("li").each do |past_company|
+          title,company=past_company.text.strip.split(" at ")
+          company=company.gsub(/\s+/, " ").strip if company
+          title=title.gsub(/\s+/, " ").strip if title
+          past_company={:past_company=>company,:past_title=> title}
+          past_cs<<past_company
+        end
+        return past_cs
+      end
+    end
+    def get_current_companies page
+      current_cs=[]
+      if page.search(".current").first
+        page.search(".current").search("li").each do |past_company|
+          title,company=past_company.text.strip.split(" at ")
+          company=company.gsub(/\s+/, " ").strip if company
+          title=title.gsub(/\s+/, " ").strip if title
+          current_company={:current_company=>company,:current_title=> title}
+          current_cs<<current_company
+        end
+        return current_cs
+      end
+    end
+    def get_recommended_visitors  page
+      recommended_vs=[]
+      if page.search(".browsemap").first
+        page.at(".browsemap").at("ul").search("li").each do |visitor|
+          v={}
+          v[:link]=visitor.at('a').attributes["href"]
+          v[:name]=visitor.at('a').text
+          v[:title]=visitor.at('.headline').text.split(" at ").first
+          v[:company]=visitor.at('.headline').text.split(" at ").last
+          recommended_vs<<v
+        end
+        return recommended_vs
+      end
+    end
+  end
+end

data/lib/linkedin-scraper/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 module Linkedin
   module Scraper
-    VERSION = "0.0.1"
+    VERSION = "0.0.2"
   end
 end

data/linkedin-scraper.gemspec CHANGED Viewed

@@ -5,8 +5,8 @@ Gem::Specification.new do |gem|
   gem.authors       = ["Yatish Mehta"]
   gem.email         = ["yatishmehta27@gmail.com"]
   gem.description   = %q{Scrapes the linkedin profile when a url is given }
-  gem.summary       = %q{Write a gem summary}
-  gem.homepage      = ""
+  gem.summary       = %q{when a url of  public linkedin profile page is given it scrapes the entire page and converts into a accessible object}
+  gem.homepage      = "https://github.com/yatishmehta27/linkedin-scraper"
    gem.add_dependency(%q<httparty>, [">= 0"])
 gem.add_dependency(%q<mechanize>, [">= 0"])
 gem.add_dependency(%q<awesome_print>, [">= 0"])

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: linkedin-scraper
 version: !ruby/object:Gem::Version
-  version: 0.0.1
+  version: 0.0.2
   prerelease:
 platform: ruby
 authors:
@@ -14,7 +14,7 @@ default_executable:
 dependencies:
 - !ruby/object:Gem::Dependency
   name: httparty
-  requirement: &10580260 !ruby/object:Gem::Requirement
+  requirement: &21059780 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -22,10 +22,10 @@ dependencies:
         version: '0'
   type: :runtime
   prerelease: false
-  version_requirements: *10580260
+  version_requirements: *21059780
 - !ruby/object:Gem::Dependency
   name: mechanize
-  requirement: &10577140 !ruby/object:Gem::Requirement
+  requirement: &21059160 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -33,10 +33,10 @@ dependencies:
         version: '0'
   type: :runtime
   prerelease: false
-  version_requirements: *10577140
+  version_requirements: *21059160
 - !ruby/object:Gem::Dependency
   name: awesome_print
-  requirement: &10576640 !ruby/object:Gem::Requirement
+  requirement: &21058580 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -44,7 +44,7 @@ dependencies:
         version: '0'
   type: :runtime
   prerelease: false
-  version_requirements: *10576640
+  version_requirements: *21058580
 description: ! 'Scrapes the linkedin profile when a url is given '
 email:
 - yatishmehta27@gmail.com
@@ -55,13 +55,16 @@ files:
 - .gitignore
 - Gemfile
 - LICENSE
-- README.md
+- README.rdoc
 - Rakefile
 - lib/linkedin-scraper.rb
+- lib/linkedin-scraper/client.rb
+- lib/linkedin-scraper/contact.rb
+- lib/linkedin-scraper/profile.rb
 - lib/linkedin-scraper/version.rb
 - linkedin-scraper.gemspec
 has_rdoc: true
-homepage: ''
+homepage: https://github.com/yatishmehta27/linkedin-scraper
 licenses: []
 post_install_message:
 rdoc_options: []
@@ -84,5 +87,6 @@ rubyforge_project:
 rubygems_version: 1.6.2
 signing_key:
 specification_version: 3
-summary: Write a gem summary
+summary: when a url of  public linkedin profile page is given it scrapes the entire
+  page and converts into a accessible object
 test_files: []

data/README.md DELETED Viewed

@@ -1,29 +0,0 @@
-# Linkedin::Scraper
-TODO: Write a gem description
-## Installation
-Add this line to your application's Gemfile:
-    gem 'linkedin-scraper'
-And then execute:
-    $ bundle
-Or install it yourself as:
-    $ gem install linkedin-scraper
-## Usage
-TODO: Write usage instructions here
-## Contributing
-1. Fork it
-2. Create your feature branch (`git checkout -b my-new-feature`)
-3. Commit your changes (`git commit -am 'Added some feature'`)
-4. Push to the branch (`git push origin my-new-feature`)
-5. Create new Pull Request