RubyGems - linkedin-scraper - Versions diffs - 0.0.1 → 0.0.2 - Mend

linkedin-scraper 0.0.1 → 0.0.2

Files changed (9) hide show

data/README.rdoc +104 -0
data/lib/linkedin-scraper.rb +1 -1
data/lib/linkedin-scraper/client.rb +125 -0
data/lib/linkedin-scraper/contact.rb +134 -0
data/lib/linkedin-scraper/profile.rb +148 -0
data/lib/linkedin-scraper/version.rb +1 -1
data/linkedin-scraper.gemspec +2 -2
metadata +14 -10
data/README.md +0 -29

data/README.rdoc ADDED Viewed

@@ -0,0 +1,104 @@
+= Linkedin-Scraper {<img src="http://travis-ci.org/jaimeiniesta/metainspector.png" />}[http://travis-ci.org/jaimeiniesta/metainspector]
+Linkedin-scraper is a gem for scraping linkedin public profiles. You give it an URL, and it lets you easily get its title,name,country,area,current_companies .
+= Installation
+Install the gem from RubyGems:
+  gem install linkedin-scraper
+This gem is tested on Ruby versions 1.8.7, 1.9.2 and 1.9.3.
+= Usage
+Initialize a scraper instance for an URL, like this:
+  profile = Linkedin::Scraper.get_profile('http://in.linkedin.com/pub/yatish-mehta/22/460/a86')
+Then you can see the scraped data like this:
+  profile.first_name          #the First name of the contact
+  profile.last_name           #the last name of the contact
+  profile.title               #the linkedin job title
+  profile.location            #the location of the contact
+  profile.country             #the country of the contact
+  profile.industry            #the domain for which the contact belongs
+  profile.past_companies
+  #Array of hash containing its past job companies and job profile
+  #Example
+  #  [
+  #    [0] {
+  #          :past_title => "Intern",
+  #        :past_company => "Sungard"
+  #        },
+  #    [1] {
+  #          :past_title => "Software Developer",
+  #        :past_company => "Microsoft"
+  #        }
+  #  ]
+  profile.current_companies
+  #Array of hash containing its current job companies and job profile
+  #Example
+  #  [
+  #    [0] {
+  #          :current_title => "Intern",
+  #        :current_company => "Sungard"
+  #        },
+  #    [1] {
+  #          :current_title => "Software Developer",
+  #        :current_company => "Microsoft"
+  #        }
+  #  ]
+  profile.linkedin_url        #url of the profile
+  profile.recommended_visitors
+= Examples
+When a link is given, it scrapes the profile and gets the data
+  $ irb
+  >> require 'metainspector'
+  => true
+  >> page = MetaInspector.new('http://pagerankalert.com')
+  => #<MetaInspector:0x11330c0 @url="http://pagerankalert.com">
+  >> page.title
+  => "PageRankAlert.com :: Track your PageRank changes"
+  >> page.meta_description
+  => "Track your PageRank(TM) changes and receive alerts by email"
+  >> page.meta_keywords
+  => "pagerank, seo, optimization, google"
+  >> page.links.size
+  => 8
+  >> page.links[5]
+  => "http://pagerankalert.posterous.com"
+  >> page.document.class
+  => String
+  >> page.parsed_document.class
+  => Nokogiri::HTML::Document
+= ZOMG Fork! Thank you!
+You're welcome to fork this project and send pull requests. I want to thank specially:
+= To Do
+*
+Copyright (c) 2009-2012 Yatish Mehta, released under the MIT license

data/lib/linkedin-scraper.rb CHANGED Viewed

@@ -4,7 +4,7 @@ require "mechanize"
 require "awesome_print"
 %w(client contact profile).each do |file|
-  require File.join(File.dirname(__FILE__), 'linkedin', file)
+  require File.join(File.dirname(__FILE__), 'linkedin-scraper', file)
 end

data/lib/linkedin-scraper/client.rb ADDED Viewed

@@ -0,0 +1,125 @@
+# To change this template, choose Tools | Templates
+# and open the template in the editor.
+module Linkedin
+  class Client
+    USER_AGENTS = ["Windows IE 6", "Windows IE 7", "Windows Mozilla", "Mac Safari", "Mac FireFox", "Mac Mozilla", "Linux Mozilla", "Linux Firefox", "Linux Konqueror"]
+    attr_accessor :contacts ,:matched_tag,:probability
+    def initialize(first_name,last_name ,company,options={})
+      @first_name=first_name.downcase
+      @last_name=last_name.downcase
+      @company=company
+      @country=options[:country] || "us"
+      @search_linkedin_url="http://#{@country}.linkedin.com/pub/dir/#{@first_name}/#{@last_name}"
+      @contacts=[]
+      @links=[]
+      get_agent
+    end
+    def get_agent
+      @agent=Mechanize.new
+      @agent.user_agent_alias = USER_AGENTS.sample
+      @agent.max_history = 0
+      @agent
+    end
+    def get_contacts
+      begin
+        sleep(2+rand(4))
+        puts "===>Father:Scrapping linkedin url "+ @search_linkedin_url
+        @page=@agent.get @search_linkedin_url
+        @page.search(".vcard").each do |node|
+          @contacts<<Linkedin::Contact.new(node)
+        end
+      rescue Mechanize::ResponseCodeError=>e
+        puts "RESCUE"
+      end
+      return @contacts
+    end
+    #TODO need to refactor this function need seperate function of each case
+    def get_verified_contact
+      get_contacts
+      @contacts.each do |contact|
+        #check current company
+        contact.current_companies.each do |company|
+          if company[:current_company]
+            if company[:current_company].match(/#{@company}/i)
+              @matched_tag="CURRENT"
+              return contact
+            end
+          end
+        end if contact.current_companies
+        #title of profile
+        if contact.title.match(/#{@company}/i)
+          @matched_tag="CURRENT"
+          return contact
+        end
+        #check past companies
+        contact.past_companies.each do |company|
+          if company[:past_company]
+            if company[:past_company].match(/#{@company}/i)
+              @matched_tag="PAST"
+              return contact
+            end
+          end
+        end if contact.past_companies
+        #
+        #Going in to profile homepage and then checking
+        #
+        sleep(2+rand(4))
+        puts "===>Child:Scrapping linkedin url: "+ contact.linkedin_url
+        profile=contact.get_profile(get_agent.get(contact.linkedin_url),contact.linkedin_url)
+        #check current company
+        profile.current_companies.each do |company|
+          if company[:current_company]
+            if company[:current_company].match(/#{@company}/i)
+              @matched_tag="CURRENT"
+              return profile
+            end
+          end
+        end if profile.current_companies
+        #title of profile
+        if profile.title
+          if profile.title.match(/#{@company}/i)
+            @matched_tag="CURRENT"
+            return profile
+          end
+        end
+        #check past companies
+        profile.past_companies.each do |company|
+          if company[:past_company]
+            if company[:past_company].match(/#{@company}/i)
+              @matched_tag="PAST"
+              return profile
+            end
+          end
+        end if profile.past_companies
+        #check recommended visitors
+        if profile.recommended_visitors
+          cnt=0
+          profile.recommended_visitors.each do |visitor|
+            if visitor[:company]
+              if visitor[:company].match(/#{@company}/i)
+                cnt+=1
+              end
+            end
+          end
+          @probability=cnt/profile.recommended_visitors.length.to_f
+          @matched_tag="RECOMMENDED"
+          return profile if @probability>=0.5
+        end
+      end unless @contacts.empty?
+      return nil
+    end
+  end
+end

data/lib/linkedin-scraper/contact.rb ADDED Viewed

@@ -0,0 +1,134 @@
+# To change this template, choose Tools | Templates
+# and open the template in the editor.
+module Linkedin
+  class Contact
+    #the First name of the contact
+    attr_accessor :first_name
+    #the last name of the contact
+    attr_accessor :last_name
+    #the linkedin job title
+    attr_accessor :title
+    #the location of the contact
+    attr_accessor :location
+    #the country of the contact
+    attr_accessor :country
+    #the domain for which the contact belongs
+    attr_accessor :industry
+    #the entire profile of the contact
+    attr_accessor :profile
+    #Array of hash containing its past job companies and job profile
+    #Example
+    #  [
+    #    [0] {
+    #          :past_title => "Intern",
+    #        :past_company => "Sungard"
+    #        },
+    #    [1] {
+    #          :past_title => "Software Developer",
+    #        :past_company => "Microsoft"
+    #        }
+    #  ]
+    attr_accessor :past_companies
+    #Array of hash containing its current job companies and job profile
+    #Example
+    #  [
+    #    [0] {
+    #          :current_title => "Intern",
+    #        :current_company => "Sungard"
+    #        },
+    #    [1] {
+    #          :current_title => "Software Developer",
+    #        :current_company => "Microsoft"
+    #        }
+    #  ]
+    attr_accessor :current_companies
+    attr_accessor :linkedin_url
+    attr_accessor :profile
+    def initialize(node=[])
+      unless node.class==Array
+        @first_name=get_first_name(node)
+        @last_name=get_last_name(node)
+        @title=get_title(node)
+        @location=get_location(node)
+        @country=get_country(node)
+        @industry=get_industry(node)
+        @current_companies=get_current_companies node
+        @past_companies=get_past_companies node
+        @linkedin_url=get_linkedin_url node
+      end
+    end
+    #page is a Nokogiri::XML node of the profile page
+    #returns object of Linkedin::Profile
+    def get_profile page,url
+      @profile=Linkedin::Profile.new(page,url)
+    end
+    private
+    def get_first_name node
+      return node.at(".given-name").text.strip if node.search(".given-name").first
+    end
+    def get_last_name node
+      return node.at(".family-name").text.strip if node.search(".family-name").first
+    end
+    def get_title node
+      return node.at(".title").text.gsub(/\s+/, " ").strip if node.search(".title").first
+    end
+    def get_location node
+      return node.at(".location").text.split(",").first.strip if node.search(".location").first
+    end
+    def get_country node
+      return node.at(".location").text.split(",").last.strip if node.search(".location").first
+    end
+    def get_industry node
+      return node.at(".industry").text.strip if node.search(".industry").first
+    end
+    def get_linkedin_url node
+      node.at("h2/strong/a").attributes["href"]
+    end
+    def get_current_companies node
+      current_cs=[]
+      if node.search(".current-content").first
+        node.at(".current-content").text.split(",").each do |content|
+          title,company=content.split(" at ")
+          company=company.gsub(/\s+/, " ").strip if company
+          title=title.gsub(/\s+/, " ").strip if title
+          current_company={:current_company=>company,:current_title=> title}
+          current_cs<<current_company
+        end
+        return current_cs
+      end
+    end
+    def get_past_companies node
+      past_cs=[]
+      if node.search(".past-content").first
+        node.at(".past-content").text.split(",").each do |content|
+          title,company=content.split(" at ")
+          company=company.gsub(/\s+/, " ").strip if company
+          title=title.gsub(/\s+/, " ").strip if title
+          past_company={:past_company=>company,:past_title=> title }
+          past_cs<<past_company
+        end
+        return past_cs
+      end
+    end
+  end
+end

data/lib/linkedin-scraper/profile.rb ADDED Viewed

@@ -0,0 +1,148 @@
+# To change this template, choose Tools | Templates
+# and open the template in the editor.
+module Linkedin
+  class Profile
+    USER_AGENTS = ["Windows IE 6", "Windows IE 7", "Windows Mozilla", "Mac Safari", "Mac FireFox", "Mac Mozilla", "Linux Mozilla", "Linux Firefox", "Linux Konqueror"]
+    #the First name of the contact
+    attr_accessor :first_name
+    #the last name of the contact
+    attr_accessor :last_name
+    #the linkedin job title
+    attr_accessor :title
+    #the location of the contact
+    attr_accessor :location
+    #the country of the contact
+    attr_accessor :country
+    #the domain for which the contact belongs
+    attr_accessor :industry
+    #the entire profile of the contact
+    attr_accessor :profile
+    #Array of hash containing its past job companies and job profile
+    #Example
+    #  [
+    #    [0] {
+    #          :past_title => "Intern",
+    #        :past_company => "Sungard"
+    #        },
+    #    [1] {
+    #          :past_title => "Software Developer",
+    #        :past_company => "Microsoft"
+    #        }
+    #  ]
+    attr_accessor :past_companies
+    #Array of hash containing its current job companies and job profile
+    #Example
+    #  [
+    #    [0] {
+    #          :current_title => "Intern",
+    #        :current_company => "Sungard"
+    #        },
+    #    [1] {
+    #          :current_title => "Software Developer",
+    #        :current_company => "Microsoft"
+    #        }
+    #  ]
+    attr_accessor :current_companies
+    #url of the profile
+    attr_accessor :linkedin_url
+    #Array of hash containing its recommended visitors which come on the
+    attr_accessor :recommended_visitors
+    def initialize(page,url)
+      @first_name=get_first_name(page)
+      @last_name=get_last_name(page)
+      @title=get_title(page)
+      @location=get_location(page)
+      @country=get_country(page)
+      @industry=get_industry(page)
+      @current_companies=get_current_companies page
+      @past_companies=get_past_companies page
+      @recommended_visitors=get_recommended_visitors page
+      @linkedin_url=url
+    end
+    #returns:nil if it gives a 404 request
+    def get_profile url
+      begin
+        @agent=Mechanize.new
+        @agent.user_agent_alias = USER_AGENTS.sample
+        @agent.max_history = 0
+        page=@agent.get url
+        return Linkedin::Profile.new(page, url)
+      rescue=>e
+        puts e
+      end
+    end
+    private
+    def get_first_name page
+      return page.at(".given-name").text.strip if page.search(".given-name").first
+    end
+    def get_last_name page
+      return page.at(".family-name").text.strip if page.search(".family-name").first
+    end
+    def get_title page
+      return page.at(".headline-title").text.gsub(/\s+/, " ").strip if page.search(".headline-title").first
+    end
+    def get_location page
+      return page.at(".locality").text.split(",").first.strip if page.search(".locality").first
+    end
+    def get_country page
+      return page.at(".locality").text.split(",").last.strip if page.search(".locality").first
+    end
+    def get_industry page
+      return page.at(".industry").text.gsub(/\s+/, " ").strip if page.search(".industry").first
+    end
+    def get_past_companies page
+      past_cs=[]
+      if page.search(".past").first
+        page.search(".past").search("li").each do |past_company|
+          title,company=past_company.text.strip.split(" at ")
+          company=company.gsub(/\s+/, " ").strip if company
+          title=title.gsub(/\s+/, " ").strip if title
+          past_company={:past_company=>company,:past_title=> title}
+          past_cs<<past_company
+        end
+        return past_cs
+      end
+    end
+    def get_current_companies page
+      current_cs=[]
+      if page.search(".current").first
+        page.search(".current").search("li").each do |past_company|
+          title,company=past_company.text.strip.split(" at ")
+          company=company.gsub(/\s+/, " ").strip if company
+          title=title.gsub(/\s+/, " ").strip if title
+          current_company={:current_company=>company,:current_title=> title}
+          current_cs<<current_company
+        end
+        return current_cs
+      end
+    end
+    def get_recommended_visitors  page
+      recommended_vs=[]
+      if page.search(".browsemap").first
+        page.at(".browsemap").at("ul").search("li").each do |visitor|
+          v={}
+          v[:link]=visitor.at('a').attributes["href"]
+          v[:name]=visitor.at('a').text
+          v[:title]=visitor.at('.headline').text.split(" at ").first
+          v[:company]=visitor.at('.headline').text.split(" at ").last
+          recommended_vs<<v
+        end
+        return recommended_vs
+      end
+    end
+  end
+end

data/lib/linkedin-scraper/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 module Linkedin
   module Scraper
-    VERSION = "0.0.1"
+    VERSION = "0.0.2"
   end
 end

data/linkedin-scraper.gemspec CHANGED Viewed

@@ -5,8 +5,8 @@ Gem::Specification.new do |gem|
   gem.authors       = ["Yatish Mehta"]
   gem.email         = ["yatishmehta27@gmail.com"]
   gem.description   = %q{Scrapes the linkedin profile when a url is given }
-  gem.summary       = %q{Write a gem summary}
-  gem.homepage      = ""
+  gem.summary       = %q{when a url of  public linkedin profile page is given it scrapes the entire page and converts into a accessible object}
+  gem.homepage      = "https://github.com/yatishmehta27/linkedin-scraper"
    gem.add_dependency(%q<httparty>, [">= 0"])
 gem.add_dependency(%q<mechanize>, [">= 0"])
 gem.add_dependency(%q<awesome_print>, [">= 0"])

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: linkedin-scraper
 version: !ruby/object:Gem::Version
-  version: 0.0.1
+  version: 0.0.2
   prerelease:
 platform: ruby
 authors:
@@ -14,7 +14,7 @@ default_executable:
 dependencies:
 - !ruby/object:Gem::Dependency
   name: httparty
-  requirement: &10580260 !ruby/object:Gem::Requirement
+  requirement: &21059780 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -22,10 +22,10 @@ dependencies:
         version: '0'
   type: :runtime
   prerelease: false
-  version_requirements: *10580260
+  version_requirements: *21059780
 - !ruby/object:Gem::Dependency
   name: mechanize
-  requirement: &10577140 !ruby/object:Gem::Requirement
+  requirement: &21059160 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -33,10 +33,10 @@ dependencies:
         version: '0'
   type: :runtime
   prerelease: false
-  version_requirements: *10577140
+  version_requirements: *21059160
 - !ruby/object:Gem::Dependency
   name: awesome_print
-  requirement: &10576640 !ruby/object:Gem::Requirement
+  requirement: &21058580 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -44,7 +44,7 @@ dependencies:
         version: '0'
   type: :runtime
   prerelease: false
-  version_requirements: *10576640
+  version_requirements: *21058580
 description: ! 'Scrapes the linkedin profile when a url is given '
 email:
 - yatishmehta27@gmail.com
@@ -55,13 +55,16 @@ files:
 - .gitignore
 - Gemfile
 - LICENSE
-- README.md
+- README.rdoc
 - Rakefile
 - lib/linkedin-scraper.rb
+- lib/linkedin-scraper/client.rb
+- lib/linkedin-scraper/contact.rb
+- lib/linkedin-scraper/profile.rb
 - lib/linkedin-scraper/version.rb
 - linkedin-scraper.gemspec
 has_rdoc: true
-homepage: ''
+homepage: https://github.com/yatishmehta27/linkedin-scraper
 licenses: []
 post_install_message:
 rdoc_options: []
@@ -84,5 +87,6 @@ rubyforge_project:
 rubygems_version: 1.6.2
 signing_key:
 specification_version: 3
-summary: Write a gem summary
+summary: when a url of  public linkedin profile page is given it scrapes the entire
+  page and converts into a accessible object
 test_files: []

data/README.md DELETED Viewed

@@ -1,29 +0,0 @@
-# Linkedin::Scraper
-TODO: Write a gem description
-## Installation
-Add this line to your application's Gemfile:
-    gem 'linkedin-scraper'
-And then execute:
-    $ bundle
-Or install it yourself as:
-    $ gem install linkedin-scraper
-## Usage
-TODO: Write usage instructions here
-## Contributing
-1. Fork it
-2. Create your feature branch (`git checkout -b my-new-feature`)
-3. Commit your changes (`git commit -am 'Added some feature'`)
-4. Push to the branch (`git push origin my-new-feature`)
-5. Create new Pull Request