linkedin-scraper 0.0.1 → 0.0.2

Sign up to get free protection for your applications and to get access to all the features.
data/README.rdoc ADDED
@@ -0,0 +1,104 @@
1
+ = Linkedin-Scraper {<img src="http://travis-ci.org/jaimeiniesta/metainspector.png" />}[http://travis-ci.org/jaimeiniesta/metainspector]
2
+
3
+ Linkedin-scraper is a gem for scraping linkedin public profiles. You give it an URL, and it lets you easily get its title,name,country,area,current_companies .
4
+
5
+ = Installation
6
+
7
+ Install the gem from RubyGems:
8
+
9
+ gem install linkedin-scraper
10
+
11
+ This gem is tested on Ruby versions 1.8.7, 1.9.2 and 1.9.3.
12
+
13
+ = Usage
14
+
15
+ Initialize a scraper instance for an URL, like this:
16
+
17
+ profile = Linkedin::Scraper.get_profile('http://in.linkedin.com/pub/yatish-mehta/22/460/a86')
18
+
19
+ Then you can see the scraped data like this:
20
+
21
+
22
+ profile.first_name #the First name of the contact
23
+
24
+ profile.last_name #the last name of the contact
25
+
26
+ profile.title #the linkedin job title
27
+
28
+ profile.location #the location of the contact
29
+
30
+ profile.country #the country of the contact
31
+
32
+ profile.industry #the domain for which the contact belongs
33
+
34
+ profile.past_companies
35
+ #Array of hash containing its past job companies and job profile
36
+ #Example
37
+ # [
38
+ # [0] {
39
+ # :past_title => "Intern",
40
+ # :past_company => "Sungard"
41
+ # },
42
+ # [1] {
43
+ # :past_title => "Software Developer",
44
+ # :past_company => "Microsoft"
45
+ # }
46
+ # ]
47
+ profile.current_companies
48
+ #Array of hash containing its current job companies and job profile
49
+ #Example
50
+ # [
51
+ # [0] {
52
+ # :current_title => "Intern",
53
+ # :current_company => "Sungard"
54
+ # },
55
+ # [1] {
56
+ # :current_title => "Software Developer",
57
+ # :current_company => "Microsoft"
58
+ # }
59
+ # ]
60
+
61
+
62
+ profile.linkedin_url #url of the profile
63
+
64
+ profile.recommended_visitors
65
+
66
+ = Examples
67
+
68
+ When a link is given, it scrapes the profile and gets the data
69
+
70
+ $ irb
71
+ >> require 'metainspector'
72
+ => true
73
+
74
+ >> page = MetaInspector.new('http://pagerankalert.com')
75
+ => #<MetaInspector:0x11330c0 @url="http://pagerankalert.com">
76
+
77
+ >> page.title
78
+ => "PageRankAlert.com :: Track your PageRank changes"
79
+
80
+ >> page.meta_description
81
+ => "Track your PageRank(TM) changes and receive alerts by email"
82
+
83
+ >> page.meta_keywords
84
+ => "pagerank, seo, optimization, google"
85
+
86
+ >> page.links.size
87
+ => 8
88
+
89
+ >> page.links[5]
90
+ => "http://pagerankalert.posterous.com"
91
+
92
+ >> page.document.class
93
+ => String
94
+
95
+ >> page.parsed_document.class
96
+ => Nokogiri::HTML::Document
97
+
98
+ = ZOMG Fork! Thank you!
99
+
100
+ You're welcome to fork this project and send pull requests. I want to thank specially:
101
+
102
+ = To Do
103
+ *
104
+ Copyright (c) 2009-2012 Yatish Mehta, released under the MIT license
@@ -4,7 +4,7 @@ require "mechanize"
4
4
  require "awesome_print"
5
5
 
6
6
  %w(client contact profile).each do |file|
7
- require File.join(File.dirname(__FILE__), 'linkedin', file)
7
+ require File.join(File.dirname(__FILE__), 'linkedin-scraper', file)
8
8
  end
9
9
 
10
10
 
@@ -0,0 +1,125 @@
1
+ # To change this template, choose Tools | Templates
2
+ # and open the template in the editor.
3
+
4
+
5
+ module Linkedin
6
+ class Client
7
+ USER_AGENTS = ["Windows IE 6", "Windows IE 7", "Windows Mozilla", "Mac Safari", "Mac FireFox", "Mac Mozilla", "Linux Mozilla", "Linux Firefox", "Linux Konqueror"]
8
+ attr_accessor :contacts ,:matched_tag,:probability
9
+
10
+ def initialize(first_name,last_name ,company,options={})
11
+ @first_name=first_name.downcase
12
+ @last_name=last_name.downcase
13
+ @company=company
14
+ @country=options[:country] || "us"
15
+ @search_linkedin_url="http://#{@country}.linkedin.com/pub/dir/#{@first_name}/#{@last_name}"
16
+ @contacts=[]
17
+ @links=[]
18
+ get_agent
19
+ end
20
+
21
+ def get_agent
22
+ @agent=Mechanize.new
23
+ @agent.user_agent_alias = USER_AGENTS.sample
24
+ @agent.max_history = 0
25
+ @agent
26
+ end
27
+
28
+ def get_contacts
29
+ begin
30
+ sleep(2+rand(4))
31
+ puts "===>Father:Scrapping linkedin url "+ @search_linkedin_url
32
+ @page=@agent.get @search_linkedin_url
33
+ @page.search(".vcard").each do |node|
34
+ @contacts<<Linkedin::Contact.new(node)
35
+ end
36
+ rescue Mechanize::ResponseCodeError=>e
37
+ puts "RESCUE"
38
+ end
39
+ return @contacts
40
+ end
41
+
42
+
43
+ #TODO need to refactor this function need seperate function of each case
44
+ def get_verified_contact
45
+ get_contacts
46
+ @contacts.each do |contact|
47
+ #check current company
48
+ contact.current_companies.each do |company|
49
+ if company[:current_company]
50
+ if company[:current_company].match(/#{@company}/i)
51
+ @matched_tag="CURRENT"
52
+ return contact
53
+ end
54
+ end
55
+ end if contact.current_companies
56
+
57
+ #title of profile
58
+ if contact.title.match(/#{@company}/i)
59
+ @matched_tag="CURRENT"
60
+ return contact
61
+ end
62
+
63
+ #check past companies
64
+ contact.past_companies.each do |company|
65
+ if company[:past_company]
66
+ if company[:past_company].match(/#{@company}/i)
67
+ @matched_tag="PAST"
68
+ return contact
69
+ end
70
+ end
71
+ end if contact.past_companies
72
+ #
73
+ #Going in to profile homepage and then checking
74
+ #
75
+ sleep(2+rand(4))
76
+ puts "===>Child:Scrapping linkedin url: "+ contact.linkedin_url
77
+ profile=contact.get_profile(get_agent.get(contact.linkedin_url),contact.linkedin_url)
78
+ #check current company
79
+ profile.current_companies.each do |company|
80
+ if company[:current_company]
81
+ if company[:current_company].match(/#{@company}/i)
82
+ @matched_tag="CURRENT"
83
+ return profile
84
+ end
85
+ end
86
+ end if profile.current_companies
87
+
88
+ #title of profile
89
+ if profile.title
90
+ if profile.title.match(/#{@company}/i)
91
+ @matched_tag="CURRENT"
92
+ return profile
93
+ end
94
+ end
95
+ #check past companies
96
+ profile.past_companies.each do |company|
97
+ if company[:past_company]
98
+ if company[:past_company].match(/#{@company}/i)
99
+ @matched_tag="PAST"
100
+ return profile
101
+ end
102
+ end
103
+ end if profile.past_companies
104
+ #check recommended visitors
105
+ if profile.recommended_visitors
106
+ cnt=0
107
+ profile.recommended_visitors.each do |visitor|
108
+ if visitor[:company]
109
+ if visitor[:company].match(/#{@company}/i)
110
+ cnt+=1
111
+ end
112
+ end
113
+ end
114
+ @probability=cnt/profile.recommended_visitors.length.to_f
115
+ @matched_tag="RECOMMENDED"
116
+ return profile if @probability>=0.5
117
+ end
118
+
119
+ end unless @contacts.empty?
120
+ return nil
121
+ end
122
+
123
+
124
+ end
125
+ end
@@ -0,0 +1,134 @@
1
+ # To change this template, choose Tools | Templates
2
+ # and open the template in the editor.
3
+ module Linkedin
4
+
5
+ class Contact
6
+ #the First name of the contact
7
+ attr_accessor :first_name
8
+ #the last name of the contact
9
+ attr_accessor :last_name
10
+ #the linkedin job title
11
+ attr_accessor :title
12
+ #the location of the contact
13
+ attr_accessor :location
14
+ #the country of the contact
15
+ attr_accessor :country
16
+ #the domain for which the contact belongs
17
+ attr_accessor :industry
18
+ #the entire profile of the contact
19
+ attr_accessor :profile
20
+
21
+ #Array of hash containing its past job companies and job profile
22
+ #Example
23
+ # [
24
+ # [0] {
25
+ # :past_title => "Intern",
26
+ # :past_company => "Sungard"
27
+ # },
28
+ # [1] {
29
+ # :past_title => "Software Developer",
30
+ # :past_company => "Microsoft"
31
+ # }
32
+ # ]
33
+
34
+ attr_accessor :past_companies
35
+ #Array of hash containing its current job companies and job profile
36
+ #Example
37
+ # [
38
+ # [0] {
39
+ # :current_title => "Intern",
40
+ # :current_company => "Sungard"
41
+ # },
42
+ # [1] {
43
+ # :current_title => "Software Developer",
44
+ # :current_company => "Microsoft"
45
+ # }
46
+ # ]
47
+ attr_accessor :current_companies
48
+
49
+ attr_accessor :linkedin_url
50
+
51
+ attr_accessor :profile
52
+
53
+ def initialize(node=[])
54
+ unless node.class==Array
55
+ @first_name=get_first_name(node)
56
+ @last_name=get_last_name(node)
57
+ @title=get_title(node)
58
+ @location=get_location(node)
59
+ @country=get_country(node)
60
+ @industry=get_industry(node)
61
+ @current_companies=get_current_companies node
62
+ @past_companies=get_past_companies node
63
+ @linkedin_url=get_linkedin_url node
64
+ end
65
+ end
66
+ #page is a Nokogiri::XML node of the profile page
67
+ #returns object of Linkedin::Profile
68
+ def get_profile page,url
69
+ @profile=Linkedin::Profile.new(page,url)
70
+ end
71
+
72
+ private
73
+
74
+ def get_first_name node
75
+ return node.at(".given-name").text.strip if node.search(".given-name").first
76
+ end
77
+
78
+ def get_last_name node
79
+ return node.at(".family-name").text.strip if node.search(".family-name").first
80
+ end
81
+
82
+ def get_title node
83
+ return node.at(".title").text.gsub(/\s+/, " ").strip if node.search(".title").first
84
+ end
85
+
86
+ def get_location node
87
+ return node.at(".location").text.split(",").first.strip if node.search(".location").first
88
+
89
+ end
90
+
91
+ def get_country node
92
+ return node.at(".location").text.split(",").last.strip if node.search(".location").first
93
+
94
+ end
95
+
96
+ def get_industry node
97
+ return node.at(".industry").text.strip if node.search(".industry").first
98
+ end
99
+
100
+ def get_linkedin_url node
101
+ node.at("h2/strong/a").attributes["href"]
102
+ end
103
+
104
+ def get_current_companies node
105
+ current_cs=[]
106
+ if node.search(".current-content").first
107
+ node.at(".current-content").text.split(",").each do |content|
108
+ title,company=content.split(" at ")
109
+ company=company.gsub(/\s+/, " ").strip if company
110
+ title=title.gsub(/\s+/, " ").strip if title
111
+ current_company={:current_company=>company,:current_title=> title}
112
+ current_cs<<current_company
113
+ end
114
+ return current_cs
115
+ end
116
+ end
117
+
118
+ def get_past_companies node
119
+ past_cs=[]
120
+ if node.search(".past-content").first
121
+ node.at(".past-content").text.split(",").each do |content|
122
+ title,company=content.split(" at ")
123
+ company=company.gsub(/\s+/, " ").strip if company
124
+ title=title.gsub(/\s+/, " ").strip if title
125
+ past_company={:past_company=>company,:past_title=> title }
126
+ past_cs<<past_company
127
+ end
128
+ return past_cs
129
+ end
130
+ end
131
+
132
+ end
133
+
134
+ end
@@ -0,0 +1,148 @@
1
+ # To change this template, choose Tools | Templates
2
+ # and open the template in the editor.
3
+ module Linkedin
4
+ class Profile
5
+ USER_AGENTS = ["Windows IE 6", "Windows IE 7", "Windows Mozilla", "Mac Safari", "Mac FireFox", "Mac Mozilla", "Linux Mozilla", "Linux Firefox", "Linux Konqueror"]
6
+ #the First name of the contact
7
+ attr_accessor :first_name
8
+ #the last name of the contact
9
+ attr_accessor :last_name
10
+ #the linkedin job title
11
+ attr_accessor :title
12
+ #the location of the contact
13
+ attr_accessor :location
14
+ #the country of the contact
15
+ attr_accessor :country
16
+ #the domain for which the contact belongs
17
+ attr_accessor :industry
18
+ #the entire profile of the contact
19
+ attr_accessor :profile
20
+
21
+ #Array of hash containing its past job companies and job profile
22
+ #Example
23
+ # [
24
+ # [0] {
25
+ # :past_title => "Intern",
26
+ # :past_company => "Sungard"
27
+ # },
28
+ # [1] {
29
+ # :past_title => "Software Developer",
30
+ # :past_company => "Microsoft"
31
+ # }
32
+ # ]
33
+
34
+ attr_accessor :past_companies
35
+ #Array of hash containing its current job companies and job profile
36
+ #Example
37
+ # [
38
+ # [0] {
39
+ # :current_title => "Intern",
40
+ # :current_company => "Sungard"
41
+ # },
42
+ # [1] {
43
+ # :current_title => "Software Developer",
44
+ # :current_company => "Microsoft"
45
+ # }
46
+ # ]
47
+ attr_accessor :current_companies
48
+ #url of the profile
49
+ attr_accessor :linkedin_url
50
+ #Array of hash containing its recommended visitors which come on the
51
+ attr_accessor :recommended_visitors
52
+
53
+ def initialize(page,url)
54
+ @first_name=get_first_name(page)
55
+ @last_name=get_last_name(page)
56
+ @title=get_title(page)
57
+ @location=get_location(page)
58
+ @country=get_country(page)
59
+ @industry=get_industry(page)
60
+ @current_companies=get_current_companies page
61
+ @past_companies=get_past_companies page
62
+ @recommended_visitors=get_recommended_visitors page
63
+ @linkedin_url=url
64
+ end
65
+ #returns:nil if it gives a 404 request
66
+ def get_profile url
67
+ begin
68
+ @agent=Mechanize.new
69
+ @agent.user_agent_alias = USER_AGENTS.sample
70
+ @agent.max_history = 0
71
+ page=@agent.get url
72
+ return Linkedin::Profile.new(page, url)
73
+ rescue=>e
74
+ puts e
75
+ end
76
+ end
77
+
78
+ private
79
+
80
+ def get_first_name page
81
+ return page.at(".given-name").text.strip if page.search(".given-name").first
82
+ end
83
+
84
+ def get_last_name page
85
+ return page.at(".family-name").text.strip if page.search(".family-name").first
86
+ end
87
+
88
+ def get_title page
89
+ return page.at(".headline-title").text.gsub(/\s+/, " ").strip if page.search(".headline-title").first
90
+ end
91
+
92
+ def get_location page
93
+ return page.at(".locality").text.split(",").first.strip if page.search(".locality").first
94
+ end
95
+
96
+ def get_country page
97
+ return page.at(".locality").text.split(",").last.strip if page.search(".locality").first
98
+ end
99
+
100
+ def get_industry page
101
+ return page.at(".industry").text.gsub(/\s+/, " ").strip if page.search(".industry").first
102
+ end
103
+
104
+ def get_past_companies page
105
+ past_cs=[]
106
+ if page.search(".past").first
107
+ page.search(".past").search("li").each do |past_company|
108
+ title,company=past_company.text.strip.split(" at ")
109
+ company=company.gsub(/\s+/, " ").strip if company
110
+ title=title.gsub(/\s+/, " ").strip if title
111
+ past_company={:past_company=>company,:past_title=> title}
112
+ past_cs<<past_company
113
+ end
114
+ return past_cs
115
+ end
116
+ end
117
+
118
+ def get_current_companies page
119
+ current_cs=[]
120
+ if page.search(".current").first
121
+ page.search(".current").search("li").each do |past_company|
122
+ title,company=past_company.text.strip.split(" at ")
123
+ company=company.gsub(/\s+/, " ").strip if company
124
+ title=title.gsub(/\s+/, " ").strip if title
125
+ current_company={:current_company=>company,:current_title=> title}
126
+ current_cs<<current_company
127
+ end
128
+ return current_cs
129
+ end
130
+ end
131
+
132
+ def get_recommended_visitors page
133
+ recommended_vs=[]
134
+ if page.search(".browsemap").first
135
+ page.at(".browsemap").at("ul").search("li").each do |visitor|
136
+ v={}
137
+ v[:link]=visitor.at('a').attributes["href"]
138
+ v[:name]=visitor.at('a').text
139
+ v[:title]=visitor.at('.headline').text.split(" at ").first
140
+ v[:company]=visitor.at('.headline').text.split(" at ").last
141
+ recommended_vs<<v
142
+ end
143
+ return recommended_vs
144
+ end
145
+
146
+ end
147
+ end
148
+ end
@@ -1,5 +1,5 @@
1
1
  module Linkedin
2
2
  module Scraper
3
- VERSION = "0.0.1"
3
+ VERSION = "0.0.2"
4
4
  end
5
5
  end
@@ -5,8 +5,8 @@ Gem::Specification.new do |gem|
5
5
  gem.authors = ["Yatish Mehta"]
6
6
  gem.email = ["yatishmehta27@gmail.com"]
7
7
  gem.description = %q{Scrapes the linkedin profile when a url is given }
8
- gem.summary = %q{Write a gem summary}
9
- gem.homepage = ""
8
+ gem.summary = %q{when a url of public linkedin profile page is given it scrapes the entire page and converts into a accessible object}
9
+ gem.homepage = "https://github.com/yatishmehta27/linkedin-scraper"
10
10
  gem.add_dependency(%q<httparty>, [">= 0"])
11
11
  gem.add_dependency(%q<mechanize>, [">= 0"])
12
12
  gem.add_dependency(%q<awesome_print>, [">= 0"])
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: linkedin-scraper
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.1
4
+ version: 0.0.2
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -14,7 +14,7 @@ default_executable:
14
14
  dependencies:
15
15
  - !ruby/object:Gem::Dependency
16
16
  name: httparty
17
- requirement: &10580260 !ruby/object:Gem::Requirement
17
+ requirement: &21059780 !ruby/object:Gem::Requirement
18
18
  none: false
19
19
  requirements:
20
20
  - - ! '>='
@@ -22,10 +22,10 @@ dependencies:
22
22
  version: '0'
23
23
  type: :runtime
24
24
  prerelease: false
25
- version_requirements: *10580260
25
+ version_requirements: *21059780
26
26
  - !ruby/object:Gem::Dependency
27
27
  name: mechanize
28
- requirement: &10577140 !ruby/object:Gem::Requirement
28
+ requirement: &21059160 !ruby/object:Gem::Requirement
29
29
  none: false
30
30
  requirements:
31
31
  - - ! '>='
@@ -33,10 +33,10 @@ dependencies:
33
33
  version: '0'
34
34
  type: :runtime
35
35
  prerelease: false
36
- version_requirements: *10577140
36
+ version_requirements: *21059160
37
37
  - !ruby/object:Gem::Dependency
38
38
  name: awesome_print
39
- requirement: &10576640 !ruby/object:Gem::Requirement
39
+ requirement: &21058580 !ruby/object:Gem::Requirement
40
40
  none: false
41
41
  requirements:
42
42
  - - ! '>='
@@ -44,7 +44,7 @@ dependencies:
44
44
  version: '0'
45
45
  type: :runtime
46
46
  prerelease: false
47
- version_requirements: *10576640
47
+ version_requirements: *21058580
48
48
  description: ! 'Scrapes the linkedin profile when a url is given '
49
49
  email:
50
50
  - yatishmehta27@gmail.com
@@ -55,13 +55,16 @@ files:
55
55
  - .gitignore
56
56
  - Gemfile
57
57
  - LICENSE
58
- - README.md
58
+ - README.rdoc
59
59
  - Rakefile
60
60
  - lib/linkedin-scraper.rb
61
+ - lib/linkedin-scraper/client.rb
62
+ - lib/linkedin-scraper/contact.rb
63
+ - lib/linkedin-scraper/profile.rb
61
64
  - lib/linkedin-scraper/version.rb
62
65
  - linkedin-scraper.gemspec
63
66
  has_rdoc: true
64
- homepage: ''
67
+ homepage: https://github.com/yatishmehta27/linkedin-scraper
65
68
  licenses: []
66
69
  post_install_message:
67
70
  rdoc_options: []
@@ -84,5 +87,6 @@ rubyforge_project:
84
87
  rubygems_version: 1.6.2
85
88
  signing_key:
86
89
  specification_version: 3
87
- summary: Write a gem summary
90
+ summary: when a url of public linkedin profile page is given it scrapes the entire
91
+ page and converts into a accessible object
88
92
  test_files: []
data/README.md DELETED
@@ -1,29 +0,0 @@
1
- # Linkedin::Scraper
2
-
3
- TODO: Write a gem description
4
-
5
- ## Installation
6
-
7
- Add this line to your application's Gemfile:
8
-
9
- gem 'linkedin-scraper'
10
-
11
- And then execute:
12
-
13
- $ bundle
14
-
15
- Or install it yourself as:
16
-
17
- $ gem install linkedin-scraper
18
-
19
- ## Usage
20
-
21
- TODO: Write usage instructions here
22
-
23
- ## Contributing
24
-
25
- 1. Fork it
26
- 2. Create your feature branch (`git checkout -b my-new-feature`)
27
- 3. Commit your changes (`git commit -am 'Added some feature'`)
28
- 4. Push to the branch (`git push origin my-new-feature`)
29
- 5. Create new Pull Request