rubyscholar 0.0.2
Sign up to get free protection for your applications and to get access to all the features.
- data/.gitignore +18 -0
- data/README.md +40 -0
- data/bin/scrape.rb +22 -0
- data/config.yml +49 -0
- data/lib/rubyscholar.rb +141 -0
- metadata +85 -0
data/.gitignore
ADDED
data/README.md
ADDED
@@ -0,0 +1,40 @@
|
|
1
|
+
# Synopsis
|
2
|
+
|
3
|
+
Here is a small script to "scrape" your Google Scholar citations and reformat them (the way I need it for my website).
|
4
|
+
Not super flexible - but should be easily customizable.
|
5
|
+
|
6
|
+
Some features:
|
7
|
+
|
8
|
+
* if registered on Crossref, retreives corresponding DOIs and can add altmetric.org links.
|
9
|
+
If Crossref doesn't think your email is valid, no DOIs will be retreived.
|
10
|
+
* adds "Cited by N" for popular papers
|
11
|
+
|
12
|
+
# How to use:
|
13
|
+
|
14
|
+
1. Configure "config.yml"
|
15
|
+
If you want DOI retreival to work (including Altmetrics), you need to be
|
16
|
+
registered at crossref (its free).
|
17
|
+
2. Run `ruby bin/scrape.rb > mypublications.html`
|
18
|
+
3. Thats it.
|
19
|
+
|
20
|
+
|
21
|
+
# Potential for improvement:
|
22
|
+
|
23
|
+
* uses author list as visible on your main Google Scholar page. Sometimes this
|
24
|
+
means names are chopped in two or just a single author is missing. This could
|
25
|
+
be made smarter.
|
26
|
+
* flexible output
|
27
|
+
* flexible use of DOIs
|
28
|
+
|
29
|
+
# Technologies
|
30
|
+
|
31
|
+
Ruby, Nokogiri. Thanks to Google Scholar and Crossref. I hope none of this infringes on anything.
|
32
|
+
|
33
|
+
# Contact
|
34
|
+
|
35
|
+
RubyScholar was developed by Yannick Wurm (http://yannick.poulet.org). Pull requests, patches and bug reports are welcome. The source code is available on github. Bug reports and feature requests may also be made there.
|
36
|
+
|
37
|
+
# Copyright
|
38
|
+
|
39
|
+
RubyScholar � 2013 by Yannick Wurm. Licensed under the MIT license.
|
40
|
+
|
data/bin/scrape.rb
ADDED
@@ -0,0 +1,22 @@
|
|
1
|
+
require_relative '../lib/rubyscholar'
|
2
|
+
require 'yaml'
|
3
|
+
|
4
|
+
def scrape()
|
5
|
+
config = YAML.load_file('config.yml')
|
6
|
+
parsed = RubyScholar::Parser.new(config["url"],
|
7
|
+
config["email"])
|
8
|
+
formatter = RubyScholar::Formatter.new(parsed,
|
9
|
+
config["highlight"],
|
10
|
+
config["pdfs"],
|
11
|
+
config["altmetricDOIs"],
|
12
|
+
config["minCitations"].to_i)
|
13
|
+
|
14
|
+
html = formatter.to_html
|
15
|
+
config["italicize"].each do |term|
|
16
|
+
html.gsub!( term , '<em>' + term + '</em>')
|
17
|
+
end
|
18
|
+
|
19
|
+
f= File.open('scholar.html','w')
|
20
|
+
f.write html
|
21
|
+
f.close
|
22
|
+
end
|
data/config.yml
ADDED
@@ -0,0 +1,49 @@
|
|
1
|
+
# Google Scholar page (you can choose how you sort it)
|
2
|
+
url: "http://scholar.google.com/citations?sortby=pubdate&hl=en&user=k6y0EGsAAAAJ&view_op=list_works"
|
3
|
+
|
4
|
+
# Name to highlight
|
5
|
+
highlight: "Y Wurm"
|
6
|
+
|
7
|
+
|
8
|
+
# Need an Email address that has been registered with CrossRef to obtain DOIs
|
9
|
+
# using their OpenURL service.
|
10
|
+
# e.g. the following should provide an XML file:
|
11
|
+
# http://www.crossref.org/openurl?redirect=false&pid=YOUR@EMAIL>COM&aulast=Wurm&atitle=Behavioral%20Genomics:%20A,%20Bee,%20C,%20G,%20T
|
12
|
+
email: your@email.com
|
13
|
+
|
14
|
+
|
15
|
+
# Show "[Cited Nx]" if N > the following number
|
16
|
+
minCitations: 5
|
17
|
+
|
18
|
+
# Words to italicize (emphasize). These will have "<em>" around them.
|
19
|
+
italicize:
|
20
|
+
- Solenopsis invicta
|
21
|
+
- Acromyrmex echinatior
|
22
|
+
- de novo
|
23
|
+
|
24
|
+
# DOIs of articles for which we should show altmetric.org badges.
|
25
|
+
altmetricDOIs:
|
26
|
+
- "10.1038/nature11832"
|
27
|
+
- "10.1101/gr.121392.111"
|
28
|
+
- "10.1073/pnas.1009690108"
|
29
|
+
- "10.1073/pnas.1104825108"
|
30
|
+
|
31
|
+
# Article titles for which we have urls to PDFs
|
32
|
+
pdfs:
|
33
|
+
"A Y-like social chromosome causes alternative colony organization in fire ants" : "/publications/wangwurm2013socialChromosome.pdf"
|
34
|
+
"Duplication and concerted evolution in a master sex determiner under balancing selection" : "/publications/procb2013.pdf"
|
35
|
+
"Comparative genomics of chemosensory protein genes reveals rapid evolution and positive selection in ant-specific duplicates" : "/publications/hdy2012122a.pdf"
|
36
|
+
"The Molecular Clockwork of the Fire Ant Solenopsis invicta" : "/publications/ingram2012-fireAntClockGenes.pdf"
|
37
|
+
"Epigenetics: The Making of Ant Castes" : "/publications/2012CurrBiolAntepigenetics.pdf"
|
38
|
+
"Visualization and quality assessment of de novo genome assemblies" : "/publications/Bioinformatics-2011-Riba-Grognuz-3425-6"
|
39
|
+
"The genomic impact of 100 million years of social evolution in seven ant species" : "/publications/TiG2011.pdf"
|
40
|
+
"Relaxed selection is a precursor to the evolution of phenotypic plasticity" : "/publications/hunt2011phenotypicPlasticity.pdf"
|
41
|
+
"The genome of the leaf-cutting ant Acromyrmex echinatior suggests key adaptations to advanced social life and fungus farming" : "/publications/nygaard2011-acromyrmex-genome.pdf"
|
42
|
+
"Behind the Scenes of an Ant Genome Project" : "/publications/wurm2011antGenomeBehindTheScenes.pdf"
|
43
|
+
"The genome of the fire ant Solenopsis invicta" : "/publications/wurm2011fireAntGenome.pdf"
|
44
|
+
"Odorant Binding Proteins of the Red Imported Fire Ant, Solenopsis invicta: An Example of the Problems Facing the Analysis of Widely Divergent Proteins" : "/publications/gotzek2011obps.pdf"
|
45
|
+
"Parasitoid Wasps: From Natural History to Genomic Studies" : "/publications/wurm2010wasps.pdf"
|
46
|
+
"Changes in reproductive roles are associated with changes in gene expression in fire ant queens" : "/publications/wurm2010fireAntQueenDealationExpression.pdf"
|
47
|
+
"Fourmidable: a database for ant genomics" : "/publications/wurm2009antDatabase.pdf"
|
48
|
+
"Behavioral Genomics: A, Bee, C, G, T" : "/publications/wurm2007bees.pdf"
|
49
|
+
"An annotated cDNA library and microarray for large-scale gene-expression studies in the ant Solenopsis invicta" : "/publications/wang2007fireAntMicroarrays.pdf"
|
data/lib/rubyscholar.rb
ADDED
@@ -0,0 +1,141 @@
|
|
1
|
+
require "nokogiri"
|
2
|
+
require "open-uri"
|
3
|
+
|
4
|
+
class String
|
5
|
+
def clean
|
6
|
+
# removes leading and trailing whitespace, commas
|
7
|
+
self.gsub!(/(^[\s,]+)|([\s,]+$)/, '')
|
8
|
+
return self
|
9
|
+
end
|
10
|
+
end
|
11
|
+
|
12
|
+
module RubyScholar
|
13
|
+
class Paper < Struct.new(:title, :url, :authors, :journalName, :journalDetails, :year, :citationCount, :citingPapers, :doi)
|
14
|
+
end
|
15
|
+
|
16
|
+
class Parser
|
17
|
+
attr_accessor :parsedPapers, :crossRefEmail
|
18
|
+
|
19
|
+
def initialize(url, crossRefEmail = "")
|
20
|
+
@parsedPapers = []
|
21
|
+
@crossRefEmail = crossRefEmail # if nil doesn't return any DOI
|
22
|
+
parse(url)
|
23
|
+
end
|
24
|
+
|
25
|
+
def parse(url)
|
26
|
+
papers = Nokogiri::HTML(open(url)).css(".cit-table .item")
|
27
|
+
STDOUT << "Found #{papers.length} papers.\n"
|
28
|
+
papers.each do |paper|
|
29
|
+
paperDetails = paper.css("#col-title")
|
30
|
+
title = paperDetails[0].children[0].content.clean
|
31
|
+
googleUrl = paperDetails[0].children[0].attribute('href')
|
32
|
+
authors = paperDetails[0].children[2].content.clean
|
33
|
+
authors.gsub!("...", "et al")
|
34
|
+
|
35
|
+
journal = paperDetails[0].children[4].content
|
36
|
+
journalName = journal.split(/,|\d/).first.clean
|
37
|
+
journalDetails = journal.gsub(journalName, '').clean
|
38
|
+
|
39
|
+
year = paper.css("#col-year").text # is the last thing we get
|
40
|
+
|
41
|
+
#citations
|
42
|
+
citeInfo = paper.css(".cit-dark-link")
|
43
|
+
citationCount = citeInfo.text
|
44
|
+
citationUrl = citationCount.empty? ? nil : citeInfo.attribute('href').to_s
|
45
|
+
|
46
|
+
# get DOI: needs last name of first author, no funny chars
|
47
|
+
lastNameFirstAuthor = ((authors.split(',').first ).split(' ').last ).gsub(/[^A-Za-z\-]/, '')
|
48
|
+
doi = getDoi( lastNameFirstAuthor, title, @crossRefEmail)
|
49
|
+
|
50
|
+
@parsedPapers.push(Paper.new( title, googleUrl, authors, journalName, journalDetails, year, citationCount, citationUrl, doi))
|
51
|
+
end
|
52
|
+
STDOUT << "Scraped #{parsedPapers.length} from Google Scholar.\n"
|
53
|
+
end
|
54
|
+
|
55
|
+
# Scholar doesn't provide DOI.
|
56
|
+
# But if registered at crossref (its free), DOI can be retreived.
|
57
|
+
def getDoi(lastNameFirstAuthor, title, crossRefEmail)
|
58
|
+
return '' if @crossRefEmail.nil?
|
59
|
+
sleep(1) # to reduce risk
|
60
|
+
STDERR << "Getting DOI for paper by #{lastNameFirstAuthor}: #{title}.\n"
|
61
|
+
url = 'http://www.crossref.org/openurl?redirect=false' +
|
62
|
+
'&pid=' + crossRefEmail +
|
63
|
+
'&aulast=' + lastNameFirstAuthor +
|
64
|
+
'&atitle=' + URI.escape(title)
|
65
|
+
crossRefXML = Nokogiri::XML(open(url))
|
66
|
+
crossRefXML.search("doi").children.first.content rescue ''
|
67
|
+
end
|
68
|
+
end
|
69
|
+
|
70
|
+
class Formatter
|
71
|
+
attr_accessor :parser, :nameToHighlight, :pdfLinks, :altmetricDOIs
|
72
|
+
|
73
|
+
def initialize(parser, nameToHighlight = nil, pdfLinks = {}, altmetricDOIs = [], minCitationCount = 1)
|
74
|
+
@parser = parser
|
75
|
+
@nameToHighlight = nameToHighlight
|
76
|
+
@pdfLinks = pdfLinks
|
77
|
+
@altmetricDOIs = altmetricDOIs
|
78
|
+
@minCitations = minCitationCount
|
79
|
+
end
|
80
|
+
|
81
|
+
def to_html
|
82
|
+
##@doc = Nokogiri::HTML::DocumentFragment.parse ""
|
83
|
+
builder = Nokogiri::HTML::Builder.new do |doc|
|
84
|
+
doc.html {
|
85
|
+
doc.body {
|
86
|
+
@parser.parsedPapers.each_with_index { |paper, index|
|
87
|
+
doc.div( :class => "publication") {
|
88
|
+
doc.p {
|
89
|
+
doc.text ((@parser.parsedPapers).length - index).to_s + '. '
|
90
|
+
|
91
|
+
if paper[:authors].include?(@nameToHighlight)
|
92
|
+
doc.text( paper[:authors].sub(Regexp.new(@nameToHighlight + '.*'), '') )
|
93
|
+
doc.span( :class => "me") { doc.text @nameToHighlight }
|
94
|
+
doc.text( paper[:authors].sub(Regexp.new('.*' + @nameToHighlight), '') )
|
95
|
+
else
|
96
|
+
doc.text( paper[:authors])
|
97
|
+
end
|
98
|
+
|
99
|
+
doc.text ' ' + paper[:year] + '. '
|
100
|
+
doc.b paper[:title] + '.'
|
101
|
+
doc.br
|
102
|
+
doc.em paper[:journalName]
|
103
|
+
doc.text ' '
|
104
|
+
doc.text paper[:journalDetails]
|
105
|
+
|
106
|
+
unless paper[ :doi].empty?
|
107
|
+
doc.text(' ')
|
108
|
+
doc.a( :href => URI.join("http://dx.doi.org/", paper[ :doi])) {
|
109
|
+
doc.text "[DOI]"
|
110
|
+
}
|
111
|
+
end
|
112
|
+
if @pdfLinks.keys.include?(paper[:title])
|
113
|
+
doc.text(' ')
|
114
|
+
doc.a( :href => @pdfLinks[paper[:title]]) {
|
115
|
+
doc.text "[PDF]"
|
116
|
+
}
|
117
|
+
end
|
118
|
+
if paper[ :citationCount].to_i > @minCitations
|
119
|
+
doc.text(' ')
|
120
|
+
doc.a( :href => paper[ :citingPapers]) {
|
121
|
+
doc.text("[Cited #{paper[ :citationCount]}x]")
|
122
|
+
}
|
123
|
+
end
|
124
|
+
if altmetricDOIs.include?( paper[ :doi])
|
125
|
+
doc.text(' ')
|
126
|
+
doc.span( :class => 'altmetric-embed',
|
127
|
+
:'data-badge-popover' => 'bottom',
|
128
|
+
:'data-doi' => paper[ :doi] )
|
129
|
+
end
|
130
|
+
}
|
131
|
+
}
|
132
|
+
}
|
133
|
+
}
|
134
|
+
}
|
135
|
+
end
|
136
|
+
return builder.to_html
|
137
|
+
end
|
138
|
+
end
|
139
|
+
end
|
140
|
+
|
141
|
+
|
metadata
ADDED
@@ -0,0 +1,85 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: rubyscholar
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.0.2
|
5
|
+
prerelease:
|
6
|
+
platform: ruby
|
7
|
+
authors:
|
8
|
+
- Yannick Wurm
|
9
|
+
autorequire:
|
10
|
+
bindir: bin
|
11
|
+
cert_chain: []
|
12
|
+
date: 2013-08-18 00:00:00.000000000 Z
|
13
|
+
dependencies:
|
14
|
+
- !ruby/object:Gem::Dependency
|
15
|
+
name: nokogiri
|
16
|
+
requirement: !ruby/object:Gem::Requirement
|
17
|
+
none: false
|
18
|
+
requirements:
|
19
|
+
- - ~>
|
20
|
+
- !ruby/object:Gem::Version
|
21
|
+
version: 1.6.0
|
22
|
+
type: :runtime
|
23
|
+
prerelease: false
|
24
|
+
version_requirements: !ruby/object:Gem::Requirement
|
25
|
+
none: false
|
26
|
+
requirements:
|
27
|
+
- - ~>
|
28
|
+
- !ruby/object:Gem::Version
|
29
|
+
version: 1.6.0
|
30
|
+
- !ruby/object:Gem::Dependency
|
31
|
+
name: rspec
|
32
|
+
requirement: !ruby/object:Gem::Requirement
|
33
|
+
none: false
|
34
|
+
requirements:
|
35
|
+
- - ~>
|
36
|
+
- !ruby/object:Gem::Version
|
37
|
+
version: 2.5.0
|
38
|
+
type: :development
|
39
|
+
prerelease: false
|
40
|
+
version_requirements: !ruby/object:Gem::Requirement
|
41
|
+
none: false
|
42
|
+
requirements:
|
43
|
+
- - ~>
|
44
|
+
- !ruby/object:Gem::Version
|
45
|
+
version: 2.5.0
|
46
|
+
description: A small script to "scrape" your Google Scholar citations and reformat
|
47
|
+
them. It doesn't do a whole lot, but it's still useful.
|
48
|
+
email:
|
49
|
+
- y.wurm@qmul.ac.uk
|
50
|
+
executables:
|
51
|
+
- scrape.rb
|
52
|
+
extensions: []
|
53
|
+
extra_rdoc_files: []
|
54
|
+
files:
|
55
|
+
- .gitignore
|
56
|
+
- README.md
|
57
|
+
- bin/scrape.rb
|
58
|
+
- config.yml
|
59
|
+
- lib/rubyscholar.rb
|
60
|
+
homepage: ''
|
61
|
+
licenses:
|
62
|
+
- MIT
|
63
|
+
post_install_message:
|
64
|
+
rdoc_options: []
|
65
|
+
require_paths:
|
66
|
+
- lib
|
67
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
68
|
+
none: false
|
69
|
+
requirements:
|
70
|
+
- - ! '>='
|
71
|
+
- !ruby/object:Gem::Version
|
72
|
+
version: '0'
|
73
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
74
|
+
none: false
|
75
|
+
requirements:
|
76
|
+
- - ! '>='
|
77
|
+
- !ruby/object:Gem::Version
|
78
|
+
version: '0'
|
79
|
+
requirements: []
|
80
|
+
rubyforge_project:
|
81
|
+
rubygems_version: 1.8.23
|
82
|
+
signing_key:
|
83
|
+
specification_version: 3
|
84
|
+
summary: RubyScholar - Scrape your Google Scholar citations.
|
85
|
+
test_files: []
|