rubyscholar 0.0.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/.gitignore +18 -0
- data/README.md +40 -0
- data/bin/scrape.rb +22 -0
- data/config.yml +49 -0
- data/lib/rubyscholar.rb +141 -0
- metadata +85 -0
data/.gitignore
ADDED
data/README.md
ADDED
@@ -0,0 +1,40 @@
|
|
1
|
+
# Synopsis
|
2
|
+
|
3
|
+
Here is a small script to "scrape" your Google Scholar citations and reformat them (the way I need it for my website).
|
4
|
+
Not super flexible - but should be easily customizable.
|
5
|
+
|
6
|
+
Some features:
|
7
|
+
|
8
|
+
* if registered on Crossref, retreives corresponding DOIs and can add altmetric.org links.
|
9
|
+
If Crossref doesn't think your email is valid, no DOIs will be retreived.
|
10
|
+
* adds "Cited by N" for popular papers
|
11
|
+
|
12
|
+
# How to use:
|
13
|
+
|
14
|
+
1. Configure "config.yml"
|
15
|
+
If you want DOI retreival to work (including Altmetrics), you need to be
|
16
|
+
registered at crossref (its free).
|
17
|
+
2. Run `ruby bin/scrape.rb > mypublications.html`
|
18
|
+
3. Thats it.
|
19
|
+
|
20
|
+
|
21
|
+
# Potential for improvement:
|
22
|
+
|
23
|
+
* uses author list as visible on your main Google Scholar page. Sometimes this
|
24
|
+
means names are chopped in two or just a single author is missing. This could
|
25
|
+
be made smarter.
|
26
|
+
* flexible output
|
27
|
+
* flexible use of DOIs
|
28
|
+
|
29
|
+
# Technologies
|
30
|
+
|
31
|
+
Ruby, Nokogiri. Thanks to Google Scholar and Crossref. I hope none of this infringes on anything.
|
32
|
+
|
33
|
+
# Contact
|
34
|
+
|
35
|
+
RubyScholar was developed by Yannick Wurm (http://yannick.poulet.org). Pull requests, patches and bug reports are welcome. The source code is available on github. Bug reports and feature requests may also be made there.
|
36
|
+
|
37
|
+
# Copyright
|
38
|
+
|
39
|
+
RubyScholar � 2013 by Yannick Wurm. Licensed under the MIT license.
|
40
|
+
|
data/bin/scrape.rb
ADDED
@@ -0,0 +1,22 @@
|
|
1
|
+
require_relative '../lib/rubyscholar'
|
2
|
+
require 'yaml'
|
3
|
+
|
4
|
+
def scrape()
|
5
|
+
config = YAML.load_file('config.yml')
|
6
|
+
parsed = RubyScholar::Parser.new(config["url"],
|
7
|
+
config["email"])
|
8
|
+
formatter = RubyScholar::Formatter.new(parsed,
|
9
|
+
config["highlight"],
|
10
|
+
config["pdfs"],
|
11
|
+
config["altmetricDOIs"],
|
12
|
+
config["minCitations"].to_i)
|
13
|
+
|
14
|
+
html = formatter.to_html
|
15
|
+
config["italicize"].each do |term|
|
16
|
+
html.gsub!( term , '<em>' + term + '</em>')
|
17
|
+
end
|
18
|
+
|
19
|
+
f= File.open('scholar.html','w')
|
20
|
+
f.write html
|
21
|
+
f.close
|
22
|
+
end
|
data/config.yml
ADDED
@@ -0,0 +1,49 @@
|
|
1
|
+
# Google Scholar page (you can choose how you sort it)
|
2
|
+
url: "http://scholar.google.com/citations?sortby=pubdate&hl=en&user=k6y0EGsAAAAJ&view_op=list_works"
|
3
|
+
|
4
|
+
# Name to highlight
|
5
|
+
highlight: "Y Wurm"
|
6
|
+
|
7
|
+
|
8
|
+
# Need an Email address that has been registered with CrossRef to obtain DOIs
|
9
|
+
# using their OpenURL service.
|
10
|
+
# e.g. the following should provide an XML file:
|
11
|
+
# http://www.crossref.org/openurl?redirect=false&pid=YOUR@EMAIL>COM&aulast=Wurm&atitle=Behavioral%20Genomics:%20A,%20Bee,%20C,%20G,%20T
|
12
|
+
email: your@email.com
|
13
|
+
|
14
|
+
|
15
|
+
# Show "[Cited Nx]" if N > the following number
|
16
|
+
minCitations: 5
|
17
|
+
|
18
|
+
# Words to italicize (emphasize). These will have "<em>" around them.
|
19
|
+
italicize:
|
20
|
+
- Solenopsis invicta
|
21
|
+
- Acromyrmex echinatior
|
22
|
+
- de novo
|
23
|
+
|
24
|
+
# DOIs of articles for which we should show altmetric.org badges.
|
25
|
+
altmetricDOIs:
|
26
|
+
- "10.1038/nature11832"
|
27
|
+
- "10.1101/gr.121392.111"
|
28
|
+
- "10.1073/pnas.1009690108"
|
29
|
+
- "10.1073/pnas.1104825108"
|
30
|
+
|
31
|
+
# Article titles for which we have urls to PDFs
|
32
|
+
pdfs:
|
33
|
+
"A Y-like social chromosome causes alternative colony organization in fire ants" : "/publications/wangwurm2013socialChromosome.pdf"
|
34
|
+
"Duplication and concerted evolution in a master sex determiner under balancing selection" : "/publications/procb2013.pdf"
|
35
|
+
"Comparative genomics of chemosensory protein genes reveals rapid evolution and positive selection in ant-specific duplicates" : "/publications/hdy2012122a.pdf"
|
36
|
+
"The Molecular Clockwork of the Fire Ant Solenopsis invicta" : "/publications/ingram2012-fireAntClockGenes.pdf"
|
37
|
+
"Epigenetics: The Making of Ant Castes" : "/publications/2012CurrBiolAntepigenetics.pdf"
|
38
|
+
"Visualization and quality assessment of de novo genome assemblies" : "/publications/Bioinformatics-2011-Riba-Grognuz-3425-6"
|
39
|
+
"The genomic impact of 100 million years of social evolution in seven ant species" : "/publications/TiG2011.pdf"
|
40
|
+
"Relaxed selection is a precursor to the evolution of phenotypic plasticity" : "/publications/hunt2011phenotypicPlasticity.pdf"
|
41
|
+
"The genome of the leaf-cutting ant Acromyrmex echinatior suggests key adaptations to advanced social life and fungus farming" : "/publications/nygaard2011-acromyrmex-genome.pdf"
|
42
|
+
"Behind the Scenes of an Ant Genome Project" : "/publications/wurm2011antGenomeBehindTheScenes.pdf"
|
43
|
+
"The genome of the fire ant Solenopsis invicta" : "/publications/wurm2011fireAntGenome.pdf"
|
44
|
+
"Odorant Binding Proteins of the Red Imported Fire Ant, Solenopsis invicta: An Example of the Problems Facing the Analysis of Widely Divergent Proteins" : "/publications/gotzek2011obps.pdf"
|
45
|
+
"Parasitoid Wasps: From Natural History to Genomic Studies" : "/publications/wurm2010wasps.pdf"
|
46
|
+
"Changes in reproductive roles are associated with changes in gene expression in fire ant queens" : "/publications/wurm2010fireAntQueenDealationExpression.pdf"
|
47
|
+
"Fourmidable: a database for ant genomics" : "/publications/wurm2009antDatabase.pdf"
|
48
|
+
"Behavioral Genomics: A, Bee, C, G, T" : "/publications/wurm2007bees.pdf"
|
49
|
+
"An annotated cDNA library and microarray for large-scale gene-expression studies in the ant Solenopsis invicta" : "/publications/wang2007fireAntMicroarrays.pdf"
|
data/lib/rubyscholar.rb
ADDED
@@ -0,0 +1,141 @@
|
|
1
|
+
require "nokogiri"
|
2
|
+
require "open-uri"
|
3
|
+
|
4
|
+
class String
|
5
|
+
def clean
|
6
|
+
# removes leading and trailing whitespace, commas
|
7
|
+
self.gsub!(/(^[\s,]+)|([\s,]+$)/, '')
|
8
|
+
return self
|
9
|
+
end
|
10
|
+
end
|
11
|
+
|
12
|
+
module RubyScholar
|
13
|
+
class Paper < Struct.new(:title, :url, :authors, :journalName, :journalDetails, :year, :citationCount, :citingPapers, :doi)
|
14
|
+
end
|
15
|
+
|
16
|
+
class Parser
|
17
|
+
attr_accessor :parsedPapers, :crossRefEmail
|
18
|
+
|
19
|
+
def initialize(url, crossRefEmail = "")
|
20
|
+
@parsedPapers = []
|
21
|
+
@crossRefEmail = crossRefEmail # if nil doesn't return any DOI
|
22
|
+
parse(url)
|
23
|
+
end
|
24
|
+
|
25
|
+
def parse(url)
|
26
|
+
papers = Nokogiri::HTML(open(url)).css(".cit-table .item")
|
27
|
+
STDOUT << "Found #{papers.length} papers.\n"
|
28
|
+
papers.each do |paper|
|
29
|
+
paperDetails = paper.css("#col-title")
|
30
|
+
title = paperDetails[0].children[0].content.clean
|
31
|
+
googleUrl = paperDetails[0].children[0].attribute('href')
|
32
|
+
authors = paperDetails[0].children[2].content.clean
|
33
|
+
authors.gsub!("...", "et al")
|
34
|
+
|
35
|
+
journal = paperDetails[0].children[4].content
|
36
|
+
journalName = journal.split(/,|\d/).first.clean
|
37
|
+
journalDetails = journal.gsub(journalName, '').clean
|
38
|
+
|
39
|
+
year = paper.css("#col-year").text # is the last thing we get
|
40
|
+
|
41
|
+
#citations
|
42
|
+
citeInfo = paper.css(".cit-dark-link")
|
43
|
+
citationCount = citeInfo.text
|
44
|
+
citationUrl = citationCount.empty? ? nil : citeInfo.attribute('href').to_s
|
45
|
+
|
46
|
+
# get DOI: needs last name of first author, no funny chars
|
47
|
+
lastNameFirstAuthor = ((authors.split(',').first ).split(' ').last ).gsub(/[^A-Za-z\-]/, '')
|
48
|
+
doi = getDoi( lastNameFirstAuthor, title, @crossRefEmail)
|
49
|
+
|
50
|
+
@parsedPapers.push(Paper.new( title, googleUrl, authors, journalName, journalDetails, year, citationCount, citationUrl, doi))
|
51
|
+
end
|
52
|
+
STDOUT << "Scraped #{parsedPapers.length} from Google Scholar.\n"
|
53
|
+
end
|
54
|
+
|
55
|
+
# Scholar doesn't provide DOI.
|
56
|
+
# But if registered at crossref (its free), DOI can be retreived.
|
57
|
+
def getDoi(lastNameFirstAuthor, title, crossRefEmail)
|
58
|
+
return '' if @crossRefEmail.nil?
|
59
|
+
sleep(1) # to reduce risk
|
60
|
+
STDERR << "Getting DOI for paper by #{lastNameFirstAuthor}: #{title}.\n"
|
61
|
+
url = 'http://www.crossref.org/openurl?redirect=false' +
|
62
|
+
'&pid=' + crossRefEmail +
|
63
|
+
'&aulast=' + lastNameFirstAuthor +
|
64
|
+
'&atitle=' + URI.escape(title)
|
65
|
+
crossRefXML = Nokogiri::XML(open(url))
|
66
|
+
crossRefXML.search("doi").children.first.content rescue ''
|
67
|
+
end
|
68
|
+
end
|
69
|
+
|
70
|
+
class Formatter
|
71
|
+
attr_accessor :parser, :nameToHighlight, :pdfLinks, :altmetricDOIs
|
72
|
+
|
73
|
+
def initialize(parser, nameToHighlight = nil, pdfLinks = {}, altmetricDOIs = [], minCitationCount = 1)
|
74
|
+
@parser = parser
|
75
|
+
@nameToHighlight = nameToHighlight
|
76
|
+
@pdfLinks = pdfLinks
|
77
|
+
@altmetricDOIs = altmetricDOIs
|
78
|
+
@minCitations = minCitationCount
|
79
|
+
end
|
80
|
+
|
81
|
+
def to_html
|
82
|
+
##@doc = Nokogiri::HTML::DocumentFragment.parse ""
|
83
|
+
builder = Nokogiri::HTML::Builder.new do |doc|
|
84
|
+
doc.html {
|
85
|
+
doc.body {
|
86
|
+
@parser.parsedPapers.each_with_index { |paper, index|
|
87
|
+
doc.div( :class => "publication") {
|
88
|
+
doc.p {
|
89
|
+
doc.text ((@parser.parsedPapers).length - index).to_s + '. '
|
90
|
+
|
91
|
+
if paper[:authors].include?(@nameToHighlight)
|
92
|
+
doc.text( paper[:authors].sub(Regexp.new(@nameToHighlight + '.*'), '') )
|
93
|
+
doc.span( :class => "me") { doc.text @nameToHighlight }
|
94
|
+
doc.text( paper[:authors].sub(Regexp.new('.*' + @nameToHighlight), '') )
|
95
|
+
else
|
96
|
+
doc.text( paper[:authors])
|
97
|
+
end
|
98
|
+
|
99
|
+
doc.text ' ' + paper[:year] + '. '
|
100
|
+
doc.b paper[:title] + '.'
|
101
|
+
doc.br
|
102
|
+
doc.em paper[:journalName]
|
103
|
+
doc.text ' '
|
104
|
+
doc.text paper[:journalDetails]
|
105
|
+
|
106
|
+
unless paper[ :doi].empty?
|
107
|
+
doc.text(' ')
|
108
|
+
doc.a( :href => URI.join("http://dx.doi.org/", paper[ :doi])) {
|
109
|
+
doc.text "[DOI]"
|
110
|
+
}
|
111
|
+
end
|
112
|
+
if @pdfLinks.keys.include?(paper[:title])
|
113
|
+
doc.text(' ')
|
114
|
+
doc.a( :href => @pdfLinks[paper[:title]]) {
|
115
|
+
doc.text "[PDF]"
|
116
|
+
}
|
117
|
+
end
|
118
|
+
if paper[ :citationCount].to_i > @minCitations
|
119
|
+
doc.text(' ')
|
120
|
+
doc.a( :href => paper[ :citingPapers]) {
|
121
|
+
doc.text("[Cited #{paper[ :citationCount]}x]")
|
122
|
+
}
|
123
|
+
end
|
124
|
+
if altmetricDOIs.include?( paper[ :doi])
|
125
|
+
doc.text(' ')
|
126
|
+
doc.span( :class => 'altmetric-embed',
|
127
|
+
:'data-badge-popover' => 'bottom',
|
128
|
+
:'data-doi' => paper[ :doi] )
|
129
|
+
end
|
130
|
+
}
|
131
|
+
}
|
132
|
+
}
|
133
|
+
}
|
134
|
+
}
|
135
|
+
end
|
136
|
+
return builder.to_html
|
137
|
+
end
|
138
|
+
end
|
139
|
+
end
|
140
|
+
|
141
|
+
|
metadata
ADDED
@@ -0,0 +1,85 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: rubyscholar
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.0.2
|
5
|
+
prerelease:
|
6
|
+
platform: ruby
|
7
|
+
authors:
|
8
|
+
- Yannick Wurm
|
9
|
+
autorequire:
|
10
|
+
bindir: bin
|
11
|
+
cert_chain: []
|
12
|
+
date: 2013-08-18 00:00:00.000000000 Z
|
13
|
+
dependencies:
|
14
|
+
- !ruby/object:Gem::Dependency
|
15
|
+
name: nokogiri
|
16
|
+
requirement: !ruby/object:Gem::Requirement
|
17
|
+
none: false
|
18
|
+
requirements:
|
19
|
+
- - ~>
|
20
|
+
- !ruby/object:Gem::Version
|
21
|
+
version: 1.6.0
|
22
|
+
type: :runtime
|
23
|
+
prerelease: false
|
24
|
+
version_requirements: !ruby/object:Gem::Requirement
|
25
|
+
none: false
|
26
|
+
requirements:
|
27
|
+
- - ~>
|
28
|
+
- !ruby/object:Gem::Version
|
29
|
+
version: 1.6.0
|
30
|
+
- !ruby/object:Gem::Dependency
|
31
|
+
name: rspec
|
32
|
+
requirement: !ruby/object:Gem::Requirement
|
33
|
+
none: false
|
34
|
+
requirements:
|
35
|
+
- - ~>
|
36
|
+
- !ruby/object:Gem::Version
|
37
|
+
version: 2.5.0
|
38
|
+
type: :development
|
39
|
+
prerelease: false
|
40
|
+
version_requirements: !ruby/object:Gem::Requirement
|
41
|
+
none: false
|
42
|
+
requirements:
|
43
|
+
- - ~>
|
44
|
+
- !ruby/object:Gem::Version
|
45
|
+
version: 2.5.0
|
46
|
+
description: A small script to "scrape" your Google Scholar citations and reformat
|
47
|
+
them. It doesn't do a whole lot, but it's still useful.
|
48
|
+
email:
|
49
|
+
- y.wurm@qmul.ac.uk
|
50
|
+
executables:
|
51
|
+
- scrape.rb
|
52
|
+
extensions: []
|
53
|
+
extra_rdoc_files: []
|
54
|
+
files:
|
55
|
+
- .gitignore
|
56
|
+
- README.md
|
57
|
+
- bin/scrape.rb
|
58
|
+
- config.yml
|
59
|
+
- lib/rubyscholar.rb
|
60
|
+
homepage: ''
|
61
|
+
licenses:
|
62
|
+
- MIT
|
63
|
+
post_install_message:
|
64
|
+
rdoc_options: []
|
65
|
+
require_paths:
|
66
|
+
- lib
|
67
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
68
|
+
none: false
|
69
|
+
requirements:
|
70
|
+
- - ! '>='
|
71
|
+
- !ruby/object:Gem::Version
|
72
|
+
version: '0'
|
73
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
74
|
+
none: false
|
75
|
+
requirements:
|
76
|
+
- - ! '>='
|
77
|
+
- !ruby/object:Gem::Version
|
78
|
+
version: '0'
|
79
|
+
requirements: []
|
80
|
+
rubyforge_project:
|
81
|
+
rubygems_version: 1.8.23
|
82
|
+
signing_key:
|
83
|
+
specification_version: 3
|
84
|
+
summary: RubyScholar - Scrape your Google Scholar citations.
|
85
|
+
test_files: []
|