rubyscholar 0.0.2

Sign up to get free protection for your applications and to get access to all the features.
Files changed (6) hide show
  1. data/.gitignore +18 -0
  2. data/README.md +40 -0
  3. data/bin/scrape.rb +22 -0
  4. data/config.yml +49 -0
  5. data/lib/rubyscholar.rb +141 -0
  6. metadata +85 -0
data/.gitignore ADDED
@@ -0,0 +1,18 @@
1
+ *.gem
2
+ *.rbc
3
+ .bundle
4
+ .config
5
+ coverage
6
+ InstalledFiles
7
+ lib/bundler/man
8
+ pkg
9
+ rdoc
10
+ spec/reports
11
+ test/tmp
12
+ test/version_tmp
13
+ tmp
14
+
15
+ # YARD artifacts
16
+ .yardoc
17
+ _yardoc
18
+ doc/
data/README.md ADDED
@@ -0,0 +1,40 @@
1
+ # Synopsis
2
+
3
+ Here is a small script to "scrape" your Google Scholar citations and reformat them (the way I need it for my website).
4
+ Not super flexible - but should be easily customizable.
5
+
6
+ Some features:
7
+
8
+ * if registered on Crossref, retreives corresponding DOIs and can add altmetric.org links.
9
+ If Crossref doesn't think your email is valid, no DOIs will be retreived.
10
+ * adds "Cited by N" for popular papers
11
+
12
+ # How to use:
13
+
14
+ 1. Configure "config.yml"
15
+ If you want DOI retreival to work (including Altmetrics), you need to be
16
+ registered at crossref (its free).
17
+ 2. Run `ruby bin/scrape.rb > mypublications.html`
18
+ 3. Thats it.
19
+
20
+
21
+ # Potential for improvement:
22
+
23
+ * uses author list as visible on your main Google Scholar page. Sometimes this
24
+ means names are chopped in two or just a single author is missing. This could
25
+ be made smarter.
26
+ * flexible output
27
+ * flexible use of DOIs
28
+
29
+ # Technologies
30
+
31
+ Ruby, Nokogiri. Thanks to Google Scholar and Crossref. I hope none of this infringes on anything.
32
+
33
+ # Contact
34
+
35
+ RubyScholar was developed by Yannick Wurm (http://yannick.poulet.org). Pull requests, patches and bug reports are welcome. The source code is available on github. Bug reports and feature requests may also be made there.
36
+
37
+ # Copyright
38
+
39
+ RubyScholar � 2013 by Yannick Wurm. Licensed under the MIT license.
40
+
data/bin/scrape.rb ADDED
@@ -0,0 +1,22 @@
1
+ require_relative '../lib/rubyscholar'
2
+ require 'yaml'
3
+
4
+ def scrape()
5
+ config = YAML.load_file('config.yml')
6
+ parsed = RubyScholar::Parser.new(config["url"],
7
+ config["email"])
8
+ formatter = RubyScholar::Formatter.new(parsed,
9
+ config["highlight"],
10
+ config["pdfs"],
11
+ config["altmetricDOIs"],
12
+ config["minCitations"].to_i)
13
+
14
+ html = formatter.to_html
15
+ config["italicize"].each do |term|
16
+ html.gsub!( term , '<em>' + term + '</em>')
17
+ end
18
+
19
+ f= File.open('scholar.html','w')
20
+ f.write html
21
+ f.close
22
+ end
data/config.yml ADDED
@@ -0,0 +1,49 @@
1
+ # Google Scholar page (you can choose how you sort it)
2
+ url: "http://scholar.google.com/citations?sortby=pubdate&hl=en&user=k6y0EGsAAAAJ&view_op=list_works"
3
+
4
+ # Name to highlight
5
+ highlight: "Y Wurm"
6
+
7
+
8
+ # Need an Email address that has been registered with CrossRef to obtain DOIs
9
+ # using their OpenURL service.
10
+ # e.g. the following should provide an XML file:
11
+ # http://www.crossref.org/openurl?redirect=false&pid=YOUR@EMAIL>COM&aulast=Wurm&atitle=Behavioral%20Genomics:%20A,%20Bee,%20C,%20G,%20T
12
+ email: your@email.com
13
+
14
+
15
+ # Show "[Cited Nx]" if N > the following number
16
+ minCitations: 5
17
+
18
+ # Words to italicize (emphasize). These will have "<em>" around them.
19
+ italicize:
20
+ - Solenopsis invicta
21
+ - Acromyrmex echinatior
22
+ - de novo
23
+
24
+ # DOIs of articles for which we should show altmetric.org badges.
25
+ altmetricDOIs:
26
+ - "10.1038/nature11832"
27
+ - "10.1101/gr.121392.111"
28
+ - "10.1073/pnas.1009690108"
29
+ - "10.1073/pnas.1104825108"
30
+
31
+ # Article titles for which we have urls to PDFs
32
+ pdfs:
33
+ "A Y-like social chromosome causes alternative colony organization in fire ants" : "/publications/wangwurm2013socialChromosome.pdf"
34
+ "Duplication and concerted evolution in a master sex determiner under balancing selection" : "/publications/procb2013.pdf"
35
+ "Comparative genomics of chemosensory protein genes reveals rapid evolution and positive selection in ant-specific duplicates" : "/publications/hdy2012122a.pdf"
36
+ "The Molecular Clockwork of the Fire Ant Solenopsis invicta" : "/publications/ingram2012-fireAntClockGenes.pdf"
37
+ "Epigenetics: The Making of Ant Castes" : "/publications/2012CurrBiolAntepigenetics.pdf"
38
+ "Visualization and quality assessment of de novo genome assemblies" : "/publications/Bioinformatics-2011-Riba-Grognuz-3425-6"
39
+ "The genomic impact of 100 million years of social evolution in seven ant species" : "/publications/TiG2011.pdf"
40
+ "Relaxed selection is a precursor to the evolution of phenotypic plasticity" : "/publications/hunt2011phenotypicPlasticity.pdf"
41
+ "The genome of the leaf-cutting ant Acromyrmex echinatior suggests key adaptations to advanced social life and fungus farming" : "/publications/nygaard2011-acromyrmex-genome.pdf"
42
+ "Behind the Scenes of an Ant Genome Project" : "/publications/wurm2011antGenomeBehindTheScenes.pdf"
43
+ "The genome of the fire ant Solenopsis invicta" : "/publications/wurm2011fireAntGenome.pdf"
44
+ "Odorant Binding Proteins of the Red Imported Fire Ant, Solenopsis invicta: An Example of the Problems Facing the Analysis of Widely Divergent Proteins" : "/publications/gotzek2011obps.pdf"
45
+ "Parasitoid Wasps: From Natural History to Genomic Studies" : "/publications/wurm2010wasps.pdf"
46
+ "Changes in reproductive roles are associated with changes in gene expression in fire ant queens" : "/publications/wurm2010fireAntQueenDealationExpression.pdf"
47
+ "Fourmidable: a database for ant genomics" : "/publications/wurm2009antDatabase.pdf"
48
+ "Behavioral Genomics: A, Bee, C, G, T" : "/publications/wurm2007bees.pdf"
49
+ "An annotated cDNA library and microarray for large-scale gene-expression studies in the ant Solenopsis invicta" : "/publications/wang2007fireAntMicroarrays.pdf"
@@ -0,0 +1,141 @@
1
+ require "nokogiri"
2
+ require "open-uri"
3
+
4
+ class String
5
+ def clean
6
+ # removes leading and trailing whitespace, commas
7
+ self.gsub!(/(^[\s,]+)|([\s,]+$)/, '')
8
+ return self
9
+ end
10
+ end
11
+
12
+ module RubyScholar
13
+ class Paper < Struct.new(:title, :url, :authors, :journalName, :journalDetails, :year, :citationCount, :citingPapers, :doi)
14
+ end
15
+
16
+ class Parser
17
+ attr_accessor :parsedPapers, :crossRefEmail
18
+
19
+ def initialize(url, crossRefEmail = "")
20
+ @parsedPapers = []
21
+ @crossRefEmail = crossRefEmail # if nil doesn't return any DOI
22
+ parse(url)
23
+ end
24
+
25
+ def parse(url)
26
+ papers = Nokogiri::HTML(open(url)).css(".cit-table .item")
27
+ STDOUT << "Found #{papers.length} papers.\n"
28
+ papers.each do |paper|
29
+ paperDetails = paper.css("#col-title")
30
+ title = paperDetails[0].children[0].content.clean
31
+ googleUrl = paperDetails[0].children[0].attribute('href')
32
+ authors = paperDetails[0].children[2].content.clean
33
+ authors.gsub!("...", "et al")
34
+
35
+ journal = paperDetails[0].children[4].content
36
+ journalName = journal.split(/,|\d/).first.clean
37
+ journalDetails = journal.gsub(journalName, '').clean
38
+
39
+ year = paper.css("#col-year").text # is the last thing we get
40
+
41
+ #citations
42
+ citeInfo = paper.css(".cit-dark-link")
43
+ citationCount = citeInfo.text
44
+ citationUrl = citationCount.empty? ? nil : citeInfo.attribute('href').to_s
45
+
46
+ # get DOI: needs last name of first author, no funny chars
47
+ lastNameFirstAuthor = ((authors.split(',').first ).split(' ').last ).gsub(/[^A-Za-z\-]/, '')
48
+ doi = getDoi( lastNameFirstAuthor, title, @crossRefEmail)
49
+
50
+ @parsedPapers.push(Paper.new( title, googleUrl, authors, journalName, journalDetails, year, citationCount, citationUrl, doi))
51
+ end
52
+ STDOUT << "Scraped #{parsedPapers.length} from Google Scholar.\n"
53
+ end
54
+
55
+ # Scholar doesn't provide DOI.
56
+ # But if registered at crossref (its free), DOI can be retreived.
57
+ def getDoi(lastNameFirstAuthor, title, crossRefEmail)
58
+ return '' if @crossRefEmail.nil?
59
+ sleep(1) # to reduce risk
60
+ STDERR << "Getting DOI for paper by #{lastNameFirstAuthor}: #{title}.\n"
61
+ url = 'http://www.crossref.org/openurl?redirect=false' +
62
+ '&pid=' + crossRefEmail +
63
+ '&aulast=' + lastNameFirstAuthor +
64
+ '&atitle=' + URI.escape(title)
65
+ crossRefXML = Nokogiri::XML(open(url))
66
+ crossRefXML.search("doi").children.first.content rescue ''
67
+ end
68
+ end
69
+
70
+ class Formatter
71
+ attr_accessor :parser, :nameToHighlight, :pdfLinks, :altmetricDOIs
72
+
73
+ def initialize(parser, nameToHighlight = nil, pdfLinks = {}, altmetricDOIs = [], minCitationCount = 1)
74
+ @parser = parser
75
+ @nameToHighlight = nameToHighlight
76
+ @pdfLinks = pdfLinks
77
+ @altmetricDOIs = altmetricDOIs
78
+ @minCitations = minCitationCount
79
+ end
80
+
81
+ def to_html
82
+ ##@doc = Nokogiri::HTML::DocumentFragment.parse ""
83
+ builder = Nokogiri::HTML::Builder.new do |doc|
84
+ doc.html {
85
+ doc.body {
86
+ @parser.parsedPapers.each_with_index { |paper, index|
87
+ doc.div( :class => "publication") {
88
+ doc.p {
89
+ doc.text ((@parser.parsedPapers).length - index).to_s + '. '
90
+
91
+ if paper[:authors].include?(@nameToHighlight)
92
+ doc.text( paper[:authors].sub(Regexp.new(@nameToHighlight + '.*'), '') )
93
+ doc.span( :class => "me") { doc.text @nameToHighlight }
94
+ doc.text( paper[:authors].sub(Regexp.new('.*' + @nameToHighlight), '') )
95
+ else
96
+ doc.text( paper[:authors])
97
+ end
98
+
99
+ doc.text ' ' + paper[:year] + '. '
100
+ doc.b paper[:title] + '.'
101
+ doc.br
102
+ doc.em paper[:journalName]
103
+ doc.text ' '
104
+ doc.text paper[:journalDetails]
105
+
106
+ unless paper[ :doi].empty?
107
+ doc.text(' ')
108
+ doc.a( :href => URI.join("http://dx.doi.org/", paper[ :doi])) {
109
+ doc.text "[DOI]"
110
+ }
111
+ end
112
+ if @pdfLinks.keys.include?(paper[:title])
113
+ doc.text(' ')
114
+ doc.a( :href => @pdfLinks[paper[:title]]) {
115
+ doc.text "[PDF]"
116
+ }
117
+ end
118
+ if paper[ :citationCount].to_i > @minCitations
119
+ doc.text(' ')
120
+ doc.a( :href => paper[ :citingPapers]) {
121
+ doc.text("[Cited #{paper[ :citationCount]}x]")
122
+ }
123
+ end
124
+ if altmetricDOIs.include?( paper[ :doi])
125
+ doc.text(' ')
126
+ doc.span( :class => 'altmetric-embed',
127
+ :'data-badge-popover' => 'bottom',
128
+ :'data-doi' => paper[ :doi] )
129
+ end
130
+ }
131
+ }
132
+ }
133
+ }
134
+ }
135
+ end
136
+ return builder.to_html
137
+ end
138
+ end
139
+ end
140
+
141
+
metadata ADDED
@@ -0,0 +1,85 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: rubyscholar
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.2
5
+ prerelease:
6
+ platform: ruby
7
+ authors:
8
+ - Yannick Wurm
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+ date: 2013-08-18 00:00:00.000000000 Z
13
+ dependencies:
14
+ - !ruby/object:Gem::Dependency
15
+ name: nokogiri
16
+ requirement: !ruby/object:Gem::Requirement
17
+ none: false
18
+ requirements:
19
+ - - ~>
20
+ - !ruby/object:Gem::Version
21
+ version: 1.6.0
22
+ type: :runtime
23
+ prerelease: false
24
+ version_requirements: !ruby/object:Gem::Requirement
25
+ none: false
26
+ requirements:
27
+ - - ~>
28
+ - !ruby/object:Gem::Version
29
+ version: 1.6.0
30
+ - !ruby/object:Gem::Dependency
31
+ name: rspec
32
+ requirement: !ruby/object:Gem::Requirement
33
+ none: false
34
+ requirements:
35
+ - - ~>
36
+ - !ruby/object:Gem::Version
37
+ version: 2.5.0
38
+ type: :development
39
+ prerelease: false
40
+ version_requirements: !ruby/object:Gem::Requirement
41
+ none: false
42
+ requirements:
43
+ - - ~>
44
+ - !ruby/object:Gem::Version
45
+ version: 2.5.0
46
+ description: A small script to "scrape" your Google Scholar citations and reformat
47
+ them. It doesn't do a whole lot, but it's still useful.
48
+ email:
49
+ - y.wurm@qmul.ac.uk
50
+ executables:
51
+ - scrape.rb
52
+ extensions: []
53
+ extra_rdoc_files: []
54
+ files:
55
+ - .gitignore
56
+ - README.md
57
+ - bin/scrape.rb
58
+ - config.yml
59
+ - lib/rubyscholar.rb
60
+ homepage: ''
61
+ licenses:
62
+ - MIT
63
+ post_install_message:
64
+ rdoc_options: []
65
+ require_paths:
66
+ - lib
67
+ required_ruby_version: !ruby/object:Gem::Requirement
68
+ none: false
69
+ requirements:
70
+ - - ! '>='
71
+ - !ruby/object:Gem::Version
72
+ version: '0'
73
+ required_rubygems_version: !ruby/object:Gem::Requirement
74
+ none: false
75
+ requirements:
76
+ - - ! '>='
77
+ - !ruby/object:Gem::Version
78
+ version: '0'
79
+ requirements: []
80
+ rubyforge_project:
81
+ rubygems_version: 1.8.23
82
+ signing_key:
83
+ specification_version: 3
84
+ summary: RubyScholar - Scrape your Google Scholar citations.
85
+ test_files: []