rubyscholar 0.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (6) hide show
  1. data/.gitignore +18 -0
  2. data/README.md +40 -0
  3. data/bin/scrape.rb +22 -0
  4. data/config.yml +49 -0
  5. data/lib/rubyscholar.rb +141 -0
  6. metadata +85 -0
data/.gitignore ADDED
@@ -0,0 +1,18 @@
1
+ *.gem
2
+ *.rbc
3
+ .bundle
4
+ .config
5
+ coverage
6
+ InstalledFiles
7
+ lib/bundler/man
8
+ pkg
9
+ rdoc
10
+ spec/reports
11
+ test/tmp
12
+ test/version_tmp
13
+ tmp
14
+
15
+ # YARD artifacts
16
+ .yardoc
17
+ _yardoc
18
+ doc/
data/README.md ADDED
@@ -0,0 +1,40 @@
1
+ # Synopsis
2
+
3
+ Here is a small script to "scrape" your Google Scholar citations and reformat them (the way I need it for my website).
4
+ Not super flexible - but should be easily customizable.
5
+
6
+ Some features:
7
+
8
+ * if registered on Crossref, retreives corresponding DOIs and can add altmetric.org links.
9
+ If Crossref doesn't think your email is valid, no DOIs will be retreived.
10
+ * adds "Cited by N" for popular papers
11
+
12
+ # How to use:
13
+
14
+ 1. Configure "config.yml"
15
+ If you want DOI retreival to work (including Altmetrics), you need to be
16
+ registered at crossref (its free).
17
+ 2. Run `ruby bin/scrape.rb > mypublications.html`
18
+ 3. Thats it.
19
+
20
+
21
+ # Potential for improvement:
22
+
23
+ * uses author list as visible on your main Google Scholar page. Sometimes this
24
+ means names are chopped in two or just a single author is missing. This could
25
+ be made smarter.
26
+ * flexible output
27
+ * flexible use of DOIs
28
+
29
+ # Technologies
30
+
31
+ Ruby, Nokogiri. Thanks to Google Scholar and Crossref. I hope none of this infringes on anything.
32
+
33
+ # Contact
34
+
35
+ RubyScholar was developed by Yannick Wurm (http://yannick.poulet.org). Pull requests, patches and bug reports are welcome. The source code is available on github. Bug reports and feature requests may also be made there.
36
+
37
+ # Copyright
38
+
39
+ RubyScholar � 2013 by Yannick Wurm. Licensed under the MIT license.
40
+
data/bin/scrape.rb ADDED
@@ -0,0 +1,22 @@
1
+ require_relative '../lib/rubyscholar'
2
+ require 'yaml'
3
+
4
+ def scrape()
5
+ config = YAML.load_file('config.yml')
6
+ parsed = RubyScholar::Parser.new(config["url"],
7
+ config["email"])
8
+ formatter = RubyScholar::Formatter.new(parsed,
9
+ config["highlight"],
10
+ config["pdfs"],
11
+ config["altmetricDOIs"],
12
+ config["minCitations"].to_i)
13
+
14
+ html = formatter.to_html
15
+ config["italicize"].each do |term|
16
+ html.gsub!( term , '<em>' + term + '</em>')
17
+ end
18
+
19
+ f= File.open('scholar.html','w')
20
+ f.write html
21
+ f.close
22
+ end
data/config.yml ADDED
@@ -0,0 +1,49 @@
1
+ # Google Scholar page (you can choose how you sort it)
2
+ url: "http://scholar.google.com/citations?sortby=pubdate&hl=en&user=k6y0EGsAAAAJ&view_op=list_works"
3
+
4
+ # Name to highlight
5
+ highlight: "Y Wurm"
6
+
7
+
8
+ # Need an Email address that has been registered with CrossRef to obtain DOIs
9
+ # using their OpenURL service.
10
+ # e.g. the following should provide an XML file:
11
+ # http://www.crossref.org/openurl?redirect=false&pid=YOUR@EMAIL>COM&aulast=Wurm&atitle=Behavioral%20Genomics:%20A,%20Bee,%20C,%20G,%20T
12
+ email: your@email.com
13
+
14
+
15
+ # Show "[Cited Nx]" if N > the following number
16
+ minCitations: 5
17
+
18
+ # Words to italicize (emphasize). These will have "<em>" around them.
19
+ italicize:
20
+ - Solenopsis invicta
21
+ - Acromyrmex echinatior
22
+ - de novo
23
+
24
+ # DOIs of articles for which we should show altmetric.org badges.
25
+ altmetricDOIs:
26
+ - "10.1038/nature11832"
27
+ - "10.1101/gr.121392.111"
28
+ - "10.1073/pnas.1009690108"
29
+ - "10.1073/pnas.1104825108"
30
+
31
+ # Article titles for which we have urls to PDFs
32
+ pdfs:
33
+ "A Y-like social chromosome causes alternative colony organization in fire ants" : "/publications/wangwurm2013socialChromosome.pdf"
34
+ "Duplication and concerted evolution in a master sex determiner under balancing selection" : "/publications/procb2013.pdf"
35
+ "Comparative genomics of chemosensory protein genes reveals rapid evolution and positive selection in ant-specific duplicates" : "/publications/hdy2012122a.pdf"
36
+ "The Molecular Clockwork of the Fire Ant Solenopsis invicta" : "/publications/ingram2012-fireAntClockGenes.pdf"
37
+ "Epigenetics: The Making of Ant Castes" : "/publications/2012CurrBiolAntepigenetics.pdf"
38
+ "Visualization and quality assessment of de novo genome assemblies" : "/publications/Bioinformatics-2011-Riba-Grognuz-3425-6"
39
+ "The genomic impact of 100 million years of social evolution in seven ant species" : "/publications/TiG2011.pdf"
40
+ "Relaxed selection is a precursor to the evolution of phenotypic plasticity" : "/publications/hunt2011phenotypicPlasticity.pdf"
41
+ "The genome of the leaf-cutting ant Acromyrmex echinatior suggests key adaptations to advanced social life and fungus farming" : "/publications/nygaard2011-acromyrmex-genome.pdf"
42
+ "Behind the Scenes of an Ant Genome Project" : "/publications/wurm2011antGenomeBehindTheScenes.pdf"
43
+ "The genome of the fire ant Solenopsis invicta" : "/publications/wurm2011fireAntGenome.pdf"
44
+ "Odorant Binding Proteins of the Red Imported Fire Ant, Solenopsis invicta: An Example of the Problems Facing the Analysis of Widely Divergent Proteins" : "/publications/gotzek2011obps.pdf"
45
+ "Parasitoid Wasps: From Natural History to Genomic Studies" : "/publications/wurm2010wasps.pdf"
46
+ "Changes in reproductive roles are associated with changes in gene expression in fire ant queens" : "/publications/wurm2010fireAntQueenDealationExpression.pdf"
47
+ "Fourmidable: a database for ant genomics" : "/publications/wurm2009antDatabase.pdf"
48
+ "Behavioral Genomics: A, Bee, C, G, T" : "/publications/wurm2007bees.pdf"
49
+ "An annotated cDNA library and microarray for large-scale gene-expression studies in the ant Solenopsis invicta" : "/publications/wang2007fireAntMicroarrays.pdf"
@@ -0,0 +1,141 @@
1
+ require "nokogiri"
2
+ require "open-uri"
3
+
4
+ class String
5
+ def clean
6
+ # removes leading and trailing whitespace, commas
7
+ self.gsub!(/(^[\s,]+)|([\s,]+$)/, '')
8
+ return self
9
+ end
10
+ end
11
+
12
+ module RubyScholar
13
+ class Paper < Struct.new(:title, :url, :authors, :journalName, :journalDetails, :year, :citationCount, :citingPapers, :doi)
14
+ end
15
+
16
+ class Parser
17
+ attr_accessor :parsedPapers, :crossRefEmail
18
+
19
+ def initialize(url, crossRefEmail = "")
20
+ @parsedPapers = []
21
+ @crossRefEmail = crossRefEmail # if nil doesn't return any DOI
22
+ parse(url)
23
+ end
24
+
25
+ def parse(url)
26
+ papers = Nokogiri::HTML(open(url)).css(".cit-table .item")
27
+ STDOUT << "Found #{papers.length} papers.\n"
28
+ papers.each do |paper|
29
+ paperDetails = paper.css("#col-title")
30
+ title = paperDetails[0].children[0].content.clean
31
+ googleUrl = paperDetails[0].children[0].attribute('href')
32
+ authors = paperDetails[0].children[2].content.clean
33
+ authors.gsub!("...", "et al")
34
+
35
+ journal = paperDetails[0].children[4].content
36
+ journalName = journal.split(/,|\d/).first.clean
37
+ journalDetails = journal.gsub(journalName, '').clean
38
+
39
+ year = paper.css("#col-year").text # is the last thing we get
40
+
41
+ #citations
42
+ citeInfo = paper.css(".cit-dark-link")
43
+ citationCount = citeInfo.text
44
+ citationUrl = citationCount.empty? ? nil : citeInfo.attribute('href').to_s
45
+
46
+ # get DOI: needs last name of first author, no funny chars
47
+ lastNameFirstAuthor = ((authors.split(',').first ).split(' ').last ).gsub(/[^A-Za-z\-]/, '')
48
+ doi = getDoi( lastNameFirstAuthor, title, @crossRefEmail)
49
+
50
+ @parsedPapers.push(Paper.new( title, googleUrl, authors, journalName, journalDetails, year, citationCount, citationUrl, doi))
51
+ end
52
+ STDOUT << "Scraped #{parsedPapers.length} from Google Scholar.\n"
53
+ end
54
+
55
+ # Scholar doesn't provide DOI.
56
+ # But if registered at crossref (its free), DOI can be retreived.
57
+ def getDoi(lastNameFirstAuthor, title, crossRefEmail)
58
+ return '' if @crossRefEmail.nil?
59
+ sleep(1) # to reduce risk
60
+ STDERR << "Getting DOI for paper by #{lastNameFirstAuthor}: #{title}.\n"
61
+ url = 'http://www.crossref.org/openurl?redirect=false' +
62
+ '&pid=' + crossRefEmail +
63
+ '&aulast=' + lastNameFirstAuthor +
64
+ '&atitle=' + URI.escape(title)
65
+ crossRefXML = Nokogiri::XML(open(url))
66
+ crossRefXML.search("doi").children.first.content rescue ''
67
+ end
68
+ end
69
+
70
+ class Formatter
71
+ attr_accessor :parser, :nameToHighlight, :pdfLinks, :altmetricDOIs
72
+
73
+ def initialize(parser, nameToHighlight = nil, pdfLinks = {}, altmetricDOIs = [], minCitationCount = 1)
74
+ @parser = parser
75
+ @nameToHighlight = nameToHighlight
76
+ @pdfLinks = pdfLinks
77
+ @altmetricDOIs = altmetricDOIs
78
+ @minCitations = minCitationCount
79
+ end
80
+
81
+ def to_html
82
+ ##@doc = Nokogiri::HTML::DocumentFragment.parse ""
83
+ builder = Nokogiri::HTML::Builder.new do |doc|
84
+ doc.html {
85
+ doc.body {
86
+ @parser.parsedPapers.each_with_index { |paper, index|
87
+ doc.div( :class => "publication") {
88
+ doc.p {
89
+ doc.text ((@parser.parsedPapers).length - index).to_s + '. '
90
+
91
+ if paper[:authors].include?(@nameToHighlight)
92
+ doc.text( paper[:authors].sub(Regexp.new(@nameToHighlight + '.*'), '') )
93
+ doc.span( :class => "me") { doc.text @nameToHighlight }
94
+ doc.text( paper[:authors].sub(Regexp.new('.*' + @nameToHighlight), '') )
95
+ else
96
+ doc.text( paper[:authors])
97
+ end
98
+
99
+ doc.text ' ' + paper[:year] + '. '
100
+ doc.b paper[:title] + '.'
101
+ doc.br
102
+ doc.em paper[:journalName]
103
+ doc.text ' '
104
+ doc.text paper[:journalDetails]
105
+
106
+ unless paper[ :doi].empty?
107
+ doc.text(' ')
108
+ doc.a( :href => URI.join("http://dx.doi.org/", paper[ :doi])) {
109
+ doc.text "[DOI]"
110
+ }
111
+ end
112
+ if @pdfLinks.keys.include?(paper[:title])
113
+ doc.text(' ')
114
+ doc.a( :href => @pdfLinks[paper[:title]]) {
115
+ doc.text "[PDF]"
116
+ }
117
+ end
118
+ if paper[ :citationCount].to_i > @minCitations
119
+ doc.text(' ')
120
+ doc.a( :href => paper[ :citingPapers]) {
121
+ doc.text("[Cited #{paper[ :citationCount]}x]")
122
+ }
123
+ end
124
+ if altmetricDOIs.include?( paper[ :doi])
125
+ doc.text(' ')
126
+ doc.span( :class => 'altmetric-embed',
127
+ :'data-badge-popover' => 'bottom',
128
+ :'data-doi' => paper[ :doi] )
129
+ end
130
+ }
131
+ }
132
+ }
133
+ }
134
+ }
135
+ end
136
+ return builder.to_html
137
+ end
138
+ end
139
+ end
140
+
141
+
metadata ADDED
@@ -0,0 +1,85 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: rubyscholar
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.2
5
+ prerelease:
6
+ platform: ruby
7
+ authors:
8
+ - Yannick Wurm
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+ date: 2013-08-18 00:00:00.000000000 Z
13
+ dependencies:
14
+ - !ruby/object:Gem::Dependency
15
+ name: nokogiri
16
+ requirement: !ruby/object:Gem::Requirement
17
+ none: false
18
+ requirements:
19
+ - - ~>
20
+ - !ruby/object:Gem::Version
21
+ version: 1.6.0
22
+ type: :runtime
23
+ prerelease: false
24
+ version_requirements: !ruby/object:Gem::Requirement
25
+ none: false
26
+ requirements:
27
+ - - ~>
28
+ - !ruby/object:Gem::Version
29
+ version: 1.6.0
30
+ - !ruby/object:Gem::Dependency
31
+ name: rspec
32
+ requirement: !ruby/object:Gem::Requirement
33
+ none: false
34
+ requirements:
35
+ - - ~>
36
+ - !ruby/object:Gem::Version
37
+ version: 2.5.0
38
+ type: :development
39
+ prerelease: false
40
+ version_requirements: !ruby/object:Gem::Requirement
41
+ none: false
42
+ requirements:
43
+ - - ~>
44
+ - !ruby/object:Gem::Version
45
+ version: 2.5.0
46
+ description: A small script to "scrape" your Google Scholar citations and reformat
47
+ them. It doesn't do a whole lot, but it's still useful.
48
+ email:
49
+ - y.wurm@qmul.ac.uk
50
+ executables:
51
+ - scrape.rb
52
+ extensions: []
53
+ extra_rdoc_files: []
54
+ files:
55
+ - .gitignore
56
+ - README.md
57
+ - bin/scrape.rb
58
+ - config.yml
59
+ - lib/rubyscholar.rb
60
+ homepage: ''
61
+ licenses:
62
+ - MIT
63
+ post_install_message:
64
+ rdoc_options: []
65
+ require_paths:
66
+ - lib
67
+ required_ruby_version: !ruby/object:Gem::Requirement
68
+ none: false
69
+ requirements:
70
+ - - ! '>='
71
+ - !ruby/object:Gem::Version
72
+ version: '0'
73
+ required_rubygems_version: !ruby/object:Gem::Requirement
74
+ none: false
75
+ requirements:
76
+ - - ! '>='
77
+ - !ruby/object:Gem::Version
78
+ version: '0'
79
+ requirements: []
80
+ rubyforge_project:
81
+ rubygems_version: 1.8.23
82
+ signing_key:
83
+ specification_version: 3
84
+ summary: RubyScholar - Scrape your Google Scholar citations.
85
+ test_files: []