whitepaper 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/.gitignore ADDED
@@ -0,0 +1,17 @@
1
+ *.gem
2
+ *.rbc
3
+ .bundle
4
+ .config
5
+ .yardoc
6
+ Gemfile.lock
7
+ InstalledFiles
8
+ _yardoc
9
+ coverage
10
+ doc/
11
+ lib/bundler/man
12
+ pkg
13
+ rdoc
14
+ spec/reports
15
+ test/tmp
16
+ test/version_tmp
17
+ tmp
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in whitepaper.gemspec
4
+ gemspec
data/LICENSE ADDED
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2013 wilkie (Dave Wilkinson II)
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,107 @@
1
+ # Whitepaper
2
+
3
+ This gem will perform a whitepaper lookup on major scholarly databases. It's purpose is to easily find
4
+ related papers and organize your paper collection. With this application, you can easily download pdfs
5
+ or use it as a library to automatically assign metadata.
6
+
7
+ Currently, CiteSeerX is the only database it uses along with a google pdf search if CiteSeerX does not
8
+ list a pdf to download.
9
+
10
+ ## Installation
11
+
12
+ Add this line to your application's Gemfile:
13
+
14
+ gem 'whitepaper'
15
+
16
+ And then execute:
17
+
18
+ $ bundle
19
+
20
+ Or install it yourself as:
21
+
22
+ $ gem install whitepaper
23
+
24
+ ## Usage
25
+
26
+ Display usage:
27
+
28
+ whitepaper -h
29
+
30
+ Finding article metadata by title keyword search:
31
+
32
+ whitepaper -t "corey multicore"
33
+
34
+ With no other options, this will print out all information about the paper.
35
+
36
+ If you want to limit what the program prints, then add one or more output flags:
37
+
38
+ Printing the article's proper title:
39
+
40
+ whitepaper -n -t "xomb multicore"
41
+
42
+ Printing the article's list of authors:
43
+
44
+ whitepaper -a -t "ubiquitous computing"
45
+
46
+ Printing the article's pdf url:
47
+
48
+ whitepaper -p -t "exokernel"
49
+
50
+ Finally, you can simply have the app download an article and place it in the
51
+ current directory. It will name the file as closely to the title as it can.
52
+
53
+ Download a pdf by any means necessary by title keyword search:
54
+
55
+ whitepaper -d -t "The Design and Implementation of a Log-Structured File System"
56
+
57
+ ### Programmable
58
+
59
+ To get paper metadata, add whitepaper to your Gemfile:
60
+
61
+ gem 'whitepaper'
62
+
63
+ And require it if necessary: (Your project may auto require libraries in your Gemfile)
64
+
65
+ require 'whitepaper'
66
+
67
+ Invoke with this simple command to look up a paper with the given terms in the title:
68
+
69
+ paper = Whitepaper.find_by_title("hierarchial filesystems are dead")
70
+
71
+ This will give you back a Whitepaper::Paper object! To get a pdf url, just go:
72
+
73
+ paper.pdf_urls.first unless paper.pdf_urls.empty?
74
+
75
+ As you can see, you can get a list of pdf links, so you can try each until you find one
76
+ that actually exists, or as a mirror if the server is down. If postscript is your thing, then check
77
+ out ps_urls and follow the same steps.
78
+
79
+ To get other metadatas, just follow one of the following lines of code:
80
+
81
+ title = paper.title
82
+ authors = paper.authors
83
+ description = paper.description
84
+ year = paper.year
85
+ conference = paper.conference
86
+ keywords = paper.keywords
87
+
88
+ A field that does not have a value will be a blank string or an empty array. Use the
89
+ empty? method on what is returned in either case to check
90
+ for a field that is, well, empty.
91
+
92
+ ## Contributing
93
+
94
+ 1. Fork it
95
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
96
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
97
+ 4. Push to the branch (`git push origin my-new-feature`)
98
+ 5. Create new Pull Request
99
+
100
+ Please refer to the LICENSE file in this repo for distribution information.
101
+ Whitespace uses an unaltered MIT license, which is a permissive open source
102
+ license. If this is unacceptable for you, please defer to the copyright holder.
103
+
104
+ ### TODO
105
+
106
+ 1. Add new output options (JSON, YAML, etc) for better metadata usage by other programs.
107
+ 2. Add new engines (Google Scholar, IEEE, ACM, etc)
data/Rakefile ADDED
@@ -0,0 +1 @@
1
+ require "bundler/gem_tasks"
data/bin/whitepaper ADDED
@@ -0,0 +1,8 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ $: << File.dirname(__FILE__) + '/../lib'
4
+
5
+ require 'whitepaper'
6
+ require 'whitepaper/cli'
7
+
8
+ Whitepaper::CLI.run
@@ -0,0 +1,99 @@
1
+ require 'optparse'
2
+
3
+ require 'whitepaper'
4
+
5
+ module Whitepaper
6
+ class CLI
7
+ BANNER = <<-USAGE
8
+ USAGE
9
+
10
+ class << self
11
+ def parse_options
12
+ options = {}
13
+ @opts = OptionParser.new do |opts|
14
+ opts.banner = BANNER.gsub(/^ /, '')
15
+
16
+ opts.separator ''
17
+ opts.separator 'Options:'
18
+
19
+ opts.on('-h', '--help', 'Display this help') do
20
+ puts opts
21
+ exit 0
22
+ end
23
+
24
+ opts.on('-t', '--by-title KEYWORDS', 'Display the data for the paper with the given KEYWORDS in title') do |title|
25
+ options[:by_title] = title
26
+ end
27
+
28
+ opts.on('-d', '--download', 'Downloads a pdf of the paper of the paper found') do
29
+ options[:download] = true
30
+ end
31
+
32
+ opts.on('-p', '--pdf', 'Display a link to the pdf of the paper found') do
33
+ options[:print_pdf_url] = true
34
+ end
35
+
36
+ opts.on('-n', '--name', 'Display the title of the paper found') do
37
+ options[:print_title] = true
38
+ end
39
+
40
+ opts.on('-a', '--authors', 'Display the authors of the paper found') do
41
+ options[:print_authors] = true
42
+ end
43
+
44
+ end
45
+
46
+ @opts.parse!
47
+
48
+ if options[:by_title]
49
+ paper = Whitepaper.find_by_title(options[:by_title])
50
+
51
+ if options[:print_title]
52
+ puts paper.title
53
+ end
54
+
55
+ if options[:print_authors]
56
+ puts paper.authors
57
+ end
58
+
59
+ if options[:print_pdf_url]
60
+ unless paper.pdf_urls.empty?
61
+ puts paper.pdf_urls.first
62
+ end
63
+ end
64
+
65
+ unless options[:print_title] or
66
+ options[:print_authors] or
67
+ options[:print_pdf_url]
68
+ puts paper
69
+ end
70
+
71
+ if options[:download] and not paper.pdf_urls.empty?
72
+ puts "Downloading: " + paper.pdf_urls.first
73
+ paper.download
74
+ end
75
+ else
76
+ puts opts
77
+ end
78
+ end
79
+
80
+ def CLI.run
81
+ begin
82
+ parse_options
83
+ rescue OptionParser::InvalidOption => e
84
+ warn e
85
+ exit -1
86
+ end
87
+
88
+ def fail
89
+ puts @opts
90
+ exit -1
91
+ end
92
+
93
+ # Default
94
+ puts BANNER
95
+ exit 0
96
+ end
97
+ end
98
+ end
99
+ end
@@ -0,0 +1,87 @@
1
+ require 'mechanize'
2
+
3
+ require 'whitepaper/paper'
4
+
5
+ module Whitepaper
6
+ module Engine
7
+ module CiteSeerX
8
+ DOMAIN = "http://citeseerx.ist.psu.edu"
9
+ SEARCH_BY_TITLE_URL = "search?q=title%3A{title}&t=doc&sort=cite"
10
+
11
+ class << self
12
+ def find_by_title_url(title)
13
+ "#{DOMAIN}/#{SEARCH_BY_TITLE_URL.gsub(/\{title\}/, title)}"
14
+ end
15
+
16
+ def find_by_title(title)
17
+ @agent = Mechanize.new
18
+ page = @agent.get "#{find_by_title_url(title)}"
19
+
20
+ # get the first link
21
+ paper = page.search '//div[@id="result_list"]/div[@class="result"]/h3/a'
22
+
23
+ paper_link = "#{DOMAIN}#{paper.first.attribute("href")}"
24
+
25
+ retrieve_details paper_link
26
+ end
27
+
28
+ def retrieve_details(url)
29
+ @agent = Mechanize.new
30
+
31
+ page = @agent.get url
32
+
33
+ def get_meta(name, page)
34
+ meta = page.search "//meta[@name=\"#{name}\"]"
35
+ if meta.nil? or meta.first.nil?
36
+ return ""
37
+ end
38
+ meta.first.attribute "content"
39
+ end
40
+
41
+ description = get_meta("description", page)
42
+ keywords_raw = get_meta("keywords", page)
43
+ title = get_meta("citation_title", page)
44
+ authors_raw = get_meta("citation_authors", page)
45
+ year = get_meta("citation_year", page)
46
+ conference = get_meta("citation_conference", page)
47
+
48
+ authors = authors_raw.to_s.split(',').map(&:strip)
49
+ keywords = keywords_raw.to_s.split(',').map(&:strip)
50
+
51
+ links = []
52
+ ps_links = []
53
+
54
+ link_url = page.search '//ul[@id="clinks"]/li/a'
55
+ link_url.each do |l|
56
+ url = "#{DOMAIN}#{l.attribute("href").to_s}"
57
+ if url.end_with? "pdf"
58
+ links << url
59
+ end
60
+ if url.end_with? "ps"
61
+ ps_links << url
62
+ end
63
+ end
64
+
65
+ link_url = page.search '//ul[@id="dlinks"]/li/a'
66
+ link_url.each do |l|
67
+ url = l.attribute("href").to_s
68
+ if url.end_with? "pdf"
69
+ links << url
70
+ end
71
+ if url.end_with? "ps"
72
+ ps_links << url
73
+ end
74
+ end
75
+
76
+ Paper.new title, authors, {:description => description,
77
+ :keywords => keywords,
78
+ :year => year,
79
+ :conference => conference,
80
+ :pdf_urls => links,
81
+ :ps_urls => ps_links}
82
+ end
83
+ end
84
+ end
85
+ end
86
+ end
87
+
@@ -0,0 +1,45 @@
1
+ require 'mechanize'
2
+
3
+ require 'whitepaper/paper'
4
+
5
+ module Whitepaper
6
+ module Engine
7
+ module Google
8
+ class << self
9
+ def find_by_title(title)
10
+ @agent = Mechanize.new
11
+
12
+ page = @agent.get "https://www.google.com/search?q=#{URI::encode(title)}+filetype%3Apdf"
13
+
14
+ results = page.search '//h3[@class="r"]'
15
+
16
+ urls = results.map do |r|
17
+ a = r.search './a'
18
+
19
+ # sanitize
20
+ url = a.attribute "href"
21
+
22
+ url = url.to_s.match(/\/url\?q=([^&]+)&/)[1]
23
+
24
+ title = a.first.content
25
+
26
+ author = r.search '../div[@class="s"]/span[@class="f"]'
27
+
28
+ authors = author.map do |e|
29
+ e.content.to_s
30
+ end
31
+
32
+ {:url => url, :title => title, :authors => authors}
33
+ end
34
+
35
+ if urls.length > 0
36
+ Paper.new(urls[0][:title], urls[0][:authors], {:pdf_urls => [urls[0][:url]]})
37
+ else
38
+ nil
39
+ end
40
+ end
41
+ end
42
+ end
43
+ end
44
+ end
45
+
@@ -0,0 +1,4 @@
1
+ module Whitepaper
2
+ class Finder
3
+ end
4
+ end
@@ -0,0 +1,65 @@
1
+ module Whitepaper
2
+ class Paper
3
+ attr_reader :title
4
+ attr_reader :authors
5
+ attr_reader :description
6
+ attr_reader :keywords
7
+ attr_reader :year
8
+ attr_reader :conference
9
+
10
+ attr_reader :pdf_urls
11
+ attr_reader :ps_urls
12
+
13
+ def initialize(title, authors, options = {})
14
+ @title = title
15
+ @authors = authors
16
+ @description = options[:description] || ""
17
+ @keywords = options[:keywords] || []
18
+ @year = options[:year] || ""
19
+ @conference = options[:conference] || ""
20
+
21
+ @pdf_urls = options[:pdf_urls] || []
22
+ @ps_urls = options[:ps_urls] || []
23
+ end
24
+
25
+ def download(filename = nil)
26
+ if filename.nil?
27
+ filename = title.to_s
28
+ end
29
+ escaped_filename = filename.gsub(/[\t:\?\<\>\*\"\\\/]/, "") + ".pdf"
30
+
31
+ f = open(escaped_filename, "w+")
32
+
33
+ if pdf_urls.empty?
34
+ return false
35
+ end
36
+
37
+ uri = URI.parse(pdf_urls.first)
38
+ begin
39
+ Net::HTTP.start(uri.host, uri.port) do |http|
40
+ http.request_get(uri.request_uri) do |resp|
41
+ resp.read_body do |segment|
42
+ f.write(segment)
43
+ end
44
+ end
45
+ end
46
+ ensure
47
+ f.close()
48
+ end
49
+
50
+ true
51
+ end
52
+
53
+ def to_s
54
+ "Title: #{@title}\n" +
55
+ "Authors: #{@authors}\n" +
56
+ "Description: #{@description}\n" +
57
+ "Keywords: #{@keywords}\n" +
58
+ "Year: #{@year}\n" +
59
+ "Conference: #{@conference}\n" +
60
+
61
+ "Pdf Available: #{@pdf_urls}\n" +
62
+ "Ps Available: #{@ps_urls}"
63
+ end
64
+ end
65
+ end
@@ -0,0 +1,3 @@
1
+ module Whitepaper
2
+ VERSION = "0.0.1"
3
+ end
data/lib/whitepaper.rb ADDED
@@ -0,0 +1,56 @@
1
+ require "whitepaper/version"
2
+
3
+ require 'whitepaper/engine/citeseerx'
4
+ require 'whitepaper/engine/google'
5
+
6
+ module Whitepaper
7
+ class << self
8
+ def find_by_title(title)
9
+ paper = Engine::CiteSeerX.find_by_title(title)
10
+
11
+ if paper.pdf_urls.empty?
12
+ g = Engine::Google.find_by_title(title)
13
+
14
+ paper = Paper.new(paper.title,
15
+ paper.authors,
16
+ {:description => paper.description,
17
+ :keywords => paper.keywords,
18
+ :year => paper.year,
19
+ :conference => paper.conference,
20
+ :pdf_urls => g.pdf_urls,
21
+ :ps_urls => paper.ps_urls})
22
+ end
23
+
24
+ paper
25
+ end
26
+
27
+ def find_authors_by_title(title)
28
+ paper = find_by_title(title)
29
+
30
+ if paper
31
+ paper.authors
32
+ end
33
+ end
34
+
35
+ def find_title_by_title(title)
36
+ paper = find_by_title(title)
37
+
38
+ if paper
39
+ paper.title
40
+ end
41
+ end
42
+
43
+ def find_pdfs_by_title(title)
44
+ paper = find_by_title(title)
45
+
46
+ if paper
47
+ paper.pdf_urls
48
+ end
49
+ end
50
+
51
+ def download_pdf_by_title(title)
52
+ paper = find_by_title(title)
53
+ paper.download
54
+ end
55
+ end
56
+ end
@@ -0,0 +1,22 @@
1
+ # -*- encoding: utf-8 -*-
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'whitepaper/version'
5
+
6
+ Gem::Specification.new do |gem|
7
+ gem.name = "whitepaper"
8
+ gem.version = Whitepaper::VERSION
9
+ gem.authors = ["wilkie"]
10
+ gem.email = ["wilkie05@gmail.com"]
11
+ gem.description = %q{Finds metadata on scholarly works and is able to download pdfs of whitepapers.}
12
+ gem.summary = %q{Finds whitepaper metadata and pdf download links with a basic keyword query using web-based databases such as Google and CiteSeerX.}
13
+ gem.homepage = ""
14
+
15
+ gem.files = `git ls-files`.split($/)
16
+ gem.executables = gem.files.grep(%r{^bin/}).map{ |f| File.basename(f) }
17
+ gem.test_files = gem.files.grep(%r{^(test|spec|features)/})
18
+ gem.require_paths = ["lib"]
19
+
20
+ gem.add_dependency "nokogiri"
21
+ gem.add_dependency "mechanize"
22
+ end
metadata ADDED
@@ -0,0 +1,93 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: whitepaper
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.1
5
+ prerelease:
6
+ platform: ruby
7
+ authors:
8
+ - wilkie
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+ date: 2013-01-27 00:00:00.000000000 Z
13
+ dependencies:
14
+ - !ruby/object:Gem::Dependency
15
+ name: nokogiri
16
+ requirement: !ruby/object:Gem::Requirement
17
+ none: false
18
+ requirements:
19
+ - - ! '>='
20
+ - !ruby/object:Gem::Version
21
+ version: '0'
22
+ type: :runtime
23
+ prerelease: false
24
+ version_requirements: !ruby/object:Gem::Requirement
25
+ none: false
26
+ requirements:
27
+ - - ! '>='
28
+ - !ruby/object:Gem::Version
29
+ version: '0'
30
+ - !ruby/object:Gem::Dependency
31
+ name: mechanize
32
+ requirement: !ruby/object:Gem::Requirement
33
+ none: false
34
+ requirements:
35
+ - - ! '>='
36
+ - !ruby/object:Gem::Version
37
+ version: '0'
38
+ type: :runtime
39
+ prerelease: false
40
+ version_requirements: !ruby/object:Gem::Requirement
41
+ none: false
42
+ requirements:
43
+ - - ! '>='
44
+ - !ruby/object:Gem::Version
45
+ version: '0'
46
+ description: Finds metadata on scholarly works and is able to download pdfs of whitepapers.
47
+ email:
48
+ - wilkie05@gmail.com
49
+ executables:
50
+ - whitepaper
51
+ extensions: []
52
+ extra_rdoc_files: []
53
+ files:
54
+ - .gitignore
55
+ - Gemfile
56
+ - LICENSE
57
+ - README.md
58
+ - Rakefile
59
+ - bin/whitepaper
60
+ - lib/whitepaper.rb
61
+ - lib/whitepaper/cli.rb
62
+ - lib/whitepaper/engine/citeseerx.rb
63
+ - lib/whitepaper/engine/google.rb
64
+ - lib/whitepaper/finder.rb
65
+ - lib/whitepaper/paper.rb
66
+ - lib/whitepaper/version.rb
67
+ - whitepaper.gemspec
68
+ homepage: ''
69
+ licenses: []
70
+ post_install_message:
71
+ rdoc_options: []
72
+ require_paths:
73
+ - lib
74
+ required_ruby_version: !ruby/object:Gem::Requirement
75
+ none: false
76
+ requirements:
77
+ - - ! '>='
78
+ - !ruby/object:Gem::Version
79
+ version: '0'
80
+ required_rubygems_version: !ruby/object:Gem::Requirement
81
+ none: false
82
+ requirements:
83
+ - - ! '>='
84
+ - !ruby/object:Gem::Version
85
+ version: '0'
86
+ requirements: []
87
+ rubyforge_project:
88
+ rubygems_version: 1.8.24
89
+ signing_key:
90
+ specification_version: 3
91
+ summary: Finds whitepaper metadata and pdf download links with a basic keyword query
92
+ using web-based databases such as Google and CiteSeerX.
93
+ test_files: []