textminer 0.1.0 → 0.1.5

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 5e954930af35bca6c9752b9f9d660eb675cd4bea
4
- data.tar.gz: c451255b116b2eae5a52d66adc3619e39c7f10c4
3
+ metadata.gz: c6c80a22022bb38bc141dc50e8da5d913db03946
4
+ data.tar.gz: 957cf24214f95f1b2d8309f2fd1a2e2aa7b6ca69
5
5
  SHA512:
6
- metadata.gz: 0ff5aacaf4be3b3a797f6cc8435c9c4a2bec7be98d695986f2b27a044b98492f5729cd58979952dacf96777d2306c2c8b9d9abda8dc7bfa94e9080a4f4ae8f6c
7
- data.tar.gz: 64a5fd5ebb268403c12350d444794efde329e97fb14e7ad47959d4af5aa3465306f135174e183b4ce11ab1009ba24fc29ebc47d7b605e90a4e4e308274264671
6
+ metadata.gz: 9837bd893866ef35e420d928bf02f3151783b345d39f758ed5ddce8b98c6df92147ff518b889d5cab33f84aa62ad795a3e7e1e2c6ad18cfd7a9a3060589293eb
7
+ data.tar.gz: 1151759369e8007f85ad73f24872f409ffcb70e99ad114e7a48e623c48a53ea118c7ed13a4ded171fb64823a2e21e342000f77766aaac93b12493335ace58f1d
@@ -1,5 +1,4 @@
1
1
  language: ruby
2
2
  rvm:
3
- - 1.9.3
4
3
  - 2.1.7
5
4
  - 2.2.3
@@ -0,0 +1,9 @@
1
+ ## 0.1.5 (2015-12-04)
2
+
3
+ * Now using `serrano` gem for interacting with the Crossref API
4
+ * Changed `links` method to `search`
5
+ * Changed `fetch` method to accept a URL for a full text article instead of a DOI
6
+
7
+ ## 0.1.0 (2015-08-24)
8
+
9
+ * First version to Rubygems
@@ -1,12 +1,18 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- textminer (0.1.0)
4
+ textminer (0.1.5)
5
+ faraday (~> 0.9.1)
6
+ faraday_middleware (~> 0.10.0)
5
7
  httparty (~> 0.13)
6
8
  json (~> 1.8)
7
- launchy (~> 2.4, >= 2.4.2)
9
+ launchy (~> 2.4, >= 2.4.3)
10
+ multi_json (~> 1.0)
11
+ nokogiri (~> 1.6, >= 1.6.6.2)
8
12
  pdf-reader (~> 1.3)
13
+ serrano (~> 0.1.4.1)
9
14
  thor (~> 0.19)
15
+ uuidtools (~> 2.1, >= 2.1.5)
10
16
 
11
17
  GEM
12
18
  remote: https://rubygems.org/
@@ -21,14 +27,23 @@ GEM
21
27
  simplecov
22
28
  url
23
29
  docile (1.1.5)
30
+ faraday (0.9.1)
31
+ multipart-post (>= 1.2, < 3)
32
+ faraday_middleware (0.10.0)
33
+ faraday (>= 0.7.4, < 0.10)
24
34
  hashery (2.1.1)
25
- httparty (0.13.5)
35
+ httparty (0.13.7)
26
36
  json (~> 1.8)
27
37
  multi_xml (>= 0.5.2)
28
38
  json (1.8.3)
29
39
  launchy (2.4.3)
30
40
  addressable (~> 2.3)
41
+ mini_portile (0.6.2)
42
+ multi_json (1.11.2)
31
43
  multi_xml (0.5.5)
44
+ multipart-post (2.0.0)
45
+ nokogiri (1.6.6.2)
46
+ mini_portile (~> 0.6.0)
32
47
  oga (1.2.3)
33
48
  ast
34
49
  ruby-ll (~> 2.1)
@@ -44,6 +59,11 @@ GEM
44
59
  ansi
45
60
  ast
46
61
  ruby-rc4 (0.1.5)
62
+ serrano (0.1.4.1)
63
+ faraday (~> 0.9.1)
64
+ faraday_middleware (~> 0.10.0)
65
+ multi_json (~> 1.0)
66
+ thor (~> 0.19)
47
67
  simplecov (0.10.0)
48
68
  docile (~> 1.1.0)
49
69
  json (~> 1.8)
@@ -54,6 +74,7 @@ GEM
54
74
  thor (0.19.1)
55
75
  ttfunk (1.4.0)
56
76
  url (0.3.2)
77
+ uuidtools (2.1.5)
57
78
 
58
79
  PLATFORMS
59
80
  ruby
@@ -66,3 +87,6 @@ DEPENDENCIES
66
87
  simplecov (~> 0.10)
67
88
  test-unit (~> 3.1)
68
89
  textminer!
90
+
91
+ BUNDLED WITH
92
+ 1.10.6
data/README.md CHANGED
@@ -1,24 +1,29 @@
1
1
  textminer
2
2
  =========
3
3
 
4
- [![Build Status](https://api.travis-ci.org/sckott/textminer.png)](https://travis-ci.org/sckott/textminer)
4
+ [![gem version](https://img.shields.io/gem/v/textminer.svg)](https://rubygems.org/gems/textminer)
5
+ [![Build Status](https://travis-ci.org/sckott/textminer.svg?branch=master)](https://travis-ci.org/sckott/textminer)
5
6
  [![codecov.io](http://codecov.io/github/sckott/textminer/coverage.svg?branch=master)](http://codecov.io/github/sckott/textminer?branch=master)
6
7
 
7
- __This is alpha software, so expect changes__
8
-
9
- ## What is it?
10
-
11
8
  __`textminer` helps you text mine through Crossref's TDM (Text & Data Mining) services:__
12
9
 
13
10
  ## Changes
14
11
 
15
- For changes see the [NEWS file](https://github.com/sckott/textminer/blob/master/NEWS.md).
12
+ For changes see the [CHANGELOG][changelog]
13
+
14
+ ## gem API
15
+
16
+ * `Textiner.search` - search by DOI, query string, filters, etc. to get Crossref metadata, which you can use downstream to get full text links. This method essentially wraps `Serrano.works()`, but only a subset of params - this interface may change depending on feedback.
17
+ * `Textiner.fetch` - Fetch full text given a url, supports Crossref's Text and Data Mining service
18
+ * `Textiner.extract` - Extract text from a pdf
16
19
 
17
20
  ## Install
18
21
 
19
22
  ### Release version
20
23
 
21
- Not on RubyGems yet
24
+ ```
25
+ gem install textminer
26
+ ```
22
27
 
23
28
  ### Development version
24
29
 
@@ -28,89 +33,87 @@ cd textminer
28
33
  rake install
29
34
  ```
30
35
 
31
- ## Within Ruby
36
+ ## Examples
37
+
38
+ ### Within Ruby
39
+
40
+ #### Search
32
41
 
33
42
  Search by DOI
34
43
 
35
44
  ```ruby
36
45
  require 'textminer'
37
- out = textminer.links("10.5555/515151")
46
+ # link to full text available
47
+ Textminer.search(doi: '10.7554/elife.06430')
48
+ # no link to full text available
49
+ Textminer.search(doi: "10.1371/journal.pone.0000308")
38
50
  ```
39
51
 
40
- Get the pdf link
41
-
42
- ```ruby
43
- out.pdf
44
- ```
52
+ Many DOIs at once
45
53
 
46
54
  ```ruby
47
- "http://annalsofpsychoceramics.labs.crossref.org/fulltext/10.5555/515151.pdf"
55
+ require 'serrano'
56
+ dois = Serrano.random_dois(sample: 6)
57
+ Textminer.search(doi: dois)
48
58
  ```
49
59
 
50
- Get the xml link
60
+ Search with filters
51
61
 
52
62
  ```ruby
53
- out.xml
63
+ Textminer.search(filter: {has_full_text: true})
54
64
  ```
55
65
 
56
- ```ruby
57
- "http://annalsofpsychoceramics.labs.crossref.org/fulltext/10.5555/515151.xml"
58
- ```
66
+ #### Get full text links
59
67
 
60
- Fetch XML
68
+ The object returned form `Textminer.search` is a class, which has methods for pulling out all links, xml only, pdf only, or plain text only
61
69
 
62
70
  ```ruby
63
- Textminer.fetch("10.3897/phytokeys.42.7604", "xml")
71
+ x = Textminer.search(filter: {has_full_text: true})
72
+ x.links_xml
73
+ x.links_pdf
74
+ x.links_plain
64
75
  ```
65
76
 
66
- ```ruby
67
- => {"article"=>
68
- {"front"=>
69
- {"journal_meta"=>
70
- {"journal_id"=>
71
- {"__content__"=>"PhytoKeys", "journal_id_type"=>"publisher-id"},
72
- "journal_title_group"=>
73
- {"journal_title"=>{"__content__"=>"PhytoKeys", "lang"=>"en"},
74
- "abbrev_journal_title"=>{"__content__"=>"PhytoKeys", "lang"=>"en"}},
75
- "issn"=>
76
- [{"__content__"=>"1314-2011", "pub_type"=>"ppub"},
77
- {"__content__"=>"1314-2003", "pub_type"=>"epub"}],
78
- "publisher"=>{"publisher_name"=>"Pensoft Publishers"}},
79
- "article_meta"=>
80
-
81
- ...
82
- ```
77
+ #### Fetch full text
83
78
 
84
- Fetch PDF
79
+ `Textminer.fetch()` gets full text based on URL input. We determine how to pull down and parse the content based on content type.
85
80
 
86
81
  ```ruby
87
- Textminer.fetch("10.3897/phytokeys.42.7604", "pdf")
82
+ # get some metadata
83
+ res = Textminer.search(member: 2258, filter: {has_full_text: true});
84
+ # get links
85
+ links = res.links_xml(true);
86
+ # Get full text for an article
87
+ res = Textminer.fetch(url: links[0]);
88
+ # url
89
+ res.url
90
+ # file path
91
+ res.path
92
+ # content type
93
+ res.type
94
+ # parse content
95
+ res.parse
88
96
  ```
89
97
 
90
- > pdf written to disk
91
-
92
- ## On the CLI
98
+ #### Extract text from PDF
93
99
 
94
- Get links
100
+ `Textminer.extract()` extracts text from a pdf, given a path for a pdf
95
101
 
96
- ```sh
97
- tm links 10.3897/phytokeys.42.7604
102
+ ```ruby
103
+ res = Textminer.search(member: 2258, filter: {has_full_text: true});
104
+ links = res.links_pdf(true);
105
+ res = Textminer.fetch(url: links[0]);
106
+ Textminer.extract(res.path)
98
107
  ```
99
108
 
100
- ```sh
101
- http://phytokeys.pensoft.net/lib/ajax_srv/article_elements_srv.php?action=download_xml&item_id=4190
102
- http://phytokeys.pensoft.net/lib/ajax_srv/article_elements_srv.php?action=download_pdf&item_id=4190
103
- ```
109
+ ### On the CLI
104
110
 
105
- More than one DOI:
106
-
107
- ```sh
108
- tm links '10.3897/phytokeys.42.7604,10.3897/zookeys.516.9439'
109
- ```
111
+ Coming soon...
110
112
 
111
113
  ## To do
112
114
 
113
115
  * CLI executable
114
- * get actual full text
115
116
  * better test suite
116
- * documentation
117
+ * better documentation
118
+
119
+ [changelog]: https://github.com/sckott/textminer/blob/master/CHANGELOG.md
data/Rakefile CHANGED
@@ -3,20 +3,35 @@ require 'rake/testtask'
3
3
 
4
4
  Rake::TestTask.new do |t|
5
5
  t.libs << "test"
6
- t.test_files = FileList['test/test*.rb']
6
+ t.test_files = FileList['test/test-*.rb']
7
7
  t.verbose = true
8
8
  end
9
9
 
10
10
  desc "Run tests"
11
11
  task :default => :test
12
12
 
13
+ desc "Build textminer docs"
14
+ task :docs do
15
+ system "yardoc"
16
+ end
17
+
18
+ desc "bundle install"
19
+ task :bundle do
20
+ system "bundle install"
21
+ end
22
+
23
+ desc "clean out builds"
24
+ task :clean do
25
+ system "ls | grep [0-9].gem | xargs rm"
26
+ end
27
+
13
28
  desc "Build textminer"
14
29
  task :build do
15
30
  system "gem build textminer.gemspec"
16
31
  end
17
32
 
18
33
  desc "Install textminer"
19
- task :install => :build do
34
+ task :install => [:bundle, :build] do
20
35
  system "gem install textminer-#{Textminer::VERSION}.gem"
21
36
  end
22
37
 
data/bin/tm CHANGED
@@ -14,7 +14,7 @@ class Tm < Thor
14
14
  def links(tt)
15
15
  tt = "#{tt}"
16
16
  tt = tt.to_s.split(',')
17
- out = Textminer.links(tt).all
17
+ out = Textminer.search(doi: tt).links(true)
18
18
  puts out
19
19
  end
20
20
  end
File without changes
@@ -0,0 +1,17 @@
1
+ ##
2
+ # Thin layer around pdf-reader gem's PDF::Reader
3
+ #
4
+ # @param doi [Array] A DOI, digital object identifier
5
+ # @param type [Array] One of two options to download: xml (default) or pdf
6
+ #
7
+ # @example
8
+ # require 'textminer'
9
+ # # fetch full text by DOI - xml by default
10
+ # Textminer.fetch("10.3897/phytokeys.42.7604")
11
+ # # many DOIs - xml output
12
+ # res = Textminer.fetch(["10.3897/phytokeys.42.7604", "10.3897/zookeys.516.9439"])
13
+ # # fetch full text - pdf
14
+ # Textminer.fetch("10.3897/phytokeys.42.7604", "pdf")
15
+ def self.fetch(doi, type = 'xml')
16
+ Fetch.new(doi, type).fetchtext
17
+ end
@@ -1,49 +1,124 @@
1
1
  require 'httparty'
2
2
  require 'json'
3
3
  require 'pdf-reader'
4
+ require 'serrano'
5
+ require "textminer/miner"
4
6
  require "textminer/version"
5
7
  require "textminer/request"
6
8
  require "textminer/response"
7
- require "textminer/fetch"
8
9
 
9
10
  module Textminer
11
+ extend Configuration
12
+
13
+ define_setting :tdm_key
14
+
10
15
  ##
11
- # Get links meant for text mining
16
+ # Search for papers and get full text links
12
17
  #
13
18
  # @param doi [Array] A DOI, digital object identifier
19
+ # @param options [Array] Curl request options
14
20
  # @return [Array] the output
15
21
  #
16
22
  # @example
17
23
  # require 'textminer'
18
24
  # # link to full text available
19
- # Textminer.links("10.5555/515151")
25
+ # Textminer.search(doi: '10.3897/phytokeys.42.7604')
20
26
  # # no link to full text available
21
- # Textminer.links("10.1371/journal.pone.0000308")
27
+ # Textminer.search(doi: "10.1371/journal.pone.0000308")
22
28
  # # many DOIs at once
23
- # res = Textminer.links(["10.3897/phytokeys.42.7604", "10.3897/zookeys.516.9439"])
29
+ # require 'serrano'
30
+ # dois = Serrano.random_dois(sample: 6)
31
+ # res = Textminer.search(doi: dois)
32
+ # res = Textminer.search(doi: ["10.3897/phytokeys.42.7604", "10.3897/zookeys.516.9439"])
24
33
  # res.links
25
- # res.pdf
26
- # res.xml
27
- def self.links(doi)
28
- Request.new(doi).perform
34
+ # res.links_pdf
35
+ # res.links_xml
36
+ # res.links_plain
37
+ # # only full text available
38
+ # x = Textminer.search(doi: '10.3816/clm.2001.n.006')
39
+ # x.links_xml
40
+ # x.links_plain
41
+ # x.links_pdf
42
+ # # no dois
43
+ # x = Textminer.search(filter: {has_full_text: true})
44
+ # x.links_xml
45
+ # x.links_plain
46
+ # x = Textminer.search(member: 311, filter: {has_full_text: true})
47
+ # x.links_pdf
48
+ def self.search(doi: nil, member: nil, filter: nil, limit: nil, options: nil)
49
+ Request.new(doi, member, filter, limit, options).perform
29
50
  end
30
51
 
31
52
  ##
32
- # Thin layer around pdf-reader gem's PDF::Reader
53
+ # Get full text
33
54
  #
34
- # @param doi [Array] A DOI, digital object identifier
35
- # @param type [Array] One of two options to download: xml (default) or pdf
55
+ # Work easily for open access papers, but for closed. For non-OA papers, use
56
+ # Crossref's Text and Data Mining service, which requires authentication and
57
+ # pre-authorized IP address. Go to https://apps.crossref.org/clickthrough/researchers
58
+ # to sign up for the TDM service, to get your key. The only publishers
59
+ # taking part at this time are Elsevier and Wiley.
60
+ #
61
+ # @param url [String] A url for full text
62
+ # @return [Mined] An object of class Mined, with methods for extracting
63
+ # the url requested, the file path, and parsing the plain text, XML, or extracting
64
+ # text from the pdf.
36
65
  #
37
66
  # @example
38
- # require 'textminer'
39
- # # fetch full text by DOI - xml by default
40
- # Textminer.fetch("10.3897/phytokeys.42.7604")
41
- # # many DOIs - xml output
42
- # res = Textminer.fetch(["10.3897/phytokeys.42.7604", "10.3897/zookeys.516.9439"])
43
- # # fetch full text - pdf
44
- # Textminer.fetch("10.3897/phytokeys.42.7604", "pdf")
45
- def self.fetch(doi, type = 'xml')
46
- Fetch.new(doi, type).fetchtext
67
+ # require 'textminer'
68
+ # # Set authorization
69
+ # Textminer.configuration do |config|
70
+ # config.tdm_key = "<your key>"
71
+ # end
72
+ # # Get some elsevier works
73
+ # res = Textminer.search(member: 78, filter: {has_full_text: true});
74
+ # links = res.links_xml(true);
75
+ # # Get full text for an article
76
+ # out = Textminer.fetch(url: links[0]);
77
+ # out.url
78
+ # out.path
79
+ # out.type
80
+ # xml = out.parse()
81
+ # puts xml
82
+ # xml.xpath('//xocs:cover-date-text', xml.root.namespaces).text
83
+ # # Get lots of articles
84
+ # links = links[1..3]
85
+ # out = links.collect{ |x| Textminer.fetch(url: x) }
86
+ # out.collect{ |z| z.path }
87
+ # out.collect{ |z| z.parse }
88
+ # zz = out[0].parse
89
+ # zz.xpath('//xocs:cover-date-text', zz.root.namespaces).text
90
+ #
91
+ # ## plain text
92
+ # # get full text links, here doing xml
93
+ # links = res.links_plain(true);
94
+ # # Get full text for an article
95
+ # res = Textminer.fetch(url: links[0]);
96
+ # res.url
97
+ # res.parse
98
+ #
99
+ # # With open access content - using Pensoft
100
+ # res = Textminer.search(member: 2258, filter: {has_full_text: true});
101
+ # links = res.links_xml(true);
102
+ # # Get full text for an article
103
+ # res = Textminer.fetch(url: links[0]);
104
+ # res.url
105
+ # res.parse
106
+ #
107
+ # # OA content - pdfs, using pensoft again
108
+ # res = Textminer.search(member: 2258, filter: {has_full_text: true});
109
+ # links = res.links_pdf(true);
110
+ # # Get full text for an article
111
+ # res = Textminer.fetch(url: links[0]);
112
+ # # url used
113
+ # res.url
114
+ # # document type
115
+ # res.type
116
+ # # document path on your machine
117
+ # res.path
118
+ # # get text
119
+ # res.parse
120
+ def self.fetch(url)
121
+ Miner.new(url).perform
47
122
  end
48
123
 
49
124
  ##
@@ -52,15 +127,34 @@ module Textminer
52
127
  # @param path [String] Path to a pdf file downloaded via {fetch}, or
53
128
  # another way.
54
129
  #
130
+ # This method is used internally within fetch to parse PDFs.
131
+ #
55
132
  # @example
56
- # require 'textminer'
57
- # # fetch full text - pdf
58
- # res = Textminer.fetch("10.3897/phytokeys.42.7604", "pdf")
59
- # # extract pdf to text
60
- # Textminer.extract(res)
133
+ # require 'textminer'
134
+ # res = Textminer.search(member: 2258, filter: {has_full_text: true});
135
+ # links = res.links_pdf(true);
136
+ # # Get full text for an article
137
+ # out = Textminer.fetch(url: links[0]);
138
+ # # extract pdf to text
139
+ # Textminer.extract(out.path)
61
140
  def self.extract(path)
62
141
  rr = PDF::Reader.new(path)
63
142
  rr.pages.map { |page| page.text }.join("\n")
64
143
  end
65
144
 
145
+ protected
146
+
147
+ def self.link_switch(x, y)
148
+ case y
149
+ when nil
150
+ x.links
151
+ when 'xml'
152
+ x.links_xml
153
+ when 'pdf'
154
+ x.links_pdf
155
+ when 'plain'
156
+ x.links_plain
157
+ end
158
+ end
159
+
66
160
  end
@@ -0,0 +1,26 @@
1
+ # taken from: https://viget.com/extend/easy-gem-configuration-variables-with-defaults
2
+ module Configuration
3
+
4
+ def configuration
5
+ yield self
6
+ end
7
+
8
+ def define_setting(name, default = nil)
9
+ class_variable_set("@@#{name}", default)
10
+ define_class_method "#{name}=" do |value|
11
+ class_variable_set("@@#{name}", value)
12
+ end
13
+ define_class_method name do
14
+ class_variable_get("@@#{name}")
15
+ end
16
+ end
17
+
18
+ private
19
+
20
+ def define_class_method(name, &block)
21
+ (class << self; self; end).instance_eval do
22
+ define_method name, &block
23
+ end
24
+ end
25
+
26
+ end
@@ -0,0 +1,54 @@
1
+ # Array methods
2
+ class Array
3
+ def links(just_urls = true)
4
+ return self.collect{ |x| x.links(just_urls) }.flatten
5
+ # if temp.length == 1
6
+ # return tmp[0]
7
+ # else
8
+ # return tmp
9
+ # end
10
+ # tmp = self.collect{ |x| x['message']['link'] }
11
+ # return parse_link(tmp, just_urls)
12
+ end
13
+ end
14
+
15
+ class Array
16
+ def links_xml(just_urls = true)
17
+ self.collect { |z| z.links_xml(just_urls) }[0]
18
+ # return parse_link(self.collect { |z| z.links_xml }[0], just_urls)
19
+ # return parse_link(pull_link(self, '^application\/xml$|^text\/xml$'), just_urls)
20
+ end
21
+ end
22
+
23
+ class Array
24
+ def links_pdf(just_urls = true)
25
+ self.collect { |z| z.links_pdf(just_urls) }[0]
26
+ # return parse_link(self.collect { |z| z.links_pdf }[0], just_urls)
27
+ # return parse_link(pull_link(self, '^application\/pdf$'), just_urls)
28
+ end
29
+ end
30
+
31
+ class Array
32
+ def links_plain(just_urls = true)
33
+ self.collect { |z| z.links_plain(just_urls) }[0]
34
+ # return parse_link(self.collect { |z| z.links_plain }[0], just_urls)
35
+ # return parse_link(pull_link(self, '^application\/plain$|^text\/plain$'), just_urls)
36
+ end
37
+ end
38
+
39
+ # def pull_link(x, y)
40
+ # return x.collect { |z| z.links_xml }[0]
41
+ # # return x.collect { |z| z['message']['link'] }.compact.collect { |z| z.compact.select { |w| w['content-type'].match(/#{y}/) } }
42
+ # end
43
+
44
+ # def parse_link(x, just_urls)
45
+ # if x.nil?
46
+ # return x
47
+ # else
48
+ # if just_urls
49
+ # return x.compact.collect { |z| z.collect{ |y| y['URL'] }}.flatten
50
+ # else
51
+ # return x
52
+ # end
53
+ # end
54
+ # end
@@ -0,0 +1,71 @@
1
+ # Hash methods
2
+ class Hash
3
+ def links(just_urls = true)
4
+ if self['message']['items'].nil?
5
+ tmp = self['message']['link']
6
+ if tmp.nil?
7
+ tmp = nil
8
+ else
9
+ tmp = tmp.reject { |c| c.empty? }
10
+ end
11
+ else
12
+ tmp = self['message']['items'].collect { |x| x['link'] }.reject { |c| c.empty? }
13
+ end
14
+
15
+ return parse_links(tmp, just_urls)
16
+ end
17
+ end
18
+
19
+ class Hash
20
+ def links_xml(just_urls = true)
21
+ return parse_links(pull_links(self, '^application\/xml$|^text\/xml$'), just_urls)
22
+ end
23
+ end
24
+
25
+ class Hash
26
+ def links_pdf(just_urls = true)
27
+ return parse_links(pull_links(self, '^application\/pdf$'), just_urls)
28
+ end
29
+ end
30
+
31
+ class Hash
32
+ def links_plain(just_urls = true)
33
+ return parse_links(pull_links(self, '^application\/plain$|^text\/plain$'), just_urls)
34
+ end
35
+ end
36
+
37
+ def pull_links(x, y)
38
+ if x['message']['items'].nil?
39
+ tmp = self['message']['link']
40
+ if tmp.nil?
41
+ return nil
42
+ else
43
+ return tmp.select { |z| z['content-type'].match(/#{y}/) }.reject { |c| c.empty? }
44
+ end
45
+ else
46
+ return x['message']['items'].collect { |x| x['link'].select { |z| z['content-type'].match(/#{y}/) } }.reject { |c| c.empty? }
47
+ end
48
+ end
49
+
50
+ def parse_links(x, just_urls)
51
+ if x.nil?
52
+ return nil
53
+ else
54
+ if x.empty?
55
+ return x
56
+ else
57
+ if just_urls
58
+ if x[0].class != Array
59
+ # return x[0]['URL']
60
+ return x.collect { |x| x['URL'] }.flatten
61
+ else
62
+ return x.collect { |x| x.collect { |z| z['URL'] }}.flatten
63
+ # return x.collect { |x| x['URL'] }.flatten.compact
64
+ # return x.collect { |x| x.collect { |z| z['URL'] }}.flatten
65
+ end
66
+ else
67
+ return x
68
+ end
69
+ end
70
+ end
71
+ end
@@ -0,0 +1,65 @@
1
+ require 'nokogiri'
2
+ require 'uuidtools'
3
+
4
+ def detect_type(x)
5
+ ctype = x.headers['content-type']
6
+ case ctype
7
+ when 'text/xml'
8
+ 'xml'
9
+ when 'text/plain'
10
+ 'plain'
11
+ when 'application/pdf'
12
+ 'pdf'
13
+ end
14
+ end
15
+
16
+ def make_ext(x)
17
+ case x
18
+ when 'xml'
19
+ 'xml'
20
+ when 'plain'
21
+ 'txt'
22
+ when 'pdf'
23
+ 'pdf'
24
+ end
25
+ end
26
+
27
+ def make_path(type)
28
+ # id = x.split('article/')[1].split('?')[0]
29
+ # path = id + '.' + type
30
+ # return path
31
+ type = make_ext(type)
32
+ uuid = UUIDTools::UUID.random_create.to_s
33
+ path = uuid + '.' + type
34
+ return path
35
+ end
36
+
37
+ def write_disk(res, path)
38
+ f = File.new(path, "wb")
39
+ f.write(res.body)
40
+ f.close()
41
+ end
42
+
43
+ def read_disk(path)
44
+ return File.read(path)
45
+ end
46
+
47
+ def parse_xml(x)
48
+ text = read_disk(x)
49
+ xml = Nokogiri.parse(text)
50
+ return xml
51
+ end
52
+
53
+ def parse_plain(x)
54
+ text = read_disk(x)
55
+ return text
56
+ end
57
+
58
+ def parse_pdf(x)
59
+ return Textminer.extract(x)
60
+ end
61
+
62
+ def is_elsevier_wiley(x)
63
+ tmp = x.match 'elsevier|wiley'
64
+ !tmp.nil?
65
+ end
@@ -0,0 +1,31 @@
1
+ require "nokogiri"
2
+
3
+ ##
4
+ # Textminer::Mined
5
+ #
6
+ # Class to give back text mining object
7
+ module Textminer
8
+ class Mined #:nodoc:
9
+ attr_accessor :url
10
+ attr_accessor :path
11
+ attr_accessor :type
12
+
13
+ def initialize(url, path, type)
14
+ self.url = url
15
+ self.path = path
16
+ self.type = type
17
+ end
18
+
19
+ def parse
20
+ case self.type
21
+ when 'xml'
22
+ parse_xml(self.path)
23
+ when 'plain'
24
+ parse_plain(self.path)
25
+ when 'pdf'
26
+ parse_pdf(self.path)
27
+ end
28
+ end
29
+
30
+ end
31
+ end
@@ -0,0 +1,42 @@
1
+ require "faraday"
2
+ require "faraday_middleware"
3
+ require "multi_json"
4
+ require 'textminer/helpers/configuration'
5
+ require 'textminer/mined'
6
+ require 'textminer/mine_utils'
7
+
8
+ ##
9
+ # Textminer::Miner
10
+ #
11
+ # Class to give back text mining object
12
+ module Textminer
13
+ class Miner #:nodoc:
14
+ attr_accessor :url
15
+
16
+ def initialize(url)
17
+ self.url = url
18
+ end
19
+
20
+ def perform
21
+ conn = Faraday.new self.url do |c|
22
+ c.use FaradayMiddleware::FollowRedirects
23
+ c.adapter :net_http
24
+ end
25
+
26
+ if is_elsevier_wiley(self.url)
27
+ res = conn.get do |req|
28
+ req.headers['CR-Clickthrough-Client-Token'] = Textminer.tdm_key
29
+ end
30
+ else
31
+ res = conn.get
32
+ end
33
+
34
+ type = detect_type(res)
35
+ path = make_path(type)
36
+ write_disk(res, path)
37
+
38
+ return Mined.new(self.url, path, type)
39
+ end
40
+
41
+ end
42
+ end
@@ -1,19 +1,36 @@
1
1
  module Textminer
2
2
  class Request #:nodoc:
3
3
  attr_accessor :doi
4
+ attr_accessor :member
5
+ attr_accessor :filter
6
+ attr_accessor :limit
7
+ attr_accessor :options
4
8
 
5
- def initialize(doi)
9
+ def initialize(doi, member, filter, limit, options)
6
10
  self.doi = doi
11
+ self.member = member
12
+ self.filter = filter
13
+ self.limit = limit
14
+ self.options = options
7
15
  end
8
16
 
9
17
  def perform
10
- url = "http://api.crossref.org/works/"
11
- coll = []
12
- Array(self.doi).each do |x|
13
- coll << HTTParty.get(url + x)
18
+ fac = nil
19
+
20
+ if member.nil?
21
+ res = Serrano.works(ids: doi, filter: filter, limit: limit, options: options)
22
+ if doi.nil?
23
+ fac = Serrano.works(ids: doi, filter: filter, options: options, facet: 'license:*', limit: 0)
24
+ fac = fac['message']['facets']['license']['value-count'].to_s
25
+ end
26
+ else
27
+ res = Serrano.members(ids: member, filter: filter, works: true, limit: limit, options: options)
28
+ if member.nil?
29
+ fac = Serrano.member(ids: member, filter: filter, options: options, facet: 'license:*', limit: 0)
30
+ fac = fac['message']['facets']['license']['value-count'].to_s
31
+ end
14
32
  end
15
- # res = HTTParty.get(url + self.doi)
16
- Response.new(self.doi, coll)
33
+ Response.new(self.doi, self.member, res, fac)
17
34
  end
18
35
  end
19
36
  end
@@ -1,52 +1,76 @@
1
+ require 'launchy'
2
+ require "textminer/link_methods_hash"
3
+ require "textminer/link_methods_array"
4
+
1
5
  module Textminer
2
6
  class Response #:nodoc:
3
- attr_reader :doi, :response
7
+ attr_reader :doi, :member, :response, :facet
4
8
 
5
- def initialize(doi, res)
9
+ def initialize(doi, member, response, facet)
6
10
  @doi = doi
7
- @res = res
11
+ @member = member
12
+ @response = response
13
+ @facet = facet
8
14
  end
9
15
 
10
- def raw_body
11
- # @res
12
- @res.collect { |x| x.body }
16
+ def to_s
17
+ if !@doi.nil?
18
+ if @doi.length > 3
19
+ ending = '...'
20
+ else
21
+ ending = ''
22
+ end
23
+ tt = sprintf('dois: %s %s', Array(@doi)[0..2].join(', '), ending)
24
+ end
25
+ if !@member.nil?
26
+ tt = 'member: ' + @member.to_s
27
+ end
28
+ if @doi.nil? && @member.nil?
29
+ tt = ''
30
+ end
31
+ sprintf("<textminer>: \n search: %s\n no. licenses: %s", tt, @facet)
13
32
  end
14
33
 
15
- def parsed
16
- # JSON.parse(@res.body)
17
- @res.collect { |x| JSON.parse(x.body) }
34
+ def inspect
35
+ to_s
18
36
  end
19
37
 
20
- def links
21
- # @res['message']['link']
22
- @res.collect { |x| x['message']['link'] }
38
+ def body
39
+ @response
23
40
  end
24
41
 
25
- def pdf
26
- tmp = links
27
- if !tmp.nil?
28
- tmp.collect { |z|
29
- z.select{ |x| x['content-type'] == "application/pdf" }[0]['URL']
30
- }
31
- end
42
+ def links(just_urls = true)
43
+ tmp = @response.links(just_urls)
44
+ compactif(tmp)
32
45
  end
33
46
 
34
- def xml
35
- tmp = links
36
- if !tmp.nil?
37
- tmp.collect { |z|
38
- z.select{ |x| x['content-type'] == "application/xml" }[0]['URL']
39
- }
40
- end
47
+ def links_xml(just_urls = true)
48
+ tmp = @response.links_xml(just_urls)
49
+ compactif(tmp)
41
50
  end
42
51
 
43
- def all
44
- [xml, pdf]
52
+ def links_pdf(just_urls = true)
53
+ tmp = @response.links_pdf(just_urls)
54
+ compactif(tmp)
45
55
  end
46
56
 
47
- # def browse
57
+ def links_plain(just_urls = true)
58
+ tmp = @response.links_plain(just_urls)
59
+ compactif(tmp)
60
+ end
48
61
 
49
- # end
62
+ protected
50
63
 
64
+ def compactif(z)
65
+ if z.nil?
66
+ return z
67
+ else
68
+ return z.compact
69
+ end
70
+ end
71
+ # def browse
72
+ # url = 'http://doi.org/' + @doi
73
+ # Launchy.open(url)
74
+ # end
51
75
  end
52
76
  end
@@ -0,0 +1,7 @@
1
+ def singlearray2hash(x)
2
+ if x.length == 1 && x.class == Array
3
+ return x[0]
4
+ else
5
+ return x
6
+ end
7
+ end
@@ -1,3 +1,3 @@
1
1
  module Textminer
2
- VERSION = "0.1.0"
2
+ VERSION = "0.1.5"
3
3
  end
@@ -6,7 +6,7 @@ require 'textminer/version'
6
6
  Gem::Specification.new do |s|
7
7
  s.name = 'textminer'
8
8
  s.version = Textminer::VERSION
9
- s.date = '2015-08-24'
9
+ s.date = '2015-12-04'
10
10
  s.summary = "Interact with Crossref's Text and Data mining API"
11
11
  s.description = "Search Crossref's search API for full text content, and get full text content."
12
12
  s.authors = "Scott Chamberlain"
@@ -15,7 +15,6 @@ Gem::Specification.new do |s|
15
15
  s.licenses = 'MIT'
16
16
 
17
17
  s.files = `git ls-files -z`.split("\x0").reject {|f| f.match(%r{^(test|spec|features)/}) }
18
- s.test_files = ["test/test_tdm.rb"]
19
18
  s.require_paths = ["lib"]
20
19
 
21
20
  s.bindir = 'bin'
@@ -27,9 +26,16 @@ Gem::Specification.new do |s|
27
26
  s.add_development_dependency "oga", '~> 1.2'
28
27
  s.add_development_dependency "simplecov", '~> 0.10'
29
28
  s.add_development_dependency "codecov", '~> 0.1'
29
+
30
+ s.add_runtime_dependency 'serrano', '~> 0.1.4.1'
30
31
  s.add_runtime_dependency 'httparty', '~> 0.13'
31
32
  s.add_runtime_dependency 'thor', '~> 0.19'
32
33
  s.add_runtime_dependency 'json', '~> 1.8'
33
- s.add_runtime_dependency 'launchy', '~> 2.4', '>= 2.4.2'
34
+ s.add_runtime_dependency 'multi_json', '~> 1.0'
35
+ s.add_runtime_dependency 'faraday', '~> 0.9.1'
36
+ s.add_runtime_dependency 'faraday_middleware', '~> 0.10.0'
37
+ s.add_runtime_dependency 'launchy', '~> 2.4', '>= 2.4.3'
34
38
  s.add_runtime_dependency 'pdf-reader','~> 1.3'
39
+ s.add_runtime_dependency 'nokogiri', '~> 1.6', '>= 1.6.6.2'
40
+ s.add_runtime_dependency 'uuidtools', '~> 2.1', '>= 2.1.5'
35
41
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: textminer
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.1.5
5
5
  platform: ruby
6
6
  authors:
7
7
  - Scott Chamberlain
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2015-08-24 00:00:00.000000000 Z
11
+ date: 2015-12-04 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bundler
@@ -94,6 +94,20 @@ dependencies:
94
94
  - - "~>"
95
95
  - !ruby/object:Gem::Version
96
96
  version: '0.1'
97
+ - !ruby/object:Gem::Dependency
98
+ name: serrano
99
+ requirement: !ruby/object:Gem::Requirement
100
+ requirements:
101
+ - - "~>"
102
+ - !ruby/object:Gem::Version
103
+ version: 0.1.4.1
104
+ type: :runtime
105
+ prerelease: false
106
+ version_requirements: !ruby/object:Gem::Requirement
107
+ requirements:
108
+ - - "~>"
109
+ - !ruby/object:Gem::Version
110
+ version: 0.1.4.1
97
111
  - !ruby/object:Gem::Dependency
98
112
  name: httparty
99
113
  requirement: !ruby/object:Gem::Requirement
@@ -136,6 +150,48 @@ dependencies:
136
150
  - - "~>"
137
151
  - !ruby/object:Gem::Version
138
152
  version: '1.8'
153
+ - !ruby/object:Gem::Dependency
154
+ name: multi_json
155
+ requirement: !ruby/object:Gem::Requirement
156
+ requirements:
157
+ - - "~>"
158
+ - !ruby/object:Gem::Version
159
+ version: '1.0'
160
+ type: :runtime
161
+ prerelease: false
162
+ version_requirements: !ruby/object:Gem::Requirement
163
+ requirements:
164
+ - - "~>"
165
+ - !ruby/object:Gem::Version
166
+ version: '1.0'
167
+ - !ruby/object:Gem::Dependency
168
+ name: faraday
169
+ requirement: !ruby/object:Gem::Requirement
170
+ requirements:
171
+ - - "~>"
172
+ - !ruby/object:Gem::Version
173
+ version: 0.9.1
174
+ type: :runtime
175
+ prerelease: false
176
+ version_requirements: !ruby/object:Gem::Requirement
177
+ requirements:
178
+ - - "~>"
179
+ - !ruby/object:Gem::Version
180
+ version: 0.9.1
181
+ - !ruby/object:Gem::Dependency
182
+ name: faraday_middleware
183
+ requirement: !ruby/object:Gem::Requirement
184
+ requirements:
185
+ - - "~>"
186
+ - !ruby/object:Gem::Version
187
+ version: 0.10.0
188
+ type: :runtime
189
+ prerelease: false
190
+ version_requirements: !ruby/object:Gem::Requirement
191
+ requirements:
192
+ - - "~>"
193
+ - !ruby/object:Gem::Version
194
+ version: 0.10.0
139
195
  - !ruby/object:Gem::Dependency
140
196
  name: launchy
141
197
  requirement: !ruby/object:Gem::Requirement
@@ -145,7 +201,7 @@ dependencies:
145
201
  version: '2.4'
146
202
  - - ">="
147
203
  - !ruby/object:Gem::Version
148
- version: 2.4.2
204
+ version: 2.4.3
149
205
  type: :runtime
150
206
  prerelease: false
151
207
  version_requirements: !ruby/object:Gem::Requirement
@@ -155,7 +211,7 @@ dependencies:
155
211
  version: '2.4'
156
212
  - - ">="
157
213
  - !ruby/object:Gem::Version
158
- version: 2.4.2
214
+ version: 2.4.3
159
215
  - !ruby/object:Gem::Dependency
160
216
  name: pdf-reader
161
217
  requirement: !ruby/object:Gem::Requirement
@@ -170,6 +226,46 @@ dependencies:
170
226
  - - "~>"
171
227
  - !ruby/object:Gem::Version
172
228
  version: '1.3'
229
+ - !ruby/object:Gem::Dependency
230
+ name: nokogiri
231
+ requirement: !ruby/object:Gem::Requirement
232
+ requirements:
233
+ - - "~>"
234
+ - !ruby/object:Gem::Version
235
+ version: '1.6'
236
+ - - ">="
237
+ - !ruby/object:Gem::Version
238
+ version: 1.6.6.2
239
+ type: :runtime
240
+ prerelease: false
241
+ version_requirements: !ruby/object:Gem::Requirement
242
+ requirements:
243
+ - - "~>"
244
+ - !ruby/object:Gem::Version
245
+ version: '1.6'
246
+ - - ">="
247
+ - !ruby/object:Gem::Version
248
+ version: 1.6.6.2
249
+ - !ruby/object:Gem::Dependency
250
+ name: uuidtools
251
+ requirement: !ruby/object:Gem::Requirement
252
+ requirements:
253
+ - - "~>"
254
+ - !ruby/object:Gem::Version
255
+ version: '2.1'
256
+ - - ">="
257
+ - !ruby/object:Gem::Version
258
+ version: 2.1.5
259
+ type: :runtime
260
+ prerelease: false
261
+ version_requirements: !ruby/object:Gem::Requirement
262
+ requirements:
263
+ - - "~>"
264
+ - !ruby/object:Gem::Version
265
+ version: '2.1'
266
+ - - ">="
267
+ - !ruby/object:Gem::Version
268
+ version: 2.1.5
173
269
  description: Search Crossref's search API for full text content, and get full text
174
270
  content.
175
271
  email: myrmecocystus@gmail.com
@@ -180,18 +276,25 @@ extra_rdoc_files: []
180
276
  files:
181
277
  - ".gitignore"
182
278
  - ".travis.yml"
279
+ - CHANGELOG.md
183
280
  - Gemfile
184
281
  - Gemfile.lock
185
- - NEWS.md
186
282
  - README.md
187
283
  - Rakefile
188
284
  - bin/tm
285
+ - extra/fetch.rb
286
+ - extra/fetch_method.rb
189
287
  - lib/textminer.rb
190
- - lib/textminer/fetch.rb
288
+ - lib/textminer/helpers/configuration.rb
289
+ - lib/textminer/link_methods_array.rb
290
+ - lib/textminer/link_methods_hash.rb
291
+ - lib/textminer/mine_utils.rb
292
+ - lib/textminer/mined.rb
293
+ - lib/textminer/miner.rb
191
294
  - lib/textminer/request.rb
192
295
  - lib/textminer/response.rb
296
+ - lib/textminer/tmutils.rb
193
297
  - lib/textminer/version.rb
194
- - test/test_tdm.rb
195
298
  - textminer.gemspec
196
299
  homepage: http://github.com/sckott/textminer
197
300
  licenses:
@@ -213,10 +316,9 @@ required_rubygems_version: !ruby/object:Gem::Requirement
213
316
  version: '0'
214
317
  requirements: []
215
318
  rubyforge_project:
216
- rubygems_version: 2.4.5
319
+ rubygems_version: 2.4.5.1
217
320
  signing_key:
218
321
  specification_version: 4
219
322
  summary: Interact with Crossref's Text and Data mining API
220
- test_files:
221
- - test/test_tdm.rb
323
+ test_files: []
222
324
  has_rdoc:
data/NEWS.md DELETED
@@ -1,3 +0,0 @@
1
- ## 0.0.1 (2015-08-22)
2
-
3
- * First version
@@ -1,52 +0,0 @@
1
- require 'simplecov'
2
- SimpleCov.start
3
- if ENV['CI']=='true'
4
- require 'codecov'
5
- SimpleCov.formatter = SimpleCov::Formatter::Codecov
6
- end
7
-
8
- require "textminer"
9
- require 'fileutils'
10
- require "test/unit"
11
- require "oga"
12
-
13
- class TestResponse < Test::Unit::TestCase
14
-
15
- def setup
16
- @doi = '10.5555/515151'
17
- @doi2 = "10.3897/phytokeys.42.7604"
18
- @pdf = ["http://annalsofpsychoceramics.labs.crossref.org/fulltext/10.5555/515151.pdf"]
19
- @xml = ["http://annalsofpsychoceramics.labs.crossref.org/fulltext/10.5555/515151.xml"]
20
- end
21
-
22
- def test_links_endpoint
23
- assert_equal(Textminer::Response, Textminer.links(@doi).class)
24
- end
25
-
26
- def test_doi
27
- assert_equal(@doi, Textminer.links(@doi).doi)
28
- end
29
-
30
- def test_pdf
31
- assert_equal(@pdf, Textminer.links(@doi).pdf)
32
- end
33
-
34
- def test_xml
35
- assert_equal(@xml, Textminer.links(@doi).xml)
36
- end
37
-
38
- def test_fetch_xml
39
- res = Textminer.fetch(@doi2, "xml")
40
- assert_equal(HTTParty::Response, res[0].class)
41
- assert_true(res[0].ok?)
42
- assert_equal(String, res[0].body.class)
43
- assert_equal("PhytoKeys", Oga.parse_xml(res[0].body).xpath('//journal-meta//journal-id').text)
44
- end
45
-
46
- # def test_fetch_pdf
47
- # res = Textminer.fetch(@doi2, "pdf")
48
- # assert_equal(HTTParty::Response, res.class)
49
- # assert_true(res.ok?)
50
- # end
51
-
52
- end