spn2 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 6ec6088e94bed925212bc5aa5abf3211cae5fd0bf5402c49f316d7e4d65f051d
4
- data.tar.gz: e00cad0b9921f36052ec81271e69f553979dec1e0b6cd540f1bee1c0f712f337
3
+ metadata.gz: d73702950316690d62ce910059788093af0ba31b094c1208eb7c7ef72cbaa06e
4
+ data.tar.gz: 87ba2a61a9f76d86d46043da74418b2a67f6ac584dac2c3b50b2aa7baa8c7aad
5
5
  SHA512:
6
- metadata.gz: e618df35304407811745df2e7c92106669a01564d4c61c1639fa06a109ffb995fb05e145e12d13efc7cf30fd4c899a959138e0640894bc68c3942efd1035e3d0
7
- data.tar.gz: cc5616e4848cd9f5bfceb58e163095f9adfc86e584857d3c5c9674bfbbe8a182262d7c779c7d24a87134e7959e7f3a6fcbfc4588f039efc5184919e193d70ca2
6
+ metadata.gz: 0ad8a32dc48bd5dbaf5b24428f6a3522c4af9ce3a5a2297df39fb4dd9b36a627565a844cf841ca50e3c3e0200b569c88d73bd27717468ed8b8e2f5444114d316
7
+ data.tar.gz: b08014dddc62d499a30e5bc8ba1ce20c96a5f01ad8d0d465759897904f297339a9b79ecff1fdd56561c194072e24ba83542b5d0e39f6d6ab8e57a3b55bae8f6d
data/.rubocop.yml CHANGED
@@ -8,6 +8,3 @@ AllCops:
8
8
 
9
9
  Style/HashSyntax:
10
10
  Enabled: false # yuk Ruby 3.1
11
-
12
- Style/NumericLiterals:
13
- Enabled: false
data/CHANGELOG.md CHANGED
@@ -1,5 +1,17 @@
1
- ## [Unreleased]
2
-
3
1
  ## [0.1.0] - 2022-06-29
4
2
 
5
3
  - Initial release
4
+
5
+ ## [0.1.1] - 2022-06-30
6
+
7
+ - Add error handling
8
+ - Add ability to add opts to Spn2.save
9
+
10
+ ## [0.1.2] - 2022-07-02
11
+
12
+ - Add user_status
13
+ - Add status calls for multiple job_ids and outlinks
14
+
15
+ ## [0.2.0] - 2022-07-03
16
+
17
+ - Breaking change: Single method 'status' with kwarg :job_ids now used for status of job(s)
data/Guardfile CHANGED
@@ -4,4 +4,5 @@ guard :minitest do
4
4
  watch(%r{^test/(.*)/?test_(.*)\.rb$})
5
5
  watch(%r{^test/test_helper\.rb$}) { 'test' }
6
6
  watch(%r{^lib/(.*/)?([^/]+)\.rb$}) { |m| "test/lib/#{m[1]}test_#{m[2]}.rb" }
7
+ watch(%r{^lib/spn2\.rb}) { 'test/lib/spn2' }
7
8
  end
data/README.md CHANGED
@@ -1,3 +1,4 @@
1
+ [![Ruby Style Guide](https://img.shields.io/badge/code_style-rubocop-brightgreen.svg)](https://github.com/rubocop/rubocop)
1
2
  [![Gem Version](https://badge.fury.io/rb/spn2.svg)](https://badge.fury.io/rb/spn2)
2
3
 
3
4
  # Spn2
@@ -21,25 +22,44 @@ For the Spn2 namespace do:
21
22
  ```rb
22
23
  require 'spn2'
23
24
  ```
24
-
25
25
  ### Authentication
26
26
 
27
27
  The API requires authentication, so you will need an account at [archive.org](https://archive.org). There are two methods of authentication; cookies and API key. Presently only the latter is implemented. API keys may be generated at https://archive.org/account/s3.php. Ensure your access key and secret key are set in environment variables SPN2_ACCESS_KEY and SPN2_SECRET_KEY respectively.
28
28
 
29
+ ```rb
30
+ > Spn2.access_key
31
+ => <your access key>
32
+ > Spn2.secret_key
33
+ => <your secret key>
34
+ ```
29
35
  ### Save a page
30
36
 
31
- Save a url in the Wayback Machine. This method returns the job_id in a hash.
37
+ Save (capture) a url in the Wayback Machine. This method returns the job_id in a hash.
32
38
  ```rb
33
39
  > Spn2.save(url: 'example.com') # returns a job_id
34
40
 
35
- => {job_id: 'spn2-9c17e047f58f9220a7008d4f18152fee4d111d14'}
41
+ => {"url"=>"http://example.com","job_id"=>"spn2-9c17e047f58f9220a7008d4f18152fee4d111d14"} # may include a "message" key too
42
+ ```
43
+ Various options are available, as detailed in the [specification](https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA/edit) in the section "Capture request". These may be passed like so:
44
+ ```rb
45
+ > Spn2.save(url: 'example.com', opts: { capture_all: 1, capture_outlinks: 1 })
46
+
47
+ => {"url"=>"http://example.com","job_id"=>"spn2-9c17e047f58f9220a7008d4f18152fee4d111d14"}
48
+ ```
49
+ Page save errors will raise an error and look like this:
50
+ ```rb
51
+ => {"status"=>"error", "status_ext"=>"error:too-many-daily-captures", "message"=>"This URL has been already captured 10 times today.
52
+ Please try again tomorrow. Please email us at \"info@archive.org\" if you would like to discuss this more."} (Spn2::Spn2ErrorFailedCapture)
36
53
  ```
54
+ The key "status_ext" contains an explanatory message - see the API [specification](https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA/edit).
55
+
56
+
37
57
 
38
58
  ### View the status of a job
39
59
 
40
60
  Use the job_id.
41
61
  ```rb
42
- > Spn2.status(job_id: 'spn2-9c17e047f58f9220a7008d4f18152fee4d111d14')
62
+ > Spn2.status(job_ids: 'spn2-9c17e047f58f9220a7008d4f18152fee4d111d14')
43
63
 
44
64
  => {"counters"=>{"outlinks"=>1, "embeds"=>2}, "job_id"=>"spn2-9c17e047f58f9220a7008d4f18152fee4d111d14",
45
65
  "original_url"=>"http://example.com/", "resources"=>["http://example.com/", "http://example.com/favicon.ico"],
@@ -48,13 +68,56 @@ Use the job_id.
48
68
  ```
49
69
  "status" => "success" is what you are looking for.
50
70
 
71
+ Care is advised for domains/urls which are frequently saved into the Wayback Machine as the job_id is merely "spn2-" followed by a hash of the url\*. A status request will show the status of _the most recent capture by anyone_ of the url in question.
72
+
73
+ \* Usually an sha1 hash of the url in the form http://\<domain\>/\<path\>/ e.g:
74
+ ```sh
75
+ $ echo "http://example.com/"|tr -d "\n"|shasum
76
+ 9c17e047f58f9220a7008d4f18152fee4d111d14 -
77
+ ```
78
+
79
+ The status of an array of job_id's can be obtained with:
80
+ ```rb
81
+ > Spn2.status(job_ids: ['spn2-9c17e047f58f9220a7008d4f18152fee4d111d14', 'spn2-...'])
82
+
83
+ => [.. # an array of status hashes
84
+ ```
85
+
86
+ Finally, the status of any outlinks captured by using the save option `capture_outlinks: 1` is available by supplying the parent job_id to:
87
+ ```rb
88
+ > Spn2.status(job_ids: 'spn2-cce034d987e1d72d8cbf1770bcf99024fe20dddf', outlinks: true)
89
+
90
+ => [.. # an array of outlink job status hashes
91
+ ```
92
+ ### User status
93
+
94
+ Information about the user is available via:
95
+ ```rb
96
+ > Spn2.user_status
97
+ => {"daily_captures_limit"=>100000, "available"=>8, "processing"=>0, "daily_captures"=>10}
98
+ ```
99
+
51
100
  ### System status
52
101
 
53
102
  The status of Wayback Machine itself is available.
54
103
  ```rb
55
104
  > Spn2.system_status
56
- => {"status"=>"ok"}
105
+ => {"status"=>"ok"} # if not "ok" captures may be delayed
106
+ ```
107
+ ### Error handling
108
+
109
+ To facilitate graceful error handling, a full list of all error classes is provided by:
110
+ ```rb
111
+ > Spn2.error_classes
112
+ => [Spn2::Spn2Error, Spn2::Spn2ErrorBadAuth,.. ..]
57
113
  ```
114
+ ## Testing
115
+
116
+ Just run `bundle exec rake` to run the test suite.
117
+
118
+ Valid API keys must be held in SPN2_ACCESS_KEY and SPN2_SECRET_KEY for testing. Go to https://archive.org/account/s3.php to set up API keys if you need them. If you have your live keys stored in these env vars just do:
119
+
120
+ `export SPN2_ACCESS_KEY=<valid access test key> && export SPN2_SECRET_KEY=<valid secret test key>` immediately before the above command.
58
121
 
59
122
  ## Development
60
123
 
data/lib/curlable.rb CHANGED
@@ -12,7 +12,7 @@ module Curlable
12
12
  end
13
13
 
14
14
  def post(url:, headers: {}, params: {})
15
- Curl::Easy.http_post("#{url}?#{Curl.postalize(params)}", params) do |http|
15
+ Curl::Easy.http_post("#{url}?#{Curl.postalize(params)}", Curl.postalize(params)) do |http|
16
16
  http.follow_location = true
17
17
  headers.each { |k, v| http.headers[k] = v }
18
18
  end.body_str
data/lib/spn2/version.rb CHANGED
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Spn2
4
- VERSION = '0.1.0'
4
+ VERSION = '0.2.0'
5
5
  end
data/lib/spn2.rb CHANGED
@@ -1,61 +1,134 @@
1
1
  # frozen_string_literal: true
2
2
 
3
+ require 'date'
3
4
  require 'json'
4
5
  require 'nokogiri'
5
6
 
6
7
  require_relative 'curlable'
8
+ require_relative 'spn2_errors'
7
9
 
8
- # namespace
10
+ # Design decison to not use a class as only 'state' is in 2 env vars
9
11
  module Spn2
10
- extend Curlable # for .system_status
11
- include Curlable
12
+ extend Curlable
12
13
 
13
- class Spn2Error < StandardError; end
14
-
15
- FIND_JOB_ID_REGEXP = /(spn2-([a-f]|\d){40})/
14
+ ESSENTIAL_STATUS_KEYS = %w[job_id resources status].freeze
15
+ JOB_ID_REGEXP = /^(spn2-([a-f]|\d){40})$/
16
16
  WEB_ARCHIVE = 'https://web.archive.org'
17
17
 
18
- def self.access_key
19
- ENV.fetch('SPN2_ACCESS_KEY', nil)
20
- end
18
+ BINARY_OPTS = %w[capture_all capture_outlinks capture_screenshot delay_wb_availabilty force_get skip_first_archive
19
+ outlinks_availability email_result].freeze
20
+ OTHER_OPTS = %w[if_not_archived_within js_behavior_timeout capture_cookie target_username target_password].freeze
21
21
 
22
- def self.secret_key
23
- ENV.fetch('SPN2_SECRET_KEY', nil)
24
- end
22
+ class << self
23
+ def error_classes
24
+ Spn2.constants.map { |e| Spn2.const_get(e) }.select { |e| e.is_a?(Class) && e < Exception }
25
+ end
25
26
 
26
- def self.system_status
27
- JSON.parse get(url: "#{WEB_ARCHIVE}/save/status/system")
28
- end
27
+ def access_key
28
+ ENV.fetch('SPN2_ACCESS_KEY')
29
+ end
29
30
 
30
- def self.save(url:)
31
- job_id(json(auth_post(url: "#{WEB_ARCHIVE}/save/#{url}", params: "url=#{url}")))
32
- end
31
+ def secret_key
32
+ ENV.fetch('SPN2_SECRET_KEY')
33
+ end
33
34
 
34
- def self.status(job_id:)
35
- json auth_get(url: "#{WEB_ARCHIVE}/save/status/#{job_id}?_t=#{Time.now.to_i}")
36
- end
35
+ def system_status
36
+ json get(url: "#{WEB_ARCHIVE}/save/status/system") # no auth
37
+ end
37
38
 
38
- def self.auth_get(url:)
39
- get(url: url, headers: accept_header.merge(auth_header))
40
- end
39
+ def user_status
40
+ json auth_get(url: "#{WEB_ARCHIVE}/save/status/user?t=#{DateTime.now.strftime('%Q').to_i}")
41
+ end
41
42
 
42
- def self.auth_post(url:, params:)
43
- post(url: url, headers: accept_header.merge(auth_header), params:)
44
- end
43
+ def save(url:, opts: {})
44
+ raise Spn2ErrorInvalidOption, "One or more invalid options: #{opts}" unless options_valid?(opts)
45
45
 
46
- def self.accept_header
47
- { 'Accept' => 'application/json' }
48
- end
46
+ json = json(auth_post(url: "#{WEB_ARCHIVE}/save/#{url}", params: { url: url }.merge(opts)))
47
+ raise Spn2ErrorBadAuth, json.inspect if json['message']&.== BAD_AUTH_MSG
49
48
 
50
- def self.auth_header
51
- { 'Authorization' => "LOW #{Spn2.access_key}:#{Spn2.secret_key}" }
52
- end
49
+ raise Spn2ErrorFailedCapture, json.inspect unless json['job_id']
53
50
 
54
- def self.job_id(hash)
55
- { job_id: hash['job_id'] }
56
- end
51
+ json
52
+ end
53
+ alias capture save
54
+
55
+ def status(job_ids:, outlinks: false)
56
+ params = status_params(job_ids: job_ids, outlinks: outlinks)
57
+ json = json(auth_post(url: "#{WEB_ARCHIVE}/save/status", params: params))
58
+ return json if json.is_a? Array # must be valid response
59
+
60
+ handle_status_errors(job_ids: job_ids, json: json, outlinks: outlinks)
61
+ json
62
+ end
63
+
64
+ private
65
+
66
+ def status_params(job_ids:, outlinks:)
67
+ return { job_ids: job_ids.join(',') } if job_ids.is_a?(Array)
68
+ return { job_id_outlinks: job_ids } if outlinks
69
+
70
+ { job_id: job_ids } # single job_id
71
+ end
72
+
73
+ def handle_status_errors(job_ids:, json:, outlinks:)
74
+ raise Spn2ErrorBadAuth, json.inspect if json['message']&.== BAD_AUTH_MSG
75
+ raise Spn2ErrorNoOutlinks, json.inspect if outlinks
76
+ raise Spn2ErrorMissingKeys, json.inspect unless (ESSENTIAL_STATUS_KEYS - json.keys).empty?
77
+ raise Spn2Error, json.inspect if job_ids.is_a?(Array)
78
+ end
79
+
80
+ def auth_get(url:)
81
+ get(url: url, headers: accept_header.merge(auth_header))
82
+ end
83
+
84
+ def auth_post(url:, params: {})
85
+ post(url: url, headers: accept_header.merge(auth_header), params: params)
86
+ end
87
+
88
+ def accept_header
89
+ { Accept: 'application/json' }
90
+ end
91
+
92
+ def auth_header
93
+ { Authorization: "LOW #{Spn2.access_key}:#{Spn2.secret_key}" }
94
+ end
95
+
96
+ def doc(html_string)
97
+ Nokogiri::HTML html_string
98
+ end
99
+
100
+ def json(html_string)
101
+ JSON.parse(doc = doc(html_string))
102
+ rescue JSON::ParserError # an html response
103
+ parse_error_code_from_page_title(doc.title) if doc.title
104
+ parse_error_from_page_body(html_string)
105
+ end
106
+
107
+ def parse_error_code_from_page_title(title_string)
108
+ raise_code_response_error_if_code_in_string(title_string)
109
+ raise Spn2ErrorUnknownResponseCode, title_string # code found but doesn't match any known error classes
110
+ end
111
+
112
+ def parse_error_from_page_body(html_string)
113
+ h1_tag_text = h1_tag_text(html_string)
114
+ raise_code_response_error_if_code_in_string h1_tag_text
115
+ raise Spn2ErrorTooManyRequests if h1_tag_text == TOO_MANY_REQUESTS
116
+
117
+ raise Spn2ErrorUnknownResponse, html_string # fall through
118
+ end
119
+
120
+ def h1_tag_text(html_string)
121
+ doc(html_string).xpath('//h1')&.text || ''
122
+ end
123
+
124
+ def raise_code_response_error_if_code_in_string(string)
125
+ return unless ERROR_CODES.include? code = string.to_i
126
+
127
+ raise Spn2.const_get("Spn2Error#{code}")
128
+ end
57
129
 
58
- def self.json(json_string)
59
- JSON.parse Nokogiri::HTML(json_string)
130
+ def options_valid?(opts)
131
+ opts.keys.all? { |k| (BINARY_OPTS + OTHER_OPTS).include? k.to_s }
132
+ end
60
133
  end
61
134
  end
@@ -0,0 +1,20 @@
1
+ # frozen_string_literal: true
2
+
3
+ # namespace
4
+ module Spn2
5
+ BAD_AUTH_MSG = 'You need to be logged in to use Save Page Now.'
6
+ ERROR_CODES = [400, 502].freeze
7
+ TOO_MANY_REQUESTS = 'Too Many Requests'
8
+
9
+ class Spn2Error < StandardError; end
10
+ class Spn2ErrorBadAuth < Spn2Error; end
11
+ class Spn2ErrorBadParams < Spn2Error; end
12
+ class Spn2ErrorFailedCapture < Spn2Error; end
13
+ class Spn2ErrorInvalidOption < Spn2Error; end
14
+ class Spn2ErrorMissingKeys < Spn2Error; end
15
+ class Spn2ErrorNoOutlinks < Spn2Error; end
16
+ class Spn2ErrorTooManyRequests < Spn2Error; end
17
+ class Spn2ErrorUnknownResponse < Spn2Error; end
18
+ class Spn2ErrorUnknownResponseCode < Spn2Error; end
19
+ ERROR_CODES.each { |i| const_set("Spn2Error#{i}", Class.new(Spn2Error)) }
20
+ end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: spn2
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - MatzFan
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2022-06-29 00:00:00.000000000 Z
11
+ date: 2022-07-03 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: curb
@@ -80,6 +80,20 @@ dependencies:
80
80
  - - "~>"
81
81
  - !ruby/object:Gem::Version
82
82
  version: '5.16'
83
+ - !ruby/object:Gem::Dependency
84
+ name: minitest-parallel_fork
85
+ requirement: !ruby/object:Gem::Requirement
86
+ requirements:
87
+ - - "~>"
88
+ - !ruby/object:Gem::Version
89
+ version: '1.2'
90
+ type: :development
91
+ prerelease: false
92
+ version_requirements: !ruby/object:Gem::Requirement
93
+ requirements:
94
+ - - "~>"
95
+ - !ruby/object:Gem::Version
96
+ version: '1.2'
83
97
  - !ruby/object:Gem::Dependency
84
98
  name: rake
85
99
  requirement: !ruby/object:Gem::Requirement
@@ -136,7 +150,7 @@ dependencies:
136
150
  - - "~>"
137
151
  - !ruby/object:Gem::Version
138
152
  version: '0.6'
139
- description: Atomate the process of saving web pages to archive.org
153
+ description: Automate the process of saving web pages to archive.org
140
154
  email:
141
155
  executables: []
142
156
  extensions: []
@@ -152,8 +166,9 @@ files:
152
166
  - lib/curlable.rb
153
167
  - lib/spn2.rb
154
168
  - lib/spn2/version.rb
169
+ - lib/spn2_errors.rb
155
170
  - sig/spn2.rbs
156
- homepage: https://gitlab.com/matxfan/spn2
171
+ homepage: https://gitlab.com/matzfan/spn2
157
172
  licenses:
158
173
  - MIT
159
174
  metadata:
@@ -177,8 +192,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
177
192
  - !ruby/object:Gem::Version
178
193
  version: '0'
179
194
  requirements: []
180
- rubygems_version: 3.3.16
195
+ rubygems_version: 3.3.17
181
196
  signing_key:
182
197
  specification_version: 4
183
- summary: Gem for the Save Page Now API of the Wayback Machine
198
+ summary: Gem for the Save Page Now 2 API of the Wayback Machine
184
199
  test_files: []