spn2 0.1.0 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop.yml +0 -3
- data/CHANGELOG.md +14 -2
- data/Guardfile +1 -0
- data/README.md +68 -5
- data/lib/curlable.rb +1 -1
- data/lib/spn2/version.rb +1 -1
- data/lib/spn2.rb +111 -38
- data/lib/spn2_errors.rb +20 -0
- metadata +21 -6
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: d73702950316690d62ce910059788093af0ba31b094c1208eb7c7ef72cbaa06e
|
4
|
+
data.tar.gz: 87ba2a61a9f76d86d46043da74418b2a67f6ac584dac2c3b50b2aa7baa8c7aad
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 0ad8a32dc48bd5dbaf5b24428f6a3522c4af9ce3a5a2297df39fb4dd9b36a627565a844cf841ca50e3c3e0200b569c88d73bd27717468ed8b8e2f5444114d316
|
7
|
+
data.tar.gz: b08014dddc62d499a30e5bc8ba1ce20c96a5f01ad8d0d465759897904f297339a9b79ecff1fdd56561c194072e24ba83542b5d0e39f6d6ab8e57a3b55bae8f6d
|
data/.rubocop.yml
CHANGED
data/CHANGELOG.md
CHANGED
@@ -1,5 +1,17 @@
|
|
1
|
-
## [Unreleased]
|
2
|
-
|
3
1
|
## [0.1.0] - 2022-06-29
|
4
2
|
|
5
3
|
- Initial release
|
4
|
+
|
5
|
+
## [0.1.1] - 2022-06-30
|
6
|
+
|
7
|
+
- Add error handling
|
8
|
+
- Add ability to add opts to Spn2.save
|
9
|
+
|
10
|
+
## [0.1.2] - 2022-07-02
|
11
|
+
|
12
|
+
- Add user_status
|
13
|
+
- Add status calls for multiple job_ids and outlinks
|
14
|
+
|
15
|
+
## [0.2.0] - 2022-07-03
|
16
|
+
|
17
|
+
- Breaking change: Single method 'status' with kwarg :job_ids now used for status of job(s)
|
data/Guardfile
CHANGED
data/README.md
CHANGED
@@ -1,3 +1,4 @@
|
|
1
|
+
[](https://github.com/rubocop/rubocop)
|
1
2
|
[](https://badge.fury.io/rb/spn2)
|
2
3
|
|
3
4
|
# Spn2
|
@@ -21,25 +22,44 @@ For the Spn2 namespace do:
|
|
21
22
|
```rb
|
22
23
|
require 'spn2'
|
23
24
|
```
|
24
|
-
|
25
25
|
### Authentication
|
26
26
|
|
27
27
|
The API requires authentication, so you will need an account at [archive.org](https://archive.org). There are two methods of authentication; cookies and API key. Presently only the latter is implemented. API keys may be generated at https://archive.org/account/s3.php. Ensure your access key and secret key are set in environment variables SPN2_ACCESS_KEY and SPN2_SECRET_KEY respectively.
|
28
28
|
|
29
|
+
```rb
|
30
|
+
> Spn2.access_key
|
31
|
+
=> <your access key>
|
32
|
+
> Spn2.secret_key
|
33
|
+
=> <your secret key>
|
34
|
+
```
|
29
35
|
### Save a page
|
30
36
|
|
31
|
-
Save a url in the Wayback Machine. This method returns the job_id in a hash.
|
37
|
+
Save (capture) a url in the Wayback Machine. This method returns the job_id in a hash.
|
32
38
|
```rb
|
33
39
|
> Spn2.save(url: 'example.com') # returns a job_id
|
34
40
|
|
35
|
-
=> {job_id
|
41
|
+
=> {"url"=>"http://example.com","job_id"=>"spn2-9c17e047f58f9220a7008d4f18152fee4d111d14"} # may include a "message" key too
|
42
|
+
```
|
43
|
+
Various options are available, as detailed in the [specification](https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA/edit) in the section "Capture request". These may be passed like so:
|
44
|
+
```rb
|
45
|
+
> Spn2.save(url: 'example.com', opts: { capture_all: 1, capture_outlinks: 1 })
|
46
|
+
|
47
|
+
=> {"url"=>"http://example.com","job_id"=>"spn2-9c17e047f58f9220a7008d4f18152fee4d111d14"}
|
48
|
+
```
|
49
|
+
Page save errors will raise an error and look like this:
|
50
|
+
```rb
|
51
|
+
=> {"status"=>"error", "status_ext"=>"error:too-many-daily-captures", "message"=>"This URL has been already captured 10 times today.
|
52
|
+
Please try again tomorrow. Please email us at \"info@archive.org\" if you would like to discuss this more."} (Spn2::Spn2ErrorFailedCapture)
|
36
53
|
```
|
54
|
+
The key "status_ext" contains an explanatory message - see the API [specification](https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA/edit).
|
55
|
+
|
56
|
+
|
37
57
|
|
38
58
|
### View the status of a job
|
39
59
|
|
40
60
|
Use the job_id.
|
41
61
|
```rb
|
42
|
-
> Spn2.status(
|
62
|
+
> Spn2.status(job_ids: 'spn2-9c17e047f58f9220a7008d4f18152fee4d111d14')
|
43
63
|
|
44
64
|
=> {"counters"=>{"outlinks"=>1, "embeds"=>2}, "job_id"=>"spn2-9c17e047f58f9220a7008d4f18152fee4d111d14",
|
45
65
|
"original_url"=>"http://example.com/", "resources"=>["http://example.com/", "http://example.com/favicon.ico"],
|
@@ -48,13 +68,56 @@ Use the job_id.
|
|
48
68
|
```
|
49
69
|
"status" => "success" is what you are looking for.
|
50
70
|
|
71
|
+
Care is advised for domains/urls which are frequently saved into the Wayback Machine as the job_id is merely "spn2-" followed by a hash of the url\*. A status request will show the status of _the most recent capture by anyone_ of the url in question.
|
72
|
+
|
73
|
+
\* Usually an sha1 hash of the url in the form http://\<domain\>/\<path\>/ e.g:
|
74
|
+
```sh
|
75
|
+
$ echo "http://example.com/"|tr -d "\n"|shasum
|
76
|
+
9c17e047f58f9220a7008d4f18152fee4d111d14 -
|
77
|
+
```
|
78
|
+
|
79
|
+
The status of an array of job_id's can be obtained with:
|
80
|
+
```rb
|
81
|
+
> Spn2.status(job_ids: ['spn2-9c17e047f58f9220a7008d4f18152fee4d111d14', 'spn2-...'])
|
82
|
+
|
83
|
+
=> [.. # an array of status hashes
|
84
|
+
```
|
85
|
+
|
86
|
+
Finally, the status of any outlinks captured by using the save option `capture_outlinks: 1` is available by supplying the parent job_id to:
|
87
|
+
```rb
|
88
|
+
> Spn2.status(job_ids: 'spn2-cce034d987e1d72d8cbf1770bcf99024fe20dddf', outlinks: true)
|
89
|
+
|
90
|
+
=> [.. # an array of outlink job status hashes
|
91
|
+
```
|
92
|
+
### User status
|
93
|
+
|
94
|
+
Information about the user is available via:
|
95
|
+
```rb
|
96
|
+
> Spn2.user_status
|
97
|
+
=> {"daily_captures_limit"=>100000, "available"=>8, "processing"=>0, "daily_captures"=>10}
|
98
|
+
```
|
99
|
+
|
51
100
|
### System status
|
52
101
|
|
53
102
|
The status of Wayback Machine itself is available.
|
54
103
|
```rb
|
55
104
|
> Spn2.system_status
|
56
|
-
=> {"status"=>"ok"}
|
105
|
+
=> {"status"=>"ok"} # if not "ok" captures may be delayed
|
106
|
+
```
|
107
|
+
### Error handling
|
108
|
+
|
109
|
+
To facilitate graceful error handling, a full list of all error classes is provided by:
|
110
|
+
```rb
|
111
|
+
> Spn2.error_classes
|
112
|
+
=> [Spn2::Spn2Error, Spn2::Spn2ErrorBadAuth,.. ..]
|
57
113
|
```
|
114
|
+
## Testing
|
115
|
+
|
116
|
+
Just run `bundle exec rake` to run the test suite.
|
117
|
+
|
118
|
+
Valid API keys must be held in SPN2_ACCESS_KEY and SPN2_SECRET_KEY for testing. Go to https://archive.org/account/s3.php to set up API keys if you need them. If you have your live keys stored in these env vars just do:
|
119
|
+
|
120
|
+
`export SPN2_ACCESS_KEY=<valid access test key> && export SPN2_SECRET_KEY=<valid secret test key>` immediately before the above command.
|
58
121
|
|
59
122
|
## Development
|
60
123
|
|
data/lib/curlable.rb
CHANGED
@@ -12,7 +12,7 @@ module Curlable
|
|
12
12
|
end
|
13
13
|
|
14
14
|
def post(url:, headers: {}, params: {})
|
15
|
-
Curl::Easy.http_post("#{url}?#{Curl.postalize(params)}", params) do |http|
|
15
|
+
Curl::Easy.http_post("#{url}?#{Curl.postalize(params)}", Curl.postalize(params)) do |http|
|
16
16
|
http.follow_location = true
|
17
17
|
headers.each { |k, v| http.headers[k] = v }
|
18
18
|
end.body_str
|
data/lib/spn2/version.rb
CHANGED
data/lib/spn2.rb
CHANGED
@@ -1,61 +1,134 @@
|
|
1
1
|
# frozen_string_literal: true
|
2
2
|
|
3
|
+
require 'date'
|
3
4
|
require 'json'
|
4
5
|
require 'nokogiri'
|
5
6
|
|
6
7
|
require_relative 'curlable'
|
8
|
+
require_relative 'spn2_errors'
|
7
9
|
|
8
|
-
#
|
10
|
+
# Design decison to not use a class as only 'state' is in 2 env vars
|
9
11
|
module Spn2
|
10
|
-
extend Curlable
|
11
|
-
include Curlable
|
12
|
+
extend Curlable
|
12
13
|
|
13
|
-
|
14
|
-
|
15
|
-
FIND_JOB_ID_REGEXP = /(spn2-([a-f]|\d){40})/
|
14
|
+
ESSENTIAL_STATUS_KEYS = %w[job_id resources status].freeze
|
15
|
+
JOB_ID_REGEXP = /^(spn2-([a-f]|\d){40})$/
|
16
16
|
WEB_ARCHIVE = 'https://web.archive.org'
|
17
17
|
|
18
|
-
|
19
|
-
|
20
|
-
|
18
|
+
BINARY_OPTS = %w[capture_all capture_outlinks capture_screenshot delay_wb_availabilty force_get skip_first_archive
|
19
|
+
outlinks_availability email_result].freeze
|
20
|
+
OTHER_OPTS = %w[if_not_archived_within js_behavior_timeout capture_cookie target_username target_password].freeze
|
21
21
|
|
22
|
-
|
23
|
-
|
24
|
-
|
22
|
+
class << self
|
23
|
+
def error_classes
|
24
|
+
Spn2.constants.map { |e| Spn2.const_get(e) }.select { |e| e.is_a?(Class) && e < Exception }
|
25
|
+
end
|
25
26
|
|
26
|
-
|
27
|
-
|
28
|
-
|
27
|
+
def access_key
|
28
|
+
ENV.fetch('SPN2_ACCESS_KEY')
|
29
|
+
end
|
29
30
|
|
30
|
-
|
31
|
-
|
32
|
-
|
31
|
+
def secret_key
|
32
|
+
ENV.fetch('SPN2_SECRET_KEY')
|
33
|
+
end
|
33
34
|
|
34
|
-
|
35
|
-
|
36
|
-
|
35
|
+
def system_status
|
36
|
+
json get(url: "#{WEB_ARCHIVE}/save/status/system") # no auth
|
37
|
+
end
|
37
38
|
|
38
|
-
|
39
|
-
|
40
|
-
|
39
|
+
def user_status
|
40
|
+
json auth_get(url: "#{WEB_ARCHIVE}/save/status/user?t=#{DateTime.now.strftime('%Q').to_i}")
|
41
|
+
end
|
41
42
|
|
42
|
-
|
43
|
-
|
44
|
-
end
|
43
|
+
def save(url:, opts: {})
|
44
|
+
raise Spn2ErrorInvalidOption, "One or more invalid options: #{opts}" unless options_valid?(opts)
|
45
45
|
|
46
|
-
|
47
|
-
|
48
|
-
end
|
46
|
+
json = json(auth_post(url: "#{WEB_ARCHIVE}/save/#{url}", params: { url: url }.merge(opts)))
|
47
|
+
raise Spn2ErrorBadAuth, json.inspect if json['message']&.== BAD_AUTH_MSG
|
49
48
|
|
50
|
-
|
51
|
-
{ 'Authorization' => "LOW #{Spn2.access_key}:#{Spn2.secret_key}" }
|
52
|
-
end
|
49
|
+
raise Spn2ErrorFailedCapture, json.inspect unless json['job_id']
|
53
50
|
|
54
|
-
|
55
|
-
|
56
|
-
|
51
|
+
json
|
52
|
+
end
|
53
|
+
alias capture save
|
54
|
+
|
55
|
+
def status(job_ids:, outlinks: false)
|
56
|
+
params = status_params(job_ids: job_ids, outlinks: outlinks)
|
57
|
+
json = json(auth_post(url: "#{WEB_ARCHIVE}/save/status", params: params))
|
58
|
+
return json if json.is_a? Array # must be valid response
|
59
|
+
|
60
|
+
handle_status_errors(job_ids: job_ids, json: json, outlinks: outlinks)
|
61
|
+
json
|
62
|
+
end
|
63
|
+
|
64
|
+
private
|
65
|
+
|
66
|
+
def status_params(job_ids:, outlinks:)
|
67
|
+
return { job_ids: job_ids.join(',') } if job_ids.is_a?(Array)
|
68
|
+
return { job_id_outlinks: job_ids } if outlinks
|
69
|
+
|
70
|
+
{ job_id: job_ids } # single job_id
|
71
|
+
end
|
72
|
+
|
73
|
+
def handle_status_errors(job_ids:, json:, outlinks:)
|
74
|
+
raise Spn2ErrorBadAuth, json.inspect if json['message']&.== BAD_AUTH_MSG
|
75
|
+
raise Spn2ErrorNoOutlinks, json.inspect if outlinks
|
76
|
+
raise Spn2ErrorMissingKeys, json.inspect unless (ESSENTIAL_STATUS_KEYS - json.keys).empty?
|
77
|
+
raise Spn2Error, json.inspect if job_ids.is_a?(Array)
|
78
|
+
end
|
79
|
+
|
80
|
+
def auth_get(url:)
|
81
|
+
get(url: url, headers: accept_header.merge(auth_header))
|
82
|
+
end
|
83
|
+
|
84
|
+
def auth_post(url:, params: {})
|
85
|
+
post(url: url, headers: accept_header.merge(auth_header), params: params)
|
86
|
+
end
|
87
|
+
|
88
|
+
def accept_header
|
89
|
+
{ Accept: 'application/json' }
|
90
|
+
end
|
91
|
+
|
92
|
+
def auth_header
|
93
|
+
{ Authorization: "LOW #{Spn2.access_key}:#{Spn2.secret_key}" }
|
94
|
+
end
|
95
|
+
|
96
|
+
def doc(html_string)
|
97
|
+
Nokogiri::HTML html_string
|
98
|
+
end
|
99
|
+
|
100
|
+
def json(html_string)
|
101
|
+
JSON.parse(doc = doc(html_string))
|
102
|
+
rescue JSON::ParserError # an html response
|
103
|
+
parse_error_code_from_page_title(doc.title) if doc.title
|
104
|
+
parse_error_from_page_body(html_string)
|
105
|
+
end
|
106
|
+
|
107
|
+
def parse_error_code_from_page_title(title_string)
|
108
|
+
raise_code_response_error_if_code_in_string(title_string)
|
109
|
+
raise Spn2ErrorUnknownResponseCode, title_string # code found but doesn't match any known error classes
|
110
|
+
end
|
111
|
+
|
112
|
+
def parse_error_from_page_body(html_string)
|
113
|
+
h1_tag_text = h1_tag_text(html_string)
|
114
|
+
raise_code_response_error_if_code_in_string h1_tag_text
|
115
|
+
raise Spn2ErrorTooManyRequests if h1_tag_text == TOO_MANY_REQUESTS
|
116
|
+
|
117
|
+
raise Spn2ErrorUnknownResponse, html_string # fall through
|
118
|
+
end
|
119
|
+
|
120
|
+
def h1_tag_text(html_string)
|
121
|
+
doc(html_string).xpath('//h1')&.text || ''
|
122
|
+
end
|
123
|
+
|
124
|
+
def raise_code_response_error_if_code_in_string(string)
|
125
|
+
return unless ERROR_CODES.include? code = string.to_i
|
126
|
+
|
127
|
+
raise Spn2.const_get("Spn2Error#{code}")
|
128
|
+
end
|
57
129
|
|
58
|
-
|
59
|
-
|
130
|
+
def options_valid?(opts)
|
131
|
+
opts.keys.all? { |k| (BINARY_OPTS + OTHER_OPTS).include? k.to_s }
|
132
|
+
end
|
60
133
|
end
|
61
134
|
end
|
data/lib/spn2_errors.rb
ADDED
@@ -0,0 +1,20 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
# namespace
|
4
|
+
module Spn2
|
5
|
+
BAD_AUTH_MSG = 'You need to be logged in to use Save Page Now.'
|
6
|
+
ERROR_CODES = [400, 502].freeze
|
7
|
+
TOO_MANY_REQUESTS = 'Too Many Requests'
|
8
|
+
|
9
|
+
class Spn2Error < StandardError; end
|
10
|
+
class Spn2ErrorBadAuth < Spn2Error; end
|
11
|
+
class Spn2ErrorBadParams < Spn2Error; end
|
12
|
+
class Spn2ErrorFailedCapture < Spn2Error; end
|
13
|
+
class Spn2ErrorInvalidOption < Spn2Error; end
|
14
|
+
class Spn2ErrorMissingKeys < Spn2Error; end
|
15
|
+
class Spn2ErrorNoOutlinks < Spn2Error; end
|
16
|
+
class Spn2ErrorTooManyRequests < Spn2Error; end
|
17
|
+
class Spn2ErrorUnknownResponse < Spn2Error; end
|
18
|
+
class Spn2ErrorUnknownResponseCode < Spn2Error; end
|
19
|
+
ERROR_CODES.each { |i| const_set("Spn2Error#{i}", Class.new(Spn2Error)) }
|
20
|
+
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: spn2
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.2.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- MatzFan
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2022-
|
11
|
+
date: 2022-07-03 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: curb
|
@@ -80,6 +80,20 @@ dependencies:
|
|
80
80
|
- - "~>"
|
81
81
|
- !ruby/object:Gem::Version
|
82
82
|
version: '5.16'
|
83
|
+
- !ruby/object:Gem::Dependency
|
84
|
+
name: minitest-parallel_fork
|
85
|
+
requirement: !ruby/object:Gem::Requirement
|
86
|
+
requirements:
|
87
|
+
- - "~>"
|
88
|
+
- !ruby/object:Gem::Version
|
89
|
+
version: '1.2'
|
90
|
+
type: :development
|
91
|
+
prerelease: false
|
92
|
+
version_requirements: !ruby/object:Gem::Requirement
|
93
|
+
requirements:
|
94
|
+
- - "~>"
|
95
|
+
- !ruby/object:Gem::Version
|
96
|
+
version: '1.2'
|
83
97
|
- !ruby/object:Gem::Dependency
|
84
98
|
name: rake
|
85
99
|
requirement: !ruby/object:Gem::Requirement
|
@@ -136,7 +150,7 @@ dependencies:
|
|
136
150
|
- - "~>"
|
137
151
|
- !ruby/object:Gem::Version
|
138
152
|
version: '0.6'
|
139
|
-
description:
|
153
|
+
description: Automate the process of saving web pages to archive.org
|
140
154
|
email:
|
141
155
|
executables: []
|
142
156
|
extensions: []
|
@@ -152,8 +166,9 @@ files:
|
|
152
166
|
- lib/curlable.rb
|
153
167
|
- lib/spn2.rb
|
154
168
|
- lib/spn2/version.rb
|
169
|
+
- lib/spn2_errors.rb
|
155
170
|
- sig/spn2.rbs
|
156
|
-
homepage: https://gitlab.com/
|
171
|
+
homepage: https://gitlab.com/matzfan/spn2
|
157
172
|
licenses:
|
158
173
|
- MIT
|
159
174
|
metadata:
|
@@ -177,8 +192,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
177
192
|
- !ruby/object:Gem::Version
|
178
193
|
version: '0'
|
179
194
|
requirements: []
|
180
|
-
rubygems_version: 3.3.
|
195
|
+
rubygems_version: 3.3.17
|
181
196
|
signing_key:
|
182
197
|
specification_version: 4
|
183
|
-
summary: Gem for the Save Page Now API of the Wayback Machine
|
198
|
+
summary: Gem for the Save Page Now 2 API of the Wayback Machine
|
184
199
|
test_files: []
|