spn2 0.1.1 → 0.1.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop.yml +0 -3
- data/CHANGELOG.md +5 -2
- data/Guardfile +1 -0
- data/README.md +51 -4
- data/lib/spn2/version.rb +1 -1
- data/lib/spn2.rb +58 -24
- metadata +19 -5
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: e968c75da93882e48ac210e17bb599497971ee1f065035758279529b23c53f1b
|
4
|
+
data.tar.gz: 9fa1b6f6125d9347d2418254b8e8838572fd8a4efcf0c6bbabc258f3784c097b
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: bacfeda95f8a40e132496cb69078e8229767de20c99615db7da4bb1d5f35a9a4c7e3e688a1fe3d31d6dd9e5f026d0d6fc8e328aae90915de55e6810db675732f
|
7
|
+
data.tar.gz: e62fe104f074cb9ab5e4a85b89050e8dc0f198613dd9ff62fd56d3c645428e080699e14d57b8b6a655b68ed3be00704ff67ba5e9a58508083fe0f961cdc1ac55
|
data/.rubocop.yml
CHANGED
data/CHANGELOG.md
CHANGED
@@ -1,5 +1,3 @@
|
|
1
|
-
## [Unreleased]
|
2
|
-
|
3
1
|
## [0.1.0] - 2022-06-29
|
4
2
|
|
5
3
|
- Initial release
|
@@ -8,3 +6,8 @@
|
|
8
6
|
|
9
7
|
- Add error handling
|
10
8
|
- Add ability to add opts to Spn2.save
|
9
|
+
|
10
|
+
## [0.1.2] - 2022-07-02
|
11
|
+
|
12
|
+
- Add user_status
|
13
|
+
- Add status calls for multiple job_ids and outlinks
|
data/Guardfile
CHANGED
data/README.md
CHANGED
@@ -1,3 +1,4 @@
|
|
1
|
+
[](https://github.com/rubocop/rubocop)
|
1
2
|
[](https://badge.fury.io/rb/spn2)
|
2
3
|
|
3
4
|
# Spn2
|
@@ -37,19 +38,28 @@ Save (capture) a url in the Wayback Machine. This method returns the job_id in a
|
|
37
38
|
```rb
|
38
39
|
> Spn2.save(url: 'example.com') # returns a job_id
|
39
40
|
|
40
|
-
=> {job_id
|
41
|
+
=> {"url"=>"http://example.com","job_id"=>"spn2-9c17e047f58f9220a7008d4f18152fee4d111d14"} # may include a "message" key too
|
41
42
|
```
|
42
43
|
Various options are available, as detailed in the [specification](https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA/edit) in the section "Capture request". These may be passed like so:
|
43
44
|
```rb
|
44
45
|
> Spn2.save(url: 'example.com', opts: { capture_all: 1, capture_outlinks: 1 })
|
45
46
|
|
46
|
-
=> {url
|
47
|
+
=> {"url"=>"http://example.com","job_id"=>"spn2-9c17e047f58f9220a7008d4f18152fee4d111d14"}
|
47
48
|
```
|
49
|
+
Page save errors will raise an error and look like this:
|
50
|
+
```rb
|
51
|
+
=> {"status"=>"error", "status_ext"=>"error:too-many-daily-captures", "message"=>"This URL has been already captured 10 times today.
|
52
|
+
Please try again tomorrow. Please email us at \"info@archive.org\" if you would like to discuss this more."} (Spn2::Spn2ErrorFailedCapture)
|
53
|
+
```
|
54
|
+
The key "status_ext" contains an explanatory message - see the API [specification](https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA/edit).
|
55
|
+
|
56
|
+
|
57
|
+
|
48
58
|
### View the status of a job
|
49
59
|
|
50
60
|
Use the job_id.
|
51
61
|
```rb
|
52
|
-
> Spn2.
|
62
|
+
> Spn2.status_job_id(job_id: 'spn2-9c17e047f58f9220a7008d4f18152fee4d111d14')
|
53
63
|
|
54
64
|
=> {"counters"=>{"outlinks"=>1, "embeds"=>2}, "job_id"=>"spn2-9c17e047f58f9220a7008d4f18152fee4d111d14",
|
55
65
|
"original_url"=>"http://example.com/", "resources"=>["http://example.com/", "http://example.com/favicon.ico"],
|
@@ -58,6 +68,35 @@ Use the job_id.
|
|
58
68
|
```
|
59
69
|
"status" => "success" is what you are looking for.
|
60
70
|
|
71
|
+
Care is advised for domains/urls which are frequently saved into the Wayback Machine as the job_id is merely "spn2-" followed by a hash of the url\*. A status request will show the status of _the most recent capture by anyone_ of the url in question.
|
72
|
+
|
73
|
+
\* Usually an sha1 hash of the url in the form http://\<domain\>/\<path\>/ e.g:
|
74
|
+
```sh
|
75
|
+
$ echo "http://example.com/"|tr -d "\n"|shasum
|
76
|
+
9c17e047f58f9220a7008d4f18152fee4d111d14 -
|
77
|
+
```
|
78
|
+
|
79
|
+
The status of a comma-separated list of job_id's can be obtained with:
|
80
|
+
```rb
|
81
|
+
> Spn2.status_job_ids(job_ids: 'spn2-9c17e047f58f9220a7008d4f18152fee4d111d14,spn2-...')
|
82
|
+
|
83
|
+
=> [.. # an array of status hashes
|
84
|
+
```
|
85
|
+
|
86
|
+
Finally, the status of any outlinks captured by using the save option `capture_outlinks: 1` is available by supplying the parent job_id to:
|
87
|
+
```rb
|
88
|
+
> Spn2.status_job_id_outlinks(job_id: 'spn2-cce034d987e1d72d8cbf1770bcf99024fe20dddf')
|
89
|
+
|
90
|
+
=> [.. # an array of outlink job status hashes
|
91
|
+
```
|
92
|
+
### User status
|
93
|
+
|
94
|
+
Information about the user is available via:
|
95
|
+
```rb
|
96
|
+
> Spn2.user_status
|
97
|
+
=> {"daily_captures_limit"=>100000, "available"=>8, "processing"=>0, "daily_captures"=>10}
|
98
|
+
```
|
99
|
+
|
61
100
|
### System status
|
62
101
|
|
63
102
|
The status of Wayback Machine itself is available.
|
@@ -67,11 +106,19 @@ The status of Wayback Machine itself is available.
|
|
67
106
|
```
|
68
107
|
### Error handling
|
69
108
|
|
70
|
-
To
|
109
|
+
To facilitate graceful error handling, a full list of all error classes is provided by:
|
71
110
|
```rb
|
72
111
|
> Spn2.error_classes
|
73
112
|
=> [Spn2::Spn2Error, Spn2::Spn2ErrorBadAuth,.. ..]
|
74
113
|
```
|
114
|
+
## Testing
|
115
|
+
|
116
|
+
Just run `bundle exec rake` to run the test suite.
|
117
|
+
|
118
|
+
Valid API keys must be held in SPN2_ACCESS_KEY and SPN2_SECRET_KEY for testing. Go to https://archive.org/account/s3.php to set up API keys if you need them. If you have your live keys stored in these env vars just do:
|
119
|
+
|
120
|
+
`export SPN2_ACCESS_KEY=<valid access test key> && export SPN2_SECRET_KEY=<valid secret test key>` immediately before the above command.
|
121
|
+
|
75
122
|
## Development
|
76
123
|
|
77
124
|
~~After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake test` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.~~
|
data/lib/spn2/version.rb
CHANGED
data/lib/spn2.rb
CHANGED
@@ -1,5 +1,6 @@
|
|
1
1
|
# frozen_string_literal: true
|
2
2
|
|
3
|
+
require 'date'
|
3
4
|
require 'json'
|
4
5
|
require 'nokogiri'
|
5
6
|
|
@@ -7,20 +8,22 @@ require_relative 'curlable'
|
|
7
8
|
|
8
9
|
# Design decison to not use a class as only 'state' is in 2 env vars
|
9
10
|
module Spn2
|
10
|
-
extend Curlable
|
11
|
-
include Curlable
|
11
|
+
extend Curlable
|
12
12
|
|
13
|
+
BAD_AUTH_MSG = 'You need to be logged in to use Save Page Now.'
|
13
14
|
ERROR_CODES = [502].freeze
|
14
15
|
|
15
16
|
class Spn2Error < StandardError; end
|
16
17
|
class Spn2ErrorBadAuth < Spn2Error; end
|
17
|
-
class
|
18
|
-
class Spn2ErrorBadResponse < Spn2Error; end
|
18
|
+
class Spn2ErrorFailedCapture < Spn2Error; end
|
19
19
|
class Spn2ErrorInvalidOption < Spn2Error; end
|
20
|
+
class Spn2ErrorMissingKeys < Spn2Error; end
|
21
|
+
class Spn2ErrorNoOutlinks < Spn2Error; end
|
22
|
+
class Spn2ErrorTooManyRequests < Spn2Error; end
|
23
|
+
class Spn2ErrorUnknownResponse < Spn2Error; end
|
20
24
|
class Spn2ErrorUnknownResponseCode < Spn2Error; end
|
21
25
|
ERROR_CODES.each { |i| Spn2.const_set("Spn2Error#{i}", Class.new(Spn2Error)) }
|
22
26
|
|
23
|
-
BAD_AUTH_MSG = 'You need to be logged in to use Save Page Now.'
|
24
27
|
ESSENTIAL_STATUS_KEYS = %w[job_id resources status].freeze
|
25
28
|
JOB_ID_REGEXP = /^(spn2-([a-f]|\d){40})$/
|
26
29
|
WEB_ARCHIVE = 'https://web.archive.org'
|
@@ -35,37 +38,58 @@ module Spn2
|
|
35
38
|
end
|
36
39
|
|
37
40
|
def access_key
|
38
|
-
ENV.fetch('SPN2_ACCESS_KEY'
|
41
|
+
ENV.fetch('SPN2_ACCESS_KEY')
|
39
42
|
end
|
40
43
|
|
41
44
|
def secret_key
|
42
|
-
ENV.fetch('SPN2_SECRET_KEY'
|
45
|
+
ENV.fetch('SPN2_SECRET_KEY')
|
43
46
|
end
|
44
47
|
|
45
48
|
def system_status
|
46
49
|
json get(url: "#{WEB_ARCHIVE}/save/status/system") # no auth
|
47
50
|
end
|
48
51
|
|
52
|
+
def user_status
|
53
|
+
json auth_get(url: "#{WEB_ARCHIVE}/save/status/user?t=#{DateTime.now.strftime('%Q').to_i}")
|
54
|
+
end
|
55
|
+
|
49
56
|
def save(url:, opts: {})
|
50
57
|
raise Spn2ErrorInvalidOption, "One or more invalid options: #{opts}" unless options_valid?(opts)
|
51
58
|
|
52
|
-
|
53
|
-
raise Spn2ErrorBadAuth,
|
59
|
+
json = json(auth_post(url: "#{WEB_ARCHIVE}/save/#{url}", params: { url: url }.merge(opts)))
|
60
|
+
raise Spn2ErrorBadAuth, json.inspect if json['message']&.== BAD_AUTH_MSG
|
54
61
|
|
55
|
-
raise
|
62
|
+
raise Spn2ErrorFailedCapture, json.inspect unless json['job_id']
|
56
63
|
|
57
|
-
|
64
|
+
json
|
58
65
|
end
|
59
66
|
alias capture save
|
60
67
|
|
61
|
-
def
|
62
|
-
|
63
|
-
raise Spn2ErrorBadAuth,
|
68
|
+
def status_job_id(job_id:)
|
69
|
+
json = json(auth_post(url: "#{WEB_ARCHIVE}/save/status", params: { job_id: job_id }))
|
70
|
+
raise Spn2ErrorBadAuth, json.inspect if json['message']&.== BAD_AUTH_MSG
|
64
71
|
|
65
|
-
raise
|
72
|
+
raise Spn2ErrorMissingKeys, json.inspect unless (ESSENTIAL_STATUS_KEYS - json.keys).empty?
|
66
73
|
|
67
|
-
|
74
|
+
json
|
68
75
|
end
|
76
|
+
alias status status_job_id
|
77
|
+
|
78
|
+
def status_job_ids(job_ids:)
|
79
|
+
json = json(auth_post(url: "#{WEB_ARCHIVE}/save/status", params: { job_ids: job_ids }))
|
80
|
+
raise Spn2Error, json.inspect unless json.is_a? Array
|
81
|
+
|
82
|
+
json
|
83
|
+
end
|
84
|
+
alias statuses status_job_ids
|
85
|
+
|
86
|
+
def status_job_id_outlinks(job_id:)
|
87
|
+
json = json(auth_post(url: "#{WEB_ARCHIVE}/save/status", params: { job_id_outlinks: job_id }))
|
88
|
+
raise Spn2ErrorNoOutlinks, json.inspect unless json.is_a? Array
|
89
|
+
|
90
|
+
json
|
91
|
+
end
|
92
|
+
alias status_outlinks status_job_id_outlinks
|
69
93
|
|
70
94
|
private
|
71
95
|
|
@@ -85,19 +109,29 @@ module Spn2
|
|
85
109
|
{ Authorization: "LOW #{Spn2.access_key}:#{Spn2.secret_key}" }
|
86
110
|
end
|
87
111
|
|
88
|
-
def
|
89
|
-
|
90
|
-
|
91
|
-
raise Spn2ErrorBadResponse, "No title in: #{html_string}" unless (title = doc.title)
|
112
|
+
def doc(html_string)
|
113
|
+
Nokogiri::HTML html_string
|
114
|
+
end
|
92
115
|
|
93
|
-
|
116
|
+
def json(html_string)
|
117
|
+
JSON.parse(doc = doc(html_string))
|
118
|
+
rescue JSON::ParserError # an html response & therefore an error
|
119
|
+
parse_error_code_from_page_title(doc.title) if doc.title
|
120
|
+
parse_error_from_page_body(html_string) # if no title parse body
|
94
121
|
end
|
95
122
|
|
96
|
-
def parse_error_code_from_page_title(
|
97
|
-
code =
|
123
|
+
def parse_error_code_from_page_title(title_string)
|
124
|
+
code = title_string.to_i
|
98
125
|
raise Spn2.const_get("Spn2Error#{code}") if ERROR_CODES.include? code
|
99
126
|
|
100
|
-
raise Spn2ErrorUnknownResponseCode
|
127
|
+
raise Spn2ErrorUnknownResponseCode
|
128
|
+
end
|
129
|
+
|
130
|
+
def parse_error_from_page_body(html_string)
|
131
|
+
h1 = doc(html_string).xpath('//h1')
|
132
|
+
raise Spn2ErrorTooManyRequests if !h1.empty? && h1.text == 'Too Many Requests'
|
133
|
+
|
134
|
+
raise Spn2ErrorUnknownResponse, html_string # fall through
|
101
135
|
end
|
102
136
|
|
103
137
|
def options_valid?(opts)
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: spn2
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- MatzFan
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2022-
|
11
|
+
date: 2022-07-02 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: curb
|
@@ -80,6 +80,20 @@ dependencies:
|
|
80
80
|
- - "~>"
|
81
81
|
- !ruby/object:Gem::Version
|
82
82
|
version: '5.16'
|
83
|
+
- !ruby/object:Gem::Dependency
|
84
|
+
name: minitest-parallel_fork
|
85
|
+
requirement: !ruby/object:Gem::Requirement
|
86
|
+
requirements:
|
87
|
+
- - "~>"
|
88
|
+
- !ruby/object:Gem::Version
|
89
|
+
version: '1.2'
|
90
|
+
type: :development
|
91
|
+
prerelease: false
|
92
|
+
version_requirements: !ruby/object:Gem::Requirement
|
93
|
+
requirements:
|
94
|
+
- - "~>"
|
95
|
+
- !ruby/object:Gem::Version
|
96
|
+
version: '1.2'
|
83
97
|
- !ruby/object:Gem::Dependency
|
84
98
|
name: rake
|
85
99
|
requirement: !ruby/object:Gem::Requirement
|
@@ -136,7 +150,7 @@ dependencies:
|
|
136
150
|
- - "~>"
|
137
151
|
- !ruby/object:Gem::Version
|
138
152
|
version: '0.6'
|
139
|
-
description:
|
153
|
+
description: Automate the process of saving web pages to archive.org
|
140
154
|
email:
|
141
155
|
executables: []
|
142
156
|
extensions: []
|
@@ -153,7 +167,7 @@ files:
|
|
153
167
|
- lib/spn2.rb
|
154
168
|
- lib/spn2/version.rb
|
155
169
|
- sig/spn2.rbs
|
156
|
-
homepage: https://gitlab.com/
|
170
|
+
homepage: https://gitlab.com/matzfan/spn2
|
157
171
|
licenses:
|
158
172
|
- MIT
|
159
173
|
metadata:
|
@@ -180,5 +194,5 @@ requirements: []
|
|
180
194
|
rubygems_version: 3.3.17
|
181
195
|
signing_key:
|
182
196
|
specification_version: 4
|
183
|
-
summary: Gem for the Save Page Now API of the Wayback Machine
|
197
|
+
summary: Gem for the Save Page Now 2 API of the Wayback Machine
|
184
198
|
test_files: []
|