httpdisk 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 0c47ec4fda68047f57e8348746cf7e08467151a4b3d193cce16652980b5b8a47
4
+ data.tar.gz: 158e71dc98ba8a954eb3e744140a6e1f468de06e54a482690e7347d4b8750153
5
+ SHA512:
6
+ metadata.gz: 4e261b58f3c1246dec8ab9ab5732c06c595af86b27c14866045fb9992983b1f8c89d93f788272c2331ded4f5424e424ebdf5ae0bdc886e95537e5696ad2ea04b
7
+ data.tar.gz: 9947c8fd27c4f7dbb98b9eb9c3aca086d7c0041ec51353b8163371d89b98c745592f764cfb05074daff0c8e9a6e9500863bf16d5f200b54e92e61ccc9006eaca
@@ -0,0 +1,26 @@
1
+ name: test
2
+
3
+ on:
4
+ push:
5
+ paths-ignore:
6
+ - '**.md'
7
+ pull_request:
8
+ paths-ignore:
9
+ - '**.md'
10
+ workflow_dispatch:
11
+
12
+ jobs:
13
+ test:
14
+ strategy:
15
+ max-parallel: 3
16
+ matrix:
17
+ os: [ubuntu, macos]
18
+ ruby-version: [head, 3.0, 2.7]
19
+ runs-on: ${{ matrix.os }}-latest
20
+ steps:
21
+ - uses: actions/checkout@v2
22
+ - uses: ruby/setup-ruby@v1
23
+ with:
24
+ ruby-version: ${{ matrix.ruby-version }}
25
+ - run: bundle install
26
+ - run: bundle exec rake test
data/.gitignore ADDED
@@ -0,0 +1,3 @@
1
+ .ruby-version
2
+ .vscode
3
+ *.gem
data/Gemfile ADDED
@@ -0,0 +1,10 @@
1
+ source 'https://rubygems.org'
2
+ gemspec
3
+
4
+ group :development, :test do
5
+ gem 'minitest'
6
+ gem 'mocha'
7
+ gem 'pry'
8
+ gem 'rake'
9
+ gem 'webmock'
10
+ end
data/Gemfile.lock ADDED
@@ -0,0 +1,69 @@
1
+ PATH
2
+ remote: .
3
+ specs:
4
+ httpdisk (0.1.0)
5
+ faraday (~> 1.4)
6
+ faraday-cookie_jar (~> 0.0)
7
+ faraday_middleware (~> 1.0)
8
+ slop (~> 4.8)
9
+
10
+ GEM
11
+ remote: https://rubygems.org/
12
+ specs:
13
+ addressable (2.7.0)
14
+ public_suffix (>= 2.0.2, < 5.0)
15
+ coderay (1.1.3)
16
+ crack (0.4.5)
17
+ rexml
18
+ domain_name (0.5.20190701)
19
+ unf (>= 0.0.5, < 1.0.0)
20
+ faraday (1.4.1)
21
+ faraday-excon (~> 1.1)
22
+ faraday-net_http (~> 1.0)
23
+ faraday-net_http_persistent (~> 1.1)
24
+ multipart-post (>= 1.2, < 3)
25
+ ruby2_keywords (>= 0.0.4)
26
+ faraday-cookie_jar (0.0.7)
27
+ faraday (>= 0.8.0)
28
+ http-cookie (~> 1.0.0)
29
+ faraday-excon (1.1.0)
30
+ faraday-net_http (1.0.1)
31
+ faraday-net_http_persistent (1.1.0)
32
+ faraday_middleware (1.0.0)
33
+ faraday (~> 1.0)
34
+ hashdiff (1.0.1)
35
+ http-cookie (1.0.3)
36
+ domain_name (~> 0.5)
37
+ method_source (1.0.0)
38
+ minitest (5.14.4)
39
+ mocha (1.11.2)
40
+ multipart-post (2.1.1)
41
+ pry (0.13.1)
42
+ coderay (~> 1.1)
43
+ method_source (~> 1.0)
44
+ public_suffix (4.0.6)
45
+ rake (13.0.3)
46
+ rexml (3.2.5)
47
+ ruby2_keywords (0.0.4)
48
+ slop (4.8.2)
49
+ unf (0.1.4)
50
+ unf_ext
51
+ unf_ext (0.0.7.7)
52
+ webmock (3.12.2)
53
+ addressable (>= 2.3.6)
54
+ crack (>= 0.3.2)
55
+ hashdiff (>= 0.4.0, < 2.0.0)
56
+
57
+ PLATFORMS
58
+ ruby
59
+
60
+ DEPENDENCIES
61
+ httpdisk!
62
+ minitest
63
+ mocha
64
+ pry
65
+ rake
66
+ webmock
67
+
68
+ BUNDLED WITH
69
+ 2.1.4
data/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2021 gurgeous
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,179 @@
1
+ [![Build Status](https://github.com/gurgeous/httpdisk/workflows/test/badge.svg?branch=main)](https://github.com/gurgeous/httpdisk/actions)
2
+
3
+ ![logo](logo.svg)
4
+
5
+ # httpdisk
6
+
7
+ httpdisk is an aggressive disk cache built on top of [Faraday](https://lostisland.github.io/faraday/). It's primarily used for crawling, and will aggressively cache all requests including POSTs and transient errors.
8
+
9
+ ## Installation
10
+
11
+ ```sh
12
+ # install gem
13
+ $ gem install httpdisk
14
+
15
+ # or add to your Gemfile
16
+ gem 'httpdisk'
17
+ ```
18
+
19
+ ## Quick Start
20
+
21
+ ```ruby
22
+ require 'httpdisk'
23
+
24
+ # create a new Faraday client
25
+ faraday = Faraday.new do
26
+ _1.use :httpdisk
27
+ end
28
+
29
+ response = faraday.get('https://google.com') # read from network
30
+ response = faraday.get('https://google.com') # read from ~/httpdisk/google.com/...
31
+ ```
32
+
33
+ httpdisk includes a handy command that works like `curl`:
34
+
35
+ ```sh
36
+ # cache miss, read from network
37
+ $ httpdisk google.com
38
+
39
+ # cache hit, read from ~/httpdisk/google.com/...
40
+ $ httpdisk google.com
41
+
42
+ # supports many curl flags
43
+ $ httpdisk -A test-agent --proxy localhost:8080 --output tmp.html twitter.com
44
+ ```
45
+
46
+ ## Faraday & httpdisk
47
+
48
+ [Faraday](https://lostisland.github.io/faraday/) is a popular Ruby HTTP client. Faraday uses a stack of middleware to process each request, similar to the way Rack works deep inside Rails or Sinatra. httpdisk is Faraday middleware - it processes requests to look for cached responses on disk. Faraday's [usage page](https://lostisland.github.io/faraday/usage/) is a good place to learn more about Faraday.
49
+
50
+ The simplest possible setup for httpdisk looks like this:
51
+
52
+ ```ruby
53
+ faraday = Faraday.new do
54
+ _1.use :httpdisk
55
+ end
56
+ faraday.get(...)
57
+ ```
58
+
59
+ For serious crawling, you probably want a more robust middleware stack:
60
+
61
+ ```ruby
62
+ faraday = Faraday.new do
63
+ _1.options.timeout = 10 # lower the timeout
64
+ _1.use :cookie_jar # cookie support
65
+ _1.request :url_encoded # auto-encode form bodies
66
+ _1.response :json # auto-decode JSON responses
67
+ _1.response :follow_redirects # follow redirects (should be above httpdisk)
68
+ _1.use :httpdisk
69
+ _1.request :retry # retry failed responses (should be below httpdisk)
70
+ end
71
+ faraday.get(...)
72
+ ```
73
+
74
+ You may want to experiment with the options for [:retry](https://lostisland.github.io/faraday/middleware/retry), to retry a
75
+ broader set of transient errors. See [examples.rb](https://github.com/gurgeous/httpdisk/blob/main/examples.rb) for more ideas.
76
+
77
+ ## Disk Cache
78
+
79
+ httpdisk calculates a canonical cache key for each request. The key consists of the http method, url, sorted query, and sorted body if possible. We use md5(key) as the path for each file in the cache. Try `httpdisk --status` to see it in action:
80
+
81
+ ```sh
82
+ $ httpdisk --status "google.com?q=ruby"
83
+ url: "http://google.com/?q=ruby"
84
+ status: "miss"
85
+ key: "GET http://google.com?q=ruby"
86
+ digest: "0e37f96800a55958fa6029283c78f672"
87
+ path: "httpdisk/google.com/0e3/7f96800a55958fa6029283c78f672"
88
+ ```
89
+
90
+ EVERY response will be cached on disk, including POSTs. By default, the cache will be placed at `~/httpdisk` and cached responses never expire. Some examples:
91
+
92
+ ```ruby
93
+ faraday.get("http://www.google.com", nil, { "User-Agent": "test-agent" })
94
+ faraday.get("http://www.google.com", { "q": "ruby" })
95
+ faraday.post("http://httpbin.org/post", "name=hello")
96
+ ```
97
+
98
+ This will populate the cache:
99
+
100
+ ```sh
101
+ $ cd ~/httpdisk
102
+ $ find . -type f
103
+ ./google.com/5eb/fc70198242876f5e83a67253663e9
104
+ ./google.com/6d0/52ac9a33d25065fc9f405100f3741
105
+ ./httpbin.org/88f/7b2bc35cc3759c9905c4de1dbf981
106
+
107
+ $ gzcat google.com/5eb/fc70198242876f5e83a67253663e9
108
+ # GET http://www.google.com
109
+ HTTPDISK 200 OK
110
+ date: Mon, 19 Apr 2021 18:40:01 GMT
111
+ expires: -1
112
+ cache-control: private, max-age=0
113
+ ...
114
+ ```
115
+
116
+ ## Aggressive Caching
117
+
118
+ httpdisk caches all responses. POST responses are cached, along with 500 responses and other HTTP errors. HTTP response headers that typically control caching are completely ignored. We also cache many exceptions like connection refused, timeout, ssl error, etc. These are returned as responses with HTTP status code 999.
119
+
120
+ In general, if you make a request it will be cached regardless of the outcome.
121
+
122
+ ## Configuration
123
+
124
+ httpdisk supports a few options:
125
+
126
+ - `dir:` location for disk cache, defaults to `~/httpdisk`
127
+ - `expires_in:` when to expire cached requests, default is nil (never expire)
128
+ - `force:` don't read anything from cache (but still write)
129
+ - `force_errors:` don't read errors from cache (but still write)
130
+
131
+ Pass these in when setting up Faraday:
132
+
133
+ ```ruby
134
+ faraday = Faraday.new do
135
+ _1.use :httpdisk, expires_in: 7*24*60*60, force: true
136
+ end
137
+ ```
138
+
139
+ ## Command Line
140
+
141
+ The `httpdisk` command works like `curl` and supports some of curl's popular flags. Exit code 1 indicates an HTTP response code >= 400 or a failed request.
142
+
143
+ ```
144
+ $ httpdisk --help
145
+ httpdisk [options] [url]
146
+ Similar to curl:
147
+ -d, --data HTTP POST data
148
+ -H, --header pass custom header(s) to server
149
+ -i, --include include response headers in the output
150
+ -m, --max-time maximum time allowed for the transfer
151
+ -o, --output write to file instead of stdout
152
+ -x, --proxy use host[:port] as proxy
153
+ -X, --request HTTP method to use
154
+ --retry retry request if problems occur
155
+ -s, --silent silent mode (don't print errors)
156
+ -A, --user-agent send User-Agent to server
157
+ Specific to httpdisk:
158
+ --dir httpdisk cache directory (defaults to ~/httpdisk)
159
+ --expires when to expire cached requests (ex: 1h, 2d, 3w)
160
+ --force don't read anything from cache (but still write)
161
+ --force-errors don't read errors from cache (but still write)
162
+ --status show status for a url in the cache
163
+ --version show version
164
+ --help show this help
165
+ ```
166
+
167
+ ## Limitations & Gotchas
168
+
169
+ - Transient errors are cached. This is appropriate for many uses cases (like crawling) but can be confusing. Use `httpdisk --status` to debug.
170
+ - There are no builtin mechanisms to cleanup or limit the size of the cache. Use `rm`
171
+ - For best results the `:follow_redirects` middleware should be listed _above_ httpdisk. That way each redirect request will be cached.
172
+ - For best results the `:retry` middleware should be listed _below_ httpdisk. That way retries will complete before we cache.
173
+ - httpdisk does not work with Faraday's parallel mode or `on_complete`.
174
+
175
+ ## Changelog
176
+
177
+ #### 0.1 - April 2020
178
+
179
+ - Original release
data/Rakefile ADDED
@@ -0,0 +1,47 @@
1
+ require 'bundler/setup'
2
+ require 'rake/testtask'
3
+
4
+ # load the spec, we use it below
5
+ spec = Gem::Specification.load('httpdisk.gemspec')
6
+
7
+ #
8
+ # testing
9
+ # don't forget about TESTOPTS="--verbose" rake
10
+ #
11
+
12
+ # test (default)
13
+ Rake::TestTask.new { _1.libs << 'test' }
14
+ task default: :test
15
+
16
+ # Watch files, run tests whenever something changes
17
+ task :watch do
18
+ system('find . | entr -c rake test')
19
+ end
20
+
21
+ #
22
+ # pry
23
+ #
24
+
25
+ task :pry do
26
+ system 'pry -I lib -r httpdisk.rb'
27
+ end
28
+
29
+ #
30
+ # gem
31
+ #
32
+
33
+ task :build do
34
+ system('gem build --quiet httpdisk.gemspec', exception: true)
35
+ end
36
+
37
+ task install: :build do
38
+ system("gem install --quiet httpdisk-#{spec.version}.gem", exception: true)
39
+ end
40
+
41
+ task release: :build do
42
+ raise "looks like git isn't clean" unless `git status --porcelain`.empty?
43
+
44
+ system("git tag -a #{spec.version} -m 'Tagging #{spec.version}'", exception: true)
45
+ system('git push --tags', exception: true)
46
+ system("gem push httpdisk-#{spec.version}.gem", exception: true)
47
+ end
data/bin/httpdisk ADDED
@@ -0,0 +1,41 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ #
4
+ # Main bin. Most of the interesting stuff is in HTTPDisk, for ease of testing.
5
+ #
6
+
7
+ $LOAD_PATH.unshift(File.join(__dir__, '../lib'))
8
+
9
+ def puts_error(s, exit: false)
10
+ $stderr.puts "httpdisk: #{s}"
11
+ end
12
+
13
+ #
14
+ # Load the bare minimum and parse args with slop. We do this separately for speed.
15
+ #
16
+
17
+ require 'httpdisk/cli_slop'
18
+ begin
19
+ slop = HTTPDisk::CliSlop.slop(ARGV)
20
+ rescue Slop::Error => e
21
+ puts_error(e) if e.message != ''
22
+ puts_error("try 'httpdisk --help' for more information")
23
+ exit 1
24
+ end
25
+
26
+ #
27
+ # now load everything and run
28
+ #
29
+
30
+ require 'httpdisk'
31
+ cli = HTTPDisk::Cli.new(slop)
32
+ begin
33
+ cli.run
34
+ rescue StandardError => e
35
+ puts_error(e) if !cli.options[:silent]
36
+ if ENV['HTTPDISK_DEBUG']
37
+ $stderr.puts
38
+ $stderr.puts e.backtrace.join("\n")
39
+ end
40
+ exit 1
41
+ end
data/examples.rb ADDED
@@ -0,0 +1,117 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ $LOAD_PATH.unshift(File.join(__dir__, 'lib'))
4
+
5
+ require 'httpdisk'
6
+ require 'json'
7
+
8
+ class Examples
9
+ #
10
+ # Very simple example. The only middleware is httpdisk.
11
+ #
12
+
13
+ def simple
14
+ faraday = Faraday.new do
15
+ _1.use :httpdisk, force: true
16
+ end
17
+
18
+ faraday.get('http://www.google.com', nil, { "User-Agent": 'test-agent' })
19
+ faraday.get('http://www.google.com', { q: 'ruby' })
20
+ faraday.post('http://httpbin.org/post', 'name=hello')
21
+ exit
22
+
23
+ 3.times { puts }
24
+ response = faraday.get('http://httpbingo.org/get')
25
+ puts response.env.url
26
+ puts JSON.pretty_generate(JSON.parse(response.body))
27
+ end
28
+
29
+ #
30
+ # Complete Faraday stack with cookies, redirects, retries, form encoding &
31
+ # JSON response parsing.
32
+ #
33
+
34
+ def better
35
+ faraday = Faraday.new do
36
+ # options
37
+ _1.headers['User-Agent'] = 'HTTPDisk'
38
+ _1.params.update(hello: 'world')
39
+ _1.options.timeout = 10
40
+
41
+ # middleware
42
+ _1.use :cookie_jar
43
+ _1.request :url_encoded
44
+ _1.response :json
45
+ _1.response :follow_redirects # must come before httpdisk
46
+
47
+ # httpdisk
48
+ _1.use :httpdisk
49
+
50
+ # retries (must come after httpdisk)
51
+ retry_options = {
52
+ methods: %w[delete get head options patch post put trace],
53
+ retry_statuses: (400..600).to_a,
54
+ retry_if: ->(_env, _err) { true },
55
+ }.freeze
56
+ _1.request :retry, retry_options
57
+ end
58
+
59
+ # get w/ params
60
+ 3.times { puts }
61
+ response = faraday.get('http://httpbingo.org/get', { q: 'query' })
62
+ puts response.env.url
63
+ puts JSON.pretty_generate(response.body)
64
+
65
+ # post w/ encoded form body
66
+ 3.times { puts }
67
+ response = faraday.post('http://httpbingo.org/post', 'a=1&b=2')
68
+ puts response.env.url
69
+ puts JSON.pretty_generate(response.body)
70
+
71
+ # post w/ auto-encoded form hash
72
+ 3.times { puts }
73
+ response = faraday.post('http://httpbingo.org/post', { input: 'body' })
74
+ puts response.env.url
75
+ puts JSON.pretty_generate(response.body)
76
+ end
77
+
78
+ #
79
+ # Complete Faraday stack with cookies, redirects, retries, JSON encoding &
80
+ # JSON response parsing.
81
+ #
82
+
83
+ def json
84
+ faraday = Faraday.new do
85
+ # options
86
+ _1.headers['User-Agent'] = 'HTTPDisk'
87
+ _1.params.update(hello: 'world')
88
+ _1.options.timeout = 10
89
+
90
+ # middleware
91
+ _1.use :cookie_jar
92
+ _1.request :json
93
+ _1.response :json
94
+ _1.response :follow_redirects # must come before httpdisk
95
+
96
+ # httpdisk
97
+ _1.use :httpdisk
98
+
99
+ # retries (must come after httpdisk)
100
+ retry_options = {
101
+ methods: %w[delete get head options patch post put trace],
102
+ retry_statuses: (400..600).to_a,
103
+ retry_if: ->(_env, _err) { true },
104
+ }.freeze
105
+ _1.request :retry, retry_options
106
+ end
107
+
108
+ 3.times { puts }
109
+ response = faraday.post('http://httpbingo.org/post', { this_is: [ 'json' ] })
110
+ puts response.env.url
111
+ puts JSON.pretty_generate(response.body)
112
+ end
113
+ end
114
+
115
+ Examples.new.simple
116
+ Examples.new.better
117
+ Examples.new.json