sinew 2.0.4 → 2.0.5

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 7e7426a91f427a3c97969eb501ce0bd55a658ece54af0dfea994f8faffd479f7
4
- data.tar.gz: baeb12b6af0fa2c5c11390de16ed1837458f52b4b00393dce8f9e66d4eb898a3
3
+ metadata.gz: b383fb9d0a1d57acfafd78d8e3ff0185b81acb5bcd368d4c0cca9a8999aa0a52
4
+ data.tar.gz: 0dede0f01c7a53056a38705c6fc134cd33a2f15868a9b4fee1b6f9fa85361d31
5
5
  SHA512:
6
- metadata.gz: 1698acfc26dbab92c390cde0956a72011f22f8f9bb4c5ebb194d131ae0a8dfe6c58e2224d36e840167210c2bf472efd18fd8e0cc92b1c9640df590f1faf71473
7
- data.tar.gz: e9a2688616dd866792cd286a1ebff094fe32ad9d8552ce77e7465648022fd65db83149ee558f88140f907d7a2da422c41312e8a32f73f0d1930f865862eab90f
6
+ metadata.gz: e867516dd43bed9f6dd524475c70b933b30029788521224eed08d269ea1264f5d52a3740e967a2cc61ec96861e1875ecaf2f43af403bd917cac05ce2fd394119
7
+ data.tar.gz: 719ab64ac523e9cf553171bf318a59938655e4a40fd4c6544ec5c5188c7a81227a30b9ba7c70a6ace90a41b476a62081f2ac4cc88a490d0be2e05ade7e8b3dce
data/README.md CHANGED
@@ -20,24 +20,24 @@ gem 'sinew'
20
20
 
21
21
  <!--- markdown-toc --no-firsth1 --maxdepth 1 readme.md -->
22
22
 
23
- * [Sinew 2](#sinew-2-may-2018)
24
- * [Quick Example](#quick-example)
25
- * [How it Works](#how-it-works)
26
- * [DSL Reference](#dsl-reference)
27
- * [Hints](#hints)
28
- * [Limitations](#limitations)
29
- * [Changelog](#changelog)
30
- * [License](#license)
23
+ - [Sinew 2](#sinew-2-may-2018)
24
+ - [Quick Example](#quick-example)
25
+ - [How it Works](#how-it-works)
26
+ - [DSL Reference](#dsl-reference)
27
+ - [Hints](#hints)
28
+ - [Limitations](#limitations)
29
+ - [Changelog](#changelog)
30
+ - [License](#license)
31
31
 
32
32
  ## Sinew 2 (May 2018)
33
33
 
34
34
  I am pleased to announce the release of Sinew 2.0, a complete rewrite of Sinew for the modern era. Enhancements include:
35
35
 
36
- * Remove dependencies on active_support, curl and tidy. We use HTTParty now.
37
- * Much easier to customize requests in `.sinew` files. For example, setting User-Agent or Bearer tokens.
38
- * More operations like `post_json` or the generic `http`. These methods are thin wrappers around HTTParty.
39
- * New end-of-run report.
40
- * Tests, rubocop, vscode settings, travis, etc.
36
+ - Remove dependencies on active_support, curl and tidy. We use HTTParty now.
37
+ - Much easier to customize requests in `.sinew` files. For example, setting User-Agent or Bearer tokens.
38
+ - More operations like `post_json` or the generic `http`. These methods are thin wrappers around HTTParty.
39
+ - New end-of-run report.
40
+ - Tests, rubocop, vscode settings, travis, etc.
41
41
 
42
42
  **Breaking change**
43
43
 
@@ -124,72 +124,82 @@ Because all requests are cached, you can run Sinew repeatedly with confidence. R
124
124
 
125
125
  #### Making requests
126
126
 
127
- * `get(url, query = {})` - fetch a url with HTTP GET. URL parameters can be added using `query.
128
- * `post(url, form = {})` - fetch a url with HTTP POST, using `form` as the URL encoded POST body.
129
- * `post_json(url, json = {})` - fetch a url with HTTP POST, using `json` as the POST body.
130
- * `http(method, url, options = {})` - use this for more complex requests
127
+ - `get(url, query = {})` - fetch a url with HTTP GET. URL parameters can be added using `query.
128
+ - `post(url, form = {})` - fetch a url with HTTP POST, using `form` as the URL encoded POST body.
129
+ - `post_json(url, json = {})` - fetch a url with HTTP POST, using `json` as the POST body.
130
+ - `http(method, url, options = {})` - use this for more complex requests
131
131
 
132
132
  #### Parsing the response
133
133
 
134
134
  These variables are set after each HTTP request.
135
135
 
136
- * `raw` - the raw response from the last request
137
- * `html` - like `raw`, but with a handful of HTML-specific whitespace cleanups
138
- * `noko` - parse the response as HTML and return a [Nokogiri](http://nokogiri.org) document
139
- * `xml` - parse the response as XML and return a [Nokogiri](http://nokogiri.org) document
140
- * `json` - parse the response as JSON, with symbolized keys
141
- * `url` - the url of the last request. If the request goes through a redirect, `url` will reflect the final url.
142
- * `uri` - the URI of the last request. This is useful for resolving relative URLs.
136
+ - `raw` - the raw response from the last request
137
+ - `html` - like `raw`, but with a handful of HTML-specific whitespace cleanups
138
+ - `noko` - parse the response as HTML and return a [Nokogiri](http://nokogiri.org) document
139
+ - `xml` - parse the response as XML and return a [Nokogiri](http://nokogiri.org) document
140
+ - `json` - parse the response as JSON, with symbolized keys
141
+ - `url` - the url of the last request. If the request goes through a redirect, `url` will reflect the final url.
142
+ - `uri` - the URI of the last request. This is useful for resolving relative URLs.
143
143
 
144
144
  #### Writing CSV
145
145
 
146
- * `csv_header(keys)` - specify the columns for CSV output. If you don't call this, Sinew will use the keys from the first call to `csv_emit`.
147
- * `csv_emit(hash)` - append a row to the CSV file
146
+ - `csv_header(keys)` - specify the columns for CSV output. If you don't call this, Sinew will use the keys from the first call to `csv_emit`.
147
+ - `csv_emit(hash)` - append a row to the CSV file
148
148
 
149
149
  ## Hints
150
150
 
151
151
  Writing Sinew recipes is fun and easy. The builtin caching means you can iterate quickly, since you won't have to re-fetch the data. Here are some hints for writing idiomatic recipes:
152
152
 
153
- * Sinew doesn't (yet) check robots.txt - please check it manually.
154
- * Prefer Nokogiri over regular expressions wherever possible. Learn [CSS selectors](http://www.w3schools.com/cssref/css_selectors.asp).
155
- * In Chrome, `$` in the console is your friend.
156
- * Fallback to regular expressions if you're desperate. Depending on the site, use either `raw` or `html`. `html` is probably your best bet. `raw` is good for crawling Javascript, but it's fragile if the site changes.
157
- * Learn to love `String#[regexp]`, which is an obscure operator but incredibly handy for Sinew.
158
- * Laziness is useful. Keep your CSS selectors and regular expressions simple, so maybe they'll work again the next time you need to crawl a site.
159
- * Don't be afraid to mix CSS selectors, regular expressions, and Ruby:
153
+ - Sinew doesn't (yet) check robots.txt - please check it manually.
154
+ - Prefer Nokogiri over regular expressions wherever possible. Learn [CSS selectors](http://www.w3schools.com/cssref/css_selectors.asp).
155
+ - In Chrome, `$` in the console is your friend.
156
+ - Fallback to regular expressions if you're desperate. Depending on the site, use either `raw` or `html`. `html` is probably your best bet. `raw` is good for crawling Javascript, but it's fragile if the site changes.
157
+ - Learn to love `String#[regexp]`, which is an obscure operator but incredibly handy for Sinew.
158
+ - Laziness is useful. Keep your CSS selectors and regular expressions simple, so maybe they'll work again the next time you need to crawl a site.
159
+ - Don't be afraid to mix CSS selectors, regular expressions, and Ruby:
160
160
 
161
161
  ```ruby
162
162
  noko.css("table")[4].css("td").select { |i| i[:width].to_i > 80 }.map(&:text)
163
163
  ```
164
164
 
165
- * Debug your recipes using plain old `puts`, or better yet use `ap` from [awesome_print](https://github.com/michaeldv/awesome_print).
166
- * Run `sinew -v` to get a report on every `csv_emit`. Very handy.
167
- * Add the CSV files to your git repo. That way you can version them and get diffs!
165
+ - Debug your recipes using plain old `puts`, or better yet use `ap` from [awesome_print](https://github.com/michaeldv/awesome_print).
166
+ - Run `sinew -v` to get a report on every `csv_emit`. Very handy.
167
+ - Add the CSV files to your git repo. That way you can version them and get diffs!
168
168
 
169
169
  ## Limitations
170
170
 
171
- * Caching is based on URL, so use caution with cookies and other forms of authentication
172
- * Almost no support for international (non-english) characters
171
+ - Caching is based on URL, so use caution with cookies and other forms of authentication
172
+ - Almost no support for international (non-english) characters
173
173
 
174
174
  ## Changelog
175
175
 
176
+ #### 2.0.5 (unreleased)
177
+
178
+ - Supports multiple proxies (`--proxy host1,host2,...`)
179
+
180
+ #### 2.0.4 (May 2018)
181
+
182
+ - Handle and cache more errors (too many redirects, connection failures, etc.)
183
+ - Support for adding uri.scheme in generate_cache_key
184
+ - Added status `code`, a peer to `uri`, `raw`, etc.
185
+
176
186
  #### 2.0.3 (May 2018)
177
187
 
178
- * &amp; now normalizes to & (not and)
188
+ - &amp; now normalizes to & (not and)
179
189
 
180
190
  #### 2.0.2 (May 2018)
181
191
 
182
- * Support for `--limit`, `--proxy` and the `xml` variable
183
- * Dedup - warn and ignore if row[:url] has already been emitted
184
- * Auto gunzip if contents are compressed
192
+ - Support for `--limit`, `--proxy` and the `xml` variable
193
+ - Dedup - warn and ignore if row[:url] has already been emitted
194
+ - Auto gunzip if contents are compressed
185
195
 
186
196
  #### 2.0.1 (May 2018)
187
197
 
188
- * Support for legacy cached `head` files from Sinew 1
198
+ - Support for legacy cached `head` files from Sinew 1
189
199
 
190
200
  #### 2.0.0 (May 2018)
191
201
 
192
- * Complete rewrite. See above.
202
+ - Complete rewrite. See above.
193
203
 
194
204
  #### 1.0.3 (June 2012)
195
205
 
@@ -15,12 +15,6 @@ module Sinew
15
15
  @runtime_options = RuntimeOptions.new
16
16
  @request_tm = Time.at(0)
17
17
  @request_count = 0
18
-
19
- if options[:proxy]
20
- addr, port = options[:proxy].split(':')
21
- runtime_options.httparty_options[:http_proxyaddr] = addr
22
- runtime_options.httparty_options[:http_proxyport] = port || 80
23
- end
24
18
  end
25
19
 
26
20
  def run
@@ -105,6 +99,7 @@ module Sinew
105
99
  else
106
100
  "req #{request.uri}"
107
101
  end
102
+ msg = "#{msg} => #{request.proxy}" if request.proxy
108
103
  $stderr.puts msg
109
104
  end
110
105
 
@@ -24,19 +24,36 @@ module Sinew
24
24
  @cache_key = calculate_cache_key
25
25
  end
26
26
 
27
+ def proxy
28
+ @proxy ||= begin
29
+ if proxies = sinew.options[:proxy]
30
+ proxies.split(',').sample
31
+ end
32
+ end
33
+ end
34
+
27
35
  # run the request, return the result
28
36
  def perform
29
37
  validate!
30
38
 
31
- # merge optons
32
- options = self.options.merge(sinew.runtime_options.httparty_options)
39
+ party_options = options.dup
40
+
41
+ # merge proxy
42
+ if proxy = self.proxy
43
+ addr, port = proxy.split(':')
44
+ party_options[:http_proxyaddr] = addr
45
+ party_options[:http_proxyport] = port || 80
46
+ end
47
+
48
+ # now merge runtime_options
49
+ party_options = party_options.merge(sinew.runtime_options.httparty_options)
33
50
 
34
51
  # merge headers
35
52
  headers = sinew.runtime_options.headers
36
- headers = headers.merge(options[:headers]) if options[:headers]
37
- options[:headers] = headers
53
+ headers = headers.merge(party_options[:headers]) if party_options[:headers]
54
+ party_options[:headers] = headers
38
55
 
39
- party_response = HTTParty.send(method, uri, options)
56
+ party_response = HTTParty.send(method, uri, party_options)
40
57
  Response.from_network(self, party_response)
41
58
  end
42
59
 
@@ -1,4 +1,4 @@
1
1
  module Sinew
2
2
  # Gem version
3
- VERSION = '2.0.4'.freeze
3
+ VERSION = '2.0.5'.freeze
4
4
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: sinew
3
3
  version: !ruby/object:Gem::Version
4
- version: 2.0.4
4
+ version: 2.0.5
5
5
  platform: ruby
6
6
  authors:
7
7
  - Adam Doppelt
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2018-05-24 00:00:00.000000000 Z
11
+ date: 2019-03-22 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: awesome_print