sinew 2.0.4 → 2.0.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 7e7426a91f427a3c97969eb501ce0bd55a658ece54af0dfea994f8faffd479f7
4
- data.tar.gz: baeb12b6af0fa2c5c11390de16ed1837458f52b4b00393dce8f9e66d4eb898a3
3
+ metadata.gz: b383fb9d0a1d57acfafd78d8e3ff0185b81acb5bcd368d4c0cca9a8999aa0a52
4
+ data.tar.gz: 0dede0f01c7a53056a38705c6fc134cd33a2f15868a9b4fee1b6f9fa85361d31
5
5
  SHA512:
6
- metadata.gz: 1698acfc26dbab92c390cde0956a72011f22f8f9bb4c5ebb194d131ae0a8dfe6c58e2224d36e840167210c2bf472efd18fd8e0cc92b1c9640df590f1faf71473
7
- data.tar.gz: e9a2688616dd866792cd286a1ebff094fe32ad9d8552ce77e7465648022fd65db83149ee558f88140f907d7a2da422c41312e8a32f73f0d1930f865862eab90f
6
+ metadata.gz: e867516dd43bed9f6dd524475c70b933b30029788521224eed08d269ea1264f5d52a3740e967a2cc61ec96861e1875ecaf2f43af403bd917cac05ce2fd394119
7
+ data.tar.gz: 719ab64ac523e9cf553171bf318a59938655e4a40fd4c6544ec5c5188c7a81227a30b9ba7c70a6ace90a41b476a62081f2ac4cc88a490d0be2e05ade7e8b3dce
data/README.md CHANGED
@@ -20,24 +20,24 @@ gem 'sinew'
20
20
 
21
21
  <!--- markdown-toc --no-firsth1 --maxdepth 1 readme.md -->
22
22
 
23
- * [Sinew 2](#sinew-2-may-2018)
24
- * [Quick Example](#quick-example)
25
- * [How it Works](#how-it-works)
26
- * [DSL Reference](#dsl-reference)
27
- * [Hints](#hints)
28
- * [Limitations](#limitations)
29
- * [Changelog](#changelog)
30
- * [License](#license)
23
+ - [Sinew 2](#sinew-2-may-2018)
24
+ - [Quick Example](#quick-example)
25
+ - [How it Works](#how-it-works)
26
+ - [DSL Reference](#dsl-reference)
27
+ - [Hints](#hints)
28
+ - [Limitations](#limitations)
29
+ - [Changelog](#changelog)
30
+ - [License](#license)
31
31
 
32
32
  ## Sinew 2 (May 2018)
33
33
 
34
34
  I am pleased to announce the release of Sinew 2.0, a complete rewrite of Sinew for the modern era. Enhancements include:
35
35
 
36
- * Remove dependencies on active_support, curl and tidy. We use HTTParty now.
37
- * Much easier to customize requests in `.sinew` files. For example, setting User-Agent or Bearer tokens.
38
- * More operations like `post_json` or the generic `http`. These methods are thin wrappers around HTTParty.
39
- * New end-of-run report.
40
- * Tests, rubocop, vscode settings, travis, etc.
36
+ - Remove dependencies on active_support, curl and tidy. We use HTTParty now.
37
+ - Much easier to customize requests in `.sinew` files. For example, setting User-Agent or Bearer tokens.
38
+ - More operations like `post_json` or the generic `http`. These methods are thin wrappers around HTTParty.
39
+ - New end-of-run report.
40
+ - Tests, rubocop, vscode settings, travis, etc.
41
41
 
42
42
  **Breaking change**
43
43
 
@@ -124,72 +124,82 @@ Because all requests are cached, you can run Sinew repeatedly with confidence. R
124
124
 
125
125
  #### Making requests
126
126
 
127
- * `get(url, query = {})` - fetch a url with HTTP GET. URL parameters can be added using `query.
128
- * `post(url, form = {})` - fetch a url with HTTP POST, using `form` as the URL encoded POST body.
129
- * `post_json(url, json = {})` - fetch a url with HTTP POST, using `json` as the POST body.
130
- * `http(method, url, options = {})` - use this for more complex requests
127
+ - `get(url, query = {})` - fetch a url with HTTP GET. URL parameters can be added using `query.
128
+ - `post(url, form = {})` - fetch a url with HTTP POST, using `form` as the URL encoded POST body.
129
+ - `post_json(url, json = {})` - fetch a url with HTTP POST, using `json` as the POST body.
130
+ - `http(method, url, options = {})` - use this for more complex requests
131
131
 
132
132
  #### Parsing the response
133
133
 
134
134
  These variables are set after each HTTP request.
135
135
 
136
- * `raw` - the raw response from the last request
137
- * `html` - like `raw`, but with a handful of HTML-specific whitespace cleanups
138
- * `noko` - parse the response as HTML and return a [Nokogiri](http://nokogiri.org) document
139
- * `xml` - parse the response as XML and return a [Nokogiri](http://nokogiri.org) document
140
- * `json` - parse the response as JSON, with symbolized keys
141
- * `url` - the url of the last request. If the request goes through a redirect, `url` will reflect the final url.
142
- * `uri` - the URI of the last request. This is useful for resolving relative URLs.
136
+ - `raw` - the raw response from the last request
137
+ - `html` - like `raw`, but with a handful of HTML-specific whitespace cleanups
138
+ - `noko` - parse the response as HTML and return a [Nokogiri](http://nokogiri.org) document
139
+ - `xml` - parse the response as XML and return a [Nokogiri](http://nokogiri.org) document
140
+ - `json` - parse the response as JSON, with symbolized keys
141
+ - `url` - the url of the last request. If the request goes through a redirect, `url` will reflect the final url.
142
+ - `uri` - the URI of the last request. This is useful for resolving relative URLs.
143
143
 
144
144
  #### Writing CSV
145
145
 
146
- * `csv_header(keys)` - specify the columns for CSV output. If you don't call this, Sinew will use the keys from the first call to `csv_emit`.
147
- * `csv_emit(hash)` - append a row to the CSV file
146
+ - `csv_header(keys)` - specify the columns for CSV output. If you don't call this, Sinew will use the keys from the first call to `csv_emit`.
147
+ - `csv_emit(hash)` - append a row to the CSV file
148
148
 
149
149
  ## Hints
150
150
 
151
151
  Writing Sinew recipes is fun and easy. The builtin caching means you can iterate quickly, since you won't have to re-fetch the data. Here are some hints for writing idiomatic recipes:
152
152
 
153
- * Sinew doesn't (yet) check robots.txt - please check it manually.
154
- * Prefer Nokogiri over regular expressions wherever possible. Learn [CSS selectors](http://www.w3schools.com/cssref/css_selectors.asp).
155
- * In Chrome, `$` in the console is your friend.
156
- * Fallback to regular expressions if you're desperate. Depending on the site, use either `raw` or `html`. `html` is probably your best bet. `raw` is good for crawling Javascript, but it's fragile if the site changes.
157
- * Learn to love `String#[regexp]`, which is an obscure operator but incredibly handy for Sinew.
158
- * Laziness is useful. Keep your CSS selectors and regular expressions simple, so maybe they'll work again the next time you need to crawl a site.
159
- * Don't be afraid to mix CSS selectors, regular expressions, and Ruby:
153
+ - Sinew doesn't (yet) check robots.txt - please check it manually.
154
+ - Prefer Nokogiri over regular expressions wherever possible. Learn [CSS selectors](http://www.w3schools.com/cssref/css_selectors.asp).
155
+ - In Chrome, `$` in the console is your friend.
156
+ - Fallback to regular expressions if you're desperate. Depending on the site, use either `raw` or `html`. `html` is probably your best bet. `raw` is good for crawling Javascript, but it's fragile if the site changes.
157
+ - Learn to love `String#[regexp]`, which is an obscure operator but incredibly handy for Sinew.
158
+ - Laziness is useful. Keep your CSS selectors and regular expressions simple, so maybe they'll work again the next time you need to crawl a site.
159
+ - Don't be afraid to mix CSS selectors, regular expressions, and Ruby:
160
160
 
161
161
  ```ruby
162
162
  noko.css("table")[4].css("td").select { |i| i[:width].to_i > 80 }.map(&:text)
163
163
  ```
164
164
 
165
- * Debug your recipes using plain old `puts`, or better yet use `ap` from [awesome_print](https://github.com/michaeldv/awesome_print).
166
- * Run `sinew -v` to get a report on every `csv_emit`. Very handy.
167
- * Add the CSV files to your git repo. That way you can version them and get diffs!
165
+ - Debug your recipes using plain old `puts`, or better yet use `ap` from [awesome_print](https://github.com/michaeldv/awesome_print).
166
+ - Run `sinew -v` to get a report on every `csv_emit`. Very handy.
167
+ - Add the CSV files to your git repo. That way you can version them and get diffs!
168
168
 
169
169
  ## Limitations
170
170
 
171
- * Caching is based on URL, so use caution with cookies and other forms of authentication
172
- * Almost no support for international (non-english) characters
171
+ - Caching is based on URL, so use caution with cookies and other forms of authentication
172
+ - Almost no support for international (non-english) characters
173
173
 
174
174
  ## Changelog
175
175
 
176
+ #### 2.0.5 (unreleased)
177
+
178
+ - Supports multiple proxies (`--proxy host1,host2,...`)
179
+
180
+ #### 2.0.4 (May 2018)
181
+
182
+ - Handle and cache more errors (too many redirects, connection failures, etc.)
183
+ - Support for adding uri.scheme in generate_cache_key
184
+ - Added status `code`, a peer to `uri`, `raw`, etc.
185
+
176
186
  #### 2.0.3 (May 2018)
177
187
 
178
- * &amp; now normalizes to & (not and)
188
+ - &amp; now normalizes to & (not and)
179
189
 
180
190
  #### 2.0.2 (May 2018)
181
191
 
182
- * Support for `--limit`, `--proxy` and the `xml` variable
183
- * Dedup - warn and ignore if row[:url] has already been emitted
184
- * Auto gunzip if contents are compressed
192
+ - Support for `--limit`, `--proxy` and the `xml` variable
193
+ - Dedup - warn and ignore if row[:url] has already been emitted
194
+ - Auto gunzip if contents are compressed
185
195
 
186
196
  #### 2.0.1 (May 2018)
187
197
 
188
- * Support for legacy cached `head` files from Sinew 1
198
+ - Support for legacy cached `head` files from Sinew 1
189
199
 
190
200
  #### 2.0.0 (May 2018)
191
201
 
192
- * Complete rewrite. See above.
202
+ - Complete rewrite. See above.
193
203
 
194
204
  #### 1.0.3 (June 2012)
195
205
 
@@ -15,12 +15,6 @@ module Sinew
15
15
  @runtime_options = RuntimeOptions.new
16
16
  @request_tm = Time.at(0)
17
17
  @request_count = 0
18
-
19
- if options[:proxy]
20
- addr, port = options[:proxy].split(':')
21
- runtime_options.httparty_options[:http_proxyaddr] = addr
22
- runtime_options.httparty_options[:http_proxyport] = port || 80
23
- end
24
18
  end
25
19
 
26
20
  def run
@@ -105,6 +99,7 @@ module Sinew
105
99
  else
106
100
  "req #{request.uri}"
107
101
  end
102
+ msg = "#{msg} => #{request.proxy}" if request.proxy
108
103
  $stderr.puts msg
109
104
  end
110
105
 
@@ -24,19 +24,36 @@ module Sinew
24
24
  @cache_key = calculate_cache_key
25
25
  end
26
26
 
27
+ def proxy
28
+ @proxy ||= begin
29
+ if proxies = sinew.options[:proxy]
30
+ proxies.split(',').sample
31
+ end
32
+ end
33
+ end
34
+
27
35
  # run the request, return the result
28
36
  def perform
29
37
  validate!
30
38
 
31
- # merge optons
32
- options = self.options.merge(sinew.runtime_options.httparty_options)
39
+ party_options = options.dup
40
+
41
+ # merge proxy
42
+ if proxy = self.proxy
43
+ addr, port = proxy.split(':')
44
+ party_options[:http_proxyaddr] = addr
45
+ party_options[:http_proxyport] = port || 80
46
+ end
47
+
48
+ # now merge runtime_options
49
+ party_options = party_options.merge(sinew.runtime_options.httparty_options)
33
50
 
34
51
  # merge headers
35
52
  headers = sinew.runtime_options.headers
36
- headers = headers.merge(options[:headers]) if options[:headers]
37
- options[:headers] = headers
53
+ headers = headers.merge(party_options[:headers]) if party_options[:headers]
54
+ party_options[:headers] = headers
38
55
 
39
- party_response = HTTParty.send(method, uri, options)
56
+ party_response = HTTParty.send(method, uri, party_options)
40
57
  Response.from_network(self, party_response)
41
58
  end
42
59
 
@@ -1,4 +1,4 @@
1
1
  module Sinew
2
2
  # Gem version
3
- VERSION = '2.0.4'.freeze
3
+ VERSION = '2.0.5'.freeze
4
4
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: sinew
3
3
  version: !ruby/object:Gem::Version
4
- version: 2.0.4
4
+ version: 2.0.5
5
5
  platform: ruby
6
6
  authors:
7
7
  - Adam Doppelt
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2018-05-24 00:00:00.000000000 Z
11
+ date: 2019-03-22 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: awesome_print