sinew 2.0.4 → 2.0.5
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +54 -44
- data/lib/sinew/main.rb +1 -6
- data/lib/sinew/request.rb +22 -5
- data/lib/sinew/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: b383fb9d0a1d57acfafd78d8e3ff0185b81acb5bcd368d4c0cca9a8999aa0a52
|
4
|
+
data.tar.gz: 0dede0f01c7a53056a38705c6fc134cd33a2f15868a9b4fee1b6f9fa85361d31
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: e867516dd43bed9f6dd524475c70b933b30029788521224eed08d269ea1264f5d52a3740e967a2cc61ec96861e1875ecaf2f43af403bd917cac05ce2fd394119
|
7
|
+
data.tar.gz: 719ab64ac523e9cf553171bf318a59938655e4a40fd4c6544ec5c5188c7a81227a30b9ba7c70a6ace90a41b476a62081f2ac4cc88a490d0be2e05ade7e8b3dce
|
data/README.md
CHANGED
@@ -20,24 +20,24 @@ gem 'sinew'
|
|
20
20
|
|
21
21
|
<!--- markdown-toc --no-firsth1 --maxdepth 1 readme.md -->
|
22
22
|
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
|
28
|
-
|
29
|
-
|
30
|
-
|
23
|
+
- [Sinew 2](#sinew-2-may-2018)
|
24
|
+
- [Quick Example](#quick-example)
|
25
|
+
- [How it Works](#how-it-works)
|
26
|
+
- [DSL Reference](#dsl-reference)
|
27
|
+
- [Hints](#hints)
|
28
|
+
- [Limitations](#limitations)
|
29
|
+
- [Changelog](#changelog)
|
30
|
+
- [License](#license)
|
31
31
|
|
32
32
|
## Sinew 2 (May 2018)
|
33
33
|
|
34
34
|
I am pleased to announce the release of Sinew 2.0, a complete rewrite of Sinew for the modern era. Enhancements include:
|
35
35
|
|
36
|
-
|
37
|
-
|
38
|
-
|
39
|
-
|
40
|
-
|
36
|
+
- Remove dependencies on active_support, curl and tidy. We use HTTParty now.
|
37
|
+
- Much easier to customize requests in `.sinew` files. For example, setting User-Agent or Bearer tokens.
|
38
|
+
- More operations like `post_json` or the generic `http`. These methods are thin wrappers around HTTParty.
|
39
|
+
- New end-of-run report.
|
40
|
+
- Tests, rubocop, vscode settings, travis, etc.
|
41
41
|
|
42
42
|
**Breaking change**
|
43
43
|
|
@@ -124,72 +124,82 @@ Because all requests are cached, you can run Sinew repeatedly with confidence. R
|
|
124
124
|
|
125
125
|
#### Making requests
|
126
126
|
|
127
|
-
|
128
|
-
|
129
|
-
|
130
|
-
|
127
|
+
- `get(url, query = {})` - fetch a url with HTTP GET. URL parameters can be added using `query.
|
128
|
+
- `post(url, form = {})` - fetch a url with HTTP POST, using `form` as the URL encoded POST body.
|
129
|
+
- `post_json(url, json = {})` - fetch a url with HTTP POST, using `json` as the POST body.
|
130
|
+
- `http(method, url, options = {})` - use this for more complex requests
|
131
131
|
|
132
132
|
#### Parsing the response
|
133
133
|
|
134
134
|
These variables are set after each HTTP request.
|
135
135
|
|
136
|
-
|
137
|
-
|
138
|
-
|
139
|
-
|
140
|
-
|
141
|
-
|
142
|
-
|
136
|
+
- `raw` - the raw response from the last request
|
137
|
+
- `html` - like `raw`, but with a handful of HTML-specific whitespace cleanups
|
138
|
+
- `noko` - parse the response as HTML and return a [Nokogiri](http://nokogiri.org) document
|
139
|
+
- `xml` - parse the response as XML and return a [Nokogiri](http://nokogiri.org) document
|
140
|
+
- `json` - parse the response as JSON, with symbolized keys
|
141
|
+
- `url` - the url of the last request. If the request goes through a redirect, `url` will reflect the final url.
|
142
|
+
- `uri` - the URI of the last request. This is useful for resolving relative URLs.
|
143
143
|
|
144
144
|
#### Writing CSV
|
145
145
|
|
146
|
-
|
147
|
-
|
146
|
+
- `csv_header(keys)` - specify the columns for CSV output. If you don't call this, Sinew will use the keys from the first call to `csv_emit`.
|
147
|
+
- `csv_emit(hash)` - append a row to the CSV file
|
148
148
|
|
149
149
|
## Hints
|
150
150
|
|
151
151
|
Writing Sinew recipes is fun and easy. The builtin caching means you can iterate quickly, since you won't have to re-fetch the data. Here are some hints for writing idiomatic recipes:
|
152
152
|
|
153
|
-
|
154
|
-
|
155
|
-
|
156
|
-
|
157
|
-
|
158
|
-
|
159
|
-
|
153
|
+
- Sinew doesn't (yet) check robots.txt - please check it manually.
|
154
|
+
- Prefer Nokogiri over regular expressions wherever possible. Learn [CSS selectors](http://www.w3schools.com/cssref/css_selectors.asp).
|
155
|
+
- In Chrome, `$` in the console is your friend.
|
156
|
+
- Fallback to regular expressions if you're desperate. Depending on the site, use either `raw` or `html`. `html` is probably your best bet. `raw` is good for crawling Javascript, but it's fragile if the site changes.
|
157
|
+
- Learn to love `String#[regexp]`, which is an obscure operator but incredibly handy for Sinew.
|
158
|
+
- Laziness is useful. Keep your CSS selectors and regular expressions simple, so maybe they'll work again the next time you need to crawl a site.
|
159
|
+
- Don't be afraid to mix CSS selectors, regular expressions, and Ruby:
|
160
160
|
|
161
161
|
```ruby
|
162
162
|
noko.css("table")[4].css("td").select { |i| i[:width].to_i > 80 }.map(&:text)
|
163
163
|
```
|
164
164
|
|
165
|
-
|
166
|
-
|
167
|
-
|
165
|
+
- Debug your recipes using plain old `puts`, or better yet use `ap` from [awesome_print](https://github.com/michaeldv/awesome_print).
|
166
|
+
- Run `sinew -v` to get a report on every `csv_emit`. Very handy.
|
167
|
+
- Add the CSV files to your git repo. That way you can version them and get diffs!
|
168
168
|
|
169
169
|
## Limitations
|
170
170
|
|
171
|
-
|
172
|
-
|
171
|
+
- Caching is based on URL, so use caution with cookies and other forms of authentication
|
172
|
+
- Almost no support for international (non-english) characters
|
173
173
|
|
174
174
|
## Changelog
|
175
175
|
|
176
|
+
#### 2.0.5 (unreleased)
|
177
|
+
|
178
|
+
- Supports multiple proxies (`--proxy host1,host2,...`)
|
179
|
+
|
180
|
+
#### 2.0.4 (May 2018)
|
181
|
+
|
182
|
+
- Handle and cache more errors (too many redirects, connection failures, etc.)
|
183
|
+
- Support for adding uri.scheme in generate_cache_key
|
184
|
+
- Added status `code`, a peer to `uri`, `raw`, etc.
|
185
|
+
|
176
186
|
#### 2.0.3 (May 2018)
|
177
187
|
|
178
|
-
|
188
|
+
- & now normalizes to & (not and)
|
179
189
|
|
180
190
|
#### 2.0.2 (May 2018)
|
181
191
|
|
182
|
-
|
183
|
-
|
184
|
-
|
192
|
+
- Support for `--limit`, `--proxy` and the `xml` variable
|
193
|
+
- Dedup - warn and ignore if row[:url] has already been emitted
|
194
|
+
- Auto gunzip if contents are compressed
|
185
195
|
|
186
196
|
#### 2.0.1 (May 2018)
|
187
197
|
|
188
|
-
|
198
|
+
- Support for legacy cached `head` files from Sinew 1
|
189
199
|
|
190
200
|
#### 2.0.0 (May 2018)
|
191
201
|
|
192
|
-
|
202
|
+
- Complete rewrite. See above.
|
193
203
|
|
194
204
|
#### 1.0.3 (June 2012)
|
195
205
|
|
data/lib/sinew/main.rb
CHANGED
@@ -15,12 +15,6 @@ module Sinew
|
|
15
15
|
@runtime_options = RuntimeOptions.new
|
16
16
|
@request_tm = Time.at(0)
|
17
17
|
@request_count = 0
|
18
|
-
|
19
|
-
if options[:proxy]
|
20
|
-
addr, port = options[:proxy].split(':')
|
21
|
-
runtime_options.httparty_options[:http_proxyaddr] = addr
|
22
|
-
runtime_options.httparty_options[:http_proxyport] = port || 80
|
23
|
-
end
|
24
18
|
end
|
25
19
|
|
26
20
|
def run
|
@@ -105,6 +99,7 @@ module Sinew
|
|
105
99
|
else
|
106
100
|
"req #{request.uri}"
|
107
101
|
end
|
102
|
+
msg = "#{msg} => #{request.proxy}" if request.proxy
|
108
103
|
$stderr.puts msg
|
109
104
|
end
|
110
105
|
|
data/lib/sinew/request.rb
CHANGED
@@ -24,19 +24,36 @@ module Sinew
|
|
24
24
|
@cache_key = calculate_cache_key
|
25
25
|
end
|
26
26
|
|
27
|
+
def proxy
|
28
|
+
@proxy ||= begin
|
29
|
+
if proxies = sinew.options[:proxy]
|
30
|
+
proxies.split(',').sample
|
31
|
+
end
|
32
|
+
end
|
33
|
+
end
|
34
|
+
|
27
35
|
# run the request, return the result
|
28
36
|
def perform
|
29
37
|
validate!
|
30
38
|
|
31
|
-
|
32
|
-
|
39
|
+
party_options = options.dup
|
40
|
+
|
41
|
+
# merge proxy
|
42
|
+
if proxy = self.proxy
|
43
|
+
addr, port = proxy.split(':')
|
44
|
+
party_options[:http_proxyaddr] = addr
|
45
|
+
party_options[:http_proxyport] = port || 80
|
46
|
+
end
|
47
|
+
|
48
|
+
# now merge runtime_options
|
49
|
+
party_options = party_options.merge(sinew.runtime_options.httparty_options)
|
33
50
|
|
34
51
|
# merge headers
|
35
52
|
headers = sinew.runtime_options.headers
|
36
|
-
headers = headers.merge(
|
37
|
-
|
53
|
+
headers = headers.merge(party_options[:headers]) if party_options[:headers]
|
54
|
+
party_options[:headers] = headers
|
38
55
|
|
39
|
-
party_response = HTTParty.send(method, uri,
|
56
|
+
party_response = HTTParty.send(method, uri, party_options)
|
40
57
|
Response.from_network(self, party_response)
|
41
58
|
end
|
42
59
|
|
data/lib/sinew/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: sinew
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 2.0.
|
4
|
+
version: 2.0.5
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Adam Doppelt
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2019-03-22 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: awesome_print
|