sinew 2.0.4 → 2.0.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +54 -44
- data/lib/sinew/main.rb +1 -6
- data/lib/sinew/request.rb +22 -5
- data/lib/sinew/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: b383fb9d0a1d57acfafd78d8e3ff0185b81acb5bcd368d4c0cca9a8999aa0a52
|
4
|
+
data.tar.gz: 0dede0f01c7a53056a38705c6fc134cd33a2f15868a9b4fee1b6f9fa85361d31
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: e867516dd43bed9f6dd524475c70b933b30029788521224eed08d269ea1264f5d52a3740e967a2cc61ec96861e1875ecaf2f43af403bd917cac05ce2fd394119
|
7
|
+
data.tar.gz: 719ab64ac523e9cf553171bf318a59938655e4a40fd4c6544ec5c5188c7a81227a30b9ba7c70a6ace90a41b476a62081f2ac4cc88a490d0be2e05ade7e8b3dce
|
data/README.md
CHANGED
@@ -20,24 +20,24 @@ gem 'sinew'
|
|
20
20
|
|
21
21
|
<!--- markdown-toc --no-firsth1 --maxdepth 1 readme.md -->
|
22
22
|
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
|
28
|
-
|
29
|
-
|
30
|
-
|
23
|
+
- [Sinew 2](#sinew-2-may-2018)
|
24
|
+
- [Quick Example](#quick-example)
|
25
|
+
- [How it Works](#how-it-works)
|
26
|
+
- [DSL Reference](#dsl-reference)
|
27
|
+
- [Hints](#hints)
|
28
|
+
- [Limitations](#limitations)
|
29
|
+
- [Changelog](#changelog)
|
30
|
+
- [License](#license)
|
31
31
|
|
32
32
|
## Sinew 2 (May 2018)
|
33
33
|
|
34
34
|
I am pleased to announce the release of Sinew 2.0, a complete rewrite of Sinew for the modern era. Enhancements include:
|
35
35
|
|
36
|
-
|
37
|
-
|
38
|
-
|
39
|
-
|
40
|
-
|
36
|
+
- Remove dependencies on active_support, curl and tidy. We use HTTParty now.
|
37
|
+
- Much easier to customize requests in `.sinew` files. For example, setting User-Agent or Bearer tokens.
|
38
|
+
- More operations like `post_json` or the generic `http`. These methods are thin wrappers around HTTParty.
|
39
|
+
- New end-of-run report.
|
40
|
+
- Tests, rubocop, vscode settings, travis, etc.
|
41
41
|
|
42
42
|
**Breaking change**
|
43
43
|
|
@@ -124,72 +124,82 @@ Because all requests are cached, you can run Sinew repeatedly with confidence. R
|
|
124
124
|
|
125
125
|
#### Making requests
|
126
126
|
|
127
|
-
|
128
|
-
|
129
|
-
|
130
|
-
|
127
|
+
- `get(url, query = {})` - fetch a url with HTTP GET. URL parameters can be added using `query.
|
128
|
+
- `post(url, form = {})` - fetch a url with HTTP POST, using `form` as the URL encoded POST body.
|
129
|
+
- `post_json(url, json = {})` - fetch a url with HTTP POST, using `json` as the POST body.
|
130
|
+
- `http(method, url, options = {})` - use this for more complex requests
|
131
131
|
|
132
132
|
#### Parsing the response
|
133
133
|
|
134
134
|
These variables are set after each HTTP request.
|
135
135
|
|
136
|
-
|
137
|
-
|
138
|
-
|
139
|
-
|
140
|
-
|
141
|
-
|
142
|
-
|
136
|
+
- `raw` - the raw response from the last request
|
137
|
+
- `html` - like `raw`, but with a handful of HTML-specific whitespace cleanups
|
138
|
+
- `noko` - parse the response as HTML and return a [Nokogiri](http://nokogiri.org) document
|
139
|
+
- `xml` - parse the response as XML and return a [Nokogiri](http://nokogiri.org) document
|
140
|
+
- `json` - parse the response as JSON, with symbolized keys
|
141
|
+
- `url` - the url of the last request. If the request goes through a redirect, `url` will reflect the final url.
|
142
|
+
- `uri` - the URI of the last request. This is useful for resolving relative URLs.
|
143
143
|
|
144
144
|
#### Writing CSV
|
145
145
|
|
146
|
-
|
147
|
-
|
146
|
+
- `csv_header(keys)` - specify the columns for CSV output. If you don't call this, Sinew will use the keys from the first call to `csv_emit`.
|
147
|
+
- `csv_emit(hash)` - append a row to the CSV file
|
148
148
|
|
149
149
|
## Hints
|
150
150
|
|
151
151
|
Writing Sinew recipes is fun and easy. The builtin caching means you can iterate quickly, since you won't have to re-fetch the data. Here are some hints for writing idiomatic recipes:
|
152
152
|
|
153
|
-
|
154
|
-
|
155
|
-
|
156
|
-
|
157
|
-
|
158
|
-
|
159
|
-
|
153
|
+
- Sinew doesn't (yet) check robots.txt - please check it manually.
|
154
|
+
- Prefer Nokogiri over regular expressions wherever possible. Learn [CSS selectors](http://www.w3schools.com/cssref/css_selectors.asp).
|
155
|
+
- In Chrome, `$` in the console is your friend.
|
156
|
+
- Fallback to regular expressions if you're desperate. Depending on the site, use either `raw` or `html`. `html` is probably your best bet. `raw` is good for crawling Javascript, but it's fragile if the site changes.
|
157
|
+
- Learn to love `String#[regexp]`, which is an obscure operator but incredibly handy for Sinew.
|
158
|
+
- Laziness is useful. Keep your CSS selectors and regular expressions simple, so maybe they'll work again the next time you need to crawl a site.
|
159
|
+
- Don't be afraid to mix CSS selectors, regular expressions, and Ruby:
|
160
160
|
|
161
161
|
```ruby
|
162
162
|
noko.css("table")[4].css("td").select { |i| i[:width].to_i > 80 }.map(&:text)
|
163
163
|
```
|
164
164
|
|
165
|
-
|
166
|
-
|
167
|
-
|
165
|
+
- Debug your recipes using plain old `puts`, or better yet use `ap` from [awesome_print](https://github.com/michaeldv/awesome_print).
|
166
|
+
- Run `sinew -v` to get a report on every `csv_emit`. Very handy.
|
167
|
+
- Add the CSV files to your git repo. That way you can version them and get diffs!
|
168
168
|
|
169
169
|
## Limitations
|
170
170
|
|
171
|
-
|
172
|
-
|
171
|
+
- Caching is based on URL, so use caution with cookies and other forms of authentication
|
172
|
+
- Almost no support for international (non-english) characters
|
173
173
|
|
174
174
|
## Changelog
|
175
175
|
|
176
|
+
#### 2.0.5 (unreleased)
|
177
|
+
|
178
|
+
- Supports multiple proxies (`--proxy host1,host2,...`)
|
179
|
+
|
180
|
+
#### 2.0.4 (May 2018)
|
181
|
+
|
182
|
+
- Handle and cache more errors (too many redirects, connection failures, etc.)
|
183
|
+
- Support for adding uri.scheme in generate_cache_key
|
184
|
+
- Added status `code`, a peer to `uri`, `raw`, etc.
|
185
|
+
|
176
186
|
#### 2.0.3 (May 2018)
|
177
187
|
|
178
|
-
|
188
|
+
- & now normalizes to & (not and)
|
179
189
|
|
180
190
|
#### 2.0.2 (May 2018)
|
181
191
|
|
182
|
-
|
183
|
-
|
184
|
-
|
192
|
+
- Support for `--limit`, `--proxy` and the `xml` variable
|
193
|
+
- Dedup - warn and ignore if row[:url] has already been emitted
|
194
|
+
- Auto gunzip if contents are compressed
|
185
195
|
|
186
196
|
#### 2.0.1 (May 2018)
|
187
197
|
|
188
|
-
|
198
|
+
- Support for legacy cached `head` files from Sinew 1
|
189
199
|
|
190
200
|
#### 2.0.0 (May 2018)
|
191
201
|
|
192
|
-
|
202
|
+
- Complete rewrite. See above.
|
193
203
|
|
194
204
|
#### 1.0.3 (June 2012)
|
195
205
|
|
data/lib/sinew/main.rb
CHANGED
@@ -15,12 +15,6 @@ module Sinew
|
|
15
15
|
@runtime_options = RuntimeOptions.new
|
16
16
|
@request_tm = Time.at(0)
|
17
17
|
@request_count = 0
|
18
|
-
|
19
|
-
if options[:proxy]
|
20
|
-
addr, port = options[:proxy].split(':')
|
21
|
-
runtime_options.httparty_options[:http_proxyaddr] = addr
|
22
|
-
runtime_options.httparty_options[:http_proxyport] = port || 80
|
23
|
-
end
|
24
18
|
end
|
25
19
|
|
26
20
|
def run
|
@@ -105,6 +99,7 @@ module Sinew
|
|
105
99
|
else
|
106
100
|
"req #{request.uri}"
|
107
101
|
end
|
102
|
+
msg = "#{msg} => #{request.proxy}" if request.proxy
|
108
103
|
$stderr.puts msg
|
109
104
|
end
|
110
105
|
|
data/lib/sinew/request.rb
CHANGED
@@ -24,19 +24,36 @@ module Sinew
|
|
24
24
|
@cache_key = calculate_cache_key
|
25
25
|
end
|
26
26
|
|
27
|
+
def proxy
|
28
|
+
@proxy ||= begin
|
29
|
+
if proxies = sinew.options[:proxy]
|
30
|
+
proxies.split(',').sample
|
31
|
+
end
|
32
|
+
end
|
33
|
+
end
|
34
|
+
|
27
35
|
# run the request, return the result
|
28
36
|
def perform
|
29
37
|
validate!
|
30
38
|
|
31
|
-
|
32
|
-
|
39
|
+
party_options = options.dup
|
40
|
+
|
41
|
+
# merge proxy
|
42
|
+
if proxy = self.proxy
|
43
|
+
addr, port = proxy.split(':')
|
44
|
+
party_options[:http_proxyaddr] = addr
|
45
|
+
party_options[:http_proxyport] = port || 80
|
46
|
+
end
|
47
|
+
|
48
|
+
# now merge runtime_options
|
49
|
+
party_options = party_options.merge(sinew.runtime_options.httparty_options)
|
33
50
|
|
34
51
|
# merge headers
|
35
52
|
headers = sinew.runtime_options.headers
|
36
|
-
headers = headers.merge(
|
37
|
-
|
53
|
+
headers = headers.merge(party_options[:headers]) if party_options[:headers]
|
54
|
+
party_options[:headers] = headers
|
38
55
|
|
39
|
-
party_response = HTTParty.send(method, uri,
|
56
|
+
party_response = HTTParty.send(method, uri, party_options)
|
40
57
|
Response.from_network(self, party_response)
|
41
58
|
end
|
42
59
|
|
data/lib/sinew/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: sinew
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 2.0.
|
4
|
+
version: 2.0.5
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Adam Doppelt
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2019-03-22 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: awesome_print
|