sinew 3.0.1 → 4.0.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: ba5558019816540d71e1bb44029f2733aff53649a3f4da72a9905e0a67d06ad9
4
- data.tar.gz: 446c245782cad55f1caa36e01b0e8c98748295b47eb5b1911175fff2f79589b2
3
+ metadata.gz: c16f880ce1bf6454b10c34dd1f071daf1a758eebb52e598262d6357cebb2e9f2
4
+ data.tar.gz: 4e57acae70775805a96fd5e6bd7ed00ebd5d74dbc0e6daa3348fd9161118d00d
5
5
  SHA512:
6
- metadata.gz: cef8c1145a21e84f560b44821071ffc7b57ef965167b633e1c837f7b7d9dbfce340b14d1afb2a35891c7b1ed4aa4f08e47ca7405cf382acca7eae855a47d3a71
7
- data.tar.gz: 8dc7b67511fc541cccef23b69463abd9a8081f9c75583c1d6c6756dda0561ce7a91be35ee06f7d886ea7488b9005c969e14fa10794cac92475a3063c8968abec
6
+ metadata.gz: 99bf3da2db47a04dbd6f18dfb7aa3f2bc5f706bbe460633f3f8e3589c85377ae14b35d43c750d384fbdeac0883247b5dbab696127700c6fee71731818df57e74
7
+ data.tar.gz: 1ecda4412fc9f2384bf01aa38ed40327ac5855c37eb3c82ce65d21779a8a582880c6afaffa5d68954759ddccaa04cec9a6f36f8460c96849722ecafe4ac2ba6e
data/.gitignore CHANGED
@@ -1,7 +1,5 @@
1
+ .ruby-version
2
+ .yardoc
1
3
  *.gem
2
- .bundle
3
- Gemfile.lock
4
- pkg/*
5
- rdoc
4
+ doc/
6
5
  sample.csv
7
- TAGS
data/.rubocop.yml CHANGED
@@ -1,52 +1,34 @@
1
1
  AllCops:
2
- TargetRubyVersion: 2.7
3
2
  NewCops: enable
3
+ SuggestExtensions: false
4
+ TargetRubyVersion: 2.7
4
5
 
5
- # amd: customizations
6
- Layout/SpaceInsideArrayLiteralBrackets:
7
- EnforcedStyle: compact
8
- Layout/CaseIndentation:
9
- EnforcedStyle: end
10
- Layout/EndAlignment:
11
- EnforcedStyleAlignWith: variable
12
- Style/CollectionMethods:
13
- Enabled: true
14
- PreferredMethods:
15
- reduce: inject
16
- Style/EmptyMethod:
17
- Enabled: false
18
- Style/TrailingCommaInArrayLiteral:
19
- EnforcedStyleForMultiline: consistent_comma
20
- Style/TrailingCommaInHashLiteral:
21
- EnforcedStyleForMultiline: consistent_comma
22
-
23
- # amd: these seem extreme
24
- Lint/AssignmentInCondition: { Enabled: false } # I do this all the time
25
- Lint/SuppressedException: { Enabled: false } # blank rescues are useful
26
- Naming/BinaryOperatorParameterName: { Enabled: false } # silly
27
- Naming/HeredocDelimiterNaming: { Enabled: false } # silly
28
- Naming/MethodParameterName: { Enabled: false } # silly
29
- Style/AccessorGrouping: { Enabled: false } # silly
30
- Style/AsciiComments: { Enabled: false } # silly
31
- Style/ClassAndModuleChildren: { Enabled: false } # silly
32
- Style/Documentation: { Enabled: false } # we don't need this
33
- Style/DoubleNegation: { Enabled: false } # silly
34
- Style/FormatStringToken: { Enabled: false } # we like printf here
35
- Style/FrozenStringLiteralComment: { Enabled: false } # seems excessive
36
- Style/GuardClause: { Enabled: false } # confusing
37
- Style/HashTransformValues: { Enabled: false } # breaks code by trying to apply to an array
38
- Style/IfUnlessModifier: { Enabled: false } # personally I hate unless
39
- Style/NegatedIf: { Enabled: false } # these are fine
40
- Style/Next: { Enabled: false } # these are fine
41
- Style/NumericPredicate: { Enabled: false } # silly
42
- Style/ParallelAssignment: { Enabled: false } # these are fine
43
- Style/PerlBackrefs: { Enabled: false } # these are fine
44
- Style/RaiseArgs: { Enabled: false } # silly
45
- Style/RedundantAssignment: { Enabled: false } # these are usually on purpose
46
- Style/RegexpLiteral: { Enabled: false } # these are fine
47
- Style/SoleNestedConditional: { Enabled: false } # these are fine
48
- Style/StderrPuts: { Enabled: false } # this is awful
6
+ # this is buggy in 2.7.0
7
+ Style/HashTransformValues: { Enabled: false }
49
8
 
50
- # amd: these Metric rules are annoying, disable
51
- Metrics:
52
- Enabled: false
9
+ # minimal personal preference
10
+ Layout/CaseIndentation: { Enabled: false }
11
+ Layout/EndAlignment: { EnforcedStyleAlignWith: variable }
12
+ Lint/AssignmentInCondition: { Enabled: false }
13
+ Lint/NonLocalExitFromIterator: { Enabled: false }
14
+ Metrics: { Enabled: false }
15
+ Naming/HeredocDelimiterNaming: { Enabled: false }
16
+ Naming/MethodParameterName: { Enabled: false }
17
+ Naming/VariableNumber: { Enabled: false }
18
+ Style/AsciiComments: { Enabled: false }
19
+ Style/ClassVars: { Enabled: false }
20
+ Style/CommentAnnotation: { Enabled: false }
21
+ Style/Documentation: { Enabled: false }
22
+ Style/DoubleNegation: { Enabled: false }
23
+ Style/EmptyCaseCondition: { Enabled: false }
24
+ Style/FormatStringToken: { Enabled: false }
25
+ Style/FrozenStringLiteralComment: { Enabled: false }
26
+ Style/GuardClause: { Enabled: false }
27
+ Style/IfUnlessModifier: { Enabled: false }
28
+ Style/NegatedIf: { Enabled: false }
29
+ Style/NumericPredicate: { Enabled: false }
30
+ Style/ParallelAssignment: { Enabled: false }
31
+ Style/StderrPuts: { Enabled: false }
32
+ Style/StringConcatenation: { Enabled: false }
33
+ Style/TrailingCommaInArrayLiteral: { EnforcedStyleForMultiline: consistent_comma }
34
+ Style/TrailingCommaInHashLiteral: { EnforcedStyleForMultiline: consistent_comma }
data/Gemfile CHANGED
@@ -1,11 +1,11 @@
1
1
  source 'http://rubygems.org'
2
+ gemspec
2
3
 
3
- group :development do
4
+ group :development, :test do
4
5
  gem 'minitest'
5
6
  gem 'mocha'
7
+ gem 'pry'
6
8
  gem 'rake'
7
- gem 'rubocop', '~> 0.91.0', require: false
9
+ gem 'rubocop', '~> 1.18'
8
10
  gem 'webmock'
9
11
  end
10
-
11
- gemspec
data/Gemfile.lock ADDED
@@ -0,0 +1,124 @@
1
+ PATH
2
+ remote: .
3
+ specs:
4
+ sinew (4.0.0)
5
+ amazing_print (~> 1.3)
6
+ faraday (~> 1.4)
7
+ faraday-encoding (~> 0)
8
+ faraday-rate_limiter (~> 0.0)
9
+ hashie (~> 4.1)
10
+ httpdisk (~> 0.5)
11
+ nokogiri (~> 1.11)
12
+ slop (~> 4.8)
13
+ sterile (~> 1.0)
14
+
15
+ GEM
16
+ remote: http://rubygems.org/
17
+ specs:
18
+ addressable (2.8.0)
19
+ public_suffix (>= 2.0.2, < 5.0)
20
+ amazing_print (1.3.0)
21
+ ast (2.4.2)
22
+ coderay (1.1.3)
23
+ content-type (0.0.1)
24
+ parslet (~> 1.5)
25
+ crack (0.4.5)
26
+ rexml
27
+ domain_name (0.5.20190701)
28
+ unf (>= 0.0.5, < 1.0.0)
29
+ faraday (1.5.0)
30
+ faraday-em_http (~> 1.0)
31
+ faraday-em_synchrony (~> 1.0)
32
+ faraday-excon (~> 1.1)
33
+ faraday-httpclient (~> 1.0.1)
34
+ faraday-net_http (~> 1.0)
35
+ faraday-net_http_persistent (~> 1.1)
36
+ faraday-patron (~> 1.0)
37
+ multipart-post (>= 1.2, < 3)
38
+ ruby2_keywords (>= 0.0.4)
39
+ faraday-cookie_jar (0.0.7)
40
+ faraday (>= 0.8.0)
41
+ http-cookie (~> 1.0.0)
42
+ faraday-em_http (1.0.0)
43
+ faraday-em_synchrony (1.0.0)
44
+ faraday-encoding (0.0.5)
45
+ faraday
46
+ faraday-excon (1.1.0)
47
+ faraday-httpclient (1.0.1)
48
+ faraday-net_http (1.0.1)
49
+ faraday-net_http_persistent (1.1.0)
50
+ faraday-patron (1.0.0)
51
+ faraday-rate_limiter (0.0.4)
52
+ faraday
53
+ faraday_middleware (1.0.0)
54
+ faraday (~> 1.0)
55
+ hashdiff (1.0.1)
56
+ hashie (4.1.0)
57
+ http-cookie (1.0.4)
58
+ domain_name (~> 0.5)
59
+ httpdisk (0.5.2)
60
+ content-type (~> 0.0)
61
+ faraday (~> 1.4)
62
+ faraday-cookie_jar (~> 0.0)
63
+ faraday_middleware (~> 1.0)
64
+ slop (~> 4.8)
65
+ method_source (1.0.0)
66
+ mini_portile2 (2.5.3)
67
+ minitest (5.14.4)
68
+ mocha (1.13.0)
69
+ multipart-post (2.1.1)
70
+ nokogiri (1.11.7)
71
+ mini_portile2 (~> 2.5.0)
72
+ racc (~> 1.4)
73
+ parallel (1.20.1)
74
+ parser (3.0.2.0)
75
+ ast (~> 2.4.1)
76
+ parslet (1.8.2)
77
+ pry (0.14.1)
78
+ coderay (~> 1.1)
79
+ method_source (~> 1.0)
80
+ public_suffix (4.0.6)
81
+ racc (1.5.2)
82
+ rainbow (3.0.0)
83
+ rake (13.0.6)
84
+ regexp_parser (2.1.1)
85
+ rexml (3.2.5)
86
+ rubocop (1.18.3)
87
+ parallel (~> 1.10)
88
+ parser (>= 3.0.0.0)
89
+ rainbow (>= 2.2.2, < 4.0)
90
+ regexp_parser (>= 1.8, < 3.0)
91
+ rexml
92
+ rubocop-ast (>= 1.7.0, < 2.0)
93
+ ruby-progressbar (~> 1.7)
94
+ unicode-display_width (>= 1.4.0, < 3.0)
95
+ rubocop-ast (1.7.0)
96
+ parser (>= 3.0.1.1)
97
+ ruby-progressbar (1.11.0)
98
+ ruby2_keywords (0.0.4)
99
+ slop (4.9.1)
100
+ sterile (1.0.23)
101
+ nokogiri (>= 1.11.7)
102
+ unf (0.1.4)
103
+ unf_ext
104
+ unf_ext (0.0.7.7)
105
+ unicode-display_width (2.0.0)
106
+ webmock (3.13.0)
107
+ addressable (>= 2.3.6)
108
+ crack (>= 0.3.2)
109
+ hashdiff (>= 0.4.0, < 2.0.0)
110
+
111
+ PLATFORMS
112
+ ruby
113
+
114
+ DEPENDENCIES
115
+ minitest
116
+ mocha
117
+ pry
118
+ rake
119
+ rubocop (~> 1.18)
120
+ sinew!
121
+ webmock
122
+
123
+ BUNDLED WITH
124
+ 2.1.4
data/README.md CHANGED
@@ -2,17 +2,22 @@
2
2
 
3
3
  ## Welcome to Sinew
4
4
 
5
- Sinew collects structured data from web sites (screen scraping). It provides a Ruby DSL built for crawling, a robust caching system, and integration with [Nokogiri](http://nokogiri.org). Though small, this project is the culmination of years of effort based on crawling systems built at several different companies.
5
+ Sinew is a Ruby library for collecting data from web sites (scraping). Though small, this project is the culmination of years of effort based on crawling systems built at several different companies. Sinew has been used to crawl millions of websites.
6
6
 
7
- Sinew is distributed as a ruby gem:
7
+ ## Key Features
8
8
 
9
- ```sh
10
- $ gem install sinew
11
- ```
9
+ - Robust crawling with the [Faraday](https://lostisland.github.io/faraday/) HTTP client
10
+ - Aggressive caching with [httpdisk](https://github.com/gurgeous/httpdisk/)
11
+ - Easy parsing with HTML cleanup, Nokogiri, JSON, etc.
12
+ - CSV generation for crawled data
12
13
 
13
- or in your Gemfile:
14
+ ## Installation
14
15
 
15
16
  ```ruby
17
+ # install gem
18
+ $ gem install sinew
19
+
20
+ # or add to your Gemfile:
16
21
  gem 'sinew'
17
22
  ```
18
23
 
@@ -20,22 +25,22 @@ gem 'sinew'
20
25
 
21
26
  <!--- markdown-toc --no-firsth1 --maxdepth 1 readme.md -->
22
27
 
23
- - [Sinew 3](#sinew-3-may-2021)
28
+ - [Sinew 4](#sinew-4-june-2021)
24
29
  - [Quick Example](#quick-example)
25
30
  - [How it Works](#how-it-works)
26
- - [DSL Reference](#dsl-reference)
31
+ - [Reference](#dsreference)
27
32
  - [Hints](#hints)
28
33
  - [Limitations](#limitations)
29
34
  - [Changelog](#changelog)
30
35
  - [License](#license)
31
36
 
32
- ## Sinew 3 (May 2021)
33
-
34
- I am pleased to announce the release of Sinew 3.0. Sinew has been streamlined and updated to use the [Faraday](https://lostisland.github.io/faraday/) HTTP client with [sinew](https://github.com/gurgeous/sinew/) middleware for caching.
37
+ ## Sinew 4 (June 2021)
35
38
 
36
39
  **Breaking change**
37
40
 
38
- Sinew 3 uses a new format for cached responses. Old Sinew 2 cache directories should be removed before running Sinew again.
41
+ We are pleased to announce the release of Sinew 4. The Sinew DSL exposes a single `sinew` method in lieu of the many methods exposed in Sinew 3. Because of this single entry point, Sinew is now much easier to embed in other applications. Also, each Sinew 4 request returns a full Response object to faciliate parallelism.
42
+
43
+ Sinew uses the [Faraday](https://lostisland.github.io/faraday/) HTTP client with the [httpdisk](https://github.com/gurgeous/httpdisk/) middleware for aggressive caching of responses.
39
44
 
40
45
  ## Quick Example
41
46
 
@@ -43,16 +48,16 @@ Here's an example for collecting the links from httpbingo.org. Paste this into a
43
48
 
44
49
  ```ruby
45
50
  # get the url
46
- get "http://httpbingo.org"
51
+ response = sinew.get "https://httpbingo.org"
47
52
 
48
53
  # use nokogiri to collect links
49
- noko.css("ul li a").each do |a|
54
+ response.noko.css("ul li a").each do |a|
50
55
  row = { }
51
56
  row[:url] = a[:href]
52
57
  row[:title] = a.text
53
58
 
54
59
  # append a row to the csv
55
- csv_emit(row)
60
+ sinew.csv_emit(row)
56
61
  end
57
62
  ```
58
63
 
@@ -60,26 +65,26 @@ end
60
65
 
61
66
  There are three main features provided by Sinew.
62
67
 
63
- #### The Sinew DSL
68
+ #### Recipes
64
69
 
65
- Sinew uses recipe files to crawl web sites. Recipes have the `.sinew` extension, but they are plain old Ruby. The [Sinew DSL](#dsl) makes crawling easy. Use `get` to make an HTTP GET:
70
+ Sinew uses recipe files to crawl web sites. Recipes have the .sinew extension, but they are plain old Ruby. Here's a trivial example that calls `get` to make an HTTP GET request:
66
71
 
67
72
  ```ruby
68
- get "https://www.google.com/search?q=darwin"
69
- get "https://www.google.com/search", q: "charles darwin"
73
+ response = sinew.get "https://www.google.com/search?q=darwin"
74
+ response = sinew.get "https://www.google.com/search", q: "charles darwin"
70
75
  ```
71
76
 
72
- Once you've done a `get`, you have access to the document in a few different formats. In general, it's easiest to use `noko` to automatically parse and interact with the results. If Nokogiri isn't appropriate, you can fall back to regular expressions run against `raw` or `html`. Use `json` if you are expecting a JSON response.
77
+ Once you've done a `get`, you can access the document in a few different formats. In general, it's easiest to use `noko` to automatically parse and interact with HTML results. If Nokogiri isn't appropriate, fall back to regular expressions run against `body` or `html`. Use `json` if you are expecting a JSON response.
73
78
 
74
79
  ```ruby
75
- get "https://www.google.com/search?q=darwin"
80
+ response = sinew.get "https://www.google.com/search?q=darwin"
76
81
 
77
82
  # pull out the links with nokogiri
78
- links = noko.css("a").map { |i| i[:href] }
83
+ links = response.noko.css("a").map { _1[:href] }
79
84
  puts links.inspect
80
85
 
81
86
  # or, use a regex
82
- links = html[/<a[^>]+href="([^"]+)/, 1]
87
+ links = response.html[/<a[^>]+href="([^"]+)/, 1]
83
88
  puts links.inspect
84
89
  ```
85
90
 
@@ -88,16 +93,16 @@ puts links.inspect
88
93
  Recipes output CSV files. To continue the example above:
89
94
 
90
95
  ```ruby
91
- get "https://www.google.com/search?q=darwin"
92
- noko.css("a").each do |i|
96
+ response = sinew.get "https://www.google.com/search?q=darwin"
97
+ response.noko.css("a").each do |i|
93
98
  row = { }
94
99
  row[:href] = i[:href]
95
100
  row[:text] = i.text
96
- csv_emit row
101
+ sinew.csv_emit row
97
102
  end
98
103
  ```
99
104
 
100
- Sinew creates a CSV file with the same name as the recipe, and `csv_emit(hash)` appends a row. The values of your hash are converted to strings:
105
+ Sinew creates a CSV file with the same name as the recipe, and `csv_emit(hash)` appends a row. The values of your hash are cleaned up and converted to strings:
101
106
 
102
107
  1. Nokogiri nodes are converted to text
103
108
  1. Arrays are joined with "|", so you can separate them later
@@ -108,35 +113,84 @@ Sinew creates a CSV file with the same name as the recipe, and `csv_emit(hash)`
108
113
 
109
114
  Sinew uses [httpdisk](https://github.com/gurgeous/httpdisk/) to aggressively cache all HTTP responses to disk in `~/.sinew`. Error responses are cached as well. Each URL will be hit exactly once, and requests are rate limited to one per second. Sinew tries to be polite.
110
115
 
111
- Sinew never deletes files from the cache - that's up to you!
116
+ Sinew never deletes files from the cache - that's up to you! Sinew has various command line options to refresh the cache. See `--expires`, `--force` and `--force-errors`.
112
117
 
113
- Because all requests are cached, you can run Sinew repeatedly with confidence. Run it over and over again while you build up your recipe.
118
+ Because all requests are cached, you can run Sinew repeatedly with confidence. Run it over and over again while you work on your recipe.
114
119
 
115
- ## DSL Reference
120
+ ## Running Sinew
116
121
 
117
- #### Making requests
122
+ The `sinew` command line has many useful options. You will be using this command many times as you iterate on your recipe:
118
123
 
119
- - `get(url, query = {})` - fetch a url with HTTP GET. URL parameters can be added using `query.
120
- - `post(url, form = {})` - fetch a url with HTTP POST, using `form` as the URL encoded POST body.
121
- - `post_json(url, json = {})` - fetch a url with HTTP POST, using `json` as the POST body.
122
- - `http(method, url, options = {})` - use this for more complex requests
124
+ ```sh
125
+ $ bin/sinew --help
126
+ Usage: sinew [options] [recipe]
127
+ -l, --limit quit after emitting this many rows
128
+ --proxy use host[:port] as HTTP proxy
129
+ --timeout maximum time allowed for the transfer
130
+ -s, --silent suppress some output
131
+ -v, --verbose dump emitted rows while running
132
+ From httpdisk:
133
+ --dir set custom cache directory
134
+ --expires when to expire cached requests (ex: 1h, 2d, 3w)
135
+ --force don't read anything from cache (but still write)
136
+ --force-errors don't read errors from cache (but still write)
137
+ ```
138
+
139
+ `Sinew` also has many runtime options that can be set by in your recipe. For example:
140
+
141
+ ```ruby
142
+ sinew.options[:headers] = { 'User-Agent' => 'xyz' }
143
+
144
+ ...
145
+ ```
146
+
147
+ Here is the list of available options for `Sinew`:
148
+
149
+ - **headers** - default HTTP headers to use on every request
150
+ - **ignore_params** - ignore these query params when generating httpdisk cache keys
151
+ - **insecure** - ignore SSL errors
152
+ - **params** - default query parameters to use on every request
153
+ - **rate_limit** - minimum time between network requests
154
+ - **retries** - number of times to retry each failed request
155
+ - **url_prefix** - deafult URL base to use on every request
156
+
157
+ ## Reference
158
+
159
+ #### Making HTTP requests
160
+
161
+ - `sinew.get(url, params = nil, headers = nil)` - fetch a url with GET
162
+ - `sinew.post(url, body = nil, headers = nil)` - fetch a url with POST, using `form` as the URL encoded POST body.
163
+ - `sinew.post_json(url, body = nil, headers = nil)` - fetch a url with POST, using `json` as the POST body.
123
164
 
124
165
  #### Parsing the response
125
166
 
126
- These variables are set after each HTTP request.
167
+ Each request method returns a `Sinew::Response`. The response has several helpers to make parsing easier:
127
168
 
128
- - `raw` - the raw response from the last request
129
- - `html` - like `raw`, but with a handful of HTML-specific whitespace cleanups
130
- - `noko` - parse the response as HTML and return a [Nokogiri](http://nokogiri.org) document
131
- - `xml` - parse the response as XML and return a [Nokogiri](http://nokogiri.org) document
132
- - `json` - parse the response as JSON, with symbolized keys
133
- - `url` - the url of the last request. If the request goes through a redirect, `url` will reflect the final url.
134
- - `uri` - the URI of the last request. This is useful for resolving relative URLs.
169
+ - `body` - the raw body
170
+ - `html` - like `body`, but with a handful of HTML-specific whitespace cleanups
171
+ - `noko` - parse as HTML and return a [Nokogiri](http://nokogiri.org) document
172
+ - `xml` - parse as XML and return a [Nokogiri](http://nokogiri.org) document
173
+ - `json` - parse as JSON, with symbolized keys
174
+ - `mash` - parse as JSON and return a [Hashie::Mash](https://github.com/hashie/hashie#mash)
175
+ - `url` - the url of the request. If the request goes through a redirect, `url` will reflect the final url.
135
176
 
136
177
  #### Writing CSV
137
178
 
138
- - `csv_header(keys)` - specify the columns for CSV output. If you don't call this, Sinew will use the keys from the first call to `csv_emit`.
139
- - `csv_emit(hash)` - append a row to the CSV file
179
+ - `sinew.csv_header(columns)` - specify the columns for CSV output. If you don't call this, Sinew will use the keys from the first call to `sinew.csv_emit`.
180
+ - `sinew.csv_emit(hash)` - append a row to the CSV file
181
+
182
+ #### Advanced: Cache
183
+
184
+ Sinew has some advanced helpers for checking the httpdisk cache. For the following methods, `body` hashes default to form body type.
185
+
186
+ - `sinew.cached?(method, url, params = nil, body = nil)` - check if request is cached
187
+ - `sinew.uncache(method, url, params = nil, body = nil)` - remove cache file, if any
188
+ - `sinew.status(method, url, params = nil, body = nil)` - get httpdisk status
189
+
190
+ Plus some caching helpers in Sinew::Response:
191
+
192
+ - `diskpath` - the location on disk for the cached httpdisk response
193
+ - `uncache` - remove cache file for this response
140
194
 
141
195
  ## Hints
142
196
 
@@ -145,13 +199,15 @@ Writing Sinew recipes is fun and easy. The builtin caching means you can iterate
145
199
  - Sinew doesn't (yet) check robots.txt - please check it manually.
146
200
  - Prefer Nokogiri over regular expressions wherever possible. Learn [CSS selectors](http://www.w3schools.com/cssref/css_selectors.asp).
147
201
  - In Chrome, `$` in the console is your friend.
148
- - Fallback to regular expressions if you're desperate. Depending on the site, use either `raw` or `html`. `html` is probably your best bet. `raw` is good for crawling Javascript, but it's fragile if the site changes.
202
+ - Fallback to regular expressions if you're desperate. Depending on the site, use either `body` or `html`. `html` is probably your best bet. `body` is good for crawling Javascript, but it's fragile if the site changes.
149
203
  - Learn to love `String#[regexp]`, which is an obscure operator but incredibly handy for Sinew.
150
204
  - Laziness is useful. Keep your CSS selectors and regular expressions simple, so maybe they'll work again the next time you need to crawl a site.
151
205
  - Don't be afraid to mix CSS selectors, regular expressions, and Ruby:
152
206
 
153
207
  ```ruby
154
- noko.css("table")[4].css("td").select { |i| i[:width].to_i > 80 }.map(&:text)
208
+ noko.css("table")[4].css("td").select do
209
+ _1[:width].to_i > 80
210
+ end.map(&:text)
155
211
  ```
156
212
 
157
213
  - Debug your recipes using plain old `puts`, or better yet use `ap` from [amazing_print](https://github.com/amazing-print/amazing_print).
@@ -165,6 +221,11 @@ noko.css("table")[4].css("td").select { |i| i[:width].to_i > 80 }.map(&:text)
165
221
 
166
222
  ## Changelog
167
223
 
224
+ #### 4.0.0 (July 2021)
225
+
226
+ - Rewritten to use simpler DSL
227
+ - Upgraded to httpdisk 0.5 to take advantage of the new encoding support
228
+
168
229
  #### 3.0.0 (May 2021)
169
230
 
170
231
  - Major rewrite of network and caching layer. See above.