tanakai 1.5.0 → 1.6.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.gitignore +1 -0
- data/.rspec +3 -0
- data/CHANGELOG.md +8 -0
- data/README.md +66 -48
- data/lib/tanakai/base.rb +8 -7
- data/lib/tanakai/base_helper.rb +3 -3
- data/lib/tanakai/browser_builder/selenium_chrome_builder.rb +1 -1
- data/lib/tanakai/capybara_ext/mechanize/driver.rb +1 -1
- data/lib/tanakai/template/Gemfile +1 -1
- data/lib/tanakai/version.rb +1 -1
- data/tanakai.gemspec +5 -4
- metadata +34 -13
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 7ea3cd20cfaedaebf473e853b66ebe58958e89b7525246444e3c8aeef46a4bf0
|
4
|
+
data.tar.gz: a2c51b86487d6392a58b533237731996639fe0037c9aca22a6140c3c968eaf7d
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 52d9a730a0a9e08c0a49ee4177a0370f5ed2a12ac9e3925f0a83b0c232dcedb1645d1b6860cb19c8453bbc5777cec02403654e2282e57ad75c5c2cb898b6dc1b
|
7
|
+
data.tar.gz: '0969ee651ec787b9fa1e47b8d776571b6f4751c29d3dd15bb0c696181ceab8bc826db6f54486df6426151f25f626f4725bf79c51ec8cc8cebebbe6cfa057bfa3'
|
data/.gitignore
CHANGED
data/.rspec
ADDED
data/CHANGELOG.md
CHANGED
data/README.md
CHANGED
@@ -1,8 +1,19 @@
|
|
1
|
-
# Tanakai
|
1
|
+
# 🕷 Tanakai
|
2
2
|
|
3
|
-
|
3
|
+
<sub>[Liphistius tanakai](https://wsc.nmbe.ch/species/58479/Liphistius_tanakai)</sub>
|
4
4
|
|
5
|
-
Tanakai
|
5
|
+
Tanakai intends to be a maintained fork of [Kimurai](https://github.com/vifreefly/kimuraframework), a modern web scraping framework written in Ruby which **works out of the box with Apparition, Cuprite, Headless Chromium/Firefox and PhantomJS**, or simple HTTP requests and **allows you to scrape and interact with JavaScript rendered websites.**
|
6
|
+
|
7
|
+
### Goals of this fork:
|
8
|
+
|
9
|
+
- [x] add support to [Apparition](https://github.com/twalpole/apparition) and [Cuprite](https://github.com/rubycdp/cuprite)
|
10
|
+
- [x] add support to Ruby 3
|
11
|
+
- [ ] write tests with RSpec
|
12
|
+
- [ ] improve configuration options for Apparition and Cuprite (both have been recently added)
|
13
|
+
- [ ] create an awesome logo in the likes of [this](https://hsto.org/webt/_v/mt/tp/_vmttpbpzbt-y2aook642d9wpz0.png)
|
14
|
+
- [ ] have you as new contributor
|
15
|
+
|
16
|
+
Tanakai is based on the well-known [Capybara](https://github.com/teamcapybara/capybara) and [Nokogiri](https://github.com/sparklemotion/nokogiri) gems, so you don't have to learn anything new. Let's try an example:
|
6
17
|
|
7
18
|
```ruby
|
8
19
|
# github_spider.rb
|
@@ -128,7 +139,7 @@ I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] INFO -- github_spider:
|
|
128
139
|
```
|
129
140
|
</details><br>
|
130
141
|
|
131
|
-
Okay, that was easy. How about
|
142
|
+
Okay, that was easy. How about JavaScript rendered websites with dynamic HTML? Let's scrape a page with infinite scroll:
|
132
143
|
|
133
144
|
```ruby
|
134
145
|
# infinite_scroll_spider.rb
|
@@ -190,20 +201,20 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
|
|
190
201
|
|
191
202
|
|
192
203
|
## Features
|
193
|
-
* Scrape
|
204
|
+
* Scrape JavaScript rendered websites out of the box
|
194
205
|
* Supported engines: [Apparition](https://github.com/twalpole/apparition), [Cuprite](https://github.com/rubycdp/cuprite), [Headless Chrome](https://developers.google.com/web/updates/2017/04/headless-chrome), [Headless Firefox](https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Headless_mode), [PhantomJS](https://github.com/ariya/phantomjs) or simple HTTP requests ([mechanize](https://github.com/sparklemotion/mechanize) gem)
|
195
206
|
* Write spider code once, and use it with any supported engine later
|
196
207
|
* All the power of [Capybara](https://github.com/teamcapybara/capybara): use methods like `click_on`, `fill_in`, `select`, `choose`, `set`, `go_back`, etc. to interact with web pages
|
197
208
|
* Rich [configuration](#spider-config): **set default headers, cookies, delay between requests, enable proxy/user-agents rotation**
|
198
|
-
* Built-in helpers to make scraping easy, like [save_to](#save_to-helper) (save items to JSON, JSON lines, or CSV formats) or [unique?](#skip-duplicates
|
209
|
+
* Built-in helpers to make scraping easy, like [save_to](#save_to-helper) (save items to JSON, JSON lines, or CSV formats) or [unique?](#skip-duplicates) to skip duplicates
|
199
210
|
* Automatically [handle requests errors](#handle-request-errors)
|
200
211
|
* Automatically restart browsers when reaching memory limit [**(memory control)**](#spider-config) or requests limit
|
201
212
|
* Easily [schedule spiders](#schedule-spiders-using-cron) within cron using [Whenever](https://github.com/javan/whenever) (no need to know cron syntax)
|
202
213
|
* [Parallel scraping](#parallel-crawling-using-in_parallel) using simple method `in_parallel`
|
203
214
|
* **Two modes:** use single file for a simple spider, or [generate](#project-mode) Scrapy-like **project**
|
204
215
|
* Convenient development mode with [console](#interactive-console), colorized logger and debugger ([Pry](https://github.com/pry/pry), [Byebug](https://github.com/deivid-rodriguez/byebug))
|
205
|
-
* Automated [server environment setup](#setup) (for
|
206
|
-
* Command-line [runner](#runner) to run all project spiders one
|
216
|
+
* Automated [server environment setup](#setup) (for Ubuntu 18.04) and [deploy](#deploy) using commands `tanakai setup` and `tanakai deploy` ([Ansible](https://github.com/ansible/ansible) under the hood)
|
217
|
+
* Command-line [runner](#runner) to run all project spiders one-by-one or in parallel
|
207
218
|
|
208
219
|
## Table of Contents
|
209
220
|
* [Tanakai](#tanakai)
|
@@ -221,7 +232,7 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
|
|
221
232
|
* [Skip duplicates](#skip-duplicates)
|
222
233
|
* [Automatically skip all duplicated requests urls](#automatically-skip-all-duplicated-requests-urls)
|
223
234
|
* [Storage object](#storage-object)
|
224
|
-
* [
|
235
|
+
* [Handling request errors](#handling-request-errors)
|
225
236
|
* [skip_request_errors](#skip_request_errors)
|
226
237
|
* [retry_request_errors](#retry_request_errors)
|
227
238
|
* [Logging custom events](#logging-custom-events)
|
@@ -231,7 +242,7 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
|
|
231
242
|
* [Active Support included](#active-support-included)
|
232
243
|
* [Schedule spiders using Cron](#schedule-spiders-using-cron)
|
233
244
|
* [Configuration options](#configuration-options)
|
234
|
-
* [Using Tanakai inside existing Ruby
|
245
|
+
* [Using Tanakai inside existing Ruby applications](#using-tanakai-inside-existing-ruby-applications)
|
235
246
|
* [crawl! method](#crawl-method)
|
236
247
|
* [parse! method](#parsemethod_name-url-method)
|
237
248
|
* [Tanakai.list and Tanakai.find_by_name](#tanakailist-and-tanakaifind_by_name)
|
@@ -256,7 +267,7 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
|
|
256
267
|
## Installation
|
257
268
|
Tanakai requires Ruby version `>= 2.5.0`. Supported platforms: `Linux` and `Mac OS X`.
|
258
269
|
|
259
|
-
1) If your system doesn't have appropriate Ruby version, install it:
|
270
|
+
1) If your system doesn't have the appropriate Ruby version, install it:
|
260
271
|
|
261
272
|
<details/>
|
262
273
|
<summary>Ubuntu 18.04</summary>
|
@@ -288,7 +299,7 @@ gem install bundler
|
|
288
299
|
<summary>Mac OS X</summary>
|
289
300
|
|
290
301
|
```bash
|
291
|
-
# Install
|
302
|
+
# Install Homebrew if you don't have it https://brew.sh/
|
292
303
|
# Install rbenv and ruby-build:
|
293
304
|
brew install rbenv ruby-build
|
294
305
|
|
@@ -317,7 +328,7 @@ $ tanakai setup localhost --local --ask-sudo
|
|
317
328
|
```
|
318
329
|
It works using [Ansible](https://github.com/ansible/ansible) so you need to install it first: `$ sudo apt install ansible`. You can check using playbooks [here](lib/tanakai/automation).
|
319
330
|
|
320
|
-
If you chose automatic installation, you can skip
|
331
|
+
If you chose automatic installation, you can skip the rest of this section and go to ["Getting to Know"](#getting-to-know) part. In case if you want to install everything manually:
|
321
332
|
|
322
333
|
```bash
|
323
334
|
# Install basic tools
|
@@ -330,19 +341,19 @@ sudo apt install -q -y xvfb
|
|
330
341
|
sudo apt install -q -y chromium-browser firefox
|
331
342
|
|
332
343
|
# Instal chromedriver (2.44 version)
|
333
|
-
# All versions located here https://sites.google.com/a/chromium.org/chromedriver/downloads
|
344
|
+
# All versions are located here: https://sites.google.com/a/chromium.org/chromedriver/downloads
|
334
345
|
cd /tmp && wget https://chromedriver.storage.googleapis.com/2.44/chromedriver_linux64.zip
|
335
346
|
sudo unzip chromedriver_linux64.zip -d /usr/local/bin
|
336
347
|
rm -f chromedriver_linux64.zip
|
337
348
|
|
338
349
|
# Install geckodriver (0.23.0 version)
|
339
|
-
# All versions located here https://github.com/mozilla/geckodriver/releases/
|
350
|
+
# All versions are located here: https://github.com/mozilla/geckodriver/releases/
|
340
351
|
cd /tmp && wget https://github.com/mozilla/geckodriver/releases/download/v0.23.0/geckodriver-v0.23.0-linux64.tar.gz
|
341
352
|
sudo tar -xvzf geckodriver-v0.23.0-linux64.tar.gz -C /usr/local/bin
|
342
353
|
rm -f geckodriver-v0.23.0-linux64.tar.gz
|
343
354
|
|
344
355
|
# Install PhantomJS (2.1.1)
|
345
|
-
# All versions located here http://phantomjs.org/download.html
|
356
|
+
# All versions are located here: http://phantomjs.org/download.html
|
346
357
|
sudo apt install -q -y chrpath libxft-dev libfreetype6 libfreetype6-dev libfontconfig1 libfontconfig1-dev
|
347
358
|
cd /tmp && wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2
|
348
359
|
tar -xvjf phantomjs-2.1.1-linux-x86_64.tar.bz2
|
@@ -371,7 +382,7 @@ brew install phantomjs
|
|
371
382
|
```
|
372
383
|
</details><br>
|
373
384
|
|
374
|
-
Also, if you want to save scraped items to
|
385
|
+
Also, if you want to save scraped items to a database (using [ActiveRecord](https://github.com/rails/rails/tree/master/activerecord), [Sequel](https://github.com/jeremyevans/sequel) or [MongoDB Ruby Driver](https://github.com/mongodb/mongo-ruby-driver)/[Mongoid](https://github.com/mongodb/mongoid)), you need to install database clients/servers:
|
375
386
|
|
376
387
|
<details/>
|
377
388
|
<summary>Ubuntu 18.04</summary>
|
@@ -390,7 +401,7 @@ sudo apt install -q -y postgresql-client libpq-dev
|
|
390
401
|
sudo apt install -q -y mongodb-clients
|
391
402
|
```
|
392
403
|
|
393
|
-
But if you want to save items to a local database, database server required as well:
|
404
|
+
But if you want to save items to a local database, a database server is required as well:
|
394
405
|
```bash
|
395
406
|
# Install MySQL client and server
|
396
407
|
sudo apt -q -y install mysql-server mysql-client libmysqlclient-dev
|
@@ -434,7 +445,7 @@ brew install mongodb
|
|
434
445
|
|
435
446
|
## Getting to Know
|
436
447
|
### Interactive console
|
437
|
-
Before you get to know all Tanakai features, there is `$ tanakai console` command which is an interactive console where you can try and debug your scraping code very quickly, without having to run any spider (yes, it's like [Scrapy shell](https://doc.scrapy.org/en/latest/topics/shell.html#topics-shell)).
|
448
|
+
Before you get to know all of Tanakai's features, there is `$ tanakai console` command which is an interactive console where you can try and debug your scraping code very quickly, without having to run any spider (yes, it's like [Scrapy shell](https://doc.scrapy.org/en/latest/topics/shell.html#topics-shell)).
|
438
449
|
|
439
450
|
```bash
|
440
451
|
$ tanakai console --engine selenium_chrome --url https://github.com/vifreefly/kimuraframework
|
@@ -499,25 +510,25 @@ $
|
|
499
510
|
```
|
500
511
|
</details><br>
|
501
512
|
|
502
|
-
CLI
|
513
|
+
CLI arguments:
|
503
514
|
* `--engine` (optional) [engine](#available-drivers) to use. Default is `mechanize`
|
504
|
-
* `--url` (optional) url to process. If url omitted, `response` and `url` objects inside the console will be `nil` (use [browser](#browser-object) object to navigate to any webpage).
|
515
|
+
* `--url` (optional) url to process. If url is omitted, `response` and `url` objects inside the console will be `nil` (use [browser](#browser-object) object to navigate to any webpage).
|
505
516
|
|
506
517
|
### Available engines
|
507
|
-
Tanakai has support for following engines and mostly
|
518
|
+
Tanakai has support for the following engines and can mostly switch between them without the need to rewrite any code:
|
508
519
|
|
509
520
|
* `:apparition` - a Chrome driver for Capybara via [CDP protocol](https://chromedevtools.github.io/devtools-protocol/) (no selenium or chromedriver needed). It started as a fork of Poltergeist and attempts to maintain as much compatibility with the Poltergeist API as possible.
|
510
521
|
* `:cuprite` - a pure Ruby driver for Capybara. It allows you to run Capybara tests on a headless Chrome or Chromium. Under the hood it uses [Ferrum](https://github.com/rubycdp/ferrum#index) which is high-level API to the browser by [CDP protocol](https://chromedevtools.github.io/devtools-protocol/) (no selenium or chromedriver needed). The design of the driver is as close to Poltergeist as possible though it's not a goal.
|
511
|
-
* `:mechanize` - [pure Ruby fake http browser](https://github.com/sparklemotion/mechanize). Mechanize can't render
|
512
|
-
* `:poltergeist_phantomjs` - [PhantomJS headless browser](https://github.com/ariya/phantomjs), can render javascript. In general, PhantomJS still faster than Headless Chrome (and Headless Firefox). PhantomJS has memory leakage, but Tanakai has [memory control feature](#crawler-config) so you shouldn't consider it as a problem. Also, some websites can recognize PhantomJS and block access to them. Like mechanize (and unlike selenium engines) `:poltergeist_phantomjs` can freely rotate proxies and change headers _on the fly_ (see [config section](#all-available-config-options)).
|
513
|
-
* `:selenium_chrome` Chrome in headless mode driven by selenium. Modern headless browser solution with proper
|
522
|
+
* `:mechanize` - [pure Ruby fake http browser](https://github.com/sparklemotion/mechanize). Mechanize can't render JavaScript and don't know what DOM is it. It only can parse original HTML code of a page. Because of it, mechanize much faster, takes much less memory and in general much more stable than any real browser. Use mechanize if you can do it, and the website doesn't use JavaScript to render any meaningful parts of its structure. Still, because mechanize trying to mimic a real browser, it supports almost all Capybara's [methods to interact with a web page](http://cheatrags.com/capybara) (filling forms, clicking buttons, checkboxes, etc).
|
523
|
+
* `:poltergeist_phantomjs` - [PhantomJS headless browser](https://github.com/ariya/phantomjs), can render javascript. In general, PhantomJS still faster than Headless Chrome (and Headless Firefox). PhantomJS has memory leakage issues, but Tanakai has [memory control feature](#crawler-config) so you shouldn't consider it as a problem. Also, some websites can recognize PhantomJS and block access to them. Like mechanize (and unlike selenium engines) `:poltergeist_phantomjs` can freely rotate proxies and change headers _on the fly_ (see [config section](#all-available-config-options)).
|
524
|
+
* `:selenium_chrome` Chrome in headless mode driven by selenium. Modern headless browser solution with proper JavaScript rendering.
|
514
525
|
* `:selenium_firefox` Firefox in headless mode driven by selenium. Usually takes more memory than other drivers, but sometimes can be useful.
|
515
526
|
|
516
|
-
**Tip:**
|
527
|
+
**Tip:** prepend a `HEADLESS=false` environment variable on the command line (`$ HEADLESS=false ruby spider.rb`) to launch an interactive browser in normal (not headless) mode and see its window (only for selenium-like engines). It works for the [console](#interactive-console) command as well.
|
517
528
|
|
518
529
|
|
519
530
|
### Minimum required spider structure
|
520
|
-
> You can manually create a spider file, or use
|
531
|
+
> You can manually create a spider file, or use the generate command instead: `$ tanakai generate spider simple_spider`
|
521
532
|
|
522
533
|
```ruby
|
523
534
|
require 'tanakai'
|
@@ -535,10 +546,10 @@ SimpleSpider.crawl!
|
|
535
546
|
```
|
536
547
|
|
537
548
|
Where:
|
538
|
-
* `@name
|
539
|
-
* `@engine
|
540
|
-
* `@start_urls
|
541
|
-
*
|
549
|
+
* `@name`: name of a spider. You can omit name if use single-file spider
|
550
|
+
* `@engine`: engine for a spider
|
551
|
+
* `@start_urls`: array of start urls to process one by one inside `parse` method
|
552
|
+
* The `parse` method is the entry point, and should always be present in a spider class
|
542
553
|
|
543
554
|
|
544
555
|
### Method arguments `response`, `url` and `data`
|
@@ -548,9 +559,9 @@ def parse(response, url:, data: {})
|
|
548
559
|
end
|
549
560
|
```
|
550
561
|
|
551
|
-
* `response` ([Nokogiri::HTML::Document](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document) object)
|
552
|
-
* `url` (String) url of a processed webpage
|
553
|
-
* `data` (Hash) uses to pass data between requests
|
562
|
+
* `response` ([Nokogiri::HTML::Document](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document) object): contains parsed HTML code of a processed webpage
|
563
|
+
* `url` (String): url of a processed webpage
|
564
|
+
* `data` (Hash): uses to pass data between requests
|
554
565
|
|
555
566
|
<details/>
|
556
567
|
<summary><strong>Example how to use <code>data</code></strong></summary>
|
@@ -574,7 +585,7 @@ class ProductsSpider < Tanakai::Base
|
|
574
585
|
|
575
586
|
def parse_product(response, url:, data: {})
|
576
587
|
item = {}
|
577
|
-
# Assign item's category_name from data[:category_name]
|
588
|
+
# Assign an item's category_name from data[:category_name]
|
578
589
|
item[:category_name] = data[:category_name]
|
579
590
|
|
580
591
|
# ...
|
@@ -592,7 +603,7 @@ end
|
|
592
603
|
|
593
604
|
### `browser` object
|
594
605
|
|
595
|
-
|
606
|
+
A browser object is available from any spider instance method, which is a [Capybara::Session](https://www.rubydoc.info/github/jnicklas/capybara/Capybara/Session) object and uses it to process requests and get page response (`current_response` method). Usually you don't need to touch it directly, because there is `response` (see above) which contains page response after it was loaded.
|
596
607
|
|
597
608
|
But if you need to interact with a page (like filling form fields, clicking elements, checkboxes, etc) `browser` is ready for you:
|
598
609
|
|
@@ -606,7 +617,7 @@ class GoogleSpider < Tanakai::Base
|
|
606
617
|
browser.fill_in "q", with: "Tanakai web scraping framework"
|
607
618
|
browser.click_button "Google Search"
|
608
619
|
|
609
|
-
# Update response
|
620
|
+
# Update response with current_response after interaction with a browser
|
610
621
|
response = browser.current_response
|
611
622
|
|
612
623
|
# Collect results
|
@@ -626,7 +637,7 @@ Check out **Capybara cheat sheets** where you can see all available methods **to
|
|
626
637
|
|
627
638
|
### `request_to` method
|
628
639
|
|
629
|
-
For making requests to a particular method there is `request_to`. It requires minimum two arguments: `:method_name` and `url:`. An optional argument is `data:` (see above what for is it). Example:
|
640
|
+
For making requests to a particular method there is `request_to`. It requires minimum two arguments: `:method_name` and `url:`. An optional argument is `data:` (see above what for is it) and `response_type` (defaults to `:html`). Example:
|
630
641
|
|
631
642
|
```ruby
|
632
643
|
class Spider < Tanakai::Base
|
@@ -635,11 +646,12 @@ class Spider < Tanakai::Base
|
|
635
646
|
|
636
647
|
def parse(response, url:, data: {})
|
637
648
|
# Process request to `parse_product` method with `https://example.com/some_product` url:
|
638
|
-
request_to :parse_product, url: "https://example.com/some_product"
|
649
|
+
request_to :parse_product, url: "https://example.com/some_product.json", response_type: :json
|
639
650
|
end
|
640
651
|
|
641
652
|
def parse_product(response, url:, data: {})
|
642
|
-
puts "
|
653
|
+
puts "JSON parsed from page https://example.com/some_product.json"
|
654
|
+
puts response
|
643
655
|
end
|
644
656
|
end
|
645
657
|
```
|
@@ -654,7 +666,7 @@ def request_to(handler, url:, data: {})
|
|
654
666
|
request_data = { url: url, data: data }
|
655
667
|
|
656
668
|
browser.visit(url)
|
657
|
-
public_send(handler, browser.current_response, request_data)
|
669
|
+
public_send(handler, browser.current_response, **request_data)
|
658
670
|
end
|
659
671
|
```
|
660
672
|
</details><br>
|
@@ -719,7 +731,7 @@ By default `save_to` add position key to an item hash. You can disable it with `
|
|
719
731
|
|
720
732
|
**How helper works:**
|
721
733
|
|
722
|
-
|
734
|
+
While the spider is running, each new item will be appended to the output file. On the next run, this helper will clear the contents of the output file, then start appending items to it.
|
723
735
|
|
724
736
|
> If you don't want file to be cleared before each run, add option `append: true`: `save_to "scraped_products.json", item, format: :json, append: true`
|
725
737
|
|
@@ -801,7 +813,7 @@ It is possible to automatically skip all already visited urls while calling `req
|
|
801
813
|
* `#clear!` - reset the whole storage by deleting all values from all scopes.
|
802
814
|
|
803
815
|
|
804
|
-
###
|
816
|
+
### Handling request errors
|
805
817
|
It is quite common that some pages of crawling website can return different response code than `200 ok`. In such cases, method `request_to` (or `browser.visit`) can raise an exception. Tanakai provides `skip_request_errors` and `retry_request_errors` [config](#spider-config) options to handle such errors:
|
806
818
|
|
807
819
|
#### skip_request_errors
|
@@ -1194,6 +1206,7 @@ I, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840] INFO -- amazon_spider:
|
|
1194
1206
|
* `delay:` set delay between requests: `in_parallel(:method, urls, threads: 3, delay: 2)`. Delay can be `Integer`, `Float` or `Range` (`2..5`). In case of a Range, delay number will be chosen randomly for each request: `rand (2..5) # => 3`
|
1195
1207
|
* `engine:` set custom engine than a default one: `in_parallel(:method, urls, threads: 3, engine: :poltergeist_phantomjs)`
|
1196
1208
|
* `config:` pass custom options to config (see [config section](#crawler-config))
|
1209
|
+
* `response_type:` response should be returned as `:html` or `:json`, defaults to `:html`
|
1197
1210
|
|
1198
1211
|
### Active Support included
|
1199
1212
|
|
@@ -1305,7 +1318,7 @@ Tanakai.configure do |config|
|
|
1305
1318
|
end
|
1306
1319
|
```
|
1307
1320
|
|
1308
|
-
### Using Tanakai inside existing Ruby
|
1321
|
+
### Using Tanakai inside existing Ruby applications
|
1309
1322
|
|
1310
1323
|
You can integrate Tanakai spiders (which are just Ruby classes) to an existing Ruby application like Rails or Sinatra, and run them using background jobs (for example). Check the following info to understand the running process of spiders:
|
1311
1324
|
|
@@ -1420,7 +1433,7 @@ Example:
|
|
1420
1433
|
$ tanakai setup deploy@123.123.123.123 --ask-sudo --ssh-key-path path/to/private_key
|
1421
1434
|
```
|
1422
1435
|
|
1423
|
-
CLI
|
1436
|
+
CLI arguments:
|
1424
1437
|
* `--ask-sudo` pass this option to ask sudo (user) password for system-wide installation of packages (`apt install`)
|
1425
1438
|
* `--ssh-key-path path/to/private_key` authorization on the server using private ssh key. You can omit it if required key already [added to keychain](https://help.github.com/articles/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent/#adding-your-ssh-key-to-the-ssh-agent) on your desktop (Ansible uses [SSH agent forwarding](https://developer.github.com/v3/guides/using-ssh-agent-forwarding/))
|
1426
1439
|
* `--ask-auth-pass` authorization on the server using user password, alternative option to `--ssh-key-path`.
|
@@ -1440,7 +1453,7 @@ Example:
|
|
1440
1453
|
$ tanakai deploy deploy@123.123.123.123 --ssh-key-path path/to/private_key --repo-key-path path/to/repo_private_key
|
1441
1454
|
```
|
1442
1455
|
|
1443
|
-
CLI
|
1456
|
+
CLI arguments: _same like for [setup](#setup) command_ (except `--ask-sudo`), plus
|
1444
1457
|
* `--repo-url` provide custom repo url (`--repo-url git@bitbucket.org:username/repo_name.git`), otherwise current `origin/master` will be taken (output from `$ git remote get-url origin`)
|
1445
1458
|
* `--repo-key-path` if git repository is private, authorization is required to pull the code on the remote server. Use this option to provide a private repository SSH key. You can omit it if required key already added to keychain on your desktop (same like with `--ssh-key-path` option)
|
1446
1459
|
|
@@ -2030,9 +2043,14 @@ $ bundle exec tanakai runner --exclude github_spider
|
|
2030
2043
|
|
2031
2044
|
You can perform custom actions before runner starts and after runner stops using `config.runner_at_start_callback` and `config.runner_at_stop_callback`. Check [config/application.rb](lib/tanakai/template/config/application.rb) to see example.
|
2032
2045
|
|
2046
|
+
## Testing
|
2047
|
+
To run tests:
|
2048
|
+
```bash
|
2049
|
+
bundle exec rspec
|
2050
|
+
```
|
2033
2051
|
|
2034
2052
|
## Chat Support and Feedback
|
2035
|
-
|
2053
|
+
Submit an issue on GitHub and we'll try to address it in a timely manner.
|
2036
2054
|
|
2037
2055
|
## License
|
2038
|
-
|
2056
|
+
This gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
|
data/lib/tanakai/base.rb
CHANGED
@@ -1,5 +1,6 @@
|
|
1
1
|
require_relative 'base/saver'
|
2
2
|
require_relative 'base/storage'
|
3
|
+
require 'addressable/uri'
|
3
4
|
|
4
5
|
module Tanakai
|
5
6
|
class Base
|
@@ -201,7 +202,7 @@ module Tanakai
|
|
201
202
|
visited = delay ? browser.visit(url, delay: delay) : browser.visit(url)
|
202
203
|
return unless visited
|
203
204
|
|
204
|
-
public_send(handler, browser.current_response(response_type), { url: url, data: data })
|
205
|
+
public_send(handler, browser.current_response(response_type), **{ url: url, data: data })
|
205
206
|
end
|
206
207
|
|
207
208
|
def console(response = nil, url: nil, data: {})
|
@@ -224,9 +225,9 @@ module Tanakai
|
|
224
225
|
@savers[path] ||= begin
|
225
226
|
options = { format: format, position: position, append: append }
|
226
227
|
if self.with_info
|
227
|
-
self.class.savers[path] ||= Saver.new(path, options)
|
228
|
+
self.class.savers[path] ||= Saver.new(path, **options)
|
228
229
|
else
|
229
|
-
Saver.new(path, options)
|
230
|
+
Saver.new(path, **options)
|
230
231
|
end
|
231
232
|
end
|
232
233
|
|
@@ -286,7 +287,7 @@ module Tanakai
|
|
286
287
|
end
|
287
288
|
end
|
288
289
|
|
289
|
-
def in_parallel(handler, urls, threads:, data: {}, delay: nil, engine: @engine, config: {})
|
290
|
+
def in_parallel(handler, urls, threads:, data: {}, delay: nil, engine: @engine, config: {}, response_type: :html)
|
290
291
|
parts = urls.in_sorted_groups(threads, false)
|
291
292
|
urls_count = urls.size
|
292
293
|
|
@@ -304,12 +305,12 @@ module Tanakai
|
|
304
305
|
part.each do |url_data|
|
305
306
|
if url_data.class == Hash
|
306
307
|
if url_data[:url].present? && url_data[:data].present?
|
307
|
-
spider.request_to(handler, delay, url_data)
|
308
|
+
spider.request_to(handler, delay, **{ **url_data, response_type: response_type })
|
308
309
|
else
|
309
|
-
spider.public_send(handler, url_data)
|
310
|
+
spider.public_send(handler, **url_data)
|
310
311
|
end
|
311
312
|
else
|
312
|
-
spider.request_to(handler, delay, url: url_data, data: data)
|
313
|
+
spider.request_to(handler, delay, url: url_data, data: data, response_type: response_type)
|
313
314
|
end
|
314
315
|
end
|
315
316
|
ensure
|
data/lib/tanakai/base_helper.rb
CHANGED
@@ -4,13 +4,13 @@ module Tanakai
|
|
4
4
|
|
5
5
|
def absolute_url(url, base:)
|
6
6
|
return unless url
|
7
|
-
URI.join(base, URI.escape(url)).to_s
|
7
|
+
Addressable::URI.join(base, Addressable::URI.escape(url)).to_s
|
8
8
|
end
|
9
9
|
|
10
10
|
def escape_url(url)
|
11
|
-
uri = URI.parse(url)
|
11
|
+
uri = Addressable::URI.parse(url)
|
12
12
|
rescue URI::InvalidURIError => e
|
13
|
-
URI.parse(URI.escape url).to_s rescue url
|
13
|
+
Addressable::URI.parse(Addressable::URI.escape url).to_s rescue url
|
14
14
|
else
|
15
15
|
url
|
16
16
|
end
|
@@ -30,7 +30,7 @@ module Tanakai::BrowserBuilder
|
|
30
30
|
end
|
31
31
|
|
32
32
|
# See all options here: https://seleniumhq.github.io/selenium/docs/api/rb/Selenium/WebDriver/Chrome/Options.html
|
33
|
-
driver_options = Selenium::WebDriver::Chrome::Options.new(opts)
|
33
|
+
driver_options = Selenium::WebDriver::Chrome::Options.new(**opts)
|
34
34
|
|
35
35
|
# Window size
|
36
36
|
if size = @config[:window_size].presence
|
@@ -34,7 +34,7 @@ class Capybara::Mechanize::Driver
|
|
34
34
|
options[:name] ||= name
|
35
35
|
options[:value] ||= value
|
36
36
|
|
37
|
-
cookie = Mechanize::Cookie.new(options.merge
|
37
|
+
cookie = Mechanize::Cookie.new(**options.merge(path: "/"))
|
38
38
|
browser.agent.cookie_jar << cookie
|
39
39
|
end
|
40
40
|
|
data/lib/tanakai/version.rb
CHANGED
data/tanakai.gemspec
CHANGED
@@ -39,12 +39,13 @@ Gem::Specification.new do |spec|
|
|
39
39
|
spec.add_dependency "headless"
|
40
40
|
spec.add_dependency "pmap"
|
41
41
|
|
42
|
+
spec.add_dependency "addressable"
|
42
43
|
spec.add_dependency "whenever"
|
43
44
|
|
44
|
-
spec.add_dependency "rbcat", "
|
45
|
-
spec.add_dependency "pry"
|
45
|
+
spec.add_dependency "rbcat", ">= 0.2.2", "< 0.3"
|
46
|
+
spec.add_dependency "pry-nav"
|
46
47
|
|
47
|
-
spec.add_development_dependency "bundler", "~>
|
48
|
+
spec.add_development_dependency "bundler", "~> 2"
|
48
49
|
spec.add_development_dependency "rake", "~> 10.0"
|
49
|
-
spec.add_development_dependency "
|
50
|
+
spec.add_development_dependency "rspec", "~> 3"
|
50
51
|
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: tanakai
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.
|
4
|
+
version: 1.6.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Victor Afanasev
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: exe
|
11
11
|
cert_chain: []
|
12
|
-
date:
|
12
|
+
date: 2023-02-16 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: thor
|
@@ -199,6 +199,20 @@ dependencies:
|
|
199
199
|
- - ">="
|
200
200
|
- !ruby/object:Gem::Version
|
201
201
|
version: '0'
|
202
|
+
- !ruby/object:Gem::Dependency
|
203
|
+
name: addressable
|
204
|
+
requirement: !ruby/object:Gem::Requirement
|
205
|
+
requirements:
|
206
|
+
- - ">="
|
207
|
+
- !ruby/object:Gem::Version
|
208
|
+
version: '0'
|
209
|
+
type: :runtime
|
210
|
+
prerelease: false
|
211
|
+
version_requirements: !ruby/object:Gem::Requirement
|
212
|
+
requirements:
|
213
|
+
- - ">="
|
214
|
+
- !ruby/object:Gem::Version
|
215
|
+
version: '0'
|
202
216
|
- !ruby/object:Gem::Dependency
|
203
217
|
name: whenever
|
204
218
|
requirement: !ruby/object:Gem::Requirement
|
@@ -217,18 +231,24 @@ dependencies:
|
|
217
231
|
name: rbcat
|
218
232
|
requirement: !ruby/object:Gem::Requirement
|
219
233
|
requirements:
|
220
|
-
- - "
|
234
|
+
- - ">="
|
235
|
+
- !ruby/object:Gem::Version
|
236
|
+
version: 0.2.2
|
237
|
+
- - "<"
|
221
238
|
- !ruby/object:Gem::Version
|
222
|
-
version: '0.
|
239
|
+
version: '0.3'
|
223
240
|
type: :runtime
|
224
241
|
prerelease: false
|
225
242
|
version_requirements: !ruby/object:Gem::Requirement
|
226
243
|
requirements:
|
227
|
-
- - "
|
244
|
+
- - ">="
|
245
|
+
- !ruby/object:Gem::Version
|
246
|
+
version: 0.2.2
|
247
|
+
- - "<"
|
228
248
|
- !ruby/object:Gem::Version
|
229
|
-
version: '0.
|
249
|
+
version: '0.3'
|
230
250
|
- !ruby/object:Gem::Dependency
|
231
|
-
name: pry
|
251
|
+
name: pry-nav
|
232
252
|
requirement: !ruby/object:Gem::Requirement
|
233
253
|
requirements:
|
234
254
|
- - ">="
|
@@ -247,14 +267,14 @@ dependencies:
|
|
247
267
|
requirements:
|
248
268
|
- - "~>"
|
249
269
|
- !ruby/object:Gem::Version
|
250
|
-
version: '
|
270
|
+
version: '2'
|
251
271
|
type: :development
|
252
272
|
prerelease: false
|
253
273
|
version_requirements: !ruby/object:Gem::Requirement
|
254
274
|
requirements:
|
255
275
|
- - "~>"
|
256
276
|
- !ruby/object:Gem::Version
|
257
|
-
version: '
|
277
|
+
version: '2'
|
258
278
|
- !ruby/object:Gem::Dependency
|
259
279
|
name: rake
|
260
280
|
requirement: !ruby/object:Gem::Requirement
|
@@ -270,19 +290,19 @@ dependencies:
|
|
270
290
|
- !ruby/object:Gem::Version
|
271
291
|
version: '10.0'
|
272
292
|
- !ruby/object:Gem::Dependency
|
273
|
-
name:
|
293
|
+
name: rspec
|
274
294
|
requirement: !ruby/object:Gem::Requirement
|
275
295
|
requirements:
|
276
296
|
- - "~>"
|
277
297
|
- !ruby/object:Gem::Version
|
278
|
-
version: '
|
298
|
+
version: '3'
|
279
299
|
type: :development
|
280
300
|
prerelease: false
|
281
301
|
version_requirements: !ruby/object:Gem::Requirement
|
282
302
|
requirements:
|
283
303
|
- - "~>"
|
284
304
|
- !ruby/object:Gem::Version
|
285
|
-
version: '
|
305
|
+
version: '3'
|
286
306
|
description:
|
287
307
|
email:
|
288
308
|
- vicfreefly@gmail.com
|
@@ -292,6 +312,7 @@ extensions: []
|
|
292
312
|
extra_rdoc_files: []
|
293
313
|
files:
|
294
314
|
- ".gitignore"
|
315
|
+
- ".rspec"
|
295
316
|
- ".travis.yml"
|
296
317
|
- CHANGELOG.md
|
297
318
|
- Gemfile
|
@@ -374,7 +395,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
374
395
|
- !ruby/object:Gem::Version
|
375
396
|
version: '0'
|
376
397
|
requirements: []
|
377
|
-
rubygems_version: 3.
|
398
|
+
rubygems_version: 3.2.15
|
378
399
|
signing_key:
|
379
400
|
specification_version: 4
|
380
401
|
summary: Maintained fork of Kimurai, a modern web scraping framework written in Ruby
|