tanakai 1.5.1 → 1.6.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.gitignore +1 -0
- data/.rspec +3 -0
- data/CHANGELOG.md +4 -0
- data/README.md +61 -45
- data/lib/tanakai/base.rb +6 -5
- data/lib/tanakai/base_helper.rb +3 -3
- data/lib/tanakai/browser_builder/selenium_chrome_builder.rb +1 -1
- data/lib/tanakai/capybara_ext/mechanize/driver.rb +1 -1
- data/lib/tanakai/version.rb +1 -1
- data/tanakai.gemspec +5 -4
- metadata +34 -13
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 7ea3cd20cfaedaebf473e853b66ebe58958e89b7525246444e3c8aeef46a4bf0
|
4
|
+
data.tar.gz: a2c51b86487d6392a58b533237731996639fe0037c9aca22a6140c3c968eaf7d
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 52d9a730a0a9e08c0a49ee4177a0370f5ed2a12ac9e3925f0a83b0c232dcedb1645d1b6860cb19c8453bbc5777cec02403654e2282e57ad75c5c2cb898b6dc1b
|
7
|
+
data.tar.gz: '0969ee651ec787b9fa1e47b8d776571b6f4751c29d3dd15bb0c696181ceab8bc826db6f54486df6426151f25f626f4725bf79c51ec8cc8cebebbe6cfa057bfa3'
|
data/.gitignore
CHANGED
data/.rspec
ADDED
data/CHANGELOG.md
CHANGED
data/README.md
CHANGED
@@ -1,8 +1,19 @@
|
|
1
|
-
# Tanakai
|
1
|
+
# 🕷 Tanakai
|
2
2
|
|
3
|
-
|
3
|
+
<sub>[Liphistius tanakai](https://wsc.nmbe.ch/species/58479/Liphistius_tanakai)</sub>
|
4
4
|
|
5
|
-
Tanakai
|
5
|
+
Tanakai intends to be a maintained fork of [Kimurai](https://github.com/vifreefly/kimuraframework), a modern web scraping framework written in Ruby which **works out of the box with Apparition, Cuprite, Headless Chromium/Firefox and PhantomJS**, or simple HTTP requests and **allows you to scrape and interact with JavaScript rendered websites.**
|
6
|
+
|
7
|
+
### Goals of this fork:
|
8
|
+
|
9
|
+
- [x] add support to [Apparition](https://github.com/twalpole/apparition) and [Cuprite](https://github.com/rubycdp/cuprite)
|
10
|
+
- [x] add support to Ruby 3
|
11
|
+
- [ ] write tests with RSpec
|
12
|
+
- [ ] improve configuration options for Apparition and Cuprite (both have been recently added)
|
13
|
+
- [ ] create an awesome logo in the likes of [this](https://hsto.org/webt/_v/mt/tp/_vmttpbpzbt-y2aook642d9wpz0.png)
|
14
|
+
- [ ] have you as new contributor
|
15
|
+
|
16
|
+
Tanakai is based on the well-known [Capybara](https://github.com/teamcapybara/capybara) and [Nokogiri](https://github.com/sparklemotion/nokogiri) gems, so you don't have to learn anything new. Let's try an example:
|
6
17
|
|
7
18
|
```ruby
|
8
19
|
# github_spider.rb
|
@@ -128,7 +139,7 @@ I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] INFO -- github_spider:
|
|
128
139
|
```
|
129
140
|
</details><br>
|
130
141
|
|
131
|
-
Okay, that was easy. How about
|
142
|
+
Okay, that was easy. How about JavaScript rendered websites with dynamic HTML? Let's scrape a page with infinite scroll:
|
132
143
|
|
133
144
|
```ruby
|
134
145
|
# infinite_scroll_spider.rb
|
@@ -190,20 +201,20 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
|
|
190
201
|
|
191
202
|
|
192
203
|
## Features
|
193
|
-
* Scrape
|
204
|
+
* Scrape JavaScript rendered websites out of the box
|
194
205
|
* Supported engines: [Apparition](https://github.com/twalpole/apparition), [Cuprite](https://github.com/rubycdp/cuprite), [Headless Chrome](https://developers.google.com/web/updates/2017/04/headless-chrome), [Headless Firefox](https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Headless_mode), [PhantomJS](https://github.com/ariya/phantomjs) or simple HTTP requests ([mechanize](https://github.com/sparklemotion/mechanize) gem)
|
195
206
|
* Write spider code once, and use it with any supported engine later
|
196
207
|
* All the power of [Capybara](https://github.com/teamcapybara/capybara): use methods like `click_on`, `fill_in`, `select`, `choose`, `set`, `go_back`, etc. to interact with web pages
|
197
208
|
* Rich [configuration](#spider-config): **set default headers, cookies, delay between requests, enable proxy/user-agents rotation**
|
198
|
-
* Built-in helpers to make scraping easy, like [save_to](#save_to-helper) (save items to JSON, JSON lines, or CSV formats) or [unique?](#skip-duplicates
|
209
|
+
* Built-in helpers to make scraping easy, like [save_to](#save_to-helper) (save items to JSON, JSON lines, or CSV formats) or [unique?](#skip-duplicates) to skip duplicates
|
199
210
|
* Automatically [handle requests errors](#handle-request-errors)
|
200
211
|
* Automatically restart browsers when reaching memory limit [**(memory control)**](#spider-config) or requests limit
|
201
212
|
* Easily [schedule spiders](#schedule-spiders-using-cron) within cron using [Whenever](https://github.com/javan/whenever) (no need to know cron syntax)
|
202
213
|
* [Parallel scraping](#parallel-crawling-using-in_parallel) using simple method `in_parallel`
|
203
214
|
* **Two modes:** use single file for a simple spider, or [generate](#project-mode) Scrapy-like **project**
|
204
215
|
* Convenient development mode with [console](#interactive-console), colorized logger and debugger ([Pry](https://github.com/pry/pry), [Byebug](https://github.com/deivid-rodriguez/byebug))
|
205
|
-
* Automated [server environment setup](#setup) (for
|
206
|
-
* Command-line [runner](#runner) to run all project spiders one
|
216
|
+
* Automated [server environment setup](#setup) (for Ubuntu 18.04) and [deploy](#deploy) using commands `tanakai setup` and `tanakai deploy` ([Ansible](https://github.com/ansible/ansible) under the hood)
|
217
|
+
* Command-line [runner](#runner) to run all project spiders one-by-one or in parallel
|
207
218
|
|
208
219
|
## Table of Contents
|
209
220
|
* [Tanakai](#tanakai)
|
@@ -221,7 +232,7 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
|
|
221
232
|
* [Skip duplicates](#skip-duplicates)
|
222
233
|
* [Automatically skip all duplicated requests urls](#automatically-skip-all-duplicated-requests-urls)
|
223
234
|
* [Storage object](#storage-object)
|
224
|
-
* [
|
235
|
+
* [Handling request errors](#handling-request-errors)
|
225
236
|
* [skip_request_errors](#skip_request_errors)
|
226
237
|
* [retry_request_errors](#retry_request_errors)
|
227
238
|
* [Logging custom events](#logging-custom-events)
|
@@ -231,7 +242,7 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
|
|
231
242
|
* [Active Support included](#active-support-included)
|
232
243
|
* [Schedule spiders using Cron](#schedule-spiders-using-cron)
|
233
244
|
* [Configuration options](#configuration-options)
|
234
|
-
* [Using Tanakai inside existing Ruby
|
245
|
+
* [Using Tanakai inside existing Ruby applications](#using-tanakai-inside-existing-ruby-applications)
|
235
246
|
* [crawl! method](#crawl-method)
|
236
247
|
* [parse! method](#parsemethod_name-url-method)
|
237
248
|
* [Tanakai.list and Tanakai.find_by_name](#tanakailist-and-tanakaifind_by_name)
|
@@ -256,7 +267,7 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
|
|
256
267
|
## Installation
|
257
268
|
Tanakai requires Ruby version `>= 2.5.0`. Supported platforms: `Linux` and `Mac OS X`.
|
258
269
|
|
259
|
-
1) If your system doesn't have appropriate Ruby version, install it:
|
270
|
+
1) If your system doesn't have the appropriate Ruby version, install it:
|
260
271
|
|
261
272
|
<details/>
|
262
273
|
<summary>Ubuntu 18.04</summary>
|
@@ -288,7 +299,7 @@ gem install bundler
|
|
288
299
|
<summary>Mac OS X</summary>
|
289
300
|
|
290
301
|
```bash
|
291
|
-
# Install
|
302
|
+
# Install Homebrew if you don't have it https://brew.sh/
|
292
303
|
# Install rbenv and ruby-build:
|
293
304
|
brew install rbenv ruby-build
|
294
305
|
|
@@ -317,7 +328,7 @@ $ tanakai setup localhost --local --ask-sudo
|
|
317
328
|
```
|
318
329
|
It works using [Ansible](https://github.com/ansible/ansible) so you need to install it first: `$ sudo apt install ansible`. You can check using playbooks [here](lib/tanakai/automation).
|
319
330
|
|
320
|
-
If you chose automatic installation, you can skip
|
331
|
+
If you chose automatic installation, you can skip the rest of this section and go to ["Getting to Know"](#getting-to-know) part. In case if you want to install everything manually:
|
321
332
|
|
322
333
|
```bash
|
323
334
|
# Install basic tools
|
@@ -330,19 +341,19 @@ sudo apt install -q -y xvfb
|
|
330
341
|
sudo apt install -q -y chromium-browser firefox
|
331
342
|
|
332
343
|
# Instal chromedriver (2.44 version)
|
333
|
-
# All versions located here https://sites.google.com/a/chromium.org/chromedriver/downloads
|
344
|
+
# All versions are located here: https://sites.google.com/a/chromium.org/chromedriver/downloads
|
334
345
|
cd /tmp && wget https://chromedriver.storage.googleapis.com/2.44/chromedriver_linux64.zip
|
335
346
|
sudo unzip chromedriver_linux64.zip -d /usr/local/bin
|
336
347
|
rm -f chromedriver_linux64.zip
|
337
348
|
|
338
349
|
# Install geckodriver (0.23.0 version)
|
339
|
-
# All versions located here https://github.com/mozilla/geckodriver/releases/
|
350
|
+
# All versions are located here: https://github.com/mozilla/geckodriver/releases/
|
340
351
|
cd /tmp && wget https://github.com/mozilla/geckodriver/releases/download/v0.23.0/geckodriver-v0.23.0-linux64.tar.gz
|
341
352
|
sudo tar -xvzf geckodriver-v0.23.0-linux64.tar.gz -C /usr/local/bin
|
342
353
|
rm -f geckodriver-v0.23.0-linux64.tar.gz
|
343
354
|
|
344
355
|
# Install PhantomJS (2.1.1)
|
345
|
-
# All versions located here http://phantomjs.org/download.html
|
356
|
+
# All versions are located here: http://phantomjs.org/download.html
|
346
357
|
sudo apt install -q -y chrpath libxft-dev libfreetype6 libfreetype6-dev libfontconfig1 libfontconfig1-dev
|
347
358
|
cd /tmp && wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2
|
348
359
|
tar -xvjf phantomjs-2.1.1-linux-x86_64.tar.bz2
|
@@ -371,7 +382,7 @@ brew install phantomjs
|
|
371
382
|
```
|
372
383
|
</details><br>
|
373
384
|
|
374
|
-
Also, if you want to save scraped items to
|
385
|
+
Also, if you want to save scraped items to a database (using [ActiveRecord](https://github.com/rails/rails/tree/master/activerecord), [Sequel](https://github.com/jeremyevans/sequel) or [MongoDB Ruby Driver](https://github.com/mongodb/mongo-ruby-driver)/[Mongoid](https://github.com/mongodb/mongoid)), you need to install database clients/servers:
|
375
386
|
|
376
387
|
<details/>
|
377
388
|
<summary>Ubuntu 18.04</summary>
|
@@ -390,7 +401,7 @@ sudo apt install -q -y postgresql-client libpq-dev
|
|
390
401
|
sudo apt install -q -y mongodb-clients
|
391
402
|
```
|
392
403
|
|
393
|
-
But if you want to save items to a local database, database server required as well:
|
404
|
+
But if you want to save items to a local database, a database server is required as well:
|
394
405
|
```bash
|
395
406
|
# Install MySQL client and server
|
396
407
|
sudo apt -q -y install mysql-server mysql-client libmysqlclient-dev
|
@@ -434,7 +445,7 @@ brew install mongodb
|
|
434
445
|
|
435
446
|
## Getting to Know
|
436
447
|
### Interactive console
|
437
|
-
Before you get to know all Tanakai features, there is `$ tanakai console` command which is an interactive console where you can try and debug your scraping code very quickly, without having to run any spider (yes, it's like [Scrapy shell](https://doc.scrapy.org/en/latest/topics/shell.html#topics-shell)).
|
448
|
+
Before you get to know all of Tanakai's features, there is `$ tanakai console` command which is an interactive console where you can try and debug your scraping code very quickly, without having to run any spider (yes, it's like [Scrapy shell](https://doc.scrapy.org/en/latest/topics/shell.html#topics-shell)).
|
438
449
|
|
439
450
|
```bash
|
440
451
|
$ tanakai console --engine selenium_chrome --url https://github.com/vifreefly/kimuraframework
|
@@ -499,25 +510,25 @@ $
|
|
499
510
|
```
|
500
511
|
</details><br>
|
501
512
|
|
502
|
-
CLI
|
513
|
+
CLI arguments:
|
503
514
|
* `--engine` (optional) [engine](#available-drivers) to use. Default is `mechanize`
|
504
|
-
* `--url` (optional) url to process. If url omitted, `response` and `url` objects inside the console will be `nil` (use [browser](#browser-object) object to navigate to any webpage).
|
515
|
+
* `--url` (optional) url to process. If url is omitted, `response` and `url` objects inside the console will be `nil` (use [browser](#browser-object) object to navigate to any webpage).
|
505
516
|
|
506
517
|
### Available engines
|
507
|
-
Tanakai has support for following engines and mostly
|
518
|
+
Tanakai has support for the following engines and can mostly switch between them without the need to rewrite any code:
|
508
519
|
|
509
520
|
* `:apparition` - a Chrome driver for Capybara via [CDP protocol](https://chromedevtools.github.io/devtools-protocol/) (no selenium or chromedriver needed). It started as a fork of Poltergeist and attempts to maintain as much compatibility with the Poltergeist API as possible.
|
510
521
|
* `:cuprite` - a pure Ruby driver for Capybara. It allows you to run Capybara tests on a headless Chrome or Chromium. Under the hood it uses [Ferrum](https://github.com/rubycdp/ferrum#index) which is high-level API to the browser by [CDP protocol](https://chromedevtools.github.io/devtools-protocol/) (no selenium or chromedriver needed). The design of the driver is as close to Poltergeist as possible though it's not a goal.
|
511
|
-
* `:mechanize` - [pure Ruby fake http browser](https://github.com/sparklemotion/mechanize). Mechanize can't render
|
512
|
-
* `:poltergeist_phantomjs` - [PhantomJS headless browser](https://github.com/ariya/phantomjs), can render javascript. In general, PhantomJS still faster than Headless Chrome (and Headless Firefox). PhantomJS has memory leakage, but Tanakai has [memory control feature](#crawler-config) so you shouldn't consider it as a problem. Also, some websites can recognize PhantomJS and block access to them. Like mechanize (and unlike selenium engines) `:poltergeist_phantomjs` can freely rotate proxies and change headers _on the fly_ (see [config section](#all-available-config-options)).
|
513
|
-
* `:selenium_chrome` Chrome in headless mode driven by selenium. Modern headless browser solution with proper
|
522
|
+
* `:mechanize` - [pure Ruby fake http browser](https://github.com/sparklemotion/mechanize). Mechanize can't render JavaScript and don't know what DOM is it. It only can parse original HTML code of a page. Because of it, mechanize much faster, takes much less memory and in general much more stable than any real browser. Use mechanize if you can do it, and the website doesn't use JavaScript to render any meaningful parts of its structure. Still, because mechanize trying to mimic a real browser, it supports almost all Capybara's [methods to interact with a web page](http://cheatrags.com/capybara) (filling forms, clicking buttons, checkboxes, etc).
|
523
|
+
* `:poltergeist_phantomjs` - [PhantomJS headless browser](https://github.com/ariya/phantomjs), can render javascript. In general, PhantomJS still faster than Headless Chrome (and Headless Firefox). PhantomJS has memory leakage issues, but Tanakai has [memory control feature](#crawler-config) so you shouldn't consider it as a problem. Also, some websites can recognize PhantomJS and block access to them. Like mechanize (and unlike selenium engines) `:poltergeist_phantomjs` can freely rotate proxies and change headers _on the fly_ (see [config section](#all-available-config-options)).
|
524
|
+
* `:selenium_chrome` Chrome in headless mode driven by selenium. Modern headless browser solution with proper JavaScript rendering.
|
514
525
|
* `:selenium_firefox` Firefox in headless mode driven by selenium. Usually takes more memory than other drivers, but sometimes can be useful.
|
515
526
|
|
516
|
-
**Tip:**
|
527
|
+
**Tip:** prepend a `HEADLESS=false` environment variable on the command line (`$ HEADLESS=false ruby spider.rb`) to launch an interactive browser in normal (not headless) mode and see its window (only for selenium-like engines). It works for the [console](#interactive-console) command as well.
|
517
528
|
|
518
529
|
|
519
530
|
### Minimum required spider structure
|
520
|
-
> You can manually create a spider file, or use
|
531
|
+
> You can manually create a spider file, or use the generate command instead: `$ tanakai generate spider simple_spider`
|
521
532
|
|
522
533
|
```ruby
|
523
534
|
require 'tanakai'
|
@@ -535,10 +546,10 @@ SimpleSpider.crawl!
|
|
535
546
|
```
|
536
547
|
|
537
548
|
Where:
|
538
|
-
* `@name
|
539
|
-
* `@engine
|
540
|
-
* `@start_urls
|
541
|
-
*
|
549
|
+
* `@name`: name of a spider. You can omit name if use single-file spider
|
550
|
+
* `@engine`: engine for a spider
|
551
|
+
* `@start_urls`: array of start urls to process one by one inside `parse` method
|
552
|
+
* The `parse` method is the entry point, and should always be present in a spider class
|
542
553
|
|
543
554
|
|
544
555
|
### Method arguments `response`, `url` and `data`
|
@@ -548,9 +559,9 @@ def parse(response, url:, data: {})
|
|
548
559
|
end
|
549
560
|
```
|
550
561
|
|
551
|
-
* `response` ([Nokogiri::HTML::Document](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document) object)
|
552
|
-
* `url` (String) url of a processed webpage
|
553
|
-
* `data` (Hash) uses to pass data between requests
|
562
|
+
* `response` ([Nokogiri::HTML::Document](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document) object): contains parsed HTML code of a processed webpage
|
563
|
+
* `url` (String): url of a processed webpage
|
564
|
+
* `data` (Hash): uses to pass data between requests
|
554
565
|
|
555
566
|
<details/>
|
556
567
|
<summary><strong>Example how to use <code>data</code></strong></summary>
|
@@ -574,7 +585,7 @@ class ProductsSpider < Tanakai::Base
|
|
574
585
|
|
575
586
|
def parse_product(response, url:, data: {})
|
576
587
|
item = {}
|
577
|
-
# Assign item's category_name from data[:category_name]
|
588
|
+
# Assign an item's category_name from data[:category_name]
|
578
589
|
item[:category_name] = data[:category_name]
|
579
590
|
|
580
591
|
# ...
|
@@ -592,7 +603,7 @@ end
|
|
592
603
|
|
593
604
|
### `browser` object
|
594
605
|
|
595
|
-
|
606
|
+
A browser object is available from any spider instance method, which is a [Capybara::Session](https://www.rubydoc.info/github/jnicklas/capybara/Capybara/Session) object and uses it to process requests and get page response (`current_response` method). Usually you don't need to touch it directly, because there is `response` (see above) which contains page response after it was loaded.
|
596
607
|
|
597
608
|
But if you need to interact with a page (like filling form fields, clicking elements, checkboxes, etc) `browser` is ready for you:
|
598
609
|
|
@@ -606,7 +617,7 @@ class GoogleSpider < Tanakai::Base
|
|
606
617
|
browser.fill_in "q", with: "Tanakai web scraping framework"
|
607
618
|
browser.click_button "Google Search"
|
608
619
|
|
609
|
-
# Update response
|
620
|
+
# Update response with current_response after interaction with a browser
|
610
621
|
response = browser.current_response
|
611
622
|
|
612
623
|
# Collect results
|
@@ -655,7 +666,7 @@ def request_to(handler, url:, data: {})
|
|
655
666
|
request_data = { url: url, data: data }
|
656
667
|
|
657
668
|
browser.visit(url)
|
658
|
-
public_send(handler, browser.current_response, request_data)
|
669
|
+
public_send(handler, browser.current_response, **request_data)
|
659
670
|
end
|
660
671
|
```
|
661
672
|
</details><br>
|
@@ -720,7 +731,7 @@ By default `save_to` add position key to an item hash. You can disable it with `
|
|
720
731
|
|
721
732
|
**How helper works:**
|
722
733
|
|
723
|
-
|
734
|
+
While the spider is running, each new item will be appended to the output file. On the next run, this helper will clear the contents of the output file, then start appending items to it.
|
724
735
|
|
725
736
|
> If you don't want file to be cleared before each run, add option `append: true`: `save_to "scraped_products.json", item, format: :json, append: true`
|
726
737
|
|
@@ -802,7 +813,7 @@ It is possible to automatically skip all already visited urls while calling `req
|
|
802
813
|
* `#clear!` - reset the whole storage by deleting all values from all scopes.
|
803
814
|
|
804
815
|
|
805
|
-
###
|
816
|
+
### Handling request errors
|
806
817
|
It is quite common that some pages of crawling website can return different response code than `200 ok`. In such cases, method `request_to` (or `browser.visit`) can raise an exception. Tanakai provides `skip_request_errors` and `retry_request_errors` [config](#spider-config) options to handle such errors:
|
807
818
|
|
808
819
|
#### skip_request_errors
|
@@ -1307,7 +1318,7 @@ Tanakai.configure do |config|
|
|
1307
1318
|
end
|
1308
1319
|
```
|
1309
1320
|
|
1310
|
-
### Using Tanakai inside existing Ruby
|
1321
|
+
### Using Tanakai inside existing Ruby applications
|
1311
1322
|
|
1312
1323
|
You can integrate Tanakai spiders (which are just Ruby classes) to an existing Ruby application like Rails or Sinatra, and run them using background jobs (for example). Check the following info to understand the running process of spiders:
|
1313
1324
|
|
@@ -1422,7 +1433,7 @@ Example:
|
|
1422
1433
|
$ tanakai setup deploy@123.123.123.123 --ask-sudo --ssh-key-path path/to/private_key
|
1423
1434
|
```
|
1424
1435
|
|
1425
|
-
CLI
|
1436
|
+
CLI arguments:
|
1426
1437
|
* `--ask-sudo` pass this option to ask sudo (user) password for system-wide installation of packages (`apt install`)
|
1427
1438
|
* `--ssh-key-path path/to/private_key` authorization on the server using private ssh key. You can omit it if required key already [added to keychain](https://help.github.com/articles/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent/#adding-your-ssh-key-to-the-ssh-agent) on your desktop (Ansible uses [SSH agent forwarding](https://developer.github.com/v3/guides/using-ssh-agent-forwarding/))
|
1428
1439
|
* `--ask-auth-pass` authorization on the server using user password, alternative option to `--ssh-key-path`.
|
@@ -1442,7 +1453,7 @@ Example:
|
|
1442
1453
|
$ tanakai deploy deploy@123.123.123.123 --ssh-key-path path/to/private_key --repo-key-path path/to/repo_private_key
|
1443
1454
|
```
|
1444
1455
|
|
1445
|
-
CLI
|
1456
|
+
CLI arguments: _same like for [setup](#setup) command_ (except `--ask-sudo`), plus
|
1446
1457
|
* `--repo-url` provide custom repo url (`--repo-url git@bitbucket.org:username/repo_name.git`), otherwise current `origin/master` will be taken (output from `$ git remote get-url origin`)
|
1447
1458
|
* `--repo-key-path` if git repository is private, authorization is required to pull the code on the remote server. Use this option to provide a private repository SSH key. You can omit it if required key already added to keychain on your desktop (same like with `--ssh-key-path` option)
|
1448
1459
|
|
@@ -2032,9 +2043,14 @@ $ bundle exec tanakai runner --exclude github_spider
|
|
2032
2043
|
|
2033
2044
|
You can perform custom actions before runner starts and after runner stops using `config.runner_at_start_callback` and `config.runner_at_stop_callback`. Check [config/application.rb](lib/tanakai/template/config/application.rb) to see example.
|
2034
2045
|
|
2046
|
+
## Testing
|
2047
|
+
To run tests:
|
2048
|
+
```bash
|
2049
|
+
bundle exec rspec
|
2050
|
+
```
|
2035
2051
|
|
2036
2052
|
## Chat Support and Feedback
|
2037
|
-
|
2053
|
+
Submit an issue on GitHub and we'll try to address it in a timely manner.
|
2038
2054
|
|
2039
2055
|
## License
|
2040
|
-
|
2056
|
+
This gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
|
data/lib/tanakai/base.rb
CHANGED
@@ -1,5 +1,6 @@
|
|
1
1
|
require_relative 'base/saver'
|
2
2
|
require_relative 'base/storage'
|
3
|
+
require 'addressable/uri'
|
3
4
|
|
4
5
|
module Tanakai
|
5
6
|
class Base
|
@@ -201,7 +202,7 @@ module Tanakai
|
|
201
202
|
visited = delay ? browser.visit(url, delay: delay) : browser.visit(url)
|
202
203
|
return unless visited
|
203
204
|
|
204
|
-
public_send(handler, browser.current_response(response_type), { url: url, data: data })
|
205
|
+
public_send(handler, browser.current_response(response_type), **{ url: url, data: data })
|
205
206
|
end
|
206
207
|
|
207
208
|
def console(response = nil, url: nil, data: {})
|
@@ -224,9 +225,9 @@ module Tanakai
|
|
224
225
|
@savers[path] ||= begin
|
225
226
|
options = { format: format, position: position, append: append }
|
226
227
|
if self.with_info
|
227
|
-
self.class.savers[path] ||= Saver.new(path, options)
|
228
|
+
self.class.savers[path] ||= Saver.new(path, **options)
|
228
229
|
else
|
229
|
-
Saver.new(path, options)
|
230
|
+
Saver.new(path, **options)
|
230
231
|
end
|
231
232
|
end
|
232
233
|
|
@@ -304,9 +305,9 @@ module Tanakai
|
|
304
305
|
part.each do |url_data|
|
305
306
|
if url_data.class == Hash
|
306
307
|
if url_data[:url].present? && url_data[:data].present?
|
307
|
-
spider.request_to(handler, delay, url_data, response_type: response_type)
|
308
|
+
spider.request_to(handler, delay, **{ **url_data, response_type: response_type })
|
308
309
|
else
|
309
|
-
spider.public_send(handler, url_data)
|
310
|
+
spider.public_send(handler, **url_data)
|
310
311
|
end
|
311
312
|
else
|
312
313
|
spider.request_to(handler, delay, url: url_data, data: data, response_type: response_type)
|
data/lib/tanakai/base_helper.rb
CHANGED
@@ -4,13 +4,13 @@ module Tanakai
|
|
4
4
|
|
5
5
|
def absolute_url(url, base:)
|
6
6
|
return unless url
|
7
|
-
URI.join(base, URI.escape(url)).to_s
|
7
|
+
Addressable::URI.join(base, Addressable::URI.escape(url)).to_s
|
8
8
|
end
|
9
9
|
|
10
10
|
def escape_url(url)
|
11
|
-
uri = URI.parse(url)
|
11
|
+
uri = Addressable::URI.parse(url)
|
12
12
|
rescue URI::InvalidURIError => e
|
13
|
-
URI.parse(URI.escape url).to_s rescue url
|
13
|
+
Addressable::URI.parse(Addressable::URI.escape url).to_s rescue url
|
14
14
|
else
|
15
15
|
url
|
16
16
|
end
|
@@ -30,7 +30,7 @@ module Tanakai::BrowserBuilder
|
|
30
30
|
end
|
31
31
|
|
32
32
|
# See all options here: https://seleniumhq.github.io/selenium/docs/api/rb/Selenium/WebDriver/Chrome/Options.html
|
33
|
-
driver_options = Selenium::WebDriver::Chrome::Options.new(opts)
|
33
|
+
driver_options = Selenium::WebDriver::Chrome::Options.new(**opts)
|
34
34
|
|
35
35
|
# Window size
|
36
36
|
if size = @config[:window_size].presence
|
@@ -34,7 +34,7 @@ class Capybara::Mechanize::Driver
|
|
34
34
|
options[:name] ||= name
|
35
35
|
options[:value] ||= value
|
36
36
|
|
37
|
-
cookie = Mechanize::Cookie.new(options.merge
|
37
|
+
cookie = Mechanize::Cookie.new(**options.merge(path: "/"))
|
38
38
|
browser.agent.cookie_jar << cookie
|
39
39
|
end
|
40
40
|
|
data/lib/tanakai/version.rb
CHANGED
data/tanakai.gemspec
CHANGED
@@ -39,12 +39,13 @@ Gem::Specification.new do |spec|
|
|
39
39
|
spec.add_dependency "headless"
|
40
40
|
spec.add_dependency "pmap"
|
41
41
|
|
42
|
+
spec.add_dependency "addressable"
|
42
43
|
spec.add_dependency "whenever"
|
43
44
|
|
44
|
-
spec.add_dependency "rbcat", "
|
45
|
-
spec.add_dependency "pry"
|
45
|
+
spec.add_dependency "rbcat", ">= 0.2.2", "< 0.3"
|
46
|
+
spec.add_dependency "pry-nav"
|
46
47
|
|
47
|
-
spec.add_development_dependency "bundler", "~>
|
48
|
+
spec.add_development_dependency "bundler", "~> 2"
|
48
49
|
spec.add_development_dependency "rake", "~> 10.0"
|
49
|
-
spec.add_development_dependency "
|
50
|
+
spec.add_development_dependency "rspec", "~> 3"
|
50
51
|
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: tanakai
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.
|
4
|
+
version: 1.6.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Victor Afanasev
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: exe
|
11
11
|
cert_chain: []
|
12
|
-
date:
|
12
|
+
date: 2023-02-16 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: thor
|
@@ -199,6 +199,20 @@ dependencies:
|
|
199
199
|
- - ">="
|
200
200
|
- !ruby/object:Gem::Version
|
201
201
|
version: '0'
|
202
|
+
- !ruby/object:Gem::Dependency
|
203
|
+
name: addressable
|
204
|
+
requirement: !ruby/object:Gem::Requirement
|
205
|
+
requirements:
|
206
|
+
- - ">="
|
207
|
+
- !ruby/object:Gem::Version
|
208
|
+
version: '0'
|
209
|
+
type: :runtime
|
210
|
+
prerelease: false
|
211
|
+
version_requirements: !ruby/object:Gem::Requirement
|
212
|
+
requirements:
|
213
|
+
- - ">="
|
214
|
+
- !ruby/object:Gem::Version
|
215
|
+
version: '0'
|
202
216
|
- !ruby/object:Gem::Dependency
|
203
217
|
name: whenever
|
204
218
|
requirement: !ruby/object:Gem::Requirement
|
@@ -217,18 +231,24 @@ dependencies:
|
|
217
231
|
name: rbcat
|
218
232
|
requirement: !ruby/object:Gem::Requirement
|
219
233
|
requirements:
|
220
|
-
- - "
|
234
|
+
- - ">="
|
235
|
+
- !ruby/object:Gem::Version
|
236
|
+
version: 0.2.2
|
237
|
+
- - "<"
|
221
238
|
- !ruby/object:Gem::Version
|
222
|
-
version: '0.
|
239
|
+
version: '0.3'
|
223
240
|
type: :runtime
|
224
241
|
prerelease: false
|
225
242
|
version_requirements: !ruby/object:Gem::Requirement
|
226
243
|
requirements:
|
227
|
-
- - "
|
244
|
+
- - ">="
|
245
|
+
- !ruby/object:Gem::Version
|
246
|
+
version: 0.2.2
|
247
|
+
- - "<"
|
228
248
|
- !ruby/object:Gem::Version
|
229
|
-
version: '0.
|
249
|
+
version: '0.3'
|
230
250
|
- !ruby/object:Gem::Dependency
|
231
|
-
name: pry
|
251
|
+
name: pry-nav
|
232
252
|
requirement: !ruby/object:Gem::Requirement
|
233
253
|
requirements:
|
234
254
|
- - ">="
|
@@ -247,14 +267,14 @@ dependencies:
|
|
247
267
|
requirements:
|
248
268
|
- - "~>"
|
249
269
|
- !ruby/object:Gem::Version
|
250
|
-
version: '
|
270
|
+
version: '2'
|
251
271
|
type: :development
|
252
272
|
prerelease: false
|
253
273
|
version_requirements: !ruby/object:Gem::Requirement
|
254
274
|
requirements:
|
255
275
|
- - "~>"
|
256
276
|
- !ruby/object:Gem::Version
|
257
|
-
version: '
|
277
|
+
version: '2'
|
258
278
|
- !ruby/object:Gem::Dependency
|
259
279
|
name: rake
|
260
280
|
requirement: !ruby/object:Gem::Requirement
|
@@ -270,19 +290,19 @@ dependencies:
|
|
270
290
|
- !ruby/object:Gem::Version
|
271
291
|
version: '10.0'
|
272
292
|
- !ruby/object:Gem::Dependency
|
273
|
-
name:
|
293
|
+
name: rspec
|
274
294
|
requirement: !ruby/object:Gem::Requirement
|
275
295
|
requirements:
|
276
296
|
- - "~>"
|
277
297
|
- !ruby/object:Gem::Version
|
278
|
-
version: '
|
298
|
+
version: '3'
|
279
299
|
type: :development
|
280
300
|
prerelease: false
|
281
301
|
version_requirements: !ruby/object:Gem::Requirement
|
282
302
|
requirements:
|
283
303
|
- - "~>"
|
284
304
|
- !ruby/object:Gem::Version
|
285
|
-
version: '
|
305
|
+
version: '3'
|
286
306
|
description:
|
287
307
|
email:
|
288
308
|
- vicfreefly@gmail.com
|
@@ -292,6 +312,7 @@ extensions: []
|
|
292
312
|
extra_rdoc_files: []
|
293
313
|
files:
|
294
314
|
- ".gitignore"
|
315
|
+
- ".rspec"
|
295
316
|
- ".travis.yml"
|
296
317
|
- CHANGELOG.md
|
297
318
|
- Gemfile
|
@@ -374,7 +395,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
374
395
|
- !ruby/object:Gem::Version
|
375
396
|
version: '0'
|
376
397
|
requirements: []
|
377
|
-
rubygems_version: 3.
|
398
|
+
rubygems_version: 3.2.15
|
378
399
|
signing_key:
|
379
400
|
specification_version: 4
|
380
401
|
summary: Maintained fork of Kimurai, a modern web scraping framework written in Ruby
|