RubyGems - tanakai - Versions diffs - 1.5.0 → 1.6.0 - Mend

tanakai 1.5.0 → 1.6.0

Files changed (13) hide show

checksums.yaml +4 -4
data/.gitignore +1 -0
data/.rspec +3 -0
data/CHANGELOG.md +8 -0
data/README.md +66 -48
data/lib/tanakai/base.rb +8 -7
data/lib/tanakai/base_helper.rb +3 -3
data/lib/tanakai/browser_builder/selenium_chrome_builder.rb +1 -1
data/lib/tanakai/capybara_ext/mechanize/driver.rb +1 -1
data/lib/tanakai/template/Gemfile +1 -1
data/lib/tanakai/version.rb +1 -1
data/tanakai.gemspec +5 -4
metadata +34 -13

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: '0363335680ba18ca855d2413e4efdce1957decab5f31c954e2b04f4f91660ac6'
-  data.tar.gz: 3fee8a56e284ef3bae724d1ffe3cc7f1614ad374efd59d60a002a7f71353cf06
+  metadata.gz: 7ea3cd20cfaedaebf473e853b66ebe58958e89b7525246444e3c8aeef46a4bf0
+  data.tar.gz: a2c51b86487d6392a58b533237731996639fe0037c9aca22a6140c3c968eaf7d
 SHA512:
-  metadata.gz: fabeeb2270349d0961294de34abe055906c38477cd4f744da9e033c626939e2672b86b30053a3d8c89bea1889f6392370e1b58296a0327f1a891d4915132478c
-  data.tar.gz: 8e34927825ef45893de6e00c676621823b4d3ca7c28210c73720409000c04bb228b1ea0dd75293144afd1dbff6f4f370ca0312847d781895434896374bb77f7b
+  metadata.gz: 52d9a730a0a9e08c0a49ee4177a0370f5ed2a12ac9e3925f0a83b0c232dcedb1645d1b6860cb19c8453bbc5777cec02403654e2282e57ad75c5c2cb898b6dc1b
+  data.tar.gz: '0969ee651ec787b9fa1e47b8d776571b6f4751c29d3dd15bb0c696181ceab8bc826db6f54486df6426151f25f626f4725bf79c51ec8cc8cebebbe6cfa057bfa3'

data/.gitignore CHANGED Viewed

@@ -10,3 +10,4 @@ Gemfile.lock
 *.retry
 .tags*
+*.gem

data/.rspec ADDED Viewed

@@ -0,0 +1,3 @@
+--format documentation
+--color
+--require spec_helper

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,13 @@
 # CHANGELOG
+## 1.6.0
+### New
+* Add support to Ruby 3
+## 1.5.1
+### New
+* Add `response_type` to `in_parallel`
 ## 1.5.0
 ### New
 * First release as Tanakai

data/README.md CHANGED Viewed

@@ -1,8 +1,19 @@
-# Tanakai
+# 🕷 Tanakai
-Tanakai intends to be a maintained fork of [Kimurai](https://github.com/vifreefly/kimuraframework), a modern web scraping framework written in Ruby which **works out of box with Apparition, Cuprite, Headless Chromium/Firefox and PhantomJS**, or simple HTTP requests and **allows to scrape and interact with JavaScript rendered websites.**
+<sub>[Liphistius tanakai](https://wsc.nmbe.ch/species/58479/Liphistius_tanakai)</sub>
-Tanakai based on well-known [Capybara](https://github.com/teamcapybara/capybara) and [Nokogiri](https://github.com/sparklemotion/nokogiri) gems, so you don't have to learn anything new. Lets see:
+Tanakai intends to be a maintained fork of [Kimurai](https://github.com/vifreefly/kimuraframework), a modern web scraping framework written in Ruby which **works out of the box with Apparition, Cuprite, Headless Chromium/Firefox and PhantomJS**, or simple HTTP requests and **allows you to scrape and interact with JavaScript rendered websites.**
+### Goals of this fork:
+- [x] add support to [Apparition](https://github.com/twalpole/apparition) and [Cuprite](https://github.com/rubycdp/cuprite)
+- [x] add support to Ruby 3
+- [ ] write tests with RSpec
+- [ ] improve configuration options for Apparition and Cuprite (both have been recently added)
+- [ ] create an awesome logo in the likes of [this](https://hsto.org/webt/_v/mt/tp/_vmttpbpzbt-y2aook642d9wpz0.png)
+- [ ] have you as new contributor
+Tanakai is based on the well-known [Capybara](https://github.com/teamcapybara/capybara) and [Nokogiri](https://github.com/sparklemotion/nokogiri) gems, so you don't have to learn anything new. Let's try an example:
 ```ruby
 # github_spider.rb
@@ -128,7 +139,7 @@ I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720]  INFO -- github_spider:
 ```
 </details><br>
-Okay, that was easy. How about javascript rendered websites with dynamic HTML? Lets scrape a page with infinite scroll:
+Okay, that was easy. How about JavaScript rendered websites with dynamic HTML? Let's scrape a page with infinite scroll:
 ```ruby
 # infinite_scroll_spider.rb
@@ -190,20 +201,20 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320]  INFO -- infinite_scrol
 ## Features
-* Scrape javascript rendered websites out of box
+* Scrape JavaScript rendered websites out of the box
 * Supported engines: [Apparition](https://github.com/twalpole/apparition), [Cuprite](https://github.com/rubycdp/cuprite), [Headless Chrome](https://developers.google.com/web/updates/2017/04/headless-chrome), [Headless Firefox](https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Headless_mode), [PhantomJS](https://github.com/ariya/phantomjs) or simple HTTP requests ([mechanize](https://github.com/sparklemotion/mechanize) gem)
 * Write spider code once, and use it with any supported engine later
 * All the power of [Capybara](https://github.com/teamcapybara/capybara): use methods like `click_on`, `fill_in`, `select`, `choose`, `set`, `go_back`, etc. to interact with web pages
 * Rich [configuration](#spider-config): **set default headers, cookies, delay between requests, enable proxy/user-agents rotation**
-* Built-in helpers to make scraping easy, like [save_to](#save_to-helper) (save items to JSON, JSON lines, or CSV formats) or [unique?](#skip-duplicates-unique-helper) to skip duplicates
+* Built-in helpers to make scraping easy, like [save_to](#save_to-helper) (save items to JSON, JSON lines, or CSV formats) or [unique?](#skip-duplicates) to skip duplicates
 * Automatically [handle requests errors](#handle-request-errors)
 * Automatically restart browsers when reaching memory limit [**(memory control)**](#spider-config) or requests limit
 * Easily [schedule spiders](#schedule-spiders-using-cron) within cron using [Whenever](https://github.com/javan/whenever) (no need to know cron syntax)
 * [Parallel scraping](#parallel-crawling-using-in_parallel) using simple method `in_parallel`
 * **Two modes:** use single file for a simple spider, or [generate](#project-mode) Scrapy-like **project**
 * Convenient development mode with [console](#interactive-console), colorized logger and debugger ([Pry](https://github.com/pry/pry), [Byebug](https://github.com/deivid-rodriguez/byebug))
-* Automated [server environment setup](#setup) (for ubuntu 18.04) and [deploy](#deploy) using commands `tanakai setup` and `tanakai deploy` ([Ansible](https://github.com/ansible/ansible) under the hood)
-* Command-line [runner](#runner) to run all project spiders one by one or in parallel
+* Automated [server environment setup](#setup) (for Ubuntu 18.04) and [deploy](#deploy) using commands `tanakai setup` and `tanakai deploy` ([Ansible](https://github.com/ansible/ansible) under the hood)
+* Command-line [runner](#runner) to run all project spiders one-by-one or in parallel
 ## Table of Contents
 * [Tanakai](#tanakai)
@@ -221,7 +232,7 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320]  INFO -- infinite_scrol
     * [Skip duplicates](#skip-duplicates)
       * [Automatically skip all duplicated requests urls](#automatically-skip-all-duplicated-requests-urls)
       * [Storage object](#storage-object)
-    * [Handle request errors](#handle-request-errors)
+    * [Handling request errors](#handling-request-errors)
       * [skip_request_errors](#skip_request_errors)
       * [retry_request_errors](#retry_request_errors)
     * [Logging custom events](#logging-custom-events)
@@ -231,7 +242,7 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320]  INFO -- infinite_scrol
     * [Active Support included](#active-support-included)
     * [Schedule spiders using Cron](#schedule-spiders-using-cron)
     * [Configuration options](#configuration-options)
-    * [Using Tanakai inside existing Ruby application](#using-tanakai-inside-existing-ruby-application)
+    * [Using Tanakai inside existing Ruby applications](#using-tanakai-inside-existing-ruby-applications)
       * [crawl! method](#crawl-method)
       * [parse! method](#parsemethod_name-url-method)
       * [Tanakai.list and Tanakai.find_by_name](#tanakailist-and-tanakaifind_by_name)
@@ -256,7 +267,7 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320]  INFO -- infinite_scrol
 ## Installation
 Tanakai requires Ruby version `>= 2.5.0`. Supported platforms: `Linux` and `Mac OS X`.
-1) If your system doesn't have appropriate Ruby version, install it:
+1) If your system doesn't have the appropriate Ruby version, install it:
 <details/>
   <summary>Ubuntu 18.04</summary>
@@ -288,7 +299,7 @@ gem install bundler
   <summary>Mac OS X</summary>
 ```bash
-# Install homebrew if you don't have it https://brew.sh/
+# Install Homebrew if you don't have it https://brew.sh/
 # Install rbenv and ruby-build:
 brew install rbenv ruby-build
@@ -317,7 +328,7 @@ $ tanakai setup localhost --local --ask-sudo
 ```
 It works using [Ansible](https://github.com/ansible/ansible) so you need to install it first: `$ sudo apt install ansible`. You can check using playbooks [here](lib/tanakai/automation).
-If you chose automatic installation, you can skip following and go to "Getting To Know" part. In case if you want to install everything manually:
+If you chose automatic installation, you can skip the rest of this section and go to ["Getting to Know"](#getting-to-know) part. In case if you want to install everything manually:
 ```bash
 # Install basic tools
@@ -330,19 +341,19 @@ sudo apt install -q -y xvfb
 sudo apt install -q -y chromium-browser firefox
 # Instal chromedriver (2.44 version)
-# All versions located here https://sites.google.com/a/chromium.org/chromedriver/downloads
+# All versions are located here: https://sites.google.com/a/chromium.org/chromedriver/downloads
 cd /tmp && wget https://chromedriver.storage.googleapis.com/2.44/chromedriver_linux64.zip
 sudo unzip chromedriver_linux64.zip -d /usr/local/bin
 rm -f chromedriver_linux64.zip
 # Install geckodriver (0.23.0 version)
-# All versions located here https://github.com/mozilla/geckodriver/releases/
+# All versions are located here: https://github.com/mozilla/geckodriver/releases/
 cd /tmp && wget https://github.com/mozilla/geckodriver/releases/download/v0.23.0/geckodriver-v0.23.0-linux64.tar.gz
 sudo tar -xvzf geckodriver-v0.23.0-linux64.tar.gz -C /usr/local/bin
 rm -f geckodriver-v0.23.0-linux64.tar.gz
 # Install PhantomJS (2.1.1)
-# All versions located here http://phantomjs.org/download.html
+# All versions are located here: http://phantomjs.org/download.html
 sudo apt install -q -y chrpath libxft-dev libfreetype6 libfreetype6-dev libfontconfig1 libfontconfig1-dev
 cd /tmp && wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2
 tar -xvjf phantomjs-2.1.1-linux-x86_64.tar.bz2
@@ -371,7 +382,7 @@ brew install phantomjs
 ```
 </details><br>
-Also, if you want to save scraped items to the database (using [ActiveRecord](https://github.com/rails/rails/tree/master/activerecord), [Sequel](https://github.com/jeremyevans/sequel) or [MongoDB Ruby Driver](https://github.com/mongodb/mongo-ruby-driver)/[Mongoid](https://github.com/mongodb/mongoid)), you need to install database clients/servers:
+Also, if you want to save scraped items to a database (using [ActiveRecord](https://github.com/rails/rails/tree/master/activerecord), [Sequel](https://github.com/jeremyevans/sequel) or [MongoDB Ruby Driver](https://github.com/mongodb/mongo-ruby-driver)/[Mongoid](https://github.com/mongodb/mongoid)), you need to install database clients/servers:
 <details/>
   <summary>Ubuntu 18.04</summary>
@@ -390,7 +401,7 @@ sudo apt install -q -y postgresql-client libpq-dev
 sudo apt install -q -y mongodb-clients
 ```
-But if you want to save items to a local database, database server required as well:
+But if you want to save items to a local database, a database server is required as well:
 ```bash
 # Install MySQL client and server
 sudo apt -q -y install mysql-server mysql-client libmysqlclient-dev
@@ -434,7 +445,7 @@ brew install mongodb
 ## Getting to Know
 ### Interactive console
-Before you get to know all Tanakai features, there is `$ tanakai console` command which is an interactive console where you can try and debug your scraping code very quickly, without having to run any spider (yes, it's like [Scrapy shell](https://doc.scrapy.org/en/latest/topics/shell.html#topics-shell)).
+Before you get to know all of Tanakai's features, there is `$ tanakai console` command which is an interactive console where you can try and debug your scraping code very quickly, without having to run any spider (yes, it's like [Scrapy shell](https://doc.scrapy.org/en/latest/topics/shell.html#topics-shell)).
 ```bash
 $ tanakai console --engine selenium_chrome --url https://github.com/vifreefly/kimuraframework
@@ -499,25 +510,25 @@ $
 ```
 </details><br>
-CLI options:
+CLI arguments:
 * `--engine` (optional) [engine](#available-drivers) to use. Default is `mechanize`
-* `--url` (optional) url to process. If url omitted, `response` and `url` objects inside the console will be `nil` (use [browser](#browser-object) object to navigate to any webpage).
+* `--url` (optional) url to process. If url is omitted, `response` and `url` objects inside the console will be `nil` (use [browser](#browser-object) object to navigate to any webpage).
 ### Available engines
-Tanakai has support for following engines and mostly can switch between them without need to rewrite any code:
+Tanakai has support for the following engines and can mostly switch between them without the need to rewrite any code:
 * `:apparition` - a Chrome driver for Capybara via [CDP protocol](https://chromedevtools.github.io/devtools-protocol/) (no selenium or chromedriver needed). It started as a fork of Poltergeist and attempts to maintain as much compatibility with the Poltergeist API as possible.
 * `:cuprite` - a pure Ruby driver for Capybara. It allows you to run Capybara tests on a headless Chrome or Chromium. Under the hood it uses [Ferrum](https://github.com/rubycdp/ferrum#index) which is high-level API to the browser by [CDP protocol](https://chromedevtools.github.io/devtools-protocol/) (no selenium or chromedriver needed). The design of the driver is as close to Poltergeist as possible though it's not a goal.
-* `:mechanize` - [pure Ruby fake http browser](https://github.com/sparklemotion/mechanize). Mechanize can't render javascript and don't know what DOM is it. It only can parse original HTML code of a page. Because of it, mechanize much faster, takes much less memory and in general much more stable than any real browser. Use mechanize if you can do it, and the website doesn't use javascript to render any meaningful parts of its structure. Still, because mechanize trying to mimic a real browser, it supports almost all Capybara's [methods to interact with a web page](http://cheatrags.com/capybara) (filling forms, clicking buttons, checkboxes, etc).
-* `:poltergeist_phantomjs` - [PhantomJS headless browser](https://github.com/ariya/phantomjs), can render javascript. In general, PhantomJS still faster than Headless Chrome (and Headless Firefox). PhantomJS has memory leakage, but Tanakai has [memory control feature](#crawler-config) so you shouldn't consider it as a problem. Also, some websites can recognize PhantomJS and block access to them. Like mechanize (and unlike selenium engines) `:poltergeist_phantomjs` can freely rotate proxies and change headers _on the fly_ (see [config section](#all-available-config-options)).
-* `:selenium_chrome` Chrome in headless mode driven by selenium. Modern headless browser solution with proper javascript rendering.
+* `:mechanize` - [pure Ruby fake http browser](https://github.com/sparklemotion/mechanize). Mechanize can't render JavaScript and don't know what DOM is it. It only can parse original HTML code of a page. Because of it, mechanize much faster, takes much less memory and in general much more stable than any real browser. Use mechanize if you can do it, and the website doesn't use JavaScript to render any meaningful parts of its structure. Still, because mechanize trying to mimic a real browser, it supports almost all Capybara's [methods to interact with a web page](http://cheatrags.com/capybara) (filling forms, clicking buttons, checkboxes, etc).
+* `:poltergeist_phantomjs` - [PhantomJS headless browser](https://github.com/ariya/phantomjs), can render javascript. In general, PhantomJS still faster than Headless Chrome (and Headless Firefox). PhantomJS has memory leakage issues, but Tanakai has [memory control feature](#crawler-config) so you shouldn't consider it as a problem. Also, some websites can recognize PhantomJS and block access to them. Like mechanize (and unlike selenium engines) `:poltergeist_phantomjs` can freely rotate proxies and change headers _on the fly_ (see [config section](#all-available-config-options)).
+* `:selenium_chrome` Chrome in headless mode driven by selenium. Modern headless browser solution with proper JavaScript rendering.
 * `:selenium_firefox` Firefox in headless mode driven by selenium. Usually takes more memory than other drivers, but sometimes can be useful.
-**Tip:** add `HEADLESS=false` ENV variable before command (`$ HEADLESS=false ruby spider.rb`) to run browser in normal (not headless) mode and see it's window (only for selenium-like engines). It works for [console](#interactive-console) command as well.
+**Tip:** prepend a `HEADLESS=false` environment variable on the command line (`$ HEADLESS=false ruby spider.rb`) to launch an interactive browser in normal (not headless) mode and see its window (only for selenium-like engines). It works for the [console](#interactive-console) command as well.
 ### Minimum required spider structure
-> You can manually create a spider file, or use generator instead: `$ tanakai generate spider simple_spider`
+> You can manually create a spider file, or use the generate command instead: `$ tanakai generate spider simple_spider`
 ```ruby
 require 'tanakai'
@@ -535,10 +546,10 @@ SimpleSpider.crawl!
 ```
 Where:
-* `@name` name of a spider. You can omit name if use single-file spider
-* `@engine` engine for a spider
-* `@start_urls` array of start urls to process one by one inside `parse` method
-* Method `parse` is the start method, should be always present in spider class
+* `@name`: name of a spider. You can omit name if use single-file spider
+* `@engine`: engine for a spider
+* `@start_urls`: array of start urls to process one by one inside `parse` method
+* The `parse` method is the entry point, and should always be present in a spider class
 ### Method arguments `response`, `url` and `data`
@@ -548,9 +559,9 @@ def parse(response, url:, data: {})
 end
 ```
-* `response` ([Nokogiri::HTML::Document](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document) object) Contains parsed HTML code of a processed webpage
-* `url` (String) url of a processed webpage
-* `data` (Hash) uses to pass data between requests
+* `response` ([Nokogiri::HTML::Document](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document) object): contains parsed HTML code of a processed webpage
+* `url` (String): url of a processed webpage
+* `data` (Hash): uses to pass data between requests
 <details/>
   <summary><strong>Example how to use <code>data</code></strong></summary>
@@ -574,7 +585,7 @@ class ProductsSpider < Tanakai::Base
   def parse_product(response, url:, data: {})
     item = {}
-    # Assign item's category_name from data[:category_name]
+    # Assign an item's category_name from data[:category_name]
     item[:category_name] = data[:category_name]
     # ...
@@ -592,7 +603,7 @@ end
 ### `browser` object
-From any spider instance method there is available `browser` object, which is [Capybara::Session](https://www.rubydoc.info/github/jnicklas/capybara/Capybara/Session) object and uses to process requests and get page response (`current_response` method). Usually you don't need to touch it directly, because there is `response` (see above) which contains page response after it was loaded.
+A browser object is available from any spider instance method, which is a [Capybara::Session](https://www.rubydoc.info/github/jnicklas/capybara/Capybara/Session) object and uses it to process requests and get page response (`current_response` method). Usually you don't need to touch it directly, because there is `response` (see above) which contains page response after it was loaded.
 But if you need to interact with a page (like filling form fields, clicking elements, checkboxes, etc) `browser` is ready for you:
@@ -606,7 +617,7 @@ class GoogleSpider < Tanakai::Base
     browser.fill_in "q", with: "Tanakai web scraping framework"
     browser.click_button "Google Search"
-    # Update response to current response after interaction with a browser
+    # Update response with current_response after interaction with a browser
     response = browser.current_response
     # Collect results
@@ -626,7 +637,7 @@ Check out **Capybara cheat sheets** where you can see all available methods **to
 ### `request_to` method
-For making requests to a particular method there is `request_to`. It requires minimum two arguments: `:method_name` and `url:`. An optional argument is `data:` (see above what for is it). Example:
+For making requests to a particular method there is `request_to`. It requires minimum two arguments: `:method_name` and `url:`. An optional argument is `data:` (see above what for is it) and `response_type` (defaults to `:html`). Example:
 ```ruby
 class Spider < Tanakai::Base
@@ -635,11 +646,12 @@ class Spider < Tanakai::Base
   def parse(response, url:, data: {})
     # Process request to `parse_product` method with `https://example.com/some_product` url:
-    request_to :parse_product, url: "https://example.com/some_product"
+    request_to :parse_product, url: "https://example.com/some_product.json", response_type: :json
   end
   def parse_product(response, url:, data: {})
-    puts "From page https://example.com/some_product !"
+    puts "JSON parsed from page https://example.com/some_product.json"
+    puts response
   end
 end
 ```
@@ -654,7 +666,7 @@ def request_to(handler, url:, data: {})
   request_data = { url: url, data: data }
   browser.visit(url)
-  public_send(handler, browser.current_response, request_data)
+  public_send(handler, browser.current_response, **request_data)
 end
 ```
 </details><br>
@@ -719,7 +731,7 @@ By default `save_to` add position key to an item hash. You can disable it with `
 **How helper works:**
-Until spider stops, each new item will be appended to a file. At the next run, helper will clear the content of a file first, and then start again appending items to it.
+While the spider is running, each new item will be appended to the output file. On the next run, this helper will clear the contents of the output file, then start appending items to it.
 > If you don't want file to be cleared before each run, add option `append: true`: `save_to "scraped_products.json", item, format: :json, append: true`
@@ -801,7 +813,7 @@ It is possible to automatically skip all already visited urls while calling `req
 * `#clear!` - reset the whole storage by deleting all values from all scopes.
-### Handle request errors
+### Handling request errors
 It is quite common that some pages of crawling website can return different response code than `200 ok`. In such cases, method `request_to` (or `browser.visit`) can raise an exception. Tanakai provides `skip_request_errors` and `retry_request_errors` [config](#spider-config) options to handle such errors:
 #### skip_request_errors
@@ -1194,6 +1206,7 @@ I, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840]  INFO -- amazon_spider:
 * `delay:` set delay between requests: `in_parallel(:method, urls, threads: 3, delay: 2)`. Delay can be `Integer`, `Float` or `Range` (`2..5`). In case of a Range, delay number will be chosen randomly for each request: `rand (2..5) # => 3`
 * `engine:` set custom engine than a default one: `in_parallel(:method, urls, threads: 3, engine: :poltergeist_phantomjs)`
 * `config:` pass custom options to config (see [config section](#crawler-config))
+* `response_type:` response should be returned as `:html` or `:json`, defaults to `:html`
 ### Active Support included
@@ -1305,7 +1318,7 @@ Tanakai.configure do |config|
 end
 ```
-### Using Tanakai inside existing Ruby application
+### Using Tanakai inside existing Ruby applications
 You can integrate Tanakai spiders (which are just Ruby classes) to an existing Ruby application like Rails or Sinatra, and run them using background jobs (for example). Check the following info to understand the running process of spiders:
@@ -1420,7 +1433,7 @@ Example:
 $ tanakai setup deploy@123.123.123.123 --ask-sudo --ssh-key-path path/to/private_key
 ```
-CLI options:
+CLI arguments:
 * `--ask-sudo` pass this option to ask sudo (user) password for system-wide installation of packages (`apt install`)
 * `--ssh-key-path path/to/private_key` authorization on the server using private ssh key. You can omit it if required key already [added to keychain](https://help.github.com/articles/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent/#adding-your-ssh-key-to-the-ssh-agent) on your desktop (Ansible uses [SSH agent forwarding](https://developer.github.com/v3/guides/using-ssh-agent-forwarding/))
 * `--ask-auth-pass` authorization on the server using user password, alternative option to `--ssh-key-path`.
@@ -1440,7 +1453,7 @@ Example:
 $ tanakai deploy deploy@123.123.123.123 --ssh-key-path path/to/private_key --repo-key-path path/to/repo_private_key
 ```
-CLI options: _same like for [setup](#setup) command_ (except `--ask-sudo`), plus
+CLI arguments: _same like for [setup](#setup) command_ (except `--ask-sudo`), plus
 * `--repo-url` provide custom repo url (`--repo-url git@bitbucket.org:username/repo_name.git`), otherwise current `origin/master` will be taken (output from `$ git remote get-url origin`)
 * `--repo-key-path` if git repository is private, authorization is required to pull the code on the remote server. Use this option to provide a private repository SSH key. You can omit it if required key already added to keychain on your desktop (same like with `--ssh-key-path` option)
@@ -2030,9 +2043,14 @@ $ bundle exec tanakai runner --exclude github_spider
 You can perform custom actions before runner starts and after runner stops using `config.runner_at_start_callback` and `config.runner_at_stop_callback`. Check [config/application.rb](lib/tanakai/template/config/application.rb) to see example.
+## Testing
+To run tests:
+```bash
+bundle exec rspec
+```
 ## Chat Support and Feedback
-Will be updated
+Submit an issue on GitHub and we'll try to address it in a timely manner.
 ## License
-The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
+This gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).

data/lib/tanakai/base.rb CHANGED Viewed

@@ -1,5 +1,6 @@
 require_relative 'base/saver'
 require_relative 'base/storage'
+require 'addressable/uri'
 module Tanakai
   class Base
@@ -201,7 +202,7 @@ module Tanakai
       visited = delay ? browser.visit(url, delay: delay) : browser.visit(url)
       return unless visited
-      public_send(handler, browser.current_response(response_type), { url: url, data: data })
+      public_send(handler, browser.current_response(response_type), **{ url: url, data: data })
     end
     def console(response = nil, url: nil, data: {})
@@ -224,9 +225,9 @@ module Tanakai
       @savers[path] ||= begin
         options = { format: format, position: position, append: append }
         if self.with_info
-          self.class.savers[path] ||= Saver.new(path, options)
+          self.class.savers[path] ||= Saver.new(path, **options)
         else
-          Saver.new(path, options)
+          Saver.new(path, **options)
         end
       end
@@ -286,7 +287,7 @@ module Tanakai
       end
     end
-    def in_parallel(handler, urls, threads:, data: {}, delay: nil, engine: @engine, config: {})
+    def in_parallel(handler, urls, threads:, data: {}, delay: nil, engine: @engine, config: {}, response_type: :html)
       parts = urls.in_sorted_groups(threads, false)
       urls_count = urls.size
@@ -304,12 +305,12 @@ module Tanakai
           part.each do |url_data|
             if url_data.class == Hash
               if url_data[:url].present? && url_data[:data].present?
-                spider.request_to(handler, delay, url_data)
+                spider.request_to(handler, delay, **{ **url_data, response_type: response_type })
               else
-                spider.public_send(handler, url_data)
+                spider.public_send(handler, **url_data)
               end
             else
-              spider.request_to(handler, delay, url: url_data, data: data)
+              spider.request_to(handler, delay, url: url_data, data: data, response_type: response_type)
             end
           end
         ensure

data/lib/tanakai/base_helper.rb CHANGED Viewed

@@ -4,13 +4,13 @@ module Tanakai
     def absolute_url(url, base:)
       return unless url
-      URI.join(base, URI.escape(url)).to_s
+      Addressable::URI.join(base, Addressable::URI.escape(url)).to_s
     end
     def escape_url(url)
-      uri = URI.parse(url)
+      uri = Addressable::URI.parse(url)
     rescue URI::InvalidURIError => e
-      URI.parse(URI.escape url).to_s rescue url
+      Addressable::URI.parse(Addressable::URI.escape url).to_s rescue url
     else
       url
     end

data/lib/tanakai/browser_builder/selenium_chrome_builder.rb CHANGED Viewed

@@ -30,7 +30,7 @@ module Tanakai::BrowserBuilder
         end
         # See all options here: https://seleniumhq.github.io/selenium/docs/api/rb/Selenium/WebDriver/Chrome/Options.html
-        driver_options = Selenium::WebDriver::Chrome::Options.new(opts)
+        driver_options = Selenium::WebDriver::Chrome::Options.new(**opts)
         # Window size
         if size = @config[:window_size].presence

data/lib/tanakai/capybara_ext/mechanize/driver.rb CHANGED Viewed

@@ -34,7 +34,7 @@ class Capybara::Mechanize::Driver
     options[:name]  ||= name
     options[:value] ||= value
-    cookie = Mechanize::Cookie.new(options.merge path: "/")
+    cookie = Mechanize::Cookie.new(**options.merge(path: "/"))
     browser.agent.cookie_jar << cookie
   end

data/lib/tanakai/template/Gemfile CHANGED Viewed

@@ -4,7 +4,7 @@ git_source(:github) { |repo| "https://github.com/#{repo}.git" }
 ruby '>= 2.5'
 # Framework
-gem 'tanakai'
+gem 'tanakai', '~> 1.5'
 # Require files in directory and child directories recursively
 gem 'require_all'

data/lib/tanakai/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Tanakai
-  VERSION = "1.5.0"
+  VERSION = "1.6.0"
 end

data/tanakai.gemspec CHANGED Viewed

@@ -39,12 +39,13 @@ Gem::Specification.new do |spec|
   spec.add_dependency "headless"
   spec.add_dependency "pmap"
+  spec.add_dependency "addressable"
   spec.add_dependency "whenever"
-  spec.add_dependency "rbcat", "~> 0.2"
-  spec.add_dependency "pry"
+  spec.add_dependency "rbcat", ">= 0.2.2", "< 0.3"
+  spec.add_dependency "pry-nav"
-  spec.add_development_dependency "bundler", "~> 1.16"
+  spec.add_development_dependency "bundler", "~> 2"
   spec.add_development_dependency "rake", "~> 10.0"
-  spec.add_development_dependency "minitest", "~> 5.0"
+  spec.add_development_dependency "rspec", "~> 3"
 end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: tanakai
 version: !ruby/object:Gem::Version
-  version: 1.5.0
+  version: 1.6.0
 platform: ruby
 authors:
 - Victor Afanasev
@@ -9,7 +9,7 @@ authors:
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2022-08-13 00:00:00.000000000 Z
+date: 2023-02-16 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: thor
@@ -199,6 +199,20 @@ dependencies:
     - - ">="
       - !ruby/object:Gem::Version
         version: '0'
+- !ruby/object:Gem::Dependency
+  name: addressable
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
 - !ruby/object:Gem::Dependency
   name: whenever
   requirement: !ruby/object:Gem::Requirement
@@ -217,18 +231,24 @@ dependencies:
   name: rbcat
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - "~>"
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: 0.2.2
+    - - "<"
       - !ruby/object:Gem::Version
-        version: '0.2'
+        version: '0.3'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - "~>"
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: 0.2.2
+    - - "<"
       - !ruby/object:Gem::Version
-        version: '0.2'
+        version: '0.3'
 - !ruby/object:Gem::Dependency
-  name: pry
+  name: pry-nav
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - ">="
@@ -247,14 +267,14 @@ dependencies:
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.16'
+        version: '2'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.16'
+        version: '2'
 - !ruby/object:Gem::Dependency
   name: rake
   requirement: !ruby/object:Gem::Requirement
@@ -270,19 +290,19 @@ dependencies:
       - !ruby/object:Gem::Version
         version: '10.0'
 - !ruby/object:Gem::Dependency
-  name: minitest
+  name: rspec
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '5.0'
+        version: '3'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '5.0'
+        version: '3'
 description:
 email:
 - vicfreefly@gmail.com
@@ -292,6 +312,7 @@ extensions: []
 extra_rdoc_files: []
 files:
 - ".gitignore"
+- ".rspec"
 - ".travis.yml"
 - CHANGELOG.md
 - Gemfile
@@ -374,7 +395,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 3.1.2
+rubygems_version: 3.2.15
 signing_key:
 specification_version: 4
 summary: Maintained fork of Kimurai, a modern web scraping framework written in Ruby