tanakai 1.5.0 → 1.6.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: '0363335680ba18ca855d2413e4efdce1957decab5f31c954e2b04f4f91660ac6'
4
- data.tar.gz: 3fee8a56e284ef3bae724d1ffe3cc7f1614ad374efd59d60a002a7f71353cf06
3
+ metadata.gz: 7ea3cd20cfaedaebf473e853b66ebe58958e89b7525246444e3c8aeef46a4bf0
4
+ data.tar.gz: a2c51b86487d6392a58b533237731996639fe0037c9aca22a6140c3c968eaf7d
5
5
  SHA512:
6
- metadata.gz: fabeeb2270349d0961294de34abe055906c38477cd4f744da9e033c626939e2672b86b30053a3d8c89bea1889f6392370e1b58296a0327f1a891d4915132478c
7
- data.tar.gz: 8e34927825ef45893de6e00c676621823b4d3ca7c28210c73720409000c04bb228b1ea0dd75293144afd1dbff6f4f370ca0312847d781895434896374bb77f7b
6
+ metadata.gz: 52d9a730a0a9e08c0a49ee4177a0370f5ed2a12ac9e3925f0a83b0c232dcedb1645d1b6860cb19c8453bbc5777cec02403654e2282e57ad75c5c2cb898b6dc1b
7
+ data.tar.gz: '0969ee651ec787b9fa1e47b8d776571b6f4751c29d3dd15bb0c696181ceab8bc826db6f54486df6426151f25f626f4725bf79c51ec8cc8cebebbe6cfa057bfa3'
data/.gitignore CHANGED
@@ -10,3 +10,4 @@ Gemfile.lock
10
10
 
11
11
  *.retry
12
12
  .tags*
13
+ *.gem
data/.rspec ADDED
@@ -0,0 +1,3 @@
1
+ --format documentation
2
+ --color
3
+ --require spec_helper
data/CHANGELOG.md CHANGED
@@ -1,5 +1,13 @@
1
1
  # CHANGELOG
2
2
 
3
+ ## 1.6.0
4
+ ### New
5
+ * Add support to Ruby 3
6
+
7
+ ## 1.5.1
8
+ ### New
9
+ * Add `response_type` to `in_parallel`
10
+
3
11
  ## 1.5.0
4
12
  ### New
5
13
  * First release as Tanakai
data/README.md CHANGED
@@ -1,8 +1,19 @@
1
- # Tanakai
1
+ # 🕷 Tanakai
2
2
 
3
- Tanakai intends to be a maintained fork of [Kimurai](https://github.com/vifreefly/kimuraframework), a modern web scraping framework written in Ruby which **works out of box with Apparition, Cuprite, Headless Chromium/Firefox and PhantomJS**, or simple HTTP requests and **allows to scrape and interact with JavaScript rendered websites.**
3
+ <sub>[Liphistius tanakai](https://wsc.nmbe.ch/species/58479/Liphistius_tanakai)</sub>
4
4
 
5
- Tanakai based on well-known [Capybara](https://github.com/teamcapybara/capybara) and [Nokogiri](https://github.com/sparklemotion/nokogiri) gems, so you don't have to learn anything new. Lets see:
5
+ Tanakai intends to be a maintained fork of [Kimurai](https://github.com/vifreefly/kimuraframework), a modern web scraping framework written in Ruby which **works out of the box with Apparition, Cuprite, Headless Chromium/Firefox and PhantomJS**, or simple HTTP requests and **allows you to scrape and interact with JavaScript rendered websites.**
6
+
7
+ ### Goals of this fork:
8
+
9
+ - [x] add support to [Apparition](https://github.com/twalpole/apparition) and [Cuprite](https://github.com/rubycdp/cuprite)
10
+ - [x] add support to Ruby 3
11
+ - [ ] write tests with RSpec
12
+ - [ ] improve configuration options for Apparition and Cuprite (both have been recently added)
13
+ - [ ] create an awesome logo in the likes of [this](https://hsto.org/webt/_v/mt/tp/_vmttpbpzbt-y2aook642d9wpz0.png)
14
+ - [ ] have you as new contributor
15
+
16
+ Tanakai is based on the well-known [Capybara](https://github.com/teamcapybara/capybara) and [Nokogiri](https://github.com/sparklemotion/nokogiri) gems, so you don't have to learn anything new. Let's try an example:
6
17
 
7
18
  ```ruby
8
19
  # github_spider.rb
@@ -128,7 +139,7 @@ I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] INFO -- github_spider:
128
139
  ```
129
140
  </details><br>
130
141
 
131
- Okay, that was easy. How about javascript rendered websites with dynamic HTML? Lets scrape a page with infinite scroll:
142
+ Okay, that was easy. How about JavaScript rendered websites with dynamic HTML? Let's scrape a page with infinite scroll:
132
143
 
133
144
  ```ruby
134
145
  # infinite_scroll_spider.rb
@@ -190,20 +201,20 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
190
201
 
191
202
 
192
203
  ## Features
193
- * Scrape javascript rendered websites out of box
204
+ * Scrape JavaScript rendered websites out of the box
194
205
  * Supported engines: [Apparition](https://github.com/twalpole/apparition), [Cuprite](https://github.com/rubycdp/cuprite), [Headless Chrome](https://developers.google.com/web/updates/2017/04/headless-chrome), [Headless Firefox](https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Headless_mode), [PhantomJS](https://github.com/ariya/phantomjs) or simple HTTP requests ([mechanize](https://github.com/sparklemotion/mechanize) gem)
195
206
  * Write spider code once, and use it with any supported engine later
196
207
  * All the power of [Capybara](https://github.com/teamcapybara/capybara): use methods like `click_on`, `fill_in`, `select`, `choose`, `set`, `go_back`, etc. to interact with web pages
197
208
  * Rich [configuration](#spider-config): **set default headers, cookies, delay between requests, enable proxy/user-agents rotation**
198
- * Built-in helpers to make scraping easy, like [save_to](#save_to-helper) (save items to JSON, JSON lines, or CSV formats) or [unique?](#skip-duplicates-unique-helper) to skip duplicates
209
+ * Built-in helpers to make scraping easy, like [save_to](#save_to-helper) (save items to JSON, JSON lines, or CSV formats) or [unique?](#skip-duplicates) to skip duplicates
199
210
  * Automatically [handle requests errors](#handle-request-errors)
200
211
  * Automatically restart browsers when reaching memory limit [**(memory control)**](#spider-config) or requests limit
201
212
  * Easily [schedule spiders](#schedule-spiders-using-cron) within cron using [Whenever](https://github.com/javan/whenever) (no need to know cron syntax)
202
213
  * [Parallel scraping](#parallel-crawling-using-in_parallel) using simple method `in_parallel`
203
214
  * **Two modes:** use single file for a simple spider, or [generate](#project-mode) Scrapy-like **project**
204
215
  * Convenient development mode with [console](#interactive-console), colorized logger and debugger ([Pry](https://github.com/pry/pry), [Byebug](https://github.com/deivid-rodriguez/byebug))
205
- * Automated [server environment setup](#setup) (for ubuntu 18.04) and [deploy](#deploy) using commands `tanakai setup` and `tanakai deploy` ([Ansible](https://github.com/ansible/ansible) under the hood)
206
- * Command-line [runner](#runner) to run all project spiders one by one or in parallel
216
+ * Automated [server environment setup](#setup) (for Ubuntu 18.04) and [deploy](#deploy) using commands `tanakai setup` and `tanakai deploy` ([Ansible](https://github.com/ansible/ansible) under the hood)
217
+ * Command-line [runner](#runner) to run all project spiders one-by-one or in parallel
207
218
 
208
219
  ## Table of Contents
209
220
  * [Tanakai](#tanakai)
@@ -221,7 +232,7 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
221
232
  * [Skip duplicates](#skip-duplicates)
222
233
  * [Automatically skip all duplicated requests urls](#automatically-skip-all-duplicated-requests-urls)
223
234
  * [Storage object](#storage-object)
224
- * [Handle request errors](#handle-request-errors)
235
+ * [Handling request errors](#handling-request-errors)
225
236
  * [skip_request_errors](#skip_request_errors)
226
237
  * [retry_request_errors](#retry_request_errors)
227
238
  * [Logging custom events](#logging-custom-events)
@@ -231,7 +242,7 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
231
242
  * [Active Support included](#active-support-included)
232
243
  * [Schedule spiders using Cron](#schedule-spiders-using-cron)
233
244
  * [Configuration options](#configuration-options)
234
- * [Using Tanakai inside existing Ruby application](#using-tanakai-inside-existing-ruby-application)
245
+ * [Using Tanakai inside existing Ruby applications](#using-tanakai-inside-existing-ruby-applications)
235
246
  * [crawl! method](#crawl-method)
236
247
  * [parse! method](#parsemethod_name-url-method)
237
248
  * [Tanakai.list and Tanakai.find_by_name](#tanakailist-and-tanakaifind_by_name)
@@ -256,7 +267,7 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
256
267
  ## Installation
257
268
  Tanakai requires Ruby version `>= 2.5.0`. Supported platforms: `Linux` and `Mac OS X`.
258
269
 
259
- 1) If your system doesn't have appropriate Ruby version, install it:
270
+ 1) If your system doesn't have the appropriate Ruby version, install it:
260
271
 
261
272
  <details/>
262
273
  <summary>Ubuntu 18.04</summary>
@@ -288,7 +299,7 @@ gem install bundler
288
299
  <summary>Mac OS X</summary>
289
300
 
290
301
  ```bash
291
- # Install homebrew if you don't have it https://brew.sh/
302
+ # Install Homebrew if you don't have it https://brew.sh/
292
303
  # Install rbenv and ruby-build:
293
304
  brew install rbenv ruby-build
294
305
 
@@ -317,7 +328,7 @@ $ tanakai setup localhost --local --ask-sudo
317
328
  ```
318
329
  It works using [Ansible](https://github.com/ansible/ansible) so you need to install it first: `$ sudo apt install ansible`. You can check using playbooks [here](lib/tanakai/automation).
319
330
 
320
- If you chose automatic installation, you can skip following and go to "Getting To Know" part. In case if you want to install everything manually:
331
+ If you chose automatic installation, you can skip the rest of this section and go to ["Getting to Know"](#getting-to-know) part. In case if you want to install everything manually:
321
332
 
322
333
  ```bash
323
334
  # Install basic tools
@@ -330,19 +341,19 @@ sudo apt install -q -y xvfb
330
341
  sudo apt install -q -y chromium-browser firefox
331
342
 
332
343
  # Instal chromedriver (2.44 version)
333
- # All versions located here https://sites.google.com/a/chromium.org/chromedriver/downloads
344
+ # All versions are located here: https://sites.google.com/a/chromium.org/chromedriver/downloads
334
345
  cd /tmp && wget https://chromedriver.storage.googleapis.com/2.44/chromedriver_linux64.zip
335
346
  sudo unzip chromedriver_linux64.zip -d /usr/local/bin
336
347
  rm -f chromedriver_linux64.zip
337
348
 
338
349
  # Install geckodriver (0.23.0 version)
339
- # All versions located here https://github.com/mozilla/geckodriver/releases/
350
+ # All versions are located here: https://github.com/mozilla/geckodriver/releases/
340
351
  cd /tmp && wget https://github.com/mozilla/geckodriver/releases/download/v0.23.0/geckodriver-v0.23.0-linux64.tar.gz
341
352
  sudo tar -xvzf geckodriver-v0.23.0-linux64.tar.gz -C /usr/local/bin
342
353
  rm -f geckodriver-v0.23.0-linux64.tar.gz
343
354
 
344
355
  # Install PhantomJS (2.1.1)
345
- # All versions located here http://phantomjs.org/download.html
356
+ # All versions are located here: http://phantomjs.org/download.html
346
357
  sudo apt install -q -y chrpath libxft-dev libfreetype6 libfreetype6-dev libfontconfig1 libfontconfig1-dev
347
358
  cd /tmp && wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2
348
359
  tar -xvjf phantomjs-2.1.1-linux-x86_64.tar.bz2
@@ -371,7 +382,7 @@ brew install phantomjs
371
382
  ```
372
383
  </details><br>
373
384
 
374
- Also, if you want to save scraped items to the database (using [ActiveRecord](https://github.com/rails/rails/tree/master/activerecord), [Sequel](https://github.com/jeremyevans/sequel) or [MongoDB Ruby Driver](https://github.com/mongodb/mongo-ruby-driver)/[Mongoid](https://github.com/mongodb/mongoid)), you need to install database clients/servers:
385
+ Also, if you want to save scraped items to a database (using [ActiveRecord](https://github.com/rails/rails/tree/master/activerecord), [Sequel](https://github.com/jeremyevans/sequel) or [MongoDB Ruby Driver](https://github.com/mongodb/mongo-ruby-driver)/[Mongoid](https://github.com/mongodb/mongoid)), you need to install database clients/servers:
375
386
 
376
387
  <details/>
377
388
  <summary>Ubuntu 18.04</summary>
@@ -390,7 +401,7 @@ sudo apt install -q -y postgresql-client libpq-dev
390
401
  sudo apt install -q -y mongodb-clients
391
402
  ```
392
403
 
393
- But if you want to save items to a local database, database server required as well:
404
+ But if you want to save items to a local database, a database server is required as well:
394
405
  ```bash
395
406
  # Install MySQL client and server
396
407
  sudo apt -q -y install mysql-server mysql-client libmysqlclient-dev
@@ -434,7 +445,7 @@ brew install mongodb
434
445
 
435
446
  ## Getting to Know
436
447
  ### Interactive console
437
- Before you get to know all Tanakai features, there is `$ tanakai console` command which is an interactive console where you can try and debug your scraping code very quickly, without having to run any spider (yes, it's like [Scrapy shell](https://doc.scrapy.org/en/latest/topics/shell.html#topics-shell)).
448
+ Before you get to know all of Tanakai's features, there is `$ tanakai console` command which is an interactive console where you can try and debug your scraping code very quickly, without having to run any spider (yes, it's like [Scrapy shell](https://doc.scrapy.org/en/latest/topics/shell.html#topics-shell)).
438
449
 
439
450
  ```bash
440
451
  $ tanakai console --engine selenium_chrome --url https://github.com/vifreefly/kimuraframework
@@ -499,25 +510,25 @@ $
499
510
  ```
500
511
  </details><br>
501
512
 
502
- CLI options:
513
+ CLI arguments:
503
514
  * `--engine` (optional) [engine](#available-drivers) to use. Default is `mechanize`
504
- * `--url` (optional) url to process. If url omitted, `response` and `url` objects inside the console will be `nil` (use [browser](#browser-object) object to navigate to any webpage).
515
+ * `--url` (optional) url to process. If url is omitted, `response` and `url` objects inside the console will be `nil` (use [browser](#browser-object) object to navigate to any webpage).
505
516
 
506
517
  ### Available engines
507
- Tanakai has support for following engines and mostly can switch between them without need to rewrite any code:
518
+ Tanakai has support for the following engines and can mostly switch between them without the need to rewrite any code:
508
519
 
509
520
  * `:apparition` - a Chrome driver for Capybara via [CDP protocol](https://chromedevtools.github.io/devtools-protocol/) (no selenium or chromedriver needed). It started as a fork of Poltergeist and attempts to maintain as much compatibility with the Poltergeist API as possible.
510
521
  * `:cuprite` - a pure Ruby driver for Capybara. It allows you to run Capybara tests on a headless Chrome or Chromium. Under the hood it uses [Ferrum](https://github.com/rubycdp/ferrum#index) which is high-level API to the browser by [CDP protocol](https://chromedevtools.github.io/devtools-protocol/) (no selenium or chromedriver needed). The design of the driver is as close to Poltergeist as possible though it's not a goal.
511
- * `:mechanize` - [pure Ruby fake http browser](https://github.com/sparklemotion/mechanize). Mechanize can't render javascript and don't know what DOM is it. It only can parse original HTML code of a page. Because of it, mechanize much faster, takes much less memory and in general much more stable than any real browser. Use mechanize if you can do it, and the website doesn't use javascript to render any meaningful parts of its structure. Still, because mechanize trying to mimic a real browser, it supports almost all Capybara's [methods to interact with a web page](http://cheatrags.com/capybara) (filling forms, clicking buttons, checkboxes, etc).
512
- * `:poltergeist_phantomjs` - [PhantomJS headless browser](https://github.com/ariya/phantomjs), can render javascript. In general, PhantomJS still faster than Headless Chrome (and Headless Firefox). PhantomJS has memory leakage, but Tanakai has [memory control feature](#crawler-config) so you shouldn't consider it as a problem. Also, some websites can recognize PhantomJS and block access to them. Like mechanize (and unlike selenium engines) `:poltergeist_phantomjs` can freely rotate proxies and change headers _on the fly_ (see [config section](#all-available-config-options)).
513
- * `:selenium_chrome` Chrome in headless mode driven by selenium. Modern headless browser solution with proper javascript rendering.
522
+ * `:mechanize` - [pure Ruby fake http browser](https://github.com/sparklemotion/mechanize). Mechanize can't render JavaScript and don't know what DOM is it. It only can parse original HTML code of a page. Because of it, mechanize much faster, takes much less memory and in general much more stable than any real browser. Use mechanize if you can do it, and the website doesn't use JavaScript to render any meaningful parts of its structure. Still, because mechanize trying to mimic a real browser, it supports almost all Capybara's [methods to interact with a web page](http://cheatrags.com/capybara) (filling forms, clicking buttons, checkboxes, etc).
523
+ * `:poltergeist_phantomjs` - [PhantomJS headless browser](https://github.com/ariya/phantomjs), can render javascript. In general, PhantomJS still faster than Headless Chrome (and Headless Firefox). PhantomJS has memory leakage issues, but Tanakai has [memory control feature](#crawler-config) so you shouldn't consider it as a problem. Also, some websites can recognize PhantomJS and block access to them. Like mechanize (and unlike selenium engines) `:poltergeist_phantomjs` can freely rotate proxies and change headers _on the fly_ (see [config section](#all-available-config-options)).
524
+ * `:selenium_chrome` Chrome in headless mode driven by selenium. Modern headless browser solution with proper JavaScript rendering.
514
525
  * `:selenium_firefox` Firefox in headless mode driven by selenium. Usually takes more memory than other drivers, but sometimes can be useful.
515
526
 
516
- **Tip:** add `HEADLESS=false` ENV variable before command (`$ HEADLESS=false ruby spider.rb`) to run browser in normal (not headless) mode and see it's window (only for selenium-like engines). It works for [console](#interactive-console) command as well.
527
+ **Tip:** prepend a `HEADLESS=false` environment variable on the command line (`$ HEADLESS=false ruby spider.rb`) to launch an interactive browser in normal (not headless) mode and see its window (only for selenium-like engines). It works for the [console](#interactive-console) command as well.
517
528
 
518
529
 
519
530
  ### Minimum required spider structure
520
- > You can manually create a spider file, or use generator instead: `$ tanakai generate spider simple_spider`
531
+ > You can manually create a spider file, or use the generate command instead: `$ tanakai generate spider simple_spider`
521
532
 
522
533
  ```ruby
523
534
  require 'tanakai'
@@ -535,10 +546,10 @@ SimpleSpider.crawl!
535
546
  ```
536
547
 
537
548
  Where:
538
- * `@name` name of a spider. You can omit name if use single-file spider
539
- * `@engine` engine for a spider
540
- * `@start_urls` array of start urls to process one by one inside `parse` method
541
- * Method `parse` is the start method, should be always present in spider class
549
+ * `@name`: name of a spider. You can omit name if use single-file spider
550
+ * `@engine`: engine for a spider
551
+ * `@start_urls`: array of start urls to process one by one inside `parse` method
552
+ * The `parse` method is the entry point, and should always be present in a spider class
542
553
 
543
554
 
544
555
  ### Method arguments `response`, `url` and `data`
@@ -548,9 +559,9 @@ def parse(response, url:, data: {})
548
559
  end
549
560
  ```
550
561
 
551
- * `response` ([Nokogiri::HTML::Document](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document) object) Contains parsed HTML code of a processed webpage
552
- * `url` (String) url of a processed webpage
553
- * `data` (Hash) uses to pass data between requests
562
+ * `response` ([Nokogiri::HTML::Document](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document) object): contains parsed HTML code of a processed webpage
563
+ * `url` (String): url of a processed webpage
564
+ * `data` (Hash): uses to pass data between requests
554
565
 
555
566
  <details/>
556
567
  <summary><strong>Example how to use <code>data</code></strong></summary>
@@ -574,7 +585,7 @@ class ProductsSpider < Tanakai::Base
574
585
 
575
586
  def parse_product(response, url:, data: {})
576
587
  item = {}
577
- # Assign item's category_name from data[:category_name]
588
+ # Assign an item's category_name from data[:category_name]
578
589
  item[:category_name] = data[:category_name]
579
590
 
580
591
  # ...
@@ -592,7 +603,7 @@ end
592
603
 
593
604
  ### `browser` object
594
605
 
595
- From any spider instance method there is available `browser` object, which is [Capybara::Session](https://www.rubydoc.info/github/jnicklas/capybara/Capybara/Session) object and uses to process requests and get page response (`current_response` method). Usually you don't need to touch it directly, because there is `response` (see above) which contains page response after it was loaded.
606
+ A browser object is available from any spider instance method, which is a [Capybara::Session](https://www.rubydoc.info/github/jnicklas/capybara/Capybara/Session) object and uses it to process requests and get page response (`current_response` method). Usually you don't need to touch it directly, because there is `response` (see above) which contains page response after it was loaded.
596
607
 
597
608
  But if you need to interact with a page (like filling form fields, clicking elements, checkboxes, etc) `browser` is ready for you:
598
609
 
@@ -606,7 +617,7 @@ class GoogleSpider < Tanakai::Base
606
617
  browser.fill_in "q", with: "Tanakai web scraping framework"
607
618
  browser.click_button "Google Search"
608
619
 
609
- # Update response to current response after interaction with a browser
620
+ # Update response with current_response after interaction with a browser
610
621
  response = browser.current_response
611
622
 
612
623
  # Collect results
@@ -626,7 +637,7 @@ Check out **Capybara cheat sheets** where you can see all available methods **to
626
637
 
627
638
  ### `request_to` method
628
639
 
629
- For making requests to a particular method there is `request_to`. It requires minimum two arguments: `:method_name` and `url:`. An optional argument is `data:` (see above what for is it). Example:
640
+ For making requests to a particular method there is `request_to`. It requires minimum two arguments: `:method_name` and `url:`. An optional argument is `data:` (see above what for is it) and `response_type` (defaults to `:html`). Example:
630
641
 
631
642
  ```ruby
632
643
  class Spider < Tanakai::Base
@@ -635,11 +646,12 @@ class Spider < Tanakai::Base
635
646
 
636
647
  def parse(response, url:, data: {})
637
648
  # Process request to `parse_product` method with `https://example.com/some_product` url:
638
- request_to :parse_product, url: "https://example.com/some_product"
649
+ request_to :parse_product, url: "https://example.com/some_product.json", response_type: :json
639
650
  end
640
651
 
641
652
  def parse_product(response, url:, data: {})
642
- puts "From page https://example.com/some_product !"
653
+ puts "JSON parsed from page https://example.com/some_product.json"
654
+ puts response
643
655
  end
644
656
  end
645
657
  ```
@@ -654,7 +666,7 @@ def request_to(handler, url:, data: {})
654
666
  request_data = { url: url, data: data }
655
667
 
656
668
  browser.visit(url)
657
- public_send(handler, browser.current_response, request_data)
669
+ public_send(handler, browser.current_response, **request_data)
658
670
  end
659
671
  ```
660
672
  </details><br>
@@ -719,7 +731,7 @@ By default `save_to` add position key to an item hash. You can disable it with `
719
731
 
720
732
  **How helper works:**
721
733
 
722
- Until spider stops, each new item will be appended to a file. At the next run, helper will clear the content of a file first, and then start again appending items to it.
734
+ While the spider is running, each new item will be appended to the output file. On the next run, this helper will clear the contents of the output file, then start appending items to it.
723
735
 
724
736
  > If you don't want file to be cleared before each run, add option `append: true`: `save_to "scraped_products.json", item, format: :json, append: true`
725
737
 
@@ -801,7 +813,7 @@ It is possible to automatically skip all already visited urls while calling `req
801
813
  * `#clear!` - reset the whole storage by deleting all values from all scopes.
802
814
 
803
815
 
804
- ### Handle request errors
816
+ ### Handling request errors
805
817
  It is quite common that some pages of crawling website can return different response code than `200 ok`. In such cases, method `request_to` (or `browser.visit`) can raise an exception. Tanakai provides `skip_request_errors` and `retry_request_errors` [config](#spider-config) options to handle such errors:
806
818
 
807
819
  #### skip_request_errors
@@ -1194,6 +1206,7 @@ I, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840] INFO -- amazon_spider:
1194
1206
  * `delay:` set delay between requests: `in_parallel(:method, urls, threads: 3, delay: 2)`. Delay can be `Integer`, `Float` or `Range` (`2..5`). In case of a Range, delay number will be chosen randomly for each request: `rand (2..5) # => 3`
1195
1207
  * `engine:` set custom engine than a default one: `in_parallel(:method, urls, threads: 3, engine: :poltergeist_phantomjs)`
1196
1208
  * `config:` pass custom options to config (see [config section](#crawler-config))
1209
+ * `response_type:` response should be returned as `:html` or `:json`, defaults to `:html`
1197
1210
 
1198
1211
  ### Active Support included
1199
1212
 
@@ -1305,7 +1318,7 @@ Tanakai.configure do |config|
1305
1318
  end
1306
1319
  ```
1307
1320
 
1308
- ### Using Tanakai inside existing Ruby application
1321
+ ### Using Tanakai inside existing Ruby applications
1309
1322
 
1310
1323
  You can integrate Tanakai spiders (which are just Ruby classes) to an existing Ruby application like Rails or Sinatra, and run them using background jobs (for example). Check the following info to understand the running process of spiders:
1311
1324
 
@@ -1420,7 +1433,7 @@ Example:
1420
1433
  $ tanakai setup deploy@123.123.123.123 --ask-sudo --ssh-key-path path/to/private_key
1421
1434
  ```
1422
1435
 
1423
- CLI options:
1436
+ CLI arguments:
1424
1437
  * `--ask-sudo` pass this option to ask sudo (user) password for system-wide installation of packages (`apt install`)
1425
1438
  * `--ssh-key-path path/to/private_key` authorization on the server using private ssh key. You can omit it if required key already [added to keychain](https://help.github.com/articles/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent/#adding-your-ssh-key-to-the-ssh-agent) on your desktop (Ansible uses [SSH agent forwarding](https://developer.github.com/v3/guides/using-ssh-agent-forwarding/))
1426
1439
  * `--ask-auth-pass` authorization on the server using user password, alternative option to `--ssh-key-path`.
@@ -1440,7 +1453,7 @@ Example:
1440
1453
  $ tanakai deploy deploy@123.123.123.123 --ssh-key-path path/to/private_key --repo-key-path path/to/repo_private_key
1441
1454
  ```
1442
1455
 
1443
- CLI options: _same like for [setup](#setup) command_ (except `--ask-sudo`), plus
1456
+ CLI arguments: _same like for [setup](#setup) command_ (except `--ask-sudo`), plus
1444
1457
  * `--repo-url` provide custom repo url (`--repo-url git@bitbucket.org:username/repo_name.git`), otherwise current `origin/master` will be taken (output from `$ git remote get-url origin`)
1445
1458
  * `--repo-key-path` if git repository is private, authorization is required to pull the code on the remote server. Use this option to provide a private repository SSH key. You can omit it if required key already added to keychain on your desktop (same like with `--ssh-key-path` option)
1446
1459
 
@@ -2030,9 +2043,14 @@ $ bundle exec tanakai runner --exclude github_spider
2030
2043
 
2031
2044
  You can perform custom actions before runner starts and after runner stops using `config.runner_at_start_callback` and `config.runner_at_stop_callback`. Check [config/application.rb](lib/tanakai/template/config/application.rb) to see example.
2032
2045
 
2046
+ ## Testing
2047
+ To run tests:
2048
+ ```bash
2049
+ bundle exec rspec
2050
+ ```
2033
2051
 
2034
2052
  ## Chat Support and Feedback
2035
- Will be updated
2053
+ Submit an issue on GitHub and we'll try to address it in a timely manner.
2036
2054
 
2037
2055
  ## License
2038
- The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
2056
+ This gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
data/lib/tanakai/base.rb CHANGED
@@ -1,5 +1,6 @@
1
1
  require_relative 'base/saver'
2
2
  require_relative 'base/storage'
3
+ require 'addressable/uri'
3
4
 
4
5
  module Tanakai
5
6
  class Base
@@ -201,7 +202,7 @@ module Tanakai
201
202
  visited = delay ? browser.visit(url, delay: delay) : browser.visit(url)
202
203
  return unless visited
203
204
 
204
- public_send(handler, browser.current_response(response_type), { url: url, data: data })
205
+ public_send(handler, browser.current_response(response_type), **{ url: url, data: data })
205
206
  end
206
207
 
207
208
  def console(response = nil, url: nil, data: {})
@@ -224,9 +225,9 @@ module Tanakai
224
225
  @savers[path] ||= begin
225
226
  options = { format: format, position: position, append: append }
226
227
  if self.with_info
227
- self.class.savers[path] ||= Saver.new(path, options)
228
+ self.class.savers[path] ||= Saver.new(path, **options)
228
229
  else
229
- Saver.new(path, options)
230
+ Saver.new(path, **options)
230
231
  end
231
232
  end
232
233
 
@@ -286,7 +287,7 @@ module Tanakai
286
287
  end
287
288
  end
288
289
 
289
- def in_parallel(handler, urls, threads:, data: {}, delay: nil, engine: @engine, config: {})
290
+ def in_parallel(handler, urls, threads:, data: {}, delay: nil, engine: @engine, config: {}, response_type: :html)
290
291
  parts = urls.in_sorted_groups(threads, false)
291
292
  urls_count = urls.size
292
293
 
@@ -304,12 +305,12 @@ module Tanakai
304
305
  part.each do |url_data|
305
306
  if url_data.class == Hash
306
307
  if url_data[:url].present? && url_data[:data].present?
307
- spider.request_to(handler, delay, url_data)
308
+ spider.request_to(handler, delay, **{ **url_data, response_type: response_type })
308
309
  else
309
- spider.public_send(handler, url_data)
310
+ spider.public_send(handler, **url_data)
310
311
  end
311
312
  else
312
- spider.request_to(handler, delay, url: url_data, data: data)
313
+ spider.request_to(handler, delay, url: url_data, data: data, response_type: response_type)
313
314
  end
314
315
  end
315
316
  ensure
@@ -4,13 +4,13 @@ module Tanakai
4
4
 
5
5
  def absolute_url(url, base:)
6
6
  return unless url
7
- URI.join(base, URI.escape(url)).to_s
7
+ Addressable::URI.join(base, Addressable::URI.escape(url)).to_s
8
8
  end
9
9
 
10
10
  def escape_url(url)
11
- uri = URI.parse(url)
11
+ uri = Addressable::URI.parse(url)
12
12
  rescue URI::InvalidURIError => e
13
- URI.parse(URI.escape url).to_s rescue url
13
+ Addressable::URI.parse(Addressable::URI.escape url).to_s rescue url
14
14
  else
15
15
  url
16
16
  end
@@ -30,7 +30,7 @@ module Tanakai::BrowserBuilder
30
30
  end
31
31
 
32
32
  # See all options here: https://seleniumhq.github.io/selenium/docs/api/rb/Selenium/WebDriver/Chrome/Options.html
33
- driver_options = Selenium::WebDriver::Chrome::Options.new(opts)
33
+ driver_options = Selenium::WebDriver::Chrome::Options.new(**opts)
34
34
 
35
35
  # Window size
36
36
  if size = @config[:window_size].presence
@@ -34,7 +34,7 @@ class Capybara::Mechanize::Driver
34
34
  options[:name] ||= name
35
35
  options[:value] ||= value
36
36
 
37
- cookie = Mechanize::Cookie.new(options.merge path: "/")
37
+ cookie = Mechanize::Cookie.new(**options.merge(path: "/"))
38
38
  browser.agent.cookie_jar << cookie
39
39
  end
40
40
 
@@ -4,7 +4,7 @@ git_source(:github) { |repo| "https://github.com/#{repo}.git" }
4
4
  ruby '>= 2.5'
5
5
 
6
6
  # Framework
7
- gem 'tanakai'
7
+ gem 'tanakai', '~> 1.5'
8
8
 
9
9
  # Require files in directory and child directories recursively
10
10
  gem 'require_all'
@@ -1,3 +1,3 @@
1
1
  module Tanakai
2
- VERSION = "1.5.0"
2
+ VERSION = "1.6.0"
3
3
  end
data/tanakai.gemspec CHANGED
@@ -39,12 +39,13 @@ Gem::Specification.new do |spec|
39
39
  spec.add_dependency "headless"
40
40
  spec.add_dependency "pmap"
41
41
 
42
+ spec.add_dependency "addressable"
42
43
  spec.add_dependency "whenever"
43
44
 
44
- spec.add_dependency "rbcat", "~> 0.2"
45
- spec.add_dependency "pry"
45
+ spec.add_dependency "rbcat", ">= 0.2.2", "< 0.3"
46
+ spec.add_dependency "pry-nav"
46
47
 
47
- spec.add_development_dependency "bundler", "~> 1.16"
48
+ spec.add_development_dependency "bundler", "~> 2"
48
49
  spec.add_development_dependency "rake", "~> 10.0"
49
- spec.add_development_dependency "minitest", "~> 5.0"
50
+ spec.add_development_dependency "rspec", "~> 3"
50
51
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: tanakai
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.5.0
4
+ version: 1.6.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Victor Afanasev
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: exe
11
11
  cert_chain: []
12
- date: 2022-08-13 00:00:00.000000000 Z
12
+ date: 2023-02-16 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: thor
@@ -199,6 +199,20 @@ dependencies:
199
199
  - - ">="
200
200
  - !ruby/object:Gem::Version
201
201
  version: '0'
202
+ - !ruby/object:Gem::Dependency
203
+ name: addressable
204
+ requirement: !ruby/object:Gem::Requirement
205
+ requirements:
206
+ - - ">="
207
+ - !ruby/object:Gem::Version
208
+ version: '0'
209
+ type: :runtime
210
+ prerelease: false
211
+ version_requirements: !ruby/object:Gem::Requirement
212
+ requirements:
213
+ - - ">="
214
+ - !ruby/object:Gem::Version
215
+ version: '0'
202
216
  - !ruby/object:Gem::Dependency
203
217
  name: whenever
204
218
  requirement: !ruby/object:Gem::Requirement
@@ -217,18 +231,24 @@ dependencies:
217
231
  name: rbcat
218
232
  requirement: !ruby/object:Gem::Requirement
219
233
  requirements:
220
- - - "~>"
234
+ - - ">="
235
+ - !ruby/object:Gem::Version
236
+ version: 0.2.2
237
+ - - "<"
221
238
  - !ruby/object:Gem::Version
222
- version: '0.2'
239
+ version: '0.3'
223
240
  type: :runtime
224
241
  prerelease: false
225
242
  version_requirements: !ruby/object:Gem::Requirement
226
243
  requirements:
227
- - - "~>"
244
+ - - ">="
245
+ - !ruby/object:Gem::Version
246
+ version: 0.2.2
247
+ - - "<"
228
248
  - !ruby/object:Gem::Version
229
- version: '0.2'
249
+ version: '0.3'
230
250
  - !ruby/object:Gem::Dependency
231
- name: pry
251
+ name: pry-nav
232
252
  requirement: !ruby/object:Gem::Requirement
233
253
  requirements:
234
254
  - - ">="
@@ -247,14 +267,14 @@ dependencies:
247
267
  requirements:
248
268
  - - "~>"
249
269
  - !ruby/object:Gem::Version
250
- version: '1.16'
270
+ version: '2'
251
271
  type: :development
252
272
  prerelease: false
253
273
  version_requirements: !ruby/object:Gem::Requirement
254
274
  requirements:
255
275
  - - "~>"
256
276
  - !ruby/object:Gem::Version
257
- version: '1.16'
277
+ version: '2'
258
278
  - !ruby/object:Gem::Dependency
259
279
  name: rake
260
280
  requirement: !ruby/object:Gem::Requirement
@@ -270,19 +290,19 @@ dependencies:
270
290
  - !ruby/object:Gem::Version
271
291
  version: '10.0'
272
292
  - !ruby/object:Gem::Dependency
273
- name: minitest
293
+ name: rspec
274
294
  requirement: !ruby/object:Gem::Requirement
275
295
  requirements:
276
296
  - - "~>"
277
297
  - !ruby/object:Gem::Version
278
- version: '5.0'
298
+ version: '3'
279
299
  type: :development
280
300
  prerelease: false
281
301
  version_requirements: !ruby/object:Gem::Requirement
282
302
  requirements:
283
303
  - - "~>"
284
304
  - !ruby/object:Gem::Version
285
- version: '5.0'
305
+ version: '3'
286
306
  description:
287
307
  email:
288
308
  - vicfreefly@gmail.com
@@ -292,6 +312,7 @@ extensions: []
292
312
  extra_rdoc_files: []
293
313
  files:
294
314
  - ".gitignore"
315
+ - ".rspec"
295
316
  - ".travis.yml"
296
317
  - CHANGELOG.md
297
318
  - Gemfile
@@ -374,7 +395,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
374
395
  - !ruby/object:Gem::Version
375
396
  version: '0'
376
397
  requirements: []
377
- rubygems_version: 3.1.2
398
+ rubygems_version: 3.2.15
378
399
  signing_key:
379
400
  specification_version: 4
380
401
  summary: Maintained fork of Kimurai, a modern web scraping framework written in Ruby