RubyGems - kimurai - Versions diffs - 1.0.1 → 1.1.0 - Mend

kimurai 1.0.1 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (21) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +56 -1
data/README.md +183 -69
data/kimurai.gemspec +1 -1
data/lib/kimurai/base.rb +96 -36
data/lib/kimurai/base/{simple_saver.rb → saver.rb} +25 -17
data/lib/kimurai/base/storage.rb +91 -0
data/lib/kimurai/browser_builder.rb +6 -0
data/lib/kimurai/browser_builder/mechanize_builder.rb +22 -18
data/lib/kimurai/browser_builder/poltergeist_phantomjs_builder.rb +25 -20
data/lib/kimurai/browser_builder/selenium_chrome_builder.rb +21 -23
data/lib/kimurai/browser_builder/selenium_firefox_builder.rb +22 -18
data/lib/kimurai/capybara_ext/mechanize/driver.rb +1 -1
data/lib/kimurai/capybara_ext/session.rb +47 -7
data/lib/kimurai/cli.rb +2 -1
data/lib/kimurai/pipeline.rb +6 -2
data/lib/kimurai/template/Gemfile +8 -0
data/lib/kimurai/template/spiders/application_spider.rb +50 -35
data/lib/kimurai/version.rb +1 -1
metadata +5 -5
data/lib/kimurai/base/uniq_checker.rb +0 -22

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 7f2185614ca5aa8486c17e0c43b3b035cf22cd18d51617430f556a12af3dc7c8
-  data.tar.gz: 9e5c296feb5d020aa13bcfaa7f6f4c77d839ff373e8aa9d3e0abcc953aaa89de
+  metadata.gz: 67ee49692e64813bc980eb7562b711d5b5d2c47b50a995acb4759709703da0f9
+  data.tar.gz: baba361bc5039d303ae4a6c9a1dd2109368f8e4c7a641d0a782cfc6a7776ade4
 SHA512:
-  metadata.gz: 07d92edd8719cbfc701ac7d82975d4c06f5ba9f6adb0bdbbc6731f81655d70d077d140efa54b473b462058042078abab9218f5f00dab244f7478f91c62c8e24b
-  data.tar.gz: 5dc6a70b6379a46c58c917455a7eace96c1093944125888cbc2f9b2af93cf065de0ca00e9d98e786d9a2fbc3a53a1ea3dbf1712e703984d063dfc937ad5e0c71
+  metadata.gz: 0173d3859b5f8776371fad454ff5575fdc453aa3c6038d8a8399651c46c5eaae789273772227ea014b6ce39b13586e6805bd7f69156eafeacf653804f954003c
+  data.tar.gz: b05889c0cb030aed06fe1df5cc5411154d24019667e1f00f9f4248d598fc93990f86a4aae78430af3140f3dc3989e856cfc3e2316f455984c898442fccad15db

data/CHANGELOG.md CHANGED

@@ -1,5 +1,60 @@
 # CHANGELOG
-## HEAD
+## 1.1.0
+### Breaking changes 1.1.0
+`browser` config option depricated. Now all sub-options inside `browser` should be placed right into `@config` hash, without `browser` parent key. Example:
+```ruby
+# Was:
+@config = {
+  browser: {
+    retry_request_errors: [Net::ReadTimeout],
+    restart_if: {
+      memory_limit: 350_000,
+      requests_limit: 100
+    },
+    before_request: {
+      change_proxy: true,
+      change_user_agent: true,
+      clear_cookies: true,
+      clear_and_set_cookies: true,
+      delay: 1..3
+    }
+  }
+}
+# Now:
+@config = {
+  retry_request_errors: [Net::ReadTimeout],
+  restart_if: {
+    memory_limit: 350_000,
+    requests_limit: 100
+  },
+  before_request: {
+    change_proxy: true,
+    change_user_agent: true,
+    clear_cookies: true,
+    clear_and_set_cookies: true,
+    delay: 1..3
+  }
+}
+```
+### New
+* Add `storage` object with additional methods and persistence database feature
+* Add events feature to `run_info`
+* Add `skip_duplicate_requests` config option to automatically skip already visited urls when using requrst_to
+* Add  `extensions` config option to allow inject JS code into browser (supported only by poltergeist_phantomjs engine)
+* Add Capybara::Session#within_new_window_by method
+### Improvements
+* Add the last backtrace line to pipeline output when item was dropped
+* Do not destroy driver if it's not exists (for Base.parse! method)
+* Handle possible Net::ReadTimeout error while trying to #quit driver
+### Fixes
+* Fix Mechanize::Driver#proxy (there was a bug while using proxy for mechanize engine without authorization)
+* Fix requests retries logic
 ## 1.0.1
 * Add missing `logger` method to pipeline

data/README.md CHANGED

@@ -1,9 +1,9 @@
 <div align="center">
-  <a href="https://github.com/vfreefly/kimurai">
+  <a href="https://github.com/vifreefly/kimuraframework">
     <img width="312" height="200" src="https://hsto.org/webt/_v/mt/tp/_vmttpbpzbt-y2aook642d9wpz0.png">
   </a>
-  <h1>Kimura Framework</h1>
+  <h1>Kimurai Scraping Framework</h1>
 </div>
 > **Note about v1.0.0 version:**
@@ -18,6 +18,8 @@
 <br>
+> Note: this readme is for `1.1.0` gem version. CHANGELOG [here](CHANGELOG.md).
 Kimurai is a modern web scraping framework written in Ruby which **works out of box with Headless Chromium/Firefox, PhantomJS**, or simple HTTP requests and **allows to scrape and interact with JavaScript rendered websites.**
 Kimurai based on well-known [Capybara](https://github.com/teamcapybara/capybara) and [Nokogiri](https://github.com/sparklemotion/nokogiri) gems, so you don't have to learn anything new. Lets see:
@@ -32,9 +34,7 @@ class GithubSpider < Kimurai::Base
   @start_urls = ["https://github.com/search?q=Ruby%20Web%20Scraping"]
   @config = {
     user_agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36",
-    browser: {
-      before_request: { delay: 4..7 }
-    }
+    before_request: { delay: 4..7 }
   }
   def parse(response, url:, data: {})
@@ -238,7 +238,10 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320]  INFO -- infinite_scrol
     * [browser object](#browser-object)
     * [request_to method](#request_to-method)
     * [save_to helper](#save_to-helper)
-    * [Skip duplicates, unique? helper](#skip-duplicates-unique-helper)
+    * [Skip duplicates](#skip-duplicates)
+      * [Automatically skip all duplicated requests urls](#automatically-skip-all-duplicated-requests-urls)
+      * [Storage object](#storage-object)
+      * [Persistence database for the storage](#persistence-database-for-the-storage)
     * [open_spider and close_spider callbacks](#open_spider-and-close_spider-callbacks)
     * [KIMURAI_ENV](#kimurai_env)
     * [Parallel crawling using in_parallel](#parallel-crawling-using-in_parallel)
@@ -451,19 +454,19 @@ brew install mongodb
 Before you get to know all Kimurai features, there is `$ kimurai console` command which is an interactive console where you can try and debug your scraping code very quickly, without having to run any spider (yes, it's like [Scrapy shell](https://doc.scrapy.org/en/latest/topics/shell.html#topics-shell)).
 ```bash
-$ kimurai console --engine selenium_chrome --url https://github.com/vfreefly/kimurai
+$ kimurai console --engine selenium_chrome --url https://github.com/vifreefly/kimuraframework
 ```
 <details/>
   <summary>Show output</summary>
 ```
-$ kimurai console --engine selenium_chrome --url https://github.com/vfreefly/kimurai
+$ kimurai console --engine selenium_chrome --url https://github.com/vifreefly/kimuraframework
 D, [2018-08-22 13:42:32 +0400#26079] [M: 47461994677760] DEBUG -- : BrowserBuilder (selenium_chrome): created browser instance
 D, [2018-08-22 13:42:32 +0400#26079] [M: 47461994677760] DEBUG -- : BrowserBuilder (selenium_chrome): enabled native headless_mode
-I, [2018-08-22 13:42:32 +0400#26079] [M: 47461994677760]  INFO -- : Browser: started get request to: https://github.com/vfreefly/kimurai
-I, [2018-08-22 13:42:35 +0400#26079] [M: 47461994677760]  INFO -- : Browser: finished get request to: https://github.com/vfreefly/kimurai
+I, [2018-08-22 13:42:32 +0400#26079] [M: 47461994677760]  INFO -- : Browser: started get request to: https://github.com/vifreefly/kimuraframework
+I, [2018-08-22 13:42:35 +0400#26079] [M: 47461994677760]  INFO -- : Browser: finished get request to: https://github.com/vifreefly/kimuraframework
 D, [2018-08-22 13:42:35 +0400#26079] [M: 47461994677760] DEBUG -- : Browser: driver.current_memory: 201701
 From: /home/victor/code/kimurai/lib/kimurai/base.rb @ line 189 Kimurai::Base#console:
@@ -473,7 +476,7 @@ From: /home/victor/code/kimurai/lib/kimurai/base.rb @ line 189 Kimurai::Base#con
     190: end
 [1] pry(#<Kimurai::Base>)> response.xpath("//title").text
-=> "GitHub - vfreefly/kimurai: Kimurai is a modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites"
+=> "GitHub - vifreefly/kimuraframework: Modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites"
 [2] pry(#<Kimurai::Base>)> ls
 Kimurai::Base#methods: browser  console  logger  request_to  save_to  unique?
@@ -733,9 +736,11 @@ By default `save_to` add position key to an item hash. You can disable it with `
 Until spider stops, each new item will be appended to a file. At the next run, helper will clear the content of a file first, and then start again appending items to it.
-### Skip duplicates, `unique?` helper
+> If you don't want file to be cleared before each run, add option `append: true`: `save_to "scraped_products.json", item, format: :json, append: true`
+### Skip duplicates
-It's pretty common when websites have duplicated pages. For example when an e-commerce shop has the same products in different categories. To skip duplicates, there is `unique?` helper:
+It's pretty common when websites have duplicated pages. For example when an e-commerce shop has the same products in different categories. To skip duplicates, there is simple `unique?` helper:
 ```ruby
 class ProductsSpider < Kimurai::Base
@@ -796,6 +801,100 @@ unique?(:id, 324234232)
 unique?(:custom, "Lorem Ipsum")
 ```
+#### Automatically skip all duplicated requests urls
+It is possible to automatically skip all already visited urls while calling `request_to` method, using [@config](#all-available-config-options) option `skip_duplicate_requests: true`. With this option, all already visited urls will be automatically skipped. Also check the [@config](#all-available-config-options) for an additional options of this setting.
+#### `storage` object
+`unique?` method it's just an alias for `storage#unique?`. Storage has several methods:
+* `#all` - display storage hash where keys are existing scopes.
+* `#include?(scope, value)` - return `true` if value in the scope exists, and `false` if not
+* `#add(scope, value)` - add value to the scope
+* `unique?(scope, value)` - method already described above, will return `false` if value in the scope exists, or return `true` + add value to the scope if value in the scope not exists.
+* `clear!` - reset the whole storage by deleting all values from all scopes.
+#### Persistence database for the storage
+It's pretty common that spider can fail (IP blocking, etc.) while crawling a huge website with +5k listings. In this case, it's not convenient to start everything over again.
+Kimurai can use persistence database for a `storage` using Ruby built-in [PStore](https://ruby-doc.org/stdlib-2.5.1/libdoc/pstore/rdoc/PStore.html) database. With this option, you can automatically skip already visited urls in the next run _if previous run was failed_, otherwise _(if run was successful)_ storage database will be removed before spider stops.
+Also, with persistence storage enabled, [save_to](#save_to-helper) method will keep adding items to an existing file (it will not be cleared before each run).
+To use persistence storage, provide `continue: true` option to the `.crawl!` method: `SomeSpider.crawl!(continue: true)`.
+There are two approaches how to use persistence storage and skip already processed items pages. First, is to manually add required urls to the storage:
+```ruby
+class ProductsSpider < Kimurai::Base
+  @start_urls = ["https://example-shop.com/"]
+  def parse(response, url:, data: {})
+    response.xpath("//categories/path").each do |category|
+      request_to :parse_category, url: category[:href]
+    end
+  end
+  def parse_category(response, url:, data: {})
+    response.xpath("//products/path").each do |product|
+      # check if product url already contains in the scope `:product_urls`, if so, skip the request:
+      next if storage.contains?(:product_urls, product[:href])
+      # Otherwise process it:
+      request_to :parse_product, url: product[:href]
+    end
+  end
+  def parse_product(response, url:, data: {})
+    # Add visited item to the storage:
+    storage.add(:product_urls, url)
+    # ...
+  end
+end
+# Run the spider with persistence database option:
+ProductsSpider.crawl!(continue: true)
+```
+Second approach is to automatically skip already processed items urls using `@config` `skip_duplicate_requests:` option:
+```ruby
+class ProductsSpider < Kimurai::Base
+  @start_urls = ["https://example-shop.com/"]
+  @config = {
+    # Configure skip_duplicate_requests option:
+    skip_duplicate_requests: { scope: :product_urls, check_only: true }
+  }
+  def parse(response, url:, data: {})
+    response.xpath("//categories/path").each do |category|
+      request_to :parse_category, url: category[:href]
+    end
+  end
+  def parse_category(response, url:, data: {})
+    response.xpath("//products/path").each do |product|
+      # Before visiting the url, `request_to` will check if it already contains
+      # in the storage scope `:product_urls`, if so, request will be skipped:
+      request_to :parse_product, url: product[:href]
+    end
+  end
+  def parse_product(response, url:, data: {})
+    # Add visited item url to the storage scope `:product_urls`:
+    storage.add(:product_urls, url)
+    # ...
+  end
+end
+# Run the spider with persistence database option:
+ProductsSpider.crawl!(continue: true)
+```
 ### `open_spider` and `close_spider` callbacks
 You can define `.open_spider` and `.close_spider` callbacks (class methods) to perform some action before spider started or after spider has been stopped:
@@ -1316,6 +1415,8 @@ end # =>
 # "reddit: the front page of the internetHotHot"
 ```
+Keep in mind, that [save_to](#save_to-helper) and [unique?](#skip-duplicates) helpers are not thread-safe while using `.parse!` method.
 #### `Kimurai.list` and `Kimurai.find_by_name()`
 ```ruby
@@ -1399,21 +1500,19 @@ class Spider < Kimurai::Base
     proxy: -> { PROXIES.sample },
     window_size: [1366, 768],
     disable_images: true,
-    browser: {
-      restart_if: {
-        # Restart browser if provided memory limit (in kilobytes) is exceeded:
-        memory_limit: 350_000
-      },
-      before_request: {
-        # Change user agent before each request:
-        change_user_agent: true,
-        # Change proxy before each request:
-        change_proxy: true,
-        # Clear all cookies and set default cookies (if provided) before each request:
-        clear_and_set_cookies: true,
-        # Process delay before each request:
-        delay: 1..3
-      }
+    restart_if: {
+      # Restart browser if provided memory limit (in kilobytes) is exceeded:
+      memory_limit: 350_000
+    },
+    before_request: {
+      # Change user agent before each request:
+      change_user_agent: true,
+      # Change proxy before each request:
+      change_proxy: true,
+      # Clear all cookies and set default cookies (if provided) before each request:
+      clear_and_set_cookies: true,
+      # Process delay before each request:
+      delay: 1..3
     }
   }
@@ -1475,41 +1574,56 @@ end
   # Option to provide custom SSL certificate. Works only for :poltergeist_phantomjs and :mechanize
   ssl_cert_path: "path/to/ssl_cert",
-  # Browser (Capybara session instance) options:
-  browser: {
-    # Array of errors to retry while processing a request
-    retry_request_errors: [Net::ReadTimeout],
-    # Restart browser if one of the options is true:
-    restart_if: {
-      # Restart browser if provided memory limit (in kilobytes) is exceeded (works for all engines)
-      memory_limit: 350_000,
-      # Restart browser if provided requests limit is exceeded (works for all engines)
-      requests_limit: 100
-    },
-    before_request: {
-      # Change proxy before each request. The `proxy:` option above should be presented
-      # and has lambda format. Works only for poltergeist and mechanize engines
-      # (Selenium doesn't support proxy rotation).
-      change_proxy: true,
-      # Change user agent before each request. The `user_agent:` option above should be presented
-      # and has lambda format. Works only for poltergeist and mechanize engines
-      # (selenium doesn't support to get/set headers).
-      change_user_agent: true,
-      # Clear all cookies before each request, works for all engines
-      clear_cookies: true,
-      # If you want to clear all cookies + set custom cookies (`cookies:` option above should be presented)
-      # use this option instead (works for all engines)
-      clear_and_set_cookies: true,
-      # Global option to set delay between requests.
-      # Delay can be `Integer`, `Float` or `Range` (`2..5`). In case of a range,
-      # delay number will be chosen randomly for each request: `rand (2..5) # => 3`
-      delay: 1..3
-    }
+  # Inject some JavaScript code to the browser.
+  # Format: array of strings, where each string is a path to JS file.
+  # Works only for poltergeist_phantomjs engine (Selenium doesn't support JS code injection)
+  extensions: ["lib/code_to_inject.js"],
+  # Automatically skip duplicated (already visited) urls when using `request_to` method.
+  # Possible values: `true` or `hash` with options.
+  # In case of `true`, all visited urls will be added to the storage's scope `:requests_urls`
+  # and if url already contains in this scope, request will be skipped.
+  # You can configure this setting by providing additional options as hash:
+  # `skip_duplicate_requests: { scope: :custom_scope, check_only: true }`, where:
+  # `scope:` - use custom scope than `:requests_urls`
+  # `check_only:` - if true, then scope will be only checked for url, url will not
+  # be added to the scope if scope doesn't contains it.
+  # works for all drivers
+  skip_duplicate_requests: true,
+  # Array of possible errors to retry while processing a request:
+  retry_request_errors: [Net::ReadTimeout],
+  # Restart browser if one of the options is true:
+  restart_if: {
+    # Restart browser if provided memory limit (in kilobytes) is exceeded (works for all engines)
+    memory_limit: 350_000,
+    # Restart browser if provided requests limit is exceeded (works for all engines)
+    requests_limit: 100
+  },
+  before_request: {
+    # Change proxy before each request. The `proxy:` option above should be presented
+    # and has lambda format. Works only for poltergeist and mechanize engines
+    # (Selenium doesn't support proxy rotation).
+    change_proxy: true,
+    # Change user agent before each request. The `user_agent:` option above should be presented
+    # and has lambda format. Works only for poltergeist and mechanize engines
+    # (selenium doesn't support to get/set headers).
+    change_user_agent: true,
+    # Clear all cookies before each request, works for all engines
+    clear_cookies: true,
+    # If you want to clear all cookies + set custom cookies (`cookies:` option above should be presented)
+    # use this option instead (works for all engines)
+    clear_and_set_cookies: true,
+    # Global option to set delay between requests.
+    # Delay can be `Integer`, `Float` or `Range` (`2..5`). In case of a range,
+    # delay number will be chosen randomly for each request: `rand (2..5) # => 3`
+    delay: 1..3
   }
 }
 ```
@@ -1525,10 +1639,8 @@ class ApplicationSpider < Kimurai::Base
   @config = {
     user_agent: "Firefox",
     disable_images: true,
-    browser: {
-      restart_if: { memory_limit: 350_000 },
-      before_request: { delay: 1..2 }
-    }
+    restart_if: { memory_limit: 350_000 },
+    before_request: { delay: 1..2 }
   }
 end
@@ -1536,7 +1648,7 @@ class CustomSpider < ApplicationSpider
   @name = "custom_spider"
   @start_urls = ["https://example.com/"]
   @config = {
-    browser: { before_request: { delay: 4..6 }}
+    before_request: { delay: 4..6 }
   }
   def parse(response, url:, data: {})
@@ -1628,6 +1740,8 @@ end
 ### Crawl
 To run a particular spider in the project, run: `$ bundle exec kimurai crawl example_spider`. Don't forget to add `bundle exec` before command to load required environment.
+You can provide an additional option `--continue` to use [persistence storage database](#persistence-database-for-the-storage) feature.
 ### List
 To list all project spiders, run: `$ bundle exec kimurai list`
@@ -1786,7 +1900,7 @@ class GithubSpider < ApplicationSpider
   @start_urls = ["https://github.com/search?q=Ruby%20Web%20Scraping"]
   @config = {
     user_agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36",
-    browser: { before_request: { delay: 4..7 } }
+    before_request: { delay: 4..7 }
   }
   def parse(response, url:, data: {})

data/kimurai.gemspec CHANGED

@@ -10,7 +10,7 @@ Gem::Specification.new do |spec|
   spec.email         = ["vicfreefly@gmail.com"]
   spec.summary       = "Modern web scraping framework written in Ruby and based on Capybara/Nokogiri"
-  spec.homepage      = "https://github.com/vifreefly/kimurai"
+  spec.homepage      = "https://github.com/vifreefly/kimuraframework"
   spec.license       = "MIT"
   # Specify which files should be added to the gem when it is released.

data/lib/kimurai/base.rb CHANGED

@@ -1,5 +1,5 @@
-require_relative 'base/simple_saver'
-require_relative 'base/uniq_checker'
+require_relative 'base/saver'
+require_relative 'base/storage'
 module Kimurai
   class Base
@@ -21,7 +21,7 @@ module Kimurai
     ###
     class << self
-      attr_reader :run_info
+      attr_reader :run_info, :savers, :storage
     end
     def self.running?
@@ -46,10 +46,12 @@ module Kimurai
     def self.update(type, subtype)
       return unless @run_info
+      @update_mutex.synchronize { @run_info[type][subtype] += 1 }
+    end
-      (@update_mutex ||= Mutex.new).synchronize do
-        @run_info[type][subtype] += 1
-      end
+    def self.add_event(scope, event)
+      return unless @run_info
+      @update_mutex.synchronize { @run_info[:events][scope][event] += 1 }
     end
     ###
@@ -58,8 +60,6 @@ module Kimurai
     @pipelines = []
     @config = {}
-    ###
     def self.name
       @name
     end
@@ -90,34 +90,27 @@ module Kimurai
       end
     end
-    ###
-    def self.checker
-      @checker ||= UniqChecker.new
-    end
-    def unique?(scope, value)
-      self.class.checker.unique?(scope, value)
-    end
-    def self.saver
-      @saver ||= SimpleSaver.new
-    end
+    def self.crawl!(continue: false)
+      logger.error "Spider: already running: #{name}" and return false if running?
-    def save_to(path, item, format:, position: true)
-      self.class.saver.save(path, item, format: format, position: position)
-    end
+      storage_path =
+        if continue
+          Dir.exists?("tmp") ? "tmp/#{name}.pstore" : "#{name}.pstore"
+        end
-    ###
+      @storage = Storage.new(storage_path)
+      @savers = {}
+      @update_mutex = Mutex.new
-    def self.crawl!
-      logger.error "Spider: already running: #{name}" and return false if running?
       @run_info = {
-        spider_name: name, status: :running, environment: Kimurai.env,
+        spider_name: name, status: :running, error: nil, environment: Kimurai.env,
         start_time: Time.new, stop_time: nil, running_time: nil,
-        visits: { requests: 0, responses: 0 }, items: { sent: 0, processed: 0 }, error: nil
+        visits: { requests: 0, responses: 0 }, items: { sent: 0, processed: 0 },
+        events: { requests_errors: Hash.new(0), drop_items_errors: Hash.new(0), custom: Hash.new(0) }
       }
+      ###
       logger.info "Spider: started: #{name}"
       open_spider if self.respond_to? :open_spider
@@ -130,12 +123,11 @@ module Kimurai
       else
         spider.parse
       end
-    rescue StandardError, SignalException => e
+    rescue StandardError, SignalException, SystemExit => e
       @run_info.merge!(status: :failed, error: e.inspect)
       raise e
     else
-      @run_info[:status] = :completed
-      @run_info
+      @run_info.merge!(status: :completed)
     ensure
       if spider
         spider.browser.destroy_driver!
@@ -145,10 +137,20 @@ module Kimurai
         @run_info.merge!(stop_time: stop_time, running_time: total_time)
         close_spider if self.respond_to? :close_spider
+        if @storage.path
+          if completed?
+            @storage.delete!
+            logger.info "Spider: storage: persistence database #{@storage.path} was removed (successful run)"
+          else
+            logger.info "Spider: storage: persistence database #{@storage.path} wasn't removed (failed run)"
+          end
+        end
         message = "Spider: stopped: #{@run_info.merge(running_time: @run_info[:running_time]&.duration)}"
-        failed? ? @logger.fatal(message) : @logger.info(message)
+        failed? ? logger.fatal(message) : logger.info(message)
-        @run_info, @checker, @saver = nil
+        @run_info, @storage, @savers, @update_mutex = nil
       end
     end
@@ -156,7 +158,7 @@ module Kimurai
       spider = engine ? self.new(engine) : self.new
       url.present? ? spider.request_to(handler, url: url, data: data) : spider.public_send(handler)
     ensure
-      spider.browser.destroy_driver!
+      spider.browser.destroy_driver! if spider.instance_variable_get("@browser")
     end
     ###
@@ -175,6 +177,7 @@ module Kimurai
       end.to_h
       @logger = self.class.logger
+      @savers = {}
     end
     def browser
@@ -182,6 +185,11 @@ module Kimurai
     end
     def request_to(handler, delay = nil, url:, data: {})
+      if @config[:skip_duplicate_requests] && !unique_request?(url)
+        add_event(:duplicate_requests) if self.with_info
+        logger.warn "Spider: request_to: url is not unique: #{url}, skipped" and return
+      end
       request_data = { url: url, data: data }
       delay ? browser.visit(url, delay: delay) : browser.visit(url)
       public_send(handler, browser.current_response, request_data)
@@ -191,8 +199,59 @@ module Kimurai
       binding.pry
     end
+    ###
+    def storage
+      # Note: for `.crawl!` uses shared thread safe Storage instance,
+      # otherwise, each spider instance will have it's own Storage
+      @storage ||= self.with_info ? self.class.storage : Storage.new
+    end
+    def unique?(scope, value)
+      storage.unique?(scope, value)
+    end
+    def save_to(path, item, format:, position: true, append: false)
+      @savers[path] ||= begin
+        options = { format: format, position: position, append: storage.path ? true : append }
+        if self.with_info
+          self.class.savers[path] ||= Saver.new(path, options)
+        else
+          Saver.new(path, options)
+        end
+      end
+      @savers[path].save(item)
+    end
+    ###
+    def add_event(scope = :custom, event)
+      unless self.with_info
+        raise "It's allowed to use `add_event` only while performing a full run (`.crawl!` method)"
+      end
+      self.class.add_event(scope, event)
+    end
+    ###
     private
+    def unique_request?(url)
+      options = @config[:skip_duplicate_requests]
+      if options.class == Hash
+        scope = options[:scope] || :requests_urls
+        if options[:check_only]
+          storage.include?(scope, url) ? false : true
+        else
+          storage.unique?(scope, url) ? true : false
+        end
+      else
+        storage.unique?(:requests_urls, url) ? true : false
+      end
+    end
     def send_item(item, options = {})
       logger.debug "Pipeline: starting processing item through #{@pipelines.size} #{'pipeline'.pluralize(@pipelines.size)}..."
       self.class.update(:items, :sent) if self.with_info
@@ -201,7 +260,8 @@ module Kimurai
         item = options[name] ? instance.process_item(item, options: options[name]) : instance.process_item(item)
       end
     rescue => e
-      logger.error "Pipeline: dropped: #{e.inspect}, item: #{item}"
+      logger.error "Pipeline: dropped: #{e.inspect} (#{e.backtrace.first}), item: #{item}"
+      add_event(:drop_items_errors, e.inspect) if self.with_info
       false
     else
       self.class.update(:items, :processed) if self.with_info