kimurai 1.0.1 → 1.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 7f2185614ca5aa8486c17e0c43b3b035cf22cd18d51617430f556a12af3dc7c8
4
- data.tar.gz: 9e5c296feb5d020aa13bcfaa7f6f4c77d839ff373e8aa9d3e0abcc953aaa89de
3
+ metadata.gz: 67ee49692e64813bc980eb7562b711d5b5d2c47b50a995acb4759709703da0f9
4
+ data.tar.gz: baba361bc5039d303ae4a6c9a1dd2109368f8e4c7a641d0a782cfc6a7776ade4
5
5
  SHA512:
6
- metadata.gz: 07d92edd8719cbfc701ac7d82975d4c06f5ba9f6adb0bdbbc6731f81655d70d077d140efa54b473b462058042078abab9218f5f00dab244f7478f91c62c8e24b
7
- data.tar.gz: 5dc6a70b6379a46c58c917455a7eace96c1093944125888cbc2f9b2af93cf065de0ca00e9d98e786d9a2fbc3a53a1ea3dbf1712e703984d063dfc937ad5e0c71
6
+ metadata.gz: 0173d3859b5f8776371fad454ff5575fdc453aa3c6038d8a8399651c46c5eaae789273772227ea014b6ce39b13586e6805bd7f69156eafeacf653804f954003c
7
+ data.tar.gz: b05889c0cb030aed06fe1df5cc5411154d24019667e1f00f9f4248d598fc93990f86a4aae78430af3140f3dc3989e856cfc3e2316f455984c898442fccad15db
@@ -1,5 +1,60 @@
1
1
  # CHANGELOG
2
- ## HEAD
2
+ ## 1.1.0
3
+ ### Breaking changes 1.1.0
4
+ `browser` config option depricated. Now all sub-options inside `browser` should be placed right into `@config` hash, without `browser` parent key. Example:
5
+
6
+ ```ruby
7
+ # Was:
8
+ @config = {
9
+ browser: {
10
+ retry_request_errors: [Net::ReadTimeout],
11
+ restart_if: {
12
+ memory_limit: 350_000,
13
+ requests_limit: 100
14
+ },
15
+ before_request: {
16
+ change_proxy: true,
17
+ change_user_agent: true,
18
+ clear_cookies: true,
19
+ clear_and_set_cookies: true,
20
+ delay: 1..3
21
+ }
22
+ }
23
+ }
24
+
25
+ # Now:
26
+ @config = {
27
+ retry_request_errors: [Net::ReadTimeout],
28
+ restart_if: {
29
+ memory_limit: 350_000,
30
+ requests_limit: 100
31
+ },
32
+ before_request: {
33
+ change_proxy: true,
34
+ change_user_agent: true,
35
+ clear_cookies: true,
36
+ clear_and_set_cookies: true,
37
+ delay: 1..3
38
+ }
39
+ }
40
+ ```
41
+
42
+ ### New
43
+ * Add `storage` object with additional methods and persistence database feature
44
+ * Add events feature to `run_info`
45
+ * Add `skip_duplicate_requests` config option to automatically skip already visited urls when using requrst_to
46
+ * Add `extensions` config option to allow inject JS code into browser (supported only by poltergeist_phantomjs engine)
47
+ * Add Capybara::Session#within_new_window_by method
48
+
49
+ ### Improvements
50
+ * Add the last backtrace line to pipeline output when item was dropped
51
+ * Do not destroy driver if it's not exists (for Base.parse! method)
52
+ * Handle possible Net::ReadTimeout error while trying to #quit driver
53
+
54
+ ### Fixes
55
+ * Fix Mechanize::Driver#proxy (there was a bug while using proxy for mechanize engine without authorization)
56
+ * Fix requests retries logic
57
+
3
58
 
4
59
  ## 1.0.1
5
60
  * Add missing `logger` method to pipeline
data/README.md CHANGED
@@ -1,9 +1,9 @@
1
1
  <div align="center">
2
- <a href="https://github.com/vfreefly/kimurai">
2
+ <a href="https://github.com/vifreefly/kimuraframework">
3
3
  <img width="312" height="200" src="https://hsto.org/webt/_v/mt/tp/_vmttpbpzbt-y2aook642d9wpz0.png">
4
4
  </a>
5
5
 
6
- <h1>Kimura Framework</h1>
6
+ <h1>Kimurai Scraping Framework</h1>
7
7
  </div>
8
8
 
9
9
  > **Note about v1.0.0 version:**
@@ -18,6 +18,8 @@
18
18
 
19
19
  <br>
20
20
 
21
+ > Note: this readme is for `1.1.0` gem version. CHANGELOG [here](CHANGELOG.md).
22
+
21
23
  Kimurai is a modern web scraping framework written in Ruby which **works out of box with Headless Chromium/Firefox, PhantomJS**, or simple HTTP requests and **allows to scrape and interact with JavaScript rendered websites.**
22
24
 
23
25
  Kimurai based on well-known [Capybara](https://github.com/teamcapybara/capybara) and [Nokogiri](https://github.com/sparklemotion/nokogiri) gems, so you don't have to learn anything new. Lets see:
@@ -32,9 +34,7 @@ class GithubSpider < Kimurai::Base
32
34
  @start_urls = ["https://github.com/search?q=Ruby%20Web%20Scraping"]
33
35
  @config = {
34
36
  user_agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36",
35
- browser: {
36
- before_request: { delay: 4..7 }
37
- }
37
+ before_request: { delay: 4..7 }
38
38
  }
39
39
 
40
40
  def parse(response, url:, data: {})
@@ -238,7 +238,10 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
238
238
  * [browser object](#browser-object)
239
239
  * [request_to method](#request_to-method)
240
240
  * [save_to helper](#save_to-helper)
241
- * [Skip duplicates, unique? helper](#skip-duplicates-unique-helper)
241
+ * [Skip duplicates](#skip-duplicates)
242
+ * [Automatically skip all duplicated requests urls](#automatically-skip-all-duplicated-requests-urls)
243
+ * [Storage object](#storage-object)
244
+ * [Persistence database for the storage](#persistence-database-for-the-storage)
242
245
  * [open_spider and close_spider callbacks](#open_spider-and-close_spider-callbacks)
243
246
  * [KIMURAI_ENV](#kimurai_env)
244
247
  * [Parallel crawling using in_parallel](#parallel-crawling-using-in_parallel)
@@ -451,19 +454,19 @@ brew install mongodb
451
454
  Before you get to know all Kimurai features, there is `$ kimurai console` command which is an interactive console where you can try and debug your scraping code very quickly, without having to run any spider (yes, it's like [Scrapy shell](https://doc.scrapy.org/en/latest/topics/shell.html#topics-shell)).
452
455
 
453
456
  ```bash
454
- $ kimurai console --engine selenium_chrome --url https://github.com/vfreefly/kimurai
457
+ $ kimurai console --engine selenium_chrome --url https://github.com/vifreefly/kimuraframework
455
458
  ```
456
459
 
457
460
  <details/>
458
461
  <summary>Show output</summary>
459
462
 
460
463
  ```
461
- $ kimurai console --engine selenium_chrome --url https://github.com/vfreefly/kimurai
464
+ $ kimurai console --engine selenium_chrome --url https://github.com/vifreefly/kimuraframework
462
465
 
463
466
  D, [2018-08-22 13:42:32 +0400#26079] [M: 47461994677760] DEBUG -- : BrowserBuilder (selenium_chrome): created browser instance
464
467
  D, [2018-08-22 13:42:32 +0400#26079] [M: 47461994677760] DEBUG -- : BrowserBuilder (selenium_chrome): enabled native headless_mode
465
- I, [2018-08-22 13:42:32 +0400#26079] [M: 47461994677760] INFO -- : Browser: started get request to: https://github.com/vfreefly/kimurai
466
- I, [2018-08-22 13:42:35 +0400#26079] [M: 47461994677760] INFO -- : Browser: finished get request to: https://github.com/vfreefly/kimurai
468
+ I, [2018-08-22 13:42:32 +0400#26079] [M: 47461994677760] INFO -- : Browser: started get request to: https://github.com/vifreefly/kimuraframework
469
+ I, [2018-08-22 13:42:35 +0400#26079] [M: 47461994677760] INFO -- : Browser: finished get request to: https://github.com/vifreefly/kimuraframework
467
470
  D, [2018-08-22 13:42:35 +0400#26079] [M: 47461994677760] DEBUG -- : Browser: driver.current_memory: 201701
468
471
 
469
472
  From: /home/victor/code/kimurai/lib/kimurai/base.rb @ line 189 Kimurai::Base#console:
@@ -473,7 +476,7 @@ From: /home/victor/code/kimurai/lib/kimurai/base.rb @ line 189 Kimurai::Base#con
473
476
  190: end
474
477
 
475
478
  [1] pry(#<Kimurai::Base>)> response.xpath("//title").text
476
- => "GitHub - vfreefly/kimurai: Kimurai is a modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites"
479
+ => "GitHub - vifreefly/kimuraframework: Modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites"
477
480
 
478
481
  [2] pry(#<Kimurai::Base>)> ls
479
482
  Kimurai::Base#methods: browser console logger request_to save_to unique?
@@ -733,9 +736,11 @@ By default `save_to` add position key to an item hash. You can disable it with `
733
736
 
734
737
  Until spider stops, each new item will be appended to a file. At the next run, helper will clear the content of a file first, and then start again appending items to it.
735
738
 
736
- ### Skip duplicates, `unique?` helper
739
+ > If you don't want file to be cleared before each run, add option `append: true`: `save_to "scraped_products.json", item, format: :json, append: true`
740
+
741
+ ### Skip duplicates
737
742
 
738
- It's pretty common when websites have duplicated pages. For example when an e-commerce shop has the same products in different categories. To skip duplicates, there is `unique?` helper:
743
+ It's pretty common when websites have duplicated pages. For example when an e-commerce shop has the same products in different categories. To skip duplicates, there is simple `unique?` helper:
739
744
 
740
745
  ```ruby
741
746
  class ProductsSpider < Kimurai::Base
@@ -796,6 +801,100 @@ unique?(:id, 324234232)
796
801
  unique?(:custom, "Lorem Ipsum")
797
802
  ```
798
803
 
804
+ #### Automatically skip all duplicated requests urls
805
+
806
+ It is possible to automatically skip all already visited urls while calling `request_to` method, using [@config](#all-available-config-options) option `skip_duplicate_requests: true`. With this option, all already visited urls will be automatically skipped. Also check the [@config](#all-available-config-options) for an additional options of this setting.
807
+
808
+ #### `storage` object
809
+
810
+ `unique?` method it's just an alias for `storage#unique?`. Storage has several methods:
811
+
812
+ * `#all` - display storage hash where keys are existing scopes.
813
+ * `#include?(scope, value)` - return `true` if value in the scope exists, and `false` if not
814
+ * `#add(scope, value)` - add value to the scope
815
+ * `unique?(scope, value)` - method already described above, will return `false` if value in the scope exists, or return `true` + add value to the scope if value in the scope not exists.
816
+ * `clear!` - reset the whole storage by deleting all values from all scopes.
817
+
818
+ #### Persistence database for the storage
819
+
820
+ It's pretty common that spider can fail (IP blocking, etc.) while crawling a huge website with +5k listings. In this case, it's not convenient to start everything over again.
821
+
822
+ Kimurai can use persistence database for a `storage` using Ruby built-in [PStore](https://ruby-doc.org/stdlib-2.5.1/libdoc/pstore/rdoc/PStore.html) database. With this option, you can automatically skip already visited urls in the next run _if previous run was failed_, otherwise _(if run was successful)_ storage database will be removed before spider stops.
823
+
824
+ Also, with persistence storage enabled, [save_to](#save_to-helper) method will keep adding items to an existing file (it will not be cleared before each run).
825
+
826
+ To use persistence storage, provide `continue: true` option to the `.crawl!` method: `SomeSpider.crawl!(continue: true)`.
827
+
828
+ There are two approaches how to use persistence storage and skip already processed items pages. First, is to manually add required urls to the storage:
829
+
830
+ ```ruby
831
+ class ProductsSpider < Kimurai::Base
832
+ @start_urls = ["https://example-shop.com/"]
833
+
834
+ def parse(response, url:, data: {})
835
+ response.xpath("//categories/path").each do |category|
836
+ request_to :parse_category, url: category[:href]
837
+ end
838
+ end
839
+
840
+ def parse_category(response, url:, data: {})
841
+ response.xpath("//products/path").each do |product|
842
+ # check if product url already contains in the scope `:product_urls`, if so, skip the request:
843
+ next if storage.contains?(:product_urls, product[:href])
844
+ # Otherwise process it:
845
+ request_to :parse_product, url: product[:href]
846
+ end
847
+ end
848
+
849
+ def parse_product(response, url:, data: {})
850
+ # Add visited item to the storage:
851
+ storage.add(:product_urls, url)
852
+
853
+ # ...
854
+ end
855
+ end
856
+
857
+ # Run the spider with persistence database option:
858
+ ProductsSpider.crawl!(continue: true)
859
+ ```
860
+
861
+ Second approach is to automatically skip already processed items urls using `@config` `skip_duplicate_requests:` option:
862
+
863
+ ```ruby
864
+ class ProductsSpider < Kimurai::Base
865
+ @start_urls = ["https://example-shop.com/"]
866
+ @config = {
867
+ # Configure skip_duplicate_requests option:
868
+ skip_duplicate_requests: { scope: :product_urls, check_only: true }
869
+ }
870
+
871
+ def parse(response, url:, data: {})
872
+ response.xpath("//categories/path").each do |category|
873
+ request_to :parse_category, url: category[:href]
874
+ end
875
+ end
876
+
877
+ def parse_category(response, url:, data: {})
878
+ response.xpath("//products/path").each do |product|
879
+ # Before visiting the url, `request_to` will check if it already contains
880
+ # in the storage scope `:product_urls`, if so, request will be skipped:
881
+ request_to :parse_product, url: product[:href]
882
+ end
883
+ end
884
+
885
+ def parse_product(response, url:, data: {})
886
+ # Add visited item url to the storage scope `:product_urls`:
887
+ storage.add(:product_urls, url)
888
+
889
+ # ...
890
+ end
891
+ end
892
+
893
+ # Run the spider with persistence database option:
894
+ ProductsSpider.crawl!(continue: true)
895
+ ```
896
+
897
+
799
898
  ### `open_spider` and `close_spider` callbacks
800
899
 
801
900
  You can define `.open_spider` and `.close_spider` callbacks (class methods) to perform some action before spider started or after spider has been stopped:
@@ -1316,6 +1415,8 @@ end # =>
1316
1415
  # "reddit: the front page of the internetHotHot"
1317
1416
  ```
1318
1417
 
1418
+ Keep in mind, that [save_to](#save_to-helper) and [unique?](#skip-duplicates) helpers are not thread-safe while using `.parse!` method.
1419
+
1319
1420
  #### `Kimurai.list` and `Kimurai.find_by_name()`
1320
1421
 
1321
1422
  ```ruby
@@ -1399,21 +1500,19 @@ class Spider < Kimurai::Base
1399
1500
  proxy: -> { PROXIES.sample },
1400
1501
  window_size: [1366, 768],
1401
1502
  disable_images: true,
1402
- browser: {
1403
- restart_if: {
1404
- # Restart browser if provided memory limit (in kilobytes) is exceeded:
1405
- memory_limit: 350_000
1406
- },
1407
- before_request: {
1408
- # Change user agent before each request:
1409
- change_user_agent: true,
1410
- # Change proxy before each request:
1411
- change_proxy: true,
1412
- # Clear all cookies and set default cookies (if provided) before each request:
1413
- clear_and_set_cookies: true,
1414
- # Process delay before each request:
1415
- delay: 1..3
1416
- }
1503
+ restart_if: {
1504
+ # Restart browser if provided memory limit (in kilobytes) is exceeded:
1505
+ memory_limit: 350_000
1506
+ },
1507
+ before_request: {
1508
+ # Change user agent before each request:
1509
+ change_user_agent: true,
1510
+ # Change proxy before each request:
1511
+ change_proxy: true,
1512
+ # Clear all cookies and set default cookies (if provided) before each request:
1513
+ clear_and_set_cookies: true,
1514
+ # Process delay before each request:
1515
+ delay: 1..3
1417
1516
  }
1418
1517
  }
1419
1518
 
@@ -1475,41 +1574,56 @@ end
1475
1574
  # Option to provide custom SSL certificate. Works only for :poltergeist_phantomjs and :mechanize
1476
1575
  ssl_cert_path: "path/to/ssl_cert",
1477
1576
 
1478
- # Browser (Capybara session instance) options:
1479
- browser: {
1480
- # Array of errors to retry while processing a request
1481
- retry_request_errors: [Net::ReadTimeout],
1482
- # Restart browser if one of the options is true:
1483
- restart_if: {
1484
- # Restart browser if provided memory limit (in kilobytes) is exceeded (works for all engines)
1485
- memory_limit: 350_000,
1486
-
1487
- # Restart browser if provided requests limit is exceeded (works for all engines)
1488
- requests_limit: 100
1489
- },
1490
- before_request: {
1491
- # Change proxy before each request. The `proxy:` option above should be presented
1492
- # and has lambda format. Works only for poltergeist and mechanize engines
1493
- # (Selenium doesn't support proxy rotation).
1494
- change_proxy: true,
1495
-
1496
- # Change user agent before each request. The `user_agent:` option above should be presented
1497
- # and has lambda format. Works only for poltergeist and mechanize engines
1498
- # (selenium doesn't support to get/set headers).
1499
- change_user_agent: true,
1500
-
1501
- # Clear all cookies before each request, works for all engines
1502
- clear_cookies: true,
1503
-
1504
- # If you want to clear all cookies + set custom cookies (`cookies:` option above should be presented)
1505
- # use this option instead (works for all engines)
1506
- clear_and_set_cookies: true,
1507
-
1508
- # Global option to set delay between requests.
1509
- # Delay can be `Integer`, `Float` or `Range` (`2..5`). In case of a range,
1510
- # delay number will be chosen randomly for each request: `rand (2..5) # => 3`
1511
- delay: 1..3
1512
- }
1577
+ # Inject some JavaScript code to the browser.
1578
+ # Format: array of strings, where each string is a path to JS file.
1579
+ # Works only for poltergeist_phantomjs engine (Selenium doesn't support JS code injection)
1580
+ extensions: ["lib/code_to_inject.js"],
1581
+
1582
+ # Automatically skip duplicated (already visited) urls when using `request_to` method.
1583
+ # Possible values: `true` or `hash` with options.
1584
+ # In case of `true`, all visited urls will be added to the storage's scope `:requests_urls`
1585
+ # and if url already contains in this scope, request will be skipped.
1586
+ # You can configure this setting by providing additional options as hash:
1587
+ # `skip_duplicate_requests: { scope: :custom_scope, check_only: true }`, where:
1588
+ # `scope:` - use custom scope than `:requests_urls`
1589
+ # `check_only:` - if true, then scope will be only checked for url, url will not
1590
+ # be added to the scope if scope doesn't contains it.
1591
+ # works for all drivers
1592
+ skip_duplicate_requests: true,
1593
+
1594
+ # Array of possible errors to retry while processing a request:
1595
+ retry_request_errors: [Net::ReadTimeout],
1596
+
1597
+ # Restart browser if one of the options is true:
1598
+ restart_if: {
1599
+ # Restart browser if provided memory limit (in kilobytes) is exceeded (works for all engines)
1600
+ memory_limit: 350_000,
1601
+
1602
+ # Restart browser if provided requests limit is exceeded (works for all engines)
1603
+ requests_limit: 100
1604
+ },
1605
+ before_request: {
1606
+ # Change proxy before each request. The `proxy:` option above should be presented
1607
+ # and has lambda format. Works only for poltergeist and mechanize engines
1608
+ # (Selenium doesn't support proxy rotation).
1609
+ change_proxy: true,
1610
+
1611
+ # Change user agent before each request. The `user_agent:` option above should be presented
1612
+ # and has lambda format. Works only for poltergeist and mechanize engines
1613
+ # (selenium doesn't support to get/set headers).
1614
+ change_user_agent: true,
1615
+
1616
+ # Clear all cookies before each request, works for all engines
1617
+ clear_cookies: true,
1618
+
1619
+ # If you want to clear all cookies + set custom cookies (`cookies:` option above should be presented)
1620
+ # use this option instead (works for all engines)
1621
+ clear_and_set_cookies: true,
1622
+
1623
+ # Global option to set delay between requests.
1624
+ # Delay can be `Integer`, `Float` or `Range` (`2..5`). In case of a range,
1625
+ # delay number will be chosen randomly for each request: `rand (2..5) # => 3`
1626
+ delay: 1..3
1513
1627
  }
1514
1628
  }
1515
1629
  ```
@@ -1525,10 +1639,8 @@ class ApplicationSpider < Kimurai::Base
1525
1639
  @config = {
1526
1640
  user_agent: "Firefox",
1527
1641
  disable_images: true,
1528
- browser: {
1529
- restart_if: { memory_limit: 350_000 },
1530
- before_request: { delay: 1..2 }
1531
- }
1642
+ restart_if: { memory_limit: 350_000 },
1643
+ before_request: { delay: 1..2 }
1532
1644
  }
1533
1645
  end
1534
1646
 
@@ -1536,7 +1648,7 @@ class CustomSpider < ApplicationSpider
1536
1648
  @name = "custom_spider"
1537
1649
  @start_urls = ["https://example.com/"]
1538
1650
  @config = {
1539
- browser: { before_request: { delay: 4..6 }}
1651
+ before_request: { delay: 4..6 }
1540
1652
  }
1541
1653
 
1542
1654
  def parse(response, url:, data: {})
@@ -1628,6 +1740,8 @@ end
1628
1740
  ### Crawl
1629
1741
  To run a particular spider in the project, run: `$ bundle exec kimurai crawl example_spider`. Don't forget to add `bundle exec` before command to load required environment.
1630
1742
 
1743
+ You can provide an additional option `--continue` to use [persistence storage database](#persistence-database-for-the-storage) feature.
1744
+
1631
1745
  ### List
1632
1746
  To list all project spiders, run: `$ bundle exec kimurai list`
1633
1747
 
@@ -1786,7 +1900,7 @@ class GithubSpider < ApplicationSpider
1786
1900
  @start_urls = ["https://github.com/search?q=Ruby%20Web%20Scraping"]
1787
1901
  @config = {
1788
1902
  user_agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36",
1789
- browser: { before_request: { delay: 4..7 } }
1903
+ before_request: { delay: 4..7 }
1790
1904
  }
1791
1905
 
1792
1906
  def parse(response, url:, data: {})
@@ -10,7 +10,7 @@ Gem::Specification.new do |spec|
10
10
  spec.email = ["vicfreefly@gmail.com"]
11
11
 
12
12
  spec.summary = "Modern web scraping framework written in Ruby and based on Capybara/Nokogiri"
13
- spec.homepage = "https://github.com/vifreefly/kimurai"
13
+ spec.homepage = "https://github.com/vifreefly/kimuraframework"
14
14
  spec.license = "MIT"
15
15
 
16
16
  # Specify which files should be added to the gem when it is released.
@@ -1,5 +1,5 @@
1
- require_relative 'base/simple_saver'
2
- require_relative 'base/uniq_checker'
1
+ require_relative 'base/saver'
2
+ require_relative 'base/storage'
3
3
 
4
4
  module Kimurai
5
5
  class Base
@@ -21,7 +21,7 @@ module Kimurai
21
21
  ###
22
22
 
23
23
  class << self
24
- attr_reader :run_info
24
+ attr_reader :run_info, :savers, :storage
25
25
  end
26
26
 
27
27
  def self.running?
@@ -46,10 +46,12 @@ module Kimurai
46
46
 
47
47
  def self.update(type, subtype)
48
48
  return unless @run_info
49
+ @update_mutex.synchronize { @run_info[type][subtype] += 1 }
50
+ end
49
51
 
50
- (@update_mutex ||= Mutex.new).synchronize do
51
- @run_info[type][subtype] += 1
52
- end
52
+ def self.add_event(scope, event)
53
+ return unless @run_info
54
+ @update_mutex.synchronize { @run_info[:events][scope][event] += 1 }
53
55
  end
54
56
 
55
57
  ###
@@ -58,8 +60,6 @@ module Kimurai
58
60
  @pipelines = []
59
61
  @config = {}
60
62
 
61
- ###
62
-
63
63
  def self.name
64
64
  @name
65
65
  end
@@ -90,34 +90,27 @@ module Kimurai
90
90
  end
91
91
  end
92
92
 
93
- ###
94
-
95
- def self.checker
96
- @checker ||= UniqChecker.new
97
- end
98
-
99
- def unique?(scope, value)
100
- self.class.checker.unique?(scope, value)
101
- end
102
-
103
- def self.saver
104
- @saver ||= SimpleSaver.new
105
- end
93
+ def self.crawl!(continue: false)
94
+ logger.error "Spider: already running: #{name}" and return false if running?
106
95
 
107
- def save_to(path, item, format:, position: true)
108
- self.class.saver.save(path, item, format: format, position: position)
109
- end
96
+ storage_path =
97
+ if continue
98
+ Dir.exists?("tmp") ? "tmp/#{name}.pstore" : "#{name}.pstore"
99
+ end
110
100
 
111
- ###
101
+ @storage = Storage.new(storage_path)
102
+ @savers = {}
103
+ @update_mutex = Mutex.new
112
104
 
113
- def self.crawl!
114
- logger.error "Spider: already running: #{name}" and return false if running?
115
105
  @run_info = {
116
- spider_name: name, status: :running, environment: Kimurai.env,
106
+ spider_name: name, status: :running, error: nil, environment: Kimurai.env,
117
107
  start_time: Time.new, stop_time: nil, running_time: nil,
118
- visits: { requests: 0, responses: 0 }, items: { sent: 0, processed: 0 }, error: nil
108
+ visits: { requests: 0, responses: 0 }, items: { sent: 0, processed: 0 },
109
+ events: { requests_errors: Hash.new(0), drop_items_errors: Hash.new(0), custom: Hash.new(0) }
119
110
  }
120
111
 
112
+ ###
113
+
121
114
  logger.info "Spider: started: #{name}"
122
115
  open_spider if self.respond_to? :open_spider
123
116
 
@@ -130,12 +123,11 @@ module Kimurai
130
123
  else
131
124
  spider.parse
132
125
  end
133
- rescue StandardError, SignalException => e
126
+ rescue StandardError, SignalException, SystemExit => e
134
127
  @run_info.merge!(status: :failed, error: e.inspect)
135
128
  raise e
136
129
  else
137
- @run_info[:status] = :completed
138
- @run_info
130
+ @run_info.merge!(status: :completed)
139
131
  ensure
140
132
  if spider
141
133
  spider.browser.destroy_driver!
@@ -145,10 +137,20 @@ module Kimurai
145
137
  @run_info.merge!(stop_time: stop_time, running_time: total_time)
146
138
 
147
139
  close_spider if self.respond_to? :close_spider
140
+
141
+ if @storage.path
142
+ if completed?
143
+ @storage.delete!
144
+ logger.info "Spider: storage: persistence database #{@storage.path} was removed (successful run)"
145
+ else
146
+ logger.info "Spider: storage: persistence database #{@storage.path} wasn't removed (failed run)"
147
+ end
148
+ end
149
+
148
150
  message = "Spider: stopped: #{@run_info.merge(running_time: @run_info[:running_time]&.duration)}"
149
- failed? ? @logger.fatal(message) : @logger.info(message)
151
+ failed? ? logger.fatal(message) : logger.info(message)
150
152
 
151
- @run_info, @checker, @saver = nil
153
+ @run_info, @storage, @savers, @update_mutex = nil
152
154
  end
153
155
  end
154
156
 
@@ -156,7 +158,7 @@ module Kimurai
156
158
  spider = engine ? self.new(engine) : self.new
157
159
  url.present? ? spider.request_to(handler, url: url, data: data) : spider.public_send(handler)
158
160
  ensure
159
- spider.browser.destroy_driver!
161
+ spider.browser.destroy_driver! if spider.instance_variable_get("@browser")
160
162
  end
161
163
 
162
164
  ###
@@ -175,6 +177,7 @@ module Kimurai
175
177
  end.to_h
176
178
 
177
179
  @logger = self.class.logger
180
+ @savers = {}
178
181
  end
179
182
 
180
183
  def browser
@@ -182,6 +185,11 @@ module Kimurai
182
185
  end
183
186
 
184
187
  def request_to(handler, delay = nil, url:, data: {})
188
+ if @config[:skip_duplicate_requests] && !unique_request?(url)
189
+ add_event(:duplicate_requests) if self.with_info
190
+ logger.warn "Spider: request_to: url is not unique: #{url}, skipped" and return
191
+ end
192
+
185
193
  request_data = { url: url, data: data }
186
194
  delay ? browser.visit(url, delay: delay) : browser.visit(url)
187
195
  public_send(handler, browser.current_response, request_data)
@@ -191,8 +199,59 @@ module Kimurai
191
199
  binding.pry
192
200
  end
193
201
 
202
+ ###
203
+
204
+ def storage
205
+ # Note: for `.crawl!` uses shared thread safe Storage instance,
206
+ # otherwise, each spider instance will have it's own Storage
207
+ @storage ||= self.with_info ? self.class.storage : Storage.new
208
+ end
209
+
210
+ def unique?(scope, value)
211
+ storage.unique?(scope, value)
212
+ end
213
+
214
+ def save_to(path, item, format:, position: true, append: false)
215
+ @savers[path] ||= begin
216
+ options = { format: format, position: position, append: storage.path ? true : append }
217
+ if self.with_info
218
+ self.class.savers[path] ||= Saver.new(path, options)
219
+ else
220
+ Saver.new(path, options)
221
+ end
222
+ end
223
+
224
+ @savers[path].save(item)
225
+ end
226
+
227
+ ###
228
+
229
+ def add_event(scope = :custom, event)
230
+ unless self.with_info
231
+ raise "It's allowed to use `add_event` only while performing a full run (`.crawl!` method)"
232
+ end
233
+
234
+ self.class.add_event(scope, event)
235
+ end
236
+
237
+ ###
238
+
194
239
  private
195
240
 
241
+ def unique_request?(url)
242
+ options = @config[:skip_duplicate_requests]
243
+ if options.class == Hash
244
+ scope = options[:scope] || :requests_urls
245
+ if options[:check_only]
246
+ storage.include?(scope, url) ? false : true
247
+ else
248
+ storage.unique?(scope, url) ? true : false
249
+ end
250
+ else
251
+ storage.unique?(:requests_urls, url) ? true : false
252
+ end
253
+ end
254
+
196
255
  def send_item(item, options = {})
197
256
  logger.debug "Pipeline: starting processing item through #{@pipelines.size} #{'pipeline'.pluralize(@pipelines.size)}..."
198
257
  self.class.update(:items, :sent) if self.with_info
@@ -201,7 +260,8 @@ module Kimurai
201
260
  item = options[name] ? instance.process_item(item, options: options[name]) : instance.process_item(item)
202
261
  end
203
262
  rescue => e
204
- logger.error "Pipeline: dropped: #{e.inspect}, item: #{item}"
263
+ logger.error "Pipeline: dropped: #{e.inspect} (#{e.backtrace.first}), item: #{item}"
264
+ add_event(:drop_items_errors, e.inspect) if self.with_info
205
265
  false
206
266
  else
207
267
  self.class.update(:items, :processed) if self.with_info