kimurai 1.3.2 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (48) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop.yml +9 -0
  3. data/CHANGELOG.md +29 -0
  4. data/Gemfile +2 -2
  5. data/README.md +478 -649
  6. data/Rakefile +6 -6
  7. data/bin/console +3 -4
  8. data/exe/kimurai +0 -1
  9. data/kimurai.gemspec +38 -37
  10. data/lib/kimurai/base/saver.rb +15 -19
  11. data/lib/kimurai/base/storage.rb +1 -1
  12. data/lib/kimurai/base.rb +42 -38
  13. data/lib/kimurai/base_helper.rb +5 -4
  14. data/lib/kimurai/browser_builder/mechanize_builder.rb +44 -38
  15. data/lib/kimurai/browser_builder/selenium_chrome_builder.rb +63 -51
  16. data/lib/kimurai/browser_builder/selenium_firefox_builder.rb +61 -55
  17. data/lib/kimurai/browser_builder.rb +7 -31
  18. data/lib/kimurai/capybara_configuration.rb +1 -1
  19. data/lib/kimurai/capybara_ext/driver/base.rb +50 -46
  20. data/lib/kimurai/capybara_ext/mechanize/driver.rb +51 -50
  21. data/lib/kimurai/capybara_ext/selenium/driver.rb +33 -29
  22. data/lib/kimurai/capybara_ext/session/config.rb +1 -1
  23. data/lib/kimurai/capybara_ext/session.rb +40 -38
  24. data/lib/kimurai/cli/generator.rb +15 -15
  25. data/lib/kimurai/cli.rb +52 -85
  26. data/lib/kimurai/core_ext/array.rb +2 -2
  27. data/lib/kimurai/core_ext/hash.rb +1 -1
  28. data/lib/kimurai/core_ext/numeric.rb +4 -4
  29. data/lib/kimurai/pipeline.rb +2 -1
  30. data/lib/kimurai/runner.rb +6 -6
  31. data/lib/kimurai/template/Gemfile +2 -2
  32. data/lib/kimurai/template/config/boot.rb +4 -4
  33. data/lib/kimurai/template/config/schedule.rb +15 -15
  34. data/lib/kimurai/template/spiders/application_spider.rb +14 -14
  35. data/lib/kimurai/version.rb +1 -1
  36. data/lib/kimurai.rb +7 -3
  37. metadata +58 -65
  38. data/.travis.yml +0 -5
  39. data/lib/kimurai/automation/deploy.yml +0 -54
  40. data/lib/kimurai/automation/setup/chromium_chromedriver.yml +0 -26
  41. data/lib/kimurai/automation/setup/firefox_geckodriver.yml +0 -20
  42. data/lib/kimurai/automation/setup/phantomjs.yml +0 -33
  43. data/lib/kimurai/automation/setup/ruby_environment.yml +0 -124
  44. data/lib/kimurai/automation/setup.yml +0 -44
  45. data/lib/kimurai/browser_builder/poltergeist_phantomjs_builder.rb +0 -171
  46. data/lib/kimurai/capybara_ext/poltergeist/driver.rb +0 -13
  47. data/lib/kimurai/cli/ansible_command_builder.rb +0 -71
  48. data/lib/kimurai/template/config/automation.yml +0 -13
data/README.md CHANGED
@@ -1,28 +1,8 @@
1
- <div align="center">
2
- <a href="https://github.com/vifreefly/kimuraframework">
3
- <img width="312" height="200" src="https://hsto.org/webt/_v/mt/tp/_vmttpbpzbt-y2aook642d9wpz0.png">
4
- </a>
1
+ # Kimurai
5
2
 
6
- <h1>Kimurai Scraping Framework</h1>
7
- </div>
3
+ Kimurai is a modern web scraping framework written in Ruby which **works out of the box with Headless Chromium/Firefox** or simple HTTP requests and **allows you to scrape and interact with JavaScript rendered websites.**
8
4
 
9
- > **Note about v1.0.0 version:**
10
- > * The code was massively refactored for a [support](#using-kimurai-inside-existing-ruby-application) to run spiders multiple times from inside a single process. Now it's possible to run Kimurai spiders using background jobs like Sidekiq.
11
- > * `require 'kimurai'` doesn't require any gems except Active Support. Only when a particular spider [starts](#crawl-method), Capybara will be required with a specific driver.
12
- > * Although Kimurai [extends](lib/kimurai/capybara_ext) Capybara (all the magic happens inside [extended](lib/kimurai/capybara_ext/session.rb) `Capybara::Session#visit` method), session instances which were created manually will behave normally.
13
- > * No spaghetti code with `case/when/end` blocks anymore. All drivers [were extended](lib/kimurai/capybara_ext) to support unified methods for cookies, proxies, headers, etc.
14
- > * `selenium_url_to_set_cookies` @config option don't need anymore if you're use Selenium-like engine with custom cookies setting.
15
- > * Small changes in design (check the readme again to see what was changed)
16
- > * Stats database with a web dashboard were removed
17
- > * Again, massive refactor. Code now looks much better than it was before.
18
-
19
- <br>
20
-
21
- > Note: this readme is for `1.3.2` gem version. CHANGELOG [here](CHANGELOG.md).
22
-
23
- Kimurai is a modern web scraping framework written in Ruby which **works out of box with Headless Chromium/Firefox, PhantomJS**, or simple HTTP requests and **allows to scrape and interact with JavaScript rendered websites.**
24
-
25
- Kimurai based on well-known [Capybara](https://github.com/teamcapybara/capybara) and [Nokogiri](https://github.com/sparklemotion/nokogiri) gems, so you don't have to learn anything new. Lets see:
5
+ Kimurai is based on the well-known [Capybara](https://github.com/teamcapybara/capybara) and [Nokogiri](https://github.com/sparklemotion/nokogiri) gems, so you don't have to learn anything new. Let's try an example:
26
6
 
27
7
  ```ruby
28
8
  # github_spider.rb
@@ -31,18 +11,17 @@ require 'kimurai'
31
11
  class GithubSpider < Kimurai::Base
32
12
  @name = "github_spider"
33
13
  @engine = :selenium_chrome
34
- @start_urls = ["https://github.com/search?q=Ruby%20Web%20Scraping"]
14
+ @start_urls = ["https://github.com/search?q=ruby+web+scraping&type=repositories"]
35
15
  @config = {
36
- user_agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36",
37
- before_request: { delay: 4..7 }
16
+ before_request: { delay: 3..5 }
38
17
  }
39
18
 
40
19
  def parse(response, url:, data: {})
41
- response.xpath("//ul[@class='repo-list']/div//h3/a").each do |a|
20
+ response.xpath("//div[@data-testid='results-list']//div[contains(@class, 'search-title')]/a").each do |a|
42
21
  request_to :parse_repo_page, url: absolute_url(a[:href], base: url)
43
22
  end
44
23
 
45
- if next_page = response.at_xpath("//a[@class='next_page']")
24
+ if next_page = response.at_xpath("//a[@rel='next']")
46
25
  request_to :parse, url: absolute_url(next_page[:href], base: url)
47
26
  end
48
27
  end
@@ -50,15 +29,15 @@ class GithubSpider < Kimurai::Base
50
29
  def parse_repo_page(response, url:, data: {})
51
30
  item = {}
52
31
 
53
- item[:owner] = response.xpath("//h1//a[@rel='author']").text
54
- item[:repo_name] = response.xpath("//h1/strong[@itemprop='name']/a").text
32
+ item[:owner] = response.xpath("//a[@rel='author']").text.squish
33
+ item[:repo_name] = response.xpath("//strong[@itemprop='name']").text.squish
55
34
  item[:repo_url] = url
56
- item[:description] = response.xpath("//span[@itemprop='about']").text.squish
57
- item[:tags] = response.xpath("//div[@id='topics-list-container']/div/a").map { |a| a.text.squish }
58
- item[:watch_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Watch')]/a[2]").text.squish
59
- item[:star_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Star')]/a[2]").text.squish
60
- item[:fork_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Fork')]/a[2]").text.squish
61
- item[:last_commit] = response.xpath("//span[@itemprop='dateModified']/*").text
35
+ item[:description] = response.xpath("//div[h2[text()='About']]/p").text.squish
36
+ item[:tags] = response.xpath("//div/a[contains(@title, 'Topic')]").map { |a| a.text.squish }
37
+ item[:watch_count] = response.xpath("//div/h3[text()='Watchers']/following-sibling::div[1]/a/strong").text.squish
38
+ item[:star_count] = response.xpath("//div/h3[text()='Stars']/following-sibling::div[1]/a/strong").text.squish
39
+ item[:fork_count] = response.xpath("//div/h3[text()='Forks']/following-sibling::div[1]/a/strong").text.squish
40
+ item[:last_commit] = response.xpath("//div[@data-testid='latest-commit-details']//relative-time/text()").text.squish
62
41
 
63
42
  save_to "results.json", item, format: :pretty_json
64
43
  end
@@ -71,33 +50,25 @@ GithubSpider.crawl!
71
50
  <summary>Run: <code>$ ruby github_spider.rb</code></summary>
72
51
 
73
52
  ```
74
- I, [2018-08-22 13:08:03 +0400#15477] [M: 47377500980720] INFO -- github_spider: Spider: started: github_spider
75
- D, [2018-08-22 13:08:03 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): created browser instance
76
- D, [2018-08-22 13:08:03 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): enabled `browser before_request delay`
77
- D, [2018-08-22 13:08:03 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: sleep 7 seconds before request...
78
- D, [2018-08-22 13:08:10 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): enabled custom user-agent
79
- D, [2018-08-22 13:08:10 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): enabled native headless_mode
80
- I, [2018-08-22 13:08:10 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: started get request to: https://github.com/search?q=Ruby%20Web%20Scraping
81
- I, [2018-08-22 13:08:26 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: finished get request to: https://github.com/search?q=Ruby%20Web%20Scraping
82
- I, [2018-08-22 13:08:26 +0400#15477] [M: 47377500980720] INFO -- github_spider: Info: visits: requests: 1, responses: 1
83
- D, [2018-08-22 13:08:27 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: driver.current_memory: 107968
84
- D, [2018-08-22 13:08:27 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: sleep 5 seconds before request...
85
- I, [2018-08-22 13:08:32 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: started get request to: https://github.com/lorien/awesome-web-scraping
86
- I, [2018-08-22 13:08:33 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: finished get request to: https://github.com/lorien/awesome-web-scraping
87
- I, [2018-08-22 13:08:33 +0400#15477] [M: 47377500980720] INFO -- github_spider: Info: visits: requests: 2, responses: 2
88
- D, [2018-08-22 13:08:33 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: driver.current_memory: 212542
89
- D, [2018-08-22 13:08:33 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: sleep 4 seconds before request...
90
- I, [2018-08-22 13:08:37 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: started get request to: https://github.com/jaimeiniesta/metainspector
53
+ $ ruby github_spider.rb
54
+
55
+ I, [2025-12-16 12:15:48] INFO -- github_spider: Spider: started: github_spider
56
+ I, [2025-12-16 12:15:48] INFO -- github_spider: Browser: started get request to: https://github.com/search?q=ruby+web+scraping&type=repositories
57
+ I, [2025-12-16 12:16:01] INFO -- github_spider: Browser: finished get request to: https://github.com/search?q=ruby+web+scraping&type=repositories
58
+ I, [2025-12-16 12:16:01] INFO -- github_spider: Info: visits: requests: 1, responses: 1
59
+ I, [2025-12-16 12:16:01] INFO -- github_spider: Browser: started get request to: https://github.com/sparklemotion/mechanize
60
+ I, [2025-12-16 12:16:06] INFO -- github_spider: Browser: finished get request to: https://github.com/sparklemotion/mechanize
61
+ I, [2025-12-16 12:16:06] INFO -- github_spider: Info: visits: requests: 2, responses: 2
62
+ I, [2025-12-16 12:16:06] INFO -- github_spider: Browser: started get request to: https://github.com/jaimeiniesta/metainspector
63
+ I, [2025-12-16 12:16:11] INFO -- github_spider: Browser: finished get request to: https://github.com/jaimeiniesta/metainspector
64
+ I, [2025-12-16 12:16:11] INFO -- github_spider: Info: visits: requests: 3, responses: 3
65
+ I, [2025-12-16 12:16:11] INFO -- github_spider: Browser: started get request to: https://github.com/Germey/AwesomeWebScraping
66
+ I, [2025-12-16 12:16:13] INFO -- github_spider: Browser: finished get request to: https://github.com/Germey/AwesomeWebScraping
67
+ I, [2025-12-16 12:16:13] INFO -- github_spider: Info: visits: requests: 4, responses: 4
68
+ I, [2025-12-16 12:16:13] INFO -- github_spider: Browser: started get request to: https://github.com/vifreefly/kimuraframework
69
+ I, [2025-12-16 12:16:17] INFO -- github_spider: Browser: finished get request to: https://github.com/vifreefly/kimuraframework
91
70
 
92
71
  ...
93
-
94
- I, [2018-08-22 13:23:07 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: started get request to: https://github.com/preston/idclight
95
- I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: finished get request to: https://github.com/preston/idclight
96
- I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] INFO -- github_spider: Info: visits: requests: 140, responses: 140
97
- D, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: driver.current_memory: 204198
98
- I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: driver selenium_chrome has been destroyed
99
-
100
- I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] INFO -- github_spider: Spider: stopped: {:spider_name=>"github_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 13:08:03 +0400, :stop_time=>2018-08-22 13:23:08 +0400, :running_time=>"15m, 5s", :visits=>{:requests=>140, :responses=>140}, :error=>nil}
101
72
  ```
102
73
  </details>
103
74
 
@@ -107,48 +78,71 @@ I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] INFO -- github_spider:
107
78
  ```json
108
79
  [
109
80
  {
110
- "owner": "lorien",
111
- "repo_name": "awesome-web-scraping",
112
- "repo_url": "https://github.com/lorien/awesome-web-scraping",
113
- "description": "List of libraries, tools and APIs for web scraping and data processing.",
114
- "tags": [
115
- "awesome",
116
- "awesome-list",
117
- "web-scraping",
118
- "data-processing",
119
- "python",
120
- "javascript",
121
- "php",
122
- "ruby"
123
- ],
124
- "watch_count": "159",
125
- "star_count": "2,423",
126
- "fork_count": "358",
127
- "last_commit": "4 days ago",
81
+ "owner": "sparklemotion",
82
+ "repo_name": "mechanize",
83
+ "repo_url": "https://github.com/sparklemotion/mechanize",
84
+ "description": "Mechanize is a ruby library that makes automated web interaction easy.",
85
+ "tags": ["ruby", "web", "scraping"],
86
+ "watch_count": "79",
87
+ "star_count": "4.4k",
88
+ "fork_count": "480",
89
+ "last_commit": "Sep 30, 2025",
128
90
  "position": 1
129
91
  },
130
-
131
- ...
132
-
133
92
  {
134
- "owner": "preston",
135
- "repo_name": "idclight",
136
- "repo_url": "https://github.com/preston/idclight",
137
- "description": "A Ruby gem for accessing the freely available IDClight (IDConverter Light) web service, which convert between different types of gene IDs such as Hugo and Entrez. Queries are screen scraped from http://idclight.bioinfo.cnio.es.",
138
- "tags": [
139
-
140
- ],
141
- "watch_count": "6",
142
- "star_count": "1",
93
+ "owner": "jaimeiniesta",
94
+ "repo_name": "metainspector",
95
+ "repo_url": "https://github.com/jaimeiniesta/metainspector",
96
+ "description": "Ruby gem for web scraping purposes. It scrapes a given URL, and returns you its title, meta description, meta keywords, links, images...",
97
+ "tags": [],
98
+ "watch_count": "20",
99
+ "star_count": "1k",
100
+ "fork_count": "166",
101
+ "last_commit": "Oct 8, 2025",
102
+ "position": 2
103
+ },
104
+ {
105
+ "owner": "Germey",
106
+ "repo_name": "AwesomeWebScraping",
107
+ "repo_url": "https://github.com/Germey/AwesomeWebScraping",
108
+ "description": "List of libraries, tools and APIs for web scraping and data processing.",
109
+ "tags": ["javascript", "ruby", "python", "golang", "php", "awesome", "captcha", "proxy", "web-scraping", "aswsome-list"],
110
+ "watch_count": "5",
111
+ "star_count": "253",
112
+ "fork_count": "33",
113
+ "last_commit": "Apr 5, 2024",
114
+ "position": 3
115
+ },
116
+ {
117
+ "owner": "vifreefly",
118
+ "repo_name": "kimuraframework",
119
+ "repo_url": "https://github.com/vifreefly/kimuraframework",
120
+ "description": "Kimurai is a modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites",
121
+ "tags": ["crawler", "scraper", "scrapy", "headless-chrome", "kimurai"],
122
+ "watch_count": "28",
123
+ "star_count": "1k",
124
+ "fork_count": "158",
125
+ "last_commit": "Dec 12, 2025",
126
+ "position": 4
127
+ },
128
+ // ...
129
+ {
130
+ "owner": "citixenken",
131
+ "repo_name": "web_scraping_with_ruby",
132
+ "repo_url": "https://github.com/citixenken/web_scraping_with_ruby",
133
+ "description": "",
134
+ "tags": [],
135
+ "watch_count": "1",
136
+ "star_count": "0",
143
137
  "fork_count": "0",
144
- "last_commit": "on Apr 12, 2012",
145
- "position": 127
138
+ "last_commit": "Aug 29, 2022",
139
+ "position": 118
146
140
  }
147
141
  ]
148
142
  ```
149
143
  </details><br>
150
144
 
151
- Okay, that was easy. How about javascript rendered websites with dynamic HTML? Lets scrape a page with infinite scroll:
145
+ Okay, that was easy. How about JavaScript rendered websites with dynamic HTML? Let's scrape a page with infinite scroll:
152
146
 
153
147
  ```ruby
154
148
  # infinite_scroll_spider.rb
@@ -172,7 +166,7 @@ class InfiniteScrollSpider < Kimurai::Base
172
166
  logger.info "> Pagination is done" and break
173
167
  else
174
168
  count = new_count
175
- logger.info "> Continue scrolling, current count is #{count}..."
169
+ logger.info "> Continue scrolling, current posts count is #{count}..."
176
170
  end
177
171
  end
178
172
 
@@ -188,49 +182,46 @@ InfiniteScrollSpider.crawl!
188
182
  <summary>Run: <code>$ ruby infinite_scroll_spider.rb</code></summary>
189
183
 
190
184
  ```
191
- I, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Spider: started: infinite_scroll_spider
192
- D, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320] DEBUG -- infinite_scroll_spider: BrowserBuilder (selenium_chrome): created browser instance
193
- D, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320] DEBUG -- infinite_scroll_spider: BrowserBuilder (selenium_chrome): enabled native headless_mode
194
- I, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Browser: started get request to: https://infinite-scroll.com/demo/full-page/
195
- I, [2018-08-22 13:33:03 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Browser: finished get request to: https://infinite-scroll.com/demo/full-page/
196
- I, [2018-08-22 13:33:03 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Info: visits: requests: 1, responses: 1
197
- D, [2018-08-22 13:33:03 +0400#23356] [M: 47375890851320] DEBUG -- infinite_scroll_spider: Browser: driver.current_memory: 95463
198
- I, [2018-08-22 13:33:05 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Continue scrolling, current count is 5...
199
- I, [2018-08-22 13:33:18 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Continue scrolling, current count is 9...
200
- I, [2018-08-22 13:33:20 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Continue scrolling, current count is 11...
201
- I, [2018-08-22 13:33:26 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Continue scrolling, current count is 13...
202
- I, [2018-08-22 13:33:28 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Continue scrolling, current count is 15...
203
- I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Pagination is done
204
- I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > All posts from page: 1a - Infinite Scroll full page demo; 1b - RGB Schemes logo in Computer Arts; 2a - RGB Schemes logo; 2b - Masonry gets horizontalOrder; 2c - Every vector 2016; 3a - Logo Pizza delivered; 3b - Some CodePens; 3c - 365daysofmusic.com; 3d - Holograms; 4a - Huebee: 1-click color picker; 4b - Word is Flickity is good; Flickity v2 released: groupCells, adaptiveHeight, parallax; New tech gets chatter; Isotope v3 released: stagger in, IE8 out; Packery v2 released
205
- I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Browser: driver selenium_chrome has been destroyed
206
- I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Spider: stopped: {:spider_name=>"infinite_scroll_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 13:32:57 +0400, :stop_time=>2018-08-22 13:33:30 +0400, :running_time=>"33s", :visits=>{:requests=>1, :responses=>1}, :error=>nil}
185
+ $ ruby infinite_scroll_spider.rb
207
186
 
187
+ I, [2025-12-16 12:47:05] INFO -- infinite_scroll_spider: Spider: started: infinite_scroll_spider
188
+ I, [2025-12-16 12:47:05] INFO -- infinite_scroll_spider: Browser: started get request to: https://infinite-scroll.com/demo/full-page/
189
+ I, [2025-12-16 12:47:09] INFO -- infinite_scroll_spider: Browser: finished get request to: https://infinite-scroll.com/demo/full-page/
190
+ I, [2025-12-16 12:47:09] INFO -- infinite_scroll_spider: Info: visits: requests: 1, responses: 1
191
+ I, [2025-12-16 12:47:11] INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 5...
192
+ I, [2025-12-16 12:47:13] INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 9...
193
+ I, [2025-12-16 12:47:15] INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 11...
194
+ I, [2025-12-16 12:47:17] INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 13...
195
+ I, [2025-12-16 12:47:19] INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 15...
196
+ I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: > Pagination is done
197
+ I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: > All posts from page: 1a - Infinite Scroll full page demo; 1b - RGB Schemes logo in Computer Arts; 2a - RGB Schemes logo; 2b - Masonry gets horizontalOrder; 2c - Every vector 2016; 3a - Logo Pizza delivered; 3b - Some CodePens; 3c - 365daysofmusic.com; 3d - Holograms; 4a - Huebee: 1-click color picker; 4b - Word is Flickity is good; Flickity v2 released: groupCells, adaptiveHeight, parallax; New tech gets chatter; Isotope v3 released: stagger in, IE8 out; Packery v2 released
198
+ I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: Browser: driver selenium_chrome has been destroyed
199
+ I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: Spider: stopped: {spider_name: "infinite_scroll_spider", status: :completed, error: nil, environment: "development", start_time: 2025-12-16 12:47:05.372053 +0300, stop_time: 2025-12-16 12:47:21.505078 +0300, running_time: "16s", visits: {requests: 1, responses: 1}, items: {sent: 0, processed: 0}, events: {requests_errors: {}, drop_items_errors: {}, custom: {}}}
208
200
  ```
209
201
  </details><br>
210
202
 
211
203
 
212
204
  ## Features
213
- * Scrape javascript rendered websites out of box
214
- * Supported engines: [Headless Chrome](https://developers.google.com/web/updates/2017/04/headless-chrome), [Headless Firefox](https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Headless_mode), [PhantomJS](https://github.com/ariya/phantomjs) or simple HTTP requests ([mechanize](https://github.com/sparklemotion/mechanize) gem)
205
+ * Scrape JavaScript rendered websites out of the box
206
+ * Supported engines: [Headless Chrome](https://developers.google.com/web/updates/2017/04/headless-chrome), [Headless Firefox](https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Headless_mode) or simple HTTP requests ([mechanize](https://github.com/sparklemotion/mechanize) gem)
215
207
  * Write spider code once, and use it with any supported engine later
216
208
  * All the power of [Capybara](https://github.com/teamcapybara/capybara): use methods like `click_on`, `fill_in`, `select`, `choose`, `set`, `go_back`, etc. to interact with web pages
217
209
  * Rich [configuration](#spider-config): **set default headers, cookies, delay between requests, enable proxy/user-agents rotation**
218
- * Built-in helpers to make scraping easy, like [save_to](#save_to-helper) (save items to JSON, JSON lines, or CSV formats) or [unique?](#skip-duplicates-unique-helper) to skip duplicates
210
+ * Built-in helpers to make scraping easy, like [save_to](#save_to-helper) (save items to JSON, JSON lines, or CSV formats) or [unique?](#skip-duplicates) to skip duplicates
219
211
  * Automatically [handle requests errors](#handle-request-errors)
220
212
  * Automatically restart browsers when reaching memory limit [**(memory control)**](#spider-config) or requests limit
221
213
  * Easily [schedule spiders](#schedule-spiders-using-cron) within cron using [Whenever](https://github.com/javan/whenever) (no need to know cron syntax)
222
214
  * [Parallel scraping](#parallel-crawling-using-in_parallel) using simple method `in_parallel`
223
215
  * **Two modes:** use single file for a simple spider, or [generate](#project-mode) Scrapy-like **project**
224
216
  * Convenient development mode with [console](#interactive-console), colorized logger and debugger ([Pry](https://github.com/pry/pry), [Byebug](https://github.com/deivid-rodriguez/byebug))
225
- * Automated [server environment setup](#setup) (for ubuntu 18.04) and [deploy](#deploy) using commands `kimurai setup` and `kimurai deploy` ([Ansible](https://github.com/ansible/ansible) under the hood)
226
- * Command-line [runner](#runner) to run all project spiders one by one or in parallel
217
+ * Command-line [runner](#runner) to run all project spiders one-by-one or in parallel
227
218
 
228
219
  ## Table of Contents
229
220
  * [Kimurai](#kimurai)
230
221
  * [Features](#features)
231
222
  * [Table of Contents](#table-of-contents)
232
223
  * [Installation](#installation)
233
- * [Getting to Know](#getting-to-know)
224
+ * [Getting to know Kimurai](#getting-to-know-kimurai)
234
225
  * [Interactive console](#interactive-console)
235
226
  * [Available engines](#available-engines)
236
227
  * [Minimum required spider structure](#minimum-required-spider-structure)
@@ -239,9 +230,9 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
239
230
  * [request_to method](#request_to-method)
240
231
  * [save_to helper](#save_to-helper)
241
232
  * [Skip duplicates](#skip-duplicates)
242
- * [Automatically skip all duplicated requests urls](#automatically-skip-all-duplicated-requests-urls)
233
+ * [Automatically skip all duplicate request urls](#automatically-skip-all-duplicate-request-urls)
243
234
  * [Storage object](#storage-object)
244
- * [Handle request errors](#handle-request-errors)
235
+ * [Handling request errors](#handling-request-errors)
245
236
  * [skip_request_errors](#skip_request_errors)
246
237
  * [retry_request_errors](#retry_request_errors)
247
238
  * [Logging custom events](#logging-custom-events)
@@ -251,13 +242,10 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
251
242
  * [Active Support included](#active-support-included)
252
243
  * [Schedule spiders using Cron](#schedule-spiders-using-cron)
253
244
  * [Configuration options](#configuration-options)
254
- * [Using Kimurai inside existing Ruby application](#using-kimurai-inside-existing-ruby-application)
245
+ * [Using Kimurai inside existing Ruby applications](#using-kimurai-inside-existing-ruby-applications)
255
246
  * [crawl! method](#crawl-method)
256
247
  * [parse! method](#parsemethod_name-url-method)
257
248
  * [Kimurai.list and Kimurai.find_by_name](#kimurailist-and-kimuraifind_by_name)
258
- * [Automated sever setup and deployment](#automated-sever-setup-and-deployment)
259
- * [Setup](#setup)
260
- * [Deploy](#deploy)
261
249
  * [Spider @config](#spider-config)
262
250
  * [All available @config options](#all-available-config-options)
263
251
  * [@config settings inheritance](#config-settings-inheritance)
@@ -274,187 +262,111 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
274
262
 
275
263
 
276
264
  ## Installation
277
- Kimurai requires Ruby version `>= 2.5.0`. Supported platforms: `Linux` and `Mac OS X`.
265
+ Kimurai requires Ruby version `>= 3.1.0`. Officially supported platforms: `Linux` and `macOS`.
278
266
 
279
- 1) If your system doesn't have appropriate Ruby version, install it:
267
+ 1) If your system doesn't have the appropriate Ruby version, install it:
280
268
 
281
269
  <details/>
282
- <summary>Ubuntu 18.04</summary>
270
+ <summary>Ubuntu 24.04</summary>
283
271
 
284
272
  ```bash
285
- # Install required packages for ruby-build
273
+ # Install required system packages
286
274
  sudo apt update
287
- sudo apt install git-core curl zlib1g-dev build-essential libssl-dev libreadline-dev libreadline6-dev libyaml-dev libxml2-dev libxslt1-dev libcurl4-openssl-dev libffi-dev
288
-
289
- # Install rbenv and ruby-build
290
- cd && git clone https://github.com/rbenv/rbenv.git ~/.rbenv
291
- echo 'export PATH="$HOME/.rbenv/bin:$PATH"' >> ~/.bashrc
292
- echo 'eval "$(rbenv init -)"' >> ~/.bashrc
293
- exec $SHELL
275
+ sudo apt install build-essential rustc libssl-dev libyaml-dev zlib1g-dev libgmp-dev
294
276
 
295
- git clone https://github.com/rbenv/ruby-build.git ~/.rbenv/plugins/ruby-build
296
- echo 'export PATH="$HOME/.rbenv/plugins/ruby-build/bin:$PATH"' >> ~/.bashrc
297
- exec $SHELL
277
+ # Install Mice version manager
278
+ curl https://mise.run | sh
279
+ echo 'eval "$(~/.local/bin/mise activate)"' >> ~/.bashrc
280
+ source ~/.bashrc
298
281
 
299
282
  # Install latest Ruby
300
- rbenv install 2.5.3
301
- rbenv global 2.5.3
302
-
303
- gem install bundler
283
+ mise use --global ruby@3
284
+ gem update --system
304
285
  ```
305
286
  </details>
306
287
 
307
288
  <details/>
308
- <summary>Mac OS X</summary>
289
+ <summary>macOS</summary>
309
290
 
310
291
  ```bash
311
- # Install homebrew if you don't have it https://brew.sh/
312
- # Install rbenv and ruby-build:
313
- brew install rbenv ruby-build
292
+ # Install Homebrew if you don't have it https://brew.sh/
293
+ brew install openssl@3 libyaml gmp rust
314
294
 
315
- # Add rbenv to bash so that it loads every time you open a terminal
316
- echo 'if which rbenv > /dev/null; then eval "$(rbenv init -)"; fi' >> ~/.bash_profile
317
- source ~/.bash_profile
295
+ # Install Mice version manager
296
+ curl https://mise.run | sh
297
+ echo 'eval "$(~/.local/bin/mise activate)"' >> ~/.zshrc
298
+ source ~/.zshrc
318
299
 
319
300
  # Install latest Ruby
320
- rbenv install 2.5.3
321
- rbenv global 2.5.3
322
-
323
- gem install bundler
301
+ mise use --global ruby@3
302
+ gem update --system
324
303
  ```
325
304
  </details>
326
305
 
327
306
  2) Install Kimurai gem: `$ gem install kimurai`
328
307
 
329
- 3) Install browsers with webdrivers:
308
+ 3) Install browsers:
330
309
 
331
310
  <details/>
332
- <summary>Ubuntu 18.04</summary>
333
-
334
- Note: for Ubuntu 16.04-18.04 there is available automatic installation using `setup` command:
335
- ```bash
336
- $ kimurai setup localhost --local --ask-sudo
337
- ```
338
- It works using [Ansible](https://github.com/ansible/ansible) so you need to install it first: `$ sudo apt install ansible`. You can check using playbooks [here](lib/kimurai/automation).
339
-
340
- If you chose automatic installation, you can skip following and go to "Getting To Know" part. In case if you want to install everything manually:
311
+ <summary>Ubuntu 24.04</summary>
341
312
 
342
313
  ```bash
343
314
  # Install basic tools
344
315
  sudo apt install -q -y unzip wget tar openssl
345
316
 
346
- # Install xvfb (for virtual_display headless mode, in additional to native)
317
+ # Install xvfb (for virtual_display headless mode, in addition to native)
347
318
  sudo apt install -q -y xvfb
348
-
349
- # Install chromium-browser and firefox
350
- sudo apt install -q -y chromium-browser firefox
351
-
352
- # Instal chromedriver (2.44 version)
353
- # All versions located here https://sites.google.com/a/chromium.org/chromedriver/downloads
354
- cd /tmp && wget https://chromedriver.storage.googleapis.com/2.44/chromedriver_linux64.zip
355
- sudo unzip chromedriver_linux64.zip -d /usr/local/bin
356
- rm -f chromedriver_linux64.zip
357
-
358
- # Install geckodriver (0.23.0 version)
359
- # All versions located here https://github.com/mozilla/geckodriver/releases/
360
- cd /tmp && wget https://github.com/mozilla/geckodriver/releases/download/v0.23.0/geckodriver-v0.23.0-linux64.tar.gz
361
- sudo tar -xvzf geckodriver-v0.23.0-linux64.tar.gz -C /usr/local/bin
362
- rm -f geckodriver-v0.23.0-linux64.tar.gz
363
-
364
- # Install PhantomJS (2.1.1)
365
- # All versions located here http://phantomjs.org/download.html
366
- sudo apt install -q -y chrpath libxft-dev libfreetype6 libfreetype6-dev libfontconfig1 libfontconfig1-dev
367
- cd /tmp && wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2
368
- tar -xvjf phantomjs-2.1.1-linux-x86_64.tar.bz2
369
- sudo mv phantomjs-2.1.1-linux-x86_64 /usr/local/lib
370
- sudo ln -s /usr/local/lib/phantomjs-2.1.1-linux-x86_64/bin/phantomjs /usr/local/bin
371
- rm -f phantomjs-2.1.1-linux-x86_64.tar.bz2
372
319
  ```
373
320
 
374
- </details>
375
-
376
- <details/>
377
- <summary>Mac OS X</summary>
321
+ Latest automatically installed selenium drivers doesn't work well with Ubuntu Snap versions of Chrome and Firefox, therefore we need to install classic .deb versions and make sure they are available over Snap versions:
378
322
 
379
323
  ```bash
380
- # Install chrome and firefox
381
- brew cask install google-chrome firefox
382
-
383
- # Install chromedriver (latest)
384
- brew cask install chromedriver
385
-
386
- # Install geckodriver (latest)
387
- brew install geckodriver
388
-
389
- # Install PhantomJS (latest)
390
- brew install phantomjs
324
+ # Install google chrome
325
+ wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
326
+ sudo apt-get install -y ./google-chrome-stable_current_amd64.deb
391
327
  ```
392
- </details><br>
393
328
 
394
- Also, if you want to save scraped items to the database (using [ActiveRecord](https://github.com/rails/rails/tree/master/activerecord), [Sequel](https://github.com/jeremyevans/sequel) or [MongoDB Ruby Driver](https://github.com/mongodb/mongo-ruby-driver)/[Mongoid](https://github.com/mongodb/mongoid)), you need to install database clients/servers:
329
+ ```bash
330
+ # Install firefox (only if you intend to use Firefox as a browser, using selenium_firefox engine)
331
+ # See https://www.omgubuntu.co.uk/2022/04/how-to-install-firefox-deb-apt-ubuntu-22-04
332
+ sudo snap remove firefox
395
333
 
396
- <details/>
397
- <summary>Ubuntu 18.04</summary>
334
+ sudo install -d -m 0755 /etc/apt/keyrings
335
+ wget -q https://packages.mozilla.org/apt/repo-signing-key.gpg -O- | sudo tee /etc/apt/keyrings/packages.mozilla.org.asc > /dev/null
398
336
 
399
- SQlite: `$ sudo apt -q -y install libsqlite3-dev sqlite3`.
337
+ echo "deb [signed-by=/etc/apt/keyrings/packages.mozilla.org.asc] https://packages.mozilla.org/apt mozilla main" | sudo tee -a /etc/apt/sources.list.d/mozilla.list > /dev/null
400
338
 
401
- If you want to connect to a remote database, you don't need database server on a local machine (only client):
402
- ```bash
403
- # Install MySQL client
404
- sudo apt -q -y install mysql-client libmysqlclient-dev
339
+ echo '
340
+ Package: *
341
+ Pin: origin packages.mozilla.org
342
+ Pin-Priority: 1000
405
343
 
406
- # Install Postgres client
407
- sudo apt install -q -y postgresql-client libpq-dev
344
+ Package: firefox*
345
+ Pin: release o=Ubuntu
346
+ Pin-Priority: -1' | sudo tee /etc/apt/preferences.d/mozilla
408
347
 
409
- # Install MongoDB client
410
- sudo apt install -q -y mongodb-clients
411
- ```
412
-
413
- But if you want to save items to a local database, database server required as well:
414
- ```bash
415
- # Install MySQL client and server
416
- sudo apt -q -y install mysql-server mysql-client libmysqlclient-dev
417
-
418
- # Install Postgres client and server
419
- sudo apt install -q -y postgresql postgresql-contrib libpq-dev
420
-
421
- # Install MongoDB client and server
422
- # version 4.0 (check here https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/)
423
- sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4
424
- # for 16.04:
425
- # echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
426
- # for 18.04:
427
- echo "deb [ arch=amd64 ] https://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
428
- sudo apt update
429
- sudo apt install -q -y mongodb-org
430
- sudo service mongod start
348
+ sudo apt update && sudo apt remove firefox
349
+ sudo apt install firefox
431
350
  ```
432
351
  </details>
433
352
 
434
353
  <details/>
435
- <summary>Mac OS X</summary>
436
-
437
- SQlite: `$ brew install sqlite3`
354
+ <summary>macOS</summary>
438
355
 
439
356
  ```bash
440
- # Install MySQL client and server
441
- brew install mysql
442
- # Start server if you need it: brew services start mysql
443
-
444
- # Install Postgres client and server
445
- brew install postgresql
446
- # Start server if you need it: brew services start postgresql
447
-
448
- # Install MongoDB client and server
449
- brew install mongodb
450
- # Start server if you need it: brew services start mongodb
357
+ # Install google chrome
358
+ brew install google-chrome
451
359
  ```
452
- </details>
453
360
 
361
+ ```bash
362
+ # Install firefox (only if you intend to use Firefox as a browser, using selenium_firefox engine)
363
+ brew install firefox
364
+ ```
365
+ </details><br>
454
366
 
455
- ## Getting to Know
367
+ ## Getting to know Kimurai
456
368
  ### Interactive console
457
- Before you get to know all Kimurai features, there is `$ kimurai console` command which is an interactive console where you can try and debug your scraping code very quickly, without having to run any spider (yes, it's like [Scrapy shell](https://doc.scrapy.org/en/latest/topics/shell.html#topics-shell)).
369
+ Before you get to know all of Kimurai's features, there is a `$ kimurai console` command which is an interactive console where you can try and debug your scraping code very quickly, without having to run any spider (yes, it's like [Scrapy shell](https://doc.scrapy.org/en/latest/topics/shell.html#topics-shell)).
458
370
 
459
371
  ```bash
460
372
  $ kimurai console --engine selenium_chrome --url https://github.com/vifreefly/kimuraframework
@@ -466,76 +378,45 @@ $ kimurai console --engine selenium_chrome --url https://github.com/vifreefly/ki
466
378
  ```
467
379
  $ kimurai console --engine selenium_chrome --url https://github.com/vifreefly/kimuraframework
468
380
 
469
- D, [2018-08-22 13:42:32 +0400#26079] [M: 47461994677760] DEBUG -- : BrowserBuilder (selenium_chrome): created browser instance
470
- D, [2018-08-22 13:42:32 +0400#26079] [M: 47461994677760] DEBUG -- : BrowserBuilder (selenium_chrome): enabled native headless_mode
471
- I, [2018-08-22 13:42:32 +0400#26079] [M: 47461994677760] INFO -- : Browser: started get request to: https://github.com/vifreefly/kimuraframework
472
- I, [2018-08-22 13:42:35 +0400#26079] [M: 47461994677760] INFO -- : Browser: finished get request to: https://github.com/vifreefly/kimuraframework
473
- D, [2018-08-22 13:42:35 +0400#26079] [M: 47461994677760] DEBUG -- : Browser: driver.current_memory: 201701
474
-
475
- From: /home/victor/code/kimurai/lib/kimurai/base.rb @ line 189 Kimurai::Base#console:
476
-
477
- 188: def console(response = nil, url: nil, data: {})
478
- => 189: binding.pry
479
- 190: end
480
-
481
- [1] pry(#<Kimurai::Base>)> response.xpath("//title").text
482
- => "GitHub - vifreefly/kimuraframework: Modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites"
483
-
484
- [2] pry(#<Kimurai::Base>)> ls
485
- Kimurai::Base#methods: browser console logger request_to save_to unique?
486
- instance variables: @browser @config @engine @logger @pipelines
487
- locals: _ __ _dir_ _ex_ _file_ _in_ _out_ _pry_ data response url
488
-
489
- [3] pry(#<Kimurai::Base>)> ls response
490
- Nokogiri::XML::PP::Node#methods: inspect pretty_print
491
- Nokogiri::XML::Searchable#methods: % / at at_css at_xpath css search xpath
492
- Enumerable#methods:
493
- all? collect drop each_with_index find_all grep_v lazy member? none? reject slice_when take_while without
494
- any? collect_concat drop_while each_with_object find_index group_by many? min one? reverse_each sort to_a zip
495
- as_json count each_cons entries first include? map min_by partition select sort_by to_h
496
- chunk cycle each_entry exclude? flat_map index_by max minmax pluck slice_after sum to_set
497
- chunk_while detect each_slice find grep inject max_by minmax_by reduce slice_before take uniq
498
- Nokogiri::XML::Node#methods:
499
- <=> append_class classes document? has_attribute? matches? node_name= processing_instruction? to_str
500
- == attr comment? each html? name= node_type read_only? to_xhtml
501
- > attribute content elem? inner_html namespace= parent= remove traverse
502
- [] attribute_nodes content= element? inner_html= namespace_scopes parse remove_attribute unlink
503
- []= attribute_with_ns create_external_subset element_children inner_text namespaced_key? path remove_class values
504
- accept before create_internal_subset elements internal_subset native_content= pointer_id replace write_html_to
505
- add_class blank? css_path encode_special_chars key? next prepend_child set_attribute write_to
506
- add_next_sibling cdata? decorate! external_subset keys next= previous text write_xhtml_to
507
- add_previous_sibling child delete first_element_child lang next_element previous= text? write_xml_to
508
- after children description fragment? lang= next_sibling previous_element to_html xml?
509
- ancestors children= do_xinclude get_attribute last_element_child node_name previous_sibling to_s
510
- Nokogiri::XML::Document#methods:
511
- << canonicalize collect_namespaces create_comment create_entity decorate document encoding errors name remove_namespaces! root= to_java url version
512
- add_child clone create_cdata create_element create_text_node decorators dup encoding= errors= namespaces root slop! to_xml validate
513
- Nokogiri::HTML::Document#methods: fragment meta_encoding meta_encoding= serialize title title= type
514
- instance variables: @decorators @errors @node_cache
515
-
516
- [4] pry(#<Kimurai::Base>)> exit
517
- I, [2018-08-22 13:43:47 +0400#26079] [M: 47461994677760] INFO -- : Browser: driver selenium_chrome has been destroyed
518
- $
381
+ D, [2025-12-16 13:08:41 +0300#37718] [M: 1208] DEBUG -- : BrowserBuilder (selenium_chrome): created browser instance
382
+ I, [2025-12-16 13:08:41 +0300#37718] [M: 1208] INFO -- : Browser: started get request to: https://github.com/vifreefly/kimuraframework
383
+ I, [2025-12-16 13:08:43 +0300#37718] [M: 1208] INFO -- : Browser: finished get request to: https://github.com/vifreefly/kimuraframework
384
+
385
+ From: /Users/vic/code/spiders/kimuraframework/lib/kimurai/base.rb:208 Kimurai::Base#console:
386
+
387
+ 207: def console(response = nil, url: nil, data: {})
388
+ => 208: binding.pry
389
+ 209: end
390
+
391
+ [1] pry(#<Kimurai::Base>)> response.css('title').text
392
+ => "GitHub - vifreefly/kimuraframework: Kimurai is a modern Ruby web scraping framework that supports scraping with antidetect Chrome/Firefox as well as HTTP requests"
393
+ [2] pry(#<Kimurai::Base>)> browser.current_url
394
+ => "https://github.com/vifreefly/kimuraframework"
395
+ [3] pry(#<Kimurai::Base>)> browser.visit('https://google.com')
396
+ I, [2025-12-16 13:09:24 +0300#37718] [M: 1208] INFO -- : Browser: started get request to: https://google.com
397
+ I, [2025-12-16 13:09:26 +0300#37718] [M: 1208] INFO -- : Browser: finished get request to: https://google.com
398
+ => true
399
+ [4] pry(#<Kimurai::Base>)> browser.current_response.title
400
+ => "Google"
519
401
  ```
520
402
  </details><br>
521
403
 
522
- CLI options:
404
+ CLI arguments:
523
405
  * `--engine` (optional) [engine](#available-drivers) to use. Default is `mechanize`
524
- * `--url` (optional) url to process. If url omitted, `response` and `url` objects inside the console will be `nil` (use [browser](#browser-object) object to navigate to any webpage).
406
+ * `--url` (optional) url to process. If url is omitted, `response` and `url` objects inside the console will be `nil` (use [browser](#browser-object) object to navigate to any webpage).
525
407
 
526
408
  ### Available engines
527
- Kimurai has support for following engines and mostly can switch between them without need to rewrite any code:
409
+ Kimurai has support for the following engines and can mostly switch between them without the need to rewrite any code:
528
410
 
529
- * `:mechanize` - [pure Ruby fake http browser](https://github.com/sparklemotion/mechanize). Mechanize can't render javascript and don't know what DOM is it. It only can parse original HTML code of a page. Because of it, mechanize much faster, takes much less memory and in general much more stable than any real browser. Use mechanize if you can do it, and the website doesn't use javascript to render any meaningful parts of its structure. Still, because mechanize trying to mimic a real browser, it supports almost all Capybara's [methods to interact with a web page](http://cheatrags.com/capybara) (filling forms, clicking buttons, checkboxes, etc).
530
- * `:poltergeist_phantomjs` - [PhantomJS headless browser](https://github.com/ariya/phantomjs), can render javascript. In general, PhantomJS still faster than Headless Chrome (and Headless Firefox). PhantomJS has memory leakage, but Kimurai has [memory control feature](#crawler-config) so you shouldn't consider it as a problem. Also, some websites can recognize PhantomJS and block access to them. Like mechanize (and unlike selenium engines) `:poltergeist_phantomjs` can freely rotate proxies and change headers _on the fly_ (see [config section](#all-available-config-options)).
531
- * `:selenium_chrome` Chrome in headless mode driven by selenium. Modern headless browser solution with proper javascript rendering.
532
- * `:selenium_firefox` Firefox in headless mode driven by selenium. Usually takes more memory than other drivers, but sometimes can be useful.
411
+ * `:mechanize` [pure Ruby fake http browser](https://github.com/sparklemotion/mechanize). Mechanize can't render JavaScript and doesn't know what the DOM is it. It can only parse the original HTML code of a page. Because of it, mechanize is much faster, takes much less memory and is in general much more stable than any real browser. It's recommended to use mechanize when possible; if the website doesn't use JavaScript to render any meaningful parts of its structure. Still, because mechanize is trying to mimic a real browser, it supports almost all of Capybara's [methods to interact with a web page](http://cheatrags.com/capybara) (filling forms, clicking buttons, checkboxes, etc).
412
+ * `:selenium_chrome` Chrome in headless mode driven by selenium. A modern headless browser solution with proper JavaScript rendering.
413
+ * `:selenium_firefox` Firefox in headless mode driven by selenium. Usually takes more memory than other drivers, but can sometimes be useful.
533
414
 
534
- **Tip:** add `HEADLESS=false` ENV variable before command (`$ HEADLESS=false ruby spider.rb`) to run browser in normal (not headless) mode and see it's window (only for selenium-like engines). It works for [console](#interactive-console) command as well.
415
+ **Tip:** prepend a `HEADLESS=false` environment variable on the command line (i.e. `$ HEADLESS=false ruby spider.rb`) to launch an interactive browser in normal (not headless) mode and see its window (only for selenium-like engines). It works for the [console](#interactive-console) command as well.
535
416
 
536
417
 
537
418
  ### Minimum required spider structure
538
- > You can manually create a spider file, or use generator instead: `$ kimurai generate spider simple_spider`
419
+ > You can manually create a spider file, or use the generate command: `$ kimurai generate spider simple_spider`
539
420
 
540
421
  ```ruby
541
422
  require 'kimurai'
@@ -553,10 +434,10 @@ SimpleSpider.crawl!
553
434
  ```
554
435
 
555
436
  Where:
556
- * `@name` name of a spider. You can omit name if use single-file spider
557
- * `@engine` engine for a spider
558
- * `@start_urls` array of start urls to process one by one inside `parse` method
559
- * Method `parse` is the start method, should be always present in spider class
437
+ * `@name` a name for the spider
438
+ * `@engine` engine to use for the spider
439
+ * `@start_urls` array of urls to process one-by-one inside the `parse` method
440
+ * The `parse` method is the entry point, and should always be present in a spider class
560
441
 
561
442
 
562
443
  ### Method arguments `response`, `url` and `data`
@@ -566,14 +447,14 @@ def parse(response, url:, data: {})
566
447
  end
567
448
  ```
568
449
 
569
- * `response` ([Nokogiri::HTML::Document](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document) object) Contains parsed HTML code of a processed webpage
570
- * `url` (String) url of a processed webpage
571
- * `data` (Hash) uses to pass data between requests
450
+ * `response` [Nokogiri::HTML::Document](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document) object – contains parsed HTML code of a processed webpage
451
+ * `url` String – url of a processed webpage
452
+ * `data` – Hash used to pass data between requests
572
453
 
573
454
  <details/>
574
- <summary><strong>Example how to use <code>data</code></strong></summary>
455
+ <summary><strong>An example of how to use <code>data</code></strong></summary>
575
456
 
576
- Imagine that there is a product page which doesn't contain product category. Category name present only on category page with pagination. This is the case where we can use `data` to pass category name from `parse` to `parse_product` method:
457
+ Imagine that there is a product page that doesn't contain a category name. The category name is only present on category pages with pagination. This is a case where we can use `data` to pass a category name from `parse` to `parse_product`:
577
458
 
578
459
  ```ruby
579
460
  class ProductsSpider < Kimurai::Base
@@ -583,7 +464,7 @@ class ProductsSpider < Kimurai::Base
583
464
  def parse(response, url:, data: {})
584
465
  category_name = response.xpath("//path/to/category/name").text
585
466
  response.xpath("//path/to/products/urls").each do |product_url|
586
- # Merge category_name with current data hash and pass it next to parse_product method
467
+ # Merge category_name with current data hash and pass it to parse_product
587
468
  request_to(:parse_product, url: product_url[:href], data: data.merge(category_name: category_name))
588
469
  end
589
470
 
@@ -592,7 +473,7 @@ class ProductsSpider < Kimurai::Base
592
473
 
593
474
  def parse_product(response, url:, data: {})
594
475
  item = {}
595
- # Assign item's category_name from data[:category_name]
476
+ # Assign an item's category_name from data[:category_name]
596
477
  item[:category_name] = data[:category_name]
597
478
 
598
479
  # ...
@@ -603,16 +484,16 @@ end
603
484
  </details><br>
604
485
 
605
486
  **You can query `response` using [XPath or CSS selectors](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/Searchable)**. Check Nokogiri tutorials to understand how to work with `response`:
606
- * [Parsing HTML with Nokogiri](http://ruby.bastardsbook.com/chapters/html-parsing/) - ruby.bastardsbook.com
607
- * [HOWTO parse HTML with Ruby & Nokogiri](https://readysteadycode.com/howto-parse-html-with-ruby-and-nokogiri) - readysteadycode.com
608
- * [Class: Nokogiri::HTML::Document](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document) (documentation) - rubydoc.info
487
+ * [Parsing HTML with Nokogiri](http://ruby.bastardsbook.com/chapters/html-parsing/) ruby.bastardsbook.com
488
+ * [HOWTO parse HTML with Ruby & Nokogiri](https://readysteadycode.com/howto-parse-html-with-ruby-and-nokogiri) readysteadycode.com
489
+ * [Class: Nokogiri::HTML::Document](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document) (documentation) rubydoc.info
609
490
 
610
491
 
611
492
  ### `browser` object
612
493
 
613
- From any spider instance method there is available `browser` object, which is [Capybara::Session](https://www.rubydoc.info/github/jnicklas/capybara/Capybara/Session) object and uses to process requests and get page response (`current_response` method). Usually you don't need to touch it directly, because there is `response` (see above) which contains page response after it was loaded.
494
+ A browser object is available from any spider instance method, which is a [Capybara::Session](https://www.rubydoc.info/github/jnicklas/capybara/Capybara/Session) object and uses it to process requests and get page responses (`current_response` method). Usually, you don't need to touch it directly because `response` (see above) contains the page response after it was loaded.
614
495
 
615
- But if you need to interact with a page (like filling form fields, clicking elements, checkboxes, etc) `browser` is ready for you:
496
+ But, if you need to interact with a page (like filling form fields, clicking elements, checkboxes, etc) a `browser` is ready for you:
616
497
 
617
498
  ```ruby
618
499
  class GoogleSpider < Kimurai::Base
@@ -624,7 +505,7 @@ class GoogleSpider < Kimurai::Base
624
505
  browser.fill_in "q", with: "Kimurai web scraping framework"
625
506
  browser.click_button "Google Search"
626
507
 
627
- # Update response to current response after interaction with a browser
508
+ # Update response with current_response after interaction with a browser
628
509
  response = browser.current_response
629
510
 
630
511
  # Collect results
@@ -638,13 +519,13 @@ end
638
519
  ```
639
520
 
640
521
  Check out **Capybara cheat sheets** where you can see all available methods **to interact with browser**:
641
- * [UI Testing with RSpec and Capybara [cheat sheet]](http://cheatrags.com/capybara) - cheatrags.com
642
- * [Capybara Cheatsheet PDF](https://thoughtbot.com/upcase/test-driven-rails-resources/capybara.pdf) - thoughtbot.com
643
- * [Class: Capybara::Session](https://www.rubydoc.info/github/jnicklas/capybara/Capybara/Session) (documentation) - rubydoc.info
522
+ * [UI Testing with RSpec and Capybara [cheat sheet]](http://cheatrags.com/capybara) cheatrags.com
523
+ * [Capybara Cheatsheet PDF](https://thoughtbot.com/upcase/test-driven-rails-resources/capybara.pdf) thoughtbot.com
524
+ * [Class: Capybara::Session](https://www.rubydoc.info/github/jnicklas/capybara/Capybara/Session) (documentation) rubydoc.info
644
525
 
645
526
  ### `request_to` method
646
527
 
647
- For making requests to a particular method there is `request_to`. It requires minimum two arguments: `:method_name` and `url:`. An optional argument is `data:` (see above what for is it). Example:
528
+ For making requests to a particular method, there is `request_to`. It requires at least two arguments: `:method_name` and `url:`. And, optionally `data:` (see above). Example:
648
529
 
649
530
  ```ruby
650
531
  class Spider < Kimurai::Base
@@ -662,7 +543,7 @@ class Spider < Kimurai::Base
662
543
  end
663
544
  ```
664
545
 
665
- Under the hood `request_to` simply call [#visit](https://www.rubydoc.info/github/jnicklas/capybara/Capybara%2FSession:visit) (`browser.visit(url)`) and then required method with arguments:
546
+ Under the hood, `request_to` simply calls [#visit](https://www.rubydoc.info/github/jnicklas/capybara/Capybara%2FSession:visit) (`browser.visit(url)`), and the provided method with arguments:
666
547
 
667
548
  <details/>
668
549
  <summary>request_to</summary>
@@ -677,10 +558,10 @@ end
677
558
  ```
678
559
  </details><br>
679
560
 
680
- `request_to` just makes things simpler, and without it we could do something like:
561
+ The `request_to` helper method makes things simpler. We could also do something like:
681
562
 
682
563
  <details/>
683
- <summary>Check the code</summary>
564
+ <summary>See the code</summary>
684
565
 
685
566
  ```ruby
686
567
  class Spider < Kimurai::Base
@@ -703,7 +584,7 @@ end
703
584
 
704
585
  ### `save_to` helper
705
586
 
706
- Sometimes all that you need is to simply save scraped data to a file format, like JSON or CSV. You can use `save_to` for it:
587
+ Sometimes all you need is to simply save scraped data to a file. You can use the `save_to` helper method like so:
707
588
 
708
589
  ```ruby
709
590
  class ProductsSpider < Kimurai::Base
@@ -719,31 +600,31 @@ class ProductsSpider < Kimurai::Base
719
600
  item[:description] = response.xpath("//desc/path").text.squish
720
601
  item[:price] = response.xpath("//price/path").text[/\d+/]&.to_f
721
602
 
722
- # Add each new item to the `scraped_products.json` file:
603
+ # Append each new item to the `scraped_products.json` file:
723
604
  save_to "scraped_products.json", item, format: :json
724
605
  end
725
606
  end
726
607
  ```
727
608
 
728
609
  Supported formats:
729
- * `:json` JSON
730
- * `:pretty_json` "pretty" JSON (`JSON.pretty_generate`)
731
- * `:jsonlines` [JSON Lines](http://jsonlines.org/)
732
- * `:csv` CSV
610
+ * `:json` JSON
611
+ * `:pretty_json` "pretty" JSON (`JSON.pretty_generate`)
612
+ * `:jsonlines` [JSON Lines](http://jsonlines.org/)
613
+ * `:csv` CSV
733
614
 
734
- Note: `save_to` requires data (item to save) to be a `Hash`.
615
+ Note: `save_to` requires the data (item) to save to be a `Hash`.
735
616
 
736
- By default `save_to` add position key to an item hash. You can disable it with `position: false`: `save_to "scraped_products.json", item, format: :json, position: false`.
617
+ By default, `save_to` will add a position key to an item hash. You can disable it like so: `save_to "scraped_products.json", item, format: :json, position: false`
737
618
 
738
619
  **How helper works:**
739
620
 
740
- Until spider stops, each new item will be appended to a file. At the next run, helper will clear the content of a file first, and then start again appending items to it.
621
+ While the spider is running, each new item will be appended to the output file. On the next run, this helper will clear the contents of the output file, then start appending items to it.
741
622
 
742
- > If you don't want file to be cleared before each run, add option `append: true`: `save_to "scraped_products.json", item, format: :json, append: true`
623
+ > If you don't want the file to be cleared before each run, pass `append: true` like so: `save_to "scraped_products.json", item, format: :json, append: true`
743
624
 
744
625
  ### Skip duplicates
745
626
 
746
- It's pretty common when websites have duplicated pages. For example when an e-commerce shop has the same products in different categories. To skip duplicates, there is simple `unique?` helper:
627
+ It's pretty common for websites to have duplicate pages. For example, when an e-commerce site has the same products in different categories. To skip duplicates, there is a simple `unique?` helper:
747
628
 
748
629
  ```ruby
749
630
  class ProductsSpider < Kimurai::Base
@@ -766,11 +647,11 @@ class ProductsSpider < Kimurai::Base
766
647
  end
767
648
  end
768
649
 
769
- # Or/and check products for uniqueness using product sku inside of parse_product:
650
+ # And/or check products for uniqueness using product sku inside of parse_product:
770
651
  def parse_product(response, url:, data: {})
771
652
  item = {}
772
653
  item[:sku] = response.xpath("//product/sku/path").text.strip.upcase
773
- # Don't save product and return from method if there is already saved item with the same sku:
654
+ # Don't save the product if there is already an item with the same sku:
774
655
  return unless unique?(:sku, item[:sku])
775
656
 
776
657
  # ...
@@ -779,14 +660,14 @@ class ProductsSpider < Kimurai::Base
779
660
  end
780
661
  ```
781
662
 
782
- `unique?` helper works pretty simple:
663
+ The `unique?` helper works quite simply:
783
664
 
784
665
  ```ruby
785
- # Check string "http://example.com" in scope `url` for a first time:
666
+ # Check for "http://example.com" in `url` scope for the first time:
786
667
  unique?(:url, "http://example.com")
787
668
  # => true
788
669
 
789
- # Try again:
670
+ # Next time:
790
671
  unique?(:url, "http://example.com")
791
672
  # => false
792
673
  ```
@@ -804,44 +685,44 @@ unique?(:id, 324234232)
804
685
  unique?(:custom, "Lorem Ipsum")
805
686
  ```
806
687
 
807
- #### Automatically skip all duplicated requests urls
688
+ #### Automatically skip all duplicate request urls
808
689
 
809
- It is possible to automatically skip all already visited urls while calling `request_to` method, using [@config](#all-available-config-options) option `skip_duplicate_requests: true`. With this option, all already visited urls will be automatically skipped. Also check the [@config](#all-available-config-options) for an additional options of this setting.
690
+ It's possible to automatically skip any previously visited urls when calling the `request_to` method using the `skip_duplicate_requests: true` config option. See [@config](#all-available-config-options) for additional options.
810
691
 
811
692
  #### `storage` object
812
693
 
813
- `unique?` method it's just an alias for `storage#unique?`. Storage has several methods:
694
+ The `unique?` method is just an alias for `storage#unique?`. Storage has several methods:
814
695
 
815
- * `#all` - display storage hash where keys are existing scopes.
816
- * `#include?(scope, value)` - return `true` if value in the scope exists, and `false` if not
817
- * `#add(scope, value)` - add value to the scope
818
- * `#unique?(scope, value)` - method already described above, will return `false` if value in the scope exists, or return `true` + add value to the scope if value in the scope not exists.
819
- * `#clear!` - reset the whole storage by deleting all values from all scopes.
696
+ * `#all` return all scopes
697
+ * `#add(scope, value)` add a value to the scope
698
+ * `#include?(scope, value)` returns `true` if the value exists in the scope, or `false` if it doesn't
699
+ * `#unique?(scope, value)` returns `false` if the value exists in the scope, otherwise adds the value to the scope and returns `true`
700
+ * `#clear!` deletes all values from all scopes
820
701
 
821
702
 
822
- ### Handle request errors
823
- It is quite common that some pages of crawling website can return different response code than `200 ok`. In such cases, method `request_to` (or `browser.visit`) can raise an exception. Kimurai provides `skip_request_errors` and `retry_request_errors` [config](#spider-config) options to handle such errors:
703
+ ### Handling request errors
704
+ It's common while crawling web pages to get response codes other than `200 OK`. In such cases, the `request_to` method (or `browser.visit`) can raise an exception. Kimurai provides the `skip_request_errors` and `retry_request_errors` [config](#spider-config) options to handle such errors:
824
705
 
825
706
  #### skip_request_errors
826
- You can automatically skip some of errors while requesting a page using `skip_request_errors` [config](#spider-config) option. If raised error matches one of the errors in the list, then this error will be caught, and request will be skipped. It is a good idea to skip errors like NotFound(404), etc.
707
+ Kimurai can automatically skip certain errors while performing requests using the `skip_request_errors` [config](#spider-config) option. If a raised error matches one of the errors in the list, the error will be caught, and the request will be skipped. It's a good idea to skip errors like `404 Not Found`, etc.
827
708
 
828
- Format for the option: array where elements are error classes or/and hashes. You can use _hash_ format for more flexibility:
709
+ `skip_request_errors` is an array of error classes and/or hashes. You can use a _hash_ for more flexibility like so:
829
710
 
830
711
  ```
831
712
  @config = {
832
- skip_request_errors: [{ error: RuntimeError, message: "404 => Net::HTTPNotFound" }]
713
+ skip_request_errors: [{ error: RuntimeError, message: "404 => Net::HTTPNotFound" }, { error: TimeoutError }]
833
714
  }
834
715
  ```
835
- In this case, provided `message:` will be compared with a full error message using `String#include?`. Also you can use regex instead: `{ error: RuntimeError, message: /404|403/ }`.
716
+ In this case, the provided `message:` will be compared with a full error message using `String#include?`. You can also use regex like so: `{ error: RuntimeError, message: /404|403/ }`.
836
717
 
837
718
  #### retry_request_errors
838
- You can automatically retry some of errors with a few attempts while requesting a page using `retry_request_errors` [config](#spider-config) option. If raised error matches one of the errors in the list, then this error will be caught and the request will be processed again within a delay.
719
+ Kimurai can automatically retry requests several times after certain errors with the `retry_request_errors` [config](#spider-config) option. If a raised error matches one of the errors in the list, the error will be caught, and the request will be processed again with progressive delay.
839
720
 
840
- There are 3 attempts: first: delay _15 sec_, second: delay _30 sec_, third: delay _45 sec_. If after 3 attempts there is still an exception, then the exception will be raised. It is a good idea to try to retry errros like `ReadTimeout`, `HTTPBadGateway`, etc.
721
+ There are 3 attempts with _15 sec_, _30 sec_, and _45 sec_ delays, respectively. If after 3 attempts there is still an exception, then the exception will be raised. It's a good idea to retry errors like `ReadTimeout`, `HTTPBadGateway`, etc.
841
722
 
842
- Format for the option: same like for `skip_request_errors` option.
723
+ The format for `retry_request_errors` is the same as for `skip_request_errors`.
843
724
 
844
- If you would like to skip (not raise) error after all retries gone, you can specify `skip_on_failure: true` option:
725
+ If you would like to skip (not raise) the error after the 3 retries, you can specify `skip_on_failure: true` like so:
845
726
 
846
727
  ```ruby
847
728
  @config = {
@@ -851,7 +732,7 @@ If you would like to skip (not raise) error after all retries gone, you can spec
851
732
 
852
733
  ### Logging custom events
853
734
 
854
- It is possible to save custom messages to the [run_info](#open_spider-and-close_spider-callbacks) hash using `add_event('Some message')` method. This feature helps you to keep track on important things which happened during crawling without checking the whole spider log (in case if you're logging these messages using `logger`). Example:
735
+ It's possible to save custom messages to the [run_info](#open_spider-and-close_spider-callbacks) hash using the `add_event('Some message')` method. This feature helps you to keep track of important events during crawling without checking the whole spider log (in case if you're logging these messages using `logger`). For example:
855
736
 
856
737
  ```ruby
857
738
  def parse_product(response, url:, data: {})
@@ -872,7 +753,7 @@ I, [2018-11-28 22:20:19 +0400#7402] [M: 47156576560640] INFO -- example_spider:
872
753
 
873
754
  ### `open_spider` and `close_spider` callbacks
874
755
 
875
- You can define `.open_spider` and `.close_spider` callbacks (class methods) to perform some action before spider started or after spider has been stopped:
756
+ You can define `.open_spider` and `.close_spider` callbacks (class methods) to perform some action(s) before or after the spider runs:
876
757
 
877
758
  ```ruby
878
759
  require 'kimurai'
@@ -917,7 +798,7 @@ I, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840] INFO -- example_spider:
917
798
  ```
918
799
  </details><br>
919
800
 
920
- Inside `open_spider` and `close_spider` class methods there is available `run_info` method which contains useful information about spider state:
801
+ The `run_info` method is available from the `open_spider` and `close_spider` class methods. It contains useful information about the spider state:
921
802
 
922
803
  ```ruby
923
804
  11: def self.open_spider
@@ -937,7 +818,7 @@ Inside `open_spider` and `close_spider` class methods there is available `run_in
937
818
  }
938
819
  ```
939
820
 
940
- Inside `close_spider`, `run_info` will be updated:
821
+ `run_info` will be updated from `close_spider`:
941
822
 
942
823
  ```ruby
943
824
  15: def self.close_spider
@@ -957,7 +838,7 @@ Inside `close_spider`, `run_info` will be updated:
957
838
  }
958
839
  ```
959
840
 
960
- `run_info[:status]` helps to determine if spider was finished successfully or failed (possible values: `:completed`, `:failed`):
841
+ `run_info[:status]` helps to determine if the spider finished successfully or failed (possible values: `:completed`, `:failed`):
961
842
 
962
843
  ```ruby
963
844
  class ExampleSpider < Kimurai::Base
@@ -1005,12 +886,12 @@ example_spider.rb:15:in `parse': undefined method `strip' for nil:NilClass (NoMe
1005
886
  ```
1006
887
  </details><br>
1007
888
 
1008
- **Usage example:** if spider finished successfully, send JSON file with scraped items to a remote FTP location, otherwise (if spider failed), skip incompleted results and send email/notification to slack about it:
889
+ **Usage example:** if the spider finished successfully, send a JSON file with scraped items to a remote FTP location, otherwise (if the spider failed), skip incompleted results and send an email/notification to Slack about it:
1009
890
 
1010
891
  <details/>
1011
892
  <summary>Example</summary>
1012
893
 
1013
- Also you can use additional methods `completed?` or `failed?`
894
+ You can also use the additional methods `completed?` or `failed?`
1014
895
 
1015
896
  ```ruby
1016
897
  class Spider < Kimurai::Base
@@ -1047,7 +928,7 @@ end
1047
928
 
1048
929
 
1049
930
  ### `KIMURAI_ENV`
1050
- Kimurai has environments, default is `development`. To provide custom environment pass `KIMURAI_ENV` ENV variable before command: `$ KIMURAI_ENV=production ruby spider.rb`. To access current environment there is `Kimurai.env` method.
931
+ Kimurai supports environments. The default is `development`. To provide a custom environment provide a `KIMURAI_ENV` environment variable like so: `$ KIMURAI_ENV=production ruby spider.rb`. To access the current environment there is a `Kimurai.env` method.
1051
932
 
1052
933
  Usage example:
1053
934
  ```ruby
@@ -1068,7 +949,7 @@ end
1068
949
  ```
1069
950
 
1070
951
  ### Parallel crawling using `in_parallel`
1071
- Kimurai can process web pages concurrently in one single line: `in_parallel(:parse_product, urls, threads: 3)`, where `:parse_product` is a method to process, `urls` is array of urls to crawl and `threads:` is a number of threads:
952
+ Kimurai can process web pages concurrently: `in_parallel(:parse_product, urls, threads: 3)`, where `:parse_product` is a method to process, `urls` is an array of urls to crawl and `threads:` is a number of threads:
1072
953
 
1073
954
  ```ruby
1074
955
  # amazon_spider.rb
@@ -1083,7 +964,7 @@ class AmazonSpider < Kimurai::Base
1083
964
  browser.fill_in "field-keywords", with: "Web Scraping Books"
1084
965
  browser.click_on "Go"
1085
966
 
1086
- # Walk through pagination and collect products urls:
967
+ # Walk through pagination and collect product urls:
1087
968
  urls = []
1088
969
  loop do
1089
970
  response = browser.current_response
@@ -1094,7 +975,7 @@ class AmazonSpider < Kimurai::Base
1094
975
  browser.find(:xpath, "//a[@id='pagnNextLink']", wait: 1).click rescue break
1095
976
  end
1096
977
 
1097
- # Process all collected urls concurrently within 3 threads:
978
+ # Process all collected urls concurrently using 3 threads:
1098
979
  in_parallel(:parse_book_page, urls, threads: 3)
1099
980
  end
1100
981
 
@@ -1117,50 +998,22 @@ AmazonSpider.crawl!
1117
998
  <summary>Run: <code>$ ruby amazon_spider.rb</code></summary>
1118
999
 
1119
1000
  ```
1120
- I, [2018-08-22 14:48:37 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Spider: started: amazon_spider
1121
- D, [2018-08-22 14:48:37 +0400#13033] [M: 46982297486840] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
1122
- I, [2018-08-22 14:48:37 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/
1123
- I, [2018-08-22 14:48:38 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/
1124
- I, [2018-08-22 14:48:38 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Info: visits: requests: 1, responses: 1
1125
-
1126
- I, [2018-08-22 14:48:43 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Spider: in_parallel: starting processing 52 urls within 3 threads
1127
- D, [2018-08-22 14:48:43 +0400#13033] [C: 46982320219020] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
1128
- I, [2018-08-22 14:48:43 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Practical-Web-Scraping-Data-Science/dp/1484235819/
1129
- D, [2018-08-22 14:48:44 +0400#13033] [C: 46982320189640] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
1130
- I, [2018-08-22 14:48:44 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Python-Web-Scraping-Cookbook-scraping/dp/1787285219/
1131
- D, [2018-08-22 14:48:44 +0400#13033] [C: 46982319187320] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
1132
- I, [2018-08-22 14:48:44 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Scraping-Python-Community-Experience-Distilled/dp/1782164367/
1133
- I, [2018-08-22 14:48:45 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Practical-Web-Scraping-Data-Science/dp/1484235819/
1134
- I, [2018-08-22 14:48:45 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Info: visits: requests: 4, responses: 2
1135
- I, [2018-08-22 14:48:45 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491910291/
1136
- I, [2018-08-22 14:48:46 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Python-Web-Scraping-Cookbook-scraping/dp/1787285219/
1137
- I, [2018-08-22 14:48:46 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Info: visits: requests: 5, responses: 3
1138
- I, [2018-08-22 14:48:46 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491985577/
1139
- I, [2018-08-22 14:48:46 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Scraping-Python-Community-Experience-Distilled/dp/1782164367/
1140
- I, [2018-08-22 14:48:46 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Info: visits: requests: 6, responses: 4
1141
- I, [2018-08-22 14:48:46 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Web-Scraping-Excel-Effective-Scrapes-ebook/dp/B01CMMJGZ8/
1001
+ $ ruby amazon_spider.rb
1142
1002
 
1143
1003
  ...
1144
1004
 
1145
- I, [2018-08-22 14:49:10 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Info: visits: requests: 51, responses: 49
1146
- I, [2018-08-22 14:49:10 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Browser: driver mechanize has been destroyed
1147
- I, [2018-08-22 14:49:11 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Scraping-Ice-Life-Bill-Rayburn-ebook/dp/B00C0NF1L8/
1148
- I, [2018-08-22 14:49:11 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Info: visits: requests: 51, responses: 50
1149
- I, [2018-08-22 14:49:11 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Instant-Scraping-Jacob-Ward-2013-07-26/dp/B01FJ1G3G4/
1150
- I, [2018-08-22 14:49:11 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Php-architects-Guide-Scraping-Author/dp/B010DTKYY4/
1151
- I, [2018-08-22 14:49:11 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Info: visits: requests: 52, responses: 51
1152
- I, [2018-08-22 14:49:11 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Ship-Tracking-Maritime-Domain-Awareness/dp/B001J5MTOK/
1153
- I, [2018-08-22 14:49:12 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Instant-Scraping-Jacob-Ward-2013-07-26/dp/B01FJ1G3G4/
1154
- I, [2018-08-22 14:49:12 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Info: visits: requests: 53, responses: 52
1155
- I, [2018-08-22 14:49:12 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: driver mechanize has been destroyed
1156
- I, [2018-08-22 14:49:12 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Ship-Tracking-Maritime-Domain-Awareness/dp/B001J5MTOK/
1157
- I, [2018-08-22 14:49:12 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Info: visits: requests: 53, responses: 53
1158
- I, [2018-08-22 14:49:12 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: driver mechanize has been destroyed
1159
-
1160
- I, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Spider: in_parallel: stopped processing 52 urls within 3 threads, total time: 29s
1161
- I, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Browser: driver mechanize has been destroyed
1162
-
1163
- I, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Spider: stopped: {:spider_name=>"amazon_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 14:48:37 +0400, :stop_time=>2018-08-22 14:49:12 +0400, :running_time=>"35s", :visits=>{:requests=>53, :responses=>53}, :error=>nil}
1005
+ I, [2025-12-16 13:48:19 +0300#39167] [C: 1624] INFO -- amazon_spider: Info: visits: requests: 305, responses: 305
1006
+ I, [2025-12-16 13:48:19 +0300#39167] [C: 1624] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Real-World-Python-Hackers-Solving-Problems/dp/1718500629/
1007
+ I, [2025-12-16 13:48:22 +0300#39167] [C: 1624] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Real-World-Python-Hackers-Solving-Problems/dp/1718500629/
1008
+ I, [2025-12-16 13:48:22 +0300#39167] [C: 1624] INFO -- amazon_spider: Info: visits: requests: 306, responses: 306
1009
+ I, [2025-12-16 13:48:22 +0300#39167] [C: 1624] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Introduction-Important-efficient-collection-scraping-ebook/dp/B0D2MLXFT6/
1010
+ I, [2025-12-16 13:48:23 +0300#39167] [C: 1624] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Introduction-Important-efficient-collection-scraping-ebook/dp/B0D2MLXFT6/
1011
+ I, [2025-12-16 13:48:23 +0300#39167] [C: 1624] INFO -- amazon_spider: Info: visits: requests: 307, responses: 307
1012
+ I, [2025-12-16 13:48:23 +0300#39167] [C: 1624] INFO -- amazon_spider: Browser: driver mechanize has been destroyed
1013
+ I, [2025-12-16 13:48:23 +0300#39167] [M: 1152] INFO -- amazon_spider: Spider: in_parallel: stopped processing 306 urls within 3 threads, total time: 2m, 37s
1014
+ I, [2025-12-16 13:48:23 +0300#39167] [M: 1152] INFO -- amazon_spider: Browser: driver mechanize has been destroyed
1015
+ I, [2025-12-16 13:48:23 +0300#39167] [M: 1152] INFO -- amazon_spider: Spider: stopped: {spider_name: "amazon_spider", status: :completed, error: nil, environment: "development", start_time: 2025-12-16 13:45:12.5338 +0300, stop_time: 2025-12-16 13:48:23.526221 +0300, running_time: "3m, 10s", visits: {requests: 307, responses: 307}, items: {sent: 0, processed: 0}, events: {requests_errors: {}, drop_items_errors: {}, custom: {}}}
1016
+ vic@Vics-MacBook-Air single %
1164
1017
 
1165
1018
  ```
1166
1019
  </details>
@@ -1171,35 +1024,39 @@ I, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840] INFO -- amazon_spider:
1171
1024
  ```json
1172
1025
  [
1173
1026
  {
1174
- "title": "Web Scraping with Python: Collecting More Data from the Modern Web2nd Edition",
1175
- "url": "https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491985577/",
1176
- "price": "$26.94",
1177
- "publisher": "O'Reilly Media; 2 edition (April 14, 2018)",
1027
+ "title": "Web Scraping with Python: Data Extraction from the Modern Web 3rd Edition",
1028
+ "url": "https://www.amazon.com/Web-Scraping-Python-Extraction-Modern/dp/1098145356/",
1029
+ "price": "$27.00",
1030
+ "author": "Ryan Mitchell",
1031
+ "publication_date": "March 26, 2024",
1178
1032
  "position": 1
1179
1033
  },
1180
1034
  {
1181
- "title": "Python Web Scraping Cookbook: Over 90 proven recipes to get you scraping with Python, micro services, Docker and AWS",
1182
- "url": "https://www.amazon.com/Python-Web-Scraping-Cookbook-scraping/dp/1787285219/",
1183
- "price": "$39.99",
1184
- "publisher": "Packt Publishing - ebooks Account (February 9, 2018)",
1035
+ "title": "Web Scraping with Python: Collecting More Data from the Modern Web 2nd Edition",
1036
+ "url": "https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491985577/",
1037
+ "price": "$13.20 - $38.15",
1038
+ "author": "Ryan Mitchell",
1039
+ "publication_date": "May 8, 2018",
1185
1040
  "position": 2
1186
1041
  },
1187
1042
  {
1188
- "title": "Web Scraping with Python: Collecting Data from the Modern Web1st Edition",
1189
- "url": "https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491910291/",
1190
- "price": "$15.75",
1191
- "publisher": "O'Reilly Media; 1 edition (July 24, 2015)",
1043
+ "title": "Scripting: Automation with Bash, PowerShell, and Python—Automate Everyday IT Tasks from Backups to Web Scraping in Just a Few Lines of Code (Rheinwerk Computing) First Edition",
1044
+ "url": "https://www.amazon.com/Scripting-Automation-Bash-PowerShell-Python/dp/1493225561/",
1045
+ "price": "$47.02",
1046
+ "author": "Michael Kofler",
1047
+ "publication_date": "February 25, 2024",
1192
1048
  "position": 3
1193
1049
  },
1194
1050
 
1195
- ...
1196
-
1051
+ // ...
1052
+
1197
1053
  {
1198
- "title": "Instant Web Scraping with Java by Ryan Mitchell (2013-08-26)",
1199
- "url": "https://www.amazon.com/Instant-Scraping-Java-Mitchell-2013-08-26/dp/B01FEM76X2/",
1200
- "price": "$35.82",
1201
- "publisher": "Packt Publishing (2013-08-26) (1896)",
1202
- "position": 52
1054
+ "title": "Introduction to Python Important points for efficient data collection with scraping (Japanese Edition) Kindle Edition",
1055
+ "url": "https://www.amazon.com/Introduction-Important-efficient-collection-scraping-ebook/dp/B0D2MLXFT6/",
1056
+ "price": "$0.00",
1057
+ "author": "r",
1058
+ "publication_date": "April 24, 2024",
1059
+ "position": 306
1203
1060
  }
1204
1061
  ]
1205
1062
  ```
@@ -1207,11 +1064,12 @@ I, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840] INFO -- amazon_spider:
1207
1064
 
1208
1065
  > Note that [save_to](#save_to-helper) and [unique?](#skip-duplicates-unique-helper) helpers are thread-safe (protected by [Mutex](https://ruby-doc.org/core-2.5.1/Mutex.html)) and can be freely used inside threads.
1209
1066
 
1210
- `in_parallel` can take additional options:
1211
- * `data:` pass with urls custom data hash: `in_parallel(:method, urls, threads: 3, data: { category: "Scraping" })`
1212
- * `delay:` set delay between requests: `in_parallel(:method, urls, threads: 3, delay: 2)`. Delay can be `Integer`, `Float` or `Range` (`2..5`). In case of a Range, delay number will be chosen randomly for each request: `rand (2..5) # => 3`
1213
- * `engine:` set custom engine than a default one: `in_parallel(:method, urls, threads: 3, engine: :poltergeist_phantomjs)`
1214
- * `config:` pass custom options to config (see [config section](#crawler-config))
1067
+ `in_parallel` can take additional parameters:
1068
+
1069
+ * `data:` – pass custom data like so: `in_parallel(:method, urls, threads: 3, data: { category: "Scraping" })`
1070
+ * `delay:` set delay between requests like so: `in_parallel(:method, urls, threads: 3, delay: 2)`. Delay can be `Integer`, `Float` or `Range` (`2..5`). In case of a Range, the delay (in seconds) will be set randomly for each request: `rand (2..5) # => 3`
1071
+ * `engine:` set custom engine like so: `in_parallel(:method, urls, threads: 3, engine: :selenium_chrome)`
1072
+ * `config:` – set custom [config](#spider-config) options
1215
1073
 
1216
1074
  ### Active Support included
1217
1075
 
@@ -1219,7 +1077,7 @@ You can use all the power of familiar [Rails core-ext methods](https://guides.ru
1219
1077
 
1220
1078
  ### Schedule spiders using Cron
1221
1079
 
1222
- 1) Inside spider directory generate [Whenever](https://github.com/javan/whenever) config: `$ kimurai generate schedule`.
1080
+ 1) Inside the spider directory generate a [Whenever](https://github.com/javan/whenever) schedule configuration like so: `$ kimurai generate schedule`.
1223
1081
 
1224
1082
  <details/>
1225
1083
  <summary><code>schedule.rb</code></summary>
@@ -1228,7 +1086,7 @@ You can use all the power of familiar [Rails core-ext methods](https://guides.ru
1228
1086
  ### Settings ###
1229
1087
  require 'tzinfo'
1230
1088
 
1231
- # Export current PATH to the cron
1089
+ # Export current PATH for cron
1232
1090
  env :PATH, ENV["PATH"]
1233
1091
 
1234
1092
  # Use 24 hour format when using `at:` option
@@ -1236,8 +1094,8 @@ set :chronic_options, hours24: true
1236
1094
 
1237
1095
  # Use local_to_utc helper to setup execution time using your local timezone instead
1238
1096
  # of server's timezone (which is probably and should be UTC, to check run `$ timedatectl`).
1239
- # Also maybe you'll want to set same timezone in kimurai as well (use `Kimurai.configuration.time_zone =` for that),
1240
- # to have spiders logs in a specific time zone format.
1097
+ # You should also set the same timezone in kimurai (use `Kimurai.configuration.time_zone =` for that).
1098
+ #
1241
1099
  # Example usage of helper:
1242
1100
  # every 1.day, at: local_to_utc("7:00", zone: "Europe/Moscow") do
1243
1101
  # crawl "google_spider.com", output: "log/google_spider.com.log"
@@ -1248,7 +1106,7 @@ end
1248
1106
 
1249
1107
  # Note: by default Whenever exports cron commands with :environment == "production".
1250
1108
  # Note: Whenever can only append log data to a log file (>>). If you want
1251
- # to overwrite (>) log file before each run, pass lambda:
1109
+ # to overwrite (>) a log file before each run, use lambda notation:
1252
1110
  # crawl "google_spider.com", output: -> { "> log/google_spider.com.log 2>&1" }
1253
1111
 
1254
1112
  # Project job types
@@ -1261,31 +1119,29 @@ job_type :single, "cd :path && KIMURAI_ENV=:environment ruby :task :output"
1261
1119
  job_type :single_bundle, "cd :path && KIMURAI_ENV=:environment bundle exec ruby :task :output"
1262
1120
 
1263
1121
  ### Schedule ###
1264
- # Usage (check examples here https://github.com/javan/whenever#example-schedulerb-file):
1122
+ # Usage (see examples here https://github.com/javan/whenever#example-schedulerb-file):
1265
1123
  # every 1.day do
1266
1124
  # Example to schedule a single spider in the project:
1267
1125
  # crawl "google_spider.com", output: "log/google_spider.com.log"
1268
1126
 
1269
1127
  # Example to schedule all spiders in the project using runner. Each spider will write
1270
- # it's own output to the `log/spider_name.log` file (handled by a runner itself).
1271
- # Runner output will be written to log/runner.log file.
1272
- # Argument number it's a count of concurrent jobs:
1273
- # runner 3, output:"log/runner.log"
1128
+ # its own output to the `log/spider_name.log` file (handled by runner itself).
1129
+ # Runner output will be written to log/runner.log
1274
1130
 
1275
- # Example to schedule single spider (without project):
1131
+ # Example to schedule single spider (without a project):
1276
1132
  # single "single_spider.rb", output: "single_spider.log"
1277
1133
  # end
1278
1134
 
1279
- ### How to set a cron schedule ###
1135
+ ### How to set up a cron schedule ###
1280
1136
  # Run: `$ whenever --update-crontab --load-file config/schedule.rb`.
1281
- # If you don't have whenever command, install the gem: `$ gem install whenever`.
1137
+ # If you don't have the whenever command, install the gem like so: `$ gem install whenever`.
1282
1138
 
1283
1139
  ### How to cancel a schedule ###
1284
1140
  # Run: `$ whenever --clear-crontab --load-file config/schedule.rb`.
1285
1141
  ```
1286
1142
  </details><br>
1287
1143
 
1288
- 2) Add at the bottom of `schedule.rb` following code:
1144
+ 2) At the bottom of `schedule.rb`, add the following code:
1289
1145
 
1290
1146
  ```ruby
1291
1147
  every 1.day, at: "7:00" do
@@ -1295,14 +1151,14 @@ end
1295
1151
 
1296
1152
  3) Run: `$ whenever --update-crontab --load-file schedule.rb`. Done!
1297
1153
 
1298
- You can check Whenever examples [here](https://github.com/javan/whenever#example-schedulerb-file). To cancel schedule, run: `$ whenever --clear-crontab --load-file schedule.rb`.
1154
+ You can see some [Whenever](https://github.com/javan/whenever) examples [here](https://github.com/javan/whenever#example-schedulerb-file). To cancel a schedule, run: `$ whenever --clear-crontab --load-file schedule.rb`.
1299
1155
 
1300
1156
  ### Configuration options
1301
- You can configure several options using `configure` block:
1157
+ You can configure several options inside the `configure` block:
1302
1158
 
1303
1159
  ```ruby
1304
1160
  Kimurai.configure do |config|
1305
- # Default logger has colored mode in development.
1161
+ # The default logger has colorized mode enabled in development.
1306
1162
  # If you would like to disable it, set `colorize_logger` to false.
1307
1163
  # config.colorize_logger = false
1308
1164
 
@@ -1323,13 +1179,13 @@ Kimurai.configure do |config|
1323
1179
  end
1324
1180
  ```
1325
1181
 
1326
- ### Using Kimurai inside existing Ruby application
1182
+ ### Using Kimurai inside existing Ruby applications
1327
1183
 
1328
- You can integrate Kimurai spiders (which are just Ruby classes) to an existing Ruby application like Rails or Sinatra, and run them using background jobs (for example). Check the following info to understand the running process of spiders:
1184
+ You can integrate Kimurai spiders (which are just Ruby classes) into an existing Ruby application like Rails or Sinatra, and run them using background jobs, for example. See the following sections to understand the process of running spiders:
1329
1185
 
1330
1186
  #### `.crawl!` method
1331
1187
 
1332
- `.crawl!` (class method) performs a _full run_ of a particular spider. This method will return run_info if run was successful, or an exception if something went wrong.
1188
+ `.crawl!` (class method) performs a _full run_ of a particular spider. This method will return run_info if it was successful, or an exception if something went wrong.
1333
1189
 
1334
1190
  ```ruby
1335
1191
  class ExampleSpider < Kimurai::Base
@@ -1346,7 +1202,7 @@ ExampleSpider.crawl!
1346
1202
  # => { :spider_name => "example_spider", :status => :completed, :environment => "development", :start_time => 2018-08-22 18:20:16 +0400, :stop_time => 2018-08-22 18:20:17 +0400, :running_time => 1.216, :visits => { :requests => 1, :responses => 1 }, :items => { :sent => 0, :processed => 0 }, :error => nil }
1347
1203
  ```
1348
1204
 
1349
- You can't `.crawl!` spider in different thread if it still running (because spider instances store some shared data in the `@run_info` class variable while `crawl`ing):
1205
+ You can't `.crawl!` a spider in a different thread if it's still running (because spider instances store some shared data in the `@run_info` class variable while `crawl`ing):
1350
1206
 
1351
1207
  ```ruby
1352
1208
  2.times do |i|
@@ -1360,11 +1216,11 @@ end # =>
1360
1216
  # {:spider_name=>"example_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 18:49:22 +0400, :stop_time=>2018-08-22 18:49:23 +0400, :running_time=>0.801, :visits=>{:requests=>1, :responses=>1}, :items=>{:sent=>0, :processed=>0}, :error=>nil}
1361
1217
  ```
1362
1218
 
1363
- So what if you're don't care about stats and just want to process request to a particular spider method and get the returning value from this method? Use `.parse!` instead:
1219
+ So, what if you don't care about stats and just want to process a request with a particular spider method and get the return value from this method? Use `.parse!` instead:
1364
1220
 
1365
1221
  #### `.parse!(:method_name, url:)` method
1366
1222
 
1367
- `.parse!` (class method) creates a new spider instance and performs a request to given method with a given url. Value from the method will be returned back:
1223
+ The `.parse!` (class method) creates a new spider instance and performs a request with the provided method and url. The value from the method will be returned back:
1368
1224
 
1369
1225
  ```ruby
1370
1226
  class ExampleSpider < Kimurai::Base
@@ -1381,7 +1237,7 @@ ExampleSpider.parse!(:parse, url: "https://example.com/")
1381
1237
  # => "Example Domain"
1382
1238
  ```
1383
1239
 
1384
- Like `.crawl!`, `.parse!` method takes care of a browser instance and kills it (`browser.destroy_driver!`) before returning the value. Unlike `.crawl!`, `.parse!` method can be called from different threads at the same time:
1240
+ Like `.crawl!`, the `.parse!` method creates a browser instance and destroys it (`browser.destroy_driver!`) before returning the value. Unlike `.crawl!`, `.parse!` method can be called from different threads at the same time:
1385
1241
 
1386
1242
  ```ruby
1387
1243
  urls = ["https://www.google.com/", "https://www.reddit.com/", "https://en.wikipedia.org/"]
@@ -1395,7 +1251,7 @@ end # =>
1395
1251
  # "reddit: the front page of the internetHotHot"
1396
1252
  ```
1397
1253
 
1398
- Keep in mind, that [save_to](#save_to-helper) and [unique?](#skip-duplicates) helpers are not thread-safe while using `.parse!` method.
1254
+ Keep in mind, that [save_to](#save_to-helper) and [unique?](#skip-duplicates) helpers are not thread-safe while using the `.parse!` method.
1399
1255
 
1400
1256
  #### `Kimurai.list` and `Kimurai.find_by_name()`
1401
1257
 
@@ -1416,64 +1272,21 @@ end
1416
1272
  Kimurai.list
1417
1273
  # => {"google_spider"=>GoogleSpider, "reddit_spider"=>RedditSpider, "wikipedia_spider"=>WikipediaSpider}
1418
1274
 
1419
- # To find a particular spider class by it's name:
1275
+ # To find a particular spider class by its name:
1420
1276
  Kimurai.find_by_name("reddit_spider")
1421
1277
  # => RedditSpider
1422
1278
  ```
1423
1279
 
1424
-
1425
- ### Automated sever setup and deployment
1426
- > **EXPERIMENTAL**
1427
-
1428
- #### Setup
1429
- You can automatically setup [required environment](#installation) for Kimurai on the remote server (currently there is only Ubuntu Server 18.04 support) using `$ kimurai setup` command. `setup` will perform installation of: latest Ruby with Rbenv, browsers with webdrivers and in additional databases clients (only clients) for MySQL, Postgres and MongoDB (so you can connect to a remote database from ruby).
1430
-
1431
- > To perform remote server setup, [Ansible](https://github.com/ansible/ansible) is required **on the desktop** machine (to install: Ubuntu: `$ sudo apt install ansible`, Mac OS X: `$ brew install ansible`)
1432
-
1433
- > It's recommended to use regular user to setup the server, not `root`. To create a new user, login to the server `$ ssh root@your_server_ip`, type `$ adduser username` to create a user, and `$ gpasswd -a username sudo` to add new user to a sudo group.
1434
-
1435
- Example:
1436
-
1437
- ```bash
1438
- $ kimurai setup deploy@123.123.123.123 --ask-sudo --ssh-key-path path/to/private_key
1439
- ```
1440
-
1441
- CLI options:
1442
- * `--ask-sudo` pass this option to ask sudo (user) password for system-wide installation of packages (`apt install`)
1443
- * `--ssh-key-path path/to/private_key` authorization on the server using private ssh key. You can omit it if required key already [added to keychain](https://help.github.com/articles/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent/#adding-your-ssh-key-to-the-ssh-agent) on your desktop (Ansible uses [SSH agent forwarding](https://developer.github.com/v3/guides/using-ssh-agent-forwarding/))
1444
- * `--ask-auth-pass` authorization on the server using user password, alternative option to `--ssh-key-path`.
1445
- * `-p port_number` custom port for ssh connection (`-p 2222`)
1446
-
1447
- > You can check setup playbook [here](lib/kimurai/automation/setup.yml)
1448
-
1449
- #### Deploy
1450
-
1451
- After successful `setup` you can deploy a spider to the remote server using `$ kimurai deploy` command. On each deploy there are performing several tasks: 1) pull repo from a remote origin to `~/repo_name` user directory 2) run `bundle install` 3) Update crontab `whenever --update-crontab` (to update spider schedule from schedule.rb file).
1452
-
1453
- Before `deploy` make sure that inside spider directory you have: 1) git repository with remote origin (bitbucket, github, etc.) 2) `Gemfile` 3) schedule.rb inside subfolder `config` (`config/schedule.rb`).
1454
-
1455
- Example:
1456
-
1457
- ```bash
1458
- $ kimurai deploy deploy@123.123.123.123 --ssh-key-path path/to/private_key --repo-key-path path/to/repo_private_key
1459
- ```
1460
-
1461
- CLI options: _same like for [setup](#setup) command_ (except `--ask-sudo`), plus
1462
- * `--repo-url` provide custom repo url (`--repo-url git@bitbucket.org:username/repo_name.git`), otherwise current `origin/master` will be taken (output from `$ git remote get-url origin`)
1463
- * `--repo-key-path` if git repository is private, authorization is required to pull the code on the remote server. Use this option to provide a private repository SSH key. You can omit it if required key already added to keychain on your desktop (same like with `--ssh-key-path` option)
1464
-
1465
- > You can check deploy playbook [here](lib/kimurai/automation/deploy.yml)
1466
-
1467
1280
  ## Spider `@config`
1468
1281
 
1469
- Using `@config` you can set several options for a spider, like proxy, user-agent, default cookies/headers, delay between requests, browser **memory control** and so on:
1282
+ Using `@config` you can set several options for a spider; such as proxy, user-agent, default cookies/headers, delay between requests, browser **memory control** and so on:
1470
1283
 
1471
1284
  ```ruby
1472
1285
  class Spider < Kimurai::Base
1473
1286
  USER_AGENTS = ["Chrome", "Firefox", "Safari", "Opera"]
1474
1287
  PROXIES = ["2.3.4.5:8080:http:username:password", "3.4.5.6:3128:http", "1.2.3.4:3000:socks5"]
1475
1288
 
1476
- @engine = :poltergeist_phantomjs
1289
+ @engine = :selenium_chrome
1477
1290
  @start_urls = ["https://example.com/"]
1478
1291
  @config = {
1479
1292
  headers: { "custom_header" => "custom_value" },
@@ -1493,7 +1306,7 @@ class Spider < Kimurai::Base
1493
1306
  change_proxy: true,
1494
1307
  # Clear all cookies and set default cookies (if provided) before each request:
1495
1308
  clear_and_set_cookies: true,
1496
- # Process delay before each request:
1309
+ # Set a delay before each request:
1497
1310
  delay: 1..3
1498
1311
  }
1499
1312
  }
@@ -1508,94 +1321,116 @@ end
1508
1321
 
1509
1322
  ```ruby
1510
1323
  @config = {
1511
- # Custom headers, format: hash. Example: { "some header" => "some value", "another header" => "another value" }
1512
- # Works only for :mechanize and :poltergeist_phantomjs engines (Selenium doesn't allow to set/get headers)
1324
+ # Custom headers hash. Example: { "some header" => "some value", "another header" => "another value" }
1325
+ # Works for :mechanize. Selenium doesn't support setting headers.
1513
1326
  headers: {},
1514
1327
 
1515
- # Custom User Agent, format: string or lambda.
1328
+ # Custom User Agent string or lambda
1329
+ #
1516
1330
  # Use lambda if you want to rotate user agents before each run:
1517
- # user_agent: -> { ARRAY_OF_USER_AGENTS.sample }
1331
+ # user_agent: -> { ARRAY_OF_USER_AGENTS.sample }
1332
+ #
1518
1333
  # Works for all engines
1519
1334
  user_agent: "Mozilla/5.0 Firefox/61.0",
1520
1335
 
1521
- # Custom cookies, format: array of hashes.
1336
+ # Custom cookies an array of hashes
1522
1337
  # Format for a single cookie: { name: "cookie name", value: "cookie value", domain: ".example.com" }
1338
+ #
1523
1339
  # Works for all engines
1524
1340
  cookies: [],
1525
1341
 
1526
- # Proxy, format: string or lambda. Format of a proxy string: "ip:port:protocol:user:password"
1527
- # `protocol` can be http or socks5. User and password are optional.
1342
+ # Proxy string or lambda. Format for a proxy string: "ip:port:protocol:user:password"
1343
+ # `protocol` can be http or socks5. User and password are optional.
1344
+ #
1528
1345
  # Use lambda if you want to rotate proxies before each run:
1529
- # proxy: -> { ARRAY_OF_PROXIES.sample }
1530
- # Works for all engines, but keep in mind that Selenium drivers doesn't support proxies
1531
- # with authorization. Also, Mechanize doesn't support socks5 proxy format (only http)
1346
+ # proxy: -> { ARRAY_OF_PROXIES.sample }
1347
+ #
1348
+ # Works for all engines, but keep in mind that Selenium drivers don't support proxies
1349
+ # with authorization. Also, Mechanize doesn't support socks5 proxy format (only http).
1532
1350
  proxy: "3.4.5.6:3128:http:user:pass",
1533
1351
 
1534
1352
  # If enabled, browser will ignore any https errors. It's handy while using a proxy
1535
- # with self-signed SSL cert (for example Crawlera or Mitmproxy)
1536
- # Also, it will allow to visit webpages with expires SSL certificate.
1353
+ # with a self-signed SSL cert (for example Crawlera or Mitmproxy). It will allow you to
1354
+ # visit web pages with expired SSL certificates.
1355
+ #
1537
1356
  # Works for all engines
1538
1357
  ignore_ssl_errors: true,
1539
1358
 
1540
1359
  # Custom window size, works for all engines
1541
1360
  window_size: [1366, 768],
1542
1361
 
1543
- # Skip images downloading if true, works for all engines
1362
+ # Skip loading images if true, works for all engines. Speeds up page load time.
1544
1363
  disable_images: true,
1545
1364
 
1546
- # Selenium engines only: headless mode, `:native` or `:virtual_display` (default is :native)
1547
- # Although native mode has a better performance, virtual display mode
1548
- # sometimes can be useful. For example, some websites can detect (and block)
1549
- # headless chrome, so you can use virtual_display mode instead
1365
+ # For Selenium engines only: headless mode, `:native` or `:virtual_display` (default is :native)
1366
+ # Although native mode has better performance, virtual display mode
1367
+ # can sometimes be useful. For example, some websites can detect (and block)
1368
+ # headless chrome, so you can use virtual_display mode instead.
1550
1369
  headless_mode: :native,
1551
1370
 
1552
1371
  # This option tells the browser not to use a proxy for the provided list of domains or IP addresses.
1553
- # Format: array of strings. Works only for :selenium_firefox and selenium_chrome
1372
+ # Format: array of strings. Works only for :selenium_firefox and selenium_chrome.
1554
1373
  proxy_bypass_list: [],
1555
1374
 
1556
- # Option to provide custom SSL certificate. Works only for :poltergeist_phantomjs and :mechanize
1375
+ # Option to provide custom SSL certificate. Works only for :mechanize.
1557
1376
  ssl_cert_path: "path/to/ssl_cert",
1558
1377
 
1559
- # Inject some JavaScript code to the browser.
1560
- # Format: array of strings, where each string is a path to JS file.
1561
- # Works only for poltergeist_phantomjs engine (Selenium doesn't support JS code injection)
1378
+ # Inject some JavaScript code into the browser.
1379
+ # Format: array of strings, where each string is a path to a JS file or extension directory
1380
+ # Selenium doesn't support JS code injection.
1562
1381
  extensions: ["lib/code_to_inject.js"],
1563
1382
 
1564
- # Automatically skip duplicated (already visited) urls when using `request_to` method.
1565
- # Possible values: `true` or `hash` with options.
1566
- # In case of `true`, all visited urls will be added to the storage's scope `:requests_urls`
1567
- # and if url already contains in this scope, request will be skipped.
1383
+ # Automatically skip already visited urls when using `request_to` method
1384
+ #
1385
+ # Possible values: `true` or a hash with options
1386
+ # In case of `true`, all visited urls will be added to the storage scope `:requests_urls`
1387
+ # and if the url already exists in this scope, the request will be skipped.
1388
+ #
1568
1389
  # You can configure this setting by providing additional options as hash:
1569
- # `skip_duplicate_requests: { scope: :custom_scope, check_only: true }`, where:
1570
- # `scope:` - use custom scope than `:requests_urls`
1571
- # `check_only:` - if true, then scope will be only checked for url, url will not
1572
- # be added to the scope if scope doesn't contains it.
1573
- # works for all drivers
1390
+ # `skip_duplicate_requests: { scope: :custom_scope, check_only: true }`, where:
1391
+ # `scope:` use a custom scope other than `:requests_urls`
1392
+ # `check_only:` if true, the url will not be added to the scope
1393
+ #
1394
+ # Works for all drivers
1574
1395
  skip_duplicate_requests: true,
1575
1396
 
1576
- # Automatically skip provided errors while requesting a page.
1577
- # If raised error matches one of the errors in the list, then this error will be caught,
1578
- # and request will be skipped.
1579
- # It is a good idea to skip errors like NotFound(404), etc.
1580
- # Format: array where elements are error classes or/and hashes. You can use hash format
1397
+ # Automatically skip provided errors while requesting a page
1398
+ #
1399
+ # If a raised error matches one of the errors in the list, then the error will be caught,
1400
+ # and the request will be skipped. It's a good idea to skip errors like 404 Not Found, etc.
1401
+ #
1402
+ # Format: array where elements are error classes and/or hashes. You can use a hash
1581
1403
  # for more flexibility: `{ error: "RuntimeError", message: "404 => Net::HTTPNotFound" }`.
1582
- # Provided `message:` will be compared with a full error message using `String#include?`. Also
1583
- # you can use regex instead: `{ error: "RuntimeError", message: /404|403/ }`.
1404
+ #
1405
+ # The provided `message:` will be compared with a full error message using `String#include?`.
1406
+ # You can also use regex: `{ error: "RuntimeError", message: /404|403/ }`.
1584
1407
  skip_request_errors: [{ error: RuntimeError, message: "404 => Net::HTTPNotFound" }],
1585
-
1586
- # Automatically retry provided errors with a few attempts while requesting a page.
1587
- # If raised error matches one of the errors in the list, then this error will be caught
1588
- # and the request will be processed again within a delay. There are 3 attempts:
1589
- # first: delay 15 sec, second: delay 30 sec, third: delay 45 sec.
1590
- # If after 3 attempts there is still an exception, then the exception will be raised.
1591
- # It is a good idea to try to retry errros like `ReadTimeout`, `HTTPBadGateway`, etc.
1592
- # Format: same like for `skip_request_errors` option.
1408
+
1409
+ # Automatically retry requests several times after certain errors
1410
+ #
1411
+ # If a raised error matches one of the errors in the list, the error will be caught,
1412
+ # and the request will be processed again with progressive delay.
1413
+ #
1414
+ # There are 3 attempts with _15 sec_, _30 sec_, and _45 sec_ delays, respectively. If after 3
1415
+ # attempts there is still an exception, then the exception will be raised. It's a good idea to
1416
+ # retry errors like `ReadTimeout`, `HTTPBadGateway`, etc.
1417
+ #
1418
+ # The format for `retry_request_errors` is the same as for `skip_request_errors`.
1593
1419
  retry_request_errors: [Net::ReadTimeout],
1594
1420
 
1421
+ # Handle page encoding while parsing html response using Nokogiri
1422
+ #
1423
+ # There are two ways to use this option:
1424
+ # encoding: :auto # auto-detect from <meta http-equiv="Content-Type"> or <meta charset> tags
1425
+ # encoding: "GB2312" # set encoding manually
1426
+ #
1427
+ # This option is not set by default
1428
+ encoding: nil,
1429
+
1595
1430
  # Restart browser if one of the options is true:
1596
1431
  restart_if: {
1597
1432
  # Restart browser if provided memory limit (in kilobytes) is exceeded (works for all engines)
1598
- memory_limit: 350_000,
1433
+ memory_limit: 1_500_000,
1599
1434
 
1600
1435
  # Restart browser if provided requests limit is exceeded (works for all engines)
1601
1436
  requests_limit: 100
@@ -1603,26 +1438,25 @@ end
1603
1438
 
1604
1439
  # Perform several actions before each request:
1605
1440
  before_request: {
1606
- # Change proxy before each request. The `proxy:` option above should be presented
1607
- # and has lambda format. Works only for poltergeist and mechanize engines
1608
- # (Selenium doesn't support proxy rotation).
1441
+ # Change proxy before each request. The `proxy:` option above should be set with lambda notation.
1442
+ # Works for :mechanize engine. Selenium doesn't support proxy rotation.
1609
1443
  change_proxy: true,
1610
1444
 
1611
- # Change user agent before each request. The `user_agent:` option above should be presented
1612
- # and has lambda format. Works only for poltergeist and mechanize engines
1613
- # (selenium doesn't support to get/set headers).
1445
+ # Change user agent before each request. The `user_agent:` option above should set with lambda
1446
+ # notation. Works for :mechanize engine. Selenium doesn't support setting headers.
1614
1447
  change_user_agent: true,
1615
1448
 
1616
- # Clear all cookies before each request, works for all engines
1449
+ # Clear all cookies before each request. Works for all engines.
1617
1450
  clear_cookies: true,
1618
1451
 
1619
- # If you want to clear all cookies + set custom cookies (`cookies:` option above should be presented)
1620
- # use this option instead (works for all engines)
1452
+ # If you want to clear all cookies and set custom cookies, the `cookies:` option above should be set
1453
+ # Use this option instead of clear_cookies. Works for all engines.
1621
1454
  clear_and_set_cookies: true,
1622
1455
 
1623
- # Global option to set delay between requests.
1456
+ # Global option to set delay between requests
1457
+ #
1624
1458
  # Delay can be `Integer`, `Float` or `Range` (`2..5`). In case of a range,
1625
- # delay number will be chosen randomly for each request: `rand (2..5) # => 3`
1459
+ # the delay (in seconds) will be set randomly for each request: `rand (2..5) # => 3`
1626
1460
  delay: 1..3
1627
1461
  }
1628
1462
  }
@@ -1635,11 +1469,11 @@ Settings can be inherited:
1635
1469
 
1636
1470
  ```ruby
1637
1471
  class ApplicationSpider < Kimurai::Base
1638
- @engine = :poltergeist_phantomjs
1472
+ @engine = :selenium_chrome
1639
1473
  @config = {
1640
- user_agent: "Firefox",
1474
+ user_agent: "Chrome",
1641
1475
  disable_images: true,
1642
- restart_if: { memory_limit: 350_000 },
1476
+ restart_if: { memory_limit: 1_500_000 },
1643
1477
  before_request: { delay: 1..2 }
1644
1478
  }
1645
1479
  end
@@ -1657,11 +1491,11 @@ class CustomSpider < ApplicationSpider
1657
1491
  end
1658
1492
  ```
1659
1493
 
1660
- Here, `@config` of `CustomSpider` will be _[deep merged](https://apidock.com/rails/Hash/deep_merge)_ with `ApplicationSpider` config, so `CustomSpider` will keep all inherited options with only `delay` updated.
1494
+ Here, `@config` of `CustomSpider` will be _[deep merged](https://apidock.com/rails/Hash/deep_merge)_ with `ApplicationSpider`'s' config. In this example, `CustomSpider` will keep all inherited options with only the `delay` being updated.
1661
1495
 
1662
1496
  ## Project mode
1663
1497
 
1664
- Kimurai can work in project mode ([Like Scrapy](https://doc.scrapy.org/en/latest/intro/tutorial.html#creating-a-project)). To generate a new project, run: `$ kimurai generate project web_spiders` (where `web_spiders` is a name of project).
1498
+ Kimurai can work in project mode. To generate a new project, run: `$ kimurai new web_spiders` (where `web_spiders` is the name for the project).
1665
1499
 
1666
1500
  Structure of the project:
1667
1501
 
@@ -1670,7 +1504,6 @@ Structure of the project:
1670
1504
  ├── config/
1671
1505
  │   ├── initializers/
1672
1506
  │   ├── application.rb
1673
- │   ├── automation.yml
1674
1507
  │   ├── boot.rb
1675
1508
  │   └── schedule.rb
1676
1509
  ├── spiders/
@@ -1693,26 +1526,25 @@ Structure of the project:
1693
1526
  <details/>
1694
1527
  <summary>Description</summary>
1695
1528
 
1696
- * `config/` folder for configutation files
1697
- * `config/initializers` [Rails-like initializers](https://guides.rubyonrails.org/configuring.html#using-initializer-files) to load custom code at start of framework
1698
- * `config/application.rb` configuration settings for Kimurai (`Kimurai.configure do` block)
1699
- * `config/automation.yml` specify some settings for [setup and deploy](#automated-sever-setup-and-deployment)
1700
- * `config/boot.rb` loads framework and project
1701
- * `config/schedule.rb` Cron [schedule for spiders](#schedule-spiders-using-cron)
1702
- * `spiders/` folder for spiders
1703
- * `spiders/application_spider.rb` Base parent class for all spiders
1704
- * `db/` store here all database files (`sqlite`, `json`, `csv`, etc.)
1705
- * `helpers/` Rails-like helpers for spiders
1706
- * `helpers/application_helper.rb` all methods inside ApplicationHelper module will be available for all spiders
1707
- * `lib/` put here custom Ruby code
1708
- * `log/` folder for logs
1709
- * `pipelines/` folder for [Scrapy-like](https://doc.scrapy.org/en/latest/topics/item-pipeline.html) pipelines. One file = one pipeline
1710
- * `pipelines/validator.rb` example pipeline to validate item
1711
- * `pipelines/saver.rb` example pipeline to save item
1712
- * `tmp/` folder for temp. files
1713
- * `.env` file to store ENV variables for project and load them using [Dotenv](https://github.com/bkeepers/dotenv)
1714
- * `Gemfile` dependency file
1715
- * `Readme.md` example project readme
1529
+ * `config/` directory for configutation files
1530
+ * `config/initializers` [Rails-like initializers](https://guides.rubyonrails.org/configuring.html#using-initializer-files) to load custom code when the framework initializes
1531
+ * `config/application.rb` configuration settings for Kimurai (`Kimurai.configure do` block)
1532
+ * `config/boot.rb`– loads framework and project
1533
+ * `config/schedule.rb` Cron [schedule for spiders](#schedule-spiders-using-cron)
1534
+ * `spiders/` directory for spiders
1535
+ * `spiders/application_spider.rb` base parent class for all spiders
1536
+ * `db/` directory for database files (`sqlite`, `json`, `csv`, etc.)
1537
+ * `helpers/` Rails-like helpers for spiders
1538
+ * `helpers/application_helper.rb` all methods inside the ApplicationHelper module will be available for all spiders
1539
+ * `lib/` custom Ruby code
1540
+ * `log/` directory for logs
1541
+ * `pipelines/` directory for [Scrapy-like](https://doc.scrapy.org/en/latest/topics/item-pipeline.html) pipelines (one file per pipeline)
1542
+ * `pipelines/validator.rb` example pipeline to validate an item
1543
+ * `pipelines/saver.rb` example pipeline to save an item
1544
+ * `tmp/` folder for temp files
1545
+ * `.env` file to store environment variables for a project and load them using [Dotenv](https://github.com/bkeepers/dotenv)
1546
+ * `Gemfile` dependency file
1547
+ * `Readme.md` example project readme
1716
1548
  </details>
1717
1549
 
1718
1550
 
@@ -1740,8 +1572,6 @@ end
1740
1572
  ### Crawl
1741
1573
  To run a particular spider in the project, run: `$ bundle exec kimurai crawl example_spider`. Don't forget to add `bundle exec` before command to load required environment.
1742
1574
 
1743
- You can provide an additional option `--continue` to use [persistence storage database](#persistence-database-for-the-storage) feature.
1744
-
1745
1575
  ### List
1746
1576
  To list all project spiders, run: `$ bundle exec kimurai list`
1747
1577
 
@@ -1769,7 +1599,7 @@ class Validator < Kimurai::Pipeline
1769
1599
  # Here you can validate item and raise `DropItemError`
1770
1600
  # if one of the validations failed. Examples:
1771
1601
 
1772
- # Drop item if it's category is not "shoe":
1602
+ # Drop item if its category is not "shoe":
1773
1603
  if item[:category] != "shoe"
1774
1604
  raise DropItemError, "Wrong item category"
1775
1605
  end
@@ -1820,6 +1650,7 @@ spiders/application_spider.rb
1820
1650
  ```ruby
1821
1651
  class ApplicationSpider < Kimurai::Base
1822
1652
  @engine = :selenium_chrome
1653
+
1823
1654
  # Define pipelines (by order) for all spiders:
1824
1655
  @pipelines = [:validator, :saver]
1825
1656
  end
@@ -1893,22 +1724,20 @@ end
1893
1724
 
1894
1725
  spiders/github_spider.rb
1895
1726
  ```ruby
1896
- class GithubSpider < ApplicationSpider
1727
+ class GithubSpider < Kimurai::Base
1897
1728
  @name = "github_spider"
1898
1729
  @engine = :selenium_chrome
1899
- @pipelines = [:validator]
1900
- @start_urls = ["https://github.com/search?q=Ruby%20Web%20Scraping"]
1730
+ @start_urls = ["https://github.com/search?q=ruby+web+scraping&type=repositories"]
1901
1731
  @config = {
1902
- user_agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36",
1903
- before_request: { delay: 4..7 }
1732
+ before_request: { delay: 3..5 }
1904
1733
  }
1905
1734
 
1906
1735
  def parse(response, url:, data: {})
1907
- response.xpath("//ul[@class='repo-list']/div//h3/a").each do |a|
1736
+ response.xpath("//div[@data-testid='results-list']//div[contains(@class, 'search-title')]/a").each do |a|
1908
1737
  request_to :parse_repo_page, url: absolute_url(a[:href], base: url)
1909
1738
  end
1910
1739
 
1911
- if next_page = response.at_xpath("//a[@class='next_page']")
1740
+ if next_page = response.at_xpath("//a[@rel='next']")
1912
1741
  request_to :parse, url: absolute_url(next_page[:href], base: url)
1913
1742
  end
1914
1743
  end
@@ -1916,17 +1745,17 @@ class GithubSpider < ApplicationSpider
1916
1745
  def parse_repo_page(response, url:, data: {})
1917
1746
  item = {}
1918
1747
 
1919
- item[:owner] = response.xpath("//h1//a[@rel='author']").text
1920
- item[:repo_name] = response.xpath("//h1/strong[@itemprop='name']/a").text
1748
+ item[:owner] = response.xpath("//a[@rel='author']").text.squish
1749
+ item[:repo_name] = response.xpath("//strong[@itemprop='name']").text.squish
1921
1750
  item[:repo_url] = url
1922
- item[:description] = response.xpath("//span[@itemprop='about']").text.squish
1923
- item[:tags] = response.xpath("//div[@id='topics-list-container']/div/a").map { |a| a.text.squish }
1924
- item[:watch_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Watch')]/a[2]").text.squish.delete(",").to_i
1925
- item[:star_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Star')]/a[2]").text.squish.delete(",").to_i
1926
- item[:fork_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Fork')]/a[2]").text.squish.delete(",").to_i
1927
- item[:last_commit] = response.xpath("//span[@itemprop='dateModified']/*").text
1751
+ item[:description] = response.xpath("//div[h2[text()='About']]/p").text.squish
1752
+ item[:tags] = response.xpath("//div/a[contains(@title, 'Topic')]").map { |a| a.text.squish }
1753
+ item[:watch_count] = response.xpath("//div/h3[text()='Watchers']/following-sibling::div[1]/a/strong").text.squish
1754
+ item[:star_count] = response.xpath("//div/h3[text()='Stars']/following-sibling::div[1]/a/strong").text.squish
1755
+ item[:fork_count] = response.xpath("//div/h3[text()='Forks']/following-sibling::div[1]/a/strong").text.squish
1756
+ item[:last_commit] = response.xpath("//div[@data-testid='latest-commit-details']//relative-time/text()").text.squish
1928
1757
 
1929
- send_item item
1758
+ save_to "results.json", item, format: :pretty_json
1930
1759
  end
1931
1760
  end
1932
1761
  ```
@@ -1934,41 +1763,41 @@ end
1934
1763
  ```
1935
1764
  $ bundle exec kimurai crawl github_spider
1936
1765
 
1937
- I, [2018-08-22 15:56:35 +0400#1358] [M: 47347279209980] INFO -- github_spider: Spider: started: github_spider
1938
- D, [2018-08-22 15:56:35 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): created browser instance
1939
- I, [2018-08-22 15:56:40 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: started get request to: https://github.com/search?q=Ruby%20Web%20Scraping
1940
- I, [2018-08-22 15:56:44 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: finished get request to: https://github.com/search?q=Ruby%20Web%20Scraping
1941
- I, [2018-08-22 15:56:44 +0400#1358] [M: 47347279209980] INFO -- github_spider: Info: visits: requests: 1, responses: 1
1942
- D, [2018-08-22 15:56:44 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: driver.current_memory: 116182
1943
- D, [2018-08-22 15:56:44 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: sleep 5 seconds before request...
1944
-
1945
- I, [2018-08-22 15:56:49 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: started get request to: https://github.com/lorien/awesome-web-scraping
1946
- I, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: finished get request to: https://github.com/lorien/awesome-web-scraping
1947
- I, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] INFO -- github_spider: Info: visits: requests: 2, responses: 2
1948
- D, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: driver.current_memory: 217432
1949
- D, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Pipeline: starting processing item through 1 pipeline...
1950
- I, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] INFO -- github_spider: Pipeline: processed: {"owner":"lorien","repo_name":"awesome-web-scraping","repo_url":"https://github.com/lorien/awesome-web-scraping","description":"List of libraries, tools and APIs for web scraping and data processing.","tags":["awesome","awesome-list","web-scraping","data-processing","python","javascript","php","ruby"],"watch_count":159,"star_count":2423,"fork_count":358,"last_commit":"4 days ago"}
1951
- I, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] INFO -- github_spider: Info: items: sent: 1, processed: 1
1952
- D, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: sleep 6 seconds before request...
1766
+ I, [2018-08-22 15:56:35 +0400#1358] INFO -- github_spider: Spider: started: github_spider
1767
+ D, [2018-08-22 15:56:35 +0400#1358] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): created browser instance
1768
+ I, [2018-08-22 15:56:40 +0400#1358] INFO -- github_spider: Browser: started get request to: https://github.com/search?q=Ruby%20Web%20Scraping
1769
+ I, [2018-08-22 15:56:44 +0400#1358] INFO -- github_spider: Browser: finished get request to: https://github.com/search?q=Ruby%20Web%20Scraping
1770
+ I, [2018-08-22 15:56:44 +0400#1358] INFO -- github_spider: Info: visits: requests: 1, responses: 1
1771
+ D, [2018-08-22 15:56:44 +0400#1358] DEBUG -- github_spider: Browser: driver.current_memory: 116182
1772
+ D, [2018-08-22 15:56:44 +0400#1358] DEBUG -- github_spider: Browser: sleep 5 seconds before request...
1773
+
1774
+ I, [2018-08-22 15:56:49 +0400#1358] INFO -- github_spider: Browser: started get request to: https://github.com/lorien/awesome-web-scraping
1775
+ I, [2018-08-22 15:56:50 +0400#1358] INFO -- github_spider: Browser: finished get request to: https://github.com/lorien/awesome-web-scraping
1776
+ I, [2018-08-22 15:56:50 +0400#1358] INFO -- github_spider: Info: visits: requests: 2, responses: 2
1777
+ D, [2018-08-22 15:56:50 +0400#1358] DEBUG -- github_spider: Browser: driver.current_memory: 217432
1778
+ D, [2018-08-22 15:56:50 +0400#1358] DEBUG -- github_spider: Pipeline: starting processing item through 1 pipeline...
1779
+ I, [2018-08-22 15:56:50 +0400#1358] INFO -- github_spider: Pipeline: processed: {"owner":"lorien","repo_name":"awesome-web-scraping","repo_url":"https://github.com/lorien/awesome-web-scraping","description":"List of libraries, tools and APIs for web scraping and data processing.","tags":["awesome","awesome-list","web-scraping","data-processing","python","javascript","php","ruby"],"watch_count":159,"star_count":2423,"fork_count":358,"last_commit":"4 days ago"}
1780
+ I, [2018-08-22 15:56:50 +0400#1358] INFO -- github_spider: Info: items: sent: 1, processed: 1
1781
+ D, [2018-08-22 15:56:50 +0400#1358] DEBUG -- github_spider: Browser: sleep 6 seconds before request...
1953
1782
 
1954
1783
  ...
1955
1784
 
1956
- I, [2018-08-22 16:11:50 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: started get request to: https://github.com/preston/idclight
1957
- I, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: finished get request to: https://github.com/preston/idclight
1958
- I, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] INFO -- github_spider: Info: visits: requests: 140, responses: 140
1959
- D, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: driver.current_memory: 211713
1785
+ I, [2018-08-22 16:11:50 +0400#1358] INFO -- github_spider: Browser: started get request to: https://github.com/preston/idclight
1786
+ I, [2018-08-22 16:11:51 +0400#1358] INFO -- github_spider: Browser: finished get request to: https://github.com/preston/idclight
1787
+ I, [2018-08-22 16:11:51 +0400#1358] INFO -- github_spider: Info: visits: requests: 140, responses: 140
1788
+ D, [2018-08-22 16:11:51 +0400#1358] DEBUG -- github_spider: Browser: driver.current_memory: 211713
1960
1789
 
1961
- D, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Pipeline: starting processing item through 1 pipeline...
1962
- E, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] ERROR -- github_spider: Pipeline: dropped: #<Kimurai::Pipeline::DropItemError: Repository doesn't have enough stars>, item: {:owner=>"preston", :repo_name=>"idclight", :repo_url=>"https://github.com/preston/idclight", :description=>"A Ruby gem for accessing the freely available IDClight (IDConverter Light) web service, which convert between different types of gene IDs such as Hugo and Entrez. Queries are screen scraped from http://idclight.bioinfo.cnio.es.", :tags=>[], :watch_count=>6, :star_count=>1, :fork_count=>0, :last_commit=>"on Apr 12, 2012"}
1790
+ D, [2018-08-22 16:11:51 +0400#1358] DEBUG -- github_spider: Pipeline: starting processing item through 1 pipeline...
1791
+ E, [2018-08-22 16:11:51 +0400#1358] ERROR -- github_spider: Pipeline: dropped: #<Kimurai::Pipeline::DropItemError: Repository doesn't have enough stars>, item: {:owner=>"preston", :repo_name=>"idclight", :repo_url=>"https://github.com/preston/idclight", :description=>"A Ruby gem for accessing the freely available IDClight (IDConverter Light) web service, which convert between different types of gene IDs such as Hugo and Entrez. Queries are screen scraped from http://idclight.bioinfo.cnio.es.", :tags=>[], :watch_count=>6, :star_count=>1, :fork_count=>0, :last_commit=>"on Apr 12, 2012"}
1963
1792
 
1964
- I, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] INFO -- github_spider: Info: items: sent: 127, processed: 12
1793
+ I, [2018-08-22 16:11:51 +0400#1358] INFO -- github_spider: Info: items: sent: 127, processed: 12
1965
1794
 
1966
- I, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: driver selenium_chrome has been destroyed
1967
- I, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] INFO -- github_spider: Spider: stopped: {:spider_name=>"github_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 15:56:35 +0400, :stop_time=>2018-08-22 16:11:51 +0400, :running_time=>"15m, 16s", :visits=>{:requests=>140, :responses=>140}, :items=>{:sent=>127, :processed=>12}, :error=>nil}
1795
+ I, [2018-08-22 16:11:51 +0400#1358] INFO -- github_spider: Browser: driver selenium_chrome has been destroyed
1796
+ I, [2018-08-22 16:11:51 +0400#1358] INFO -- github_spider: Spider: stopped: {:spider_name=>"github_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 15:56:35 +0400, :stop_time=>2018-08-22 16:11:51 +0400, :running_time=>"15m, 16s", :visits=>{:requests=>140, :responses=>140}, :items=>{:sent=>127, :processed=>12}, :error=>nil}
1968
1797
  ```
1969
1798
  </details><br>
1970
1799
 
1971
- Also, you can pass custom options to pipeline from a particular spider if you want to change pipeline behavior for this spider:
1800
+ You can also pass custom options to a pipeline from a particular spider if you want to change the pipeline behavior for this spider:
1972
1801
 
1973
1802
  <details>
1974
1803
  <summary>Example</summary>
@@ -2028,7 +1857,7 @@ $ bundle exec kimurai runner -j 3
2028
1857
  <<< Runner: stopped: {:id=>1533727423, :status=>:completed, :start_time=>2018-08-08 15:23:43 +0400, :stop_time=>2018-08-08 15:25:11 +0400, :environment=>"development", :concurrent_jobs=>3, :spiders=>["custom_spider", "github_spider", "example_spider"]}
2029
1858
  ```
2030
1859
 
2031
- Each spider runs in a separate process. Spiders logs available at `log/` folder. Pass `-j` option to specify how many spiders should be processed at the same time (default is 1).
1860
+ Each spider runs in a separate process. Spider logs are available in the `log/` directory. Use the `-j` argument to specify how many spiders should be processed at the same time (default is 1).
2032
1861
 
2033
1862
  You can provide additional arguments like `--include` or `--exclude` to specify which spiders to run:
2034
1863
 
@@ -2046,7 +1875,7 @@ You can perform custom actions before runner starts and after runner stops using
2046
1875
 
2047
1876
 
2048
1877
  ## Chat Support and Feedback
2049
- Will be updated
1878
+ Submit an issue on GitHub and we'll try to address it in a timely manner.
2050
1879
 
2051
1880
  ## License
2052
- The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
1881
+ This gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).