kimurai 1.4.0 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (47) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop.yml +9 -0
  3. data/CHANGELOG.md +21 -0
  4. data/Gemfile +2 -2
  5. data/README.md +476 -648
  6. data/Rakefile +6 -6
  7. data/bin/console +3 -4
  8. data/exe/kimurai +0 -1
  9. data/kimurai.gemspec +38 -37
  10. data/lib/kimurai/base/saver.rb +15 -19
  11. data/lib/kimurai/base/storage.rb +1 -1
  12. data/lib/kimurai/base.rb +38 -38
  13. data/lib/kimurai/base_helper.rb +5 -4
  14. data/lib/kimurai/browser_builder/mechanize_builder.rb +121 -119
  15. data/lib/kimurai/browser_builder/selenium_chrome_builder.rb +160 -152
  16. data/lib/kimurai/browser_builder/selenium_firefox_builder.rb +162 -160
  17. data/lib/kimurai/browser_builder.rb +1 -7
  18. data/lib/kimurai/capybara_configuration.rb +1 -1
  19. data/lib/kimurai/capybara_ext/driver/base.rb +50 -46
  20. data/lib/kimurai/capybara_ext/mechanize/driver.rb +51 -50
  21. data/lib/kimurai/capybara_ext/selenium/driver.rb +33 -29
  22. data/lib/kimurai/capybara_ext/session.rb +31 -38
  23. data/lib/kimurai/cli/generator.rb +15 -15
  24. data/lib/kimurai/cli.rb +49 -86
  25. data/lib/kimurai/core_ext/array.rb +2 -2
  26. data/lib/kimurai/core_ext/hash.rb +1 -1
  27. data/lib/kimurai/core_ext/numeric.rb +4 -4
  28. data/lib/kimurai/pipeline.rb +2 -1
  29. data/lib/kimurai/runner.rb +6 -6
  30. data/lib/kimurai/template/Gemfile +2 -2
  31. data/lib/kimurai/template/config/boot.rb +4 -4
  32. data/lib/kimurai/template/config/schedule.rb +15 -15
  33. data/lib/kimurai/template/spiders/application_spider.rb +8 -14
  34. data/lib/kimurai/version.rb +1 -1
  35. data/lib/kimurai.rb +7 -3
  36. metadata +58 -65
  37. data/.travis.yml +0 -5
  38. data/lib/kimurai/automation/deploy.yml +0 -54
  39. data/lib/kimurai/automation/setup/chromium_chromedriver.yml +0 -26
  40. data/lib/kimurai/automation/setup/firefox_geckodriver.yml +0 -20
  41. data/lib/kimurai/automation/setup/phantomjs.yml +0 -33
  42. data/lib/kimurai/automation/setup/ruby_environment.yml +0 -124
  43. data/lib/kimurai/automation/setup.yml +0 -44
  44. data/lib/kimurai/browser_builder/poltergeist_phantomjs_builder.rb +0 -175
  45. data/lib/kimurai/capybara_ext/poltergeist/driver.rb +0 -13
  46. data/lib/kimurai/cli/ansible_command_builder.rb +0 -71
  47. data/lib/kimurai/template/config/automation.yml +0 -13
data/README.md CHANGED
@@ -1,25 +1,8 @@
1
- <div align="center">
2
- <a href="https://github.com/vifreefly/kimuraframework">
3
- <img width="312" height="200" src="https://hsto.org/webt/_v/mt/tp/_vmttpbpzbt-y2aook642d9wpz0.png">
4
- </a>
1
+ # Kimurai
5
2
 
6
- <h1>Kimurai Scraping Framework</h1>
7
- </div>
3
+ Kimurai is a modern web scraping framework written in Ruby which **works out of the box with Headless Chromium/Firefox** or simple HTTP requests and **allows you to scrape and interact with JavaScript rendered websites.**
8
4
 
9
- > **Note about v1.0.0 version:**
10
- > * The code was massively refactored for a [support](#using-kimurai-inside-existing-ruby-application) to run spiders multiple times from inside a single process. Now it's possible to run Kimurai spiders using background jobs like Sidekiq.
11
- > * `require 'kimurai'` doesn't require any gems except Active Support. Only when a particular spider [starts](#crawl-method), Capybara will be required with a specific driver.
12
- > * Although Kimurai [extends](lib/kimurai/capybara_ext) Capybara (all the magic happens inside [extended](lib/kimurai/capybara_ext/session.rb) `Capybara::Session#visit` method), session instances which were created manually will behave normally.
13
- > * Small changes in design (check the readme again to see what was changed)
14
- > * Stats database with a web dashboard were removed
15
-
16
- <br>
17
-
18
- > Note: this readme is for `1.4.0` gem version. CHANGELOG [here](CHANGELOG.md).
19
-
20
- Kimurai is a modern web scraping framework written in Ruby which **works out of box with Headless Chromium/Firefox, PhantomJS**, or simple HTTP requests and **allows to scrape and interact with JavaScript rendered websites.**
21
-
22
- Kimurai based on well-known [Capybara](https://github.com/teamcapybara/capybara) and [Nokogiri](https://github.com/sparklemotion/nokogiri) gems, so you don't have to learn anything new. Lets see:
5
+ Kimurai is based on the well-known [Capybara](https://github.com/teamcapybara/capybara) and [Nokogiri](https://github.com/sparklemotion/nokogiri) gems, so you don't have to learn anything new. Let's try an example:
23
6
 
24
7
  ```ruby
25
8
  # github_spider.rb
@@ -28,18 +11,17 @@ require 'kimurai'
28
11
  class GithubSpider < Kimurai::Base
29
12
  @name = "github_spider"
30
13
  @engine = :selenium_chrome
31
- @start_urls = ["https://github.com/search?q=Ruby%20Web%20Scraping"]
14
+ @start_urls = ["https://github.com/search?q=ruby+web+scraping&type=repositories"]
32
15
  @config = {
33
- user_agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36",
34
- before_request: { delay: 4..7 }
16
+ before_request: { delay: 3..5 }
35
17
  }
36
18
 
37
19
  def parse(response, url:, data: {})
38
- response.xpath("//ul[@class='repo-list']/div//h3/a").each do |a|
20
+ response.xpath("//div[@data-testid='results-list']//div[contains(@class, 'search-title')]/a").each do |a|
39
21
  request_to :parse_repo_page, url: absolute_url(a[:href], base: url)
40
22
  end
41
23
 
42
- if next_page = response.at_xpath("//a[@class='next_page']")
24
+ if next_page = response.at_xpath("//a[@rel='next']")
43
25
  request_to :parse, url: absolute_url(next_page[:href], base: url)
44
26
  end
45
27
  end
@@ -47,15 +29,15 @@ class GithubSpider < Kimurai::Base
47
29
  def parse_repo_page(response, url:, data: {})
48
30
  item = {}
49
31
 
50
- item[:owner] = response.xpath("//h1//a[@rel='author']").text
51
- item[:repo_name] = response.xpath("//h1/strong[@itemprop='name']/a").text
32
+ item[:owner] = response.xpath("//a[@rel='author']").text.squish
33
+ item[:repo_name] = response.xpath("//strong[@itemprop='name']").text.squish
52
34
  item[:repo_url] = url
53
- item[:description] = response.xpath("//span[@itemprop='about']").text.squish
54
- item[:tags] = response.xpath("//div[@id='topics-list-container']/div/a").map { |a| a.text.squish }
55
- item[:watch_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Watch')]/a[2]").text.squish
56
- item[:star_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Star')]/a[2]").text.squish
57
- item[:fork_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Fork')]/a[2]").text.squish
58
- item[:last_commit] = response.xpath("//span[@itemprop='dateModified']/*").text
35
+ item[:description] = response.xpath("//div[h2[text()='About']]/p").text.squish
36
+ item[:tags] = response.xpath("//div/a[contains(@title, 'Topic')]").map { |a| a.text.squish }
37
+ item[:watch_count] = response.xpath("//div/h3[text()='Watchers']/following-sibling::div[1]/a/strong").text.squish
38
+ item[:star_count] = response.xpath("//div/h3[text()='Stars']/following-sibling::div[1]/a/strong").text.squish
39
+ item[:fork_count] = response.xpath("//div/h3[text()='Forks']/following-sibling::div[1]/a/strong").text.squish
40
+ item[:last_commit] = response.xpath("//div[@data-testid='latest-commit-details']//relative-time/text()").text.squish
59
41
 
60
42
  save_to "results.json", item, format: :pretty_json
61
43
  end
@@ -68,33 +50,25 @@ GithubSpider.crawl!
68
50
  <summary>Run: <code>$ ruby github_spider.rb</code></summary>
69
51
 
70
52
  ```
71
- I, [2018-08-22 13:08:03 +0400#15477] [M: 47377500980720] INFO -- github_spider: Spider: started: github_spider
72
- D, [2018-08-22 13:08:03 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): created browser instance
73
- D, [2018-08-22 13:08:03 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): enabled `browser before_request delay`
74
- D, [2018-08-22 13:08:03 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: sleep 7 seconds before request...
75
- D, [2018-08-22 13:08:10 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): enabled custom user-agent
76
- D, [2018-08-22 13:08:10 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): enabled native headless_mode
77
- I, [2018-08-22 13:08:10 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: started get request to: https://github.com/search?q=Ruby%20Web%20Scraping
78
- I, [2018-08-22 13:08:26 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: finished get request to: https://github.com/search?q=Ruby%20Web%20Scraping
79
- I, [2018-08-22 13:08:26 +0400#15477] [M: 47377500980720] INFO -- github_spider: Info: visits: requests: 1, responses: 1
80
- D, [2018-08-22 13:08:27 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: driver.current_memory: 107968
81
- D, [2018-08-22 13:08:27 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: sleep 5 seconds before request...
82
- I, [2018-08-22 13:08:32 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: started get request to: https://github.com/lorien/awesome-web-scraping
83
- I, [2018-08-22 13:08:33 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: finished get request to: https://github.com/lorien/awesome-web-scraping
84
- I, [2018-08-22 13:08:33 +0400#15477] [M: 47377500980720] INFO -- github_spider: Info: visits: requests: 2, responses: 2
85
- D, [2018-08-22 13:08:33 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: driver.current_memory: 212542
86
- D, [2018-08-22 13:08:33 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: sleep 4 seconds before request...
87
- I, [2018-08-22 13:08:37 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: started get request to: https://github.com/jaimeiniesta/metainspector
53
+ $ ruby github_spider.rb
54
+
55
+ I, [2025-12-16 12:15:48] INFO -- github_spider: Spider: started: github_spider
56
+ I, [2025-12-16 12:15:48] INFO -- github_spider: Browser: started get request to: https://github.com/search?q=ruby+web+scraping&type=repositories
57
+ I, [2025-12-16 12:16:01] INFO -- github_spider: Browser: finished get request to: https://github.com/search?q=ruby+web+scraping&type=repositories
58
+ I, [2025-12-16 12:16:01] INFO -- github_spider: Info: visits: requests: 1, responses: 1
59
+ I, [2025-12-16 12:16:01] INFO -- github_spider: Browser: started get request to: https://github.com/sparklemotion/mechanize
60
+ I, [2025-12-16 12:16:06] INFO -- github_spider: Browser: finished get request to: https://github.com/sparklemotion/mechanize
61
+ I, [2025-12-16 12:16:06] INFO -- github_spider: Info: visits: requests: 2, responses: 2
62
+ I, [2025-12-16 12:16:06] INFO -- github_spider: Browser: started get request to: https://github.com/jaimeiniesta/metainspector
63
+ I, [2025-12-16 12:16:11] INFO -- github_spider: Browser: finished get request to: https://github.com/jaimeiniesta/metainspector
64
+ I, [2025-12-16 12:16:11] INFO -- github_spider: Info: visits: requests: 3, responses: 3
65
+ I, [2025-12-16 12:16:11] INFO -- github_spider: Browser: started get request to: https://github.com/Germey/AwesomeWebScraping
66
+ I, [2025-12-16 12:16:13] INFO -- github_spider: Browser: finished get request to: https://github.com/Germey/AwesomeWebScraping
67
+ I, [2025-12-16 12:16:13] INFO -- github_spider: Info: visits: requests: 4, responses: 4
68
+ I, [2025-12-16 12:16:13] INFO -- github_spider: Browser: started get request to: https://github.com/vifreefly/kimuraframework
69
+ I, [2025-12-16 12:16:17] INFO -- github_spider: Browser: finished get request to: https://github.com/vifreefly/kimuraframework
88
70
 
89
71
  ...
90
-
91
- I, [2018-08-22 13:23:07 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: started get request to: https://github.com/preston/idclight
92
- I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: finished get request to: https://github.com/preston/idclight
93
- I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] INFO -- github_spider: Info: visits: requests: 140, responses: 140
94
- D, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: driver.current_memory: 204198
95
- I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: driver selenium_chrome has been destroyed
96
-
97
- I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] INFO -- github_spider: Spider: stopped: {:spider_name=>"github_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 13:08:03 +0400, :stop_time=>2018-08-22 13:23:08 +0400, :running_time=>"15m, 5s", :visits=>{:requests=>140, :responses=>140}, :error=>nil}
98
72
  ```
99
73
  </details>
100
74
 
@@ -104,48 +78,71 @@ I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] INFO -- github_spider:
104
78
  ```json
105
79
  [
106
80
  {
107
- "owner": "lorien",
108
- "repo_name": "awesome-web-scraping",
109
- "repo_url": "https://github.com/lorien/awesome-web-scraping",
110
- "description": "List of libraries, tools and APIs for web scraping and data processing.",
111
- "tags": [
112
- "awesome",
113
- "awesome-list",
114
- "web-scraping",
115
- "data-processing",
116
- "python",
117
- "javascript",
118
- "php",
119
- "ruby"
120
- ],
121
- "watch_count": "159",
122
- "star_count": "2,423",
123
- "fork_count": "358",
124
- "last_commit": "4 days ago",
81
+ "owner": "sparklemotion",
82
+ "repo_name": "mechanize",
83
+ "repo_url": "https://github.com/sparklemotion/mechanize",
84
+ "description": "Mechanize is a ruby library that makes automated web interaction easy.",
85
+ "tags": ["ruby", "web", "scraping"],
86
+ "watch_count": "79",
87
+ "star_count": "4.4k",
88
+ "fork_count": "480",
89
+ "last_commit": "Sep 30, 2025",
125
90
  "position": 1
126
91
  },
127
-
128
- ...
129
-
130
92
  {
131
- "owner": "preston",
132
- "repo_name": "idclight",
133
- "repo_url": "https://github.com/preston/idclight",
134
- "description": "A Ruby gem for accessing the freely available IDClight (IDConverter Light) web service, which convert between different types of gene IDs such as Hugo and Entrez. Queries are screen scraped from http://idclight.bioinfo.cnio.es.",
135
- "tags": [
136
-
137
- ],
138
- "watch_count": "6",
139
- "star_count": "1",
93
+ "owner": "jaimeiniesta",
94
+ "repo_name": "metainspector",
95
+ "repo_url": "https://github.com/jaimeiniesta/metainspector",
96
+ "description": "Ruby gem for web scraping purposes. It scrapes a given URL, and returns you its title, meta description, meta keywords, links, images...",
97
+ "tags": [],
98
+ "watch_count": "20",
99
+ "star_count": "1k",
100
+ "fork_count": "166",
101
+ "last_commit": "Oct 8, 2025",
102
+ "position": 2
103
+ },
104
+ {
105
+ "owner": "Germey",
106
+ "repo_name": "AwesomeWebScraping",
107
+ "repo_url": "https://github.com/Germey/AwesomeWebScraping",
108
+ "description": "List of libraries, tools and APIs for web scraping and data processing.",
109
+ "tags": ["javascript", "ruby", "python", "golang", "php", "awesome", "captcha", "proxy", "web-scraping", "aswsome-list"],
110
+ "watch_count": "5",
111
+ "star_count": "253",
112
+ "fork_count": "33",
113
+ "last_commit": "Apr 5, 2024",
114
+ "position": 3
115
+ },
116
+ {
117
+ "owner": "vifreefly",
118
+ "repo_name": "kimuraframework",
119
+ "repo_url": "https://github.com/vifreefly/kimuraframework",
120
+ "description": "Kimurai is a modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites",
121
+ "tags": ["crawler", "scraper", "scrapy", "headless-chrome", "kimurai"],
122
+ "watch_count": "28",
123
+ "star_count": "1k",
124
+ "fork_count": "158",
125
+ "last_commit": "Dec 12, 2025",
126
+ "position": 4
127
+ },
128
+ // ...
129
+ {
130
+ "owner": "citixenken",
131
+ "repo_name": "web_scraping_with_ruby",
132
+ "repo_url": "https://github.com/citixenken/web_scraping_with_ruby",
133
+ "description": "",
134
+ "tags": [],
135
+ "watch_count": "1",
136
+ "star_count": "0",
140
137
  "fork_count": "0",
141
- "last_commit": "on Apr 12, 2012",
142
- "position": 127
138
+ "last_commit": "Aug 29, 2022",
139
+ "position": 118
143
140
  }
144
141
  ]
145
142
  ```
146
143
  </details><br>
147
144
 
148
- Okay, that was easy. How about javascript rendered websites with dynamic HTML? Lets scrape a page with infinite scroll:
145
+ Okay, that was easy. How about JavaScript rendered websites with dynamic HTML? Let's scrape a page with infinite scroll:
149
146
 
150
147
  ```ruby
151
148
  # infinite_scroll_spider.rb
@@ -169,7 +166,7 @@ class InfiniteScrollSpider < Kimurai::Base
169
166
  logger.info "> Pagination is done" and break
170
167
  else
171
168
  count = new_count
172
- logger.info "> Continue scrolling, current count is #{count}..."
169
+ logger.info "> Continue scrolling, current posts count is #{count}..."
173
170
  end
174
171
  end
175
172
 
@@ -185,49 +182,46 @@ InfiniteScrollSpider.crawl!
185
182
  <summary>Run: <code>$ ruby infinite_scroll_spider.rb</code></summary>
186
183
 
187
184
  ```
188
- I, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Spider: started: infinite_scroll_spider
189
- D, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320] DEBUG -- infinite_scroll_spider: BrowserBuilder (selenium_chrome): created browser instance
190
- D, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320] DEBUG -- infinite_scroll_spider: BrowserBuilder (selenium_chrome): enabled native headless_mode
191
- I, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Browser: started get request to: https://infinite-scroll.com/demo/full-page/
192
- I, [2018-08-22 13:33:03 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Browser: finished get request to: https://infinite-scroll.com/demo/full-page/
193
- I, [2018-08-22 13:33:03 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Info: visits: requests: 1, responses: 1
194
- D, [2018-08-22 13:33:03 +0400#23356] [M: 47375890851320] DEBUG -- infinite_scroll_spider: Browser: driver.current_memory: 95463
195
- I, [2018-08-22 13:33:05 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Continue scrolling, current count is 5...
196
- I, [2018-08-22 13:33:18 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Continue scrolling, current count is 9...
197
- I, [2018-08-22 13:33:20 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Continue scrolling, current count is 11...
198
- I, [2018-08-22 13:33:26 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Continue scrolling, current count is 13...
199
- I, [2018-08-22 13:33:28 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Continue scrolling, current count is 15...
200
- I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Pagination is done
201
- I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > All posts from page: 1a - Infinite Scroll full page demo; 1b - RGB Schemes logo in Computer Arts; 2a - RGB Schemes logo; 2b - Masonry gets horizontalOrder; 2c - Every vector 2016; 3a - Logo Pizza delivered; 3b - Some CodePens; 3c - 365daysofmusic.com; 3d - Holograms; 4a - Huebee: 1-click color picker; 4b - Word is Flickity is good; Flickity v2 released: groupCells, adaptiveHeight, parallax; New tech gets chatter; Isotope v3 released: stagger in, IE8 out; Packery v2 released
202
- I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Browser: driver selenium_chrome has been destroyed
203
- I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Spider: stopped: {:spider_name=>"infinite_scroll_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 13:32:57 +0400, :stop_time=>2018-08-22 13:33:30 +0400, :running_time=>"33s", :visits=>{:requests=>1, :responses=>1}, :error=>nil}
185
+ $ ruby infinite_scroll_spider.rb
204
186
 
187
+ I, [2025-12-16 12:47:05] INFO -- infinite_scroll_spider: Spider: started: infinite_scroll_spider
188
+ I, [2025-12-16 12:47:05] INFO -- infinite_scroll_spider: Browser: started get request to: https://infinite-scroll.com/demo/full-page/
189
+ I, [2025-12-16 12:47:09] INFO -- infinite_scroll_spider: Browser: finished get request to: https://infinite-scroll.com/demo/full-page/
190
+ I, [2025-12-16 12:47:09] INFO -- infinite_scroll_spider: Info: visits: requests: 1, responses: 1
191
+ I, [2025-12-16 12:47:11] INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 5...
192
+ I, [2025-12-16 12:47:13] INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 9...
193
+ I, [2025-12-16 12:47:15] INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 11...
194
+ I, [2025-12-16 12:47:17] INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 13...
195
+ I, [2025-12-16 12:47:19] INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 15...
196
+ I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: > Pagination is done
197
+ I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: > All posts from page: 1a - Infinite Scroll full page demo; 1b - RGB Schemes logo in Computer Arts; 2a - RGB Schemes logo; 2b - Masonry gets horizontalOrder; 2c - Every vector 2016; 3a - Logo Pizza delivered; 3b - Some CodePens; 3c - 365daysofmusic.com; 3d - Holograms; 4a - Huebee: 1-click color picker; 4b - Word is Flickity is good; Flickity v2 released: groupCells, adaptiveHeight, parallax; New tech gets chatter; Isotope v3 released: stagger in, IE8 out; Packery v2 released
198
+ I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: Browser: driver selenium_chrome has been destroyed
199
+ I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: Spider: stopped: {spider_name: "infinite_scroll_spider", status: :completed, error: nil, environment: "development", start_time: 2025-12-16 12:47:05.372053 +0300, stop_time: 2025-12-16 12:47:21.505078 +0300, running_time: "16s", visits: {requests: 1, responses: 1}, items: {sent: 0, processed: 0}, events: {requests_errors: {}, drop_items_errors: {}, custom: {}}}
205
200
  ```
206
201
  </details><br>
207
202
 
208
203
 
209
204
  ## Features
210
- * Scrape javascript rendered websites out of box
211
- * Supported engines: [Headless Chrome](https://developers.google.com/web/updates/2017/04/headless-chrome), [Headless Firefox](https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Headless_mode), [PhantomJS](https://github.com/ariya/phantomjs) or simple HTTP requests ([mechanize](https://github.com/sparklemotion/mechanize) gem)
205
+ * Scrape JavaScript rendered websites out of the box
206
+ * Supported engines: [Headless Chrome](https://developers.google.com/web/updates/2017/04/headless-chrome), [Headless Firefox](https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Headless_mode) or simple HTTP requests ([mechanize](https://github.com/sparklemotion/mechanize) gem)
212
207
  * Write spider code once, and use it with any supported engine later
213
208
  * All the power of [Capybara](https://github.com/teamcapybara/capybara): use methods like `click_on`, `fill_in`, `select`, `choose`, `set`, `go_back`, etc. to interact with web pages
214
209
  * Rich [configuration](#spider-config): **set default headers, cookies, delay between requests, enable proxy/user-agents rotation**
215
- * Built-in helpers to make scraping easy, like [save_to](#save_to-helper) (save items to JSON, JSON lines, or CSV formats) or [unique?](#skip-duplicates-unique-helper) to skip duplicates
210
+ * Built-in helpers to make scraping easy, like [save_to](#save_to-helper) (save items to JSON, JSON lines, or CSV formats) or [unique?](#skip-duplicates) to skip duplicates
216
211
  * Automatically [handle requests errors](#handle-request-errors)
217
212
  * Automatically restart browsers when reaching memory limit [**(memory control)**](#spider-config) or requests limit
218
213
  * Easily [schedule spiders](#schedule-spiders-using-cron) within cron using [Whenever](https://github.com/javan/whenever) (no need to know cron syntax)
219
214
  * [Parallel scraping](#parallel-crawling-using-in_parallel) using simple method `in_parallel`
220
215
  * **Two modes:** use single file for a simple spider, or [generate](#project-mode) Scrapy-like **project**
221
216
  * Convenient development mode with [console](#interactive-console), colorized logger and debugger ([Pry](https://github.com/pry/pry), [Byebug](https://github.com/deivid-rodriguez/byebug))
222
- * Automated [server environment setup](#setup) (for ubuntu 18.04) and [deploy](#deploy) using commands `kimurai setup` and `kimurai deploy` ([Ansible](https://github.com/ansible/ansible) under the hood)
223
- * Command-line [runner](#runner) to run all project spiders one by one or in parallel
217
+ * Command-line [runner](#runner) to run all project spiders one-by-one or in parallel
224
218
 
225
219
  ## Table of Contents
226
220
  * [Kimurai](#kimurai)
227
221
  * [Features](#features)
228
222
  * [Table of Contents](#table-of-contents)
229
223
  * [Installation](#installation)
230
- * [Getting to Know](#getting-to-know)
224
+ * [Getting to know Kimurai](#getting-to-know-kimurai)
231
225
  * [Interactive console](#interactive-console)
232
226
  * [Available engines](#available-engines)
233
227
  * [Minimum required spider structure](#minimum-required-spider-structure)
@@ -236,9 +230,9 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
236
230
  * [request_to method](#request_to-method)
237
231
  * [save_to helper](#save_to-helper)
238
232
  * [Skip duplicates](#skip-duplicates)
239
- * [Automatically skip all duplicated requests urls](#automatically-skip-all-duplicated-requests-urls)
233
+ * [Automatically skip all duplicate request urls](#automatically-skip-all-duplicate-request-urls)
240
234
  * [Storage object](#storage-object)
241
- * [Handle request errors](#handle-request-errors)
235
+ * [Handling request errors](#handling-request-errors)
242
236
  * [skip_request_errors](#skip_request_errors)
243
237
  * [retry_request_errors](#retry_request_errors)
244
238
  * [Logging custom events](#logging-custom-events)
@@ -248,13 +242,10 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
248
242
  * [Active Support included](#active-support-included)
249
243
  * [Schedule spiders using Cron](#schedule-spiders-using-cron)
250
244
  * [Configuration options](#configuration-options)
251
- * [Using Kimurai inside existing Ruby application](#using-kimurai-inside-existing-ruby-application)
245
+ * [Using Kimurai inside existing Ruby applications](#using-kimurai-inside-existing-ruby-applications)
252
246
  * [crawl! method](#crawl-method)
253
247
  * [parse! method](#parsemethod_name-url-method)
254
248
  * [Kimurai.list and Kimurai.find_by_name](#kimurailist-and-kimuraifind_by_name)
255
- * [Automated sever setup and deployment](#automated-sever-setup-and-deployment)
256
- * [Setup](#setup)
257
- * [Deploy](#deploy)
258
249
  * [Spider @config](#spider-config)
259
250
  * [All available @config options](#all-available-config-options)
260
251
  * [@config settings inheritance](#config-settings-inheritance)
@@ -271,187 +262,111 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
271
262
 
272
263
 
273
264
  ## Installation
274
- Kimurai requires Ruby version `>= 2.5.0`. Supported platforms: `Linux` and `Mac OS X`.
265
+ Kimurai requires Ruby version `>= 3.1.0`. Officially supported platforms: `Linux` and `macOS`.
275
266
 
276
- 1) If your system doesn't have appropriate Ruby version, install it:
267
+ 1) If your system doesn't have the appropriate Ruby version, install it:
277
268
 
278
269
  <details/>
279
- <summary>Ubuntu 18.04</summary>
270
+ <summary>Ubuntu 24.04</summary>
280
271
 
281
272
  ```bash
282
- # Install required packages for ruby-build
273
+ # Install required system packages
283
274
  sudo apt update
284
- sudo apt install git-core curl zlib1g-dev build-essential libssl-dev libreadline-dev libreadline6-dev libyaml-dev libxml2-dev libxslt1-dev libcurl4-openssl-dev libffi-dev
285
-
286
- # Install rbenv and ruby-build
287
- cd && git clone https://github.com/rbenv/rbenv.git ~/.rbenv
288
- echo 'export PATH="$HOME/.rbenv/bin:$PATH"' >> ~/.bashrc
289
- echo 'eval "$(rbenv init -)"' >> ~/.bashrc
290
- exec $SHELL
275
+ sudo apt install build-essential rustc libssl-dev libyaml-dev zlib1g-dev libgmp-dev
291
276
 
292
- git clone https://github.com/rbenv/ruby-build.git ~/.rbenv/plugins/ruby-build
293
- echo 'export PATH="$HOME/.rbenv/plugins/ruby-build/bin:$PATH"' >> ~/.bashrc
294
- exec $SHELL
277
+ # Install Mice version manager
278
+ curl https://mise.run | sh
279
+ echo 'eval "$(~/.local/bin/mise activate)"' >> ~/.bashrc
280
+ source ~/.bashrc
295
281
 
296
282
  # Install latest Ruby
297
- rbenv install 2.5.3
298
- rbenv global 2.5.3
299
-
300
- gem install bundler
283
+ mise use --global ruby@3
284
+ gem update --system
301
285
  ```
302
286
  </details>
303
287
 
304
288
  <details/>
305
- <summary>Mac OS X</summary>
289
+ <summary>macOS</summary>
306
290
 
307
291
  ```bash
308
- # Install homebrew if you don't have it https://brew.sh/
309
- # Install rbenv and ruby-build:
310
- brew install rbenv ruby-build
292
+ # Install Homebrew if you don't have it https://brew.sh/
293
+ brew install openssl@3 libyaml gmp rust
311
294
 
312
- # Add rbenv to bash so that it loads every time you open a terminal
313
- echo 'if which rbenv > /dev/null; then eval "$(rbenv init -)"; fi' >> ~/.bash_profile
314
- source ~/.bash_profile
295
+ # Install Mice version manager
296
+ curl https://mise.run | sh
297
+ echo 'eval "$(~/.local/bin/mise activate)"' >> ~/.zshrc
298
+ source ~/.zshrc
315
299
 
316
300
  # Install latest Ruby
317
- rbenv install 2.5.3
318
- rbenv global 2.5.3
319
-
320
- gem install bundler
301
+ mise use --global ruby@3
302
+ gem update --system
321
303
  ```
322
304
  </details>
323
305
 
324
306
  2) Install Kimurai gem: `$ gem install kimurai`
325
307
 
326
- 3) Install browsers with webdrivers:
308
+ 3) Install browsers:
327
309
 
328
310
  <details/>
329
- <summary>Ubuntu 18.04</summary>
330
-
331
- Note: for Ubuntu 16.04-18.04 there is available automatic installation using `setup` command:
332
- ```bash
333
- $ kimurai setup localhost --local --ask-sudo
334
- ```
335
- It works using [Ansible](https://github.com/ansible/ansible) so you need to install it first: `$ sudo apt install ansible`. You can check using playbooks [here](lib/kimurai/automation).
336
-
337
- If you chose automatic installation, you can skip following and go to "Getting To Know" part. In case if you want to install everything manually:
311
+ <summary>Ubuntu 24.04</summary>
338
312
 
339
313
  ```bash
340
314
  # Install basic tools
341
315
  sudo apt install -q -y unzip wget tar openssl
342
316
 
343
- # Install xvfb (for virtual_display headless mode, in additional to native)
317
+ # Install xvfb (for virtual_display headless mode, in addition to native)
344
318
  sudo apt install -q -y xvfb
345
-
346
- # Install chromium-browser and firefox
347
- sudo apt install -q -y chromium-browser firefox
348
-
349
- # Instal chromedriver (2.44 version)
350
- # All versions located here https://sites.google.com/a/chromium.org/chromedriver/downloads
351
- cd /tmp && wget https://chromedriver.storage.googleapis.com/2.44/chromedriver_linux64.zip
352
- sudo unzip chromedriver_linux64.zip -d /usr/local/bin
353
- rm -f chromedriver_linux64.zip
354
-
355
- # Install geckodriver (0.23.0 version)
356
- # All versions located here https://github.com/mozilla/geckodriver/releases/
357
- cd /tmp && wget https://github.com/mozilla/geckodriver/releases/download/v0.23.0/geckodriver-v0.23.0-linux64.tar.gz
358
- sudo tar -xvzf geckodriver-v0.23.0-linux64.tar.gz -C /usr/local/bin
359
- rm -f geckodriver-v0.23.0-linux64.tar.gz
360
-
361
- # Install PhantomJS (2.1.1)
362
- # All versions located here http://phantomjs.org/download.html
363
- sudo apt install -q -y chrpath libxft-dev libfreetype6 libfreetype6-dev libfontconfig1 libfontconfig1-dev
364
- cd /tmp && wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2
365
- tar -xvjf phantomjs-2.1.1-linux-x86_64.tar.bz2
366
- sudo mv phantomjs-2.1.1-linux-x86_64 /usr/local/lib
367
- sudo ln -s /usr/local/lib/phantomjs-2.1.1-linux-x86_64/bin/phantomjs /usr/local/bin
368
- rm -f phantomjs-2.1.1-linux-x86_64.tar.bz2
369
319
  ```
370
320
 
371
- </details>
372
-
373
- <details/>
374
- <summary>Mac OS X</summary>
321
+ Latest automatically installed selenium drivers doesn't work well with Ubuntu Snap versions of Chrome and Firefox, therefore we need to install classic .deb versions and make sure they are available over Snap versions:
375
322
 
376
323
  ```bash
377
- # Install chrome and firefox
378
- brew cask install google-chrome firefox
379
-
380
- # Install chromedriver (latest)
381
- brew cask install chromedriver
382
-
383
- # Install geckodriver (latest)
384
- brew install geckodriver
385
-
386
- # Install PhantomJS (latest)
387
- brew install phantomjs
324
+ # Install google chrome
325
+ wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
326
+ sudo apt-get install -y ./google-chrome-stable_current_amd64.deb
388
327
  ```
389
- </details><br>
390
328
 
391
- Also, if you want to save scraped items to the database (using [ActiveRecord](https://github.com/rails/rails/tree/master/activerecord), [Sequel](https://github.com/jeremyevans/sequel) or [MongoDB Ruby Driver](https://github.com/mongodb/mongo-ruby-driver)/[Mongoid](https://github.com/mongodb/mongoid)), you need to install database clients/servers:
392
-
393
- <details/>
394
- <summary>Ubuntu 18.04</summary>
329
+ ```bash
330
+ # Install firefox (only if you intend to use Firefox as a browser, using selenium_firefox engine)
331
+ # See https://www.omgubuntu.co.uk/2022/04/how-to-install-firefox-deb-apt-ubuntu-22-04
332
+ sudo snap remove firefox
395
333
 
396
- SQlite: `$ sudo apt -q -y install libsqlite3-dev sqlite3`.
334
+ sudo install -d -m 0755 /etc/apt/keyrings
335
+ wget -q https://packages.mozilla.org/apt/repo-signing-key.gpg -O- | sudo tee /etc/apt/keyrings/packages.mozilla.org.asc > /dev/null
397
336
 
398
- If you want to connect to a remote database, you don't need database server on a local machine (only client):
399
- ```bash
400
- # Install MySQL client
401
- sudo apt -q -y install mysql-client libmysqlclient-dev
337
+ echo "deb [signed-by=/etc/apt/keyrings/packages.mozilla.org.asc] https://packages.mozilla.org/apt mozilla main" | sudo tee -a /etc/apt/sources.list.d/mozilla.list > /dev/null
402
338
 
403
- # Install Postgres client
404
- sudo apt install -q -y postgresql-client libpq-dev
339
+ echo '
340
+ Package: *
341
+ Pin: origin packages.mozilla.org
342
+ Pin-Priority: 1000
405
343
 
406
- # Install MongoDB client
407
- sudo apt install -q -y mongodb-clients
408
- ```
344
+ Package: firefox*
345
+ Pin: release o=Ubuntu
346
+ Pin-Priority: -1' | sudo tee /etc/apt/preferences.d/mozilla
409
347
 
410
- But if you want to save items to a local database, database server required as well:
411
- ```bash
412
- # Install MySQL client and server
413
- sudo apt -q -y install mysql-server mysql-client libmysqlclient-dev
414
-
415
- # Install Postgres client and server
416
- sudo apt install -q -y postgresql postgresql-contrib libpq-dev
417
-
418
- # Install MongoDB client and server
419
- # version 4.0 (check here https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/)
420
- sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4
421
- # for 16.04:
422
- # echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
423
- # for 18.04:
424
- echo "deb [ arch=amd64 ] https://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
425
- sudo apt update
426
- sudo apt install -q -y mongodb-org
427
- sudo service mongod start
348
+ sudo apt update && sudo apt remove firefox
349
+ sudo apt install firefox
428
350
  ```
429
351
  </details>
430
352
 
431
353
  <details/>
432
- <summary>Mac OS X</summary>
433
-
434
- SQlite: `$ brew install sqlite3`
354
+ <summary>macOS</summary>
435
355
 
436
356
  ```bash
437
- # Install MySQL client and server
438
- brew install mysql
439
- # Start server if you need it: brew services start mysql
440
-
441
- # Install Postgres client and server
442
- brew install postgresql
443
- # Start server if you need it: brew services start postgresql
444
-
445
- # Install MongoDB client and server
446
- brew install mongodb
447
- # Start server if you need it: brew services start mongodb
357
+ # Install google chrome
358
+ brew install google-chrome
448
359
  ```
449
- </details>
450
360
 
361
+ ```bash
362
+ # Install firefox (only if you intend to use Firefox as a browser, using selenium_firefox engine)
363
+ brew install firefox
364
+ ```
365
+ </details><br>
451
366
 
452
- ## Getting to Know
367
+ ## Getting to know Kimurai
453
368
  ### Interactive console
454
- Before you get to know all Kimurai features, there is `$ kimurai console` command which is an interactive console where you can try and debug your scraping code very quickly, without having to run any spider (yes, it's like [Scrapy shell](https://doc.scrapy.org/en/latest/topics/shell.html#topics-shell)).
369
+ Before you get to know all of Kimurai's features, there is a `$ kimurai console` command which is an interactive console where you can try and debug your scraping code very quickly, without having to run any spider (yes, it's like [Scrapy shell](https://doc.scrapy.org/en/latest/topics/shell.html#topics-shell)).
455
370
 
456
371
  ```bash
457
372
  $ kimurai console --engine selenium_chrome --url https://github.com/vifreefly/kimuraframework
@@ -463,76 +378,45 @@ $ kimurai console --engine selenium_chrome --url https://github.com/vifreefly/ki
463
378
  ```
464
379
  $ kimurai console --engine selenium_chrome --url https://github.com/vifreefly/kimuraframework
465
380
 
466
- D, [2018-08-22 13:42:32 +0400#26079] [M: 47461994677760] DEBUG -- : BrowserBuilder (selenium_chrome): created browser instance
467
- D, [2018-08-22 13:42:32 +0400#26079] [M: 47461994677760] DEBUG -- : BrowserBuilder (selenium_chrome): enabled native headless_mode
468
- I, [2018-08-22 13:42:32 +0400#26079] [M: 47461994677760] INFO -- : Browser: started get request to: https://github.com/vifreefly/kimuraframework
469
- I, [2018-08-22 13:42:35 +0400#26079] [M: 47461994677760] INFO -- : Browser: finished get request to: https://github.com/vifreefly/kimuraframework
470
- D, [2018-08-22 13:42:35 +0400#26079] [M: 47461994677760] DEBUG -- : Browser: driver.current_memory: 201701
471
-
472
- From: /home/victor/code/kimurai/lib/kimurai/base.rb @ line 189 Kimurai::Base#console:
473
-
474
- 188: def console(response = nil, url: nil, data: {})
475
- => 189: binding.pry
476
- 190: end
477
-
478
- [1] pry(#<Kimurai::Base>)> response.xpath("//title").text
479
- => "GitHub - vifreefly/kimuraframework: Modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites"
480
-
481
- [2] pry(#<Kimurai::Base>)> ls
482
- Kimurai::Base#methods: browser console logger request_to save_to unique?
483
- instance variables: @browser @config @engine @logger @pipelines
484
- locals: _ __ _dir_ _ex_ _file_ _in_ _out_ _pry_ data response url
485
-
486
- [3] pry(#<Kimurai::Base>)> ls response
487
- Nokogiri::XML::PP::Node#methods: inspect pretty_print
488
- Nokogiri::XML::Searchable#methods: % / at at_css at_xpath css search xpath
489
- Enumerable#methods:
490
- all? collect drop each_with_index find_all grep_v lazy member? none? reject slice_when take_while without
491
- any? collect_concat drop_while each_with_object find_index group_by many? min one? reverse_each sort to_a zip
492
- as_json count each_cons entries first include? map min_by partition select sort_by to_h
493
- chunk cycle each_entry exclude? flat_map index_by max minmax pluck slice_after sum to_set
494
- chunk_while detect each_slice find grep inject max_by minmax_by reduce slice_before take uniq
495
- Nokogiri::XML::Node#methods:
496
- <=> append_class classes document? has_attribute? matches? node_name= processing_instruction? to_str
497
- == attr comment? each html? name= node_type read_only? to_xhtml
498
- > attribute content elem? inner_html namespace= parent= remove traverse
499
- [] attribute_nodes content= element? inner_html= namespace_scopes parse remove_attribute unlink
500
- []= attribute_with_ns create_external_subset element_children inner_text namespaced_key? path remove_class values
501
- accept before create_internal_subset elements internal_subset native_content= pointer_id replace write_html_to
502
- add_class blank? css_path encode_special_chars key? next prepend_child set_attribute write_to
503
- add_next_sibling cdata? decorate! external_subset keys next= previous text write_xhtml_to
504
- add_previous_sibling child delete first_element_child lang next_element previous= text? write_xml_to
505
- after children description fragment? lang= next_sibling previous_element to_html xml?
506
- ancestors children= do_xinclude get_attribute last_element_child node_name previous_sibling to_s
507
- Nokogiri::XML::Document#methods:
508
- << canonicalize collect_namespaces create_comment create_entity decorate document encoding errors name remove_namespaces! root= to_java url version
509
- add_child clone create_cdata create_element create_text_node decorators dup encoding= errors= namespaces root slop! to_xml validate
510
- Nokogiri::HTML::Document#methods: fragment meta_encoding meta_encoding= serialize title title= type
511
- instance variables: @decorators @errors @node_cache
512
-
513
- [4] pry(#<Kimurai::Base>)> exit
514
- I, [2018-08-22 13:43:47 +0400#26079] [M: 47461994677760] INFO -- : Browser: driver selenium_chrome has been destroyed
515
- $
381
+ D, [2025-12-16 13:08:41 +0300#37718] [M: 1208] DEBUG -- : BrowserBuilder (selenium_chrome): created browser instance
382
+ I, [2025-12-16 13:08:41 +0300#37718] [M: 1208] INFO -- : Browser: started get request to: https://github.com/vifreefly/kimuraframework
383
+ I, [2025-12-16 13:08:43 +0300#37718] [M: 1208] INFO -- : Browser: finished get request to: https://github.com/vifreefly/kimuraframework
384
+
385
+ From: /Users/vic/code/spiders/kimuraframework/lib/kimurai/base.rb:208 Kimurai::Base#console:
386
+
387
+ 207: def console(response = nil, url: nil, data: {})
388
+ => 208: binding.pry
389
+ 209: end
390
+
391
+ [1] pry(#<Kimurai::Base>)> response.css('title').text
392
+ => "GitHub - vifreefly/kimuraframework: Kimurai is a modern Ruby web scraping framework that supports scraping with antidetect Chrome/Firefox as well as HTTP requests"
393
+ [2] pry(#<Kimurai::Base>)> browser.current_url
394
+ => "https://github.com/vifreefly/kimuraframework"
395
+ [3] pry(#<Kimurai::Base>)> browser.visit('https://google.com')
396
+ I, [2025-12-16 13:09:24 +0300#37718] [M: 1208] INFO -- : Browser: started get request to: https://google.com
397
+ I, [2025-12-16 13:09:26 +0300#37718] [M: 1208] INFO -- : Browser: finished get request to: https://google.com
398
+ => true
399
+ [4] pry(#<Kimurai::Base>)> browser.current_response.title
400
+ => "Google"
516
401
  ```
517
402
  </details><br>
518
403
 
519
- CLI options:
404
+ CLI arguments:
520
405
  * `--engine` (optional) [engine](#available-drivers) to use. Default is `mechanize`
521
- * `--url` (optional) url to process. If url omitted, `response` and `url` objects inside the console will be `nil` (use [browser](#browser-object) object to navigate to any webpage).
406
+ * `--url` (optional) url to process. If url is omitted, `response` and `url` objects inside the console will be `nil` (use [browser](#browser-object) object to navigate to any webpage).
522
407
 
523
408
  ### Available engines
524
- Kimurai has support for following engines and mostly can switch between them without need to rewrite any code:
409
+ Kimurai has support for the following engines and can mostly switch between them without the need to rewrite any code:
525
410
 
526
- * `:mechanize` - [pure Ruby fake http browser](https://github.com/sparklemotion/mechanize). Mechanize can't render javascript and don't know what DOM is it. It only can parse original HTML code of a page. Because of it, mechanize much faster, takes much less memory and in general much more stable than any real browser. Use mechanize if you can do it, and the website doesn't use javascript to render any meaningful parts of its structure. Still, because mechanize trying to mimic a real browser, it supports almost all Capybara's [methods to interact with a web page](http://cheatrags.com/capybara) (filling forms, clicking buttons, checkboxes, etc).
527
- * `:poltergeist_phantomjs` - [PhantomJS headless browser](https://github.com/ariya/phantomjs), can render javascript. In general, PhantomJS still faster than Headless Chrome (and Headless Firefox). PhantomJS has memory leakage, but Kimurai has [memory control feature](#crawler-config) so you shouldn't consider it as a problem. Also, some websites can recognize PhantomJS and block access to them. Like mechanize (and unlike selenium engines) `:poltergeist_phantomjs` can freely rotate proxies and change headers _on the fly_ (see [config section](#all-available-config-options)).
528
- * `:selenium_chrome` Chrome in headless mode driven by selenium. Modern headless browser solution with proper javascript rendering.
529
- * `:selenium_firefox` Firefox in headless mode driven by selenium. Usually takes more memory than other drivers, but sometimes can be useful.
411
+ * `:mechanize` [pure Ruby fake http browser](https://github.com/sparklemotion/mechanize). Mechanize can't render JavaScript and doesn't know what the DOM is it. It can only parse the original HTML code of a page. Because of it, mechanize is much faster, takes much less memory and is in general much more stable than any real browser. It's recommended to use mechanize when possible; if the website doesn't use JavaScript to render any meaningful parts of its structure. Still, because mechanize is trying to mimic a real browser, it supports almost all of Capybara's [methods to interact with a web page](http://cheatrags.com/capybara) (filling forms, clicking buttons, checkboxes, etc).
412
+ * `:selenium_chrome` Chrome in headless mode driven by selenium. A modern headless browser solution with proper JavaScript rendering.
413
+ * `:selenium_firefox` Firefox in headless mode driven by selenium. Usually takes more memory than other drivers, but can sometimes be useful.
530
414
 
531
- **Tip:** add `HEADLESS=false` ENV variable before command (`$ HEADLESS=false ruby spider.rb`) to run browser in normal (not headless) mode and see it's window (only for selenium-like engines). It works for [console](#interactive-console) command as well.
415
+ **Tip:** prepend a `HEADLESS=false` environment variable on the command line (i.e. `$ HEADLESS=false ruby spider.rb`) to launch an interactive browser in normal (not headless) mode and see its window (only for selenium-like engines). It works for the [console](#interactive-console) command as well.
532
416
 
533
417
 
534
418
  ### Minimum required spider structure
535
- > You can manually create a spider file, or use generator instead: `$ kimurai generate spider simple_spider`
419
+ > You can manually create a spider file, or use the generate command: `$ kimurai generate spider simple_spider`
536
420
 
537
421
  ```ruby
538
422
  require 'kimurai'
@@ -550,10 +434,10 @@ SimpleSpider.crawl!
550
434
  ```
551
435
 
552
436
  Where:
553
- * `@name` name of a spider. You can omit name if use single-file spider
554
- * `@engine` engine for a spider
555
- * `@start_urls` array of start urls to process one by one inside `parse` method
556
- * Method `parse` is the start method, should be always present in spider class
437
+ * `@name` a name for the spider
438
+ * `@engine` engine to use for the spider
439
+ * `@start_urls` array of urls to process one-by-one inside the `parse` method
440
+ * The `parse` method is the entry point, and should always be present in a spider class
557
441
 
558
442
 
559
443
  ### Method arguments `response`, `url` and `data`
@@ -563,14 +447,14 @@ def parse(response, url:, data: {})
563
447
  end
564
448
  ```
565
449
 
566
- * `response` ([Nokogiri::HTML::Document](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document) object) Contains parsed HTML code of a processed webpage
567
- * `url` (String) url of a processed webpage
568
- * `data` (Hash) uses to pass data between requests
450
+ * `response` [Nokogiri::HTML::Document](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document) object – contains parsed HTML code of a processed webpage
451
+ * `url` String – url of a processed webpage
452
+ * `data` – Hash used to pass data between requests
569
453
 
570
454
  <details/>
571
- <summary><strong>Example how to use <code>data</code></strong></summary>
455
+ <summary><strong>An example of how to use <code>data</code></strong></summary>
572
456
 
573
- Imagine that there is a product page which doesn't contain product category. Category name present only on category page with pagination. This is the case where we can use `data` to pass category name from `parse` to `parse_product` method:
457
+ Imagine that there is a product page that doesn't contain a category name. The category name is only present on category pages with pagination. This is a case where we can use `data` to pass a category name from `parse` to `parse_product`:
574
458
 
575
459
  ```ruby
576
460
  class ProductsSpider < Kimurai::Base
@@ -580,7 +464,7 @@ class ProductsSpider < Kimurai::Base
580
464
  def parse(response, url:, data: {})
581
465
  category_name = response.xpath("//path/to/category/name").text
582
466
  response.xpath("//path/to/products/urls").each do |product_url|
583
- # Merge category_name with current data hash and pass it next to parse_product method
467
+ # Merge category_name with current data hash and pass it to parse_product
584
468
  request_to(:parse_product, url: product_url[:href], data: data.merge(category_name: category_name))
585
469
  end
586
470
 
@@ -589,7 +473,7 @@ class ProductsSpider < Kimurai::Base
589
473
 
590
474
  def parse_product(response, url:, data: {})
591
475
  item = {}
592
- # Assign item's category_name from data[:category_name]
476
+ # Assign an item's category_name from data[:category_name]
593
477
  item[:category_name] = data[:category_name]
594
478
 
595
479
  # ...
@@ -600,16 +484,16 @@ end
600
484
  </details><br>
601
485
 
602
486
  **You can query `response` using [XPath or CSS selectors](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/Searchable)**. Check Nokogiri tutorials to understand how to work with `response`:
603
- * [Parsing HTML with Nokogiri](http://ruby.bastardsbook.com/chapters/html-parsing/) - ruby.bastardsbook.com
604
- * [HOWTO parse HTML with Ruby & Nokogiri](https://readysteadycode.com/howto-parse-html-with-ruby-and-nokogiri) - readysteadycode.com
605
- * [Class: Nokogiri::HTML::Document](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document) (documentation) - rubydoc.info
487
+ * [Parsing HTML with Nokogiri](http://ruby.bastardsbook.com/chapters/html-parsing/) ruby.bastardsbook.com
488
+ * [HOWTO parse HTML with Ruby & Nokogiri](https://readysteadycode.com/howto-parse-html-with-ruby-and-nokogiri) readysteadycode.com
489
+ * [Class: Nokogiri::HTML::Document](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document) (documentation) rubydoc.info
606
490
 
607
491
 
608
492
  ### `browser` object
609
493
 
610
- From any spider instance method there is available `browser` object, which is [Capybara::Session](https://www.rubydoc.info/github/jnicklas/capybara/Capybara/Session) object and uses to process requests and get page response (`current_response` method). Usually you don't need to touch it directly, because there is `response` (see above) which contains page response after it was loaded.
494
+ A browser object is available from any spider instance method, which is a [Capybara::Session](https://www.rubydoc.info/github/jnicklas/capybara/Capybara/Session) object and uses it to process requests and get page responses (`current_response` method). Usually, you don't need to touch it directly because `response` (see above) contains the page response after it was loaded.
611
495
 
612
- But if you need to interact with a page (like filling form fields, clicking elements, checkboxes, etc) `browser` is ready for you:
496
+ But, if you need to interact with a page (like filling form fields, clicking elements, checkboxes, etc) a `browser` is ready for you:
613
497
 
614
498
  ```ruby
615
499
  class GoogleSpider < Kimurai::Base
@@ -621,7 +505,7 @@ class GoogleSpider < Kimurai::Base
621
505
  browser.fill_in "q", with: "Kimurai web scraping framework"
622
506
  browser.click_button "Google Search"
623
507
 
624
- # Update response to current response after interaction with a browser
508
+ # Update response with current_response after interaction with a browser
625
509
  response = browser.current_response
626
510
 
627
511
  # Collect results
@@ -635,13 +519,13 @@ end
635
519
  ```
636
520
 
637
521
  Check out **Capybara cheat sheets** where you can see all available methods **to interact with browser**:
638
- * [UI Testing with RSpec and Capybara [cheat sheet]](http://cheatrags.com/capybara) - cheatrags.com
639
- * [Capybara Cheatsheet PDF](https://thoughtbot.com/upcase/test-driven-rails-resources/capybara.pdf) - thoughtbot.com
640
- * [Class: Capybara::Session](https://www.rubydoc.info/github/jnicklas/capybara/Capybara/Session) (documentation) - rubydoc.info
522
+ * [UI Testing with RSpec and Capybara [cheat sheet]](http://cheatrags.com/capybara) cheatrags.com
523
+ * [Capybara Cheatsheet PDF](https://thoughtbot.com/upcase/test-driven-rails-resources/capybara.pdf) thoughtbot.com
524
+ * [Class: Capybara::Session](https://www.rubydoc.info/github/jnicklas/capybara/Capybara/Session) (documentation) rubydoc.info
641
525
 
642
526
  ### `request_to` method
643
527
 
644
- For making requests to a particular method there is `request_to`. It requires minimum two arguments: `:method_name` and `url:`. An optional argument is `data:` (see above what for is it). Example:
528
+ For making requests to a particular method, there is `request_to`. It requires at least two arguments: `:method_name` and `url:`. And, optionally `data:` (see above). Example:
645
529
 
646
530
  ```ruby
647
531
  class Spider < Kimurai::Base
@@ -659,7 +543,7 @@ class Spider < Kimurai::Base
659
543
  end
660
544
  ```
661
545
 
662
- Under the hood `request_to` simply call [#visit](https://www.rubydoc.info/github/jnicklas/capybara/Capybara%2FSession:visit) (`browser.visit(url)`) and then required method with arguments:
546
+ Under the hood, `request_to` simply calls [#visit](https://www.rubydoc.info/github/jnicklas/capybara/Capybara%2FSession:visit) (`browser.visit(url)`), and the provided method with arguments:
663
547
 
664
548
  <details/>
665
549
  <summary>request_to</summary>
@@ -674,10 +558,10 @@ end
674
558
  ```
675
559
  </details><br>
676
560
 
677
- `request_to` just makes things simpler, and without it we could do something like:
561
+ The `request_to` helper method makes things simpler. We could also do something like:
678
562
 
679
563
  <details/>
680
- <summary>Check the code</summary>
564
+ <summary>See the code</summary>
681
565
 
682
566
  ```ruby
683
567
  class Spider < Kimurai::Base
@@ -700,7 +584,7 @@ end
700
584
 
701
585
  ### `save_to` helper
702
586
 
703
- Sometimes all that you need is to simply save scraped data to a file format, like JSON or CSV. You can use `save_to` for it:
587
+ Sometimes all you need is to simply save scraped data to a file. You can use the `save_to` helper method like so:
704
588
 
705
589
  ```ruby
706
590
  class ProductsSpider < Kimurai::Base
@@ -716,31 +600,31 @@ class ProductsSpider < Kimurai::Base
716
600
  item[:description] = response.xpath("//desc/path").text.squish
717
601
  item[:price] = response.xpath("//price/path").text[/\d+/]&.to_f
718
602
 
719
- # Add each new item to the `scraped_products.json` file:
603
+ # Append each new item to the `scraped_products.json` file:
720
604
  save_to "scraped_products.json", item, format: :json
721
605
  end
722
606
  end
723
607
  ```
724
608
 
725
609
  Supported formats:
726
- * `:json` JSON
727
- * `:pretty_json` "pretty" JSON (`JSON.pretty_generate`)
728
- * `:jsonlines` [JSON Lines](http://jsonlines.org/)
729
- * `:csv` CSV
610
+ * `:json` JSON
611
+ * `:pretty_json` "pretty" JSON (`JSON.pretty_generate`)
612
+ * `:jsonlines` [JSON Lines](http://jsonlines.org/)
613
+ * `:csv` CSV
730
614
 
731
- Note: `save_to` requires data (item to save) to be a `Hash`.
615
+ Note: `save_to` requires the data (item) to save to be a `Hash`.
732
616
 
733
- By default `save_to` add position key to an item hash. You can disable it with `position: false`: `save_to "scraped_products.json", item, format: :json, position: false`.
617
+ By default, `save_to` will add a position key to an item hash. You can disable it like so: `save_to "scraped_products.json", item, format: :json, position: false`
734
618
 
735
619
  **How helper works:**
736
620
 
737
- Until spider stops, each new item will be appended to a file. At the next run, helper will clear the content of a file first, and then start again appending items to it.
621
+ While the spider is running, each new item will be appended to the output file. On the next run, this helper will clear the contents of the output file, then start appending items to it.
738
622
 
739
- > If you don't want file to be cleared before each run, add option `append: true`: `save_to "scraped_products.json", item, format: :json, append: true`
623
+ > If you don't want the file to be cleared before each run, pass `append: true` like so: `save_to "scraped_products.json", item, format: :json, append: true`
740
624
 
741
625
  ### Skip duplicates
742
626
 
743
- It's pretty common when websites have duplicated pages. For example when an e-commerce shop has the same products in different categories. To skip duplicates, there is simple `unique?` helper:
627
+ It's pretty common for websites to have duplicate pages. For example, when an e-commerce site has the same products in different categories. To skip duplicates, there is a simple `unique?` helper:
744
628
 
745
629
  ```ruby
746
630
  class ProductsSpider < Kimurai::Base
@@ -763,11 +647,11 @@ class ProductsSpider < Kimurai::Base
763
647
  end
764
648
  end
765
649
 
766
- # Or/and check products for uniqueness using product sku inside of parse_product:
650
+ # And/or check products for uniqueness using product sku inside of parse_product:
767
651
  def parse_product(response, url:, data: {})
768
652
  item = {}
769
653
  item[:sku] = response.xpath("//product/sku/path").text.strip.upcase
770
- # Don't save product and return from method if there is already saved item with the same sku:
654
+ # Don't save the product if there is already an item with the same sku:
771
655
  return unless unique?(:sku, item[:sku])
772
656
 
773
657
  # ...
@@ -776,14 +660,14 @@ class ProductsSpider < Kimurai::Base
776
660
  end
777
661
  ```
778
662
 
779
- `unique?` helper works pretty simple:
663
+ The `unique?` helper works quite simply:
780
664
 
781
665
  ```ruby
782
- # Check string "http://example.com" in scope `url` for a first time:
666
+ # Check for "http://example.com" in `url` scope for the first time:
783
667
  unique?(:url, "http://example.com")
784
668
  # => true
785
669
 
786
- # Try again:
670
+ # Next time:
787
671
  unique?(:url, "http://example.com")
788
672
  # => false
789
673
  ```
@@ -801,44 +685,44 @@ unique?(:id, 324234232)
801
685
  unique?(:custom, "Lorem Ipsum")
802
686
  ```
803
687
 
804
- #### Automatically skip all duplicated requests urls
688
+ #### Automatically skip all duplicate request urls
805
689
 
806
- It is possible to automatically skip all already visited urls while calling `request_to` method, using [@config](#all-available-config-options) option `skip_duplicate_requests: true`. With this option, all already visited urls will be automatically skipped. Also check the [@config](#all-available-config-options) for an additional options of this setting.
690
+ It's possible to automatically skip any previously visited urls when calling the `request_to` method using the `skip_duplicate_requests: true` config option. See [@config](#all-available-config-options) for additional options.
807
691
 
808
692
  #### `storage` object
809
693
 
810
- `unique?` method it's just an alias for `storage#unique?`. Storage has several methods:
694
+ The `unique?` method is just an alias for `storage#unique?`. Storage has several methods:
811
695
 
812
- * `#all` - display storage hash where keys are existing scopes.
813
- * `#include?(scope, value)` - return `true` if value in the scope exists, and `false` if not
814
- * `#add(scope, value)` - add value to the scope
815
- * `#unique?(scope, value)` - method already described above, will return `false` if value in the scope exists, or return `true` + add value to the scope if value in the scope not exists.
816
- * `#clear!` - reset the whole storage by deleting all values from all scopes.
696
+ * `#all` return all scopes
697
+ * `#add(scope, value)` add a value to the scope
698
+ * `#include?(scope, value)` returns `true` if the value exists in the scope, or `false` if it doesn't
699
+ * `#unique?(scope, value)` returns `false` if the value exists in the scope, otherwise adds the value to the scope and returns `true`
700
+ * `#clear!` deletes all values from all scopes
817
701
 
818
702
 
819
- ### Handle request errors
820
- It is quite common that some pages of crawling website can return different response code than `200 ok`. In such cases, method `request_to` (or `browser.visit`) can raise an exception. Kimurai provides `skip_request_errors` and `retry_request_errors` [config](#spider-config) options to handle such errors:
703
+ ### Handling request errors
704
+ It's common while crawling web pages to get response codes other than `200 OK`. In such cases, the `request_to` method (or `browser.visit`) can raise an exception. Kimurai provides the `skip_request_errors` and `retry_request_errors` [config](#spider-config) options to handle such errors:
821
705
 
822
706
  #### skip_request_errors
823
- You can automatically skip some of errors while requesting a page using `skip_request_errors` [config](#spider-config) option. If raised error matches one of the errors in the list, then this error will be caught, and request will be skipped. It is a good idea to skip errors like NotFound(404), etc.
707
+ Kimurai can automatically skip certain errors while performing requests using the `skip_request_errors` [config](#spider-config) option. If a raised error matches one of the errors in the list, the error will be caught, and the request will be skipped. It's a good idea to skip errors like `404 Not Found`, etc.
824
708
 
825
- Format for the option: array where elements are error classes or/and hashes. You can use _hash_ format for more flexibility:
709
+ `skip_request_errors` is an array of error classes and/or hashes. You can use a _hash_ for more flexibility like so:
826
710
 
827
711
  ```
828
712
  @config = {
829
- skip_request_errors: [{ error: RuntimeError, message: "404 => Net::HTTPNotFound" }]
713
+ skip_request_errors: [{ error: RuntimeError, message: "404 => Net::HTTPNotFound" }, { error: TimeoutError }]
830
714
  }
831
715
  ```
832
- In this case, provided `message:` will be compared with a full error message using `String#include?`. Also you can use regex instead: `{ error: RuntimeError, message: /404|403/ }`.
716
+ In this case, the provided `message:` will be compared with a full error message using `String#include?`. You can also use regex like so: `{ error: RuntimeError, message: /404|403/ }`.
833
717
 
834
718
  #### retry_request_errors
835
- You can automatically retry some of errors with a few attempts while requesting a page using `retry_request_errors` [config](#spider-config) option. If raised error matches one of the errors in the list, then this error will be caught and the request will be processed again within a delay.
719
+ Kimurai can automatically retry requests several times after certain errors with the `retry_request_errors` [config](#spider-config) option. If a raised error matches one of the errors in the list, the error will be caught, and the request will be processed again with progressive delay.
836
720
 
837
- There are 3 attempts: first: delay _15 sec_, second: delay _30 sec_, third: delay _45 sec_. If after 3 attempts there is still an exception, then the exception will be raised. It is a good idea to try to retry errros like `ReadTimeout`, `HTTPBadGateway`, etc.
721
+ There are 3 attempts with _15 sec_, _30 sec_, and _45 sec_ delays, respectively. If after 3 attempts there is still an exception, then the exception will be raised. It's a good idea to retry errors like `ReadTimeout`, `HTTPBadGateway`, etc.
838
722
 
839
- Format for the option: same like for `skip_request_errors` option.
723
+ The format for `retry_request_errors` is the same as for `skip_request_errors`.
840
724
 
841
- If you would like to skip (not raise) error after all retries gone, you can specify `skip_on_failure: true` option:
725
+ If you would like to skip (not raise) the error after the 3 retries, you can specify `skip_on_failure: true` like so:
842
726
 
843
727
  ```ruby
844
728
  @config = {
@@ -848,7 +732,7 @@ If you would like to skip (not raise) error after all retries gone, you can spec
848
732
 
849
733
  ### Logging custom events
850
734
 
851
- It is possible to save custom messages to the [run_info](#open_spider-and-close_spider-callbacks) hash using `add_event('Some message')` method. This feature helps you to keep track on important things which happened during crawling without checking the whole spider log (in case if you're logging these messages using `logger`). Example:
735
+ It's possible to save custom messages to the [run_info](#open_spider-and-close_spider-callbacks) hash using the `add_event('Some message')` method. This feature helps you to keep track of important events during crawling without checking the whole spider log (in case if you're logging these messages using `logger`). For example:
852
736
 
853
737
  ```ruby
854
738
  def parse_product(response, url:, data: {})
@@ -869,7 +753,7 @@ I, [2018-11-28 22:20:19 +0400#7402] [M: 47156576560640] INFO -- example_spider:
869
753
 
870
754
  ### `open_spider` and `close_spider` callbacks
871
755
 
872
- You can define `.open_spider` and `.close_spider` callbacks (class methods) to perform some action before spider started or after spider has been stopped:
756
+ You can define `.open_spider` and `.close_spider` callbacks (class methods) to perform some action(s) before or after the spider runs:
873
757
 
874
758
  ```ruby
875
759
  require 'kimurai'
@@ -914,7 +798,7 @@ I, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840] INFO -- example_spider:
914
798
  ```
915
799
  </details><br>
916
800
 
917
- Inside `open_spider` and `close_spider` class methods there is available `run_info` method which contains useful information about spider state:
801
+ The `run_info` method is available from the `open_spider` and `close_spider` class methods. It contains useful information about the spider state:
918
802
 
919
803
  ```ruby
920
804
  11: def self.open_spider
@@ -934,7 +818,7 @@ Inside `open_spider` and `close_spider` class methods there is available `run_in
934
818
  }
935
819
  ```
936
820
 
937
- Inside `close_spider`, `run_info` will be updated:
821
+ `run_info` will be updated from `close_spider`:
938
822
 
939
823
  ```ruby
940
824
  15: def self.close_spider
@@ -954,7 +838,7 @@ Inside `close_spider`, `run_info` will be updated:
954
838
  }
955
839
  ```
956
840
 
957
- `run_info[:status]` helps to determine if spider was finished successfully or failed (possible values: `:completed`, `:failed`):
841
+ `run_info[:status]` helps to determine if the spider finished successfully or failed (possible values: `:completed`, `:failed`):
958
842
 
959
843
  ```ruby
960
844
  class ExampleSpider < Kimurai::Base
@@ -1002,12 +886,12 @@ example_spider.rb:15:in `parse': undefined method `strip' for nil:NilClass (NoMe
1002
886
  ```
1003
887
  </details><br>
1004
888
 
1005
- **Usage example:** if spider finished successfully, send JSON file with scraped items to a remote FTP location, otherwise (if spider failed), skip incompleted results and send email/notification to slack about it:
889
+ **Usage example:** if the spider finished successfully, send a JSON file with scraped items to a remote FTP location, otherwise (if the spider failed), skip incompleted results and send an email/notification to Slack about it:
1006
890
 
1007
891
  <details/>
1008
892
  <summary>Example</summary>
1009
893
 
1010
- Also you can use additional methods `completed?` or `failed?`
894
+ You can also use the additional methods `completed?` or `failed?`
1011
895
 
1012
896
  ```ruby
1013
897
  class Spider < Kimurai::Base
@@ -1044,7 +928,7 @@ end
1044
928
 
1045
929
 
1046
930
  ### `KIMURAI_ENV`
1047
- Kimurai has environments, default is `development`. To provide custom environment pass `KIMURAI_ENV` ENV variable before command: `$ KIMURAI_ENV=production ruby spider.rb`. To access current environment there is `Kimurai.env` method.
931
+ Kimurai supports environments. The default is `development`. To provide a custom environment provide a `KIMURAI_ENV` environment variable like so: `$ KIMURAI_ENV=production ruby spider.rb`. To access the current environment there is a `Kimurai.env` method.
1048
932
 
1049
933
  Usage example:
1050
934
  ```ruby
@@ -1065,7 +949,7 @@ end
1065
949
  ```
1066
950
 
1067
951
  ### Parallel crawling using `in_parallel`
1068
- Kimurai can process web pages concurrently in one single line: `in_parallel(:parse_product, urls, threads: 3)`, where `:parse_product` is a method to process, `urls` is array of urls to crawl and `threads:` is a number of threads:
952
+ Kimurai can process web pages concurrently: `in_parallel(:parse_product, urls, threads: 3)`, where `:parse_product` is a method to process, `urls` is an array of urls to crawl and `threads:` is a number of threads:
1069
953
 
1070
954
  ```ruby
1071
955
  # amazon_spider.rb
@@ -1080,7 +964,7 @@ class AmazonSpider < Kimurai::Base
1080
964
  browser.fill_in "field-keywords", with: "Web Scraping Books"
1081
965
  browser.click_on "Go"
1082
966
 
1083
- # Walk through pagination and collect products urls:
967
+ # Walk through pagination and collect product urls:
1084
968
  urls = []
1085
969
  loop do
1086
970
  response = browser.current_response
@@ -1091,7 +975,7 @@ class AmazonSpider < Kimurai::Base
1091
975
  browser.find(:xpath, "//a[@id='pagnNextLink']", wait: 1).click rescue break
1092
976
  end
1093
977
 
1094
- # Process all collected urls concurrently within 3 threads:
978
+ # Process all collected urls concurrently using 3 threads:
1095
979
  in_parallel(:parse_book_page, urls, threads: 3)
1096
980
  end
1097
981
 
@@ -1114,50 +998,22 @@ AmazonSpider.crawl!
1114
998
  <summary>Run: <code>$ ruby amazon_spider.rb</code></summary>
1115
999
 
1116
1000
  ```
1117
- I, [2018-08-22 14:48:37 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Spider: started: amazon_spider
1118
- D, [2018-08-22 14:48:37 +0400#13033] [M: 46982297486840] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
1119
- I, [2018-08-22 14:48:37 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/
1120
- I, [2018-08-22 14:48:38 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/
1121
- I, [2018-08-22 14:48:38 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Info: visits: requests: 1, responses: 1
1122
-
1123
- I, [2018-08-22 14:48:43 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Spider: in_parallel: starting processing 52 urls within 3 threads
1124
- D, [2018-08-22 14:48:43 +0400#13033] [C: 46982320219020] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
1125
- I, [2018-08-22 14:48:43 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Practical-Web-Scraping-Data-Science/dp/1484235819/
1126
- D, [2018-08-22 14:48:44 +0400#13033] [C: 46982320189640] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
1127
- I, [2018-08-22 14:48:44 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Python-Web-Scraping-Cookbook-scraping/dp/1787285219/
1128
- D, [2018-08-22 14:48:44 +0400#13033] [C: 46982319187320] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
1129
- I, [2018-08-22 14:48:44 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Scraping-Python-Community-Experience-Distilled/dp/1782164367/
1130
- I, [2018-08-22 14:48:45 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Practical-Web-Scraping-Data-Science/dp/1484235819/
1131
- I, [2018-08-22 14:48:45 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Info: visits: requests: 4, responses: 2
1132
- I, [2018-08-22 14:48:45 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491910291/
1133
- I, [2018-08-22 14:48:46 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Python-Web-Scraping-Cookbook-scraping/dp/1787285219/
1134
- I, [2018-08-22 14:48:46 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Info: visits: requests: 5, responses: 3
1135
- I, [2018-08-22 14:48:46 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491985577/
1136
- I, [2018-08-22 14:48:46 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Scraping-Python-Community-Experience-Distilled/dp/1782164367/
1137
- I, [2018-08-22 14:48:46 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Info: visits: requests: 6, responses: 4
1138
- I, [2018-08-22 14:48:46 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Web-Scraping-Excel-Effective-Scrapes-ebook/dp/B01CMMJGZ8/
1001
+ $ ruby amazon_spider.rb
1139
1002
 
1140
1003
  ...
1141
1004
 
1142
- I, [2018-08-22 14:49:10 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Info: visits: requests: 51, responses: 49
1143
- I, [2018-08-22 14:49:10 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Browser: driver mechanize has been destroyed
1144
- I, [2018-08-22 14:49:11 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Scraping-Ice-Life-Bill-Rayburn-ebook/dp/B00C0NF1L8/
1145
- I, [2018-08-22 14:49:11 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Info: visits: requests: 51, responses: 50
1146
- I, [2018-08-22 14:49:11 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Instant-Scraping-Jacob-Ward-2013-07-26/dp/B01FJ1G3G4/
1147
- I, [2018-08-22 14:49:11 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Php-architects-Guide-Scraping-Author/dp/B010DTKYY4/
1148
- I, [2018-08-22 14:49:11 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Info: visits: requests: 52, responses: 51
1149
- I, [2018-08-22 14:49:11 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Ship-Tracking-Maritime-Domain-Awareness/dp/B001J5MTOK/
1150
- I, [2018-08-22 14:49:12 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Instant-Scraping-Jacob-Ward-2013-07-26/dp/B01FJ1G3G4/
1151
- I, [2018-08-22 14:49:12 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Info: visits: requests: 53, responses: 52
1152
- I, [2018-08-22 14:49:12 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: driver mechanize has been destroyed
1153
- I, [2018-08-22 14:49:12 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Ship-Tracking-Maritime-Domain-Awareness/dp/B001J5MTOK/
1154
- I, [2018-08-22 14:49:12 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Info: visits: requests: 53, responses: 53
1155
- I, [2018-08-22 14:49:12 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: driver mechanize has been destroyed
1156
-
1157
- I, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Spider: in_parallel: stopped processing 52 urls within 3 threads, total time: 29s
1158
- I, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Browser: driver mechanize has been destroyed
1159
-
1160
- I, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Spider: stopped: {:spider_name=>"amazon_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 14:48:37 +0400, :stop_time=>2018-08-22 14:49:12 +0400, :running_time=>"35s", :visits=>{:requests=>53, :responses=>53}, :error=>nil}
1005
+ I, [2025-12-16 13:48:19 +0300#39167] [C: 1624] INFO -- amazon_spider: Info: visits: requests: 305, responses: 305
1006
+ I, [2025-12-16 13:48:19 +0300#39167] [C: 1624] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Real-World-Python-Hackers-Solving-Problems/dp/1718500629/
1007
+ I, [2025-12-16 13:48:22 +0300#39167] [C: 1624] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Real-World-Python-Hackers-Solving-Problems/dp/1718500629/
1008
+ I, [2025-12-16 13:48:22 +0300#39167] [C: 1624] INFO -- amazon_spider: Info: visits: requests: 306, responses: 306
1009
+ I, [2025-12-16 13:48:22 +0300#39167] [C: 1624] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Introduction-Important-efficient-collection-scraping-ebook/dp/B0D2MLXFT6/
1010
+ I, [2025-12-16 13:48:23 +0300#39167] [C: 1624] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Introduction-Important-efficient-collection-scraping-ebook/dp/B0D2MLXFT6/
1011
+ I, [2025-12-16 13:48:23 +0300#39167] [C: 1624] INFO -- amazon_spider: Info: visits: requests: 307, responses: 307
1012
+ I, [2025-12-16 13:48:23 +0300#39167] [C: 1624] INFO -- amazon_spider: Browser: driver mechanize has been destroyed
1013
+ I, [2025-12-16 13:48:23 +0300#39167] [M: 1152] INFO -- amazon_spider: Spider: in_parallel: stopped processing 306 urls within 3 threads, total time: 2m, 37s
1014
+ I, [2025-12-16 13:48:23 +0300#39167] [M: 1152] INFO -- amazon_spider: Browser: driver mechanize has been destroyed
1015
+ I, [2025-12-16 13:48:23 +0300#39167] [M: 1152] INFO -- amazon_spider: Spider: stopped: {spider_name: "amazon_spider", status: :completed, error: nil, environment: "development", start_time: 2025-12-16 13:45:12.5338 +0300, stop_time: 2025-12-16 13:48:23.526221 +0300, running_time: "3m, 10s", visits: {requests: 307, responses: 307}, items: {sent: 0, processed: 0}, events: {requests_errors: {}, drop_items_errors: {}, custom: {}}}
1016
+ vic@Vics-MacBook-Air single %
1161
1017
 
1162
1018
  ```
1163
1019
  </details>
@@ -1168,35 +1024,39 @@ I, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840] INFO -- amazon_spider:
1168
1024
  ```json
1169
1025
  [
1170
1026
  {
1171
- "title": "Web Scraping with Python: Collecting More Data from the Modern Web2nd Edition",
1172
- "url": "https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491985577/",
1173
- "price": "$26.94",
1174
- "publisher": "O'Reilly Media; 2 edition (April 14, 2018)",
1027
+ "title": "Web Scraping with Python: Data Extraction from the Modern Web 3rd Edition",
1028
+ "url": "https://www.amazon.com/Web-Scraping-Python-Extraction-Modern/dp/1098145356/",
1029
+ "price": "$27.00",
1030
+ "author": "Ryan Mitchell",
1031
+ "publication_date": "March 26, 2024",
1175
1032
  "position": 1
1176
1033
  },
1177
1034
  {
1178
- "title": "Python Web Scraping Cookbook: Over 90 proven recipes to get you scraping with Python, micro services, Docker and AWS",
1179
- "url": "https://www.amazon.com/Python-Web-Scraping-Cookbook-scraping/dp/1787285219/",
1180
- "price": "$39.99",
1181
- "publisher": "Packt Publishing - ebooks Account (February 9, 2018)",
1035
+ "title": "Web Scraping with Python: Collecting More Data from the Modern Web 2nd Edition",
1036
+ "url": "https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491985577/",
1037
+ "price": "$13.20 - $38.15",
1038
+ "author": "Ryan Mitchell",
1039
+ "publication_date": "May 8, 2018",
1182
1040
  "position": 2
1183
1041
  },
1184
1042
  {
1185
- "title": "Web Scraping with Python: Collecting Data from the Modern Web1st Edition",
1186
- "url": "https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491910291/",
1187
- "price": "$15.75",
1188
- "publisher": "O'Reilly Media; 1 edition (July 24, 2015)",
1043
+ "title": "Scripting: Automation with Bash, PowerShell, and Python—Automate Everyday IT Tasks from Backups to Web Scraping in Just a Few Lines of Code (Rheinwerk Computing) First Edition",
1044
+ "url": "https://www.amazon.com/Scripting-Automation-Bash-PowerShell-Python/dp/1493225561/",
1045
+ "price": "$47.02",
1046
+ "author": "Michael Kofler",
1047
+ "publication_date": "February 25, 2024",
1189
1048
  "position": 3
1190
1049
  },
1191
1050
 
1192
- ...
1193
-
1051
+ // ...
1052
+
1194
1053
  {
1195
- "title": "Instant Web Scraping with Java by Ryan Mitchell (2013-08-26)",
1196
- "url": "https://www.amazon.com/Instant-Scraping-Java-Mitchell-2013-08-26/dp/B01FEM76X2/",
1197
- "price": "$35.82",
1198
- "publisher": "Packt Publishing (2013-08-26) (1896)",
1199
- "position": 52
1054
+ "title": "Introduction to Python Important points for efficient data collection with scraping (Japanese Edition) Kindle Edition",
1055
+ "url": "https://www.amazon.com/Introduction-Important-efficient-collection-scraping-ebook/dp/B0D2MLXFT6/",
1056
+ "price": "$0.00",
1057
+ "author": "r",
1058
+ "publication_date": "April 24, 2024",
1059
+ "position": 306
1200
1060
  }
1201
1061
  ]
1202
1062
  ```
@@ -1204,11 +1064,12 @@ I, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840] INFO -- amazon_spider:
1204
1064
 
1205
1065
  > Note that [save_to](#save_to-helper) and [unique?](#skip-duplicates-unique-helper) helpers are thread-safe (protected by [Mutex](https://ruby-doc.org/core-2.5.1/Mutex.html)) and can be freely used inside threads.
1206
1066
 
1207
- `in_parallel` can take additional options:
1208
- * `data:` pass with urls custom data hash: `in_parallel(:method, urls, threads: 3, data: { category: "Scraping" })`
1209
- * `delay:` set delay between requests: `in_parallel(:method, urls, threads: 3, delay: 2)`. Delay can be `Integer`, `Float` or `Range` (`2..5`). In case of a Range, delay number will be chosen randomly for each request: `rand (2..5) # => 3`
1210
- * `engine:` set custom engine than a default one: `in_parallel(:method, urls, threads: 3, engine: :poltergeist_phantomjs)`
1211
- * `config:` pass custom options to config (see [config section](#crawler-config))
1067
+ `in_parallel` can take additional parameters:
1068
+
1069
+ * `data:` – pass custom data like so: `in_parallel(:method, urls, threads: 3, data: { category: "Scraping" })`
1070
+ * `delay:` set delay between requests like so: `in_parallel(:method, urls, threads: 3, delay: 2)`. Delay can be `Integer`, `Float` or `Range` (`2..5`). In case of a Range, the delay (in seconds) will be set randomly for each request: `rand (2..5) # => 3`
1071
+ * `engine:` set custom engine like so: `in_parallel(:method, urls, threads: 3, engine: :selenium_chrome)`
1072
+ * `config:` – set custom [config](#spider-config) options
1212
1073
 
1213
1074
  ### Active Support included
1214
1075
 
@@ -1216,7 +1077,7 @@ You can use all the power of familiar [Rails core-ext methods](https://guides.ru
1216
1077
 
1217
1078
  ### Schedule spiders using Cron
1218
1079
 
1219
- 1) Inside spider directory generate [Whenever](https://github.com/javan/whenever) config: `$ kimurai generate schedule`.
1080
+ 1) Inside the spider directory generate a [Whenever](https://github.com/javan/whenever) schedule configuration like so: `$ kimurai generate schedule`.
1220
1081
 
1221
1082
  <details/>
1222
1083
  <summary><code>schedule.rb</code></summary>
@@ -1225,7 +1086,7 @@ You can use all the power of familiar [Rails core-ext methods](https://guides.ru
1225
1086
  ### Settings ###
1226
1087
  require 'tzinfo'
1227
1088
 
1228
- # Export current PATH to the cron
1089
+ # Export current PATH for cron
1229
1090
  env :PATH, ENV["PATH"]
1230
1091
 
1231
1092
  # Use 24 hour format when using `at:` option
@@ -1233,8 +1094,8 @@ set :chronic_options, hours24: true
1233
1094
 
1234
1095
  # Use local_to_utc helper to setup execution time using your local timezone instead
1235
1096
  # of server's timezone (which is probably and should be UTC, to check run `$ timedatectl`).
1236
- # Also maybe you'll want to set same timezone in kimurai as well (use `Kimurai.configuration.time_zone =` for that),
1237
- # to have spiders logs in a specific time zone format.
1097
+ # You should also set the same timezone in kimurai (use `Kimurai.configuration.time_zone =` for that).
1098
+ #
1238
1099
  # Example usage of helper:
1239
1100
  # every 1.day, at: local_to_utc("7:00", zone: "Europe/Moscow") do
1240
1101
  # crawl "google_spider.com", output: "log/google_spider.com.log"
@@ -1245,7 +1106,7 @@ end
1245
1106
 
1246
1107
  # Note: by default Whenever exports cron commands with :environment == "production".
1247
1108
  # Note: Whenever can only append log data to a log file (>>). If you want
1248
- # to overwrite (>) log file before each run, pass lambda:
1109
+ # to overwrite (>) a log file before each run, use lambda notation:
1249
1110
  # crawl "google_spider.com", output: -> { "> log/google_spider.com.log 2>&1" }
1250
1111
 
1251
1112
  # Project job types
@@ -1258,31 +1119,29 @@ job_type :single, "cd :path && KIMURAI_ENV=:environment ruby :task :output"
1258
1119
  job_type :single_bundle, "cd :path && KIMURAI_ENV=:environment bundle exec ruby :task :output"
1259
1120
 
1260
1121
  ### Schedule ###
1261
- # Usage (check examples here https://github.com/javan/whenever#example-schedulerb-file):
1122
+ # Usage (see examples here https://github.com/javan/whenever#example-schedulerb-file):
1262
1123
  # every 1.day do
1263
1124
  # Example to schedule a single spider in the project:
1264
1125
  # crawl "google_spider.com", output: "log/google_spider.com.log"
1265
1126
 
1266
1127
  # Example to schedule all spiders in the project using runner. Each spider will write
1267
- # it's own output to the `log/spider_name.log` file (handled by a runner itself).
1268
- # Runner output will be written to log/runner.log file.
1269
- # Argument number it's a count of concurrent jobs:
1270
- # runner 3, output:"log/runner.log"
1128
+ # its own output to the `log/spider_name.log` file (handled by runner itself).
1129
+ # Runner output will be written to log/runner.log
1271
1130
 
1272
- # Example to schedule single spider (without project):
1131
+ # Example to schedule single spider (without a project):
1273
1132
  # single "single_spider.rb", output: "single_spider.log"
1274
1133
  # end
1275
1134
 
1276
- ### How to set a cron schedule ###
1135
+ ### How to set up a cron schedule ###
1277
1136
  # Run: `$ whenever --update-crontab --load-file config/schedule.rb`.
1278
- # If you don't have whenever command, install the gem: `$ gem install whenever`.
1137
+ # If you don't have the whenever command, install the gem like so: `$ gem install whenever`.
1279
1138
 
1280
1139
  ### How to cancel a schedule ###
1281
1140
  # Run: `$ whenever --clear-crontab --load-file config/schedule.rb`.
1282
1141
  ```
1283
1142
  </details><br>
1284
1143
 
1285
- 2) Add at the bottom of `schedule.rb` following code:
1144
+ 2) At the bottom of `schedule.rb`, add the following code:
1286
1145
 
1287
1146
  ```ruby
1288
1147
  every 1.day, at: "7:00" do
@@ -1292,14 +1151,14 @@ end
1292
1151
 
1293
1152
  3) Run: `$ whenever --update-crontab --load-file schedule.rb`. Done!
1294
1153
 
1295
- You can check Whenever examples [here](https://github.com/javan/whenever#example-schedulerb-file). To cancel schedule, run: `$ whenever --clear-crontab --load-file schedule.rb`.
1154
+ You can see some [Whenever](https://github.com/javan/whenever) examples [here](https://github.com/javan/whenever#example-schedulerb-file). To cancel a schedule, run: `$ whenever --clear-crontab --load-file schedule.rb`.
1296
1155
 
1297
1156
  ### Configuration options
1298
- You can configure several options using `configure` block:
1157
+ You can configure several options inside the `configure` block:
1299
1158
 
1300
1159
  ```ruby
1301
1160
  Kimurai.configure do |config|
1302
- # Default logger has colored mode in development.
1161
+ # The default logger has colorized mode enabled in development.
1303
1162
  # If you would like to disable it, set `colorize_logger` to false.
1304
1163
  # config.colorize_logger = false
1305
1164
 
@@ -1320,13 +1179,13 @@ Kimurai.configure do |config|
1320
1179
  end
1321
1180
  ```
1322
1181
 
1323
- ### Using Kimurai inside existing Ruby application
1182
+ ### Using Kimurai inside existing Ruby applications
1324
1183
 
1325
- You can integrate Kimurai spiders (which are just Ruby classes) to an existing Ruby application like Rails or Sinatra, and run them using background jobs (for example). Check the following info to understand the running process of spiders:
1184
+ You can integrate Kimurai spiders (which are just Ruby classes) into an existing Ruby application like Rails or Sinatra, and run them using background jobs, for example. See the following sections to understand the process of running spiders:
1326
1185
 
1327
1186
  #### `.crawl!` method
1328
1187
 
1329
- `.crawl!` (class method) performs a _full run_ of a particular spider. This method will return run_info if run was successful, or an exception if something went wrong.
1188
+ `.crawl!` (class method) performs a _full run_ of a particular spider. This method will return run_info if it was successful, or an exception if something went wrong.
1330
1189
 
1331
1190
  ```ruby
1332
1191
  class ExampleSpider < Kimurai::Base
@@ -1343,7 +1202,7 @@ ExampleSpider.crawl!
1343
1202
  # => { :spider_name => "example_spider", :status => :completed, :environment => "development", :start_time => 2018-08-22 18:20:16 +0400, :stop_time => 2018-08-22 18:20:17 +0400, :running_time => 1.216, :visits => { :requests => 1, :responses => 1 }, :items => { :sent => 0, :processed => 0 }, :error => nil }
1344
1203
  ```
1345
1204
 
1346
- You can't `.crawl!` spider in different thread if it still running (because spider instances store some shared data in the `@run_info` class variable while `crawl`ing):
1205
+ You can't `.crawl!` a spider in a different thread if it's still running (because spider instances store some shared data in the `@run_info` class variable while `crawl`ing):
1347
1206
 
1348
1207
  ```ruby
1349
1208
  2.times do |i|
@@ -1357,11 +1216,11 @@ end # =>
1357
1216
  # {:spider_name=>"example_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 18:49:22 +0400, :stop_time=>2018-08-22 18:49:23 +0400, :running_time=>0.801, :visits=>{:requests=>1, :responses=>1}, :items=>{:sent=>0, :processed=>0}, :error=>nil}
1358
1217
  ```
1359
1218
 
1360
- So what if you're don't care about stats and just want to process request to a particular spider method and get the returning value from this method? Use `.parse!` instead:
1219
+ So, what if you don't care about stats and just want to process a request with a particular spider method and get the return value from this method? Use `.parse!` instead:
1361
1220
 
1362
1221
  #### `.parse!(:method_name, url:)` method
1363
1222
 
1364
- `.parse!` (class method) creates a new spider instance and performs a request to given method with a given url. Value from the method will be returned back:
1223
+ The `.parse!` (class method) creates a new spider instance and performs a request with the provided method and url. The value from the method will be returned back:
1365
1224
 
1366
1225
  ```ruby
1367
1226
  class ExampleSpider < Kimurai::Base
@@ -1378,7 +1237,7 @@ ExampleSpider.parse!(:parse, url: "https://example.com/")
1378
1237
  # => "Example Domain"
1379
1238
  ```
1380
1239
 
1381
- Like `.crawl!`, `.parse!` method takes care of a browser instance and kills it (`browser.destroy_driver!`) before returning the value. Unlike `.crawl!`, `.parse!` method can be called from different threads at the same time:
1240
+ Like `.crawl!`, the `.parse!` method creates a browser instance and destroys it (`browser.destroy_driver!`) before returning the value. Unlike `.crawl!`, `.parse!` method can be called from different threads at the same time:
1382
1241
 
1383
1242
  ```ruby
1384
1243
  urls = ["https://www.google.com/", "https://www.reddit.com/", "https://en.wikipedia.org/"]
@@ -1392,7 +1251,7 @@ end # =>
1392
1251
  # "reddit: the front page of the internetHotHot"
1393
1252
  ```
1394
1253
 
1395
- Keep in mind, that [save_to](#save_to-helper) and [unique?](#skip-duplicates) helpers are not thread-safe while using `.parse!` method.
1254
+ Keep in mind, that [save_to](#save_to-helper) and [unique?](#skip-duplicates) helpers are not thread-safe while using the `.parse!` method.
1396
1255
 
1397
1256
  #### `Kimurai.list` and `Kimurai.find_by_name()`
1398
1257
 
@@ -1413,64 +1272,21 @@ end
1413
1272
  Kimurai.list
1414
1273
  # => {"google_spider"=>GoogleSpider, "reddit_spider"=>RedditSpider, "wikipedia_spider"=>WikipediaSpider}
1415
1274
 
1416
- # To find a particular spider class by it's name:
1275
+ # To find a particular spider class by its name:
1417
1276
  Kimurai.find_by_name("reddit_spider")
1418
1277
  # => RedditSpider
1419
1278
  ```
1420
1279
 
1421
-
1422
- ### Automated sever setup and deployment
1423
- > **EXPERIMENTAL**
1424
-
1425
- #### Setup
1426
- You can automatically setup [required environment](#installation) for Kimurai on the remote server (currently there is only Ubuntu Server 18.04 support) using `$ kimurai setup` command. `setup` will perform installation of: latest Ruby with Rbenv, browsers with webdrivers and in additional databases clients (only clients) for MySQL, Postgres and MongoDB (so you can connect to a remote database from ruby).
1427
-
1428
- > To perform remote server setup, [Ansible](https://github.com/ansible/ansible) is required **on the desktop** machine (to install: Ubuntu: `$ sudo apt install ansible`, Mac OS X: `$ brew install ansible`)
1429
-
1430
- > It's recommended to use regular user to setup the server, not `root`. To create a new user, login to the server `$ ssh root@your_server_ip`, type `$ adduser username` to create a user, and `$ gpasswd -a username sudo` to add new user to a sudo group.
1431
-
1432
- Example:
1433
-
1434
- ```bash
1435
- $ kimurai setup deploy@123.123.123.123 --ask-sudo --ssh-key-path path/to/private_key
1436
- ```
1437
-
1438
- CLI options:
1439
- * `--ask-sudo` pass this option to ask sudo (user) password for system-wide installation of packages (`apt install`)
1440
- * `--ssh-key-path path/to/private_key` authorization on the server using private ssh key. You can omit it if required key already [added to keychain](https://help.github.com/articles/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent/#adding-your-ssh-key-to-the-ssh-agent) on your desktop (Ansible uses [SSH agent forwarding](https://developer.github.com/v3/guides/using-ssh-agent-forwarding/))
1441
- * `--ask-auth-pass` authorization on the server using user password, alternative option to `--ssh-key-path`.
1442
- * `-p port_number` custom port for ssh connection (`-p 2222`)
1443
-
1444
- > You can check setup playbook [here](lib/kimurai/automation/setup.yml)
1445
-
1446
- #### Deploy
1447
-
1448
- After successful `setup` you can deploy a spider to the remote server using `$ kimurai deploy` command. On each deploy there are performing several tasks: 1) pull repo from a remote origin to `~/repo_name` user directory 2) run `bundle install` 3) Update crontab `whenever --update-crontab` (to update spider schedule from schedule.rb file).
1449
-
1450
- Before `deploy` make sure that inside spider directory you have: 1) git repository with remote origin (bitbucket, github, etc.) 2) `Gemfile` 3) schedule.rb inside subfolder `config` (`config/schedule.rb`).
1451
-
1452
- Example:
1453
-
1454
- ```bash
1455
- $ kimurai deploy deploy@123.123.123.123 --ssh-key-path path/to/private_key --repo-key-path path/to/repo_private_key
1456
- ```
1457
-
1458
- CLI options: _same like for [setup](#setup) command_ (except `--ask-sudo`), plus
1459
- * `--repo-url` provide custom repo url (`--repo-url git@bitbucket.org:username/repo_name.git`), otherwise current `origin/master` will be taken (output from `$ git remote get-url origin`)
1460
- * `--repo-key-path` if git repository is private, authorization is required to pull the code on the remote server. Use this option to provide a private repository SSH key. You can omit it if required key already added to keychain on your desktop (same like with `--ssh-key-path` option)
1461
-
1462
- > You can check deploy playbook [here](lib/kimurai/automation/deploy.yml)
1463
-
1464
1280
  ## Spider `@config`
1465
1281
 
1466
- Using `@config` you can set several options for a spider, like proxy, user-agent, default cookies/headers, delay between requests, browser **memory control** and so on:
1282
+ Using `@config` you can set several options for a spider; such as proxy, user-agent, default cookies/headers, delay between requests, browser **memory control** and so on:
1467
1283
 
1468
1284
  ```ruby
1469
1285
  class Spider < Kimurai::Base
1470
1286
  USER_AGENTS = ["Chrome", "Firefox", "Safari", "Opera"]
1471
1287
  PROXIES = ["2.3.4.5:8080:http:username:password", "3.4.5.6:3128:http", "1.2.3.4:3000:socks5"]
1472
1288
 
1473
- @engine = :poltergeist_phantomjs
1289
+ @engine = :selenium_chrome
1474
1290
  @start_urls = ["https://example.com/"]
1475
1291
  @config = {
1476
1292
  headers: { "custom_header" => "custom_value" },
@@ -1490,7 +1306,7 @@ class Spider < Kimurai::Base
1490
1306
  change_proxy: true,
1491
1307
  # Clear all cookies and set default cookies (if provided) before each request:
1492
1308
  clear_and_set_cookies: true,
1493
- # Process delay before each request:
1309
+ # Set a delay before each request:
1494
1310
  delay: 1..3
1495
1311
  }
1496
1312
  }
@@ -1505,100 +1321,116 @@ end
1505
1321
 
1506
1322
  ```ruby
1507
1323
  @config = {
1508
- # Custom headers, format: hash. Example: { "some header" => "some value", "another header" => "another value" }
1509
- # Works only for :mechanize and :poltergeist_phantomjs engines (Selenium doesn't allow to set/get headers)
1324
+ # Custom headers hash. Example: { "some header" => "some value", "another header" => "another value" }
1325
+ # Works for :mechanize. Selenium doesn't support setting headers.
1510
1326
  headers: {},
1511
1327
 
1512
- # Custom User Agent, format: string or lambda.
1328
+ # Custom User Agent string or lambda
1329
+ #
1513
1330
  # Use lambda if you want to rotate user agents before each run:
1514
- # user_agent: -> { ARRAY_OF_USER_AGENTS.sample }
1331
+ # user_agent: -> { ARRAY_OF_USER_AGENTS.sample }
1332
+ #
1515
1333
  # Works for all engines
1516
1334
  user_agent: "Mozilla/5.0 Firefox/61.0",
1517
1335
 
1518
- # Custom cookies, format: array of hashes.
1336
+ # Custom cookies an array of hashes
1519
1337
  # Format for a single cookie: { name: "cookie name", value: "cookie value", domain: ".example.com" }
1338
+ #
1520
1339
  # Works for all engines
1521
1340
  cookies: [],
1522
1341
 
1523
- # Proxy, format: string or lambda. Format of a proxy string: "ip:port:protocol:user:password"
1524
- # `protocol` can be http or socks5. User and password are optional.
1342
+ # Proxy string or lambda. Format for a proxy string: "ip:port:protocol:user:password"
1343
+ # `protocol` can be http or socks5. User and password are optional.
1344
+ #
1525
1345
  # Use lambda if you want to rotate proxies before each run:
1526
- # proxy: -> { ARRAY_OF_PROXIES.sample }
1527
- # Works for all engines, but keep in mind that Selenium drivers doesn't support proxies
1528
- # with authorization. Also, Mechanize doesn't support socks5 proxy format (only http)
1346
+ # proxy: -> { ARRAY_OF_PROXIES.sample }
1347
+ #
1348
+ # Works for all engines, but keep in mind that Selenium drivers don't support proxies
1349
+ # with authorization. Also, Mechanize doesn't support socks5 proxy format (only http).
1529
1350
  proxy: "3.4.5.6:3128:http:user:pass",
1530
1351
 
1531
1352
  # If enabled, browser will ignore any https errors. It's handy while using a proxy
1532
- # with self-signed SSL cert (for example Crawlera or Mitmproxy)
1533
- # Also, it will allow to visit webpages with expires SSL certificate.
1353
+ # with a self-signed SSL cert (for example Crawlera or Mitmproxy). It will allow you to
1354
+ # visit web pages with expired SSL certificates.
1355
+ #
1534
1356
  # Works for all engines
1535
1357
  ignore_ssl_errors: true,
1536
1358
 
1537
1359
  # Custom window size, works for all engines
1538
1360
  window_size: [1366, 768],
1539
1361
 
1540
- # Skip images downloading if true, works for all engines
1362
+ # Skip loading images if true, works for all engines. Speeds up page load time.
1541
1363
  disable_images: true,
1542
1364
 
1543
- # Selenium engines only: headless mode, `:native` or `:virtual_display` (default is :native)
1544
- # Although native mode has a better performance, virtual display mode
1545
- # sometimes can be useful. For example, some websites can detect (and block)
1546
- # headless chrome, so you can use virtual_display mode instead
1365
+ # For Selenium engines only: headless mode, `:native` or `:virtual_display` (default is :native)
1366
+ # Although native mode has better performance, virtual display mode
1367
+ # can sometimes be useful. For example, some websites can detect (and block)
1368
+ # headless chrome, so you can use virtual_display mode instead.
1547
1369
  headless_mode: :native,
1548
1370
 
1549
1371
  # This option tells the browser not to use a proxy for the provided list of domains or IP addresses.
1550
- # Format: array of strings. Works only for :selenium_firefox and selenium_chrome
1372
+ # Format: array of strings. Works only for :selenium_firefox and selenium_chrome.
1551
1373
  proxy_bypass_list: [],
1552
1374
 
1553
- # Option to provide custom SSL certificate. Works only for :poltergeist_phantomjs and :mechanize
1375
+ # Option to provide custom SSL certificate. Works only for :mechanize.
1554
1376
  ssl_cert_path: "path/to/ssl_cert",
1555
1377
 
1556
- # Inject some JavaScript code to the browser.
1557
- # Format: array of strings, where each string is a path to JS file.
1558
- # Works only for poltergeist_phantomjs engine (Selenium doesn't support JS code injection)
1378
+ # Inject some JavaScript code into the browser.
1379
+ # Format: array of strings, where each string is a path to a JS file or extension directory
1380
+ # Selenium doesn't support JS code injection.
1559
1381
  extensions: ["lib/code_to_inject.js"],
1560
1382
 
1561
- # Automatically skip duplicated (already visited) urls when using `request_to` method.
1562
- # Possible values: `true` or `hash` with options.
1563
- # In case of `true`, all visited urls will be added to the storage's scope `:requests_urls`
1564
- # and if url already contains in this scope, request will be skipped.
1383
+ # Automatically skip already visited urls when using `request_to` method
1384
+ #
1385
+ # Possible values: `true` or a hash with options
1386
+ # In case of `true`, all visited urls will be added to the storage scope `:requests_urls`
1387
+ # and if the url already exists in this scope, the request will be skipped.
1388
+ #
1565
1389
  # You can configure this setting by providing additional options as hash:
1566
- # `skip_duplicate_requests: { scope: :custom_scope, check_only: true }`, where:
1567
- # `scope:` - use custom scope than `:requests_urls`
1568
- # `check_only:` - if true, then scope will be only checked for url, url will not
1569
- # be added to the scope if scope doesn't contains it.
1570
- # works for all drivers
1390
+ # `skip_duplicate_requests: { scope: :custom_scope, check_only: true }`, where:
1391
+ # `scope:` use a custom scope other than `:requests_urls`
1392
+ # `check_only:` if true, the url will not be added to the scope
1393
+ #
1394
+ # Works for all drivers
1571
1395
  skip_duplicate_requests: true,
1572
1396
 
1573
- # Automatically skip provided errors while requesting a page.
1574
- # If raised error matches one of the errors in the list, then this error will be caught,
1575
- # and request will be skipped.
1576
- # It is a good idea to skip errors like NotFound(404), etc.
1577
- # Format: array where elements are error classes or/and hashes. You can use hash format
1397
+ # Automatically skip provided errors while requesting a page
1398
+ #
1399
+ # If a raised error matches one of the errors in the list, then the error will be caught,
1400
+ # and the request will be skipped. It's a good idea to skip errors like 404 Not Found, etc.
1401
+ #
1402
+ # Format: array where elements are error classes and/or hashes. You can use a hash
1578
1403
  # for more flexibility: `{ error: "RuntimeError", message: "404 => Net::HTTPNotFound" }`.
1579
- # Provided `message:` will be compared with a full error message using `String#include?`. Also
1580
- # you can use regex instead: `{ error: "RuntimeError", message: /404|403/ }`.
1404
+ #
1405
+ # The provided `message:` will be compared with a full error message using `String#include?`.
1406
+ # You can also use regex: `{ error: "RuntimeError", message: /404|403/ }`.
1581
1407
  skip_request_errors: [{ error: RuntimeError, message: "404 => Net::HTTPNotFound" }],
1582
-
1583
- # Automatically retry provided errors with a few attempts while requesting a page.
1584
- # If raised error matches one of the errors in the list, then this error will be caught
1585
- # and the request will be processed again within a delay. There are 3 attempts:
1586
- # first: delay 15 sec, second: delay 30 sec, third: delay 45 sec.
1587
- # If after 3 attempts there is still an exception, then the exception will be raised.
1588
- # It is a good idea to try to retry errros like `ReadTimeout`, `HTTPBadGateway`, etc.
1589
- # Format: same like for `skip_request_errors` option.
1408
+
1409
+ # Automatically retry requests several times after certain errors
1410
+ #
1411
+ # If a raised error matches one of the errors in the list, the error will be caught,
1412
+ # and the request will be processed again with progressive delay.
1413
+ #
1414
+ # There are 3 attempts with _15 sec_, _30 sec_, and _45 sec_ delays, respectively. If after 3
1415
+ # attempts there is still an exception, then the exception will be raised. It's a good idea to
1416
+ # retry errors like `ReadTimeout`, `HTTPBadGateway`, etc.
1417
+ #
1418
+ # The format for `retry_request_errors` is the same as for `skip_request_errors`.
1590
1419
  retry_request_errors: [Net::ReadTimeout],
1591
1420
 
1592
- # Handle page encoding while parsing html response using Nokogiri. There are two modes:
1593
- # Auto (`:auto`) (try to fetch correct encoding from <meta http-equiv="Content-Type"> or <meta charset> tags)
1594
- # Set required encoding manually, example: `encoding: "GB2312"` (Set required encoding manually)
1595
- # Default this option is unset.
1421
+ # Handle page encoding while parsing html response using Nokogiri
1422
+ #
1423
+ # There are two ways to use this option:
1424
+ # encoding: :auto # auto-detect from <meta http-equiv="Content-Type"> or <meta charset> tags
1425
+ # encoding: "GB2312" # set encoding manually
1426
+ #
1427
+ # This option is not set by default
1596
1428
  encoding: nil,
1597
1429
 
1598
1430
  # Restart browser if one of the options is true:
1599
1431
  restart_if: {
1600
1432
  # Restart browser if provided memory limit (in kilobytes) is exceeded (works for all engines)
1601
- memory_limit: 350_000,
1433
+ memory_limit: 1_500_000,
1602
1434
 
1603
1435
  # Restart browser if provided requests limit is exceeded (works for all engines)
1604
1436
  requests_limit: 100
@@ -1606,26 +1438,25 @@ end
1606
1438
 
1607
1439
  # Perform several actions before each request:
1608
1440
  before_request: {
1609
- # Change proxy before each request. The `proxy:` option above should be presented
1610
- # and has lambda format. Works only for poltergeist and mechanize engines
1611
- # (Selenium doesn't support proxy rotation).
1441
+ # Change proxy before each request. The `proxy:` option above should be set with lambda notation.
1442
+ # Works for :mechanize engine. Selenium doesn't support proxy rotation.
1612
1443
  change_proxy: true,
1613
1444
 
1614
- # Change user agent before each request. The `user_agent:` option above should be presented
1615
- # and has lambda format. Works only for poltergeist and mechanize engines
1616
- # (selenium doesn't support to get/set headers).
1445
+ # Change user agent before each request. The `user_agent:` option above should set with lambda
1446
+ # notation. Works for :mechanize engine. Selenium doesn't support setting headers.
1617
1447
  change_user_agent: true,
1618
1448
 
1619
- # Clear all cookies before each request, works for all engines
1449
+ # Clear all cookies before each request. Works for all engines.
1620
1450
  clear_cookies: true,
1621
1451
 
1622
- # If you want to clear all cookies + set custom cookies (`cookies:` option above should be presented)
1623
- # use this option instead (works for all engines)
1452
+ # If you want to clear all cookies and set custom cookies, the `cookies:` option above should be set
1453
+ # Use this option instead of clear_cookies. Works for all engines.
1624
1454
  clear_and_set_cookies: true,
1625
1455
 
1626
- # Global option to set delay between requests.
1456
+ # Global option to set delay between requests
1457
+ #
1627
1458
  # Delay can be `Integer`, `Float` or `Range` (`2..5`). In case of a range,
1628
- # delay number will be chosen randomly for each request: `rand (2..5) # => 3`
1459
+ # the delay (in seconds) will be set randomly for each request: `rand (2..5) # => 3`
1629
1460
  delay: 1..3
1630
1461
  }
1631
1462
  }
@@ -1638,11 +1469,11 @@ Settings can be inherited:
1638
1469
 
1639
1470
  ```ruby
1640
1471
  class ApplicationSpider < Kimurai::Base
1641
- @engine = :poltergeist_phantomjs
1472
+ @engine = :selenium_chrome
1642
1473
  @config = {
1643
- user_agent: "Firefox",
1474
+ user_agent: "Chrome",
1644
1475
  disable_images: true,
1645
- restart_if: { memory_limit: 350_000 },
1476
+ restart_if: { memory_limit: 1_500_000 },
1646
1477
  before_request: { delay: 1..2 }
1647
1478
  }
1648
1479
  end
@@ -1660,11 +1491,11 @@ class CustomSpider < ApplicationSpider
1660
1491
  end
1661
1492
  ```
1662
1493
 
1663
- Here, `@config` of `CustomSpider` will be _[deep merged](https://apidock.com/rails/Hash/deep_merge)_ with `ApplicationSpider` config, so `CustomSpider` will keep all inherited options with only `delay` updated.
1494
+ Here, `@config` of `CustomSpider` will be _[deep merged](https://apidock.com/rails/Hash/deep_merge)_ with `ApplicationSpider`'s' config. In this example, `CustomSpider` will keep all inherited options with only the `delay` being updated.
1664
1495
 
1665
1496
  ## Project mode
1666
1497
 
1667
- Kimurai can work in project mode ([Like Scrapy](https://doc.scrapy.org/en/latest/intro/tutorial.html#creating-a-project)). To generate a new project, run: `$ kimurai generate project web_spiders` (where `web_spiders` is a name of project).
1498
+ Kimurai can work in project mode. To generate a new project, run: `$ kimurai new web_spiders` (where `web_spiders` is the name for the project).
1668
1499
 
1669
1500
  Structure of the project:
1670
1501
 
@@ -1673,7 +1504,6 @@ Structure of the project:
1673
1504
  ├── config/
1674
1505
  │   ├── initializers/
1675
1506
  │   ├── application.rb
1676
- │   ├── automation.yml
1677
1507
  │   ├── boot.rb
1678
1508
  │   └── schedule.rb
1679
1509
  ├── spiders/
@@ -1696,26 +1526,25 @@ Structure of the project:
1696
1526
  <details/>
1697
1527
  <summary>Description</summary>
1698
1528
 
1699
- * `config/` folder for configutation files
1700
- * `config/initializers` [Rails-like initializers](https://guides.rubyonrails.org/configuring.html#using-initializer-files) to load custom code at start of framework
1701
- * `config/application.rb` configuration settings for Kimurai (`Kimurai.configure do` block)
1702
- * `config/automation.yml` specify some settings for [setup and deploy](#automated-sever-setup-and-deployment)
1703
- * `config/boot.rb` loads framework and project
1704
- * `config/schedule.rb` Cron [schedule for spiders](#schedule-spiders-using-cron)
1705
- * `spiders/` folder for spiders
1706
- * `spiders/application_spider.rb` Base parent class for all spiders
1707
- * `db/` store here all database files (`sqlite`, `json`, `csv`, etc.)
1708
- * `helpers/` Rails-like helpers for spiders
1709
- * `helpers/application_helper.rb` all methods inside ApplicationHelper module will be available for all spiders
1710
- * `lib/` put here custom Ruby code
1711
- * `log/` folder for logs
1712
- * `pipelines/` folder for [Scrapy-like](https://doc.scrapy.org/en/latest/topics/item-pipeline.html) pipelines. One file = one pipeline
1713
- * `pipelines/validator.rb` example pipeline to validate item
1714
- * `pipelines/saver.rb` example pipeline to save item
1715
- * `tmp/` folder for temp. files
1716
- * `.env` file to store ENV variables for project and load them using [Dotenv](https://github.com/bkeepers/dotenv)
1717
- * `Gemfile` dependency file
1718
- * `Readme.md` example project readme
1529
+ * `config/` directory for configutation files
1530
+ * `config/initializers` [Rails-like initializers](https://guides.rubyonrails.org/configuring.html#using-initializer-files) to load custom code when the framework initializes
1531
+ * `config/application.rb` configuration settings for Kimurai (`Kimurai.configure do` block)
1532
+ * `config/boot.rb`– loads framework and project
1533
+ * `config/schedule.rb` Cron [schedule for spiders](#schedule-spiders-using-cron)
1534
+ * `spiders/` directory for spiders
1535
+ * `spiders/application_spider.rb` base parent class for all spiders
1536
+ * `db/` directory for database files (`sqlite`, `json`, `csv`, etc.)
1537
+ * `helpers/` Rails-like helpers for spiders
1538
+ * `helpers/application_helper.rb` all methods inside the ApplicationHelper module will be available for all spiders
1539
+ * `lib/` custom Ruby code
1540
+ * `log/` directory for logs
1541
+ * `pipelines/` directory for [Scrapy-like](https://doc.scrapy.org/en/latest/topics/item-pipeline.html) pipelines (one file per pipeline)
1542
+ * `pipelines/validator.rb` example pipeline to validate an item
1543
+ * `pipelines/saver.rb` example pipeline to save an item
1544
+ * `tmp/` folder for temp files
1545
+ * `.env` file to store environment variables for a project and load them using [Dotenv](https://github.com/bkeepers/dotenv)
1546
+ * `Gemfile` dependency file
1547
+ * `Readme.md` example project readme
1719
1548
  </details>
1720
1549
 
1721
1550
 
@@ -1770,7 +1599,7 @@ class Validator < Kimurai::Pipeline
1770
1599
  # Here you can validate item and raise `DropItemError`
1771
1600
  # if one of the validations failed. Examples:
1772
1601
 
1773
- # Drop item if it's category is not "shoe":
1602
+ # Drop item if its category is not "shoe":
1774
1603
  if item[:category] != "shoe"
1775
1604
  raise DropItemError, "Wrong item category"
1776
1605
  end
@@ -1821,6 +1650,7 @@ spiders/application_spider.rb
1821
1650
  ```ruby
1822
1651
  class ApplicationSpider < Kimurai::Base
1823
1652
  @engine = :selenium_chrome
1653
+
1824
1654
  # Define pipelines (by order) for all spiders:
1825
1655
  @pipelines = [:validator, :saver]
1826
1656
  end
@@ -1894,22 +1724,20 @@ end
1894
1724
 
1895
1725
  spiders/github_spider.rb
1896
1726
  ```ruby
1897
- class GithubSpider < ApplicationSpider
1727
+ class GithubSpider < Kimurai::Base
1898
1728
  @name = "github_spider"
1899
1729
  @engine = :selenium_chrome
1900
- @pipelines = [:validator]
1901
- @start_urls = ["https://github.com/search?q=Ruby%20Web%20Scraping"]
1730
+ @start_urls = ["https://github.com/search?q=ruby+web+scraping&type=repositories"]
1902
1731
  @config = {
1903
- user_agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36",
1904
- before_request: { delay: 4..7 }
1732
+ before_request: { delay: 3..5 }
1905
1733
  }
1906
1734
 
1907
1735
  def parse(response, url:, data: {})
1908
- response.xpath("//ul[@class='repo-list']/div//h3/a").each do |a|
1736
+ response.xpath("//div[@data-testid='results-list']//div[contains(@class, 'search-title')]/a").each do |a|
1909
1737
  request_to :parse_repo_page, url: absolute_url(a[:href], base: url)
1910
1738
  end
1911
1739
 
1912
- if next_page = response.at_xpath("//a[@class='next_page']")
1740
+ if next_page = response.at_xpath("//a[@rel='next']")
1913
1741
  request_to :parse, url: absolute_url(next_page[:href], base: url)
1914
1742
  end
1915
1743
  end
@@ -1917,17 +1745,17 @@ class GithubSpider < ApplicationSpider
1917
1745
  def parse_repo_page(response, url:, data: {})
1918
1746
  item = {}
1919
1747
 
1920
- item[:owner] = response.xpath("//h1//a[@rel='author']").text
1921
- item[:repo_name] = response.xpath("//h1/strong[@itemprop='name']/a").text
1748
+ item[:owner] = response.xpath("//a[@rel='author']").text.squish
1749
+ item[:repo_name] = response.xpath("//strong[@itemprop='name']").text.squish
1922
1750
  item[:repo_url] = url
1923
- item[:description] = response.xpath("//span[@itemprop='about']").text.squish
1924
- item[:tags] = response.xpath("//div[@id='topics-list-container']/div/a").map { |a| a.text.squish }
1925
- item[:watch_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Watch')]/a[2]").text.squish.delete(",").to_i
1926
- item[:star_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Star')]/a[2]").text.squish.delete(",").to_i
1927
- item[:fork_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Fork')]/a[2]").text.squish.delete(",").to_i
1928
- item[:last_commit] = response.xpath("//span[@itemprop='dateModified']/*").text
1751
+ item[:description] = response.xpath("//div[h2[text()='About']]/p").text.squish
1752
+ item[:tags] = response.xpath("//div/a[contains(@title, 'Topic')]").map { |a| a.text.squish }
1753
+ item[:watch_count] = response.xpath("//div/h3[text()='Watchers']/following-sibling::div[1]/a/strong").text.squish
1754
+ item[:star_count] = response.xpath("//div/h3[text()='Stars']/following-sibling::div[1]/a/strong").text.squish
1755
+ item[:fork_count] = response.xpath("//div/h3[text()='Forks']/following-sibling::div[1]/a/strong").text.squish
1756
+ item[:last_commit] = response.xpath("//div[@data-testid='latest-commit-details']//relative-time/text()").text.squish
1929
1757
 
1930
- send_item item
1758
+ save_to "results.json", item, format: :pretty_json
1931
1759
  end
1932
1760
  end
1933
1761
  ```
@@ -1935,41 +1763,41 @@ end
1935
1763
  ```
1936
1764
  $ bundle exec kimurai crawl github_spider
1937
1765
 
1938
- I, [2018-08-22 15:56:35 +0400#1358] [M: 47347279209980] INFO -- github_spider: Spider: started: github_spider
1939
- D, [2018-08-22 15:56:35 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): created browser instance
1940
- I, [2018-08-22 15:56:40 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: started get request to: https://github.com/search?q=Ruby%20Web%20Scraping
1941
- I, [2018-08-22 15:56:44 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: finished get request to: https://github.com/search?q=Ruby%20Web%20Scraping
1942
- I, [2018-08-22 15:56:44 +0400#1358] [M: 47347279209980] INFO -- github_spider: Info: visits: requests: 1, responses: 1
1943
- D, [2018-08-22 15:56:44 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: driver.current_memory: 116182
1944
- D, [2018-08-22 15:56:44 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: sleep 5 seconds before request...
1945
-
1946
- I, [2018-08-22 15:56:49 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: started get request to: https://github.com/lorien/awesome-web-scraping
1947
- I, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: finished get request to: https://github.com/lorien/awesome-web-scraping
1948
- I, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] INFO -- github_spider: Info: visits: requests: 2, responses: 2
1949
- D, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: driver.current_memory: 217432
1950
- D, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Pipeline: starting processing item through 1 pipeline...
1951
- I, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] INFO -- github_spider: Pipeline: processed: {"owner":"lorien","repo_name":"awesome-web-scraping","repo_url":"https://github.com/lorien/awesome-web-scraping","description":"List of libraries, tools and APIs for web scraping and data processing.","tags":["awesome","awesome-list","web-scraping","data-processing","python","javascript","php","ruby"],"watch_count":159,"star_count":2423,"fork_count":358,"last_commit":"4 days ago"}
1952
- I, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] INFO -- github_spider: Info: items: sent: 1, processed: 1
1953
- D, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: sleep 6 seconds before request...
1766
+ I, [2018-08-22 15:56:35 +0400#1358] INFO -- github_spider: Spider: started: github_spider
1767
+ D, [2018-08-22 15:56:35 +0400#1358] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): created browser instance
1768
+ I, [2018-08-22 15:56:40 +0400#1358] INFO -- github_spider: Browser: started get request to: https://github.com/search?q=Ruby%20Web%20Scraping
1769
+ I, [2018-08-22 15:56:44 +0400#1358] INFO -- github_spider: Browser: finished get request to: https://github.com/search?q=Ruby%20Web%20Scraping
1770
+ I, [2018-08-22 15:56:44 +0400#1358] INFO -- github_spider: Info: visits: requests: 1, responses: 1
1771
+ D, [2018-08-22 15:56:44 +0400#1358] DEBUG -- github_spider: Browser: driver.current_memory: 116182
1772
+ D, [2018-08-22 15:56:44 +0400#1358] DEBUG -- github_spider: Browser: sleep 5 seconds before request...
1773
+
1774
+ I, [2018-08-22 15:56:49 +0400#1358] INFO -- github_spider: Browser: started get request to: https://github.com/lorien/awesome-web-scraping
1775
+ I, [2018-08-22 15:56:50 +0400#1358] INFO -- github_spider: Browser: finished get request to: https://github.com/lorien/awesome-web-scraping
1776
+ I, [2018-08-22 15:56:50 +0400#1358] INFO -- github_spider: Info: visits: requests: 2, responses: 2
1777
+ D, [2018-08-22 15:56:50 +0400#1358] DEBUG -- github_spider: Browser: driver.current_memory: 217432
1778
+ D, [2018-08-22 15:56:50 +0400#1358] DEBUG -- github_spider: Pipeline: starting processing item through 1 pipeline...
1779
+ I, [2018-08-22 15:56:50 +0400#1358] INFO -- github_spider: Pipeline: processed: {"owner":"lorien","repo_name":"awesome-web-scraping","repo_url":"https://github.com/lorien/awesome-web-scraping","description":"List of libraries, tools and APIs for web scraping and data processing.","tags":["awesome","awesome-list","web-scraping","data-processing","python","javascript","php","ruby"],"watch_count":159,"star_count":2423,"fork_count":358,"last_commit":"4 days ago"}
1780
+ I, [2018-08-22 15:56:50 +0400#1358] INFO -- github_spider: Info: items: sent: 1, processed: 1
1781
+ D, [2018-08-22 15:56:50 +0400#1358] DEBUG -- github_spider: Browser: sleep 6 seconds before request...
1954
1782
 
1955
1783
  ...
1956
1784
 
1957
- I, [2018-08-22 16:11:50 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: started get request to: https://github.com/preston/idclight
1958
- I, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: finished get request to: https://github.com/preston/idclight
1959
- I, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] INFO -- github_spider: Info: visits: requests: 140, responses: 140
1960
- D, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: driver.current_memory: 211713
1785
+ I, [2018-08-22 16:11:50 +0400#1358] INFO -- github_spider: Browser: started get request to: https://github.com/preston/idclight
1786
+ I, [2018-08-22 16:11:51 +0400#1358] INFO -- github_spider: Browser: finished get request to: https://github.com/preston/idclight
1787
+ I, [2018-08-22 16:11:51 +0400#1358] INFO -- github_spider: Info: visits: requests: 140, responses: 140
1788
+ D, [2018-08-22 16:11:51 +0400#1358] DEBUG -- github_spider: Browser: driver.current_memory: 211713
1961
1789
 
1962
- D, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Pipeline: starting processing item through 1 pipeline...
1963
- E, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] ERROR -- github_spider: Pipeline: dropped: #<Kimurai::Pipeline::DropItemError: Repository doesn't have enough stars>, item: {:owner=>"preston", :repo_name=>"idclight", :repo_url=>"https://github.com/preston/idclight", :description=>"A Ruby gem for accessing the freely available IDClight (IDConverter Light) web service, which convert between different types of gene IDs such as Hugo and Entrez. Queries are screen scraped from http://idclight.bioinfo.cnio.es.", :tags=>[], :watch_count=>6, :star_count=>1, :fork_count=>0, :last_commit=>"on Apr 12, 2012"}
1790
+ D, [2018-08-22 16:11:51 +0400#1358] DEBUG -- github_spider: Pipeline: starting processing item through 1 pipeline...
1791
+ E, [2018-08-22 16:11:51 +0400#1358] ERROR -- github_spider: Pipeline: dropped: #<Kimurai::Pipeline::DropItemError: Repository doesn't have enough stars>, item: {:owner=>"preston", :repo_name=>"idclight", :repo_url=>"https://github.com/preston/idclight", :description=>"A Ruby gem for accessing the freely available IDClight (IDConverter Light) web service, which convert between different types of gene IDs such as Hugo and Entrez. Queries are screen scraped from http://idclight.bioinfo.cnio.es.", :tags=>[], :watch_count=>6, :star_count=>1, :fork_count=>0, :last_commit=>"on Apr 12, 2012"}
1964
1792
 
1965
- I, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] INFO -- github_spider: Info: items: sent: 127, processed: 12
1793
+ I, [2018-08-22 16:11:51 +0400#1358] INFO -- github_spider: Info: items: sent: 127, processed: 12
1966
1794
 
1967
- I, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: driver selenium_chrome has been destroyed
1968
- I, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] INFO -- github_spider: Spider: stopped: {:spider_name=>"github_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 15:56:35 +0400, :stop_time=>2018-08-22 16:11:51 +0400, :running_time=>"15m, 16s", :visits=>{:requests=>140, :responses=>140}, :items=>{:sent=>127, :processed=>12}, :error=>nil}
1795
+ I, [2018-08-22 16:11:51 +0400#1358] INFO -- github_spider: Browser: driver selenium_chrome has been destroyed
1796
+ I, [2018-08-22 16:11:51 +0400#1358] INFO -- github_spider: Spider: stopped: {:spider_name=>"github_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 15:56:35 +0400, :stop_time=>2018-08-22 16:11:51 +0400, :running_time=>"15m, 16s", :visits=>{:requests=>140, :responses=>140}, :items=>{:sent=>127, :processed=>12}, :error=>nil}
1969
1797
  ```
1970
1798
  </details><br>
1971
1799
 
1972
- Also, you can pass custom options to pipeline from a particular spider if you want to change pipeline behavior for this spider:
1800
+ You can also pass custom options to a pipeline from a particular spider if you want to change the pipeline behavior for this spider:
1973
1801
 
1974
1802
  <details>
1975
1803
  <summary>Example</summary>
@@ -2029,7 +1857,7 @@ $ bundle exec kimurai runner -j 3
2029
1857
  <<< Runner: stopped: {:id=>1533727423, :status=>:completed, :start_time=>2018-08-08 15:23:43 +0400, :stop_time=>2018-08-08 15:25:11 +0400, :environment=>"development", :concurrent_jobs=>3, :spiders=>["custom_spider", "github_spider", "example_spider"]}
2030
1858
  ```
2031
1859
 
2032
- Each spider runs in a separate process. Spiders logs available at `log/` folder. Pass `-j` option to specify how many spiders should be processed at the same time (default is 1).
1860
+ Each spider runs in a separate process. Spider logs are available in the `log/` directory. Use the `-j` argument to specify how many spiders should be processed at the same time (default is 1).
2033
1861
 
2034
1862
  You can provide additional arguments like `--include` or `--exclude` to specify which spiders to run:
2035
1863
 
@@ -2047,7 +1875,7 @@ You can perform custom actions before runner starts and after runner stops using
2047
1875
 
2048
1876
 
2049
1877
  ## Chat Support and Feedback
2050
- Will be updated
1878
+ Submit an issue on GitHub and we'll try to address it in a timely manner.
2051
1879
 
2052
1880
  ## License
2053
- The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
1881
+ This gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).