crawlr 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 7962445e19428525184ea2fb8dfcb76612c3143fd2764be3dd376c9bcb65ae69
4
+ data.tar.gz: b784eb2b27f6b170ac67c4a9c9113fc7e7ed4fb443fcd3145d3be5e24ab1194e
5
+ SHA512:
6
+ metadata.gz: bd8296ebd6bdc77bbf7a4200d9f211721a137bb74073e76fb8eae44007e05bcb894abdb5c4cb92efe28af0bd8c14b9d734a5d33420ffada3c7debcd4794027e3
7
+ data.tar.gz: ba6608820012fada66dbbf1026e7d52a8aa29290a714e658a0a4d904b6f6c7b685bc287353cd9e1319561b4ca990fd6d68542f3747727388130eb390060d0b33
data/.rspec ADDED
@@ -0,0 +1,3 @@
1
+ --format documentation
2
+ --color
3
+ --require spec_helper
data/.rubocop.yml ADDED
@@ -0,0 +1,9 @@
1
+ AllCops:
2
+ TargetRubyVersion: 3.1
3
+ SuggestExtensions: false
4
+
5
+ Style/StringLiterals:
6
+ EnforcedStyle: double_quotes
7
+
8
+ Style/StringLiteralsInInterpolation:
9
+ EnforcedStyle: double_quotes
data/CHANGELOG.md ADDED
@@ -0,0 +1,5 @@
1
+ ## [Unreleased]
2
+
3
+ ## [0.1.0] - 2025-09-29
4
+
5
+ - Initial release
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2025 Aristotelis Rapai
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,326 @@
1
+ # crawlr ๐Ÿ•ท๏ธ
2
+
3
+ A powerful, async Ruby web scraping framework designed for respectful and efficient data extraction. Built with modern Ruby practices, crawlr provides a clean API for scraping websites while respecting robots.txt, managing cookies, rotating proxies, and handling complex scraping scenarios.
4
+
5
+ [![Gem Version](https://badge.fury.io/rb/crawlr.svg)](https://badge.fury.io/rb/crawlr)
6
+ [![Ruby](https://github.com/yourusername/crawlr/actions/workflows/ruby.yml/badge.svg)](https://github.com/yourusername/crawlr/actions/workflows/ruby.yml)
7
+
8
+ ## โœจ Features
9
+
10
+ - ๐Ÿš€ **Async HTTP requests** with configurable concurrency
11
+ - ๐Ÿค– **Robots.txt compliance** with automatic parsing and rule enforcement
12
+ - ๐Ÿช **Cookie management** with automatic persistence across requests
13
+ - ๐Ÿ”„ **Proxy rotation** with round-robin and random strategies
14
+ - ๐ŸŽฏ **Flexible selectors** supporting both CSS and XPath
15
+ - ๐Ÿ”ง **Extensible hooks** for request/response lifecycle events
16
+ - ๐Ÿ“Š **Built-in statistics** and monitoring capabilities
17
+ - ๐Ÿ›ก๏ธ **Respectful crawling** with delays, depth limits, and visit tracking
18
+ - ๐Ÿงต **Thread-safe** operations for parallel scraping
19
+ - ๐Ÿ“„ **Comprehensive logging** with configurable levels
20
+
21
+ ## ๐Ÿ“ฆ Installation
22
+
23
+ Add this line to your application's Gemfile:
24
+
25
+ ```ruby
26
+ gem 'crawlr'
27
+ ```
28
+
29
+ And then execute:
30
+
31
+ ```bash
32
+ $ bundle install
33
+ ```
34
+
35
+ Or install it yourself as:
36
+
37
+ ```bash
38
+ $ gem install crawlr
39
+ ```
40
+
41
+ ## ๐Ÿš€ Quick Start
42
+
43
+ ```ruby
44
+ require 'crawlr'
45
+
46
+ # Create a collector with configuration
47
+ collector = Crawlr::Collector.new(
48
+ max_depth: 3,
49
+ max_parallelism: 5,
50
+ random_delay: 1.0,
51
+ timeout: 15
52
+ )
53
+
54
+ # Register callbacks for data extraction
55
+ collector.on_html(:css, '.article-title') do |node, context|
56
+ puts "Found title: #{node.text.strip}"
57
+ end
58
+
59
+ collector.on_html(:css, 'a[href]') do |link, context|
60
+ href = link['href']
61
+ puts "Found link: #{href}" if href.start_with?('http')
62
+ end
63
+
64
+ # Start scraping
65
+ collector.visit('https://example.com')
66
+ ```
67
+
68
+ ## ๐Ÿ“š Usage Examples
69
+
70
+ ### Basic Web Scraping
71
+
72
+ ```ruby
73
+ collector = Crawlr::Collector.new
74
+
75
+ # Extract product information
76
+ collector.on_html(:css, '.product') do |product, ctx|
77
+ data = {
78
+ name: product.css('.product-name').text.strip,
79
+ price: product.css('.price').text.strip,
80
+ image: product.css('img')&.first&.[]('src')
81
+ }
82
+
83
+ ctx.products ||= []
84
+ ctx.products << data
85
+ end
86
+
87
+ collector.visit('https://shop.example.com/products')
88
+ ```
89
+
90
+ ### API Scraping with Pagination
91
+
92
+ ```ruby
93
+ collector = Crawlr::Collector.new(
94
+ max_parallelism: 10,
95
+ timeout: 30
96
+ )
97
+
98
+ collector.on_xml(:css, 'item') do |item, ctx|
99
+ ctx.items ||= []
100
+ ctx.items << {
101
+ id: item.css('id').text,
102
+ title: item.css('title').text,
103
+ published: item.css('published').text
104
+ }
105
+ end
106
+
107
+ # Automatically handles pagination with ?page=1, ?page=2, etc.
108
+ collector.paginated_visit(
109
+ 'https://api.example.com/feed',
110
+ batch_size: 5,
111
+ start_page: 1
112
+ )
113
+ ```
114
+
115
+ ### Advanced Configuration
116
+
117
+ ```ruby
118
+ collector = Crawlr::Collector.new(
119
+ # Network settings
120
+ timeout: 20,
121
+ max_parallelism: 8,
122
+ random_delay: 2.0,
123
+
124
+ # Crawling behavior
125
+ max_depth: 5,
126
+ allow_url_revisit: false,
127
+ max_visited: 50_000,
128
+
129
+ # Proxy rotation
130
+ proxies: ['proxy1.com:8080', 'proxy2.com:8080'],
131
+ proxy_strategy: :round_robin,
132
+
133
+ # Respectful crawling
134
+ ignore_robots_txt: false,
135
+ allow_cookies: true,
136
+
137
+ # Error handling
138
+ max_retries: 3,
139
+ retry_delay: 1.0,
140
+ retry_backoff: 2.0
141
+ )
142
+ ```
143
+
144
+ ### Domain Filtering
145
+
146
+ ```ruby
147
+ # Allow specific domains
148
+ collector = Crawlr::Collector.new(
149
+ allowed_domains: ['example.com', 'api.example.com']
150
+ )
151
+
152
+ # Or use glob patterns
153
+ collector = Crawlr::Collector.new(
154
+ domain_glob: ['*.example.com', '*.trusted-site.*']
155
+ )
156
+ ```
157
+
158
+ ### Hooks for Custom Behavior
159
+
160
+ ```ruby
161
+ # Add custom headers before each request
162
+ collector.hook(:before_visit) do |url, headers|
163
+ headers['Authorization'] = "Bearer #{get_auth_token()}"
164
+ headers['X-Custom-Header'] = 'MyBot/1.0'
165
+ puts "Visiting: #{url}"
166
+ end
167
+
168
+ # Process responses after each request
169
+ collector.hook(:after_visit) do |url, response|
170
+ puts "Got #{response.status} from #{url}"
171
+ log_response_time(url, response.headers['X-Response-Time'])
172
+ end
173
+
174
+ # Handle errors gracefully
175
+ collector.hook(:on_error) do |url, error|
176
+ puts "Failed to scrape #{url}: #{error.message}"
177
+ error_tracker.record(url, error)
178
+ end
179
+ ```
180
+
181
+ ### XPath Selectors
182
+
183
+ ```ruby
184
+ collector.on_html(:xpath, '//div[@class="content"]//p[position() <= 3]') do |paragraph, ctx|
185
+ # Extract first 3 paragraphs from content divs
186
+ ctx.content_paragraphs ||= []
187
+ ctx.content_paragraphs << paragraph.text.strip
188
+ end
189
+
190
+ collector.on_xml(:xpath, '//item[price > 100]/title') do |title, ctx|
191
+ # Extract titles of expensive items from XML feeds
192
+ ctx.expensive_items ||= []
193
+ ctx.expensive_items << title.text
194
+ end
195
+ ```
196
+
197
+ ### Session Management with Cookies
198
+
199
+ ```ruby
200
+ collector = Crawlr::Collector.new(allow_cookies: true)
201
+
202
+ # Login first
203
+ collector.on_html(:css, 'form[action="/login"]') do |form, ctx|
204
+ # Cookies from login will be automatically used in subsequent requests
205
+ end
206
+
207
+ collector.visit('https://site.com/login')
208
+ collector.visit('https://site.com/protected-content') # Uses login cookies
209
+ ```
210
+
211
+ ### Monitoring and Statistics
212
+
213
+ ```ruby
214
+ collector = Crawlr::Collector.new
215
+
216
+ # Get comprehensive statistics
217
+ stats = collector.stats
218
+ puts "Visited #{stats[:total_visits]} pages"
219
+ puts "Active callbacks: #{stats[:callbacks_count]}"
220
+ puts "Memory usage: #{stats[:visited_count]}/#{stats[:max_visited]} URLs tracked"
221
+
222
+ # Clone collectors for different tasks while sharing HTTP connections
223
+ product_scraper = collector.clone
224
+ product_scraper.on_html(:css, '.product') { |node, ctx| extract_product(node, ctx) }
225
+
226
+ review_scraper = collector.clone
227
+ review_scraper.on_html(:css, '.review') { |node, ctx| extract_review(node, ctx) }
228
+ ```
229
+
230
+ ## ๐Ÿ—๏ธ Architecture
231
+
232
+ crawlr is built with a modular architecture:
233
+
234
+ - **Collector**: Main orchestrator managing the scraping workflow
235
+ - **HTTPInterface**: Async HTTP client with proxy and cookie support
236
+ - **Parser**: Document parsing engine using Nokogiri
237
+ - **Callbacks**: Flexible callback system for data extraction
238
+ - **Hooks**: Event system for request/response lifecycle customization
239
+ - **Config**: Centralized configuration management
240
+ - **Visits**: Thread-safe URL deduplication and visit tracking
241
+ - **Domains**: Domain filtering and allowlist management
242
+ - **Robots**: Robots.txt parsing and compliance checking
243
+
244
+ ## ๐Ÿค Respectful Scraping
245
+
246
+ crawlr is designed to be a responsible scraping framework:
247
+
248
+ - **Robots.txt compliance**: Automatically fetches and respects robots.txt rules
249
+ - **Rate limiting**: Built-in delays and concurrency controls
250
+ - **User-Agent identification**: Clear identification in requests
251
+ - **Error handling**: Graceful handling of failures without overwhelming servers
252
+ - **Memory management**: Automatic cleanup to prevent resource exhaustion
253
+
254
+ ## ๐Ÿ”ง Configuration Options
255
+
256
+ | Option | Default | Description |
257
+ | ------------------- | ------- | ---------------------------------------- |
258
+ | `timeout` | 10 | HTTP request timeout in seconds |
259
+ | `max_parallelism` | 1 | Maximum concurrent requests |
260
+ | `max_depth` | 0 | Maximum crawling depth (0 = unlimited) |
261
+ | `random_delay` | 0 | Maximum random delay between requests |
262
+ | `allow_url_revisit` | false | Allow revisiting previously scraped URLs |
263
+ | `max_visited` | 10,000 | Maximum URLs to track before cache reset |
264
+ | `allow_cookies` | false | Enable cookie jar management |
265
+ | `ignore_robots_txt` | false | Skip robots.txt checking |
266
+ | `max_retries` | nil | Maximum retry attempts (nil = disabled) |
267
+ | `retry_delay` | 1.0 | Base delay between retries |
268
+ | `retry_backoff` | 2.0 | Exponential backoff multiplier |
269
+
270
+ ## ๐Ÿงช Testing
271
+
272
+ Run the test suite:
273
+
274
+ ```bash
275
+ bundle exec rspec
276
+ ```
277
+
278
+ Run with coverage:
279
+
280
+ ```bash
281
+ COVERAGE=true bundle exec rspec
282
+ ```
283
+
284
+ ## ๐Ÿ“– Documentation
285
+
286
+ Generate API documentation:
287
+
288
+ ```bash
289
+ yard doc
290
+ ```
291
+
292
+ View documentation:
293
+
294
+ ```bash
295
+ yard server
296
+ ```
297
+
298
+ ## ๐Ÿค Contributing
299
+
300
+ 1. Fork it (https://github.com/yourusername/crawlr/fork)
301
+ 2. Create your feature branch (`git checkout -b feature/amazing-feature`)
302
+ 3. Make your changes with tests
303
+ 4. Ensure all tests pass (`bundle exec rspec`)
304
+ 5. Commit your changes (`git commit -am 'Add amazing feature'`)
305
+ 6. Push to the branch (`git push origin feature/amazing-feature`)
306
+ 7. Create a new Pull Request
307
+
308
+ ## ๐Ÿ“ License
309
+
310
+ This gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
311
+
312
+ ## ๐Ÿ™ Acknowledgments
313
+
314
+ - Built with [Nokogiri](https://nokogiri.org/) for HTML/XML parsing
315
+ - Uses [Async](https://github.com/socketry/async) for high-performance concurrency
316
+ - Inspired by Python's Scrapy framework and modern Ruby practices
317
+
318
+ ## ๐Ÿ“ž Support
319
+
320
+ - ๐Ÿ“– [Documentation](https://yourusername.github.io/crawlr)
321
+ - ๐Ÿ› [Issue Tracker](https://github.com/yourusername/crawlr/issues)
322
+ - ๐Ÿ’ฌ [Discussions](https://github.com/yourusername/crawlr/discussions)
323
+
324
+ ---
325
+
326
+ **Happy Scraping! ๐Ÿ•ท๏ธโœจ**
data/Rakefile ADDED
@@ -0,0 +1,12 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "bundler/gem_tasks"
4
+ require "rspec/core/rake_task"
5
+
6
+ RSpec::Core::RakeTask.new(:spec)
7
+
8
+ require "rubocop/rake_task"
9
+
10
+ RuboCop::RakeTask.new
11
+
12
+ task default: %i[spec rubocop]
@@ -0,0 +1,177 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Crawlr
4
+ # Manages callback registration and execution for document scraping operations.
5
+ #
6
+ # The Callbacks class provides a centralized way to register and manage
7
+ # callbacks that process specific nodes in HTML or XML documents using
8
+ # CSS or XPath selectors.
9
+ #
10
+ # @example Basic usage
11
+ # callbacks = Crawlr::Callbacks.new
12
+ # callbacks.register(:html, :css, '.title') do |node, context|
13
+ # puts node.text
14
+ # end
15
+ #
16
+ # @example Using XPath selectors
17
+ # callbacks.register(:xml, :xpath, '//item[@id]') do |node, context|
18
+ # process_item(node, context)
19
+ # end
20
+ #
21
+ # @since 0.1.0
22
+ class Callbacks
23
+ # Supported document formats for scraping
24
+ # @return [Array<Symbol>] Array of allowed format symbols
25
+ ALLOWED_FORMATS = %i[html xml].freeze
26
+
27
+ # Supported selector types for element selection
28
+ # @return [Array<Symbol>] Array of allowed selector type symbols
29
+ ALLOWED_SELECTOR_TYPES = %i[css xpath].freeze
30
+
31
+ # Initializes a new Callbacks instance
32
+ #
33
+ # @example
34
+ # callbacks = Crawlr::Callbacks.new
35
+ def initialize
36
+ @callbacks = []
37
+ end
38
+
39
+ # Returns a copy of all registered callbacks
40
+ #
41
+ # @return [Array<Hash>] Array of callback hashes containing format, selector_type, selector, and block
42
+ # @example
43
+ # callbacks = instance.all
44
+ # puts callbacks.length #=> 3
45
+ def all
46
+ @callbacks.dup
47
+ end
48
+
49
+ # Registers a new callback for processing matching nodes
50
+ #
51
+ # @param format [Symbol] The document format (:html or :xml)
52
+ # @param selector_type [Symbol] The selector type (:css or :xpath)
53
+ # @param selector [String] The selector string to match elements
54
+ # @param block [Proc] The callback block to execute when elements match
55
+ # @yieldparam node [Object] The matched DOM node
56
+ # @yieldparam ctx [Object] The scraping context object
57
+ # @return [void]
58
+ # @raise [ArgumentError] When format or selector_type is not supported
59
+ #
60
+ # @example Register a CSS selector callback
61
+ # register(:html, :css, '.product-title') do |node, ctx|
62
+ # ctx.titles << node.text.strip
63
+ # end
64
+ #
65
+ # @example Register an XPath selector callback
66
+ # register(:xml, :xpath, '//item[@price > 100]') do |node, ctx|
67
+ # ctx.expensive_items << parse_item(node)
68
+ # end
69
+ def register(format, selector_type, selector, &block)
70
+ validate_registration(format, selector_type)
71
+ @callbacks << {
72
+ format: format,
73
+ selector_type: selector_type,
74
+ selector: selector,
75
+ block: ->(node, ctx) { block.call(node, ctx) }
76
+ }
77
+ end
78
+
79
+ # Returns basic statistics about registered callbacks
80
+ #
81
+ # @return [Hash<Symbol, Integer>] Hash containing callback statistics
82
+ # @example
83
+ # stats = instance.stats
84
+ # puts stats[:callbacks_count] #=> 5
85
+ def stats
86
+ { callbacks_count: @callbacks.size }
87
+ end
88
+
89
+ # Clears all registered callbacks
90
+ #
91
+ # @return [Array] Empty callbacks array
92
+ # @example
93
+ # instance.clear
94
+ # puts instance.stats[:callbacks_count] #=> 0
95
+ def clear
96
+ @callbacks.clear
97
+ end
98
+
99
+ private
100
+
101
+ # Validates that the format and selector_type are supported
102
+ #
103
+ # @param format [Symbol] The document format to validate
104
+ # @param selector_type [Symbol] The selector type to validate
105
+ # @return [void]
106
+ # @raise [ArgumentError] When format is not in ALLOWED_FORMATS
107
+ # @raise [ArgumentError] When selector_type is not in ALLOWED_SELECTOR_TYPES
108
+ # @api private
109
+ def validate_registration(format, selector_type)
110
+ raise ArgumentError, "Unsupported format: #{format}" unless ALLOWED_FORMATS.include?(format)
111
+ return if ALLOWED_SELECTOR_TYPES.include?(selector_type)
112
+
113
+ raise ArgumentError, "Unsupported selector type: #{selector_type}"
114
+ end
115
+
116
+ # Alternative registration method using formatted input strings
117
+ #
118
+ # @param format [Symbol] The document format (:html or :xml)
119
+ # @param input [String] Formatted input string (e.g., "css@.selector" or "xpath@//element")
120
+ # @param block [Proc] The callback block to execute when elements match
121
+ # @yieldparam node [Object] The matched DOM node
122
+ # @yieldparam ctx [Object] The scraping context object
123
+ # @return [void]
124
+ # @raise [ArgumentError] When format is not supported
125
+ # @raise [ArgumentError] When selector_type parsed from input is not supported
126
+ # @raise [ArgumentError] When input format is invalid
127
+ # @api private
128
+ #
129
+ # @example Using CSS selector input format
130
+ # register_from_input(:html, "css@.product-name") do |node, ctx|
131
+ # # Process node
132
+ # end
133
+ #
134
+ # @example Using XPath selector input format
135
+ # register_from_input(:xml, "xpath@//item[@id]") do |node, ctx|
136
+ # # Process node
137
+ # end
138
+ #
139
+ # @note This is a potential shorthand method that may be exposed in future versions
140
+ def register_from_input(format, input, &block)
141
+ raise ArgumentError, "Unsupported format: #{format}" unless ALLOWED_FORMATS.include?(format)
142
+
143
+ selector_type, selector = parse_input(input)
144
+ unless ALLOWED_SELECTOR_TYPES.include?(selector_type)
145
+ raise ArgumentError, "Unsupported selector type: #{selector_type}"
146
+ end
147
+
148
+ register(format, selector_type, selector, &block)
149
+ end
150
+
151
+ # Parses formatted input strings to extract selector type and selector
152
+ #
153
+ # @param input [String] Formatted input string with type prefix
154
+ # @return [Array<(Symbol, String)>] Tuple of [selector_type, selector]
155
+ # @raise [ArgumentError] When input format doesn't match expected patterns
156
+ # @api private
157
+ #
158
+ # @example Parse CSS selector input
159
+ # parse_input("css@.my-class") #=> [:css, ".my-class"]
160
+ #
161
+ # @example Parse XPath selector input
162
+ # parse_input("xpath@//div[@id='main']") #=> [:xpath, "//div[@id='main']"]
163
+ def parse_input(input)
164
+ if input.start_with?("css@")
165
+ selector_type = :css
166
+ selector = input[4..]
167
+ elsif input.start_with?("xpath@")
168
+ selector_type = :xpath
169
+ selector = input[6..]
170
+ else
171
+ raise ArgumentError, "Unsupported input format: #{input}"
172
+ end
173
+
174
+ [selector_type, selector]
175
+ end
176
+ end
177
+ end