RubyGems - crawlr - Versions diffs - 0.1.0 - Mend

crawlr 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (21) hide show

checksums.yaml +7 -0
data/.rspec +3 -0
data/.rubocop.yml +9 -0
data/CHANGELOG.md +5 -0
data/LICENSE.txt +21 -0
data/README.md +326 -0
data/Rakefile +12 -0
data/lib/crawlr/callbacks.rb +177 -0
data/lib/crawlr/collector.rb +632 -0
data/lib/crawlr/config.rb +232 -0
data/lib/crawlr/context.rb +80 -0
data/lib/crawlr/domains.rb +166 -0
data/lib/crawlr/hooks.rb +161 -0
data/lib/crawlr/http_interface.rb +286 -0
data/lib/crawlr/parser.rb +242 -0
data/lib/crawlr/robots.rb +329 -0
data/lib/crawlr/version.rb +5 -0
data/lib/crawlr/visits.rb +190 -0
data/lib/crawlr.rb +16 -0
data/sig/crawlr.rbs +4 -0
metadata +209 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: 7962445e19428525184ea2fb8dfcb76612c3143fd2764be3dd376c9bcb65ae69
+  data.tar.gz: b784eb2b27f6b170ac67c4a9c9113fc7e7ed4fb443fcd3145d3be5e24ab1194e
+SHA512:
+  metadata.gz: bd8296ebd6bdc77bbf7a4200d9f211721a137bb74073e76fb8eae44007e05bcb894abdb5c4cb92efe28af0bd8c14b9d734a5d33420ffada3c7debcd4794027e3
+  data.tar.gz: ba6608820012fada66dbbf1026e7d52a8aa29290a714e658a0a4d904b6f6c7b685bc287353cd9e1319561b4ca990fd6d68542f3747727388130eb390060d0b33

data/.rspec ADDED Viewed

@@ -0,0 +1,3 @@
+--format documentation
+--color
+--require spec_helper

data/.rubocop.yml ADDED Viewed

@@ -0,0 +1,9 @@
+AllCops:
+  TargetRubyVersion: 3.1
+  SuggestExtensions: false
+Style/StringLiterals:
+  EnforcedStyle: double_quotes
+Style/StringLiteralsInInterpolation:
+  EnforcedStyle: double_quotes

data/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,5 @@
+## [Unreleased]
+## [0.1.0] - 2025-09-29
+- Initial release

data/LICENSE.txt ADDED Viewed

@@ -0,0 +1,21 @@
+The MIT License (MIT)
+Copyright (c) 2025 Aristotelis Rapai
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,326 @@
+# crawlr 🕷️
+A powerful, async Ruby web scraping framework designed for respectful and efficient data extraction. Built with modern Ruby practices, crawlr provides a clean API for scraping websites while respecting robots.txt, managing cookies, rotating proxies, and handling complex scraping scenarios.
+[![Gem Version](https://badge.fury.io/rb/crawlr.svg)](https://badge.fury.io/rb/crawlr)
+[![Ruby](https://github.com/yourusername/crawlr/actions/workflows/ruby.yml/badge.svg)](https://github.com/yourusername/crawlr/actions/workflows/ruby.yml)
+## ✨ Features
+- 🚀 **Async HTTP requests** with configurable concurrency
+- 🤖 **Robots.txt compliance** with automatic parsing and rule enforcement
+- 🍪 **Cookie management** with automatic persistence across requests
+- 🔄 **Proxy rotation** with round-robin and random strategies
+- 🎯 **Flexible selectors** supporting both CSS and XPath
+- 🔧 **Extensible hooks** for request/response lifecycle events
+- 📊 **Built-in statistics** and monitoring capabilities
+- 🛡️ **Respectful crawling** with delays, depth limits, and visit tracking
+- 🧵 **Thread-safe** operations for parallel scraping
+- 📄 **Comprehensive logging** with configurable levels
+## 📦 Installation
+Add this line to your application's Gemfile:
+```ruby
+gem 'crawlr'
+```
+And then execute:
+```bash
+$ bundle install
+```
+Or install it yourself as:
+```bash
+$ gem install crawlr
+```
+## 🚀 Quick Start
+```ruby
+require 'crawlr'
+# Create a collector with configuration
+collector = Crawlr::Collector.new(
+  max_depth: 3,
+  max_parallelism: 5,
+  random_delay: 1.0,
+  timeout: 15
+)
+# Register callbacks for data extraction
+collector.on_html(:css, '.article-title') do |node, context|
+  puts "Found title: #{node.text.strip}"
+end
+collector.on_html(:css, 'a[href]') do |link, context|
+  href = link['href']
+  puts "Found link: #{href}" if href.start_with?('http')
+end
+# Start scraping
+collector.visit('https://example.com')
+```
+## 📚 Usage Examples
+### Basic Web Scraping
+```ruby
+collector = Crawlr::Collector.new
+# Extract product information
+collector.on_html(:css, '.product') do |product, ctx|
+  data = {
+    name: product.css('.product-name').text.strip,
+    price: product.css('.price').text.strip,
+    image: product.css('img')&.first&.[]('src')
+  }
+  ctx.products ||= []
+  ctx.products << data
+end
+collector.visit('https://shop.example.com/products')
+```
+### API Scraping with Pagination
+```ruby
+collector = Crawlr::Collector.new(
+  max_parallelism: 10,
+  timeout: 30
+)
+collector.on_xml(:css, 'item') do |item, ctx|
+  ctx.items ||= []
+  ctx.items << {
+    id: item.css('id').text,
+    title: item.css('title').text,
+    published: item.css('published').text
+  }
+end
+# Automatically handles pagination with ?page=1, ?page=2, etc.
+collector.paginated_visit(
+  'https://api.example.com/feed',
+  batch_size: 5,
+  start_page: 1
+)
+```
+### Advanced Configuration
+```ruby
+collector = Crawlr::Collector.new(
+  # Network settings
+  timeout: 20,
+  max_parallelism: 8,
+  random_delay: 2.0,
+  # Crawling behavior
+  max_depth: 5,
+  allow_url_revisit: false,
+  max_visited: 50_000,
+  # Proxy rotation
+  proxies: ['proxy1.com:8080', 'proxy2.com:8080'],
+  proxy_strategy: :round_robin,
+  # Respectful crawling
+  ignore_robots_txt: false,
+  allow_cookies: true,
+  # Error handling
+  max_retries: 3,
+  retry_delay: 1.0,
+  retry_backoff: 2.0
+)
+```
+### Domain Filtering
+```ruby
+# Allow specific domains
+collector = Crawlr::Collector.new(
+  allowed_domains: ['example.com', 'api.example.com']
+)
+# Or use glob patterns
+collector = Crawlr::Collector.new(
+  domain_glob: ['*.example.com', '*.trusted-site.*']
+)
+```
+### Hooks for Custom Behavior
+```ruby
+# Add custom headers before each request
+collector.hook(:before_visit) do |url, headers|
+  headers['Authorization'] = "Bearer #{get_auth_token()}"
+  headers['X-Custom-Header'] = 'MyBot/1.0'
+  puts "Visiting: #{url}"
+end
+# Process responses after each request
+collector.hook(:after_visit) do |url, response|
+  puts "Got #{response.status} from #{url}"
+  log_response_time(url, response.headers['X-Response-Time'])
+end
+# Handle errors gracefully
+collector.hook(:on_error) do |url, error|
+  puts "Failed to scrape #{url}: #{error.message}"
+  error_tracker.record(url, error)
+end
+```
+### XPath Selectors
+```ruby
+collector.on_html(:xpath, '//div[@class="content"]//p[position() <= 3]') do |paragraph, ctx|
+  # Extract first 3 paragraphs from content divs
+  ctx.content_paragraphs ||= []
+  ctx.content_paragraphs << paragraph.text.strip
+end
+collector.on_xml(:xpath, '//item[price > 100]/title') do |title, ctx|
+  # Extract titles of expensive items from XML feeds
+  ctx.expensive_items ||= []
+  ctx.expensive_items << title.text
+end
+```
+### Session Management with Cookies
+```ruby
+collector = Crawlr::Collector.new(allow_cookies: true)
+# Login first
+collector.on_html(:css, 'form[action="/login"]') do |form, ctx|
+  # Cookies from login will be automatically used in subsequent requests
+end
+collector.visit('https://site.com/login')
+collector.visit('https://site.com/protected-content') # Uses login cookies
+```
+### Monitoring and Statistics
+```ruby
+collector = Crawlr::Collector.new
+# Get comprehensive statistics
+stats = collector.stats
+puts "Visited #{stats[:total_visits]} pages"
+puts "Active callbacks: #{stats[:callbacks_count]}"
+puts "Memory usage: #{stats[:visited_count]}/#{stats[:max_visited]} URLs tracked"
+# Clone collectors for different tasks while sharing HTTP connections
+product_scraper = collector.clone
+product_scraper.on_html(:css, '.product') { |node, ctx| extract_product(node, ctx) }
+review_scraper = collector.clone
+review_scraper.on_html(:css, '.review') { |node, ctx| extract_review(node, ctx) }
+```
+## 🏗️ Architecture
+crawlr is built with a modular architecture:
+- **Collector**: Main orchestrator managing the scraping workflow
+- **HTTPInterface**: Async HTTP client with proxy and cookie support
+- **Parser**: Document parsing engine using Nokogiri
+- **Callbacks**: Flexible callback system for data extraction
+- **Hooks**: Event system for request/response lifecycle customization
+- **Config**: Centralized configuration management
+- **Visits**: Thread-safe URL deduplication and visit tracking
+- **Domains**: Domain filtering and allowlist management
+- **Robots**: Robots.txt parsing and compliance checking
+## 🤝 Respectful Scraping
+crawlr is designed to be a responsible scraping framework:
+- **Robots.txt compliance**: Automatically fetches and respects robots.txt rules
+- **Rate limiting**: Built-in delays and concurrency controls
+- **User-Agent identification**: Clear identification in requests
+- **Error handling**: Graceful handling of failures without overwhelming servers
+- **Memory management**: Automatic cleanup to prevent resource exhaustion
+## 🔧 Configuration Options
+| Option              | Default | Description                              |
+| ------------------- | ------- | ---------------------------------------- |
+| `timeout`           | 10      | HTTP request timeout in seconds          |
+| `max_parallelism`   | 1       | Maximum concurrent requests              |
+| `max_depth`         | 0       | Maximum crawling depth (0 = unlimited)   |
+| `random_delay`      | 0       | Maximum random delay between requests    |
+| `allow_url_revisit` | false   | Allow revisiting previously scraped URLs |
+| `max_visited`       | 10,000  | Maximum URLs to track before cache reset |
+| `allow_cookies`     | false   | Enable cookie jar management             |
+| `ignore_robots_txt` | false   | Skip robots.txt checking                 |
+| `max_retries`       | nil     | Maximum retry attempts (nil = disabled)  |
+| `retry_delay`       | 1.0     | Base delay between retries               |
+| `retry_backoff`     | 2.0     | Exponential backoff multiplier           |
+## 🧪 Testing
+Run the test suite:
+```bash
+bundle exec rspec
+```
+Run with coverage:
+```bash
+COVERAGE=true bundle exec rspec
+```
+## 📖 Documentation
+Generate API documentation:
+```bash
+yard doc
+```
+View documentation:
+```bash
+yard server
+```
+## 🤝 Contributing
+1. Fork it (https://github.com/yourusername/crawlr/fork)
+2. Create your feature branch (`git checkout -b feature/amazing-feature`)
+3. Make your changes with tests
+4. Ensure all tests pass (`bundle exec rspec`)
+5. Commit your changes (`git commit -am 'Add amazing feature'`)
+6. Push to the branch (`git push origin feature/amazing-feature`)
+7. Create a new Pull Request
+## 📝 License
+This gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
+## 🙏 Acknowledgments
+- Built with [Nokogiri](https://nokogiri.org/) for HTML/XML parsing
+- Uses [Async](https://github.com/socketry/async) for high-performance concurrency
+- Inspired by Python's Scrapy framework and modern Ruby practices
+## 📞 Support
+- 📖 [Documentation](https://yourusername.github.io/crawlr)
+- 🐛 [Issue Tracker](https://github.com/yourusername/crawlr/issues)
+- 💬 [Discussions](https://github.com/yourusername/crawlr/discussions)
+---
+**Happy Scraping! 🕷️✨**

data/Rakefile ADDED Viewed

@@ -0,0 +1,12 @@
+# frozen_string_literal: true
+require "bundler/gem_tasks"
+require "rspec/core/rake_task"
+RSpec::Core::RakeTask.new(:spec)
+require "rubocop/rake_task"
+RuboCop::RakeTask.new
+task default: %i[spec rubocop]

data/lib/crawlr/callbacks.rb ADDED Viewed

@@ -0,0 +1,177 @@
+# frozen_string_literal: true
+module Crawlr
+  # Manages callback registration and execution for document scraping operations.
+  #
+  # The Callbacks class provides a centralized way to register and manage
+  # callbacks that process specific nodes in HTML or XML documents using
+  # CSS or XPath selectors.
+  #
+  # @example Basic usage
+  #   callbacks = Crawlr::Callbacks.new
+  #   callbacks.register(:html, :css, '.title') do |node, context|
+  #     puts node.text
+  #   end
+  #
+  # @example Using XPath selectors
+  #   callbacks.register(:xml, :xpath, '//item[@id]') do |node, context|
+  #     process_item(node, context)
+  #   end
+  #
+  # @since 0.1.0
+  class Callbacks
+    # Supported document formats for scraping
+    # @return [Array<Symbol>] Array of allowed format symbols
+    ALLOWED_FORMATS = %i[html xml].freeze
+    # Supported selector types for element selection
+    # @return [Array<Symbol>] Array of allowed selector type symbols
+    ALLOWED_SELECTOR_TYPES = %i[css xpath].freeze
+    # Initializes a new Callbacks instance
+    #
+    # @example
+    #   callbacks = Crawlr::Callbacks.new
+    def initialize
+      @callbacks = []
+    end
+    # Returns a copy of all registered callbacks
+    #
+    # @return [Array<Hash>] Array of callback hashes containing format, selector_type, selector, and block
+    # @example
+    #   callbacks = instance.all
+    #   puts callbacks.length #=> 3
+    def all
+      @callbacks.dup
+    end
+    # Registers a new callback for processing matching nodes
+    #
+    # @param format [Symbol] The document format (:html or :xml)
+    # @param selector_type [Symbol] The selector type (:css or :xpath)
+    # @param selector [String] The selector string to match elements
+    # @param block [Proc] The callback block to execute when elements match
+    # @yieldparam node [Object] The matched DOM node
+    # @yieldparam ctx [Object] The scraping context object
+    # @return [void]
+    # @raise [ArgumentError] When format or selector_type is not supported
+    #
+    # @example Register a CSS selector callback
+    #   register(:html, :css, '.product-title') do |node, ctx|
+    #     ctx.titles << node.text.strip
+    #   end
+    #
+    # @example Register an XPath selector callback
+    #   register(:xml, :xpath, '//item[@price > 100]') do |node, ctx|
+    #     ctx.expensive_items << parse_item(node)
+    #   end
+    def register(format, selector_type, selector, &block)
+      validate_registration(format, selector_type)
+      @callbacks << {
+        format: format,
+        selector_type: selector_type,
+        selector: selector,
+        block: ->(node, ctx) { block.call(node, ctx) }
+      }
+    end
+    # Returns basic statistics about registered callbacks
+    #
+    # @return [Hash<Symbol, Integer>] Hash containing callback statistics
+    # @example
+    #   stats = instance.stats
+    #   puts stats[:callbacks_count] #=> 5
+    def stats
+      { callbacks_count: @callbacks.size }
+    end
+    # Clears all registered callbacks
+    #
+    # @return [Array] Empty callbacks array
+    # @example
+    #   instance.clear
+    #   puts instance.stats[:callbacks_count] #=> 0
+    def clear
+      @callbacks.clear
+    end
+    private
+    # Validates that the format and selector_type are supported
+    #
+    # @param format [Symbol] The document format to validate
+    # @param selector_type [Symbol] The selector type to validate
+    # @return [void]
+    # @raise [ArgumentError] When format is not in ALLOWED_FORMATS
+    # @raise [ArgumentError] When selector_type is not in ALLOWED_SELECTOR_TYPES
+    # @api private
+    def validate_registration(format, selector_type)
+      raise ArgumentError, "Unsupported format: #{format}" unless ALLOWED_FORMATS.include?(format)
+      return if ALLOWED_SELECTOR_TYPES.include?(selector_type)
+      raise ArgumentError, "Unsupported selector type: #{selector_type}"
+    end
+    # Alternative registration method using formatted input strings
+    #
+    # @param format [Symbol] The document format (:html or :xml)
+    # @param input [String] Formatted input string (e.g., "css@.selector" or "xpath@//element")
+    # @param block [Proc] The callback block to execute when elements match
+    # @yieldparam node [Object] The matched DOM node
+    # @yieldparam ctx [Object] The scraping context object
+    # @return [void]
+    # @raise [ArgumentError] When format is not supported
+    # @raise [ArgumentError] When selector_type parsed from input is not supported
+    # @raise [ArgumentError] When input format is invalid
+    # @api private
+    #
+    # @example Using CSS selector input format
+    #   register_from_input(:html, "css@.product-name") do |node, ctx|
+    #     # Process node
+    #   end
+    #
+    # @example Using XPath selector input format
+    #   register_from_input(:xml, "xpath@//item[@id]") do |node, ctx|
+    #     # Process node
+    #   end
+    #
+    # @note This is a potential shorthand method that may be exposed in future versions
+    def register_from_input(format, input, &block)
+      raise ArgumentError, "Unsupported format: #{format}" unless ALLOWED_FORMATS.include?(format)
+      selector_type, selector = parse_input(input)
+      unless ALLOWED_SELECTOR_TYPES.include?(selector_type)
+        raise ArgumentError, "Unsupported selector type: #{selector_type}"
+      end
+      register(format, selector_type, selector, &block)
+    end
+    # Parses formatted input strings to extract selector type and selector
+    #
+    # @param input [String] Formatted input string with type prefix
+    # @return [Array<(Symbol, String)>] Tuple of [selector_type, selector]
+    # @raise [ArgumentError] When input format doesn't match expected patterns
+    # @api private
+    #
+    # @example Parse CSS selector input
+    #   parse_input("css@.my-class") #=> [:css, ".my-class"]
+    #
+    # @example Parse XPath selector input
+    #   parse_input("xpath@//div[@id='main']") #=> [:xpath, "//div[@id='main']"]
+    def parse_input(input)
+      if input.start_with?("css@")
+        selector_type = :css
+        selector = input[4..]
+      elsif input.start_with?("xpath@")
+        selector_type = :xpath
+        selector = input[6..]
+      else
+        raise ArgumentError, "Unsupported input format: #{input}"
+      end
+      [selector_type, selector]
+    end
+  end
+end