RubyGems - crawlr - Versions diffs - 0.1.0 → 0.2.1 - Mend

crawlr 0.1.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 7962445e19428525184ea2fb8dfcb76612c3143fd2764be3dd376c9bcb65ae69
-  data.tar.gz: b784eb2b27f6b170ac67c4a9c9113fc7e7ed4fb443fcd3145d3be5e24ab1194e
+  metadata.gz: fe3c5b1d19db6a4fda1bd66a9e2c62a1b2bdb80c361fe06e84023a6bf3f024bb
+  data.tar.gz: 6f26c3350a3cbf7e967899d8f5490312d83caa8ad9223cefcb5ad8423bec1e97
 SHA512:
-  metadata.gz: bd8296ebd6bdc77bbf7a4200d9f211721a137bb74073e76fb8eae44007e05bcb894abdb5c4cb92efe28af0bd8c14b9d734a5d33420ffada3c7debcd4794027e3
-  data.tar.gz: ba6608820012fada66dbbf1026e7d52a8aa29290a714e658a0a4d904b6f6c7b685bc287353cd9e1319561b4ca990fd6d68542f3747727388130eb390060d0b33
+  metadata.gz: 4c58780044aa20341737127823958728deb6b3574c781cb804db45e5c81971678058f779657b379bfa566c0608d273c72ec8331e2226885f71e3d476af1c0076
+  data.tar.gz: a094872a4ad346cae330a6daa894c6a49a72e7082f9279fa878dd14d09f7fdbccad5617433e683d15309eb4b1f14bcc05aa59cd47f2f7a9c460a5b2728530ad0

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,16 @@
 ## [Unreleased]
+## [0.2.1] - 2025-09-30
+- Fix paginated_visit to properly handle provided url queries (if present)
+- Update paginated_visit batch size parameter to respect max_depth (if max_depth set > 0)
+## [0.2.0] - 2025-09-30
+- Tidied up documentation and inline comments
+- Fixed small bugs caused by typos
+- Added a few examples demonstrating usage
 ## [0.1.0] - 2025-09-29
 - Initial release

data/README.md CHANGED Viewed

@@ -3,7 +3,7 @@
 A powerful, async Ruby web scraping framework designed for respectful and efficient data extraction. Built with modern Ruby practices, crawlr provides a clean API for scraping websites while respecting robots.txt, managing cookies, rotating proxies, and handling complex scraping scenarios.
 [![Gem Version](https://badge.fury.io/rb/crawlr.svg)](https://badge.fury.io/rb/crawlr)
-[![Ruby](https://github.com/yourusername/crawlr/actions/workflows/ruby.yml/badge.svg)](https://github.com/yourusername/crawlr/actions/workflows/ruby.yml)
+[![Ruby](https://github.com/aristorap/crawlr/actions/workflows/ruby.yml/badge.svg)](https://github.com/aristorap/crawlr/actions/workflows/ruby.yml)
 ## ✨ Features
@@ -71,20 +71,20 @@ collector.visit('https://example.com')
 ```ruby
 collector = Crawlr::Collector.new
+products = []
 # Extract product information
-collector.on_html(:css, '.product') do |product, ctx|
-  data = {
-    name: product.css('.product-name').text.strip,
-    price: product.css('.price').text.strip,
-    image: product.css('img')&.first&.[]('src')
-  }
-  ctx.products ||= []
-  ctx.products << data
+collector.visit('https://shop.example.com/products') do |c|
+  c.on_html(:css, '.product') do |product, ctx|
+    data = {
+      name: product.css('.product-name').text.strip,
+      price: product.css('.price').text.strip,
+      image: product.css('img')&.first&.[]('src')
+    }
+    products << data
+  end
 end
-collector.visit('https://shop.example.com/products')
+# do something with data
 ```
 ### API Scraping with Pagination
@@ -94,14 +94,16 @@ collector = Crawlr::Collector.new(
   max_parallelism: 10,
   timeout: 30
 )
+mu = Mutex.new
+items = Array.new
-collector.on_xml(:css, 'item') do |item, ctx|
-  ctx.items ||= []
-  ctx.items << {
+collector.on_xml(:css, 'item') do |item, _ctx|
+  data =  {
     id: item.css('id').text,
     title: item.css('title').text,
     published: item.css('published').text
   }
+  mu.synchronize { items << data }
 end
 # Automatically handles pagination with ?page=1, ?page=2, etc.
@@ -168,13 +170,11 @@ end
 # Process responses after each request
 collector.hook(:after_visit) do |url, response|
   puts "Got #{response.status} from #{url}"
-  log_response_time(url, response.headers['X-Response-Time'])
 end
 # Handle errors gracefully
 collector.hook(:on_error) do |url, error|
   puts "Failed to scrape #{url}: #{error.message}"
-  error_tracker.record(url, error)
 end
 ```
@@ -182,15 +182,11 @@ end
 ```ruby
 collector.on_html(:xpath, '//div[@class="content"]//p[position() <= 3]') do |paragraph, ctx|
-  # Extract first 3 paragraphs from content divs
-  ctx.content_paragraphs ||= []
-  ctx.content_paragraphs << paragraph.text.strip
+  # Do stuff
 end
 collector.on_xml(:xpath, '//item[price > 100]/title') do |title, ctx|
-  # Extract titles of expensive items from XML feeds
-  ctx.expensive_items ||= []
-  ctx.expensive_items << title.text
+  # Do stuff
 end
 ```
@@ -199,16 +195,12 @@ end
 ```ruby
 collector = Crawlr::Collector.new(allow_cookies: true)
-# Login first
-collector.on_html(:css, 'form[action="/login"]') do |form, ctx|
-  # Cookies from login will be automatically used in subsequent requests
-end
+# First visit will set cookies tor following requests
 collector.visit('https://site.com/login')
 collector.visit('https://site.com/protected-content') # Uses login cookies
 ```
-### Monitoring and Statistics
+### Stats
 ```ruby
 collector = Crawlr::Collector.new
@@ -297,7 +289,7 @@ yard server
 ## 🤝 Contributing
-1. Fork it (https://github.com/yourusername/crawlr/fork)
+1. Fork it (https://github.com/aristorap/crawlr/fork)
 2. Create your feature branch (`git checkout -b feature/amazing-feature`)
 3. Make your changes with tests
 4. Ensure all tests pass (`bundle exec rspec`)
@@ -313,13 +305,11 @@ This gem is available as open source under the terms of the [MIT License](https:
 - Built with [Nokogiri](https://nokogiri.org/) for HTML/XML parsing
 - Uses [Async](https://github.com/socketry/async) for high-performance concurrency
-- Inspired by Python's Scrapy framework and modern Ruby practices
+- Inspired by Golang's [Colly](https://go-colly.org) framework and modern Ruby practices
 ## 📞 Support
-- 📖 [Documentation](https://yourusername.github.io/crawlr)
-- 🐛 [Issue Tracker](https://github.com/yourusername/crawlr/issues)
-- 💬 [Discussions](https://github.com/yourusername/crawlr/discussions)
+- 📖 [Documentation TBD](https://aristorap.github.io/crawlr)
+- 🐛 [Issue Tracker](https://github.com/aristorap/crawlr/issues)
 ---

data/examples/basic_visit.rb ADDED Viewed

@@ -0,0 +1,19 @@
+require_relative "../lib/crawlr"
+# Create a new collector instance
+clct = Crawlr::Collector.new
+gems = []
+# Visit the RubyGems popular releases page
+clct.visit("https://rubygems.org/releases/popular") do |collector|
+  # Extract gem links using a CSS selector
+  # The callback will be executed for each matched node
+  collector.on_html(:css, ".main--interior a.gems__gem") do |node, ctx|
+    link = node["href"]
+    gems << ctx.resolve_url(link) if link
+  end
+end
+# Print results
+puts "Found #{gems.size} gems"
+gems.each { |g| puts g }

data/examples/nested_visit.rb ADDED Viewed

@@ -0,0 +1,36 @@
+require_relative "../lib/crawlr"
+# Create a new collector instance
+clct = Crawlr::Collector.new(
+  max_depth: 2, # Limit unbounded crawls
+  random_delay: 1, # Maximum random delay between requests
+  max_parallelism: 5 # Maximum concurrent requests
+)
+# Create a map to store gem metadata
+# Use a thread-safe map due to parallel processing
+gems_meta = Concurrent::Map.new
+# Visit the RubyGems popular releases page
+clct.visit("https://rubygems.org/releases/popular") do |c|
+  # Grab main container
+  c.on_html(:css, ".main--interior") do |node, ctx|
+    # Grab all gem links
+    gems = []
+    node.css("a.gems__gem").each do |a|
+      gems << ctx.resolve_url(a["href"])
+    end
+    # Visit each gem page
+    c.visit(gems, ctx.increment_depth) # Use context helper method to set depth for accurate tracking
+  end
+  # This callback will be matched on the individual gem pages
+  c.on_html(:css, "h2.gem__downloads__heading:nth-child(1) > span:nth-child(1)") do |node, ctx|
+    gems_meta[ctx.page_url] = node.text
+  end
+end
+# Print results
+puts "Found #{gems_meta.size} gems"
+gems_meta.each_pair { |k, v| puts "#{k} => #{v}" }

data/examples/paginated_visit.rb ADDED Viewed

@@ -0,0 +1,27 @@
+require_relative "../lib/crawlr"
+# Create a new collector instance
+clct = Crawlr::Collector.new(
+  max_depth: 2,
+  random_delay: 1,
+  max_parallelism: 5
+)
+mu = Mutex.new
+gems = []
+# Visit the RubyGems popular releases page with pagination
+# Set max depth in collector config to limit crawl depth
+clct.paginated_visit("https://rubygems.org/releases/popular") do |collector|
+  # Extract gem links using a CSS selector
+  collector.on_html(:css, ".main--interior a.gems__gem") do |node, ctx|
+    link = node["href"]
+    if link
+      full_link = ctx.resolve_url(link) # Resolve relative URL using context helper method
+      mu.synchronize { gems << full_link }
+    end
+  end
+end
+# Print results
+puts "Found #{gems.size} gems"
+gems.each { |g| puts g }

data/lib/crawlr/collector.rb CHANGED Viewed

@@ -53,7 +53,6 @@ module Crawlr
   #     puts "Failed to scrape #{url}: #{error.message}"
   #   end
   #
-  # @author [Your Name]
   # @since 0.1.0
   class Collector
     # @return [Crawlr::Config] The configuration object for this collector
@@ -214,7 +213,7 @@ module Crawlr
       return unless valid_url?(url)
       yield self if block_given?
-      fetch_robots(url) unless @config.ignore_robots_txt
+      fetch_robots_txt(url) unless @config.ignore_robots_txt
       return unless can_visit?(url, @config.headers)
       pages_to_visit = build_initial_pages(url, query, batch_size, start_page)
@@ -571,17 +570,19 @@ module Crawlr
     end
     def build_initial_pages(url, query, batch_size, start_page)
-      max_batch = [@config.max_depth, batch_size].min
+      uri = URI.parse(url)
+      max_batch = @config.max_depth.zero? ? batch_size : [@config.max_depth, batch_size].min
       if start_page == 1
-        [url] + (max_batch - 1).times.map { |i| "#{url}?#{query}=#{i + 2}" }
+        [url] + (max_batch - 1).times.map { |i| build_page_url(uri, query, i + 2) }
       else
-        max_batch.times.map { |i| "#{url}?#{query}=#{i + start_page}" }
+        max_batch.times.map { |i| build_page_url(uri, query, i + start_page) }
       end
     end
     def process_page_batches(pages, current_depth, batch_size, query)
       scheduled_depth = current_depth
-      max_batch = [@config.max_depth, batch_size].min
+      max_batch = @config.max_depth.zero? ? batch_size : [@config.max_depth, batch_size].min
       loop do
         break if reached_max_depth?(scheduled_depth)
@@ -626,7 +627,17 @@ module Crawlr
     end
     def generate_next_pages(batch, scheduled_depth, max_batch, query)
-      max_batch.times.map { |i| "#{batch.first}?#{query}=#{i + scheduled_depth + 1}" }
+      uri = URI.parse(batch.first)
+      (0...max_batch).map { |i| build_page_url(uri, query, i + scheduled_depth + 1) }
+    end
+    def build_page_url(uri, query, value)
+      new_uri = uri.dup
+      params = URI.decode_www_form(new_uri.query || "")
+      params.reject! { |k, _| k == query }
+      params << [query, value]
+      new_uri.query = URI.encode_www_form(params)
+      new_uri.to_s
     end
   end
 end

data/lib/crawlr/config.rb CHANGED Viewed

@@ -32,7 +32,6 @@ module Crawlr
   #     max_parallelism: 10
   #   )
   #
-  # @author [Your Name]
   # @since 0.1.0
   class Config
     # @return [Integer] HTTP request timeout in seconds

data/lib/crawlr/domains.rb CHANGED Viewed

@@ -35,7 +35,6 @@ module Crawlr
   #
   #   domains.allowed?('https://any-site.com')  #=> true
   #
-  # @author [Your Name]
   # @since 0.1.0
   class Domains
     # Initializes a new Domains instance with the given configuration

data/lib/crawlr/http_interface.rb CHANGED Viewed

@@ -45,7 +45,6 @@ module Crawlr
   #     headers['X-Request-ID'] = SecureRandom.uuid
   #   end
   #
-  # @author [Your Name]
   # @since 0.1.0
   class HTTPInterface
     # Simplified HTTP response structure for internal use

data/lib/crawlr/parser.rb CHANGED Viewed

@@ -64,7 +64,6 @@ module Crawlr
   #   # HTML content parsed once, all callbacks executed on same document
   #   Crawlr::Parser.apply_callbacks(content: html, callbacks: callbacks, context: ctx)
   #
-  # @author [Your Name]
   # @since 0.1.0
   module Parser
     # Applies registered callbacks to parsed document content

data/lib/crawlr/robots.rb CHANGED Viewed

@@ -58,7 +58,6 @@ module Crawlr
   #   robots.allowed?('https://example.com/temp/secret.txt', 'Bot')     #=> false
   #   robots.allowed?('https://example.com/temp/public/file.txt', 'Bot') #=> true
   #
-  # @author [Your Name]
   # @since 0.1.0
   class Robots
     # Represents a robots.txt rule for a specific user-agent

data/lib/crawlr/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module Crawlr
-  VERSION = "0.1.0"
+  VERSION = "0.2.1"
 end

data/lib/crawlr/visits.rb CHANGED Viewed

@@ -56,7 +56,6 @@ module Crawlr
   #
   #   threads.each(&:join)
   #
-  # @author [Your Name]
   # @since 0.1.0
   class Visits
     # Initializes a new Visits tracker with the given configuration

data/lib/crawlr.rb CHANGED Viewed

@@ -1,9 +1,11 @@
 # frozen_string_literal: true
 require_relative "crawlr/version"
+require_relative "crawlr/collector"
+require "logger"
 # A Ruby scraping framework for parsing HTML and XML documents
-# @author [Your Name]
 # @since 0.1.0
 module Crawlr
   class Error < StandardError; end

metadata CHANGED Viewed

@@ -1,13 +1,13 @@
 --- !ruby/object:Gem::Specification
 name: crawlr
 version: !ruby/object:Gem::Version
-  version: 0.1.0
+  version: 0.2.1
 platform: ruby
 authors:
 - Aristotelis Rapai
 bindir: exe
 cert_chain: []
-date: 2025-09-29 00:00:00.000000000 Z
+date: 1980-01-02 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: async
@@ -166,6 +166,9 @@ files:
 - LICENSE.txt
 - README.md
 - Rakefile
+- examples/basic_visit.rb
+- examples/nested_visit.rb
+- examples/paginated_visit.rb
 - lib/crawlr.rb
 - lib/crawlr/callbacks.rb
 - lib/crawlr/collector.rb
@@ -178,7 +181,6 @@ files:
 - lib/crawlr/robots.rb
 - lib/crawlr/version.rb
 - lib/crawlr/visits.rb
-- sig/crawlr.rbs
 homepage: https://github.com/aristorap/crawlr
 licenses:
 - MIT
@@ -203,7 +205,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 3.6.3
+rubygems_version: 3.7.2
 specification_version: 4
 summary: A powerful, async Ruby web scraping framework
 test_files: []

data/sig/crawlr.rbs DELETED Viewed

@@ -1,4 +0,0 @@
-module Crawlr
-  VERSION: String
-  # See the writing guide of rbs: https://github.com/ruby/rbs#guides
-end