RubyGems - crawlr - Versions diffs - 0.1.0 → 0.2.0 - Mend

crawlr 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 7962445e19428525184ea2fb8dfcb76612c3143fd2764be3dd376c9bcb65ae69
-  data.tar.gz: b784eb2b27f6b170ac67c4a9c9113fc7e7ed4fb443fcd3145d3be5e24ab1194e
+  metadata.gz: f4a2b21633eead87fe3b879552db225aecc2975cad779fd86413f2531cd3f079
+  data.tar.gz: 9b20eb81f931b0f514609e9b699a85a8f25d2e64e0fb7f8f6845343d4872a893
 SHA512:
-  metadata.gz: bd8296ebd6bdc77bbf7a4200d9f211721a137bb74073e76fb8eae44007e05bcb894abdb5c4cb92efe28af0bd8c14b9d734a5d33420ffada3c7debcd4794027e3
-  data.tar.gz: ba6608820012fada66dbbf1026e7d52a8aa29290a714e658a0a4d904b6f6c7b685bc287353cd9e1319561b4ca990fd6d68542f3747727388130eb390060d0b33
+  metadata.gz: 3e5d343dd502ed23343ad0e6bfbe9fbe6b8696954e171a181ae385c8c679d60cfeeeec4d65cdbb9e841731664ae27fa4034405c8265ab12f967470e91321208a
+  data.tar.gz: ecb186a9d6e9a5f34a4e429b1c3971a1eaf7f708537d698557fab3272010c0f4fb953362b64ed498556867a0938387c70cc24f0b7f41563026be7f1157f373a1

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,11 @@
 ## [Unreleased]
+## [0.2.0] - 2025-09-30
+- Tidied up documentation and inline comments
+- Fixed small bugs caused by typos
+- Added a few examples demonstrating usage
 ## [0.1.0] - 2025-09-29
 - Initial release

data/README.md CHANGED Viewed

@@ -3,7 +3,7 @@
 A powerful, async Ruby web scraping framework designed for respectful and efficient data extraction. Built with modern Ruby practices, crawlr provides a clean API for scraping websites while respecting robots.txt, managing cookies, rotating proxies, and handling complex scraping scenarios.
 [![Gem Version](https://badge.fury.io/rb/crawlr.svg)](https://badge.fury.io/rb/crawlr)
-[![Ruby](https://github.com/yourusername/crawlr/actions/workflows/ruby.yml/badge.svg)](https://github.com/yourusername/crawlr/actions/workflows/ruby.yml)
+[![Ruby](https://github.com/aristorap/crawlr/actions/workflows/ruby.yml/badge.svg)](https://github.com/aristorap/crawlr/actions/workflows/ruby.yml)
 ## ✨ Features
@@ -71,20 +71,20 @@ collector.visit('https://example.com')
 ```ruby
 collector = Crawlr::Collector.new
+products = []
 # Extract product information
-collector.on_html(:css, '.product') do |product, ctx|
-  data = {
-    name: product.css('.product-name').text.strip,
-    price: product.css('.price').text.strip,
-    image: product.css('img')&.first&.[]('src')
-  }
-  ctx.products ||= []
-  ctx.products << data
+collector.visit('https://shop.example.com/products') do |c|
+  c.on_html(:css, '.product') do |product, ctx|
+    data = {
+      name: product.css('.product-name').text.strip,
+      price: product.css('.price').text.strip,
+      image: product.css('img')&.first&.[]('src')
+    }
+    products << data
+  end
 end
-collector.visit('https://shop.example.com/products')
+# do something with data
 ```
 ### API Scraping with Pagination
@@ -94,14 +94,16 @@ collector = Crawlr::Collector.new(
   max_parallelism: 10,
   timeout: 30
 )
+mu = Mutex.new
+items = Array.new
-collector.on_xml(:css, 'item') do |item, ctx|
-  ctx.items ||= []
-  ctx.items << {
+collector.on_xml(:css, 'item') do |item, _ctx|
+  data =  {
     id: item.css('id').text,
     title: item.css('title').text,
     published: item.css('published').text
   }
+  mu.synchronize { items << data }
 end
 # Automatically handles pagination with ?page=1, ?page=2, etc.
@@ -168,13 +170,11 @@ end
 # Process responses after each request
 collector.hook(:after_visit) do |url, response|
   puts "Got #{response.status} from #{url}"
-  log_response_time(url, response.headers['X-Response-Time'])
 end
 # Handle errors gracefully
 collector.hook(:on_error) do |url, error|
   puts "Failed to scrape #{url}: #{error.message}"
-  error_tracker.record(url, error)
 end
 ```
@@ -182,15 +182,11 @@ end
 ```ruby
 collector.on_html(:xpath, '//div[@class="content"]//p[position() <= 3]') do |paragraph, ctx|
-  # Extract first 3 paragraphs from content divs
-  ctx.content_paragraphs ||= []
-  ctx.content_paragraphs << paragraph.text.strip
+  # Do stuff
 end
 collector.on_xml(:xpath, '//item[price > 100]/title') do |title, ctx|
-  # Extract titles of expensive items from XML feeds
-  ctx.expensive_items ||= []
-  ctx.expensive_items << title.text
+  # Do stuff
 end
 ```
@@ -199,16 +195,12 @@ end
 ```ruby
 collector = Crawlr::Collector.new(allow_cookies: true)
-# Login first
-collector.on_html(:css, 'form[action="/login"]') do |form, ctx|
-  # Cookies from login will be automatically used in subsequent requests
-end
+# First visit will set cookies tor following requests
 collector.visit('https://site.com/login')
 collector.visit('https://site.com/protected-content') # Uses login cookies
 ```
-### Monitoring and Statistics
+### Stats
 ```ruby
 collector = Crawlr::Collector.new
@@ -297,7 +289,7 @@ yard server
 ## 🤝 Contributing
-1. Fork it (https://github.com/yourusername/crawlr/fork)
+1. Fork it (https://github.com/aristorap/crawlr/fork)
 2. Create your feature branch (`git checkout -b feature/amazing-feature`)
 3. Make your changes with tests
 4. Ensure all tests pass (`bundle exec rspec`)
@@ -313,13 +305,11 @@ This gem is available as open source under the terms of the [MIT License](https:
 - Built with [Nokogiri](https://nokogiri.org/) for HTML/XML parsing
 - Uses [Async](https://github.com/socketry/async) for high-performance concurrency
-- Inspired by Python's Scrapy framework and modern Ruby practices
+- Inspired by Golang's [Colly](https://go-colly.org) framework and modern Ruby practices
 ## 📞 Support
-- 📖 [Documentation](https://yourusername.github.io/crawlr)
-- 🐛 [Issue Tracker](https://github.com/yourusername/crawlr/issues)
-- 💬 [Discussions](https://github.com/yourusername/crawlr/discussions)
+- 📖 [Documentation TBD](https://aristorap.github.io/crawlr)
+- 🐛 [Issue Tracker](https://github.com/aristorap/crawlr/issues)
 ---

data/examples/basic_visit.rb ADDED Viewed

@@ -0,0 +1,19 @@
+require_relative "../lib/crawlr"
+# Create a new collector instance
+clct = Crawlr::Collector.new
+gems = []
+# Visit the RubyGems popular releases page
+clct.visit("https://rubygems.org/releases/popular") do |collector|
+  # Extract gem links using a CSS selector
+  # The callback will be executed for each matched node
+  collector.on_html(:css, ".main--interior a.gems__gem") do |node, ctx|
+    link = node["href"]
+    gems << ctx.resolve_url(link) if link
+  end
+end
+# Print results
+puts "Found #{gems.size} gems"
+gems.each { |g| puts g }

data/examples/nested_visit.rb ADDED Viewed

@@ -0,0 +1,36 @@
+require_relative "../lib/crawlr"
+# Create a new collector instance
+clct = Crawlr::Collector.new(
+  max_depth: 2, # Limit unbounded crawls
+  random_delay: 1, # Maximum random delay between requests
+  max_parallelism: 5 # Maximum concurrent requests
+)
+# Create a map to store gem metadata
+# Use a thread-safe map due to parallel processing
+gems_meta = Concurrent::Map.new
+# Visit the RubyGems popular releases page
+clct.visit("https://rubygems.org/releases/popular") do |c|
+  # Grab main container
+  c.on_html(:css, ".main--interior") do |node, ctx|
+    # Grab all gem links
+    gems = []
+    node.css("a.gems__gem").each do |a|
+      gems << ctx.resolve_url(a["href"])
+    end
+    # Visit each gem page
+    c.visit(gems, ctx.increment_depth) # Use context helper method to set depth for accurate tracking
+  end
+  # This callback will be matched on the individual gem pages
+  c.on_html(:css, "h2.gem__downloads__heading:nth-child(1) > span:nth-child(1)") do |node, ctx|
+    gems_meta[ctx.page_url] = node.text
+  end
+end
+# Print results
+puts "Found #{gems_meta.size} gems"
+gems_meta.each_pair { |k, v| puts "#{k} => #{v}" }

data/examples/paginated_visit.rb ADDED Viewed

@@ -0,0 +1,27 @@
+require_relative "../lib/crawlr"
+# Create a new collector instance
+clct = Crawlr::Collector.new(
+  max_depth: 2,
+  random_delay: 1,
+  max_parallelism: 5
+)
+mu = Mutex.new
+gems = []
+# Visit the RubyGems popular releases page with pagination
+# Set max depth in collector config to limit crawl depth
+clct.paginated_visit("https://rubygems.org/releases/popular") do |collector|
+  # Extract gem links using a CSS selector
+  collector.on_html(:css, ".main--interior a.gems__gem") do |node, ctx|
+    link = node["href"]
+    if link
+      full_link = ctx.resolve_url(link) # Resolve relative URL using context helper method
+      mu.synchronize { gems << full_link }
+    end
+  end
+end
+# Print results
+puts "Found #{gems.size} gems"
+gems.each { |g| puts g }

data/lib/crawlr/collector.rb CHANGED Viewed

@@ -53,7 +53,6 @@ module Crawlr
   #     puts "Failed to scrape #{url}: #{error.message}"
   #   end
   #
-  # @author [Your Name]
   # @since 0.1.0
   class Collector
     # @return [Crawlr::Config] The configuration object for this collector
@@ -214,7 +213,7 @@ module Crawlr
       return unless valid_url?(url)
       yield self if block_given?
-      fetch_robots(url) unless @config.ignore_robots_txt
+      fetch_robots_txt(url) unless @config.ignore_robots_txt
       return unless can_visit?(url, @config.headers)
       pages_to_visit = build_initial_pages(url, query, batch_size, start_page)

data/lib/crawlr/config.rb CHANGED Viewed

@@ -32,7 +32,6 @@ module Crawlr
   #     max_parallelism: 10
   #   )
   #
-  # @author [Your Name]
   # @since 0.1.0
   class Config
     # @return [Integer] HTTP request timeout in seconds

data/lib/crawlr/domains.rb CHANGED Viewed

@@ -35,7 +35,6 @@ module Crawlr
   #
   #   domains.allowed?('https://any-site.com')  #=> true
   #
-  # @author [Your Name]
   # @since 0.1.0
   class Domains
     # Initializes a new Domains instance with the given configuration

data/lib/crawlr/http_interface.rb CHANGED Viewed

@@ -45,7 +45,6 @@ module Crawlr
   #     headers['X-Request-ID'] = SecureRandom.uuid
   #   end
   #
-  # @author [Your Name]
   # @since 0.1.0
   class HTTPInterface
     # Simplified HTTP response structure for internal use

data/lib/crawlr/parser.rb CHANGED Viewed

@@ -64,7 +64,6 @@ module Crawlr
   #   # HTML content parsed once, all callbacks executed on same document
   #   Crawlr::Parser.apply_callbacks(content: html, callbacks: callbacks, context: ctx)
   #
-  # @author [Your Name]
   # @since 0.1.0
   module Parser
     # Applies registered callbacks to parsed document content

data/lib/crawlr/robots.rb CHANGED Viewed

@@ -58,7 +58,6 @@ module Crawlr
   #   robots.allowed?('https://example.com/temp/secret.txt', 'Bot')     #=> false
   #   robots.allowed?('https://example.com/temp/public/file.txt', 'Bot') #=> true
   #
-  # @author [Your Name]
   # @since 0.1.0
   class Robots
     # Represents a robots.txt rule for a specific user-agent

data/lib/crawlr/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module Crawlr
-  VERSION = "0.1.0"
+  VERSION = "0.2.0"
 end

data/lib/crawlr/visits.rb CHANGED Viewed

@@ -56,7 +56,6 @@ module Crawlr
   #
   #   threads.each(&:join)
   #
-  # @author [Your Name]
   # @since 0.1.0
   class Visits
     # Initializes a new Visits tracker with the given configuration

data/lib/crawlr.rb CHANGED Viewed

@@ -1,9 +1,11 @@
 # frozen_string_literal: true
 require_relative "crawlr/version"
+require_relative "crawlr/collector"
+require "logger"
 # A Ruby scraping framework for parsing HTML and XML documents
-# @author [Your Name]
 # @since 0.1.0
 module Crawlr
   class Error < StandardError; end

data/rubygems.rb ADDED Viewed

@@ -0,0 +1,18 @@
+require "lib/crawlr"
+clct = Crawlr::Collector.new
+gems = []
+clct.visit("https://rubygems.org/releases/popular") do |collector|
+  collector.on_html(:css, ".main--interior a.gems__gem") do |node, ctx|
+    link = node["href"]
+    full_link = ctx.resolve_url(link) if link
+    gems << full_link
+  end
+end
+puts "Found #{gems.size} gems"
+gems.each do |gem|
+  puts gem
+end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: crawlr
 version: !ruby/object:Gem::Version
-  version: 0.1.0
+  version: 0.2.0
 platform: ruby
 authors:
 - Aristotelis Rapai
@@ -166,6 +166,9 @@ files:
 - LICENSE.txt
 - README.md
 - Rakefile
+- examples/basic_visit.rb
+- examples/nested_visit.rb
+- examples/paginated_visit.rb
 - lib/crawlr.rb
 - lib/crawlr/callbacks.rb
 - lib/crawlr/collector.rb
@@ -178,7 +181,7 @@ files:
 - lib/crawlr/robots.rb
 - lib/crawlr/version.rb
 - lib/crawlr/visits.rb
-- sig/crawlr.rbs
+- rubygems.rb
 homepage: https://github.com/aristorap/crawlr
 licenses:
 - MIT

data/sig/crawlr.rbs DELETED Viewed

@@ -1,4 +0,0 @@
-module Crawlr
-  VERSION: String
-  # See the writing guide of rbs: https://github.com/ruby/rbs#guides
-end