RubyGems - super_crawler - Versions diffs - 0.1.0 → 0.2.0 - Mend

super_crawler 0.1.0 → 0.2.0

Files changed (9) hide show

checksums.yaml +4 -4
data/.gitignore +1 -0
data/README.md +67 -98
data/lib/super_crawler/{crawl_site.rb → crawl.rb} +20 -21
data/lib/super_crawler/{crawl_page.rb → scrap.rb} +14 -25
data/lib/super_crawler/version.rb +1 -1
data/lib/super_crawler.rb +2 -2
data/super_crawler.gemspec +0 -1
metadata +4 -18

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 4f566869c9f06df215b047291bbc08a83e25a176
-  data.tar.gz: ee35c2dcee2dae0289e70a8fb0195c63859bf793
+  metadata.gz: ffb65575970b9bd45f3ac9fa80b004c96df3492a
+  data.tar.gz: 1a72997367a389dcfb67dcd913554207604ef3e2
 SHA512:
-  metadata.gz: e7f5c9db1479c83af91879d41f9ea6480142ddf11fba9fcc3bc944f46de90d8b821929582526f245cd6752a4506019f8b2489d08175e5c6fe30faea620c1ea61
-  data.tar.gz: 1c81fc244e6679cfe6782b9651782944306290a9281277658a220155ef6c4a7ccee59b9b7a2f058c0862d52027e7f8453ba01a21c953d2f31ff5980e9fea87ad
+  metadata.gz: 67dd33ed9ee8965a84cdcc22187212ea97c47c6a29bc868818b31246c827e342c486d0b12270d52eac1c1c8071cab51fef4dd441470abd1702905e84c814c31c
+  data.tar.gz: bd45255d527837c177b788f81412a82e0609a015a65e9d9423841e80bef362b62f854c55e1e1637cf8fc45bbf19d256bac4a66da9e804b8850128e0ca6f8d9a8

data/.gitignore CHANGED Viewed

@@ -8,4 +8,5 @@
 /spec/reports/
 /tmp/
+/gems/
 *.gem

data/README.md CHANGED Viewed

@@ -6,23 +6,16 @@ Easy (yet efficient) ruby gem to crawl your favorite website.
 Open your terminal, then:
-```bash
-$ git clone https://github.com/htaidirt/super_crawler
+    git clone https://github.com/htaidirt/super_crawler
+    cd super_crawler
+    bundle
+    ./bin/console
-$ cd super_crawler
+Then
-$ bundle
-$ ./bin/console
-```
-```ruby
- > sc = SuperCrawler::CrawlSite.new('https://gocardless.com')
- > sc.start # => Start crawling the website
- > sc.render(5) # => Show first 5 results of the crawling as sitemap
-```
+    sc = SuperCrawler::Crawl.new('https://gocardless.com')
+    sc.start(10) # => Start crawling the website using 10 threads
+    sc.render(5) # => Show the first 5 results of the crawling as sitemap
 ## Installation
@@ -52,64 +45,57 @@ This gem is an experiment and can't be used for production purposes. Please, use
 There are also a lot of limitations that weren't handled due to time. You'll find more information on the limitations below.
-SuperCrawler gem was only tested on MRI and ruby 2.3.1.
+SuperCrawler gem was only tested on MRI 2.3.1 and Rubinius 2.5.8.
 ## Philosophy
-Starting from a URL, extract all the internal links and assets within the page. Add all unique links to an array for future exploration of theses links. Repeat for each link in the links list until no new link is discovered.
+Starting from a given URL, the crawler extracts all the internal links and assets within the page. The links are added to a list of unique links for further exploration. The crawler repeats the exploration visiting all the links until no new link is found.
-Due to the heavy operations, and the time to access each page content, we will use threads to perform near-parallel processing.
+Due to the heavy operations (thousands of pages), and the network time to access each page content, we will use threads to perform near-parallel processing.
-In order to keep the code readable and structured, create two classes:
+In order to keep the code readable and structured, we created two classes:
-- `SuperCrawler::CrawlPage` that is responsible for crawling a single page and extracting all relevant information (internal links and assets)
-- `SuperCrawler::CrawlSite` that is responsible for crawling a whole website, by collecting links and calling `SuperCrawler::CrawlPage` within threads. This class is also responsible for rendering results.
+- `SuperCrawler::Scrap` is responsible for scrapping a single page and extracting all relevant information (internal links and assets)
+- `SuperCrawler::Crawl` is responsible for crawling a whole website by collecting and managing links (using `SuperCrawler::Scrap` on every internal link found.) This class is also responsible for rendering results.
 ## More detailed use
 Open your favorite ruby console and require the gem:
-```ruby
-require 'super_crawler'
-```
+    require 'super_crawler'
-### Crawling a single web page
+### Scrapping a single web page
 Read the following if you would like to crawl a single web page and extract relevant information (internal links and assets).
-```ruby
-page = SuperCrawler::CrawlPage.new( url )
-```
+    page = SuperCrawler::Scrap.new( url )
-Where `url` should be the URL of the page you would like to crawl.
+Where `url` should be the URL of the page you would like to scrap.
-**Nota:** When missing a scheme (`http://` or `https://`), SuperCrawler will prepend the URL with an `http://`.
+**Nota:** If the given URL has a missing scheme (`http://` or `https://`), SuperCrawler will prepend `http://` to the URL.
 #### Get the encoded URL
 Run
-```ruby
-page.url
-```
-to get the encoded URL provided.
+    page.url
+to get the encoded URL.
 #### Get internal links of a page
 Run
-```ruby
-page.get_links
-```
-to get a list of internal links within the crawled page. An internal link is a link that _has the same host than the page URL_. Subdomains are rejected.
+    page.get_links
+to get the list of internal links in the page. An internal link is a link that _has the same schame and host than the provided URL_. Subdomains are rejected.
 This method searches in the `href` attribute of all `<a>` anchor tags.
-**Nota:** This method returns an array of absolute URLs (all internal links).
+**Nota:**
-**Nota 2:** Bad links and special links (like mailto and javascript) are discarded.
+- This method returns an array of absolute URLs (all internal links).
+- Bad links and special links (like mailto and javascript) are discarded.
 #### Get images of a page
@@ -129,92 +115,75 @@ to get a list of images links within the page. The images links are extracted fr
 Run
-```ruby
-page.get_stylesheets
-```
+    page.get_stylesheets
-to get a list of stylesheets links within the page. The stylesheets links are extracted from the `href="..."` attribute of all `<link rel="stylesheet">` tags.
+to get a list of stylesheet links within the page. The links are extracted from the `href="..."` attribute of all `<link rel="stylesheet">` tags.
-**Nota:** Inline styling isn't yet detected by the method.
+**Nota:**
-**Nota 2:** This method returns an array of absolute URLs.
+- Inline styling isn't yet detected by the method.
+- This method returns an array of absolute URLs.
 #### Get scripts of a page
 Run
-```ruby
-page.get_scripts
-```
+    page.get_scripts
-to get a list of scripts links within the page. The scripts links are extracted from the `src="..."` attribute of all `<script>` tags.
+to get a list of script links within the page. The links are extracted from the `src="..."` attribute of all `<script>` tags.
-**Nota:** Inline script isn't yet detected by the method.
+**Nota:**
-**Nota 2:** This method returns an array of absolute URLs.
+- Inline script isn't yet detected by the method.
+- This method returns an array of absolute URLs.
 #### List all assets of a page
 Run
-```ruby
-page.get_assets
-```
+    page.get_assets
-to get a list of all assets (images, stylesheets and scripts links) as a hash of arrays.
+to get a list of all assets (links of images, stylesheets and scripts) as a hash of arrays.
 ### Crawling a whole web site
-First instantiate the site crawler.
-```ruby
-sc = SuperCrawler::CrawlSite.new(url, count_threads)
-```
+    sc = SuperCrawler::Crawl.new(url)
-where `url` is the URL of the page to crawl, and `count_threads` the number of threads to handle the job (by default 10).
+where `url` is the URL of the website to crawl.
 Next, start the crawler:
-```ruby
-sc.start
-```
+    sc.start(number_of_threads)
+where `number_of_threads` is the number of threads that will perform the job (10 by default.) **This can take some time, depending on the site to crawl.**
-This can take some time, depending on the site to crawl.
+To access the crawl results, use the following:
-To access crawl results, you can use the following:
-```ruby
-sc.links # The array of internal links
-sc.crawl_results # Array of hashes containing links and assets for every link crawled
-```
+    sc.links # The array of unique internal links
+    sc.crawl_results # Array of hashes containing links and assets for every unique internal link found
 To see the crawling as a sitemap, use:
-```ruby
-sc.render(5) # Will render the sitemap of the first 5 pages
-```
+    sc.render(5) # Will render the sitemap of the first 5 pages
-TODO: Make more sophisticated rendering class, that can render within files of different formats (HTML, XML, JSON,...)
+_TODO: Create a separate and more sophisticated rendering class, that can render within files of different formats (HTML, XML, JSON,...)_
 #### Tips on searching assets and links
 After `sc.start`, you can access all collected resources (links and assets) using `sc.crawl_results`. This has the following structure:
-```json
-[
-  {
-    url: 'http://example.com/',
-    links: [...array of internal links...],
-    assets: {
-      images: [...array of images links],
-      stylesheets: [...array of stylesheets links],
-      scripts: [...array of scripts links],
-    }
-  },
-  ...
-]
-```
+    [
+      {
+        url: 'http://example.com/',
+        links: [...array of internal links...],
+        assets: {
+          images: [...array of images links],
+          stylesheets: [...array of stylesheets links],
+          scripts: [...array of scripts links],
+        }
+      },
+      ...
+    ]
 You can use `sc.crawl_results.select{ |resource| ... }` to select a particular resource.
@@ -223,12 +192,12 @@ You can use `sc.crawl_results.select{ |resource| ... }` to select a particular r
 Actually, the gem has the following limitations:
 - Subdomains are not considered as internal links
-- Both HTTP and HTTPS pages are taken into account. This can increase the number of links found, but we think that we need to keep it because some sites don't duplicate all contents for HTTP and HTTPS
+- A link with the same domain but different scheme is ignored (http -> https, or the opposite)
 - Only links within `<a href="...">` tags are extracted
 - Only images links within `<img src="..."/>` tags are extracted
 - Only stylesheets links within `<link rel="stylesheet" href="..." />` tags are extracted
 - Only scripts links within `<script src="...">` tags are extracted
-- A page that is not accessible (eg. error 404) is not checked later
+- A page that is not accessible (not status 200) is not checked later
 ## Development
@@ -238,11 +207,11 @@ To install this gem onto your local machine, run `bundle exec rake install`. To
 ## Contributing
-Bug reports and pull requests are welcome on GitHub at https://github.com/htaidirt/super_crawler. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
+Bug reports and pull requests are welcome on GitHub at [https://github.com/htaidirt/super_crawler](https://github.com/htaidirt/super_crawler). This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
-Want to contribute, please follow this process:
+Please, follow this process:
-1. Fork it
+1. Fork the project
 2. Create your feature branch (git checkout -b my-new-feature)
 3. Commit your changes (git commit -am 'Add some feature')
 4. Push to the branch (git push origin my-new-feature)

data/lib/super_crawler/{crawl_site.rb → crawl.rb} RENAMED Viewed

@@ -1,33 +1,32 @@
 require 'thread'
-require 'super_crawler/crawl_page'
+require 'super_crawler/scrap'
 module SuperCrawler
   ###
   # Crawl a whole website
   #
-  class CrawlSite
+  class Crawl
     attr_reader :links, :crawl_results
-    def initialize start_url, threads = 10, options = {}
+    def initialize start_url, options = {}
       @start_url = URI(URI.encode start_url).normalize().to_s # Normalize the given URL
       @links = [@start_url] # Will contain the list of all links found
       @crawl_results = [] # Will contain the crawl results (links and assets), as array of hashes
-      @threads = threads # How many threads to use? Default: 10
       @option_debug = options[:debug].nil? ? true : !!(options[:debug]) # Debug by default
     end
     ###
     # Start crawling site
-    # Could take a while. Use threads to speed up crawling and logging to inform user.
+    # Could take a while! Use threads to speed up crawling and log to inform user.
     #
-    def start
+    def start threads_count = 10
-      crawling_start_notice # Show message on what will happen
-      threads = [] # Will contain our threads
+      crawling_start_notice( @start_url, threads_count ) # Show message on what will happen
+      threads = [] # Will contain our n-threads
       @links_queue = Queue.new # Will contain the links queue that the threads will use
       @links = [@start_url] # Re-init the links list
       @crawl_results = [] # Re-init the crawling results
@@ -38,12 +37,12 @@ module SuperCrawler
       process_page( @start_url )
       # Create threads to handle new links
-      @threads.times do # Create many threads
+      threads_count.times do # Create threads_count threads
-        threads << Thread.new do # Add a new threads
+        threads << Thread.new do # Instantiate a new threads
           begin
-            while current_link = @links_queue.pop(true) # Popping every link after another
-              process_page( current_link ) # Get links and assets
+            while current_link = @links_queue.pop(true) # Pop one link after another
+              process_page( current_link ) # Get links and assets of the popped link
             end
           rescue ThreadError # Stop when empty links queue
           end
@@ -52,7 +51,7 @@ module SuperCrawler
       end
       threads.map(&:join) # Activate the threads
-      crawling_summary_notice(start_time, Time.now) if @option_debug # Display crawling summary
+      crawling_summary_notice(start_time, Time.now, threads_count) if @option_debug # Display crawling summary
       return true
     end
@@ -64,7 +63,7 @@ module SuperCrawler
     #
     def render max_pages = 10
       draw_line
-      puts "Showing first #{max_links} crawled pages and their contents:\n\n"
+      puts "Showing first #{max_pages} crawled pages and their contents:\n\n"
       @crawl_results[0..(max_pages-1)].each_with_index do |result, index|
         puts "[#{index+1}] Content of #{result[:url]}\n"
@@ -90,13 +89,13 @@ module SuperCrawler
     # Process a page by extracting information and updating links queue, links list and results.
     #
     def process_page page_url
-      page = SuperCrawler::CrawlPage.new(page_url) # Crawl the current page
+      page = SuperCrawler::Scrap.new(page_url) # Scrap the current page
       current_page_links = page.get_links # Get current page internal links
       new_links = current_page_links - @links # Select new links
       new_links.each { |link| @links_queue.push(link) } # Add new links to the queue
-      @links += new_links # Add new links to the total links list
+      @links += new_links # Add new links to the links list
       @crawl_results << { # Provide current page crawl result as a hash
         url: page.url, # The crawled page
         links: current_page_links, # Its internal links
@@ -109,11 +108,11 @@ module SuperCrawler
     ###
     # Display a notice when starting a site crawl
     #
-    def crawling_start_notice
+    def crawling_start_notice start_url, threads
       draw_line
-      puts "Start crawling #{@start_url} using #{@threads} threads. Crawling rules:"
+      puts "Start crawling #{start_url} using #{threads} threads. Crawling rules:"
       puts "1. Keep only internal links"
-      puts "2. http and https links are considered different"
+      puts "2. Links with different scheme are agnored"
       puts "3. Remove the fragment part from the links (#...)"
       puts "4. Keep paths with different parameters (?...)"
       draw_line
@@ -132,11 +131,11 @@ module SuperCrawler
     ###
     # Display final crawling summary after site crawling complete
     #
-    def crawling_summary_notice time_start, time_end
+    def crawling_summary_notice time_start, time_end, threads
       total_time = time_end - time_start
       puts ""
       draw_line
-      puts "Crawled #{@links.count} links in #{total_time.to_f.to_s} seconds using #{@threads} threads."
+      puts "Crawled #{@links.count} links in #{total_time.to_f.to_s} seconds using #{threads} threads."
       puts "Use .crawl_results to access the crawl results as an array of hashes."
       puts "Use .render to see the crawl_results as a sitemap."
       draw_line

data/lib/super_crawler/{crawl_page.rb → scrap.rb} RENAMED Viewed

@@ -1,20 +1,19 @@
 require "open-uri"
-require "open_uri_redirections"
 require "nokogiri"
 module SuperCrawler
   ###
-  # Crawl a single HTML page
+  # Scrap a single HTML page
   # Responsible for extracting all relevant information within a page
+  # (internal links and assets)
   #
-  class CrawlPage
+  class Scrap
     attr_reader :url
     def initialize url
-      # Normalize the URL, by adding http(s) if not present in the URL
-      # NOTA: By default, add http:// scheme to an URL that doesn't have one
+      # Normalize the URL, by adding a scheme (http) if not present in the URL
       @url = URI.encode( !!(url =~ /^(http(s)?:\/\/)/) ? url : ('http://' + url) )
     end
@@ -28,7 +27,7 @@ module SuperCrawler
       links = get_doc.css('a').map{ |link| link['href'] }.compact
       # Select only internal links (relative links, or absolute links with the same host)
-      links.select!{ |link| URI.parse(URI.encode link).host.nil? || URI.parse(URI.encode link).host == URI.parse(@url).host }
+      links.select!{ |link| URI.parse(URI.encode link).host.nil? || link.start_with?( @url ) }
       # Reject bad matches links (like mailto, tel and javascript)
       links.reject!{ |link| !!(link =~ /^(mailto:|tel:|javascript:)/) }
@@ -97,9 +96,9 @@ module SuperCrawler
     #
     def get_assets
       {
-        'images': get_images,
-        'stylesheets': get_stylesheets,
-        'scripts': get_scripts
+        :'images' => get_images,
+        :'stylesheets' => get_stylesheets,
+        :'scripts' => get_scripts
       }
     end
@@ -109,10 +108,10 @@ module SuperCrawler
     #
     def get_all
       {
-        'links': get_links,
-        'images': get_images,
-        'stylesheets': get_stylesheets,
-        'scripts': get_scripts
+        :'links' => get_links,
+        :'images' => get_images,
+        :'stylesheets' => get_stylesheets,
+        :'scripts' => get_scripts
       }
     end
@@ -131,28 +130,18 @@ module SuperCrawler
     #
     def get_doc
       begin
-        @doc ||= Nokogiri(open( @url , allow_redirections: :all ))
+        @doc ||= Nokogiri(open( @url ))
       rescue Exception => e
         raise "Problem with URL #{@url}: #{e}"
       end
     end
-    ###
-    # Extract the base URL (scheme and host only)
-    #
-    # eg:
-    # http://mysite.com/abc -> http://mysite.com
-    # https://dev.mysite.co.uk/mylink -> https://dev.mysite.co.uk
-    def base_url
-      "#{URI.parse(@url).scheme}://#{URI.parse(@url).host}"
-    end
     ###
     # Given a URL, return the absolute URL
     #
     def create_absolute_url url
       # Append the base URL (scheme+host) if the provided URL is relative
-      URI.parse(URI.encode url).host.nil? ? (base_url + url) : url
+      URI.parse(URI.encode url).host.nil? ? "#{URI.parse(@url).scheme}://#{URI.parse(@url).host}#{url}" : url
     end
   end

data/lib/super_crawler/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module SuperCrawler
-  VERSION = "0.1.0"
+  VERSION = "0.2.0"
 end

data/lib/super_crawler.rb CHANGED Viewed

@@ -1,4 +1,4 @@
 require "super_crawler/version"
-require "super_crawler/crawl_page"
-require "super_crawler/crawl_site"
+require "super_crawler/scrap"
+require "super_crawler/crawl"

data/super_crawler.gemspec CHANGED Viewed

@@ -28,7 +28,6 @@ Gem::Specification.new do |spec|
   spec.require_paths = ["lib"]
   spec.add_dependency "nokogiri", "~> 1"
-  spec.add_dependency "open_uri_redirections", "~> 0.2"
   spec.add_dependency "thread", "~> 0.2"
   spec.add_development_dependency "bundler", "~> 1.10"

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: super_crawler
 version: !ruby/object:Gem::Version
-  version: 0.1.0
+  version: 0.2.0
 platform: ruby
 authors:
 - Hassen Taidirt
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2016-07-09 00:00:00.000000000 Z
+date: 2016-07-13 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: nokogiri
@@ -24,20 +24,6 @@ dependencies:
     - - "~>"
       - !ruby/object:Gem::Version
         version: '1'
-- !ruby/object:Gem::Dependency
-  name: open_uri_redirections
-  requirement: !ruby/object:Gem::Requirement
-    requirements:
-    - - "~>"
-      - !ruby/object:Gem::Version
-        version: '0.2'
-  type: :runtime
-  prerelease: false
-  version_requirements: !ruby/object:Gem::Requirement
-    requirements:
-    - - "~>"
-      - !ruby/object:Gem::Version
-        version: '0.2'
 - !ruby/object:Gem::Dependency
   name: thread
   requirement: !ruby/object:Gem::Requirement
@@ -113,8 +99,8 @@ files:
 - bin/console
 - bin/setup
 - lib/super_crawler.rb
-- lib/super_crawler/crawl_page.rb
-- lib/super_crawler/crawl_site.rb
+- lib/super_crawler/crawl.rb
+- lib/super_crawler/scrap.rb
 - lib/super_crawler/version.rb
 - super_crawler.gemspec
 homepage: https://github.com/htaidirt/super_crawler