RubyGems - super_crawler - Versions diffs - 0.2.0 → 0.2.1 - Mend

super_crawler 0.2.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: ffb65575970b9bd45f3ac9fa80b004c96df3492a
-  data.tar.gz: 1a72997367a389dcfb67dcd913554207604ef3e2
+  metadata.gz: ef7ae8a5bb3aac480832a0024b70363788e268cb
+  data.tar.gz: c03d33300045b2e2053606d616b8e5f82af2e941
 SHA512:
-  metadata.gz: 67dd33ed9ee8965a84cdcc22187212ea97c47c6a29bc868818b31246c827e342c486d0b12270d52eac1c1c8071cab51fef4dd441470abd1702905e84c814c31c
-  data.tar.gz: bd45255d527837c177b788f81412a82e0609a015a65e9d9423841e80bef362b62f854c55e1e1637cf8fc45bbf19d256bac4a66da9e804b8850128e0ca6f8d9a8
+  metadata.gz: 23e645e55ad85462fbfb81d0e518961b54e2986e16eaa256799dea2a2fe40759eae3bcbfef74261e87e59e495b3aa5598812622ca87c0e7448d0fb1ebcaf7504
+  data.tar.gz: 82fab161a1147e538b5e4f6324a9cae1a8d2bc2815834d2d7ecedc58aeafd96e2ec7145bdb5c62195374a316d7bc57efbf1157101ab00816c1b25ddb8bbc1026

data/README.md CHANGED

@@ -79,7 +79,7 @@ Where `url` should be the URL of the page you would like to scrap.
 Run
     page.url
 to get the encoded URL.
 #### Get internal links of a page
@@ -87,7 +87,7 @@ to get the encoded URL.
 Run
     page.get_links
 to get the list of internal links in the page. An internal link is a link that _has the same schame and host than the provided URL_. Subdomains are rejected.
 This method searches in the `href` attribute of all `<a>` anchor tags.
@@ -154,7 +154,7 @@ where `url` is the URL of the website to crawl.
 Next, start the crawler:
     sc.start(number_of_threads)
 where `number_of_threads` is the number of threads that will perform the job (10 by default.) **This can take some time, depending on the site to crawl.**
 To access the crawl results, use the following:
@@ -166,7 +166,7 @@ To see the crawling as a sitemap, use:
     sc.render(5) # Will render the sitemap of the first 5 pages
-_TODO: Create a separate and more sophisticated rendering class, that can render within files of different formats (HTML, XML, JSON,...)_
+_TODO: Make more sophisticated rendering methods, that can render within files of different formats (HTML, XML, JSON,...)_
 #### Tips on searching assets and links
@@ -187,6 +187,21 @@ After `sc.start`, you can access all collected resources (links and assets) usin
 You can use `sc.crawl_results.select{ |resource| ... }` to select a particular resource.
+Example:
+    images = sc.crawl_results.map{ |page| page[:assets][:images] }.flatten.uniq
+    # => Returns an array of all unique images found during the crawling
+#### Get assets of a whole crawling
+You can collect in a single array any assets of a crawling, by using the following:
+    images      = sc.get_assets :images       # => Returns an array of unique images
+    stylesheets = sc.get_assets :stylesheets  # => Returns an array of unique stylesheets
+    scripts     = sc.get_assets :scripts      # => Returns an array of unique scripts
+It is important to note that all the given arrays contain unique absolute URLs. As said before, the assets are not necessarily internal assets.
 ## Limitations
 Actually, the gem has the following limitations:

data/lib/super_crawler/crawl.rb CHANGED

@@ -1,11 +1,13 @@
 require 'thread'
 require 'super_crawler/scrap'
+require 'super_crawler/render'
 module SuperCrawler
   ###
   # Crawl a whole website
+  # For each new link detected, scrap the corresponding page.
   #
   class Crawl
@@ -25,11 +27,12 @@ module SuperCrawler
     #
     def start threads_count = 10
-      crawling_start_notice( @start_url, threads_count ) # Show message on what will happen
-      threads = [] # Will contain our n-threads
-      @links_queue = Queue.new # Will contain the links queue that the threads will use
-      @links = [@start_url] # Re-init the links list
-      @crawl_results = [] # Re-init the crawling results
+      SuperCrawler::Render.crawling_start_notice( @start_url, threads_count ) if @option_debug # Show message on what will happen
+      threads = []              # Will contain our n-threads
+      @links_queue = Queue.new  # Will contain the links queue that the threads will use
+      @links = [@start_url]     # Re-init the links list
+      @crawl_results = []       # Re-init the crawling results
       start_time = Time.now if @option_debug # Start the timer
@@ -51,42 +54,40 @@ module SuperCrawler
       end
       threads.map(&:join) # Activate the threads
-      crawling_summary_notice(start_time, Time.now, threads_count) if @option_debug # Display crawling summary
+      SuperCrawler::Render.crawling_summary_notice(Time.now - start_time, threads_count, @links.count) if @option_debug # Display crawling summary
       return true
     end
     ###
-    # Render sitemap
-    # Show, for each link, internal links and assets
-    # We will limit pages to display, because some sites have more than 1,000 pages
+    # Render the crawling result as a sitemap in the console
     #
     def render max_pages = 10
-      draw_line
-      puts "Showing first #{max_pages} crawled pages and their contents:\n\n"
-      @crawl_results[0..(max_pages-1)].each_with_index do |result, index|
-        puts "[#{index+1}] Content of #{result[:url]}\n"
-        puts "     + Internal links: #{'None' if result[:links].empty?}"
-        result[:links].each { |link| puts "            - #{link}" }
-        puts "     + Internal images: #{'None' if result[:assets][:images].empty?}"
-        result[:assets][:images].each { |link| puts "            - #{link}" }
-        puts "     + Internal stylesheets: #{'None' if result[:assets][:stylesheets].empty?}"
-        result[:assets][:stylesheets].each { |link| puts "            - #{link}" }
+      SuperCrawler::Render.console( @crawl_results, max_pages )
+    end
-        puts "     + Internal scripts: #{'None' if result[:assets][:scripts].empty?}"
-        result[:assets][:scripts].each { |link| puts "            - #{link}" }
-        puts ""
+    ###
+    # Get specific assets (images, stylesheets and scripts)
+    #
+    def get_assets asset
+      return [] if @crawl_results.empty? # No crawling yet? Return empty search
+      # The asset parameter can only be images, stylesheets or scripts
+      unless %w(images stylesheets scripts).include? asset.to_s
+        # Display error message in this case.
+        SuperCrawler::Render.error "`asset` parameter can only be `images`, `stylesheets` or `scripts`"
+        return [] # Return empty array
       end
-      draw_line
+      # Good! Return flatten array of unique assets
+      return @crawl_results.map{ |cr| cr[:assets][asset.to_sym] }.flatten.uniq
     end
     private
     ###
-    # Process a page by extracting information and updating links queue, links list and results.
+    # Process a page by extracting information and updating links queue,
+    # links list and results.
     #
     def process_page page_url
       page = SuperCrawler::Scrap.new(page_url) # Scrap the current page
@@ -102,50 +103,7 @@ module SuperCrawler
         assets: page.get_assets # Its assets
       }
-      log_status( page_url ) if @option_debug # Display site crawling status
-    end
-    ###
-    # Display a notice when starting a site crawl
-    #
-    def crawling_start_notice start_url, threads
-      draw_line
-      puts "Start crawling #{start_url} using #{threads} threads. Crawling rules:"
-      puts "1. Keep only internal links"
-      puts "2. Links with different scheme are agnored"
-      puts "3. Remove the fragment part from the links (#...)"
-      puts "4. Keep paths with different parameters (?...)"
-      draw_line
-    end
-    ###
-    # Log current search status (crawled links / total links)
-    #
-    def log_status url
-      text = "Crawled #{@crawl_results.length.to_s}/#{@links.length.to_s}: #{url}"
-      print "\r#{" "*100}\r" # Clean the previous text
-      print (text.length <= 50) ? text : "#{text[0..46]}..."
-      STDOUT.flush
-    end
-    ###
-    # Display final crawling summary after site crawling complete
-    #
-    def crawling_summary_notice time_start, time_end, threads
-      total_time = time_end - time_start
-      puts ""
-      draw_line
-      puts "Crawled #{@links.count} links in #{total_time.to_f.to_s} seconds using #{threads} threads."
-      puts "Use .crawl_results to access the crawl results as an array of hashes."
-      puts "Use .render to see the crawl_results as a sitemap."
-      draw_line
-    end
-    ###
-    # Draw a line (because readability is also important!!)
-    #
-    def draw_line
-      puts "#{'-' * 80}"
+      SuperCrawler::Render.log_status( page_url, @crawl_results.length, @links.length ) if @option_debug # Display site crawling status
     end
   end

data/lib/super_crawler/render.rb ADDED

@@ -0,0 +1,89 @@
+module SuperCrawler
+  ##
+  # Render crawl results and processing.
+  #
+  class Render
+    ###
+    # Display error message in the console.
+    #
+    def self.error message
+      puts "\e[31m[ERROR]\e[0m #{message}"
+    end
+    ###
+    # Display a notice when starting a site crawl
+    #
+    def self.crawling_start_notice start_url, threads
+      self.draw_line
+      puts "Start crawling #{start_url} using #{threads} threads. Crawling rules:"
+      puts "1. Consider only links starting with #{start_url}"
+      puts "2. Remove the fragment part from the links (#...)"
+      puts "3. Keep paths with different parameters (?...)"
+      puts "4. Assets can be internal or external to the site"
+      self.draw_line
+    end
+    ###
+    # Render sitemap in console
+    # Show, for each link, internal links and assets
+    # We will limit pages to display, because some sites have more than 1,000 pages
+    #
+    def self.console crawl_results, max_pages
+      self.draw_line
+      puts "Showing first #{max_pages} crawled pages and their contents:\n\n"
+      crawl_results[0..(max_pages-1)].each_with_index do |result, index|
+        puts "[#{index+1}] Content of #{result[:url]}\n"
+        puts "     + Internal links: #{'None' if result[:links].empty?}"
+        result[:links].each { |link| puts "            - #{link}" }
+        puts "     + Internal images: #{'None' if result[:assets][:images].empty?}"
+        result[:assets][:images].each { |link| puts "            - #{link}" }
+        puts "     + Internal stylesheets: #{'None' if result[:assets][:stylesheets].empty?}"
+        result[:assets][:stylesheets].each { |link| puts "            - #{link}" }
+        puts "     + Internal scripts: #{'None' if result[:assets][:scripts].empty?}"
+        result[:assets][:scripts].each { |link| puts "            - #{link}" }
+        puts ""
+      end
+      self.draw_line
+    end
+    ###
+    # Log current search status (crawled links / total links)
+    #
+    def self.log_status url, crawl_results_length, links_length
+      text = "Crawled #{crawl_results_length.to_s}/#{links_length.to_s}: #{url}"
+      print "\r#{" "*100}\r" # Clean the previous text
+      print (text.length <= 50) ? text : "#{text[0..46]}..."
+      STDOUT.flush
+    end
+    ###
+    # Display final crawling summary after site crawling complete
+    #
+    def self.crawling_summary_notice total_time, threads_count, links_count
+      puts
+      self.draw_line
+      puts "\e[33m[SUCCESS]\e[0m Crawled #{links_count} links in #{total_time.to_f.to_s} seconds using #{threads_count} threads."
+      puts "Use .crawl_results to access the crawl results as an array of hashes."
+      puts "Use .render to see the crawl_results as a sitemap."
+      self.draw_line
+    end
+    private
+    ###
+    # Draw a line (because readability is also important!!)
+    #
+    def self.draw_line
+      puts "#{'-' * 80}"
+    end
+  end
+end

data/lib/super_crawler/version.rb CHANGED

@@ -1,3 +1,3 @@
 module SuperCrawler
-  VERSION = "0.2.0"
+  VERSION = "0.2.1"
 end

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: super_crawler
 version: !ruby/object:Gem::Version
-  version: 0.2.0
+  version: 0.2.1
 platform: ruby
 authors:
 - Hassen Taidirt
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2016-07-13 00:00:00.000000000 Z
+date: 2016-07-24 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: nokogiri
@@ -100,6 +100,7 @@ files:
 - bin/setup
 - lib/super_crawler.rb
 - lib/super_crawler/crawl.rb
+- lib/super_crawler/render.rb
 - lib/super_crawler/scrap.rb
 - lib/super_crawler/version.rb
 - super_crawler.gemspec