RubyGems - creepy-crawler - Versions diffs - 1.0.0 - Mend

creepy-crawler 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (24) hide show

checksums.yaml +15 -0
data/.gitignore +1 -0
data/.travis.yml +5 -0
data/DOCKER.md +34 -0
data/Gemfile +2 -0
data/Gemfile.lock +53 -0
data/README.md +97 -0
data/Rakefile +6 -0
data/docker/Dockerfile +33 -0
data/docker/neo4j-server.properties +91 -0
data/docker/startup.sh +15 -0
data/examples/crawl.rb +10 -0
data/examples/output_map.png +0 -0
data/lib/creepy-crawler.rb +61 -0
data/lib/creepy-crawler/graph.rb +31 -0
data/lib/creepy-crawler/page.rb +68 -0
data/lib/creepy-crawler/site.rb +115 -0
data/spec/creepy-crawler_spec.rb +17 -0
data/spec/dummypage_helper.rb +13 -0
data/spec/graph_spec.rb +67 -0
data/spec/page_spec.rb +95 -0
data/spec/site_spec.rb +119 -0
data/spec/spec_helper.rb +4 -0
metadata +236 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,15 @@
+---
+!binary "U0hBMQ==":
+  metadata.gz: !binary |-
+    MGQ0ZWRlNjU1NjljMGU5MmI3YWQ4YmU3OTJiNTRkOWU2MTE5YTEzYQ==
+  data.tar.gz: !binary |-
+    MDNkZGI0ZjUxOTliYmY0MzJmNzBkZjgyMzEyMzYzZTVkZDg1MTY3OA==
+SHA512:
+  metadata.gz: !binary |-
+    MDlmMjk0MDNhMGI4NDg5YzkyOGM4YzQ5YjgyMWE2YjljYTBlZjgyMmI5NDBi
+    MTdjOTNjOGZkYzcyMDE2ZmNhOGM5NmVmYWYzYTY4MTI4OTA0YThjMWE5Mjhh
+    ZTkxZTNkYzI3MTdhMmYwOTM4MmIzODhkZDA4YmM5ODdjYjZjM2Y=
+  data.tar.gz: !binary |-
+    ZmEyYzdhYjc3ZjgwNTdjM2EwN2MwOGFjNjFhZGUxZmQyZDhlY2UxNTk1MWNk
+    NWZkMmRlMDFhMGRjYjdmOWVlMGEwODgwOTJiMDhjZGM1OTkzMDM5MzRjNTQ5
+    ZWY1N2RiMjM1OGU1YjU3YWNmMTUyNzY3ODQ0NDdiMWNhNjc1NmM=

data/.gitignore ADDED Viewed

	@@ -0,0 +1 @@
1	+ neo4j/

data/.travis.yml ADDED Viewed

@@ -0,0 +1,5 @@
+language: ruby
+rvm:
+  - 1.9.3
+# uncomment this line if your project needs to run something other than `rake`:
+script: bundle exec rspec spec

data/DOCKER.md ADDED Viewed

@@ -0,0 +1,34 @@
+creepy-crawler - containerized
+==============
+I have containerized creepy-crawler using [Docker](http://docker.io) for testing, portability and because Docker is awesome
+##Installation
+####Clone
+    git clone https://github.com/udryan10/creepy-crawler.git
+####Build docker image
+    cd creepy-crawler/docker/ && docker build -t "creepy-crawler:1.0" .
+##Run
+    # map neo4j's web interface port 7474 in the container to the host port 7474 for access
+    docker run -i -p 7474:7474 creepy-crawler:1.0
+It takes a bit of time for crawling information to be displayed to stdout when running in docker.
+##Output
+creepy-crawler uses neo4j graph database to store and display the site map. When the crawl is complete, the docker container is set to loop indefinitley to provide access to the graph data. If we don't do this, the container will shut down and the data will not be accessible
+### Web interface
+View crawl data, stored in neo4j running inside of the container at: <code>http://\<docker_host\>:7474/webadmin</code>
+Instructions on how to get to the graph data exist in the [README](https://github.com/udryan10/creepy-crawler#web-interface)
+###boot2docker
+If you are running docker on a mac using boot2docker, you will have to instruct virtual box to forward the correct port to the docker host:
+    VBoxManage modifyvm "boot2docker-vm" --natpf1 "neo4j,tcp,,7474,,7474"
+##Stopping
+    docker stop <container id>
+**Plans to add additional documentation for those less familiar with Docker

data/Gemfile ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ source 'http://rubygems.org'
2	+ gemspec

data/Gemfile.lock ADDED Viewed

@@ -0,0 +1,53 @@
+PATH
+  remote: .
+  specs:
+    creepy-crawler (1.0.0)
+      addressable (~> 2.3, >= 2.3.6)
+      neography (~> 1.4, >= 1.4.1)
+      nokogiri (~> 1.6, >= 1.6.1)
+      open_uri_redirections (= 0.1.4)
+      rake (~> 0.8)
+      trollop (~> 2.0)
+      webrobots (~> 0.1, >= 0.1.1)
+GEM
+  remote: http://rubygems.org/
+  specs:
+    addressable (2.3.6)
+    diff-lcs (1.2.5)
+    httpclient (2.3.4.1)
+    json (1.8.1)
+    mini_portile (0.5.3)
+    multi_json (1.9.3)
+    neography (1.4.1)
+      httpclient (>= 2.3.3)
+      json (>= 1.7.7)
+      multi_json (>= 1.3.2)
+      os (>= 0.9.6)
+      rake (>= 0.8.7)
+      rubyzip (>= 1.0.0)
+    nokogiri (1.6.1)
+      mini_portile (~> 0.5.0)
+    open_uri_redirections (0.1.4)
+    os (0.9.6)
+    rake (0.9.2.2)
+    rspec (2.14.1)
+      rspec-core (~> 2.14.0)
+      rspec-expectations (~> 2.14.0)
+      rspec-mocks (~> 2.14.0)
+    rspec-core (2.14.8)
+    rspec-expectations (2.14.5)
+      diff-lcs (>= 1.1.3, < 2.0)
+    rspec-mocks (2.14.6)
+    rubyzip (1.1.3)
+    trollop (2.0)
+    webrobots (0.1.1)
+PLATFORMS
+  ruby
+DEPENDENCIES
+  bundler (~> 1.5)
+  creepy-crawler!
+  rspec (~> 2.14)
+  rspec-core (~> 2.14)

data/README.md ADDED Viewed

@@ -0,0 +1,97 @@
+creepy-crawler
+==============
+Ruby web crawler that takes a url as input and produces a sitemap using a neo4j graph database - Nothing creepy about it.
+[![Build Status](https://travis-ci.org/udryan10/creepy-crawler.svg?branch=master)](https://travis-ci.org/udryan10/creepy-crawler)
+##Installation
+####Clone
+    git clone https://github.com/udryan10/creepy-crawler.git
+####Install Required Gems
+    bundle install
+####Install graph database
+    rake neo4j:install
+####Start graph database
+    rake neo4j:start
+####Requirements
+1. Gems listed in Gemfile
+2. Ruby 1.9+
+3. neo4j
+3. Oracle jdk7 (for neo4j graphing database)
+4. lsof (for neo4j graphing database)
+##Usage
+###Code
+####Require
+    require './creepy-crawler'
+####Start a crawl
+    Creepycrawler.crawl("http://example.com")
+####Limit number of pages to crawl
+    Creepycrawler.crawl("http://example.com", :max_page_crawl => 500)
+####Extract some (potentially) useful statistics
+    crawler = Creepycrawler.crawl("http://example.com", :max_page_crawl => 500)
+    # list of broken links
+    puts crawler.broken_links
+    # list of sites that were visited
+    puts crawler.visited_queue
+    # count of crawled pages
+    puts crawler.page_crawl_count
+####Options
+    DEFAULT_OPTIONS = {
+      # whether to print crawling information
+      :verbose => true,
+      # whether to obey robots.txt
+      :obey_robots => true,
+      # maximum number of pages to crawl, value of nil will attempt to crawl all pages
+      :max_page_crawl => nil,
+      # should pages be written to the database. Likely only used for testing, but may be used if you only wanted to get at the broken_links data
+      :graph_to_neo4j => true
+    }
+####Example
+examples located in <code>examples/</code> directory
+###Command line
+    # Crawl site
+    ruby creepy-crawler.rb --site "http://google.com"
+    # Get command options
+    ruby creepy-crawler.rb --help
+**Note:** If behind a proxy, export your proxy environment variables
+    export http_proxy=<proxy_host>; export https_proxy=<proxy_host>
+###Docker
+For testing, I have included the ability to run the environment and a crawl inside of a [docker container](https://github.com/udryan10/creepy-crawler/blob/master/DOCKER.md)
+##Output
+creepy-crawler uses [neo4j](http://www.neo4j.org/) graph database to store and display the site map.
+### Web interface
+neo4j has a web interface for viewing and interacting with the graph data. When running on local host, visit: [http://localhost:7474/webadmin/](http://localhost:7474/webadmin/)
+1. Click the Data Browser tab
+2. Enter Query to search for nodes (will search all nodes):
+    <code>
+    START root=node(*)
+    RETURN root
+    </code>
+3. Click into a node
+4. Click switch view mode at top right to view a graphical map
+**Note:** to have the map display url names instead of node numbers, you must create a style
+### REST interface
+neo4j also has a full [REST API](http://docs.neo4j.org/chunked/stable/rest-api.html) for programatic access to the data
+###Example Output Map
+![Output Map](https://raw.githubusercontent.com/udryan10/creepy-crawler/master/examples/output_map.png)
+##TODO
+1. convert to gem
+2. multi-threaded to increase crawl performance

data/Rakefile ADDED Viewed

@@ -0,0 +1,6 @@
+require 'neography/tasks'
+require 'rspec/core/rake_task'
+RSpec::Core::RakeTask.new(:spec)
+task :default => :spec

data/docker/Dockerfile ADDED Viewed

@@ -0,0 +1,33 @@
+FROM ubuntu:latest
+MAINTAINER Ryan Grothouse <rgrothouse@gmail.com>
+ENV DEBIAN_FRONTEND noninteractive
+ENV NOKOGIRI_USE_SYSTEM_LIBRARIES 1
+RUN apt-get -q update
+RUN apt-get -y upgrade
+# Nokogiri dependencies
+RUN apt-get install -qy --force-yes git ruby ruby-dev build-essential
+RUN apt-get install -qy libxslt1-dev libxml2-dev libssl-dev libyaml-dev
+RUN apt-get install -qy --no-install-recommends openjdk-7-jdk openjdk-7-jre
+# Neo4j dependencies
+RUN apt-get install -qy curl lsof
+RUN gem install bundler
+RUN git clone https://github.com/udryan10/creepy-crawler /opt/creepy-crawler
+WORKDIR /opt/creepy-crawler
+RUN bundle install
+RUN rake neo4j:install
+ADD ./startup.sh /startup.sh
+ADD ./neo4j-server.properties /opt/creepy-crawler/neo4j/conf/neo4j-server.properties
+RUN chmod u+x /startup.sh
+EXPOSE 7474
+ENTRYPOINT ["/startup.sh"]

data/docker/neo4j-server.properties ADDED Viewed

@@ -0,0 +1,91 @@
+################################################################
+# Neo4j configuration
+#
+################################################################
+#***************************************************************
+# Server configuration
+#***************************************************************
+# location of the database directory
+org.neo4j.server.database.location=data/graph.db
+# Let the webserver only listen on the specified IP. Default is localhost (only
+# accept local connections). Uncomment to allow any connection. Please see the
+# security section in the neo4j manual before modifying this.
+org.neo4j.server.webserver.address=0.0.0.0
+#
+# HTTP Connector
+#
+# http port (for all data, administrative, and UI access)
+org.neo4j.server.webserver.port=7474
+#
+# HTTPS Connector
+#
+# Turn https-support on/off
+org.neo4j.server.webserver.https.enabled=true
+# https port (for all data, administrative, and UI access)
+org.neo4j.server.webserver.https.port=7473
+# Certificate location (auto generated if the file does not exist)
+org.neo4j.server.webserver.https.cert.location=conf/ssl/snakeoil.cert
+# Private key location (auto generated if the file does not exist)
+org.neo4j.server.webserver.https.key.location=conf/ssl/snakeoil.key
+# Internally generated keystore (don't try to put your own
+# keystore there, it will get deleted when the server starts)
+org.neo4j.server.webserver.https.keystore.location=data/keystore
+#*****************************************************************
+# Administration client configuration
+#*****************************************************************
+# location of the servers round-robin database directory. Possible values:
+# - absolute path like /var/rrd
+# - path relative to the server working directory like data/rrd
+# - commented out, will default to the database data directory.
+org.neo4j.server.webadmin.rrdb.location=data/rrd
+# REST endpoint for the data API
+# Note the / in the end is mandatory
+org.neo4j.server.webadmin.data.uri=/db/data/
+# REST endpoint of the administration API (used by Webadmin)
+org.neo4j.server.webadmin.management.uri=/db/manage/
+# Low-level graph engine tuning file
+org.neo4j.server.db.tuning.properties=conf/neo4j.properties
+# The console services to be enabled
+org.neo4j.server.manage.console_engines=shell
+# Comma separated list of JAX-RS packages containing JAX-RS resources, one
+# package name for each mountpoint. The listed package names will be loaded
+# under the mountpoints specified. Uncomment this line to mount the
+# org.neo4j.examples.server.unmanaged.HelloWorldResource.java from
+# neo4j-examples under /examples/unmanaged, resulting in a final URL of
+# http://localhost:7474/examples/unmanaged/helloworld/{nodeId}
+#org.neo4j.server.thirdparty_jaxrs_classes=org.neo4j.examples.server.unmanaged=/examples/unmanaged
+#*****************************************************************
+# HTTP logging configuration
+#*****************************************************************
+# HTTP logging is disabled. HTTP logging can be enabled by setting this
+# property to 'true'.
+org.neo4j.server.http.log.enabled=false
+# Logging policy file that governs how HTTP log output is presented and
+# archived. Note: changing the rollover and retention policy is sensible, but
+# changing the output format is less so, since it is configured to use the
+# ubiquitous common log format
+org.neo4j.server.http.log.config=conf/neo4j-http-logging.xml

data/docker/startup.sh ADDED Viewed

@@ -0,0 +1,15 @@
+#!/bin/sh
+MAX_PAGE_CRAWL=50
+CRAWL_URL="http://www.yahoo.com"
+echo "Crawler is set to crawl ${CRAWL_URL}"
+echo "Crawler is set to crawl ${MAX_PAGE_CRAWL} pages"
+cd /opt/creepy-crawler && rake neo4j:start
+ruby /opt/creepy-crawler/lib/creepy-crawler.rb --site $CRAWL_URL --max-page-crawl $MAX_PAGE_CRAWL
+echo "==============================================="
+echo "Crawl is complete!"
+echo "To see graph data visit http://<docker_host>:7474/webadmin"
+echo "Sleeping indefinitley to allow graph data to be viewed"
+echo "To stop container execute: docker stop <container id>"
+while true; do sleep 10000; done

data/examples/crawl.rb ADDED Viewed

@@ -0,0 +1,10 @@
+require_relative '../lib/creepy-crawler'
+crawler = Creepycrawler.crawl("http://yahoo.com", :max_page_crawl => 100)
+puts "=" * 40
+puts "broken link list:"
+puts crawler.broken_links
+puts "=" * 40
+puts "number of visited pages:"
+puts crawler.page_crawl_count
+puts "=" * 40

data/examples/output_map.png ADDED Viewed

Binary file

data/lib/creepy-crawler.rb ADDED Viewed

@@ -0,0 +1,61 @@
+require 'rubygems'
+require 'bundler/setup'
+require 'neography'
+require 'nokogiri'
+require 'open-uri'
+require 'addressable/uri'
+require 'open_uri_redirections'
+require 'webrobots'
+require 'trollop'
+require_relative 'creepy-crawler/site'
+require_relative 'creepy-crawler/page'
+require_relative 'creepy-crawler/graph'
+module Creepycrawler
+  # todo: on my local machine im hitting some openssl bug to where legitimate https sites are not validating the certificate.
+  # For now, keeping this to override the verification but need to investigate further and remove this hack.
+  OpenSSL::SSL::VERIFY_PEER = OpenSSL::SSL::VERIFY_NONE
+  # configure Neography - for now uses all defaults and expects neo4j to be running on localhost
+  Neography.configure do |config|
+    config.protocol           = "http://"
+    config.server             = "localhost"
+    config.port               = 7474
+    config.directory          = ""  # prefix this path with '/'
+    config.cypher_path        = "/cypher"
+    config.gremlin_path       = "/ext/GremlinPlugin/graphdb/execute_script"
+    config.log_file           = "neography.log"
+    config.log_enabled        = false
+    config.slow_log_threshold = 0    # time in ms for query logging
+    config.max_threads        = 20
+    config.authentication     = nil  # 'basic' or 'digest'
+    config.username           = nil
+    config.password           = nil
+    config.parser             = MultiJsonParser
+  end
+  # class method to start a crawl
+  def Creepycrawler.crawl(url, options = {})
+    return Site.new(url, options).crawl
+  end
+end
+# allow the initiating of a crawl from command line
+if __FILE__==$0
+  # setup options
+  opts = Trollop::options do
+    opt :site, "Url of site to crawl", :type => :string  # flag --site
+    opt :obey_robots, "Obey robots.txt disallow list"    # string --name <s>, default nil
+    opt :verbose, "Whether to print crawling information", :default => true
+    opt :max_page_crawl, "Maximum number of pages to crawl. Defaults to unlimited", :default => 0
+    opt :graph_to_neo4j, "Whether pages should be written to graph database", :default => true
+  end
+  Trollop::die :site, "Must specify a site to crawl" unless opts[:site]
+  opts[:max_page_crawl] = nil if opts[:max_page_crawl] == 0
+  options_hash = {:obey_robots => opts[:obey_robots], :verbose => opts[:verbose], :max_page_crawl => opts[:max_page_crawl], :graph_to_neo4j => opts[:graph_to_neo4j]}
+  # start crawl
+  Creepycrawler.crawl(opts[:site], options_hash)
+end

data/lib/creepy-crawler/graph.rb ADDED Viewed

@@ -0,0 +1,31 @@
+module Creepycrawler
+  # Class that takes care of writing to our graph database (neo4j)
+  class Graph
+    def initialize
+      @neo4j = Neography::Rest.new
+    end
+    # add page to graph database
+    def add_page(url)
+      # if page doesnt exist, add it to neo4j
+      begin
+        node = @neo4j.get_node_index("page", "url", url)
+      rescue Neography::NotFoundException => e
+        node = nil
+      end
+      # node doesnt exist, create it
+      if node.nil?
+        node = @neo4j.create_node("url" => url)
+        @neo4j.add_node_to_index("page", "url", url, node) unless node.nil?
+      end
+      return node
+    end
+    def create_relationship(type,from,to)
+      @neo4j.create_relationship(type, from, to)
+    end
+  end
+end

data/lib/creepy-crawler/page.rb ADDED Viewed

@@ -0,0 +1,68 @@
+module Creepycrawler
+  # Represents a webpage and the methods to extract the details we need for our crawler
+  class Page
+    # page url
+    attr_accessor :url
+    # page html
+    attr_reader :body
+    def initialize(url)
+      @url = Addressable::URI.parse(url).normalize
+      @robotstxt = WebRobots.new("CreepyCrawler")
+    end
+    def body=(body)
+      # convert to Nokogiri object
+      @body = Nokogiri::HTML(body)
+    end
+    # retrieve page
+    def fetch
+      @body = Nokogiri::HTML(open(@url, :allow_redirections => :all))
+    end
+    # return all links on page
+    def links
+      # if we haven't fetched the page, get it
+      fetch if @body.nil?
+      # using nokogiri, find all anchor elements
+      hyperlinks = @body.css('a')
+      # get array of links on page - remove any empty links or links that are invalid
+      @links = hyperlinks.map {|link| link.attribute('href').to_s}.uniq.sort.delete_if do |href|
+        # if href is empty, points to an anchor, mailto or ftp delete
+        invalid = true if href.empty? or /^#/ =~ href or /^mailto:/ =~ href or /^ftp:/ =~ href or /^javascript:/ =~ href
+        # if Addressable throws an exception, we have an invalid link - delete
+        begin
+          Addressable::URI.parse(href)
+        rescue
+          invalid = true
+        end
+        invalid
+      end
+      # map all links to absolute
+      @links.map{|link| relative_to_absolute_link(link)}
+    end
+    def relative_to_absolute_link(link)
+      uri = Addressable::URI.parse(link).normalize
+      # this url was relative, prepend our known domain
+      if uri.host.nil?
+        return (@url + uri.path).to_s
+      else
+        # the url was already absolute - leave as is
+        return uri.to_s
+      end
+    end
+    def robots_disallowed?
+      return @robotstxt.disallowed?(@url)
+    end
+  end
+end

data/lib/creepy-crawler/site.rb ADDED Viewed

@@ -0,0 +1,115 @@
+module Creepycrawler
+  # object to handle the discovery of our site through crawling
+  class Site
+    # the site domain
+    attr_reader :domain
+    # url the crawl began with
+    attr_reader :url
+    # hash of additional options to be passed in
+    attr_reader :options
+    # queue used to store discovered pages and crawl the site
+    attr_reader :crawl_queue
+    # queue used to store visited pages
+    attr_reader :visited_queue
+    # number of pages crawled
+    attr_reader :page_crawl_count
+    # holds the root node information
+    attr_reader :root_node
+    # holds dead or broken links
+    attr_reader :broken_links
+    DEFAULT_OPTIONS = {
+      # whether to print crawling information
+      :verbose => true,
+      # whether to obey robots.txt
+      :obey_robots => true,
+      # maximum number of pages to crawl, value of nil will attempt to crawl all pages
+      :max_page_crawl => nil,
+      # should pages be written to the database. Likely only used for testing, but may be used if you only wanted to get at the broken_links data
+      :graph_to_neo4j => true
+    }
+    # create setter methods for each default option
+    DEFAULT_OPTIONS.keys.each do |option|
+      define_method "#{option}=" do |value|
+        @options[option.to_sym] = value
+      end
+    end
+    def initialize(url, options = {})
+      response = open(url, :allow_redirections => :all)
+      url_parsed = Addressable::URI.parse(response.base_uri)
+      @domain = url_parsed.host
+      @url = url_parsed.to_s
+      @page_crawl_count = 0
+      @options = options
+      # add the initial url to our crawl queue
+      @crawl_queue = [@url]
+      @broken_links = []
+      @visited_queue = []
+      @graph = Creepycrawler::Graph.new
+    end
+    def crawl
+      # merge default and passed in options into one hash
+      @options = DEFAULT_OPTIONS.merge(@options)
+      # begin crawl loop
+      loop do
+        # break if we have crawled all sites, or reached :max_page_crawl
+        break if @crawl_queue.empty? or (!options[:max_page_crawl].nil? and @page_crawl_count >= @options[:max_page_crawl])
+        begin
+          # pull next page from crawl_queue and setup page
+          page = Page.new(@crawl_queue.shift)
+          # add url to visited queue to keep track of where we have been
+          @visited_queue.push(page.url.to_s)
+          # respect robots.txt
+          if @options[:obey_robots] and page.robots_disallowed?
+            puts "Not crawling #{page.url} per Robots.txt request" if options[:verbose]
+            next
+          end
+          puts "Crawling and indexing: #{page.url}" if @options[:verbose]
+          # retrieve page
+          page.fetch
+          current_page_node = @graph.add_page(page.url) if @options[:graph_to_neo4j]
+          #todo: fix this. on first run current_page_node is a hash. subsequent is an array of hashes
+          @root_node = current_page_node if @page_crawl_count == 0 and @options[:graph_to_neo4j]
+          # Loop through all links on the current page
+          page.links.each do |link|
+            # add to crawl queue - only push local links, links that do not yet exist in the queue and links that haven't been visted
+            @crawl_queue.push(link) if local? link and !@crawl_queue.include? link and !@visited_queue.include? link.to_s
+            # add link page to graph
+            current_link_node = @graph.add_page(link) if @options[:graph_to_neo4j]
+            # create a links_to relationship from the current page node to link node
+            @graph.create_relationship("links_to", current_page_node, current_link_node) if @options[:graph_to_neo4j]
+          end
+        rescue => e
+          puts "Exception thrown: #{e.message} - Skipping Page" if @options[:verbose]
+          @broken_links.push(page.url)
+          next
+        end
+        @page_crawl_count += 1
+      end # end of loop
+      return self
+    end
+    # is link local to site?
+    def local?(link)
+      uri = Addressable::URI.parse(link)
+      return true if uri.host == @domain
+      return false
+    end
+  end
+end

data/spec/creepy-crawler_spec.rb ADDED Viewed

@@ -0,0 +1,17 @@
+require 'spec_helper'
+module Creepycrawler
+  describe Creepycrawler do
+    describe "#crawl" do
+      it "should have a crawl convenience method to crawl the site and return a Site object" do
+        result = Creepycrawler.crawl(RSPEC_URL,:graph_to_neo4j => false, :verbose => false)
+        expect(result).to be_an_instance_of Site
+      end
+      it "should have a crawl convenience method that accepts options to crawl the site and return a Site object" do
+        result = Creepycrawler.crawl(RSPEC_URL, :verbose => false, :graph_to_neo4j => false)
+        expect(result).to be_an_instance_of Site
+      end
+    end
+  end
+end

data/spec/dummypage_helper.rb ADDED Viewed

@@ -0,0 +1,13 @@
+module Creepycrawler
+  class Dummypage
+    attr_accessor :body
+    def initialize(link_array)
+      @body = "<html><body>"
+      link_array.each do |link|
+        @body += "<a href = '#{link}'> here </a>"
+      end
+      @body += "</body></html>"
+    end
+  end
+end

data/spec/graph_spec.rb ADDED Viewed

@@ -0,0 +1,67 @@
+require 'spec_helper'
+module Creepycrawler
+  describe Graph do
+    describe "#new" do
+      it "should return a Graph object" do
+        expect(Graph.new).to be_an_instance_of(Creepycrawler::Graph)
+      end
+    end
+    describe "#add_page" do
+      it "should create a node if the node doesnt exist in the graph (returning nil)" do
+        # mock out Neography::Rest object to allow us to return desired data
+        neography = double(Neography::Rest, :get_node_index => nil, :create_node => [], :add_node_to_index => {})
+        allow(Neography::Rest).to receive(:new) {neography}
+        expect(neography).to receive(:get_node_index).with("page","url",RSPEC_URL)
+        expect(neography).to receive(:create_node).with({"url" => RSPEC_URL})
+        expect(neography).to receive(:add_node_to_index).with("page","url",RSPEC_URL,an_instance_of(Array))
+        Graph.new.add_page RSPEC_URL
+      end
+      it "should create a node if it doesnt exist in the graph (raising exception Neography::NotFoundException)" do
+        # mock out Neography::Rest object to allow us to return desired data
+        neography = double(Neography::Rest, :create_node => [], :add_node_to_index => {})
+        allow(Neography::Rest).to receive(:new) {neography}
+        allow(neography).to receive(:get_node_index).and_raise(Neography::NotFoundException)
+        expect(neography).to receive(:get_node_index).with("page","url",RSPEC_URL)
+        expect(neography).to receive(:create_node).with({"url" => RSPEC_URL})
+        expect(neography).to receive(:add_node_to_index).with("page","url",RSPEC_URL,an_instance_of(Array))
+        Graph.new.add_page RSPEC_URL
+      end
+      it "should not create the node if the node exists" do
+        # mock out Neography::Rest object to allow us to return desired data
+        neography = double(Neography::Rest, :get_node_index => [], :create_node => [], :add_node_to_index => {})
+        allow(Neography::Rest).to receive(:new) {neography}
+        expect(neography).to receive(:get_node_index).with("page","url",RSPEC_URL)
+        expect(neography).to receive(:create_node).never
+        expect(neography).to receive(:add_node_to_index).never
+        Graph.new.add_page RSPEC_URL
+      end
+      it "should return the node array when called" do
+        # mock out Neography::Rest object to allow us to return desired data
+        neography = double(Neography::Rest, :get_node_index => nil, :create_node => [], :add_node_to_index => {})
+        allow(Neography::Rest).to receive(:new) {neography}
+        expect(neography).to receive(:get_node_index).with("page","url",RSPEC_URL)
+        expect(neography).to receive(:create_node).with({"url" => RSPEC_URL})
+        expect(neography).to receive(:add_node_to_index).with("page","url",RSPEC_URL,an_instance_of(Array))
+        expect(Graph.new.add_page RSPEC_URL).to be_an(Array)
+      end
+    end
+    describe "#create_relationship" do
+      it "should create relationship between nodes" do
+        # mock out Neography::Rest object to allow us to return desired data
+        neography = double(Neography::Rest, :create_relationship => {})
+        allow(Neography::Rest).to receive(:new) {neography}
+        expect(neography).to receive(:create_relationship).with("links_to","https://dummy.com",RSPEC_URL)
+        Graph.new.create_relationship("links_to", "https://dummy.com", RSPEC_URL)
+      end
+    end
+  end
+end

data/spec/page_spec.rb ADDED Viewed

@@ -0,0 +1,95 @@
+require 'spec_helper'
+module Creepycrawler
+  describe Page do
+    before :each do
+      @page = Page.new(RSPEC_URL)
+    end
+    describe "#new" do
+      it "should accept a url and return a page object" do
+        expect(@page).to be_an_instance_of Creepycrawler::Page
+      end
+      it "should raise error on bad url" do
+        expect { Page.new("http://?bad_uri") }.to raise_error
+      end
+    end
+    describe "#fetch" do
+      it "should return a nokogiri object" do
+        expect(@page.fetch).to be_a(Nokogiri::HTML::Document)
+      end
+    end
+    describe "#links" do
+      it "should return an array" do
+        dummy_page_link_array = [
+          "/1",
+          "/2",
+          "http://remote.com/3"
+        ]
+        @page.body = Dummypage.new(dummy_page_link_array).body
+        expect(@page.links).to be_an(Array)
+      end
+      it "should return three links" do
+        dummy_page_link_array = [
+          "/1",
+          "/2",
+          "http://remote.com/3"
+        ]
+        @page.body = Dummypage.new(dummy_page_link_array).body
+        expect(@page.links.length).to equal(3)
+      end
+      it "should not return links to itself or empty links" do
+        dummy_page_link_array = [
+          "/1",
+          "/2",
+          "http://remote.com/3",
+          "#",
+          ""
+        ]
+        @page.body = Dummypage.new(dummy_page_link_array).body
+        expect(@page.links.length).to equal(3)
+      end
+      it "should convert relative to absolute links" do
+        dummy_page_link_array = [
+          "/1",
+          "/2",
+          "3/4",
+          "foo.html",
+          "http://remote.com/3"
+        ]
+        @page.body = Dummypage.new(dummy_page_link_array).body
+        expect(@page.links).to match_array(["#{RSPEC_URL}1", "#{RSPEC_URL}2", "http://remote.com/3", "#{RSPEC_URL}3/4", "#{RSPEC_URL}foo.html"])
+      end
+      it "should not pickup mailto links" do
+        dummy_page_link_array = [
+          "mailto:foo@example.com"
+        ]
+        @page.body = Dummypage.new(dummy_page_link_array).body
+        expect(@page.links).to be_empty
+      end
+      it "should not pickup ftp links" do
+        dummy_page_link_array = [
+          "ftp://example.com"
+        ]
+        @page.body = Dummypage.new(dummy_page_link_array).body
+        expect(@page.links).to be_empty
+      end
+      it "should not pickup links that execute javascript" do
+        dummy_page_link_array = [
+          "javascript:void(0)"
+        ]
+        @page.body = Dummypage.new(dummy_page_link_array).body
+        expect(@page.links).to be_empty
+      end
+    end
+  end
+end

data/spec/site_spec.rb ADDED Viewed

@@ -0,0 +1,119 @@
+require 'spec_helper'
+module Creepycrawler
+  describe Site do
+    before :each do
+      @site = Site.new(RSPEC_URL, :verbose => false, :graph_to_neo4j => false)
+    end
+    it "should accept options" do
+      @site = Site.new(RSPEC_URL, :foo => true)
+      expect(@site.options[:foo]).to be true
+    end
+    it "should allow the changing of default options" do
+      @site = Site.new(RSPEC_URL, :verbose => false)
+      expect(@site.options[:verbose]).to be false
+    end
+    describe "#new" do
+      it "should accept a url and return a site object" do
+        expect(@site).to be_an_instance_of Creepycrawler::Site
+      end
+    end
+    describe "#local?" do
+      it "should corectly recognize a local url" do
+        expect(@site.local?("#{RSPEC_URL}/foo")).to be true
+      end
+      it "should correctly recognize a non-local url" do
+        expect(@site.local?("http://non-local.com/")).to be false
+      end
+    end
+    describe "#crawl" do
+      it "should increment page_crawl_count with every indexed page" do
+        page = double(Creepycrawler::Page, :url => RSPEC_URL, :robots_disallowed? => false, :fetch => "", :links => [])
+        allow(Creepycrawler::Page).to receive(:new) {page}
+        expect(@site.crawl.page_crawl_count).to eq(1)
+      end
+      it "should obey robots.txt when not explicity ignored" do
+        page = double(Creepycrawler::Page, :url => RSPEC_URL, :robots_disallowed? => true, :fetch => "", :links => [])
+        allow(Creepycrawler::Page).to receive(:new) {page}
+        expect(@site.crawl.page_crawl_count).to eq(0)
+      end
+      it "should add each visited site to visited_queue" do
+        page = double(Creepycrawler::Page, :url => RSPEC_URL, :robots_disallowed? => true, :fetch => "", :links => [])
+        allow(Creepycrawler::Page).to receive(:new) {page}
+        expect(@site.crawl.visited_queue).to match_array([RSPEC_URL])
+      end
+      it "should terminate when max_page_crawl is reached" do
+        dummy_page_link_array = [
+          "/1",
+          "/2",
+          "/3",
+        ]
+        dummy_page = Page.new(RSPEC_URL)
+        dummy_page.body = Dummypage.new(dummy_page_link_array).body
+        @site = Site.new(RSPEC_URL, :verbose => false, :max_page_crawl => 2, :graph_to_neo4j => false)
+        page = double(Creepycrawler::Page, :url => RSPEC_URL, :robots_disallowed? => false, :fetch => "", :links => dummy_page.links)
+        allow(Creepycrawler::Page).to receive(:new) {page}
+        expect(@site.crawl.page_crawl_count).to eq(2)
+      end
+      it "should not visit the same page twice" do
+        dummy_page_link_array = [
+          "/1",
+          "/2",
+          "/2",
+        ]
+        dummy_page = Page.new(RSPEC_URL)
+        dummy_page.body = Dummypage.new(dummy_page_link_array).body
+        page = double(Creepycrawler::Page, :robots_disallowed? => false, :fetch => "", :links => dummy_page.links)
+        #allow(Creepycrawler::Page).to receive(:links) {["foo", "baz"]}
+        allow(Creepycrawler::Page).to receive(:new) do |arg|
+          # dynamically stub url to return url passed in initialization
+          allow(page).to receive(:url) {arg}
+          # return mock
+          page
+        end
+        expect(@site.crawl.page_crawl_count).to eq(3)
+        expect(@site.crawl.visited_queue).to match_array([RSPEC_URL, "#{RSPEC_URL}1", "#{RSPEC_URL}2"])
+      end
+      it "should not visit remote sites" do
+        dummy_page_link_array = [
+          "/1",
+          "http://remote.com/"
+        ]
+        dummy_page = Page.new(RSPEC_URL)
+        dummy_page.body = Dummypage.new(dummy_page_link_array).body
+        page = double(Creepycrawler::Page, :robots_disallowed? => false, :fetch => "", :links => dummy_page.links)
+        allow(Creepycrawler::Page).to receive(:new) do |url|
+          # dynamically stub url to return url passed in initialization
+          allow(page).to receive(:url) {url}
+          # return mock
+          page
+        end
+        expect(@site.crawl.page_crawl_count).to eq(2)
+        expect(@site.crawl.visited_queue).to match_array([RSPEC_URL, "#{RSPEC_URL}1"])
+      end
+      it "should add url to broken_links when an exception is thrown" do
+        page = double(Creepycrawler::Page, :url => RSPEC_URL, :robots_disallowed? => false, :fetch => "")
+        allow(Creepycrawler::Page).to receive(:new) {page}
+        allow(page).to receive(:fetch).and_raise('404 site not found')
+        expect(@site.crawl.broken_links).to match_array([RSPEC_URL])
+      end
+    end
+  end
+end

data/spec/spec_helper.rb ADDED Viewed

@@ -0,0 +1,4 @@
+require_relative '../lib/creepy-crawler'
+require_relative './dummypage_helper'
+RSPEC_URL = "http://example.com/"

metadata ADDED Viewed

@@ -0,0 +1,236 @@
+--- !ruby/object:Gem::Specification
+name: creepy-crawler
+version: !ruby/object:Gem::Version
+  version: 1.0.0
+platform: ruby
+authors:
+- Ryan Grothouse
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2014-05-10 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: bundler
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '1.5'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '1.5'
+- !ruby/object:Gem::Dependency
+  name: rspec
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '2.14'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '2.14'
+- !ruby/object:Gem::Dependency
+  name: rspec-core
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '2.14'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '2.14'
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '0.8'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '0.8'
+- !ruby/object:Gem::Dependency
+  name: neography
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '1.4'
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: 1.4.1
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '1.4'
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: 1.4.1
+- !ruby/object:Gem::Dependency
+  name: nokogiri
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '1.6'
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: 1.6.1
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '1.6'
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: 1.6.1
+- !ruby/object:Gem::Dependency
+  name: addressable
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '2.3'
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: 2.3.6
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '2.3'
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: 2.3.6
+- !ruby/object:Gem::Dependency
+  name: open_uri_redirections
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - '='
+      - !ruby/object:Gem::Version
+        version: 0.1.4
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - '='
+      - !ruby/object:Gem::Version
+        version: 0.1.4
+- !ruby/object:Gem::Dependency
+  name: webrobots
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '0.1'
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: 0.1.1
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '0.1'
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: 0.1.1
+- !ruby/object:Gem::Dependency
+  name: trollop
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '2.0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '2.0'
+description: web crawler that generates a sitemap to a neo4j database. It will also
+  store broken_links and total number of pages on site
+email: rgrothouse@gmail.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- .gitignore
+- .travis.yml
+- DOCKER.md
+- Gemfile
+- Gemfile.lock
+- README.md
+- Rakefile
+- docker/Dockerfile
+- docker/neo4j-server.properties
+- docker/startup.sh
+- examples/crawl.rb
+- examples/output_map.png
+- lib/creepy-crawler.rb
+- lib/creepy-crawler/graph.rb
+- lib/creepy-crawler/page.rb
+- lib/creepy-crawler/site.rb
+- spec/creepy-crawler_spec.rb
+- spec/dummypage_helper.rb
+- spec/graph_spec.rb
+- spec/page_spec.rb
+- spec/site_spec.rb
+- spec/spec_helper.rb
+homepage: https://github.com/udryan10/creepy-crawler
+licenses:
+- MIT
+metadata: {}
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ! '>='
+    - !ruby/object:Gem::Version
+      version: 1.9.3
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ! '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubyforge_project:
+rubygems_version: 2.2.2
+signing_key:
+specification_version: 4
+summary: web crawler that generates a sitemap
+test_files:
+- spec/creepy-crawler_spec.rb
+- spec/dummypage_helper.rb
+- spec/graph_spec.rb
+- spec/page_spec.rb
+- spec/site_spec.rb
+- spec/spec_helper.rb