RubyGems - robotx - Versions diffs - 0.1.0 - Mend

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: 161a4310d0e1b28e499ce5dd6226125c6e345dd6
+  data.tar.gz: 2c33050af6edcdc516611e7eb8e1efc5a497ecf5
+SHA512:
+  metadata.gz: 6dc47d5c31e4629bb462ed353e31ec5e2b5b98fbf2a56363d87c8e9c9a8ed5a611341d88268f29b66b5b268acf4dce8e7766b7be0d6f189f1696544602d86d89
+  data.tar.gz: b939d2cf78e12054a92f8693ad35e5cc55efe75098df8b2e0c76c4347e3ad3763a15812a51968eba40c72b849c51b5373abddc9f0dfd92d49c3c2b1b85d59e84

data/.gitignore ADDED Viewed

@@ -0,0 +1,34 @@
+*.gem
+*.rbc
+/.config
+/coverage/
+/InstalledFiles
+/pkg/
+/spec/reports/
+/test/tmp/
+/test/version_tmp/
+/tmp/
+## Specific to RubyMotion:
+.dat*
+.repl_history
+build/
+## Documentation cache and generated files:
+/.yardoc/
+/_yardoc/
+/doc/
+/rdoc/
+## Environment normalisation:
+/.bundle/
+/lib/bundler/man/
+# for a library or gem, you might want to ignore these files since the code is
+# intended to run in multiple environments; otherwise, check them in:
+# Gemfile.lock
+# .ruby-version
+# .ruby-gemset
+# unless supporting rvm < 1.11.0 or doing something fancy, ignore this:
+.rvmrc

data/Gemfile ADDED Viewed

@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+# Specify your gem's dependencies in robotx.gemspec
+gemspec

data/Gemfile.lock ADDED Viewed

@@ -0,0 +1,17 @@
+PATH
+  remote: .
+  specs:
+    robotx (0.1.0)
+GEM
+  remote: https://rubygems.org/
+  specs:
+    rake (10.3.2)
+PLATFORMS
+  ruby
+DEPENDENCIES
+  bundler (~> 1.6)
+  rake
+  robotx!

data/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+The MIT License (MIT)
+Copyright (c) 2014
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,72 @@
+# Robotx
+Robotx _(pronounced "robotex")_ is a simple but powerful parser for robots.txt files.
+It offers a bunch of features which allows you to check whether an URL is allowed/disallowed to be visited by a crawler.
+## Features
+- Maintains lists for allowed/disallowed URLs
+- Simple method to check whether an URL or just a path is allowed to be visited
+- Show all user agents covered by the robots.txt
+- Get the 'Crawl-Delay' for a website
+- Support for sitemap(s)
+## Installation
+### With Bundler
+Just add to your Gemfile
+~~~ruby
+gem 'robotx'
+~~~
+### Without Bundler
+If you're not using Bundler just execute on your commandline
+~~~bash
+$ gem install robotx
+~~~
+## Usage
+### Support for different user agents
+Robotx can be initialized with a special user agent. The default user agent is `*`.
+**Please note:** All method results depends on the user agent Robotx was initialized with.
+~~~ruby
+require 'robotx'
+# Initialize with the default user agent '*'
+robots_txt = Robotx.new('https://github.com')
+robots_txt.allowed  # => ["/humans.txt"]
+# Initialize with 'googlebot' as user agent
+robots_txt = Robotx.new('https://github.com', 'googlebot')
+robots_txt.allowed # => ["/*/*/tree/master", "/*/*/blob/master"]
+~~~
+### Check whether an URL is allowed to be indexed
+~~~ruby
+require 'robotx'
+robots_txt = Robotx.new('https://github.com')
+robots_txt.allowed?('/humans.txt')  # => true
+robots_txt.allowed?('/')  # => false
+~~~
+### Get all allowed/disallowed URLs
+~~~ruby
+require 'robotx'
+robots_txt = Robotx.new('https://github.com')
+robots_txt.allowed  # => ["/humans.txt"]
+robots_txt.disallowed  # => ["/"]
+~~~
+### Get additional information
+~~~ruby
+require 'robotx'
+robots_txt = Robotx.new('https://github.com')
+robots_txt.sitemap  # => []
+robots_txt.crawl_delay  # => 0
+robots_txt.user_agents  # => ["googlebot", "baiduspider", ...]
+~~~
+## Todo
+- Add tests

data/Rakefile ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ require "bundler/gem_tasks"
2	+

data/lib/robotx.rb ADDED Viewed

@@ -0,0 +1,114 @@
+require 'timeout'
+require 'stringio'
+require 'open-uri'
+require 'uri'
+require 'set'
+class Robotx
+  TIMEOUT = 30 # seconds
+  def initialize(uri, user_agent='*')
+    @uri = URI.parse(URI.encode(uri))
+    raise URI::InvalidURIError.new('scheme or host missing') unless @uri.scheme and @uri.host
+    @user_agent  = user_agent.downcase
+    @robots_data = parse_robots_txt
+  end
+  def allowed
+    return disallowed.empty? ? ['/'] : @robots_data.fetch(@user_agent, {}).fetch('allow', ['/'])
+  end
+  def disallowed
+    return @robots_data.fetch(@user_agent, {}).fetch('disallow', [])
+  end
+  def allowed?(data)
+    if data.is_a?(Array) or data.is_a?(Set)
+      return {}.tap do |hash|
+        data.each do |uri|
+          hash[uri] = check_permission(uri)
+        end
+      end
+    end
+    return check_permission(data)
+  end
+  def sitemap
+    return @robots_data.fetch('sitemap', [])
+  end
+  def crawl_delay
+    return [@robots_data.fetch(@user_agent, {}).fetch('crawl-delay', 0), 0].max
+  end
+  def user_agents
+    return @robots_data.keys.delete_if { |agent| agent == 'sitemap' }
+  end
+private
+  def load_robots_txt
+    Timeout::timeout(Robotx::TIMEOUT) do
+      if robots_txt_io = URI.join(@uri, 'robots.txt').open('User-Agent' => @user_agent) and robots_txt_io.content_type.downcase == 'text/plain' and robots_txt_io.status == ['200', 'OK']
+        return robots_txt_io
+      end
+      raise OpenURI::HTTPError
+    end
+  rescue
+    return StringIO.new("User-agent: *\nAllow: /\n")
+  end
+  def parse_robots_txt
+    agent  = '*'
+    {}.tap do |hash|
+      load_robots_txt.each do |line|
+        next if line =~ /^\s*(#.*|$)/
+        data  = line.split(/:/).map(&:strip)
+        key   = data.shift
+        value = data.join
+        case key.downcase
+        when 'user-agent'
+          agent = value.downcase
+          hash[agent] ||= {}
+        when 'allow'
+          hash[agent]['allow'] ||= []
+          hash[agent]['allow'] << value.sub(/(\/){2,}$/, '')
+        when 'disallow'
+          # Disallow: '' means Allow: '/'
+          if value.empty?
+            hash[agent]['allow'] ||= []
+            hash[agent]['allow'] << '/'
+          else
+            hash[agent]['disallow'] ||= []
+            hash[agent]['disallow'] << value.sub(/(\/){2,}$/, '')
+          end
+        when 'crawl-delay'
+          hash[agent]['crawl-delay'] = value.to_i
+        when 'sitemap'
+          hash['sitemap'] ||= []
+          hash['sitemap'] << value.sub(/(\/){2,}$/, '')
+        else
+          hash[key] ||= []
+          hash[key] << value.sub(/(\/){2,}$/, '')
+        end
+      end
+    end
+  rescue
+    {}
+  end
+  def check_permission(uri)
+    uri = URI.parse(URI.encode(uri))
+    return true unless (@robots_data or @robots_data.any?) or (uri.scheme and uri.host)
+    uri_path = uri.path.sub(/(\/){2,}$/, '')
+    pattern  = Regexp.compile("(^#{Regexp.escape(uri_path)}[\/]*$)|(^/$)")
+    return (@robots_data.fetch(@user_agent, {}).fetch('disallow', []).grep(pattern).empty? or @robots_data.fetch(@user_agent, {}).fetch('allow', []).grep(pattern).any?)
+  end
+end

data/robotx.gemspec ADDED Viewed

@@ -0,0 +1,22 @@
+# coding: utf-8
+lib = File.expand_path('../lib', __FILE__)
+$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
+Gem::Specification.new do |spec|
+  spec.name          = "robotx"
+  spec.version       = "0.1.0"
+  spec.authors       = ["Matthias Kalb"]
+  spec.email         = ["matthias.kalb@railsmechanic.de"]
+  spec.summary       = %q{A parser for the robots.txt file}
+  spec.description   = %q{A simple to use parser for the robots.txt file.}
+  spec.homepage      = "https://github.com/railsmechanic/robotx"
+  spec.license       = "MIT"
+  spec.files         = `git ls-files -z`.split("\x0")
+  spec.executables   = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
+  spec.test_files    = spec.files.grep(%r{^(test|spec|features)/})
+  spec.require_paths = ["lib"]
+  spec.add_development_dependency "bundler", "~> 1.6"
+  spec.add_development_dependency "rake"
+end

metadata ADDED Viewed

@@ -0,0 +1,80 @@
+--- !ruby/object:Gem::Specification
+name: robotx
+version: !ruby/object:Gem::Version
+  version: 0.1.0
+platform: ruby
+authors:
+- Matthias Kalb
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2014-07-04 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: bundler
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '1.6'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '1.6'
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+description: A simple to use parser for the robots.txt file.
+email:
+- matthias.kalb@railsmechanic.de
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- .gitignore
+- Gemfile
+- Gemfile.lock
+- LICENSE
+- README.md
+- Rakefile
+- lib/robotx.rb
+- robotx.gemspec
+homepage: https://github.com/railsmechanic/robotx
+licenses:
+- MIT
+metadata: {}
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubyforge_project:
+rubygems_version: 2.2.2
+signing_key:
+specification_version: 4
+summary: A parser for the robots.txt file
+test_files: []

robotx 0.1.0