RubyGems - webinspector - Versions diffs - 0.5.0 → 1.1.0 - Mend

webinspector 0.5.0 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

checksums.yaml +5 -5
data/Gemfile +2 -0
data/README.md +104 -28
data/Rakefile +2 -1
data/bin/console +4 -3
data/lib/web_inspector/inspector.rb +284 -92
data/lib/web_inspector/meta.rb +36 -15
data/lib/web_inspector/page.rb +246 -61
data/lib/web_inspector/request.rb +10 -8
data/lib/web_inspector/version.rb +3 -1
data/lib/web_inspector.rb +4 -2
data/lib/webinspector.rb +3 -1
data/webinspector.gemspec +33 -26
metadata +103 -60

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
-SHA1:
-  metadata.gz: 943a718d012d5b472e7ecdb96d1eb4e6e178bc4a
-  data.tar.gz: c600d322e258a2efd527bde502875655f2819f86
+SHA256:
+  metadata.gz: df0bf76a03246a803f338a903611f128ee8b6d09329f33a9745a27eeb4e9793b
+  data.tar.gz: adda6867a10d3dc5f7a9fd0ec414d7046e140fe016d945eb20098a84a5176642
 SHA512:
-  metadata.gz: 253f5907d503c96fc19485c58ddd59ec0373a9d4c755ffd5189bfd02bf0851692ffda30c25a7e44c7b2263f6dc300049bdd59b56fcc4f4970aca22d7d6ecb700
-  data.tar.gz: 630e4c3c60de4ef1d057180cea7e09d030de573a4d9dc248d0a61657a7f44049f54b60c633d0f7ebfbe54ee70863f8d615095a4335376d2d6a8ed86527d9ff49
+  metadata.gz: 01ce7c5aab007a3c9ef300c61a990a6f00d00604c14f0a3fdec28fdfb620a1e50ad576055b892cfe4d59de66975236674ce1f81f66e2d60c0ad0e0a0c3f4a951
+  data.tar.gz: ca58cdda149cf3b0cc6dcb29017b8e080b70b4ed881c924acddfba98760246417bb0304bc4eb42242a328d4ade99408181cc3ac10016998cd2b23de8fe34a8bf

data/Gemfile CHANGED Viewed

@@ -1,3 +1,5 @@
+# frozen_string_literal: true
 source 'https://rubygems.org'
 # Specify your gem's dependencies in webinspector.gemspec

data/README.md CHANGED Viewed

@@ -1,10 +1,10 @@
-# Webinspector
+# WebInspector
-Ruby gem to inspect completely a web page. It scrapes a given URL, and returns you its title, description, meta, links, images and more.
+Ruby gem to inspect web pages. It scrapes a given URL and returns its title, description, meta tags, links, images, and more.
+<a href="https://codeclimate.com/github/davidesantangelo/webinspector"><img src="https://codeclimate.com/github/davidesantangelo/webinspector/badges/gpa.svg" /></a>
-## See it in action!
-You can try WebInspector live at this little demo: [https://scrappet.herokuapp.com](https://scrappet.herokuapp.com)
 ## Installation
 Add this line to your application's Gemfile:
@@ -23,50 +23,126 @@ Or install it yourself as:
 ## Usage
-Initialize a WebInspector instance for an URL, like this:
+### Initialize a WebInspector instance
 ```ruby
-page = WebInspector.new('http://davidesantangelo.com')
+page = WebInspector.new('http://example.com')
 ```
-## Accessing response status and headers
+### With options
+```ruby
+page = WebInspector.new('http://example.com', {
+  timeout: 30,                         # Request timeout in seconds (default: 30)
+  retries: 3,                          # Number of retries (default: 3)
+  headers: {'User-Agent': 'Custom UA'} # Custom HTTP headers
+})
+```
-You can check the status and headers from the response like this:
+### Accessing response status and headers
 ```ruby
 page.response.status  # 200
-page.response.headers # { "server"=>"apache", "content-type"=>"text/html; charset=utf-8", "cache-control"=>"must-revalidate, private, max-age=0", ... }
+page.response.headers # { "server"=>"apache", "content-type"=>"text/html; charset=utf-8", ... }
+page.status_code      # 200
+page.success?         # true if the page was loaded successfully
+page.error_message    # returns the error message if any
 ```
-## Accessing inpsected data
-You can see the data like this:
+### Accessing page data
 ```ruby
-page.url                 # URL of the page
-page.scheme              # Scheme of the page (http, https)
-page.host                # Hostname of the page (like, davidesantangelo.com, without the scheme)
-page.port                # Port of the page
-page.title               # title of the page from the head section, as string
-page.description         # description of the page
-page.links               # every link found
-page.images              # every image found
-page.meta                # metatags of the page
+page.url           # URL of the page
+page.scheme        # Scheme of the page (http, https)
+page.host          # Hostname of the page (like, example.com, without the scheme)
+page.port          # Port of the page
+page.title         # title of the page from the head section
+page.description   # description of the page
+page.links         # array of all links found on the page (absolute URLs)
+page.images        # array of all images found on the page (absolute URLs)
+page.meta          # meta tags of the page
+page.favicon       # favicon URL if available
 ```
-## Accessing meta tags
+### Working with meta tags
 ```ruby
-page.meta                 # metatags of the page
+page.meta                 # all meta tags
 page.meta['description']  # meta description
 page.meta['keywords']     # meta keywords
+page.meta['og:title']     # OpenGraph title
+```
+### Filtering links and images by domain
+```ruby
+page.domain_links('example.com')  # returns only links pointing to example.com
+page.domain_images('example.com') # returns only images hosted on example.com
+```
+### Searching for words
+```ruby
+page.find(["ruby", "rails"]) # returns [{"ruby"=>3}, {"rails"=>1}]
+```
+#### JavaScript and Stylesheets
+```ruby
+page.javascripts  # array of all JavaScript files (absolute URLs)
+page.stylesheets  # array of all CSS stylesheets (absolute URLs)
+```
+#### Language Detection
+```ruby
+page.language  # detected language code (e.g., "en", "es", "fr")
+```
+#### Structured Data
+```ruby
+page.structured_data  # array of JSON-LD structured data objects
+page.microdata        # array of microdata items
+page.json_ld          # alias for structured_data
+```
+#### Security Information
+```ruby
+page.security_info  # hash with security details: { secure: true, hsts: true, ... }
 ```
-## Find words (as array)
+#### Performance Metrics
+```ruby
+page.load_time  # page load time in seconds
+page.size       # page size in bytes
+```
+#### Content Type
 ```ruby
-page.find(["word1, word2"]) # return {"word1"=>3, "word2"=>1}
+page.content_type  # content type header (e.g., "text/html; charset=utf-8")
 ```
+#### Technology Detection
+```ruby
+page.technologies  # hash of detected technologies: { jquery: true, react: true, ... }
+```
+#### HTML Tag Statistics
+```ruby
+page.tag_count  # hash with counts of each HTML tag: { "div" => 45, "p" => 12, ... }
+```
+### Export all data to JSON
+```ruby
+page.to_hash # returns a hash with all page data
+```
 ## Contributors
@@ -74,13 +150,13 @@ page.find(["word1, word2"]) # return {"word1"=>3, "word2"=>1}
   * Sam Nissen ([@samnissen](https://github.com/samnissen))
 ## License
-The webinspector GEM is released under the MIT License.
+The WebInspector gem is released under the MIT License.
 ## Contributing
-1. Fork it ( https://github.com/[my-github-username]/webinspector/fork )
+1. Fork it ( https://github.com/davidesantangelo/webinspector/fork )
 2. Create your feature branch (`git checkout -b my-new-feature`)
 3. Commit your changes (`git commit -am 'Add some feature'`)
 4. Push to the branch (`git push origin my-new-feature`)
 5. Create a new Pull Request
->>>>>>> develop

data/Rakefile CHANGED Viewed

@@ -1,2 +1,3 @@
-require "bundler/gem_tasks"
+# frozen_string_literal: true
+require 'bundler/gem_tasks'

data/bin/console CHANGED Viewed

@@ -1,7 +1,8 @@
 #!/usr/bin/env ruby
+# frozen_string_literal: true
-require "bundler/setup"
-require "webinspector"
+require 'bundler/setup'
+require 'webinspector'
 # You can add fixtures and/or initialization code here to make experimenting
 # with your gem easier. You can also use a different console, if you like.
@@ -10,5 +11,5 @@ require "webinspector"
 # require "pry"
 # Pry.start
-require "irb"
+require 'irb'
 IRB.start

data/lib/web_inspector/inspector.rb CHANGED Viewed

@@ -1,144 +1,336 @@
+# frozen_string_literal: true
 require File.expand_path(File.join(File.dirname(__FILE__), 'meta'))
 module WebInspector
   class Inspector
+    attr_reader :page, :url, :host, :meta
     def initialize(page)
       @page = page
       @meta = WebInspector::Meta.new(page).meta
+      @base_url = nil
+    end
+    def set_url(url, host)
+      @url = url
+      @host = host
     end
     def title
-      @page.css('title').inner_text.strip rescue nil
+      @page.css('title').inner_text.strip
+    rescue StandardError
+      nil
     end
     def description
-      @meta['description'] || snippet
+      @meta['description'] || @meta['og:description'] || snippet
     end
     def body
       @page.css('body').to_html
     end
-    def meta
-      @meta
-    end
+    # Search for specific words in the page content
+    # @param words [Array<String>] List of words to search for
+    # @return [Array<Hash>] Counts of word occurrences
     def find(words)
-      text = @page.at('html').inner_text
+      text = @page.at('html').inner_text
       counter(text.downcase, words)
     end
+    # Get all links from the page
+    # @return [Array<String>] Array of URLs
     def links
-      get_new_links unless @links
-      return @links
+      @links ||= begin
+        links = []
+        @page.css('a').each do |a|
+          href = a[:href]
+          next unless href
+          # Skip javascript and mailto links
+          next if href.start_with?('javascript:', 'mailto:', 'tel:')
+          # Clean and normalize URL
+          href = href.strip
+          begin
+            absolute_url = make_absolute_url(href)
+            links << absolute_url if absolute_url
+          rescue URI::InvalidURIError
+            # Skip invalid URLs
+          end
+        end
+        links.uniq
+      end
     end
-    def domain_links(user_domain, host)
+    # Get links from a specific domain
+    # @param user_domain [String] Domain to filter links by
+    # @param host [String] Current host
+    # @return [Array<String>] Filtered links
+    def domain_links(user_domain, host = nil)
       @host ||= host
-      validated_domain_uri = validate_url_domain("http://#{user_domain.downcase.gsub(/\s+/, '')}")
-      raise "Invalid domain provided" unless validated_domain_uri
-      domain = validated_domain_uri.domain
-      domain_links = []
-      links.each do |l|
-        u = validate_url_domain(l)
-        next unless u && u.domain
-        domain_links.push(l) if domain == u.domain.downcase
+      filter_by_domain(links, user_domain)
+    end
+    # Get all images from the page
+    # @return [Array<String>] Array of image URLs
+    def images
+      @images ||= begin
+        images = []
+        @page.css('img').each do |img|
+          src = img[:src]
+          next unless src
+          # Clean and normalize URL
+          src = src.strip
+          begin
+            absolute_url = make_absolute_url(src)
+            images << absolute_url if absolute_url
+          rescue URI::InvalidURIError, URI::BadURIError
+            # Skip invalid URLs
+          end
+        end
+        images.uniq.compact
       end
-      return domain_links.compact
     end
-    def domain_images(user_domain, host)
+    # Get images from a specific domain
+    # @param user_domain [String] Domain to filter images by
+    # @param host [String] Current host
+    # @return [Array<String>] Filtered images
+    def domain_images(user_domain, host = nil)
       @host ||= host
-      validated_domain_uri = validate_url_domain("http://#{user_domain.downcase.gsub(/\s+/, '')}")
-      raise "Invalid domain provided" unless validated_domain_uri
-      domain = validated_domain_uri.domain
-      domain_images = []
-      images.each do |img|
-        u = validate_url_domain(img)
-        next unless u && u.domain
-        domain_images.push(img) if u.domain.downcase.end_with?(domain)
+      filter_by_domain(images, user_domain)
+    end
+    # Get all JavaScript files used by the page
+    # @return [Array<String>] Array of JavaScript file URLs
+    def javascripts
+      @javascripts ||= begin
+        scripts = []
+        @page.css('script[src]').each do |script|
+          src = script[:src]
+          next unless src
+          # Clean and normalize URL
+          src = src.strip
+          begin
+            absolute_url = make_absolute_url(src)
+            scripts << absolute_url if absolute_url
+          rescue URI::InvalidURIError, URI::BadURIError
+            # Skip invalid URLs
+          end
+        end
+        scripts.uniq.compact
       end
-      return domain_images.compact
     end
-    # Normalize and validate the URLs on the page for comparison
-    def validate_url_domain(u)
-      # Enforce a few bare standards before proceeding
-      u = "#{u}"
-      u = "/" if u.empty?
-      begin
-        # Look for evidence of a host. If this is a relative link
-        # like '/contact', add the page host.
-        domained_url   = @host + u unless (u.split("/").first || "").match(/(\:|\.)/)
-        domained_url ||= u
-        # http the URL if it is missing
-        httpped_url   = "http://" + domained_url unless domained_url[0..3] == 'http'
-        httpped_url ||= domained_url
-        # Make sure the URL parses
-        uri     = URI.parse(httpped_url)
-        # Make sure the URL passes ICANN rules.
-        # The PublicSuffix object splits the domain and subdomain
-        # (unlike URI), which allows more liberal URL matching.
-        return PublicSuffix.parse(uri.host)
-      rescue URI::InvalidURIError, PublicSuffix::DomainInvalid => e
-        return false
+    # Get stylesheets used by the page
+    # @return [Array<String>] Array of CSS file URLs
+    def stylesheets
+      @stylesheets ||= begin
+        styles = []
+        @page.css('link[rel="stylesheet"]').each do |style|
+          href = style[:href]
+          next unless href
+          # Clean and normalize URL
+          href = href.strip
+          begin
+            absolute_url = make_absolute_url(href)
+            styles << absolute_url if absolute_url
+          rescue URI::InvalidURIError, URI::BadURIError
+            # Skip invalid URLs
+          end
+        end
+        styles.uniq.compact
       end
     end
-    def images
-      get_new_images unless @images
-      return @images
+    # Detect the page language
+    # @return [String, nil] Language code if detected, nil otherwise
+    def language
+      # Check for html lang attribute first
+      html_tag = @page.at('html')
+      return html_tag['lang'] if html_tag && html_tag['lang'] && !html_tag['lang'].empty?
+      # Then check for language meta tag
+      lang_meta = @meta['content-language']
+      return lang_meta if lang_meta && !lang_meta.empty?
+      # Fallback to inspecting content headers if available
+      nil
+    end
+    # Extract structured data (JSON-LD) from the page
+    # @return [Array<Hash>] Array of structured data objects
+    def structured_data
+      @structured_data ||= begin
+        data = []
+        @page.css('script[type="application/ld+json"]').each do |script|
+          parsed = JSON.parse(script.text)
+          data << parsed if parsed
+        rescue JSON::ParserError
+          # Skip invalid JSON
+        end
+        data
+      end
+    end
+    # Extract microdata from the page
+    # @return [Array<Hash>] Array of microdata items
+    def microdata
+      @microdata ||= begin
+        items = []
+        @page.css('[itemscope]').each do |scope|
+          item = { type: scope['itemtype'] }
+          properties = {}
+          scope.css('[itemprop]').each do |prop|
+            name = prop['itemprop']
+            # Extract value based on tag
+            value = case prop.name.downcase
+                    when 'meta'
+                      prop['content']
+                    when 'img', 'audio', 'embed', 'iframe', 'source', 'track', 'video'
+                      make_absolute_url(prop['src'])
+                    when 'a', 'area', 'link'
+                      make_absolute_url(prop['href'])
+                    when 'time'
+                      prop['datetime'] || prop.text.strip
+                    else
+                      prop.text.strip
+                    end
+            properties[name] = value
+          end
+          item[:properties] = properties
+          items << item
+        end
+        items
+      end
+    end
+    # Count all tag types on the page
+    # @return [Hash] Counts of different HTML elements
+    def tag_count
+      tags = {}
+      @page.css('*').each do |element|
+        tag_name = element.name.downcase
+        tags[tag_name] ||= 0
+        tags[tag_name] += 1
+      end
+      tags
     end
     private
+    # Count occurrences of words in text
+    # @param text [String] Text to search in
+    # @param words [Array<String>] Words to find
+    # @return [Array<Hash>] Count results
     def counter(text, words)
-      results = []
-      hash = Hash.new
+      words.map do |word|
+        { word => text.scan(/#{Regexp.escape(word.downcase)}/).size }
+      end
+    end
+    # Validate a URL domain
+    # @param u [String] URL to validate
+    # @return [PublicSuffix::Domain, false] Domain object or false if invalid
+    def validate_url_domain(u)
+      u = u.to_s
+      u = '/' if u.empty?
+      begin
+        domained_url = if !(u.split('/').first || '').match(/(:|\.)/)
+                         @host + u
+                       else
+                         u
+                       end
+        httpped_url = domained_url.start_with?('http') ? domained_url : "http://#{domained_url}"
+        uri = URI.parse(httpped_url)
-      words.each do |word|
-        hash[word] = text.scan(/#{word.downcase}/).size
-        results.push(hash)
-        hash = Hash.new
+        PublicSuffix.parse(uri.host)
+      rescue URI::InvalidURIError, PublicSuffix::DomainInvalid
+        false
       end
-      return results
     end
-    def get_new_images
-      @images = []
-      @page.css("img").each do |img|
-        @images.push((img[:src].to_s.start_with? @url.to_s) ? img[:src] : URI.join(url, img[:src]).to_s) if (img and img[:src])
+    # Filter a list of URLs by a given domain.
+    # @param collection [Array<String>] The list of URLs to filter.
+    # @param user_domain [String] The domain to filter by.
+    # @return [Array<String>] The filtered list of URLs.
+    def filter_by_domain(collection, user_domain)
+      return [] if collection.empty?
+      # Handle nil user_domain
+      user_domain = @host.to_s if user_domain.nil? || user_domain.empty?
+      # Normalize domain for comparison
+      normalized_domain = user_domain.to_s.downcase.gsub(/\s+/, '').sub(/^www\./, '')
+      collection.select do |item|
+        uri = URI.parse(item.to_s)
+        next false unless uri.host
+        uri_host = uri.host.to_s.downcase.sub(/^www\./, '')
+        uri_host.include?(normalized_domain)
+      rescue URI::InvalidURIError, NoMethodError
+        false
       end
     end
-    def get_new_links
-      @links = []
-      @page.css("a").each do |a|
-        @links.push((a[:href].to_s.start_with? @url.to_s) ? a[:href] : URI.join(@url, a[:href]).to_s) if (a and a[:href])
+    # Make a URL absolute
+    # @param url [String] URL to make absolute
+    # @return [String, nil] Absolute URL or nil if invalid
+    def make_absolute_url(url)
+      return nil if url.nil? || url.empty?
+      # If it's already absolute, return it
+      return url if url.start_with?('http://', 'https://')
+      # Get base URL from the page if not already set
+      if @base_url.nil?
+        base_tag = @page.at_css('base[href]')
+        @base_url = base_tag ? base_tag['href'] : ''
       end
+      begin
+        # Try joining with base URL first if available
+        return URI.join(@base_url, url).to_s unless @base_url.empty?
+      rescue URI::InvalidURIError, URI::BadURIError
+        # Fall through to next method
+      end
+      begin
+        # If we have @url, try to use it
+        return URI.join(@url, url).to_s if @url
+      rescue URI::InvalidURIError, URI::BadURIError
+        # Fall through to next method
+      end
+      # For relative URLs, we need to make our best guess
+      return "http://#{@host}#{url}" if url.start_with?('/')
+      return "http://#{@host}/#{url}" if @host
+      # Last resort, return the original
+      url
+    rescue URI::InvalidURIError, URI::BadURIError
+      url # Return original instead of nil to be more lenient
     end
+    # Extract a snippet from the first long paragraph
+    # @return [String] Text snippet
     def snippet
       first_long_paragraph = @page.search('//p[string-length() >= 120]').first
-      first_long_paragraph ? first_long_paragraph.text : ''
+      first_long_paragraph ? first_long_paragraph.text.strip[0..255] : ''
     end
   end
-end
+end