RubyGems - crawlscope - Versions diffs - 0.2.0 → 0.4.0 - Mend

crawlscope 0.2.0 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (27) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +67 -0
data/README.md +46 -9
data/lib/crawlscope/cli.rb +5 -0
data/lib/crawlscope/crawl.rb +6 -0
data/lib/crawlscope/document_text.rb +40 -0
data/lib/crawlscope/rule_registry.rb +3 -1
data/lib/crawlscope/rules/content_quality.rb +99 -0
data/lib/crawlscope/rules/indexability.rb +66 -0
data/lib/crawlscope/rules/links.rb +24 -6
data/lib/crawlscope/rules/metadata.rb +57 -11
data/lib/crawlscope/rules/structured_data.rb +47 -0
data/lib/crawlscope/rules/uniqueness.rb +76 -4
data/lib/crawlscope/schemas.rb +52 -1
data/lib/crawlscope/version.rb +1 -1
data/lib/tasks/crawlscope_tasks.rake +11 -1
data/test/crawlscope/cli_test.rb +19 -5
data/test/crawlscope/configuration_test.rb +8 -1
data/test/crawlscope/content_quality_rule_test.rb +68 -0
data/test/crawlscope/crawl_test.rb +23 -3
data/test/crawlscope/indexability_rule_test.rb +96 -0
data/test/crawlscope/links_rule_test.rb +39 -0
data/test/crawlscope/metadata_rule_test.rb +77 -0
data/test/crawlscope/structured_data_rule_test.rb +91 -0
data/test/crawlscope/uniqueness_rule_test.rb +43 -2
data/test/release_task_test.rb +86 -0
metadata +9 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: ba21d55a2d9b787d7bb9d4e90f39e655a5fe2a884769dbef6f866d1e5779e076
-  data.tar.gz: b7c6b829412f8e436cd81d2d28bcd5fe22327f0bb9fcc34af307b4b5feac722c
+  metadata.gz: 79e8c8f3993c545bf7647c28b8540d3757c7d9c91eeaf885cde6d55c4935ebb5
+  data.tar.gz: d9b6a987e04546c2d3ee7bb3cc6e1d5510e78963df035cb24d7c8783064afa45
 SHA512:
-  metadata.gz: d4a6e75c44c7cff4e238ff50168b7807fec8542074bbcbe838c50cf5eba02f181576291f1033620f268484b4c75f588215789515bd6c3ee9d7e76e8e5b94ceaf
-  data.tar.gz: 5576d6a31853ebf3e6662e4bbc8f97d4da918a24352e02c4d9c7569e4300ae102d79c9a348e55ce884273c12dfa1717b8b22c16091fa26eb0d69c19b4b7dca36
+  metadata.gz: eb49361b9f26992682db7622796c4b262a12fca37254aca5e1f1c49c85702b7e4fc347a880af0665f10238f5340cb61bc44433060ba7b3fbde0bdd379c85c763
+  data.tar.gz: 5fa53f930ef529279e063bd11f9becd112c8abb266078027486f22ad37e968bad744c5a35c9432ccb170ceb51e45d858e23a47c649c6ede1d4dd89fb331fd9f3

data/CHANGELOG.md CHANGED Viewed

@@ -5,6 +5,49 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [0.4.0] - 2026-05-21
+### Added
+- add indexability and content quality checks
+### Fixed
+- preserve release changelog history
+- scope content ratio to main content
+- harden indexability and uniqueness rules
+## [0.3.0] - 2026-04-28
+### Added
+- add JobPost structured data
+### Documentation
+- fix missing changelog entry
+### Fixed
+- ldjson check now uses the same convention for default URL
 ## [0.2.0] - 2026-04-24
@@ -25,3 +68,27 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
+## [0.1.0] - 2026-04-23
+### Added
+- add crawlkit release-ready audit gem
+- add standalone validation commands
+- move default schema rules into crawlkit
+### Changed
+- strengthen public API coverage
+- load shared test dependencies
+- rename crawlkit to crawlscope

data/README.md CHANGED Viewed

@@ -23,9 +23,11 @@ It works in three modes:
 The default rule set includes:
+- indexability blockers
 - metadata validation
 - structured-data validation
 - uniqueness checks
+- content-quality checks
 - internal-link checks
 ## Installation
@@ -146,11 +148,13 @@ Available tasks:
 ```bash
 bin/rails crawlscope:validate
+bin/rails crawlscope:validate:indexability
 bin/rails crawlscope:validate:metadata
 bin/rails crawlscope:validate:structured_data
 bin/rails crawlscope:validate:uniqueness
+bin/rails crawlscope:validate:content_quality
 bin/rails crawlscope:validate:links
-bin/rails crawlscope:validate:ldjson URL=https://example.com/article
+bin/rails crawlscope:validate:ldjson
 ```
 The same validation surface is also available in the gem repository itself through plain `rake`:
@@ -161,9 +165,9 @@ bundle exec rake crawlscope:validate:metadata URL=https://example.com
 bundle exec rake crawlscope:validate:ldjson URL=https://example.com/article
 ```
-`crawlscope:validate` runs all default sitemap rules: metadata, structured data, uniqueness, and links. `URL` is the site base. Without `SITEMAP`, Crawlscope uses `/sitemap.xml`. With `SITEMAP`, Crawlscope uses `URL` as the site base and validates URLs from that sitemap. `SITEMAP` may be a full URL or a local file path.
+`crawlscope:validate` runs all default sitemap rules: indexability, metadata, structured data, uniqueness, content quality, and links. `URL` is the site base. Without `SITEMAP`, Crawlscope uses `/sitemap.xml`. With `SITEMAP`, Crawlscope uses `URL` as the site base and validates URLs from that sitemap. `SITEMAP` may be a full URL or a local file path.
-`crawlscope:validate:ldjson` is separate because it directly checks the URL or semicolon-separated URLs in `URL`; it does not crawl the sitemap.
+`crawlscope:validate:ldjson` is separate because it directly checks the URL or semicolon-separated URLs in `URL`; it does not crawl the sitemap. Without `URL`, it checks the configured base URL, falling back to `http://localhost:3000`.
 ### Structured Data URL Audit
@@ -186,11 +190,20 @@ Optional flags:
 Built-in rules:
+- `indexability`
 - `metadata`
 - `structured_data`
 - `uniqueness`
+- `content_quality`
 - `links`
+### Indexability
+Checks:
+- page-level meta robots `noindex`
+- `X-Robots-Tag: noindex`
 ### Metadata
 Checks:
@@ -220,6 +233,19 @@ Checks:
 - duplicate titles
 - duplicate meta descriptions
 - duplicate content fingerprints
+- near-duplicate visible content for up to 250 HTML pages
+For larger crawls, exact duplicate checks still run and Crawlscope reports
+`near_duplicate_scan_skipped`. Configure `Rules::Uniqueness` with
+`max_near_duplicate_pages:` in a custom rule registry to change the limit.
+### Content Quality
+Checks:
+- thin visible text
+- low visible-text-to-HTML ratio
+- low unique-token ratio
 ### Links
@@ -268,7 +294,12 @@ bundle exec rake
 ### Git hooks
-We use [lefthook](https://lefthook.dev/) with the Ruby [commitlint](https://github.com/arandilopez/commitlint) gem to enforce Conventional Commits on every commit. We also use [Standard Ruby](https://standardrb.com/) to keep code style consistent. CI validates commit messages, Standard Ruby, tests, and git-cliff changelog generation on pull requests and pushes to main/master.
+We use [lefthook](https://lefthook.dev/) with the Ruby
+[commitlint](https://github.com/arandilopez/commitlint) gem to enforce
+Conventional Commits on every commit. We also use
+[Standard Ruby](https://standardrb.com/) to keep code style consistent. CI
+validates commit messages, Standard Ruby, tests, and git-cliff changelog
+generation on pull requests and pushes to main/master.
 Run the hook installer once per clone:
@@ -284,11 +315,16 @@ rake install
 ## Release
-Releases are tag-driven and published by GitHub Actions to RubyGems. Local release commands never publish directly.
+Releases are tag-driven and published by GitHub Actions to RubyGems.
+Local release commands never publish directly.
-Install [git-cliff](https://git-cliff.org/) locally before preparing a release. The release task regenerates `CHANGELOG.md` from Conventional Commits.
+Install [git-cliff](https://git-cliff.org/) locally before preparing a
+release. The release task prepends the next `CHANGELOG.md` section from
+Conventional Commits.
-Before preparing a release, make sure you are on `main` or `master` with a clean worktree.
+Before preparing a release, make sure you are on `main` or `master` with a
+clean worktree. If the release contains a breaking public-contract change,
+update `UPGRADE.md` with the host-app migration steps first.
 Then run one of:
@@ -301,12 +337,13 @@ bundle exec rake 'release:prepare[0.1.0]'
 The task will:
-1. Regenerate `CHANGELOG.md` with `git-cliff`.
+1. Prepend the next `CHANGELOG.md` section with `git-cliff`.
 1. Update `lib/crawlscope/version.rb`.
 1. Commit the release changes.
 1. Create and push the `vX.Y.Z` tag.
-The `Release` workflow then runs tests, publishes the gem to RubyGems, and creates the GitHub release from the changelog entry.
+The `Release` workflow then runs tests, publishes the gem to RubyGems,
+and creates the GitHub release from the changelog entry.
 ## Contributing

data/lib/crawlscope/cli.rb CHANGED Viewed

@@ -105,6 +105,7 @@ module Crawlscope
       parser.parse!(@argv)
       urls = options[:urls].map(&:strip).reject(&:empty?)
+      urls = default_urls if urls.empty?
       raise ConfigurationError, "Crawlscope URL is not configured" if urls.empty?
       configure_renderer(options[:renderer])
@@ -238,6 +239,10 @@ module Crawlscope
       raw_urls.split(";").map(&:strip).reject(&:empty?)
     end
+    def default_urls
+      [normalized_string(@configuration.base_url) || "http://localhost:3000"]
+    end
     def task
       @task ||= Run.new(configuration: @configuration, reporter: Reporter.new(io: @out))
     end

data/lib/crawlscope/crawl.rb CHANGED Viewed

@@ -81,6 +81,8 @@ module Crawlscope
           issues.add(code: :fetch_failed, severity: :error, category: :crawl, url: page.url, message: page.error, details: {})
         elsif !@allowed_statuses.include?(page.status)
           issues.add(code: :unexpected_status, severity: :error, category: :crawl, url: page.url, message: "HTTP #{page.status}", details: {status: page.status})
+        elsif redirected?(page)
+          issues.add(code: :redirected_page, severity: :warning, category: :crawl, url: page.url, message: "redirects to #{page.final_url}", details: {final_url: page.final_url, status: page.status})
         end
       end
     end
@@ -128,5 +130,9 @@ module Crawlscope
         status: page.status
       }
     end
+    def redirected?(page)
+      page.normalized_url.to_s != page.normalized_final_url.to_s
+    end
   end
 end

data/lib/crawlscope/document_text.rb ADDED Viewed

@@ -0,0 +1,40 @@
+# frozen_string_literal: true
+module Crawlscope
+  module DocumentText
+    REMOVED_SELECTORS = "script, style, noscript, template, svg"
+    TOKEN_PATTERN = /[[:alnum:]]+/
+    module_function
+    def body_text(doc)
+      text_for(doc, selector: nil)
+    end
+    def html_for(doc, selector: "main")
+      root_for(doc, selector: selector)&.to_html.to_s
+    end
+    def text_for(doc, selector: "main")
+      normalize(root_for(doc, selector: selector)&.text)
+    end
+    def tokens(text)
+      normalize(text).downcase.scan(TOKEN_PATTERN).reject { |token| token.length < 2 }
+    end
+    def normalize(text)
+      text.to_s.gsub(/\s+/, " ").strip
+    end
+    def root_for(doc, selector:)
+      return unless doc
+      copy = doc.dup
+      copy.css(REMOVED_SELECTORS).remove
+      root = selector.to_s.empty? ? nil : copy.at_css(selector)
+      root || copy.at_css("body") || copy
+    end
+  end
+end

data/lib/crawlscope/rule_registry.rb CHANGED Viewed

@@ -12,12 +12,14 @@ module Crawlscope
     def self.default(site_name: nil)
       new(
         rules: [
+          Rules::Indexability.new,
           Rules::Metadata.new(site_name: site_name),
           Rules::StructuredData.new,
           Rules::Uniqueness.new,
+          Rules::ContentQuality.new,
           Rules::Links.new
         ],
-        default_codes: %i[metadata structured_data uniqueness links]
+        default_codes: %i[indexability metadata structured_data uniqueness content_quality links]
       )
     end

data/lib/crawlscope/rules/content_quality.rb ADDED Viewed

@@ -0,0 +1,99 @@
+# frozen_string_literal: true
+module Crawlscope
+  module Rules
+    class ContentQuality
+      MIN_VISIBLE_TEXT_RATIO = 0.08
+      MIN_VISIBLE_WORDS = 250
+      MIN_UNIQUE_TOKEN_RATIO = 0.25
+      attr_reader :code
+      def initialize(
+        min_visible_text_ratio: MIN_VISIBLE_TEXT_RATIO,
+        min_visible_words: MIN_VISIBLE_WORDS,
+        min_unique_token_ratio: MIN_UNIQUE_TOKEN_RATIO
+      )
+        @code = :content_quality
+        @min_visible_text_ratio = min_visible_text_ratio
+        @min_visible_words = min_visible_words
+        @min_unique_token_ratio = min_unique_token_ratio
+      end
+      def call(urls:, pages:, issues:, context: nil)
+        pages.each do |page|
+          next unless page.html?
+          validate_visible_words(page, issues)
+          validate_visible_text_ratio(page, issues)
+          validate_unique_token_ratio(page, issues)
+        end
+      end
+      private
+      def validate_unique_token_ratio(page, issues)
+        tokens = DocumentText.tokens(DocumentText.text_for(page.doc))
+        return if tokens.size < @min_visible_words
+        ratio = tokens.uniq.size.to_f / tokens.size
+        return if ratio >= @min_unique_token_ratio
+        issues.add(
+          code: :low_unique_token_ratio,
+          severity: :warning,
+          category: :content_quality,
+          url: page.url,
+          message: "visible text has low token variety (#{format_ratio(ratio)})",
+          details: {
+            ratio: ratio.round(3),
+            threshold: @min_unique_token_ratio,
+            token_count: tokens.size,
+            unique_token_count: tokens.uniq.size
+          }
+        )
+      end
+      def validate_visible_text_ratio(page, issues)
+        html_bytes = DocumentText.html_for(page.doc).bytesize
+        return if html_bytes.zero?
+        visible_text = DocumentText.text_for(page.doc)
+        ratio = visible_text.bytesize.to_f / html_bytes
+        return if ratio >= @min_visible_text_ratio
+        issues.add(
+          code: :low_visible_text_ratio,
+          severity: :warning,
+          category: :content_quality,
+          url: page.url,
+          message: "low visible text to HTML ratio (#{format_ratio(ratio)})",
+          details: {
+            html_bytes: html_bytes,
+            ratio: ratio.round(3),
+            threshold: @min_visible_text_ratio,
+            visible_text_bytes: visible_text.bytesize
+          }
+        )
+      end
+      def validate_visible_words(page, issues)
+        word_count = DocumentText.tokens(DocumentText.text_for(page.doc)).size
+        return if word_count >= @min_visible_words
+        issues.add(
+          code: :thin_visible_text,
+          severity: :warning,
+          category: :content_quality,
+          url: page.url,
+          message: "thin visible text (#{word_count} words)",
+          details: {word_count: word_count, minimum: @min_visible_words}
+        )
+      end
+      def format_ratio(value)
+        format("%.2f", value)
+      end
+    end
+  end
+end

data/lib/crawlscope/rules/indexability.rb ADDED Viewed

@@ -0,0 +1,66 @@
+# frozen_string_literal: true
+module Crawlscope
+  module Rules
+    class Indexability
+      ROBOTS_META_SELECTOR = 'meta[name="robots"], meta[name="googlebot"]'
+      X_ROBOTS_TAG_HEADER = "x-robots-tag"
+      attr_reader :code
+      def initialize
+        @code = :indexability
+      end
+      def call(urls:, pages:, issues:, context: nil)
+        pages.each do |page|
+          validate_meta_robots(page, issues) if page.html?
+          validate_x_robots_tag(page, issues)
+        end
+      end
+      private
+      def header_value(page, name)
+        page.headers.find { |key, _value| key.to_s.casecmp?(name) }&.last.to_s
+      end
+      def noindex?(value)
+        value
+          .split(",")
+          .map { |directive| directive.split(":", 2).last.to_s.strip }
+          .any? { |directive| directive.casecmp?("noindex") || directive.casecmp?("none") }
+      end
+      def validate_meta_robots(page, issues)
+        page.doc.css(ROBOTS_META_SELECTOR).each do |tag|
+          content = tag["content"].to_s
+          next unless noindex?(content)
+          issues.add(
+            code: :noindex_meta,
+            severity: :error,
+            category: :indexability,
+            url: page.url,
+            message: "robots meta tag prevents indexing",
+            details: {content: content, name: tag["name"].to_s}
+          )
+        end
+      end
+      def validate_x_robots_tag(page, issues)
+        content = header_value(page, X_ROBOTS_TAG_HEADER)
+        return unless noindex?(content)
+        issues.add(
+          code: :noindex_header,
+          severity: :error,
+          category: :indexability,
+          url: page.url,
+          message: "X-Robots-Tag header prevents indexing",
+          details: {content: content}
+        )
+      end
+    end
+  end
+end

data/lib/crawlscope/rules/links.rb CHANGED Viewed

@@ -5,7 +5,7 @@ require "uri"
 module Crawlscope
   module Rules
     class Links
-      CONTEXTUAL_LINK_SELECTORS = "main a[href], article a[href]"
+      LINK_SELECTORS = "a[href]"
       INTERNAL_PATH_PREFIXES_TO_SKIP = ["/rails/", "/cdn-cgi/"].freeze
       LINK_SCHEMES_TO_SKIP = ["mailto:", "tel:", "javascript:", "data:"].freeze
       MAX_SOURCES_IN_ERROR = 3
@@ -33,10 +33,7 @@ module Crawlscope
       private
       def contextual_links(doc)
-        links = doc.css(CONTEXTUAL_LINK_SELECTORS)
-        return links unless links.empty?
-        doc.css("a[href]")
+        doc.css(LINK_SELECTORS)
       end
       def extract_links(pages)
@@ -45,7 +42,7 @@ module Crawlscope
       def page_links(page)
         source_path = Url.path(page.normalized_url)
-        return [] unless crawlable_path?(source_path)
+        return [] unless crawlable_source_path?(source_path)
         contextual_links(page.doc).filter_map do |node|
           link_for(page: page, source_path: source_path, node: node)
@@ -146,6 +143,7 @@ module Crawlscope
             next
           end
+          report_redirect_target(target_url, grouped_links, issues, target) if target.redirect?
           next unless crawlable_path?(target.final_path)
           grouped_links.each do |link|
@@ -156,6 +154,18 @@ module Crawlscope
         resolved_links
       end
+      def report_redirect_target(target_url, grouped_links, issues, target)
+        source_urls = grouped_links.map { |link| link[:source_url] }.uniq.first(MAX_SOURCES_IN_ERROR)
+        issues.add(
+          code: :internal_link_redirects,
+          severity: :warning,
+          category: :links,
+          url: target_url,
+          message: "internal link redirects to #{target.final_url} (sources: #{source_urls.join(", ")})",
+          details: {final_url: target.final_url, source_urls: source_urls, status: target.status}
+        )
+      end
       def resolve_target(target_url)
         resolution = @resolve_target.call(target_url)
         LinkTarget.new(target_url: target_url, resolution: resolution)
@@ -183,11 +193,19 @@ module Crawlscope
           resolution && resolution[:status]
         end
+        def redirect?
+          (status && (300..399).cover?(status.to_i)) || final_url != target_url
+        end
         def unresolved?
           resolution.nil? || (status.nil? && !ignored_error?)
         end
       end
+      def crawlable_source_path?(path)
+        !path.nil? && INTERNAL_PATH_PREFIXES_TO_SKIP.none? { |prefix| path.start_with?(prefix) }
+      end
       def skip_internal_path?(path)
         return true if path == "/"

data/lib/crawlscope/rules/metadata.rb CHANGED Viewed

@@ -1,10 +1,14 @@
 # frozen_string_literal: true
+require "uri"
 module Crawlscope
   module Rules
     class Metadata
       TITLE_MAX_LENGTH = 72
+      DESCRIPTION_MIN_LENGTH = 110
       DESCRIPTION_MAX_LENGTH = 160
+      REQUIRED_OPEN_GRAPH_PROPERTIES = %w[og:title og:description og:url og:type og:image].freeze
       attr_reader :code
@@ -21,22 +25,35 @@ module Crawlscope
           validate_title(page, issues)
           validate_description(page, issues)
           validate_canonical(page, issues)
+          validate_open_graph(page, issues)
         end
       end
       private
       def validate_h1(page, issues)
-        return unless page.doc.at_css("h1").nil?
-        issues.add(
-          code: :missing_h1,
-          severity: :warning,
-          category: :metadata,
-          url: page.url,
-          message: "missing <h1>",
-          details: {}
-        )
+        h1s = page.doc.css("h1")
+        return if h1s.one?
+        if h1s.empty?
+          issues.add(
+            code: :missing_h1,
+            severity: :warning,
+            category: :metadata,
+            url: page.url,
+            message: "missing <h1>",
+            details: {}
+          )
+        else
+          issues.add(
+            code: :multiple_h1,
+            severity: :warning,
+            category: :metadata,
+            url: page.url,
+            message: "multiple <h1> tags (#{h1s.size})",
+            details: {count: h1s.size}
+          )
+        end
       end
       def validate_title(page, issues)
@@ -56,6 +73,8 @@ module Crawlscope
         if description.empty?
           issues.add(code: :missing_meta_description, severity: :warning, category: :metadata, url: page.url, message: "missing meta description", details: {})
+        elsif description.length < DESCRIPTION_MIN_LENGTH
+          issues.add(code: :meta_description_too_short, severity: :warning, category: :metadata, url: page.url, message: "meta description too short (#{description.length})", details: {length: description.length, minimum: DESCRIPTION_MIN_LENGTH})
         elsif description.length > DESCRIPTION_MAX_LENGTH
           issues.add(code: :meta_description_too_long, severity: :warning, category: :metadata, url: page.url, message: "meta description too long (#{description.length})", details: {length: description.length})
         end
@@ -71,7 +90,7 @@ module Crawlscope
         normalized_canonical = Url.normalize(canonical, base_url: page.url)
         normalized_page_url = Url.normalize(page.url, base_url: page.url)
-        return if normalized_canonical == normalized_page_url
+        return if canonical_matches_page?(normalized_canonical, normalized_page_url)
         issues.add(
           code: :canonical_mismatch,
@@ -88,6 +107,33 @@ module Crawlscope
         title.split(/[^[:alnum:]]+/).count { |token| token.casecmp?(@site_name) } > 1
       end
+      def validate_open_graph(page, issues)
+        missing = REQUIRED_OPEN_GRAPH_PROPERTIES.reject do |property|
+          page.doc.at_css(%(meta[property="#{property}"][content]))
+        end
+        return if missing.empty?
+        issues.add(
+          code: :incomplete_open_graph_tags,
+          severity: :warning,
+          category: :metadata,
+          url: page.url,
+          message: "Open Graph tags incomplete (missing #{missing.join(", ")})",
+          details: {missing: missing}
+        )
+      end
+      def canonical_matches_page?(canonical, page_url)
+        canonical == page_url || (local_url?(page_url) && Url.path(canonical) == Url.path(page_url))
+      end
+      def local_url?(url)
+        host = URI.parse(url.to_s).host.to_s
+        ["localhost", "127.0.0.1", "0.0.0.0", "::1"].include?(host)
+      rescue URI::InvalidURIError
+        false
+      end
     end
   end
 end