RubyGems - crawlscope - Versions diffs - 0.3.0 → 0.4.0 - Mend

crawlscope 0.3.0 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +64 -0
data/README.md +44 -7
data/lib/crawlscope/document_text.rb +40 -0
data/lib/crawlscope/rule_registry.rb +3 -1
data/lib/crawlscope/rules/content_quality.rb +99 -0
data/lib/crawlscope/rules/indexability.rb +66 -0
data/lib/crawlscope/rules/uniqueness.rb +76 -4
data/lib/crawlscope/version.rb +1 -1
data/lib/tasks/crawlscope_tasks.rake +10 -0
data/test/crawlscope/configuration_test.rb +8 -1
data/test/crawlscope/content_quality_rule_test.rb +68 -0
data/test/crawlscope/crawl_test.rb +11 -1
data/test/crawlscope/indexability_rule_test.rb +96 -0
data/test/crawlscope/uniqueness_rule_test.rb +43 -2
data/test/release_task_test.rb +86 -0
metadata +8 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: b49aaaa6fdb5f7d5bd4dc63713d8c0090411e7063363645a900d8f59d803aaaa
-  data.tar.gz: 5dfcc35d60745c25db6faf3acaa4344e29e438c758740613d6216e2f47aeac6e
+  metadata.gz: 79e8c8f3993c545bf7647c28b8540d3757c7d9c91eeaf885cde6d55c4935ebb5
+  data.tar.gz: d9b6a987e04546c2d3ee7bb3cc6e1d5510e78963df035cb24d7c8783064afa45
 SHA512:
-  metadata.gz: 9f66627274ce2ea969b5bb9b53a339215718c37baf47393c75bcf3a528c5c73658c6a71903fdbbf9e53796aaf3680be5f99ab4151b834efbf9450e05abbab83b
-  data.tar.gz: 3cf2e2c7f251a6af7b931f00da63436eaa7e09f078d73de112852a10665cf16eefb561c7d61d6bc8b0c3c014ca0db2df217d31c00b9f0ed321565ed554574261
+  metadata.gz: eb49361b9f26992682db7622796c4b262a12fca37254aca5e1f1c49c85702b7e4fc347a880af0665f10238f5340cb61bc44433060ba7b3fbde0bdd379c85c763
+  data.tar.gz: 5fa53f930ef529279e063bd11f9becd112c8abb266078027486f22ad37e968bad744c5a35c9432ccb170ceb51e45d858e23a47c649c6ede1d4dd89fb331fd9f3

data/CHANGELOG.md CHANGED Viewed

@@ -5,6 +5,26 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [0.4.0] - 2026-05-21
+### Added
+- add indexability and content quality checks
+### Fixed
+- preserve release changelog history
+- scope content ratio to main content
+- harden indexability and uniqueness rules
 ## [0.3.0] - 2026-04-28
@@ -28,3 +48,47 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
+## [0.2.0] - 2026-04-24
+### Changed
+- simplify crawl and structured data boundaries
+- harden validation boundaries
+### Fixed
+- handle child sitemaps
+- use URL for sitemap validation
+## [0.1.0] - 2026-04-23
+### Added
+- add crawlkit release-ready audit gem
+- add standalone validation commands
+- move default schema rules into crawlkit
+### Changed
+- strengthen public API coverage
+- load shared test dependencies
+- rename crawlkit to crawlscope

data/README.md CHANGED Viewed

@@ -23,9 +23,11 @@ It works in three modes:
 The default rule set includes:
+- indexability blockers
 - metadata validation
 - structured-data validation
 - uniqueness checks
+- content-quality checks
 - internal-link checks
 ## Installation
@@ -146,9 +148,11 @@ Available tasks:
 ```bash
 bin/rails crawlscope:validate
+bin/rails crawlscope:validate:indexability
 bin/rails crawlscope:validate:metadata
 bin/rails crawlscope:validate:structured_data
 bin/rails crawlscope:validate:uniqueness
+bin/rails crawlscope:validate:content_quality
 bin/rails crawlscope:validate:links
 bin/rails crawlscope:validate:ldjson
 ```
@@ -161,7 +165,7 @@ bundle exec rake crawlscope:validate:metadata URL=https://example.com
 bundle exec rake crawlscope:validate:ldjson URL=https://example.com/article
 ```
-`crawlscope:validate` runs all default sitemap rules: metadata, structured data, uniqueness, and links. `URL` is the site base. Without `SITEMAP`, Crawlscope uses `/sitemap.xml`. With `SITEMAP`, Crawlscope uses `URL` as the site base and validates URLs from that sitemap. `SITEMAP` may be a full URL or a local file path.
+`crawlscope:validate` runs all default sitemap rules: indexability, metadata, structured data, uniqueness, content quality, and links. `URL` is the site base. Without `SITEMAP`, Crawlscope uses `/sitemap.xml`. With `SITEMAP`, Crawlscope uses `URL` as the site base and validates URLs from that sitemap. `SITEMAP` may be a full URL or a local file path.
 `crawlscope:validate:ldjson` is separate because it directly checks the URL or semicolon-separated URLs in `URL`; it does not crawl the sitemap. Without `URL`, it checks the configured base URL, falling back to `http://localhost:3000`.
@@ -186,11 +190,20 @@ Optional flags:
 Built-in rules:
+- `indexability`
 - `metadata`
 - `structured_data`
 - `uniqueness`
+- `content_quality`
 - `links`
+### Indexability
+Checks:
+- page-level meta robots `noindex`
+- `X-Robots-Tag: noindex`
 ### Metadata
 Checks:
@@ -220,6 +233,19 @@ Checks:
 - duplicate titles
 - duplicate meta descriptions
 - duplicate content fingerprints
+- near-duplicate visible content for up to 250 HTML pages
+For larger crawls, exact duplicate checks still run and Crawlscope reports
+`near_duplicate_scan_skipped`. Configure `Rules::Uniqueness` with
+`max_near_duplicate_pages:` in a custom rule registry to change the limit.
+### Content Quality
+Checks:
+- thin visible text
+- low visible-text-to-HTML ratio
+- low unique-token ratio
 ### Links
@@ -268,7 +294,12 @@ bundle exec rake
 ### Git hooks
-We use [lefthook](https://lefthook.dev/) with the Ruby [commitlint](https://github.com/arandilopez/commitlint) gem to enforce Conventional Commits on every commit. We also use [Standard Ruby](https://standardrb.com/) to keep code style consistent. CI validates commit messages, Standard Ruby, tests, and git-cliff changelog generation on pull requests and pushes to main/master.
+We use [lefthook](https://lefthook.dev/) with the Ruby
+[commitlint](https://github.com/arandilopez/commitlint) gem to enforce
+Conventional Commits on every commit. We also use
+[Standard Ruby](https://standardrb.com/) to keep code style consistent. CI
+validates commit messages, Standard Ruby, tests, and git-cliff changelog
+generation on pull requests and pushes to main/master.
 Run the hook installer once per clone:
@@ -284,11 +315,16 @@ rake install
 ## Release
-Releases are tag-driven and published by GitHub Actions to RubyGems. Local release commands never publish directly.
+Releases are tag-driven and published by GitHub Actions to RubyGems.
+Local release commands never publish directly.
-Install [git-cliff](https://git-cliff.org/) locally before preparing a release. The release task regenerates `CHANGELOG.md` from Conventional Commits.
+Install [git-cliff](https://git-cliff.org/) locally before preparing a
+release. The release task prepends the next `CHANGELOG.md` section from
+Conventional Commits.
-Before preparing a release, make sure you are on `main` or `master` with a clean worktree.
+Before preparing a release, make sure you are on `main` or `master` with a
+clean worktree. If the release contains a breaking public-contract change,
+update `UPGRADE.md` with the host-app migration steps first.
 Then run one of:
@@ -301,12 +337,13 @@ bundle exec rake 'release:prepare[0.1.0]'
 The task will:
-1. Regenerate `CHANGELOG.md` with `git-cliff`.
+1. Prepend the next `CHANGELOG.md` section with `git-cliff`.
 1. Update `lib/crawlscope/version.rb`.
 1. Commit the release changes.
 1. Create and push the `vX.Y.Z` tag.
-The `Release` workflow then runs tests, publishes the gem to RubyGems, and creates the GitHub release from the changelog entry.
+The `Release` workflow then runs tests, publishes the gem to RubyGems,
+and creates the GitHub release from the changelog entry.
 ## Contributing

data/lib/crawlscope/document_text.rb ADDED Viewed

@@ -0,0 +1,40 @@
+# frozen_string_literal: true
+module Crawlscope
+  module DocumentText
+    REMOVED_SELECTORS = "script, style, noscript, template, svg"
+    TOKEN_PATTERN = /[[:alnum:]]+/
+    module_function
+    def body_text(doc)
+      text_for(doc, selector: nil)
+    end
+    def html_for(doc, selector: "main")
+      root_for(doc, selector: selector)&.to_html.to_s
+    end
+    def text_for(doc, selector: "main")
+      normalize(root_for(doc, selector: selector)&.text)
+    end
+    def tokens(text)
+      normalize(text).downcase.scan(TOKEN_PATTERN).reject { |token| token.length < 2 }
+    end
+    def normalize(text)
+      text.to_s.gsub(/\s+/, " ").strip
+    end
+    def root_for(doc, selector:)
+      return unless doc
+      copy = doc.dup
+      copy.css(REMOVED_SELECTORS).remove
+      root = selector.to_s.empty? ? nil : copy.at_css(selector)
+      root || copy.at_css("body") || copy
+    end
+  end
+end

data/lib/crawlscope/rule_registry.rb CHANGED Viewed

@@ -12,12 +12,14 @@ module Crawlscope
     def self.default(site_name: nil)
       new(
         rules: [
+          Rules::Indexability.new,
           Rules::Metadata.new(site_name: site_name),
           Rules::StructuredData.new,
           Rules::Uniqueness.new,
+          Rules::ContentQuality.new,
           Rules::Links.new
         ],
-        default_codes: %i[metadata structured_data uniqueness links]
+        default_codes: %i[indexability metadata structured_data uniqueness content_quality links]
       )
     end

data/lib/crawlscope/rules/content_quality.rb ADDED Viewed

@@ -0,0 +1,99 @@
+# frozen_string_literal: true
+module Crawlscope
+  module Rules
+    class ContentQuality
+      MIN_VISIBLE_TEXT_RATIO = 0.08
+      MIN_VISIBLE_WORDS = 250
+      MIN_UNIQUE_TOKEN_RATIO = 0.25
+      attr_reader :code
+      def initialize(
+        min_visible_text_ratio: MIN_VISIBLE_TEXT_RATIO,
+        min_visible_words: MIN_VISIBLE_WORDS,
+        min_unique_token_ratio: MIN_UNIQUE_TOKEN_RATIO
+      )
+        @code = :content_quality
+        @min_visible_text_ratio = min_visible_text_ratio
+        @min_visible_words = min_visible_words
+        @min_unique_token_ratio = min_unique_token_ratio
+      end
+      def call(urls:, pages:, issues:, context: nil)
+        pages.each do |page|
+          next unless page.html?
+          validate_visible_words(page, issues)
+          validate_visible_text_ratio(page, issues)
+          validate_unique_token_ratio(page, issues)
+        end
+      end
+      private
+      def validate_unique_token_ratio(page, issues)
+        tokens = DocumentText.tokens(DocumentText.text_for(page.doc))
+        return if tokens.size < @min_visible_words
+        ratio = tokens.uniq.size.to_f / tokens.size
+        return if ratio >= @min_unique_token_ratio
+        issues.add(
+          code: :low_unique_token_ratio,
+          severity: :warning,
+          category: :content_quality,
+          url: page.url,
+          message: "visible text has low token variety (#{format_ratio(ratio)})",
+          details: {
+            ratio: ratio.round(3),
+            threshold: @min_unique_token_ratio,
+            token_count: tokens.size,
+            unique_token_count: tokens.uniq.size
+          }
+        )
+      end
+      def validate_visible_text_ratio(page, issues)
+        html_bytes = DocumentText.html_for(page.doc).bytesize
+        return if html_bytes.zero?
+        visible_text = DocumentText.text_for(page.doc)
+        ratio = visible_text.bytesize.to_f / html_bytes
+        return if ratio >= @min_visible_text_ratio
+        issues.add(
+          code: :low_visible_text_ratio,
+          severity: :warning,
+          category: :content_quality,
+          url: page.url,
+          message: "low visible text to HTML ratio (#{format_ratio(ratio)})",
+          details: {
+            html_bytes: html_bytes,
+            ratio: ratio.round(3),
+            threshold: @min_visible_text_ratio,
+            visible_text_bytes: visible_text.bytesize
+          }
+        )
+      end
+      def validate_visible_words(page, issues)
+        word_count = DocumentText.tokens(DocumentText.text_for(page.doc)).size
+        return if word_count >= @min_visible_words
+        issues.add(
+          code: :thin_visible_text,
+          severity: :warning,
+          category: :content_quality,
+          url: page.url,
+          message: "thin visible text (#{word_count} words)",
+          details: {word_count: word_count, minimum: @min_visible_words}
+        )
+      end
+      def format_ratio(value)
+        format("%.2f", value)
+      end
+    end
+  end
+end

data/lib/crawlscope/rules/indexability.rb ADDED Viewed

@@ -0,0 +1,66 @@
+# frozen_string_literal: true
+module Crawlscope
+  module Rules
+    class Indexability
+      ROBOTS_META_SELECTOR = 'meta[name="robots"], meta[name="googlebot"]'
+      X_ROBOTS_TAG_HEADER = "x-robots-tag"
+      attr_reader :code
+      def initialize
+        @code = :indexability
+      end
+      def call(urls:, pages:, issues:, context: nil)
+        pages.each do |page|
+          validate_meta_robots(page, issues) if page.html?
+          validate_x_robots_tag(page, issues)
+        end
+      end
+      private
+      def header_value(page, name)
+        page.headers.find { |key, _value| key.to_s.casecmp?(name) }&.last.to_s
+      end
+      def noindex?(value)
+        value
+          .split(",")
+          .map { |directive| directive.split(":", 2).last.to_s.strip }
+          .any? { |directive| directive.casecmp?("noindex") || directive.casecmp?("none") }
+      end
+      def validate_meta_robots(page, issues)
+        page.doc.css(ROBOTS_META_SELECTOR).each do |tag|
+          content = tag["content"].to_s
+          next unless noindex?(content)
+          issues.add(
+            code: :noindex_meta,
+            severity: :error,
+            category: :indexability,
+            url: page.url,
+            message: "robots meta tag prevents indexing",
+            details: {content: content, name: tag["name"].to_s}
+          )
+        end
+      end
+      def validate_x_robots_tag(page, issues)
+        content = header_value(page, X_ROBOTS_TAG_HEADER)
+        return unless noindex?(content)
+        issues.add(
+          code: :noindex_header,
+          severity: :error,
+          category: :indexability,
+          url: page.url,
+          message: "X-Robots-Tag header prevents indexing",
+          details: {content: content}
+        )
+      end
+    end
+  end
+end

data/lib/crawlscope/rules/uniqueness.rb CHANGED Viewed

@@ -5,10 +5,24 @@ require "digest"
 module Crawlscope
   module Rules
     class Uniqueness
+      MINIMUM_SHINGLES = 10
+      MAX_NEAR_DUPLICATE_PAGES = 250
+      NEAR_DUPLICATE_THRESHOLD = 0.9
+      SHINGLE_SIZE = 5
       attr_reader :code
-      def initialize
+      def initialize(
+        near_duplicate_threshold: NEAR_DUPLICATE_THRESHOLD,
+        max_near_duplicate_pages: MAX_NEAR_DUPLICATE_PAGES,
+        minimum_shingles: MINIMUM_SHINGLES,
+        shingle_size: SHINGLE_SIZE
+      )
         @code = :uniqueness
+        @max_near_duplicate_pages = max_near_duplicate_pages
+        @minimum_shingles = minimum_shingles
+        @near_duplicate_threshold = near_duplicate_threshold
+        @shingle_size = shingle_size
       end
       def call(urls:, pages:, issues:, context:)
@@ -19,14 +33,13 @@ module Crawlscope
         end
         validate_duplicates(page_summaries, issues)
+        validate_near_duplicates(page_summaries, issues)
       end
       private
       def content_fingerprint_digest(doc)
-        text = doc.at_css("main")&.text.to_s
-        text = doc.at_css("body")&.text.to_s if text.empty?
-        normalized = text.gsub(/\s+/, " ").strip
+        normalized = DocumentText.text_for(doc)
         return if normalized.length < 200
         Digest::SHA256.hexdigest(normalized)
@@ -41,9 +54,12 @@ module Crawlscope
       end
       def summary_for(page)
+        tokens = DocumentText.tokens(DocumentText.text_for(page.doc))
         {
           content_fingerprint_digest: content_fingerprint_digest(page.doc),
           description: page.doc.at_css('meta[name="description"]')&.[]("content").to_s.strip,
+          shingles: shingles_for(tokens),
           title: page.doc.at_css("title")&.text.to_s.strip,
           url: page.url
         }
@@ -83,6 +99,62 @@ module Crawlscope
           )
         end
       end
+      def shingles_for(tokens)
+        return [] if tokens.size < @shingle_size
+        tokens.each_cons(@shingle_size).map { |items| items.join(" ") }.uniq
+      end
+      def validate_near_duplicates(page_summaries, issues)
+        if near_duplicate_scan_limit_exceeded?(page_summaries)
+          issues.add(
+            code: :near_duplicate_scan_skipped,
+            severity: :warning,
+            category: :uniqueness,
+            url: nil,
+            message: "near duplicate scan skipped for #{page_summaries.size} pages",
+            details: {max_pages: @max_near_duplicate_pages, page_count: page_summaries.size}
+          )
+          return
+        end
+        page_summaries.combination(2) do |left, right|
+          next if same_content_fingerprint?(left, right)
+          next if left[:shingles].size < @minimum_shingles || right[:shingles].size < @minimum_shingles
+          similarity = shingle_similarity(left[:shingles], right[:shingles])
+          next if similarity < @near_duplicate_threshold
+          urls = [left[:url], right[:url]]
+          issues.add(
+            code: :near_duplicate_content,
+            severity: :warning,
+            category: :uniqueness,
+            url: nil,
+            message: "near duplicate page content (#{format("%.2f", similarity)}) => #{urls.join(", ")}",
+            details: {similarity: similarity.round(3), threshold: @near_duplicate_threshold, urls: urls}
+          )
+        end
+      end
+      def near_duplicate_scan_limit_exceeded?(page_summaries)
+        !@max_near_duplicate_pages.nil? && page_summaries.size > @max_near_duplicate_pages
+      end
+      def same_content_fingerprint?(left, right)
+        !left[:content_fingerprint_digest].nil? &&
+          left[:content_fingerprint_digest] == right[:content_fingerprint_digest]
+      end
+      def shingle_similarity(left, right)
+        intersection_size = (left & right).size
+        smaller_set_size = [left.size, right.size].min
+        return 0.0 if smaller_set_size.zero?
+        intersection_size.to_f / smaller_set_size
+      end
     end
   end
 end

data/lib/crawlscope/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module Crawlscope
-  VERSION = "0.3.0"
+  VERSION = "0.4.0"
 end

data/lib/tasks/crawlscope_tasks.rake CHANGED Viewed

@@ -10,6 +10,11 @@ namespace :crawlscope do
       Crawlscope::RakeTasks.ldjson
     end
+    desc "Validate URLs with the indexability rule. ENV: URL, SITEMAP, JS=1"
+    task indexability: :environment do
+      Crawlscope::RakeTasks.validate_rule("indexability")
+    end
     desc "Validate URLs with the metadata rule. ENV: URL, SITEMAP, JS=1"
     task metadata: :environment do
       Crawlscope::RakeTasks.validate_rule("metadata")
@@ -25,6 +30,11 @@ namespace :crawlscope do
       Crawlscope::RakeTasks.validate_rule("uniqueness")
     end
+    desc "Validate URLs with the content_quality rule. ENV: URL, SITEMAP, JS=1"
+    task content_quality: :environment do
+      Crawlscope::RakeTasks.validate_rule("content_quality")
+    end
     desc "Validate URLs with the links rule. ENV: URL, SITEMAP, JS=1"
     task links: :environment do
       Crawlscope::RakeTasks.validate_rule("links")

data/test/crawlscope/configuration_test.rb CHANGED Viewed

@@ -20,7 +20,14 @@ class CrawlscopeConfigurationTest < Minitest::Test
     assert_equal "https://example.com", audit.instance_variable_get(:@base_url)
     assert_equal "/tmp/sitemap.xml", audit.instance_variable_get(:@sitemap_path)
     assert_equal 4, audit.instance_variable_get(:@concurrency)
-    assert_equal %i[metadata structured_data uniqueness links], audit.instance_variable_get(:@rules).map(&:code)
+    assert_equal %i[
+      indexability
+      metadata
+      structured_data
+      uniqueness
+      content_quality
+      links
+    ], audit.instance_variable_get(:@rules).map(&:code)
   end
   def test_audit_raises_without_base_url

data/test/crawlscope/content_quality_rule_test.rb ADDED Viewed

@@ -0,0 +1,68 @@
+# frozen_string_literal: true
+require "test_helper"
+class CrawlscopeContentQualityRuleTest < Minitest::Test
+  def test_reports_thin_visible_text_and_low_html_text_ratio
+    issues = Crawlscope::IssueCollection.new
+    page = page_with(main: "Short page <div>#{"<span></span>" * 500}</div>")
+    Crawlscope::Rules::ContentQuality.new.call(urls: [page.url], pages: [page], issues: issues)
+    codes = issues.to_a.map(&:code)
+    assert_includes codes, :thin_visible_text
+    assert_includes codes, :low_visible_text_ratio
+  end
+  def test_visible_text_ratio_ignores_markup_outside_main_content
+    issues = Crawlscope::IssueCollection.new
+    page = page_with(
+      main: Array.new(260) { |index| "word#{index}" }.join(" "),
+      head_markup: "<style>#{"body{}" * 10_000}</style>",
+      extra_markup: "<nav>#{"<a href=\"/\">Navigation</a>" * 500}</nav>"
+    )
+    Crawlscope::Rules::ContentQuality.new.call(urls: [page.url], pages: [page], issues: issues)
+    refute_includes issues.to_a.map(&:code), :low_visible_text_ratio
+  end
+  def test_reports_low_unique_token_ratio_for_repetitive_content
+    issues = Crawlscope::IssueCollection.new
+    page = page_with(main: ("hotel location service " * 100).strip)
+    Crawlscope::Rules::ContentQuality.new.call(urls: [page.url], pages: [page], issues: issues)
+    issue = issues.to_a.find { |item| item.code == :low_unique_token_ratio }
+    assert issue
+    assert_operator issue.details[:ratio], :<, issue.details[:threshold]
+  end
+  private
+  def page_with(main:, extra_markup: "", head_markup: "")
+    body = <<~HTML
+      <html>
+        <head>
+          <title>Content quality</title>
+          #{head_markup}
+        </head>
+        <body>
+          #{extra_markup}
+          <main>#{main}</main>
+        </body>
+      </html>
+    HTML
+    Crawlscope::Page.new(
+      url: "https://example.com/page",
+      normalized_url: "https://example.com/page",
+      final_url: "https://example.com/page",
+      normalized_final_url: "https://example.com/page",
+      status: 200,
+      headers: {"content-type" => "text/html"},
+      body: body,
+      doc: Nokogiri::HTML(body)
+    )
+  end
+end

data/test/crawlscope/crawl_test.rb CHANGED Viewed

@@ -45,6 +45,7 @@ class CrawlscopeCrawlTest < Minitest::Test
             <body>
               <main>
                 <h1>Pricing</h1>
+                <p>#{Array.new(260) { |index| "pricing#{index}" }.join(" ")}</p>
               </main>
             </body>
           </html>
@@ -100,7 +101,15 @@ class CrawlscopeCrawlTest < Minitest::Test
     ).call
     refute result.ok?
-    assert_equal %i[incomplete_open_graph_tags meta_description_too_long missing_canonical missing_h1 missing_structured_data title_repeats_site_name].sort, result.issues.to_a.map(&:code).uniq.sort
+    assert_equal %i[
+      incomplete_open_graph_tags
+      meta_description_too_long
+      missing_canonical
+      missing_h1
+      missing_structured_data
+      thin_visible_text
+      title_repeats_site_name
+    ].sort, result.issues.to_a.map(&:code).uniq.sort
   end
   def test_uses_browser_when_renderer_is_browser
@@ -147,6 +156,7 @@ class CrawlscopeCrawlTest < Minitest::Test
             <body>
               <main>
                 <h1>Pricing</h1>
+                <p>#{Array.new(260) { |index| "pricing#{index}" }.join(" ")}</p>
               </main>
             </body>
           </html>

data/test/crawlscope/indexability_rule_test.rb ADDED Viewed

@@ -0,0 +1,96 @@
+# frozen_string_literal: true
+require "test_helper"
+class CrawlscopeIndexabilityRuleTest < Minitest::Test
+  def test_reports_meta_noindex
+    issues = Crawlscope::IssueCollection.new
+    page = page_with(
+      body: <<~HTML
+        <html>
+          <head><meta name="robots" content="noindex, follow"></head>
+          <body><main>Visible content</main></body>
+        </html>
+      HTML
+    )
+    Crawlscope::Rules::Indexability.new.call(urls: [page.url], pages: [page], issues: issues)
+    issue = issues.to_a.fetch(0)
+    assert_equal :noindex_meta, issue.code
+    assert_equal :error, issue.severity
+    assert_equal "noindex, follow", issue.details[:content]
+  end
+  def test_reports_x_robots_tag_noindex
+    issues = Crawlscope::IssueCollection.new
+    page = page_with(headers: {"X-Robots-Tag" => "noindex"})
+    Crawlscope::Rules::Indexability.new.call(urls: [page.url], pages: [page], issues: issues)
+    issue = issues.to_a.fetch(0)
+    assert_equal :noindex_header, issue.code
+    assert_equal :error, issue.severity
+    assert_equal "noindex", issue.details[:content]
+  end
+  def test_reports_x_robots_tag_noindex_for_non_html_response
+    issues = Crawlscope::IssueCollection.new
+    page = page_with(
+      body: "%PDF-1.7",
+      doc: nil,
+      headers: {"content-type" => "application/pdf", "X-Robots-Tag" => "noindex"}
+    )
+    Crawlscope::Rules::Indexability.new.call(urls: [page.url], pages: [page], issues: issues)
+    issue = issues.to_a.fetch(0)
+    assert_equal :noindex_header, issue.code
+    assert_equal :error, issue.severity
+    assert_equal "noindex", issue.details[:content]
+  end
+  def test_reports_scoped_x_robots_tag_noindex
+    issues = Crawlscope::IssueCollection.new
+    page = page_with(headers: {"X-Robots-Tag" => "googlebot: noindex, nofollow"})
+    Crawlscope::Rules::Indexability.new.call(urls: [page.url], pages: [page], issues: issues)
+    issue = issues.to_a.fetch(0)
+    assert_equal :noindex_header, issue.code
+    assert_equal "googlebot: noindex, nofollow", issue.details[:content]
+  end
+  def test_reports_x_robots_tag_none
+    issues = Crawlscope::IssueCollection.new
+    page = page_with(headers: {"X-Robots-Tag" => "none"})
+    Crawlscope::Rules::Indexability.new.call(urls: [page.url], pages: [page], issues: issues)
+    issue = issues.to_a.fetch(0)
+    assert_equal :noindex_header, issue.code
+    assert_equal "none", issue.details[:content]
+  end
+  private
+  def page_with(body: nil, doc: :parse, headers: {"content-type" => "text/html"})
+    body ||= <<~HTML
+      <html>
+        <head><title>Indexable</title></head>
+        <body><main>Visible content</main></body>
+      </html>
+    HTML
+    Crawlscope::Page.new(
+      url: "https://example.com/page",
+      normalized_url: "https://example.com/page",
+      final_url: "https://example.com/page",
+      normalized_final_url: "https://example.com/page",
+      status: 200,
+      headers: headers,
+      body: body,
+      doc: (doc == :parse) ? Nokogiri::HTML(body) : doc
+    )
+  end
+end

data/test/crawlscope/uniqueness_rule_test.rb CHANGED Viewed

@@ -16,10 +16,51 @@ class CrawlscopeUniquenessRuleTest < Minitest::Test
     assert_equal %i[duplicate_content_fingerprint duplicate_meta_description duplicate_title].sort, issues.to_a.map(&:code).sort
   end
+  def test_reports_near_duplicate_content
+    issues = Crawlscope::IssueCollection.new
+    rule = Crawlscope::Rules::Uniqueness.new
+    pages = [
+      page(url: "https://example.com/a", content: near_duplicate_content("reliable")),
+      page(url: "https://example.com/b", content: near_duplicate_content("dependable"))
+    ]
+    rule.call(urls: pages.map(&:url), pages: pages, issues: issues, context: {})
+    issue = issues.to_a.find { |item| item.code == :near_duplicate_content }
+    assert issue
+    assert_operator issue.details[:similarity], :>=, issue.details[:threshold]
+  end
+  def test_skips_near_duplicate_scan_when_page_count_exceeds_limit
+    issues = Crawlscope::IssueCollection.new
+    rule = Crawlscope::Rules::Uniqueness.new(max_near_duplicate_pages: 1)
+    pages = [
+      page(url: "https://example.com/a", content: near_duplicate_content("reliable")),
+      page(url: "https://example.com/b", content: near_duplicate_content("dependable"))
+    ]
+    rule.call(urls: pages.map(&:url), pages: pages, issues: issues, context: {})
+    skip_issue = issues.to_a.find { |item| item.code == :near_duplicate_scan_skipped }
+    refute issues.to_a.any? { |item| item.code == :near_duplicate_content }
+    assert_equal :warning, skip_issue.severity
+    assert_equal({max_pages: 1, page_count: 2}, skip_issue.details)
+  end
   private
-  def page(url:)
-    repeated_text = ("Useful content " * 30).strip
+  def near_duplicate_content(adjective)
+    <<~TEXT.gsub(/\s+/, " ").strip
+      This page summarizes practical hotel review patterns for operators who need #{adjective}
+      service insights across locations. It compares recurring comments about staff, rooms,
+      cleanliness, check-in, breakfast, parking, and amenities so teams can prioritize fixes.
+      The analysis highlights repeat themes, explains why guests mention them, and keeps the
+      wording focused on decisions that improve daily operations.
+    TEXT
+  end
+  def page(url:, content: nil)
+    repeated_text = content || ("Useful content " * 30).strip
     body = <<~HTML
       <html>
         <head>

data/test/release_task_test.rb ADDED Viewed

@@ -0,0 +1,86 @@
+# frozen_string_literal: true
+require "test_helper"
+require "rake"
+unless respond_to?(:release_version, true)
+  load File.expand_path("../Rakefile", __dir__)
+end
+class ReleaseTaskTest < Minitest::Test
+  def test_release_version_increments_patch_from_current_version
+    major, minor, patch = Crawlscope::VERSION.split(".").map(&:to_i)
+    assert_equal "#{major}.#{minor}.#{patch + 1}", release_version("patch")
+  end
+  def test_release_version_accepts_explicit_semantic_version
+    assert_equal "0.3.0", release_version("0.3.0")
+  end
+  def test_validate_release_version_rejects_current_version
+    error = assert_raises(ArgumentError) do
+      validate_release_version!("0.2.7", "0.2.7")
+    end
+    assert_equal(
+      "Release version 0.2.7 must be newer than current version 0.2.7.",
+      error.message
+    )
+  end
+  def test_validate_release_version_rejects_existing_local_tag
+    @local_release_tag_exists = true
+    @remote_release_tag_exists = false
+    error = assert_raises(ArgumentError) do
+      validate_release_version!("0.2.8", "0.2.7")
+    end
+    assert_equal "Release tag v0.2.8 already exists locally.", error.message
+  end
+  def test_validate_release_version_rejects_existing_remote_tag
+    @local_release_tag_exists = false
+    @remote_release_tag_exists = true
+    error = assert_raises(ArgumentError) do
+      validate_release_version!("0.2.8", "0.2.7")
+    end
+    assert_equal "Release tag v0.2.8 already exists on origin.", error.message
+  end
+  def test_remote_release_tag_command_asks_git_to_fail_when_no_tag_matches
+    assert_equal(
+      "git ls-remote --exit-code --tags origin refs/tags/v0.2.8",
+      remote_release_tag_command("v0.2.8")
+    )
+  end
+  def test_changelog_command_prepends_the_next_release
+    assert_equal(
+      [
+        "git-cliff",
+        "-c",
+        "cliff.toml",
+        "--unreleased",
+        "--tag",
+        "v0.2.8",
+        "--prepend",
+        "CHANGELOG.md"
+      ],
+      changelog_command("0.2.8")
+    )
+  end
+  private
+  def local_release_tag_exists?(_tag)
+    @local_release_tag_exists || false
+  end
+  def remote_release_tag_exists?(_tag)
+    @remote_release_tag_exists || false
+  end
+end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: crawlscope
 version: !ruby/object:Gem::Version
-  version: 0.3.0
+  version: 0.4.0
 platform: ruby
 authors:
 - Paulo Fidalgo
@@ -199,6 +199,7 @@ files:
 - lib/crawlscope/context.rb
 - lib/crawlscope/crawl.rb
 - lib/crawlscope/crawler.rb
+- lib/crawlscope/document_text.rb
 - lib/crawlscope/http.rb
 - lib/crawlscope/issue.rb
 - lib/crawlscope/issue_collection.rb
@@ -208,6 +209,8 @@ files:
 - lib/crawlscope/reporter.rb
 - lib/crawlscope/result.rb
 - lib/crawlscope/rule_registry.rb
+- lib/crawlscope/rules/content_quality.rb
+- lib/crawlscope/rules/indexability.rb
 - lib/crawlscope/rules/links.rb
 - lib/crawlscope/rules/metadata.rb
 - lib/crawlscope/rules/structured_data.rb
@@ -228,9 +231,11 @@ files:
 - test/crawlscope/browser_test.rb
 - test/crawlscope/cli_test.rb
 - test/crawlscope/configuration_test.rb
+- test/crawlscope/content_quality_rule_test.rb
 - test/crawlscope/crawl_test.rb
 - test/crawlscope/crawler_test.rb
 - test/crawlscope/http_test.rb
+- test/crawlscope/indexability_rule_test.rb
 - test/crawlscope/links_rule_test.rb
 - test/crawlscope/loader_test.rb
 - test/crawlscope/metadata_rule_test.rb
@@ -247,6 +252,7 @@ files:
 - test/crawlscope/structured_data_writer_test.rb
 - test/crawlscope/uniqueness_rule_test.rb
 - test/crawlscope/url_test.rb
+- test/release_task_test.rb
 - test/test_helper.rb
 homepage: https://www.ethos-link.com/opensource/crawlscope
 licenses:
@@ -275,7 +281,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 4.0.6
+rubygems_version: 4.0.10
 specification_version: 4
 summary: Audit sitemap URLs for metadata, structured data, uniqueness, and links
 test_files: []