crawlscope 0.3.0 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: b49aaaa6fdb5f7d5bd4dc63713d8c0090411e7063363645a900d8f59d803aaaa
4
- data.tar.gz: 5dfcc35d60745c25db6faf3acaa4344e29e438c758740613d6216e2f47aeac6e
3
+ metadata.gz: 79e8c8f3993c545bf7647c28b8540d3757c7d9c91eeaf885cde6d55c4935ebb5
4
+ data.tar.gz: d9b6a987e04546c2d3ee7bb3cc6e1d5510e78963df035cb24d7c8783064afa45
5
5
  SHA512:
6
- metadata.gz: 9f66627274ce2ea969b5bb9b53a339215718c37baf47393c75bcf3a528c5c73658c6a71903fdbbf9e53796aaf3680be5f99ab4151b834efbf9450e05abbab83b
7
- data.tar.gz: 3cf2e2c7f251a6af7b931f00da63436eaa7e09f078d73de112852a10665cf16eefb561c7d61d6bc8b0c3c014ca0db2df217d31c00b9f0ed321565ed554574261
6
+ metadata.gz: eb49361b9f26992682db7622796c4b262a12fca37254aca5e1f1c49c85702b7e4fc347a880af0665f10238f5340cb61bc44433060ba7b3fbde0bdd379c85c763
7
+ data.tar.gz: 5fa53f930ef529279e063bd11f9becd112c8abb266078027486f22ad37e968bad744c5a35c9432ccb170ceb51e45d858e23a47c649c6ede1d4dd89fb331fd9f3
data/CHANGELOG.md CHANGED
@@ -5,6 +5,26 @@ All notable changes to this project will be documented in this file.
5
5
  The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
6
  and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
7
 
8
+ ## [0.4.0] - 2026-05-21
9
+
10
+
11
+ ### Added
12
+
13
+ - add indexability and content quality checks
14
+
15
+
16
+
17
+
18
+ ### Fixed
19
+
20
+ - preserve release changelog history
21
+
22
+ - scope content ratio to main content
23
+
24
+ - harden indexability and uniqueness rules
25
+
26
+
27
+
8
28
  ## [0.3.0] - 2026-04-28
9
29
 
10
30
 
@@ -28,3 +48,47 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
28
48
 
29
49
 
30
50
 
51
+ ## [0.2.0] - 2026-04-24
52
+
53
+
54
+ ### Changed
55
+
56
+ - simplify crawl and structured data boundaries
57
+
58
+ - harden validation boundaries
59
+
60
+
61
+
62
+
63
+ ### Fixed
64
+
65
+ - handle child sitemaps
66
+
67
+ - use URL for sitemap validation
68
+
69
+
70
+
71
+ ## [0.1.0] - 2026-04-23
72
+
73
+
74
+ ### Added
75
+
76
+ - add crawlkit release-ready audit gem
77
+
78
+ - add standalone validation commands
79
+
80
+ - move default schema rules into crawlkit
81
+
82
+
83
+
84
+
85
+ ### Changed
86
+
87
+ - strengthen public API coverage
88
+
89
+ - load shared test dependencies
90
+
91
+ - rename crawlkit to crawlscope
92
+
93
+
94
+
data/README.md CHANGED
@@ -23,9 +23,11 @@ It works in three modes:
23
23
 
24
24
  The default rule set includes:
25
25
 
26
+ - indexability blockers
26
27
  - metadata validation
27
28
  - structured-data validation
28
29
  - uniqueness checks
30
+ - content-quality checks
29
31
  - internal-link checks
30
32
 
31
33
  ## Installation
@@ -146,9 +148,11 @@ Available tasks:
146
148
 
147
149
  ```bash
148
150
  bin/rails crawlscope:validate
151
+ bin/rails crawlscope:validate:indexability
149
152
  bin/rails crawlscope:validate:metadata
150
153
  bin/rails crawlscope:validate:structured_data
151
154
  bin/rails crawlscope:validate:uniqueness
155
+ bin/rails crawlscope:validate:content_quality
152
156
  bin/rails crawlscope:validate:links
153
157
  bin/rails crawlscope:validate:ldjson
154
158
  ```
@@ -161,7 +165,7 @@ bundle exec rake crawlscope:validate:metadata URL=https://example.com
161
165
  bundle exec rake crawlscope:validate:ldjson URL=https://example.com/article
162
166
  ```
163
167
 
164
- `crawlscope:validate` runs all default sitemap rules: metadata, structured data, uniqueness, and links. `URL` is the site base. Without `SITEMAP`, Crawlscope uses `/sitemap.xml`. With `SITEMAP`, Crawlscope uses `URL` as the site base and validates URLs from that sitemap. `SITEMAP` may be a full URL or a local file path.
168
+ `crawlscope:validate` runs all default sitemap rules: indexability, metadata, structured data, uniqueness, content quality, and links. `URL` is the site base. Without `SITEMAP`, Crawlscope uses `/sitemap.xml`. With `SITEMAP`, Crawlscope uses `URL` as the site base and validates URLs from that sitemap. `SITEMAP` may be a full URL or a local file path.
165
169
 
166
170
  `crawlscope:validate:ldjson` is separate because it directly checks the URL or semicolon-separated URLs in `URL`; it does not crawl the sitemap. Without `URL`, it checks the configured base URL, falling back to `http://localhost:3000`.
167
171
 
@@ -186,11 +190,20 @@ Optional flags:
186
190
 
187
191
  Built-in rules:
188
192
 
193
+ - `indexability`
189
194
  - `metadata`
190
195
  - `structured_data`
191
196
  - `uniqueness`
197
+ - `content_quality`
192
198
  - `links`
193
199
 
200
+ ### Indexability
201
+
202
+ Checks:
203
+
204
+ - page-level meta robots `noindex`
205
+ - `X-Robots-Tag: noindex`
206
+
194
207
  ### Metadata
195
208
 
196
209
  Checks:
@@ -220,6 +233,19 @@ Checks:
220
233
  - duplicate titles
221
234
  - duplicate meta descriptions
222
235
  - duplicate content fingerprints
236
+ - near-duplicate visible content for up to 250 HTML pages
237
+
238
+ For larger crawls, exact duplicate checks still run and Crawlscope reports
239
+ `near_duplicate_scan_skipped`. Configure `Rules::Uniqueness` with
240
+ `max_near_duplicate_pages:` in a custom rule registry to change the limit.
241
+
242
+ ### Content Quality
243
+
244
+ Checks:
245
+
246
+ - thin visible text
247
+ - low visible-text-to-HTML ratio
248
+ - low unique-token ratio
223
249
 
224
250
  ### Links
225
251
 
@@ -268,7 +294,12 @@ bundle exec rake
268
294
 
269
295
  ### Git hooks
270
296
 
271
- We use [lefthook](https://lefthook.dev/) with the Ruby [commitlint](https://github.com/arandilopez/commitlint) gem to enforce Conventional Commits on every commit. We also use [Standard Ruby](https://standardrb.com/) to keep code style consistent. CI validates commit messages, Standard Ruby, tests, and git-cliff changelog generation on pull requests and pushes to main/master.
297
+ We use [lefthook](https://lefthook.dev/) with the Ruby
298
+ [commitlint](https://github.com/arandilopez/commitlint) gem to enforce
299
+ Conventional Commits on every commit. We also use
300
+ [Standard Ruby](https://standardrb.com/) to keep code style consistent. CI
301
+ validates commit messages, Standard Ruby, tests, and git-cliff changelog
302
+ generation on pull requests and pushes to main/master.
272
303
 
273
304
  Run the hook installer once per clone:
274
305
 
@@ -284,11 +315,16 @@ rake install
284
315
 
285
316
  ## Release
286
317
 
287
- Releases are tag-driven and published by GitHub Actions to RubyGems. Local release commands never publish directly.
318
+ Releases are tag-driven and published by GitHub Actions to RubyGems.
319
+ Local release commands never publish directly.
288
320
 
289
- Install [git-cliff](https://git-cliff.org/) locally before preparing a release. The release task regenerates `CHANGELOG.md` from Conventional Commits.
321
+ Install [git-cliff](https://git-cliff.org/) locally before preparing a
322
+ release. The release task prepends the next `CHANGELOG.md` section from
323
+ Conventional Commits.
290
324
 
291
- Before preparing a release, make sure you are on `main` or `master` with a clean worktree.
325
+ Before preparing a release, make sure you are on `main` or `master` with a
326
+ clean worktree. If the release contains a breaking public-contract change,
327
+ update `UPGRADE.md` with the host-app migration steps first.
292
328
 
293
329
  Then run one of:
294
330
 
@@ -301,12 +337,13 @@ bundle exec rake 'release:prepare[0.1.0]'
301
337
 
302
338
  The task will:
303
339
 
304
- 1. Regenerate `CHANGELOG.md` with `git-cliff`.
340
+ 1. Prepend the next `CHANGELOG.md` section with `git-cliff`.
305
341
  1. Update `lib/crawlscope/version.rb`.
306
342
  1. Commit the release changes.
307
343
  1. Create and push the `vX.Y.Z` tag.
308
344
 
309
- The `Release` workflow then runs tests, publishes the gem to RubyGems, and creates the GitHub release from the changelog entry.
345
+ The `Release` workflow then runs tests, publishes the gem to RubyGems,
346
+ and creates the GitHub release from the changelog entry.
310
347
 
311
348
  ## Contributing
312
349
 
@@ -0,0 +1,40 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Crawlscope
4
+ module DocumentText
5
+ REMOVED_SELECTORS = "script, style, noscript, template, svg"
6
+ TOKEN_PATTERN = /[[:alnum:]]+/
7
+
8
+ module_function
9
+
10
+ def body_text(doc)
11
+ text_for(doc, selector: nil)
12
+ end
13
+
14
+ def html_for(doc, selector: "main")
15
+ root_for(doc, selector: selector)&.to_html.to_s
16
+ end
17
+
18
+ def text_for(doc, selector: "main")
19
+ normalize(root_for(doc, selector: selector)&.text)
20
+ end
21
+
22
+ def tokens(text)
23
+ normalize(text).downcase.scan(TOKEN_PATTERN).reject { |token| token.length < 2 }
24
+ end
25
+
26
+ def normalize(text)
27
+ text.to_s.gsub(/\s+/, " ").strip
28
+ end
29
+
30
+ def root_for(doc, selector:)
31
+ return unless doc
32
+
33
+ copy = doc.dup
34
+ copy.css(REMOVED_SELECTORS).remove
35
+
36
+ root = selector.to_s.empty? ? nil : copy.at_css(selector)
37
+ root || copy.at_css("body") || copy
38
+ end
39
+ end
40
+ end
@@ -12,12 +12,14 @@ module Crawlscope
12
12
  def self.default(site_name: nil)
13
13
  new(
14
14
  rules: [
15
+ Rules::Indexability.new,
15
16
  Rules::Metadata.new(site_name: site_name),
16
17
  Rules::StructuredData.new,
17
18
  Rules::Uniqueness.new,
19
+ Rules::ContentQuality.new,
18
20
  Rules::Links.new
19
21
  ],
20
- default_codes: %i[metadata structured_data uniqueness links]
22
+ default_codes: %i[indexability metadata structured_data uniqueness content_quality links]
21
23
  )
22
24
  end
23
25
 
@@ -0,0 +1,99 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Crawlscope
4
+ module Rules
5
+ class ContentQuality
6
+ MIN_VISIBLE_TEXT_RATIO = 0.08
7
+ MIN_VISIBLE_WORDS = 250
8
+ MIN_UNIQUE_TOKEN_RATIO = 0.25
9
+
10
+ attr_reader :code
11
+
12
+ def initialize(
13
+ min_visible_text_ratio: MIN_VISIBLE_TEXT_RATIO,
14
+ min_visible_words: MIN_VISIBLE_WORDS,
15
+ min_unique_token_ratio: MIN_UNIQUE_TOKEN_RATIO
16
+ )
17
+ @code = :content_quality
18
+ @min_visible_text_ratio = min_visible_text_ratio
19
+ @min_visible_words = min_visible_words
20
+ @min_unique_token_ratio = min_unique_token_ratio
21
+ end
22
+
23
+ def call(urls:, pages:, issues:, context: nil)
24
+ pages.each do |page|
25
+ next unless page.html?
26
+
27
+ validate_visible_words(page, issues)
28
+ validate_visible_text_ratio(page, issues)
29
+ validate_unique_token_ratio(page, issues)
30
+ end
31
+ end
32
+
33
+ private
34
+
35
+ def validate_unique_token_ratio(page, issues)
36
+ tokens = DocumentText.tokens(DocumentText.text_for(page.doc))
37
+ return if tokens.size < @min_visible_words
38
+
39
+ ratio = tokens.uniq.size.to_f / tokens.size
40
+ return if ratio >= @min_unique_token_ratio
41
+
42
+ issues.add(
43
+ code: :low_unique_token_ratio,
44
+ severity: :warning,
45
+ category: :content_quality,
46
+ url: page.url,
47
+ message: "visible text has low token variety (#{format_ratio(ratio)})",
48
+ details: {
49
+ ratio: ratio.round(3),
50
+ threshold: @min_unique_token_ratio,
51
+ token_count: tokens.size,
52
+ unique_token_count: tokens.uniq.size
53
+ }
54
+ )
55
+ end
56
+
57
+ def validate_visible_text_ratio(page, issues)
58
+ html_bytes = DocumentText.html_for(page.doc).bytesize
59
+ return if html_bytes.zero?
60
+
61
+ visible_text = DocumentText.text_for(page.doc)
62
+ ratio = visible_text.bytesize.to_f / html_bytes
63
+ return if ratio >= @min_visible_text_ratio
64
+
65
+ issues.add(
66
+ code: :low_visible_text_ratio,
67
+ severity: :warning,
68
+ category: :content_quality,
69
+ url: page.url,
70
+ message: "low visible text to HTML ratio (#{format_ratio(ratio)})",
71
+ details: {
72
+ html_bytes: html_bytes,
73
+ ratio: ratio.round(3),
74
+ threshold: @min_visible_text_ratio,
75
+ visible_text_bytes: visible_text.bytesize
76
+ }
77
+ )
78
+ end
79
+
80
+ def validate_visible_words(page, issues)
81
+ word_count = DocumentText.tokens(DocumentText.text_for(page.doc)).size
82
+ return if word_count >= @min_visible_words
83
+
84
+ issues.add(
85
+ code: :thin_visible_text,
86
+ severity: :warning,
87
+ category: :content_quality,
88
+ url: page.url,
89
+ message: "thin visible text (#{word_count} words)",
90
+ details: {word_count: word_count, minimum: @min_visible_words}
91
+ )
92
+ end
93
+
94
+ def format_ratio(value)
95
+ format("%.2f", value)
96
+ end
97
+ end
98
+ end
99
+ end
@@ -0,0 +1,66 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Crawlscope
4
+ module Rules
5
+ class Indexability
6
+ ROBOTS_META_SELECTOR = 'meta[name="robots"], meta[name="googlebot"]'
7
+ X_ROBOTS_TAG_HEADER = "x-robots-tag"
8
+
9
+ attr_reader :code
10
+
11
+ def initialize
12
+ @code = :indexability
13
+ end
14
+
15
+ def call(urls:, pages:, issues:, context: nil)
16
+ pages.each do |page|
17
+ validate_meta_robots(page, issues) if page.html?
18
+ validate_x_robots_tag(page, issues)
19
+ end
20
+ end
21
+
22
+ private
23
+
24
+ def header_value(page, name)
25
+ page.headers.find { |key, _value| key.to_s.casecmp?(name) }&.last.to_s
26
+ end
27
+
28
+ def noindex?(value)
29
+ value
30
+ .split(",")
31
+ .map { |directive| directive.split(":", 2).last.to_s.strip }
32
+ .any? { |directive| directive.casecmp?("noindex") || directive.casecmp?("none") }
33
+ end
34
+
35
+ def validate_meta_robots(page, issues)
36
+ page.doc.css(ROBOTS_META_SELECTOR).each do |tag|
37
+ content = tag["content"].to_s
38
+ next unless noindex?(content)
39
+
40
+ issues.add(
41
+ code: :noindex_meta,
42
+ severity: :error,
43
+ category: :indexability,
44
+ url: page.url,
45
+ message: "robots meta tag prevents indexing",
46
+ details: {content: content, name: tag["name"].to_s}
47
+ )
48
+ end
49
+ end
50
+
51
+ def validate_x_robots_tag(page, issues)
52
+ content = header_value(page, X_ROBOTS_TAG_HEADER)
53
+ return unless noindex?(content)
54
+
55
+ issues.add(
56
+ code: :noindex_header,
57
+ severity: :error,
58
+ category: :indexability,
59
+ url: page.url,
60
+ message: "X-Robots-Tag header prevents indexing",
61
+ details: {content: content}
62
+ )
63
+ end
64
+ end
65
+ end
66
+ end
@@ -5,10 +5,24 @@ require "digest"
5
5
  module Crawlscope
6
6
  module Rules
7
7
  class Uniqueness
8
+ MINIMUM_SHINGLES = 10
9
+ MAX_NEAR_DUPLICATE_PAGES = 250
10
+ NEAR_DUPLICATE_THRESHOLD = 0.9
11
+ SHINGLE_SIZE = 5
12
+
8
13
  attr_reader :code
9
14
 
10
- def initialize
15
+ def initialize(
16
+ near_duplicate_threshold: NEAR_DUPLICATE_THRESHOLD,
17
+ max_near_duplicate_pages: MAX_NEAR_DUPLICATE_PAGES,
18
+ minimum_shingles: MINIMUM_SHINGLES,
19
+ shingle_size: SHINGLE_SIZE
20
+ )
11
21
  @code = :uniqueness
22
+ @max_near_duplicate_pages = max_near_duplicate_pages
23
+ @minimum_shingles = minimum_shingles
24
+ @near_duplicate_threshold = near_duplicate_threshold
25
+ @shingle_size = shingle_size
12
26
  end
13
27
 
14
28
  def call(urls:, pages:, issues:, context:)
@@ -19,14 +33,13 @@ module Crawlscope
19
33
  end
20
34
 
21
35
  validate_duplicates(page_summaries, issues)
36
+ validate_near_duplicates(page_summaries, issues)
22
37
  end
23
38
 
24
39
  private
25
40
 
26
41
  def content_fingerprint_digest(doc)
27
- text = doc.at_css("main")&.text.to_s
28
- text = doc.at_css("body")&.text.to_s if text.empty?
29
- normalized = text.gsub(/\s+/, " ").strip
42
+ normalized = DocumentText.text_for(doc)
30
43
  return if normalized.length < 200
31
44
 
32
45
  Digest::SHA256.hexdigest(normalized)
@@ -41,9 +54,12 @@ module Crawlscope
41
54
  end
42
55
 
43
56
  def summary_for(page)
57
+ tokens = DocumentText.tokens(DocumentText.text_for(page.doc))
58
+
44
59
  {
45
60
  content_fingerprint_digest: content_fingerprint_digest(page.doc),
46
61
  description: page.doc.at_css('meta[name="description"]')&.[]("content").to_s.strip,
62
+ shingles: shingles_for(tokens),
47
63
  title: page.doc.at_css("title")&.text.to_s.strip,
48
64
  url: page.url
49
65
  }
@@ -83,6 +99,62 @@ module Crawlscope
83
99
  )
84
100
  end
85
101
  end
102
+
103
+ def shingles_for(tokens)
104
+ return [] if tokens.size < @shingle_size
105
+
106
+ tokens.each_cons(@shingle_size).map { |items| items.join(" ") }.uniq
107
+ end
108
+
109
+ def validate_near_duplicates(page_summaries, issues)
110
+ if near_duplicate_scan_limit_exceeded?(page_summaries)
111
+ issues.add(
112
+ code: :near_duplicate_scan_skipped,
113
+ severity: :warning,
114
+ category: :uniqueness,
115
+ url: nil,
116
+ message: "near duplicate scan skipped for #{page_summaries.size} pages",
117
+ details: {max_pages: @max_near_duplicate_pages, page_count: page_summaries.size}
118
+ )
119
+ return
120
+ end
121
+
122
+ page_summaries.combination(2) do |left, right|
123
+ next if same_content_fingerprint?(left, right)
124
+ next if left[:shingles].size < @minimum_shingles || right[:shingles].size < @minimum_shingles
125
+
126
+ similarity = shingle_similarity(left[:shingles], right[:shingles])
127
+ next if similarity < @near_duplicate_threshold
128
+
129
+ urls = [left[:url], right[:url]]
130
+
131
+ issues.add(
132
+ code: :near_duplicate_content,
133
+ severity: :warning,
134
+ category: :uniqueness,
135
+ url: nil,
136
+ message: "near duplicate page content (#{format("%.2f", similarity)}) => #{urls.join(", ")}",
137
+ details: {similarity: similarity.round(3), threshold: @near_duplicate_threshold, urls: urls}
138
+ )
139
+ end
140
+ end
141
+
142
+ def near_duplicate_scan_limit_exceeded?(page_summaries)
143
+ !@max_near_duplicate_pages.nil? && page_summaries.size > @max_near_duplicate_pages
144
+ end
145
+
146
+ def same_content_fingerprint?(left, right)
147
+ !left[:content_fingerprint_digest].nil? &&
148
+ left[:content_fingerprint_digest] == right[:content_fingerprint_digest]
149
+ end
150
+
151
+ def shingle_similarity(left, right)
152
+ intersection_size = (left & right).size
153
+ smaller_set_size = [left.size, right.size].min
154
+ return 0.0 if smaller_set_size.zero?
155
+
156
+ intersection_size.to_f / smaller_set_size
157
+ end
86
158
  end
87
159
  end
88
160
  end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Crawlscope
4
- VERSION = "0.3.0"
4
+ VERSION = "0.4.0"
5
5
  end
@@ -10,6 +10,11 @@ namespace :crawlscope do
10
10
  Crawlscope::RakeTasks.ldjson
11
11
  end
12
12
 
13
+ desc "Validate URLs with the indexability rule. ENV: URL, SITEMAP, JS=1"
14
+ task indexability: :environment do
15
+ Crawlscope::RakeTasks.validate_rule("indexability")
16
+ end
17
+
13
18
  desc "Validate URLs with the metadata rule. ENV: URL, SITEMAP, JS=1"
14
19
  task metadata: :environment do
15
20
  Crawlscope::RakeTasks.validate_rule("metadata")
@@ -25,6 +30,11 @@ namespace :crawlscope do
25
30
  Crawlscope::RakeTasks.validate_rule("uniqueness")
26
31
  end
27
32
 
33
+ desc "Validate URLs with the content_quality rule. ENV: URL, SITEMAP, JS=1"
34
+ task content_quality: :environment do
35
+ Crawlscope::RakeTasks.validate_rule("content_quality")
36
+ end
37
+
28
38
  desc "Validate URLs with the links rule. ENV: URL, SITEMAP, JS=1"
29
39
  task links: :environment do
30
40
  Crawlscope::RakeTasks.validate_rule("links")
@@ -20,7 +20,14 @@ class CrawlscopeConfigurationTest < Minitest::Test
20
20
  assert_equal "https://example.com", audit.instance_variable_get(:@base_url)
21
21
  assert_equal "/tmp/sitemap.xml", audit.instance_variable_get(:@sitemap_path)
22
22
  assert_equal 4, audit.instance_variable_get(:@concurrency)
23
- assert_equal %i[metadata structured_data uniqueness links], audit.instance_variable_get(:@rules).map(&:code)
23
+ assert_equal %i[
24
+ indexability
25
+ metadata
26
+ structured_data
27
+ uniqueness
28
+ content_quality
29
+ links
30
+ ], audit.instance_variable_get(:@rules).map(&:code)
24
31
  end
25
32
 
26
33
  def test_audit_raises_without_base_url
@@ -0,0 +1,68 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "test_helper"
4
+
5
+ class CrawlscopeContentQualityRuleTest < Minitest::Test
6
+ def test_reports_thin_visible_text_and_low_html_text_ratio
7
+ issues = Crawlscope::IssueCollection.new
8
+ page = page_with(main: "Short page <div>#{"<span></span>" * 500}</div>")
9
+
10
+ Crawlscope::Rules::ContentQuality.new.call(urls: [page.url], pages: [page], issues: issues)
11
+
12
+ codes = issues.to_a.map(&:code)
13
+ assert_includes codes, :thin_visible_text
14
+ assert_includes codes, :low_visible_text_ratio
15
+ end
16
+
17
+ def test_visible_text_ratio_ignores_markup_outside_main_content
18
+ issues = Crawlscope::IssueCollection.new
19
+ page = page_with(
20
+ main: Array.new(260) { |index| "word#{index}" }.join(" "),
21
+ head_markup: "<style>#{"body{}" * 10_000}</style>",
22
+ extra_markup: "<nav>#{"<a href=\"/\">Navigation</a>" * 500}</nav>"
23
+ )
24
+
25
+ Crawlscope::Rules::ContentQuality.new.call(urls: [page.url], pages: [page], issues: issues)
26
+
27
+ refute_includes issues.to_a.map(&:code), :low_visible_text_ratio
28
+ end
29
+
30
+ def test_reports_low_unique_token_ratio_for_repetitive_content
31
+ issues = Crawlscope::IssueCollection.new
32
+ page = page_with(main: ("hotel location service " * 100).strip)
33
+
34
+ Crawlscope::Rules::ContentQuality.new.call(urls: [page.url], pages: [page], issues: issues)
35
+
36
+ issue = issues.to_a.find { |item| item.code == :low_unique_token_ratio }
37
+ assert issue
38
+ assert_operator issue.details[:ratio], :<, issue.details[:threshold]
39
+ end
40
+
41
+ private
42
+
43
+ def page_with(main:, extra_markup: "", head_markup: "")
44
+ body = <<~HTML
45
+ <html>
46
+ <head>
47
+ <title>Content quality</title>
48
+ #{head_markup}
49
+ </head>
50
+ <body>
51
+ #{extra_markup}
52
+ <main>#{main}</main>
53
+ </body>
54
+ </html>
55
+ HTML
56
+
57
+ Crawlscope::Page.new(
58
+ url: "https://example.com/page",
59
+ normalized_url: "https://example.com/page",
60
+ final_url: "https://example.com/page",
61
+ normalized_final_url: "https://example.com/page",
62
+ status: 200,
63
+ headers: {"content-type" => "text/html"},
64
+ body: body,
65
+ doc: Nokogiri::HTML(body)
66
+ )
67
+ end
68
+ end
@@ -45,6 +45,7 @@ class CrawlscopeCrawlTest < Minitest::Test
45
45
  <body>
46
46
  <main>
47
47
  <h1>Pricing</h1>
48
+ <p>#{Array.new(260) { |index| "pricing#{index}" }.join(" ")}</p>
48
49
  </main>
49
50
  </body>
50
51
  </html>
@@ -100,7 +101,15 @@ class CrawlscopeCrawlTest < Minitest::Test
100
101
  ).call
101
102
 
102
103
  refute result.ok?
103
- assert_equal %i[incomplete_open_graph_tags meta_description_too_long missing_canonical missing_h1 missing_structured_data title_repeats_site_name].sort, result.issues.to_a.map(&:code).uniq.sort
104
+ assert_equal %i[
105
+ incomplete_open_graph_tags
106
+ meta_description_too_long
107
+ missing_canonical
108
+ missing_h1
109
+ missing_structured_data
110
+ thin_visible_text
111
+ title_repeats_site_name
112
+ ].sort, result.issues.to_a.map(&:code).uniq.sort
104
113
  end
105
114
 
106
115
  def test_uses_browser_when_renderer_is_browser
@@ -147,6 +156,7 @@ class CrawlscopeCrawlTest < Minitest::Test
147
156
  <body>
148
157
  <main>
149
158
  <h1>Pricing</h1>
159
+ <p>#{Array.new(260) { |index| "pricing#{index}" }.join(" ")}</p>
150
160
  </main>
151
161
  </body>
152
162
  </html>
@@ -0,0 +1,96 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "test_helper"
4
+
5
+ class CrawlscopeIndexabilityRuleTest < Minitest::Test
6
+ def test_reports_meta_noindex
7
+ issues = Crawlscope::IssueCollection.new
8
+ page = page_with(
9
+ body: <<~HTML
10
+ <html>
11
+ <head><meta name="robots" content="noindex, follow"></head>
12
+ <body><main>Visible content</main></body>
13
+ </html>
14
+ HTML
15
+ )
16
+
17
+ Crawlscope::Rules::Indexability.new.call(urls: [page.url], pages: [page], issues: issues)
18
+
19
+ issue = issues.to_a.fetch(0)
20
+ assert_equal :noindex_meta, issue.code
21
+ assert_equal :error, issue.severity
22
+ assert_equal "noindex, follow", issue.details[:content]
23
+ end
24
+
25
+ def test_reports_x_robots_tag_noindex
26
+ issues = Crawlscope::IssueCollection.new
27
+ page = page_with(headers: {"X-Robots-Tag" => "noindex"})
28
+
29
+ Crawlscope::Rules::Indexability.new.call(urls: [page.url], pages: [page], issues: issues)
30
+
31
+ issue = issues.to_a.fetch(0)
32
+ assert_equal :noindex_header, issue.code
33
+ assert_equal :error, issue.severity
34
+ assert_equal "noindex", issue.details[:content]
35
+ end
36
+
37
+ def test_reports_x_robots_tag_noindex_for_non_html_response
38
+ issues = Crawlscope::IssueCollection.new
39
+ page = page_with(
40
+ body: "%PDF-1.7",
41
+ doc: nil,
42
+ headers: {"content-type" => "application/pdf", "X-Robots-Tag" => "noindex"}
43
+ )
44
+
45
+ Crawlscope::Rules::Indexability.new.call(urls: [page.url], pages: [page], issues: issues)
46
+
47
+ issue = issues.to_a.fetch(0)
48
+ assert_equal :noindex_header, issue.code
49
+ assert_equal :error, issue.severity
50
+ assert_equal "noindex", issue.details[:content]
51
+ end
52
+
53
+ def test_reports_scoped_x_robots_tag_noindex
54
+ issues = Crawlscope::IssueCollection.new
55
+ page = page_with(headers: {"X-Robots-Tag" => "googlebot: noindex, nofollow"})
56
+
57
+ Crawlscope::Rules::Indexability.new.call(urls: [page.url], pages: [page], issues: issues)
58
+
59
+ issue = issues.to_a.fetch(0)
60
+ assert_equal :noindex_header, issue.code
61
+ assert_equal "googlebot: noindex, nofollow", issue.details[:content]
62
+ end
63
+
64
+ def test_reports_x_robots_tag_none
65
+ issues = Crawlscope::IssueCollection.new
66
+ page = page_with(headers: {"X-Robots-Tag" => "none"})
67
+
68
+ Crawlscope::Rules::Indexability.new.call(urls: [page.url], pages: [page], issues: issues)
69
+
70
+ issue = issues.to_a.fetch(0)
71
+ assert_equal :noindex_header, issue.code
72
+ assert_equal "none", issue.details[:content]
73
+ end
74
+
75
+ private
76
+
77
+ def page_with(body: nil, doc: :parse, headers: {"content-type" => "text/html"})
78
+ body ||= <<~HTML
79
+ <html>
80
+ <head><title>Indexable</title></head>
81
+ <body><main>Visible content</main></body>
82
+ </html>
83
+ HTML
84
+
85
+ Crawlscope::Page.new(
86
+ url: "https://example.com/page",
87
+ normalized_url: "https://example.com/page",
88
+ final_url: "https://example.com/page",
89
+ normalized_final_url: "https://example.com/page",
90
+ status: 200,
91
+ headers: headers,
92
+ body: body,
93
+ doc: (doc == :parse) ? Nokogiri::HTML(body) : doc
94
+ )
95
+ end
96
+ end
@@ -16,10 +16,51 @@ class CrawlscopeUniquenessRuleTest < Minitest::Test
16
16
  assert_equal %i[duplicate_content_fingerprint duplicate_meta_description duplicate_title].sort, issues.to_a.map(&:code).sort
17
17
  end
18
18
 
19
+ def test_reports_near_duplicate_content
20
+ issues = Crawlscope::IssueCollection.new
21
+ rule = Crawlscope::Rules::Uniqueness.new
22
+ pages = [
23
+ page(url: "https://example.com/a", content: near_duplicate_content("reliable")),
24
+ page(url: "https://example.com/b", content: near_duplicate_content("dependable"))
25
+ ]
26
+
27
+ rule.call(urls: pages.map(&:url), pages: pages, issues: issues, context: {})
28
+
29
+ issue = issues.to_a.find { |item| item.code == :near_duplicate_content }
30
+ assert issue
31
+ assert_operator issue.details[:similarity], :>=, issue.details[:threshold]
32
+ end
33
+
34
+ def test_skips_near_duplicate_scan_when_page_count_exceeds_limit
35
+ issues = Crawlscope::IssueCollection.new
36
+ rule = Crawlscope::Rules::Uniqueness.new(max_near_duplicate_pages: 1)
37
+ pages = [
38
+ page(url: "https://example.com/a", content: near_duplicate_content("reliable")),
39
+ page(url: "https://example.com/b", content: near_duplicate_content("dependable"))
40
+ ]
41
+
42
+ rule.call(urls: pages.map(&:url), pages: pages, issues: issues, context: {})
43
+
44
+ skip_issue = issues.to_a.find { |item| item.code == :near_duplicate_scan_skipped }
45
+ refute issues.to_a.any? { |item| item.code == :near_duplicate_content }
46
+ assert_equal :warning, skip_issue.severity
47
+ assert_equal({max_pages: 1, page_count: 2}, skip_issue.details)
48
+ end
49
+
19
50
  private
20
51
 
21
- def page(url:)
22
- repeated_text = ("Useful content " * 30).strip
52
+ def near_duplicate_content(adjective)
53
+ <<~TEXT.gsub(/\s+/, " ").strip
54
+ This page summarizes practical hotel review patterns for operators who need #{adjective}
55
+ service insights across locations. It compares recurring comments about staff, rooms,
56
+ cleanliness, check-in, breakfast, parking, and amenities so teams can prioritize fixes.
57
+ The analysis highlights repeat themes, explains why guests mention them, and keeps the
58
+ wording focused on decisions that improve daily operations.
59
+ TEXT
60
+ end
61
+
62
+ def page(url:, content: nil)
63
+ repeated_text = content || ("Useful content " * 30).strip
23
64
  body = <<~HTML
24
65
  <html>
25
66
  <head>
@@ -0,0 +1,86 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "test_helper"
4
+ require "rake"
5
+
6
+ unless respond_to?(:release_version, true)
7
+ load File.expand_path("../Rakefile", __dir__)
8
+ end
9
+
10
+ class ReleaseTaskTest < Minitest::Test
11
+ def test_release_version_increments_patch_from_current_version
12
+ major, minor, patch = Crawlscope::VERSION.split(".").map(&:to_i)
13
+
14
+ assert_equal "#{major}.#{minor}.#{patch + 1}", release_version("patch")
15
+ end
16
+
17
+ def test_release_version_accepts_explicit_semantic_version
18
+ assert_equal "0.3.0", release_version("0.3.0")
19
+ end
20
+
21
+ def test_validate_release_version_rejects_current_version
22
+ error = assert_raises(ArgumentError) do
23
+ validate_release_version!("0.2.7", "0.2.7")
24
+ end
25
+
26
+ assert_equal(
27
+ "Release version 0.2.7 must be newer than current version 0.2.7.",
28
+ error.message
29
+ )
30
+ end
31
+
32
+ def test_validate_release_version_rejects_existing_local_tag
33
+ @local_release_tag_exists = true
34
+ @remote_release_tag_exists = false
35
+
36
+ error = assert_raises(ArgumentError) do
37
+ validate_release_version!("0.2.8", "0.2.7")
38
+ end
39
+
40
+ assert_equal "Release tag v0.2.8 already exists locally.", error.message
41
+ end
42
+
43
+ def test_validate_release_version_rejects_existing_remote_tag
44
+ @local_release_tag_exists = false
45
+ @remote_release_tag_exists = true
46
+
47
+ error = assert_raises(ArgumentError) do
48
+ validate_release_version!("0.2.8", "0.2.7")
49
+ end
50
+
51
+ assert_equal "Release tag v0.2.8 already exists on origin.", error.message
52
+ end
53
+
54
+ def test_remote_release_tag_command_asks_git_to_fail_when_no_tag_matches
55
+ assert_equal(
56
+ "git ls-remote --exit-code --tags origin refs/tags/v0.2.8",
57
+ remote_release_tag_command("v0.2.8")
58
+ )
59
+ end
60
+
61
+ def test_changelog_command_prepends_the_next_release
62
+ assert_equal(
63
+ [
64
+ "git-cliff",
65
+ "-c",
66
+ "cliff.toml",
67
+ "--unreleased",
68
+ "--tag",
69
+ "v0.2.8",
70
+ "--prepend",
71
+ "CHANGELOG.md"
72
+ ],
73
+ changelog_command("0.2.8")
74
+ )
75
+ end
76
+
77
+ private
78
+
79
+ def local_release_tag_exists?(_tag)
80
+ @local_release_tag_exists || false
81
+ end
82
+
83
+ def remote_release_tag_exists?(_tag)
84
+ @remote_release_tag_exists || false
85
+ end
86
+ end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: crawlscope
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.3.0
4
+ version: 0.4.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Paulo Fidalgo
@@ -199,6 +199,7 @@ files:
199
199
  - lib/crawlscope/context.rb
200
200
  - lib/crawlscope/crawl.rb
201
201
  - lib/crawlscope/crawler.rb
202
+ - lib/crawlscope/document_text.rb
202
203
  - lib/crawlscope/http.rb
203
204
  - lib/crawlscope/issue.rb
204
205
  - lib/crawlscope/issue_collection.rb
@@ -208,6 +209,8 @@ files:
208
209
  - lib/crawlscope/reporter.rb
209
210
  - lib/crawlscope/result.rb
210
211
  - lib/crawlscope/rule_registry.rb
212
+ - lib/crawlscope/rules/content_quality.rb
213
+ - lib/crawlscope/rules/indexability.rb
211
214
  - lib/crawlscope/rules/links.rb
212
215
  - lib/crawlscope/rules/metadata.rb
213
216
  - lib/crawlscope/rules/structured_data.rb
@@ -228,9 +231,11 @@ files:
228
231
  - test/crawlscope/browser_test.rb
229
232
  - test/crawlscope/cli_test.rb
230
233
  - test/crawlscope/configuration_test.rb
234
+ - test/crawlscope/content_quality_rule_test.rb
231
235
  - test/crawlscope/crawl_test.rb
232
236
  - test/crawlscope/crawler_test.rb
233
237
  - test/crawlscope/http_test.rb
238
+ - test/crawlscope/indexability_rule_test.rb
234
239
  - test/crawlscope/links_rule_test.rb
235
240
  - test/crawlscope/loader_test.rb
236
241
  - test/crawlscope/metadata_rule_test.rb
@@ -247,6 +252,7 @@ files:
247
252
  - test/crawlscope/structured_data_writer_test.rb
248
253
  - test/crawlscope/uniqueness_rule_test.rb
249
254
  - test/crawlscope/url_test.rb
255
+ - test/release_task_test.rb
250
256
  - test/test_helper.rb
251
257
  homepage: https://www.ethos-link.com/opensource/crawlscope
252
258
  licenses:
@@ -275,7 +281,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
275
281
  - !ruby/object:Gem::Version
276
282
  version: '0'
277
283
  requirements: []
278
- rubygems_version: 4.0.6
284
+ rubygems_version: 4.0.10
279
285
  specification_version: 4
280
286
  summary: Audit sitemap URLs for metadata, structured data, uniqueness, and links
281
287
  test_files: []