crawlscope 0.3.0 → 0.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +64 -0
- data/README.md +44 -7
- data/lib/crawlscope/document_text.rb +40 -0
- data/lib/crawlscope/rule_registry.rb +3 -1
- data/lib/crawlscope/rules/content_quality.rb +99 -0
- data/lib/crawlscope/rules/indexability.rb +66 -0
- data/lib/crawlscope/rules/uniqueness.rb +76 -4
- data/lib/crawlscope/version.rb +1 -1
- data/lib/tasks/crawlscope_tasks.rake +10 -0
- data/test/crawlscope/configuration_test.rb +8 -1
- data/test/crawlscope/content_quality_rule_test.rb +68 -0
- data/test/crawlscope/crawl_test.rb +11 -1
- data/test/crawlscope/indexability_rule_test.rb +96 -0
- data/test/crawlscope/uniqueness_rule_test.rb +43 -2
- data/test/release_task_test.rb +86 -0
- metadata +8 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 79e8c8f3993c545bf7647c28b8540d3757c7d9c91eeaf885cde6d55c4935ebb5
|
|
4
|
+
data.tar.gz: d9b6a987e04546c2d3ee7bb3cc6e1d5510e78963df035cb24d7c8783064afa45
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: eb49361b9f26992682db7622796c4b262a12fca37254aca5e1f1c49c85702b7e4fc347a880af0665f10238f5340cb61bc44433060ba7b3fbde0bdd379c85c763
|
|
7
|
+
data.tar.gz: 5fa53f930ef529279e063bd11f9becd112c8abb266078027486f22ad37e968bad744c5a35c9432ccb170ceb51e45d858e23a47c649c6ede1d4dd89fb331fd9f3
|
data/CHANGELOG.md
CHANGED
|
@@ -5,6 +5,26 @@ All notable changes to this project will be documented in this file.
|
|
|
5
5
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
6
6
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
7
7
|
|
|
8
|
+
## [0.4.0] - 2026-05-21
|
|
9
|
+
|
|
10
|
+
|
|
11
|
+
### Added
|
|
12
|
+
|
|
13
|
+
- add indexability and content quality checks
|
|
14
|
+
|
|
15
|
+
|
|
16
|
+
|
|
17
|
+
|
|
18
|
+
### Fixed
|
|
19
|
+
|
|
20
|
+
- preserve release changelog history
|
|
21
|
+
|
|
22
|
+
- scope content ratio to main content
|
|
23
|
+
|
|
24
|
+
- harden indexability and uniqueness rules
|
|
25
|
+
|
|
26
|
+
|
|
27
|
+
|
|
8
28
|
## [0.3.0] - 2026-04-28
|
|
9
29
|
|
|
10
30
|
|
|
@@ -28,3 +48,47 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
|
|
28
48
|
|
|
29
49
|
|
|
30
50
|
|
|
51
|
+
## [0.2.0] - 2026-04-24
|
|
52
|
+
|
|
53
|
+
|
|
54
|
+
### Changed
|
|
55
|
+
|
|
56
|
+
- simplify crawl and structured data boundaries
|
|
57
|
+
|
|
58
|
+
- harden validation boundaries
|
|
59
|
+
|
|
60
|
+
|
|
61
|
+
|
|
62
|
+
|
|
63
|
+
### Fixed
|
|
64
|
+
|
|
65
|
+
- handle child sitemaps
|
|
66
|
+
|
|
67
|
+
- use URL for sitemap validation
|
|
68
|
+
|
|
69
|
+
|
|
70
|
+
|
|
71
|
+
## [0.1.0] - 2026-04-23
|
|
72
|
+
|
|
73
|
+
|
|
74
|
+
### Added
|
|
75
|
+
|
|
76
|
+
- add crawlkit release-ready audit gem
|
|
77
|
+
|
|
78
|
+
- add standalone validation commands
|
|
79
|
+
|
|
80
|
+
- move default schema rules into crawlkit
|
|
81
|
+
|
|
82
|
+
|
|
83
|
+
|
|
84
|
+
|
|
85
|
+
### Changed
|
|
86
|
+
|
|
87
|
+
- strengthen public API coverage
|
|
88
|
+
|
|
89
|
+
- load shared test dependencies
|
|
90
|
+
|
|
91
|
+
- rename crawlkit to crawlscope
|
|
92
|
+
|
|
93
|
+
|
|
94
|
+
|
data/README.md
CHANGED
|
@@ -23,9 +23,11 @@ It works in three modes:
|
|
|
23
23
|
|
|
24
24
|
The default rule set includes:
|
|
25
25
|
|
|
26
|
+
- indexability blockers
|
|
26
27
|
- metadata validation
|
|
27
28
|
- structured-data validation
|
|
28
29
|
- uniqueness checks
|
|
30
|
+
- content-quality checks
|
|
29
31
|
- internal-link checks
|
|
30
32
|
|
|
31
33
|
## Installation
|
|
@@ -146,9 +148,11 @@ Available tasks:
|
|
|
146
148
|
|
|
147
149
|
```bash
|
|
148
150
|
bin/rails crawlscope:validate
|
|
151
|
+
bin/rails crawlscope:validate:indexability
|
|
149
152
|
bin/rails crawlscope:validate:metadata
|
|
150
153
|
bin/rails crawlscope:validate:structured_data
|
|
151
154
|
bin/rails crawlscope:validate:uniqueness
|
|
155
|
+
bin/rails crawlscope:validate:content_quality
|
|
152
156
|
bin/rails crawlscope:validate:links
|
|
153
157
|
bin/rails crawlscope:validate:ldjson
|
|
154
158
|
```
|
|
@@ -161,7 +165,7 @@ bundle exec rake crawlscope:validate:metadata URL=https://example.com
|
|
|
161
165
|
bundle exec rake crawlscope:validate:ldjson URL=https://example.com/article
|
|
162
166
|
```
|
|
163
167
|
|
|
164
|
-
`crawlscope:validate` runs all default sitemap rules: metadata, structured data, uniqueness, and links. `URL` is the site base. Without `SITEMAP`, Crawlscope uses `/sitemap.xml`. With `SITEMAP`, Crawlscope uses `URL` as the site base and validates URLs from that sitemap. `SITEMAP` may be a full URL or a local file path.
|
|
168
|
+
`crawlscope:validate` runs all default sitemap rules: indexability, metadata, structured data, uniqueness, content quality, and links. `URL` is the site base. Without `SITEMAP`, Crawlscope uses `/sitemap.xml`. With `SITEMAP`, Crawlscope uses `URL` as the site base and validates URLs from that sitemap. `SITEMAP` may be a full URL or a local file path.
|
|
165
169
|
|
|
166
170
|
`crawlscope:validate:ldjson` is separate because it directly checks the URL or semicolon-separated URLs in `URL`; it does not crawl the sitemap. Without `URL`, it checks the configured base URL, falling back to `http://localhost:3000`.
|
|
167
171
|
|
|
@@ -186,11 +190,20 @@ Optional flags:
|
|
|
186
190
|
|
|
187
191
|
Built-in rules:
|
|
188
192
|
|
|
193
|
+
- `indexability`
|
|
189
194
|
- `metadata`
|
|
190
195
|
- `structured_data`
|
|
191
196
|
- `uniqueness`
|
|
197
|
+
- `content_quality`
|
|
192
198
|
- `links`
|
|
193
199
|
|
|
200
|
+
### Indexability
|
|
201
|
+
|
|
202
|
+
Checks:
|
|
203
|
+
|
|
204
|
+
- page-level meta robots `noindex`
|
|
205
|
+
- `X-Robots-Tag: noindex`
|
|
206
|
+
|
|
194
207
|
### Metadata
|
|
195
208
|
|
|
196
209
|
Checks:
|
|
@@ -220,6 +233,19 @@ Checks:
|
|
|
220
233
|
- duplicate titles
|
|
221
234
|
- duplicate meta descriptions
|
|
222
235
|
- duplicate content fingerprints
|
|
236
|
+
- near-duplicate visible content for up to 250 HTML pages
|
|
237
|
+
|
|
238
|
+
For larger crawls, exact duplicate checks still run and Crawlscope reports
|
|
239
|
+
`near_duplicate_scan_skipped`. Configure `Rules::Uniqueness` with
|
|
240
|
+
`max_near_duplicate_pages:` in a custom rule registry to change the limit.
|
|
241
|
+
|
|
242
|
+
### Content Quality
|
|
243
|
+
|
|
244
|
+
Checks:
|
|
245
|
+
|
|
246
|
+
- thin visible text
|
|
247
|
+
- low visible-text-to-HTML ratio
|
|
248
|
+
- low unique-token ratio
|
|
223
249
|
|
|
224
250
|
### Links
|
|
225
251
|
|
|
@@ -268,7 +294,12 @@ bundle exec rake
|
|
|
268
294
|
|
|
269
295
|
### Git hooks
|
|
270
296
|
|
|
271
|
-
We use [lefthook](https://lefthook.dev/) with the Ruby
|
|
297
|
+
We use [lefthook](https://lefthook.dev/) with the Ruby
|
|
298
|
+
[commitlint](https://github.com/arandilopez/commitlint) gem to enforce
|
|
299
|
+
Conventional Commits on every commit. We also use
|
|
300
|
+
[Standard Ruby](https://standardrb.com/) to keep code style consistent. CI
|
|
301
|
+
validates commit messages, Standard Ruby, tests, and git-cliff changelog
|
|
302
|
+
generation on pull requests and pushes to main/master.
|
|
272
303
|
|
|
273
304
|
Run the hook installer once per clone:
|
|
274
305
|
|
|
@@ -284,11 +315,16 @@ rake install
|
|
|
284
315
|
|
|
285
316
|
## Release
|
|
286
317
|
|
|
287
|
-
Releases are tag-driven and published by GitHub Actions to RubyGems.
|
|
318
|
+
Releases are tag-driven and published by GitHub Actions to RubyGems.
|
|
319
|
+
Local release commands never publish directly.
|
|
288
320
|
|
|
289
|
-
Install [git-cliff](https://git-cliff.org/) locally before preparing a
|
|
321
|
+
Install [git-cliff](https://git-cliff.org/) locally before preparing a
|
|
322
|
+
release. The release task prepends the next `CHANGELOG.md` section from
|
|
323
|
+
Conventional Commits.
|
|
290
324
|
|
|
291
|
-
Before preparing a release, make sure you are on `main` or `master` with a
|
|
325
|
+
Before preparing a release, make sure you are on `main` or `master` with a
|
|
326
|
+
clean worktree. If the release contains a breaking public-contract change,
|
|
327
|
+
update `UPGRADE.md` with the host-app migration steps first.
|
|
292
328
|
|
|
293
329
|
Then run one of:
|
|
294
330
|
|
|
@@ -301,12 +337,13 @@ bundle exec rake 'release:prepare[0.1.0]'
|
|
|
301
337
|
|
|
302
338
|
The task will:
|
|
303
339
|
|
|
304
|
-
1.
|
|
340
|
+
1. Prepend the next `CHANGELOG.md` section with `git-cliff`.
|
|
305
341
|
1. Update `lib/crawlscope/version.rb`.
|
|
306
342
|
1. Commit the release changes.
|
|
307
343
|
1. Create and push the `vX.Y.Z` tag.
|
|
308
344
|
|
|
309
|
-
The `Release` workflow then runs tests, publishes the gem to RubyGems,
|
|
345
|
+
The `Release` workflow then runs tests, publishes the gem to RubyGems,
|
|
346
|
+
and creates the GitHub release from the changelog entry.
|
|
310
347
|
|
|
311
348
|
## Contributing
|
|
312
349
|
|
|
@@ -0,0 +1,40 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module Crawlscope
|
|
4
|
+
module DocumentText
|
|
5
|
+
REMOVED_SELECTORS = "script, style, noscript, template, svg"
|
|
6
|
+
TOKEN_PATTERN = /[[:alnum:]]+/
|
|
7
|
+
|
|
8
|
+
module_function
|
|
9
|
+
|
|
10
|
+
def body_text(doc)
|
|
11
|
+
text_for(doc, selector: nil)
|
|
12
|
+
end
|
|
13
|
+
|
|
14
|
+
def html_for(doc, selector: "main")
|
|
15
|
+
root_for(doc, selector: selector)&.to_html.to_s
|
|
16
|
+
end
|
|
17
|
+
|
|
18
|
+
def text_for(doc, selector: "main")
|
|
19
|
+
normalize(root_for(doc, selector: selector)&.text)
|
|
20
|
+
end
|
|
21
|
+
|
|
22
|
+
def tokens(text)
|
|
23
|
+
normalize(text).downcase.scan(TOKEN_PATTERN).reject { |token| token.length < 2 }
|
|
24
|
+
end
|
|
25
|
+
|
|
26
|
+
def normalize(text)
|
|
27
|
+
text.to_s.gsub(/\s+/, " ").strip
|
|
28
|
+
end
|
|
29
|
+
|
|
30
|
+
def root_for(doc, selector:)
|
|
31
|
+
return unless doc
|
|
32
|
+
|
|
33
|
+
copy = doc.dup
|
|
34
|
+
copy.css(REMOVED_SELECTORS).remove
|
|
35
|
+
|
|
36
|
+
root = selector.to_s.empty? ? nil : copy.at_css(selector)
|
|
37
|
+
root || copy.at_css("body") || copy
|
|
38
|
+
end
|
|
39
|
+
end
|
|
40
|
+
end
|
|
@@ -12,12 +12,14 @@ module Crawlscope
|
|
|
12
12
|
def self.default(site_name: nil)
|
|
13
13
|
new(
|
|
14
14
|
rules: [
|
|
15
|
+
Rules::Indexability.new,
|
|
15
16
|
Rules::Metadata.new(site_name: site_name),
|
|
16
17
|
Rules::StructuredData.new,
|
|
17
18
|
Rules::Uniqueness.new,
|
|
19
|
+
Rules::ContentQuality.new,
|
|
18
20
|
Rules::Links.new
|
|
19
21
|
],
|
|
20
|
-
default_codes: %i[metadata structured_data uniqueness links]
|
|
22
|
+
default_codes: %i[indexability metadata structured_data uniqueness content_quality links]
|
|
21
23
|
)
|
|
22
24
|
end
|
|
23
25
|
|
|
@@ -0,0 +1,99 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module Crawlscope
|
|
4
|
+
module Rules
|
|
5
|
+
class ContentQuality
|
|
6
|
+
MIN_VISIBLE_TEXT_RATIO = 0.08
|
|
7
|
+
MIN_VISIBLE_WORDS = 250
|
|
8
|
+
MIN_UNIQUE_TOKEN_RATIO = 0.25
|
|
9
|
+
|
|
10
|
+
attr_reader :code
|
|
11
|
+
|
|
12
|
+
def initialize(
|
|
13
|
+
min_visible_text_ratio: MIN_VISIBLE_TEXT_RATIO,
|
|
14
|
+
min_visible_words: MIN_VISIBLE_WORDS,
|
|
15
|
+
min_unique_token_ratio: MIN_UNIQUE_TOKEN_RATIO
|
|
16
|
+
)
|
|
17
|
+
@code = :content_quality
|
|
18
|
+
@min_visible_text_ratio = min_visible_text_ratio
|
|
19
|
+
@min_visible_words = min_visible_words
|
|
20
|
+
@min_unique_token_ratio = min_unique_token_ratio
|
|
21
|
+
end
|
|
22
|
+
|
|
23
|
+
def call(urls:, pages:, issues:, context: nil)
|
|
24
|
+
pages.each do |page|
|
|
25
|
+
next unless page.html?
|
|
26
|
+
|
|
27
|
+
validate_visible_words(page, issues)
|
|
28
|
+
validate_visible_text_ratio(page, issues)
|
|
29
|
+
validate_unique_token_ratio(page, issues)
|
|
30
|
+
end
|
|
31
|
+
end
|
|
32
|
+
|
|
33
|
+
private
|
|
34
|
+
|
|
35
|
+
def validate_unique_token_ratio(page, issues)
|
|
36
|
+
tokens = DocumentText.tokens(DocumentText.text_for(page.doc))
|
|
37
|
+
return if tokens.size < @min_visible_words
|
|
38
|
+
|
|
39
|
+
ratio = tokens.uniq.size.to_f / tokens.size
|
|
40
|
+
return if ratio >= @min_unique_token_ratio
|
|
41
|
+
|
|
42
|
+
issues.add(
|
|
43
|
+
code: :low_unique_token_ratio,
|
|
44
|
+
severity: :warning,
|
|
45
|
+
category: :content_quality,
|
|
46
|
+
url: page.url,
|
|
47
|
+
message: "visible text has low token variety (#{format_ratio(ratio)})",
|
|
48
|
+
details: {
|
|
49
|
+
ratio: ratio.round(3),
|
|
50
|
+
threshold: @min_unique_token_ratio,
|
|
51
|
+
token_count: tokens.size,
|
|
52
|
+
unique_token_count: tokens.uniq.size
|
|
53
|
+
}
|
|
54
|
+
)
|
|
55
|
+
end
|
|
56
|
+
|
|
57
|
+
def validate_visible_text_ratio(page, issues)
|
|
58
|
+
html_bytes = DocumentText.html_for(page.doc).bytesize
|
|
59
|
+
return if html_bytes.zero?
|
|
60
|
+
|
|
61
|
+
visible_text = DocumentText.text_for(page.doc)
|
|
62
|
+
ratio = visible_text.bytesize.to_f / html_bytes
|
|
63
|
+
return if ratio >= @min_visible_text_ratio
|
|
64
|
+
|
|
65
|
+
issues.add(
|
|
66
|
+
code: :low_visible_text_ratio,
|
|
67
|
+
severity: :warning,
|
|
68
|
+
category: :content_quality,
|
|
69
|
+
url: page.url,
|
|
70
|
+
message: "low visible text to HTML ratio (#{format_ratio(ratio)})",
|
|
71
|
+
details: {
|
|
72
|
+
html_bytes: html_bytes,
|
|
73
|
+
ratio: ratio.round(3),
|
|
74
|
+
threshold: @min_visible_text_ratio,
|
|
75
|
+
visible_text_bytes: visible_text.bytesize
|
|
76
|
+
}
|
|
77
|
+
)
|
|
78
|
+
end
|
|
79
|
+
|
|
80
|
+
def validate_visible_words(page, issues)
|
|
81
|
+
word_count = DocumentText.tokens(DocumentText.text_for(page.doc)).size
|
|
82
|
+
return if word_count >= @min_visible_words
|
|
83
|
+
|
|
84
|
+
issues.add(
|
|
85
|
+
code: :thin_visible_text,
|
|
86
|
+
severity: :warning,
|
|
87
|
+
category: :content_quality,
|
|
88
|
+
url: page.url,
|
|
89
|
+
message: "thin visible text (#{word_count} words)",
|
|
90
|
+
details: {word_count: word_count, minimum: @min_visible_words}
|
|
91
|
+
)
|
|
92
|
+
end
|
|
93
|
+
|
|
94
|
+
def format_ratio(value)
|
|
95
|
+
format("%.2f", value)
|
|
96
|
+
end
|
|
97
|
+
end
|
|
98
|
+
end
|
|
99
|
+
end
|
|
@@ -0,0 +1,66 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module Crawlscope
|
|
4
|
+
module Rules
|
|
5
|
+
class Indexability
|
|
6
|
+
ROBOTS_META_SELECTOR = 'meta[name="robots"], meta[name="googlebot"]'
|
|
7
|
+
X_ROBOTS_TAG_HEADER = "x-robots-tag"
|
|
8
|
+
|
|
9
|
+
attr_reader :code
|
|
10
|
+
|
|
11
|
+
def initialize
|
|
12
|
+
@code = :indexability
|
|
13
|
+
end
|
|
14
|
+
|
|
15
|
+
def call(urls:, pages:, issues:, context: nil)
|
|
16
|
+
pages.each do |page|
|
|
17
|
+
validate_meta_robots(page, issues) if page.html?
|
|
18
|
+
validate_x_robots_tag(page, issues)
|
|
19
|
+
end
|
|
20
|
+
end
|
|
21
|
+
|
|
22
|
+
private
|
|
23
|
+
|
|
24
|
+
def header_value(page, name)
|
|
25
|
+
page.headers.find { |key, _value| key.to_s.casecmp?(name) }&.last.to_s
|
|
26
|
+
end
|
|
27
|
+
|
|
28
|
+
def noindex?(value)
|
|
29
|
+
value
|
|
30
|
+
.split(",")
|
|
31
|
+
.map { |directive| directive.split(":", 2).last.to_s.strip }
|
|
32
|
+
.any? { |directive| directive.casecmp?("noindex") || directive.casecmp?("none") }
|
|
33
|
+
end
|
|
34
|
+
|
|
35
|
+
def validate_meta_robots(page, issues)
|
|
36
|
+
page.doc.css(ROBOTS_META_SELECTOR).each do |tag|
|
|
37
|
+
content = tag["content"].to_s
|
|
38
|
+
next unless noindex?(content)
|
|
39
|
+
|
|
40
|
+
issues.add(
|
|
41
|
+
code: :noindex_meta,
|
|
42
|
+
severity: :error,
|
|
43
|
+
category: :indexability,
|
|
44
|
+
url: page.url,
|
|
45
|
+
message: "robots meta tag prevents indexing",
|
|
46
|
+
details: {content: content, name: tag["name"].to_s}
|
|
47
|
+
)
|
|
48
|
+
end
|
|
49
|
+
end
|
|
50
|
+
|
|
51
|
+
def validate_x_robots_tag(page, issues)
|
|
52
|
+
content = header_value(page, X_ROBOTS_TAG_HEADER)
|
|
53
|
+
return unless noindex?(content)
|
|
54
|
+
|
|
55
|
+
issues.add(
|
|
56
|
+
code: :noindex_header,
|
|
57
|
+
severity: :error,
|
|
58
|
+
category: :indexability,
|
|
59
|
+
url: page.url,
|
|
60
|
+
message: "X-Robots-Tag header prevents indexing",
|
|
61
|
+
details: {content: content}
|
|
62
|
+
)
|
|
63
|
+
end
|
|
64
|
+
end
|
|
65
|
+
end
|
|
66
|
+
end
|
|
@@ -5,10 +5,24 @@ require "digest"
|
|
|
5
5
|
module Crawlscope
|
|
6
6
|
module Rules
|
|
7
7
|
class Uniqueness
|
|
8
|
+
MINIMUM_SHINGLES = 10
|
|
9
|
+
MAX_NEAR_DUPLICATE_PAGES = 250
|
|
10
|
+
NEAR_DUPLICATE_THRESHOLD = 0.9
|
|
11
|
+
SHINGLE_SIZE = 5
|
|
12
|
+
|
|
8
13
|
attr_reader :code
|
|
9
14
|
|
|
10
|
-
def initialize
|
|
15
|
+
def initialize(
|
|
16
|
+
near_duplicate_threshold: NEAR_DUPLICATE_THRESHOLD,
|
|
17
|
+
max_near_duplicate_pages: MAX_NEAR_DUPLICATE_PAGES,
|
|
18
|
+
minimum_shingles: MINIMUM_SHINGLES,
|
|
19
|
+
shingle_size: SHINGLE_SIZE
|
|
20
|
+
)
|
|
11
21
|
@code = :uniqueness
|
|
22
|
+
@max_near_duplicate_pages = max_near_duplicate_pages
|
|
23
|
+
@minimum_shingles = minimum_shingles
|
|
24
|
+
@near_duplicate_threshold = near_duplicate_threshold
|
|
25
|
+
@shingle_size = shingle_size
|
|
12
26
|
end
|
|
13
27
|
|
|
14
28
|
def call(urls:, pages:, issues:, context:)
|
|
@@ -19,14 +33,13 @@ module Crawlscope
|
|
|
19
33
|
end
|
|
20
34
|
|
|
21
35
|
validate_duplicates(page_summaries, issues)
|
|
36
|
+
validate_near_duplicates(page_summaries, issues)
|
|
22
37
|
end
|
|
23
38
|
|
|
24
39
|
private
|
|
25
40
|
|
|
26
41
|
def content_fingerprint_digest(doc)
|
|
27
|
-
|
|
28
|
-
text = doc.at_css("body")&.text.to_s if text.empty?
|
|
29
|
-
normalized = text.gsub(/\s+/, " ").strip
|
|
42
|
+
normalized = DocumentText.text_for(doc)
|
|
30
43
|
return if normalized.length < 200
|
|
31
44
|
|
|
32
45
|
Digest::SHA256.hexdigest(normalized)
|
|
@@ -41,9 +54,12 @@ module Crawlscope
|
|
|
41
54
|
end
|
|
42
55
|
|
|
43
56
|
def summary_for(page)
|
|
57
|
+
tokens = DocumentText.tokens(DocumentText.text_for(page.doc))
|
|
58
|
+
|
|
44
59
|
{
|
|
45
60
|
content_fingerprint_digest: content_fingerprint_digest(page.doc),
|
|
46
61
|
description: page.doc.at_css('meta[name="description"]')&.[]("content").to_s.strip,
|
|
62
|
+
shingles: shingles_for(tokens),
|
|
47
63
|
title: page.doc.at_css("title")&.text.to_s.strip,
|
|
48
64
|
url: page.url
|
|
49
65
|
}
|
|
@@ -83,6 +99,62 @@ module Crawlscope
|
|
|
83
99
|
)
|
|
84
100
|
end
|
|
85
101
|
end
|
|
102
|
+
|
|
103
|
+
def shingles_for(tokens)
|
|
104
|
+
return [] if tokens.size < @shingle_size
|
|
105
|
+
|
|
106
|
+
tokens.each_cons(@shingle_size).map { |items| items.join(" ") }.uniq
|
|
107
|
+
end
|
|
108
|
+
|
|
109
|
+
def validate_near_duplicates(page_summaries, issues)
|
|
110
|
+
if near_duplicate_scan_limit_exceeded?(page_summaries)
|
|
111
|
+
issues.add(
|
|
112
|
+
code: :near_duplicate_scan_skipped,
|
|
113
|
+
severity: :warning,
|
|
114
|
+
category: :uniqueness,
|
|
115
|
+
url: nil,
|
|
116
|
+
message: "near duplicate scan skipped for #{page_summaries.size} pages",
|
|
117
|
+
details: {max_pages: @max_near_duplicate_pages, page_count: page_summaries.size}
|
|
118
|
+
)
|
|
119
|
+
return
|
|
120
|
+
end
|
|
121
|
+
|
|
122
|
+
page_summaries.combination(2) do |left, right|
|
|
123
|
+
next if same_content_fingerprint?(left, right)
|
|
124
|
+
next if left[:shingles].size < @minimum_shingles || right[:shingles].size < @minimum_shingles
|
|
125
|
+
|
|
126
|
+
similarity = shingle_similarity(left[:shingles], right[:shingles])
|
|
127
|
+
next if similarity < @near_duplicate_threshold
|
|
128
|
+
|
|
129
|
+
urls = [left[:url], right[:url]]
|
|
130
|
+
|
|
131
|
+
issues.add(
|
|
132
|
+
code: :near_duplicate_content,
|
|
133
|
+
severity: :warning,
|
|
134
|
+
category: :uniqueness,
|
|
135
|
+
url: nil,
|
|
136
|
+
message: "near duplicate page content (#{format("%.2f", similarity)}) => #{urls.join(", ")}",
|
|
137
|
+
details: {similarity: similarity.round(3), threshold: @near_duplicate_threshold, urls: urls}
|
|
138
|
+
)
|
|
139
|
+
end
|
|
140
|
+
end
|
|
141
|
+
|
|
142
|
+
def near_duplicate_scan_limit_exceeded?(page_summaries)
|
|
143
|
+
!@max_near_duplicate_pages.nil? && page_summaries.size > @max_near_duplicate_pages
|
|
144
|
+
end
|
|
145
|
+
|
|
146
|
+
def same_content_fingerprint?(left, right)
|
|
147
|
+
!left[:content_fingerprint_digest].nil? &&
|
|
148
|
+
left[:content_fingerprint_digest] == right[:content_fingerprint_digest]
|
|
149
|
+
end
|
|
150
|
+
|
|
151
|
+
def shingle_similarity(left, right)
|
|
152
|
+
intersection_size = (left & right).size
|
|
153
|
+
smaller_set_size = [left.size, right.size].min
|
|
154
|
+
return 0.0 if smaller_set_size.zero?
|
|
155
|
+
|
|
156
|
+
intersection_size.to_f / smaller_set_size
|
|
157
|
+
end
|
|
86
158
|
end
|
|
87
159
|
end
|
|
88
160
|
end
|
data/lib/crawlscope/version.rb
CHANGED
|
@@ -10,6 +10,11 @@ namespace :crawlscope do
|
|
|
10
10
|
Crawlscope::RakeTasks.ldjson
|
|
11
11
|
end
|
|
12
12
|
|
|
13
|
+
desc "Validate URLs with the indexability rule. ENV: URL, SITEMAP, JS=1"
|
|
14
|
+
task indexability: :environment do
|
|
15
|
+
Crawlscope::RakeTasks.validate_rule("indexability")
|
|
16
|
+
end
|
|
17
|
+
|
|
13
18
|
desc "Validate URLs with the metadata rule. ENV: URL, SITEMAP, JS=1"
|
|
14
19
|
task metadata: :environment do
|
|
15
20
|
Crawlscope::RakeTasks.validate_rule("metadata")
|
|
@@ -25,6 +30,11 @@ namespace :crawlscope do
|
|
|
25
30
|
Crawlscope::RakeTasks.validate_rule("uniqueness")
|
|
26
31
|
end
|
|
27
32
|
|
|
33
|
+
desc "Validate URLs with the content_quality rule. ENV: URL, SITEMAP, JS=1"
|
|
34
|
+
task content_quality: :environment do
|
|
35
|
+
Crawlscope::RakeTasks.validate_rule("content_quality")
|
|
36
|
+
end
|
|
37
|
+
|
|
28
38
|
desc "Validate URLs with the links rule. ENV: URL, SITEMAP, JS=1"
|
|
29
39
|
task links: :environment do
|
|
30
40
|
Crawlscope::RakeTasks.validate_rule("links")
|
|
@@ -20,7 +20,14 @@ class CrawlscopeConfigurationTest < Minitest::Test
|
|
|
20
20
|
assert_equal "https://example.com", audit.instance_variable_get(:@base_url)
|
|
21
21
|
assert_equal "/tmp/sitemap.xml", audit.instance_variable_get(:@sitemap_path)
|
|
22
22
|
assert_equal 4, audit.instance_variable_get(:@concurrency)
|
|
23
|
-
assert_equal %i[
|
|
23
|
+
assert_equal %i[
|
|
24
|
+
indexability
|
|
25
|
+
metadata
|
|
26
|
+
structured_data
|
|
27
|
+
uniqueness
|
|
28
|
+
content_quality
|
|
29
|
+
links
|
|
30
|
+
], audit.instance_variable_get(:@rules).map(&:code)
|
|
24
31
|
end
|
|
25
32
|
|
|
26
33
|
def test_audit_raises_without_base_url
|
|
@@ -0,0 +1,68 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require "test_helper"
|
|
4
|
+
|
|
5
|
+
class CrawlscopeContentQualityRuleTest < Minitest::Test
|
|
6
|
+
def test_reports_thin_visible_text_and_low_html_text_ratio
|
|
7
|
+
issues = Crawlscope::IssueCollection.new
|
|
8
|
+
page = page_with(main: "Short page <div>#{"<span></span>" * 500}</div>")
|
|
9
|
+
|
|
10
|
+
Crawlscope::Rules::ContentQuality.new.call(urls: [page.url], pages: [page], issues: issues)
|
|
11
|
+
|
|
12
|
+
codes = issues.to_a.map(&:code)
|
|
13
|
+
assert_includes codes, :thin_visible_text
|
|
14
|
+
assert_includes codes, :low_visible_text_ratio
|
|
15
|
+
end
|
|
16
|
+
|
|
17
|
+
def test_visible_text_ratio_ignores_markup_outside_main_content
|
|
18
|
+
issues = Crawlscope::IssueCollection.new
|
|
19
|
+
page = page_with(
|
|
20
|
+
main: Array.new(260) { |index| "word#{index}" }.join(" "),
|
|
21
|
+
head_markup: "<style>#{"body{}" * 10_000}</style>",
|
|
22
|
+
extra_markup: "<nav>#{"<a href=\"/\">Navigation</a>" * 500}</nav>"
|
|
23
|
+
)
|
|
24
|
+
|
|
25
|
+
Crawlscope::Rules::ContentQuality.new.call(urls: [page.url], pages: [page], issues: issues)
|
|
26
|
+
|
|
27
|
+
refute_includes issues.to_a.map(&:code), :low_visible_text_ratio
|
|
28
|
+
end
|
|
29
|
+
|
|
30
|
+
def test_reports_low_unique_token_ratio_for_repetitive_content
|
|
31
|
+
issues = Crawlscope::IssueCollection.new
|
|
32
|
+
page = page_with(main: ("hotel location service " * 100).strip)
|
|
33
|
+
|
|
34
|
+
Crawlscope::Rules::ContentQuality.new.call(urls: [page.url], pages: [page], issues: issues)
|
|
35
|
+
|
|
36
|
+
issue = issues.to_a.find { |item| item.code == :low_unique_token_ratio }
|
|
37
|
+
assert issue
|
|
38
|
+
assert_operator issue.details[:ratio], :<, issue.details[:threshold]
|
|
39
|
+
end
|
|
40
|
+
|
|
41
|
+
private
|
|
42
|
+
|
|
43
|
+
def page_with(main:, extra_markup: "", head_markup: "")
|
|
44
|
+
body = <<~HTML
|
|
45
|
+
<html>
|
|
46
|
+
<head>
|
|
47
|
+
<title>Content quality</title>
|
|
48
|
+
#{head_markup}
|
|
49
|
+
</head>
|
|
50
|
+
<body>
|
|
51
|
+
#{extra_markup}
|
|
52
|
+
<main>#{main}</main>
|
|
53
|
+
</body>
|
|
54
|
+
</html>
|
|
55
|
+
HTML
|
|
56
|
+
|
|
57
|
+
Crawlscope::Page.new(
|
|
58
|
+
url: "https://example.com/page",
|
|
59
|
+
normalized_url: "https://example.com/page",
|
|
60
|
+
final_url: "https://example.com/page",
|
|
61
|
+
normalized_final_url: "https://example.com/page",
|
|
62
|
+
status: 200,
|
|
63
|
+
headers: {"content-type" => "text/html"},
|
|
64
|
+
body: body,
|
|
65
|
+
doc: Nokogiri::HTML(body)
|
|
66
|
+
)
|
|
67
|
+
end
|
|
68
|
+
end
|
|
@@ -45,6 +45,7 @@ class CrawlscopeCrawlTest < Minitest::Test
|
|
|
45
45
|
<body>
|
|
46
46
|
<main>
|
|
47
47
|
<h1>Pricing</h1>
|
|
48
|
+
<p>#{Array.new(260) { |index| "pricing#{index}" }.join(" ")}</p>
|
|
48
49
|
</main>
|
|
49
50
|
</body>
|
|
50
51
|
</html>
|
|
@@ -100,7 +101,15 @@ class CrawlscopeCrawlTest < Minitest::Test
|
|
|
100
101
|
).call
|
|
101
102
|
|
|
102
103
|
refute result.ok?
|
|
103
|
-
assert_equal %i[
|
|
104
|
+
assert_equal %i[
|
|
105
|
+
incomplete_open_graph_tags
|
|
106
|
+
meta_description_too_long
|
|
107
|
+
missing_canonical
|
|
108
|
+
missing_h1
|
|
109
|
+
missing_structured_data
|
|
110
|
+
thin_visible_text
|
|
111
|
+
title_repeats_site_name
|
|
112
|
+
].sort, result.issues.to_a.map(&:code).uniq.sort
|
|
104
113
|
end
|
|
105
114
|
|
|
106
115
|
def test_uses_browser_when_renderer_is_browser
|
|
@@ -147,6 +156,7 @@ class CrawlscopeCrawlTest < Minitest::Test
|
|
|
147
156
|
<body>
|
|
148
157
|
<main>
|
|
149
158
|
<h1>Pricing</h1>
|
|
159
|
+
<p>#{Array.new(260) { |index| "pricing#{index}" }.join(" ")}</p>
|
|
150
160
|
</main>
|
|
151
161
|
</body>
|
|
152
162
|
</html>
|
|
@@ -0,0 +1,96 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require "test_helper"
|
|
4
|
+
|
|
5
|
+
class CrawlscopeIndexabilityRuleTest < Minitest::Test
|
|
6
|
+
def test_reports_meta_noindex
|
|
7
|
+
issues = Crawlscope::IssueCollection.new
|
|
8
|
+
page = page_with(
|
|
9
|
+
body: <<~HTML
|
|
10
|
+
<html>
|
|
11
|
+
<head><meta name="robots" content="noindex, follow"></head>
|
|
12
|
+
<body><main>Visible content</main></body>
|
|
13
|
+
</html>
|
|
14
|
+
HTML
|
|
15
|
+
)
|
|
16
|
+
|
|
17
|
+
Crawlscope::Rules::Indexability.new.call(urls: [page.url], pages: [page], issues: issues)
|
|
18
|
+
|
|
19
|
+
issue = issues.to_a.fetch(0)
|
|
20
|
+
assert_equal :noindex_meta, issue.code
|
|
21
|
+
assert_equal :error, issue.severity
|
|
22
|
+
assert_equal "noindex, follow", issue.details[:content]
|
|
23
|
+
end
|
|
24
|
+
|
|
25
|
+
def test_reports_x_robots_tag_noindex
|
|
26
|
+
issues = Crawlscope::IssueCollection.new
|
|
27
|
+
page = page_with(headers: {"X-Robots-Tag" => "noindex"})
|
|
28
|
+
|
|
29
|
+
Crawlscope::Rules::Indexability.new.call(urls: [page.url], pages: [page], issues: issues)
|
|
30
|
+
|
|
31
|
+
issue = issues.to_a.fetch(0)
|
|
32
|
+
assert_equal :noindex_header, issue.code
|
|
33
|
+
assert_equal :error, issue.severity
|
|
34
|
+
assert_equal "noindex", issue.details[:content]
|
|
35
|
+
end
|
|
36
|
+
|
|
37
|
+
def test_reports_x_robots_tag_noindex_for_non_html_response
|
|
38
|
+
issues = Crawlscope::IssueCollection.new
|
|
39
|
+
page = page_with(
|
|
40
|
+
body: "%PDF-1.7",
|
|
41
|
+
doc: nil,
|
|
42
|
+
headers: {"content-type" => "application/pdf", "X-Robots-Tag" => "noindex"}
|
|
43
|
+
)
|
|
44
|
+
|
|
45
|
+
Crawlscope::Rules::Indexability.new.call(urls: [page.url], pages: [page], issues: issues)
|
|
46
|
+
|
|
47
|
+
issue = issues.to_a.fetch(0)
|
|
48
|
+
assert_equal :noindex_header, issue.code
|
|
49
|
+
assert_equal :error, issue.severity
|
|
50
|
+
assert_equal "noindex", issue.details[:content]
|
|
51
|
+
end
|
|
52
|
+
|
|
53
|
+
def test_reports_scoped_x_robots_tag_noindex
|
|
54
|
+
issues = Crawlscope::IssueCollection.new
|
|
55
|
+
page = page_with(headers: {"X-Robots-Tag" => "googlebot: noindex, nofollow"})
|
|
56
|
+
|
|
57
|
+
Crawlscope::Rules::Indexability.new.call(urls: [page.url], pages: [page], issues: issues)
|
|
58
|
+
|
|
59
|
+
issue = issues.to_a.fetch(0)
|
|
60
|
+
assert_equal :noindex_header, issue.code
|
|
61
|
+
assert_equal "googlebot: noindex, nofollow", issue.details[:content]
|
|
62
|
+
end
|
|
63
|
+
|
|
64
|
+
def test_reports_x_robots_tag_none
|
|
65
|
+
issues = Crawlscope::IssueCollection.new
|
|
66
|
+
page = page_with(headers: {"X-Robots-Tag" => "none"})
|
|
67
|
+
|
|
68
|
+
Crawlscope::Rules::Indexability.new.call(urls: [page.url], pages: [page], issues: issues)
|
|
69
|
+
|
|
70
|
+
issue = issues.to_a.fetch(0)
|
|
71
|
+
assert_equal :noindex_header, issue.code
|
|
72
|
+
assert_equal "none", issue.details[:content]
|
|
73
|
+
end
|
|
74
|
+
|
|
75
|
+
private
|
|
76
|
+
|
|
77
|
+
def page_with(body: nil, doc: :parse, headers: {"content-type" => "text/html"})
|
|
78
|
+
body ||= <<~HTML
|
|
79
|
+
<html>
|
|
80
|
+
<head><title>Indexable</title></head>
|
|
81
|
+
<body><main>Visible content</main></body>
|
|
82
|
+
</html>
|
|
83
|
+
HTML
|
|
84
|
+
|
|
85
|
+
Crawlscope::Page.new(
|
|
86
|
+
url: "https://example.com/page",
|
|
87
|
+
normalized_url: "https://example.com/page",
|
|
88
|
+
final_url: "https://example.com/page",
|
|
89
|
+
normalized_final_url: "https://example.com/page",
|
|
90
|
+
status: 200,
|
|
91
|
+
headers: headers,
|
|
92
|
+
body: body,
|
|
93
|
+
doc: (doc == :parse) ? Nokogiri::HTML(body) : doc
|
|
94
|
+
)
|
|
95
|
+
end
|
|
96
|
+
end
|
|
@@ -16,10 +16,51 @@ class CrawlscopeUniquenessRuleTest < Minitest::Test
|
|
|
16
16
|
assert_equal %i[duplicate_content_fingerprint duplicate_meta_description duplicate_title].sort, issues.to_a.map(&:code).sort
|
|
17
17
|
end
|
|
18
18
|
|
|
19
|
+
def test_reports_near_duplicate_content
|
|
20
|
+
issues = Crawlscope::IssueCollection.new
|
|
21
|
+
rule = Crawlscope::Rules::Uniqueness.new
|
|
22
|
+
pages = [
|
|
23
|
+
page(url: "https://example.com/a", content: near_duplicate_content("reliable")),
|
|
24
|
+
page(url: "https://example.com/b", content: near_duplicate_content("dependable"))
|
|
25
|
+
]
|
|
26
|
+
|
|
27
|
+
rule.call(urls: pages.map(&:url), pages: pages, issues: issues, context: {})
|
|
28
|
+
|
|
29
|
+
issue = issues.to_a.find { |item| item.code == :near_duplicate_content }
|
|
30
|
+
assert issue
|
|
31
|
+
assert_operator issue.details[:similarity], :>=, issue.details[:threshold]
|
|
32
|
+
end
|
|
33
|
+
|
|
34
|
+
def test_skips_near_duplicate_scan_when_page_count_exceeds_limit
|
|
35
|
+
issues = Crawlscope::IssueCollection.new
|
|
36
|
+
rule = Crawlscope::Rules::Uniqueness.new(max_near_duplicate_pages: 1)
|
|
37
|
+
pages = [
|
|
38
|
+
page(url: "https://example.com/a", content: near_duplicate_content("reliable")),
|
|
39
|
+
page(url: "https://example.com/b", content: near_duplicate_content("dependable"))
|
|
40
|
+
]
|
|
41
|
+
|
|
42
|
+
rule.call(urls: pages.map(&:url), pages: pages, issues: issues, context: {})
|
|
43
|
+
|
|
44
|
+
skip_issue = issues.to_a.find { |item| item.code == :near_duplicate_scan_skipped }
|
|
45
|
+
refute issues.to_a.any? { |item| item.code == :near_duplicate_content }
|
|
46
|
+
assert_equal :warning, skip_issue.severity
|
|
47
|
+
assert_equal({max_pages: 1, page_count: 2}, skip_issue.details)
|
|
48
|
+
end
|
|
49
|
+
|
|
19
50
|
private
|
|
20
51
|
|
|
21
|
-
def
|
|
22
|
-
|
|
52
|
+
def near_duplicate_content(adjective)
|
|
53
|
+
<<~TEXT.gsub(/\s+/, " ").strip
|
|
54
|
+
This page summarizes practical hotel review patterns for operators who need #{adjective}
|
|
55
|
+
service insights across locations. It compares recurring comments about staff, rooms,
|
|
56
|
+
cleanliness, check-in, breakfast, parking, and amenities so teams can prioritize fixes.
|
|
57
|
+
The analysis highlights repeat themes, explains why guests mention them, and keeps the
|
|
58
|
+
wording focused on decisions that improve daily operations.
|
|
59
|
+
TEXT
|
|
60
|
+
end
|
|
61
|
+
|
|
62
|
+
def page(url:, content: nil)
|
|
63
|
+
repeated_text = content || ("Useful content " * 30).strip
|
|
23
64
|
body = <<~HTML
|
|
24
65
|
<html>
|
|
25
66
|
<head>
|
|
@@ -0,0 +1,86 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require "test_helper"
|
|
4
|
+
require "rake"
|
|
5
|
+
|
|
6
|
+
unless respond_to?(:release_version, true)
|
|
7
|
+
load File.expand_path("../Rakefile", __dir__)
|
|
8
|
+
end
|
|
9
|
+
|
|
10
|
+
class ReleaseTaskTest < Minitest::Test
|
|
11
|
+
def test_release_version_increments_patch_from_current_version
|
|
12
|
+
major, minor, patch = Crawlscope::VERSION.split(".").map(&:to_i)
|
|
13
|
+
|
|
14
|
+
assert_equal "#{major}.#{minor}.#{patch + 1}", release_version("patch")
|
|
15
|
+
end
|
|
16
|
+
|
|
17
|
+
def test_release_version_accepts_explicit_semantic_version
|
|
18
|
+
assert_equal "0.3.0", release_version("0.3.0")
|
|
19
|
+
end
|
|
20
|
+
|
|
21
|
+
def test_validate_release_version_rejects_current_version
|
|
22
|
+
error = assert_raises(ArgumentError) do
|
|
23
|
+
validate_release_version!("0.2.7", "0.2.7")
|
|
24
|
+
end
|
|
25
|
+
|
|
26
|
+
assert_equal(
|
|
27
|
+
"Release version 0.2.7 must be newer than current version 0.2.7.",
|
|
28
|
+
error.message
|
|
29
|
+
)
|
|
30
|
+
end
|
|
31
|
+
|
|
32
|
+
def test_validate_release_version_rejects_existing_local_tag
|
|
33
|
+
@local_release_tag_exists = true
|
|
34
|
+
@remote_release_tag_exists = false
|
|
35
|
+
|
|
36
|
+
error = assert_raises(ArgumentError) do
|
|
37
|
+
validate_release_version!("0.2.8", "0.2.7")
|
|
38
|
+
end
|
|
39
|
+
|
|
40
|
+
assert_equal "Release tag v0.2.8 already exists locally.", error.message
|
|
41
|
+
end
|
|
42
|
+
|
|
43
|
+
def test_validate_release_version_rejects_existing_remote_tag
|
|
44
|
+
@local_release_tag_exists = false
|
|
45
|
+
@remote_release_tag_exists = true
|
|
46
|
+
|
|
47
|
+
error = assert_raises(ArgumentError) do
|
|
48
|
+
validate_release_version!("0.2.8", "0.2.7")
|
|
49
|
+
end
|
|
50
|
+
|
|
51
|
+
assert_equal "Release tag v0.2.8 already exists on origin.", error.message
|
|
52
|
+
end
|
|
53
|
+
|
|
54
|
+
def test_remote_release_tag_command_asks_git_to_fail_when_no_tag_matches
|
|
55
|
+
assert_equal(
|
|
56
|
+
"git ls-remote --exit-code --tags origin refs/tags/v0.2.8",
|
|
57
|
+
remote_release_tag_command("v0.2.8")
|
|
58
|
+
)
|
|
59
|
+
end
|
|
60
|
+
|
|
61
|
+
def test_changelog_command_prepends_the_next_release
|
|
62
|
+
assert_equal(
|
|
63
|
+
[
|
|
64
|
+
"git-cliff",
|
|
65
|
+
"-c",
|
|
66
|
+
"cliff.toml",
|
|
67
|
+
"--unreleased",
|
|
68
|
+
"--tag",
|
|
69
|
+
"v0.2.8",
|
|
70
|
+
"--prepend",
|
|
71
|
+
"CHANGELOG.md"
|
|
72
|
+
],
|
|
73
|
+
changelog_command("0.2.8")
|
|
74
|
+
)
|
|
75
|
+
end
|
|
76
|
+
|
|
77
|
+
private
|
|
78
|
+
|
|
79
|
+
def local_release_tag_exists?(_tag)
|
|
80
|
+
@local_release_tag_exists || false
|
|
81
|
+
end
|
|
82
|
+
|
|
83
|
+
def remote_release_tag_exists?(_tag)
|
|
84
|
+
@remote_release_tag_exists || false
|
|
85
|
+
end
|
|
86
|
+
end
|
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: crawlscope
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.
|
|
4
|
+
version: 0.4.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Paulo Fidalgo
|
|
@@ -199,6 +199,7 @@ files:
|
|
|
199
199
|
- lib/crawlscope/context.rb
|
|
200
200
|
- lib/crawlscope/crawl.rb
|
|
201
201
|
- lib/crawlscope/crawler.rb
|
|
202
|
+
- lib/crawlscope/document_text.rb
|
|
202
203
|
- lib/crawlscope/http.rb
|
|
203
204
|
- lib/crawlscope/issue.rb
|
|
204
205
|
- lib/crawlscope/issue_collection.rb
|
|
@@ -208,6 +209,8 @@ files:
|
|
|
208
209
|
- lib/crawlscope/reporter.rb
|
|
209
210
|
- lib/crawlscope/result.rb
|
|
210
211
|
- lib/crawlscope/rule_registry.rb
|
|
212
|
+
- lib/crawlscope/rules/content_quality.rb
|
|
213
|
+
- lib/crawlscope/rules/indexability.rb
|
|
211
214
|
- lib/crawlscope/rules/links.rb
|
|
212
215
|
- lib/crawlscope/rules/metadata.rb
|
|
213
216
|
- lib/crawlscope/rules/structured_data.rb
|
|
@@ -228,9 +231,11 @@ files:
|
|
|
228
231
|
- test/crawlscope/browser_test.rb
|
|
229
232
|
- test/crawlscope/cli_test.rb
|
|
230
233
|
- test/crawlscope/configuration_test.rb
|
|
234
|
+
- test/crawlscope/content_quality_rule_test.rb
|
|
231
235
|
- test/crawlscope/crawl_test.rb
|
|
232
236
|
- test/crawlscope/crawler_test.rb
|
|
233
237
|
- test/crawlscope/http_test.rb
|
|
238
|
+
- test/crawlscope/indexability_rule_test.rb
|
|
234
239
|
- test/crawlscope/links_rule_test.rb
|
|
235
240
|
- test/crawlscope/loader_test.rb
|
|
236
241
|
- test/crawlscope/metadata_rule_test.rb
|
|
@@ -247,6 +252,7 @@ files:
|
|
|
247
252
|
- test/crawlscope/structured_data_writer_test.rb
|
|
248
253
|
- test/crawlscope/uniqueness_rule_test.rb
|
|
249
254
|
- test/crawlscope/url_test.rb
|
|
255
|
+
- test/release_task_test.rb
|
|
250
256
|
- test/test_helper.rb
|
|
251
257
|
homepage: https://www.ethos-link.com/opensource/crawlscope
|
|
252
258
|
licenses:
|
|
@@ -275,7 +281,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
|
275
281
|
- !ruby/object:Gem::Version
|
|
276
282
|
version: '0'
|
|
277
283
|
requirements: []
|
|
278
|
-
rubygems_version: 4.0.
|
|
284
|
+
rubygems_version: 4.0.10
|
|
279
285
|
specification_version: 4
|
|
280
286
|
summary: Audit sitemap URLs for metadata, structured data, uniqueness, and links
|
|
281
287
|
test_files: []
|