RubyGems - rubycrawl - Versions diffs - 0.2.0 → 0.3.0 - Mend

rubycrawl 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

checksums.yaml +4 -4
data/README.md +25 -15
data/lib/rubycrawl/browser/extraction.rb +34 -12
data/lib/rubycrawl/browser/readability.js +2786 -0
data/lib/rubycrawl/browser.rb +1 -1
data/lib/rubycrawl/version.rb +1 -1
metadata +3 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 9fb671708a756ff233448c5d18ac0bd815eb44ecf8d0a8c893fcf8107f95c182
-  data.tar.gz: 66898cb3f978494123441044319105e027f670557cd61430a184e0e1d105a3ed
+  metadata.gz: 56d56f2c264e3febc0f1b22badabb739393332e0948a0f4e4ceb534a68127604
+  data.tar.gz: ebcadd14ba65b12870f6069f898658240073ba7233eab63468e0502871a7a408
 SHA512:
-  metadata.gz: d715984a47719f7c022c512bf136c7bd01050b67ea1dba44f9ea2e24b93196525f46c31e95402c18ba2afbceba02a734f8f214ee686522260557589fca7fea01
-  data.tar.gz: 621e0c5ee326f2757c5459ca763bffd24bfa70baa9c33d820003d7bae0da0ce20529e1bff95136893b846fd80efc70129873ee87ccdb7c2bef5926879d01e6ba
+  metadata.gz: 182e8c771358324d256b38a42a236f634a113b18e16c716da891543ddb43a90ea68242bbc1655639781485e05802a203b64d7ea874eb4cb98900c2e771b85ec0
+  data.tar.gz: f150a6394fb2279b1f872c4074ef9b9df489f19266a7a35886b1e9fbd57e3d4d3761e0519a015270a5259c698d163e2ca223cf0106b6db6471c7911d65c12a29

data/README.md CHANGED Viewed

@@ -16,6 +16,7 @@ RubyCrawl provides **accurate, JavaScript-enabled web scraping** using a pure Ru
 - ✅ **Production-ready** — Auto-retry, error handling, resource optimization
 - ✅ **Multi-page crawling** — BFS algorithm with smart URL deduplication
 - ✅ **Rails-friendly** — Generators, initializers, and ActiveJob integration
+- ✅ **Readability-powered** — Mozilla Readability.js for article-quality extraction, heuristic fallback for all other pages
 ```ruby
 # One line to crawl any JavaScript-heavy site
@@ -35,7 +36,7 @@ result.metadata       # Title, description, OG tags, etc.
 - **Simple API**: Clean Ruby interface — zero Ferrum or CDP knowledge required
 - **Resource optimization**: Built-in resource blocking for 2-3x faster crawls
 - **Auto-managed browsers**: Lazy Chrome singleton, isolated page per crawl
-- **Content extraction**: HTML, plain text, clean HTML, Markdown (lazy), links, metadata
+- **Content extraction**: Mozilla Readability.js (primary) + link-density heuristic (fallback) — article-quality `clean_html`, `clean_text`, `clean_markdown`, links, metadata
 - **Multi-page crawling**: BFS crawler with configurable depth limits and URL deduplication
 - **Smart URL handling**: Automatic normalization, tracking parameter removal, same-host filtering
 - **Rails integration**: First-class Rails support with generators and initializers
@@ -102,14 +103,15 @@ require "rubycrawl"
 result = RubyCrawl.crawl("https://example.com")
 # Access extracted content
-result.final_url      # Final URL after redirects
-result.clean_text     # Noise-stripped plain text (no nav/footer/ads)
-result.clean_html     # Noise-stripped HTML (same noise removed as clean_text)
-result.raw_text       # Full body.innerText (unfiltered)
-result.html           # Full raw HTML content
-result.links          # Extracted links with url, text, title, rel
-result.metadata       # Title, description, OG tags, etc.
-result.clean_markdown # Markdown converted from clean_html (lazy — first access only)
+result.final_url                   # Final URL after redirects
+result.clean_text                  # Noise-stripped plain text (no nav/footer/ads)
+result.clean_html                  # Noise-stripped HTML (same noise removed as clean_text)
+result.raw_text                    # Full body.innerText (unfiltered)
+result.html                        # Full raw HTML content
+result.links                       # Extracted links with url, text, title, rel
+result.metadata                    # Title, description, OG tags, etc.
+result.metadata['extractor']       # "readability" or "heuristic" — which extractor ran
+result.clean_markdown              # Markdown converted from clean_html (lazy — first access only)
 ```
 ## Use Cases
@@ -318,7 +320,8 @@ result.metadata
 #   "twitter_image"       => "https://...",
 #   "canonical"           => "https://...",
 #   "lang"                => "en",
-#   "charset"             => "UTF-8"
+#   "charset"             => "UTF-8",
+#   "extractor"           => "readability"  # or "heuristic"
 # }
 ```
@@ -473,16 +476,21 @@ RubyCrawl uses a single-process architecture:
 ```
 RubyCrawl (public API)
   ↓
-Browser (lib/rubycrawl/browser.rb)  ← Ferrum wrapper
+Browser (lib/rubycrawl/browser.rb)       ← Ferrum wrapper
   ↓
-Ferrum::Browser                     ← Chrome DevTools Protocol (pure Ruby)
+Ferrum::Browser                          ← Chrome DevTools Protocol (pure Ruby)
   ↓
-Chromium                            ← headless browser
+Chromium                                 ← headless browser
+  ↓
+Readability.js → heuristic fallback      ← content extraction (inside browser)
 ```
 - Chrome launches once lazily and is reused across all crawls
 - Each crawl gets an isolated page context (own cookies/storage)
-- JS extraction runs inside the browser via `page.evaluate()`
+- Content extraction runs inside the browser via `page.evaluate()`:
+  - **Primary**: Mozilla Readability.js — article-quality extraction for blogs, docs, news
+  - **Fallback**: link-density heuristic — covers marketing pages, homepages, SPAs
+- `result.metadata['extractor']` tells you which path was used (`"readability"` or `"heuristic"`)
 - No separate processes, no HTTP boundary, no Node.js
 ## Performance
@@ -528,7 +536,9 @@ The gem is available as open source under the terms of the [MIT License](LICENSE
 Built with [Ferrum](https://github.com/rubycdp/ferrum) — pure Ruby Chrome DevTools Protocol client.
-Powered by [reverse_markdown](https://github.com/xijo/reverse_markdown) for GitHub-flavored Markdown conversion.
+Content extraction powered by [Mozilla Readability.js](https://github.com/mozilla/readability) — the algorithm behind Firefox Reader View.
+Markdown conversion powered by [reverse_markdown](https://github.com/xijo/reverse_markdown) for GitHub-flavored output.
 ## Support

data/lib/rubycrawl/browser/extraction.rb CHANGED Viewed

@@ -3,13 +3,10 @@
 class RubyCrawl
   class Browser
     # JavaScript extraction constants, evaluated inside Chromium via page.evaluate().
-    # Ported verbatim from node/src/index.js — logic is unchanged.
-    # NOISE_SELECTORS is interpolated directly into EXTRACT_CONTENT_JS (no need to
-    # pass as a JS argument as the Node version did).
+    # All constants are IIFEs — Ferrum's page.evaluate() evaluates an expression,
+    # it does NOT call function definitions. Wrapping as (() => { ... })() ensures
+    # the function is immediately invoked and its return value is captured.
     module Extraction
-      # All constants are IIFEs — Ferrum's page.evaluate() evaluates an expression,
-      # it does NOT call function definitions. Wrapping as (() => { ... })() ensures
-      # the function is immediately invoked and its return value is captured.
       EXTRACT_METADATA_JS = <<~JS
         (() => {
           const getMeta = (name) => {
@@ -54,8 +51,7 @@ class RubyCrawl
         (() => (document.body?.innerText || "").trim())()
       JS
-      # Semantic noise selectors — covers standard HTML5 elements and ARIA roles.
-      # Interpolated directly into EXTRACT_CONTENT_JS as a string literal.
+      # Semantic noise selectors — used by the heuristic fallback.
       NOISE_SELECTORS = [
         'nav', 'header', 'footer', 'aside',
         '[role="navigation"]', '[role="banner"]', '[role="contentinfo"]',
@@ -64,11 +60,37 @@ class RubyCrawl
         'script', 'style', 'noscript', 'iframe'
       ].join(', ').freeze
-      # Removes semantic noise (nav/header/footer/aside + ARIA roles) and high
-      # link-density containers, then returns both clean plain text and clean HTML.
-      # DOM mutations are reversed after extraction so the page is unchanged.
+      # Mozilla Readability.js v0.6.0 — vendored source, read once at load time.
+      # Embedded inside EXTRACT_CONTENT_JS's outer IIFE so Readability is defined
+      # and used within the same Runtime.evaluate expression (Ferrum evaluates a
+      # single expression — separate evaluate calls have separate scopes).
+      READABILITY_JS = File.read(File.join(__dir__, 'readability.js')).freeze
+      # Extracts clean article HTML using Mozilla Readability (primary) with a
+      # link-density heuristic as fallback when Readability returns no content.
+      # Everything is wrapped in one outer IIFE so page.evaluate gets a single
+      # expression and Readability is in scope for the extraction logic.
+      # DOM mutations from the fallback path are reversed after extraction.
       EXTRACT_CONTENT_JS = <<~JS.freeze
         (() => {
+          // Mozilla Readability.js v0.6.0 — defined in this IIFE's scope.
+          #{READABILITY_JS}
+          // Primary: Mozilla Readability — article-quality extraction.
+          let readabilityDebug = null;
+          try {
+            const docClone = document.cloneNode(true);
+            const reader = new Readability(docClone, { charThreshold: 100 });
+            const article = reader.parse();
+            if (article && article.textContent && article.textContent.trim().length > 200) {
+              return { cleanHtml: article.content, extractor: "readability" };
+            }
+            readabilityDebug = article ? `returned ${article.textContent?.trim().length ?? 0} text chars (below threshold)` : "returned null (no article detected)";
+          } catch (e) {
+            readabilityDebug = `error: ${e.message}`;
+          }
+          // Fallback: link-density heuristic (works on nav-heavy / non-article pages).
           const noiseSelectors = #{NOISE_SELECTORS.to_json};
           function linkDensity(el) {
             const total = (el.innerText || "").trim().length;
@@ -98,7 +120,7 @@ class RubyCrawl
           }
           const cleanHtml = document.body.innerHTML;
           removed.reverse().forEach(({ el, parent, next }) => parent.insertBefore(el, next));
-          return { cleanHtml };
+          return { cleanHtml, extractor: "heuristic", debug: readabilityDebug };
         })()
       JS
     end