RubyGems - nous - Versions diffs - 0.3.0 → 0.4.0 - Mend

nous 0.3.0 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (21) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +52 -0
data/README.md +76 -7
data/lib/nous/crawler/async_page_fetcher.rb +43 -11
data/lib/nous/crawler/recursive_page_fetcher.rb +12 -12
data/lib/nous/crawler/single_page_fetcher.rb +49 -9
data/lib/nous/extractor/default/client.rb +30 -12
data/lib/nous/extractor/default.rb +2 -2
data/lib/nous/extractor/jina.rb +2 -2
data/lib/nous/fetcher/extraction_runner.rb +5 -5
data/lib/nous/fetcher/page_extractor.rb +14 -8
data/lib/nous/fetcher.rb +33 -7
data/lib/nous/primitives/configuration.rb +1 -0
data/lib/nous/primitives/fetch_record.rb +26 -0
data/lib/nous/primitives/fetch_result.rb +21 -0
data/lib/nous/primitives/page.rb +1 -1
data/lib/nous/serializer.rb +9 -1
data/lib/nous/version.rb +1 -1
data/lib/nous.rb +2 -2
metadata +3 -2
data/lib/nous/primitives/raw_page.rb +0 -5

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 7636d207654dbf38a64aeec480164c0e57b3c8bf98ac8373e576f692896fb3a3
-  data.tar.gz: 62ae3b01ec837d71caf104710c42bde82df6d50e6c7acc50252f2902ef9b2046
+  metadata.gz: c73c21d427c9bb99cc148e089ed5899e7aa9e3ca86a4825540380d41771354d2
+  data.tar.gz: 4b361a7aed3c0dfb28a6a650b0813d371622c82b910b8880502281047152a739
 SHA512:
-  metadata.gz: af52f527a8720d46cd00f3a42814d432730d105aed05ebb2435f1546afb2140bd88fd1e2c6f4e75c0226afd5ef6c9072c6919518bae366047eb022f24b30ffcd
-  data.tar.gz: '049b133f406f694771617c34d3adfef2aa64ef3aa5608d89d98230b2f596e4cdef831e53f9ee039aa247a6c588900fc51cfd07a7b456dec3ee524e83abe08b93'
+  metadata.gz: aacbc4777dc1e5bd66513ddc3bc5a1f667276ac2e89ff781f7b43762133ae640d68bc1191c11b61ee00ac7911a128fcf6ad80653f86d369d284223e830f09120
+  data.tar.gz: 90fc8f0cf3c30c6e06bf6aeebd2790539ff80d0d63469a367c4a09b008c9ceed3dedc0d18528e5f6fab3c4a02103b4d2c7c9b3788b1708c6a1a46790d9f5cbab

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,51 @@
 ## [Unreleased]
+## [0.4.0] - 2026-04-11
+### Added
+- **New `details: true` option for `Nous.fetch`** - Returns a `FetchResult` object containing both successful pages and failed fetch/extraction attempts. This enables explicit failure handling without exceptions.
+  ```ruby
+  result = Nous.fetch("https://example.com", details: true)
+  result.pages    # Array<Page> - successfully extracted
+  result.failures # [{requested_url:, error:}, ...]
+  ```
+- **Page metadata** - Every extracted page now includes provenance information:
+  - `extractor`: Which extractor backend was used (e.g., "Nous::Extractor::Default")
+  - `requested_url`: The original URL before any redirects
+  - `content_type`: HTTP Content-Type header from the response
+  - `redirected`: Boolean indicating if redirects occurred
+- **FetchRecord internal primitive** - Unified fetch result representation that captures both success and failure cases with full provenance tracking. Replaces the previous `RawPage` which only handled successful fetches.
+- **Configuration#single_page? helper** - Convenience predicate method for checking if the current configuration is in single-page (non-recursive) mode.
+### Changed
+- **Improved title extraction** - Title extraction now uses a fallback chain: readability extracted title → HTML `<title>` tag → first `<h1>` element. This significantly improves title reliability on pages where readability fails to identify the title.
+- **Reduced aggressive DOM stripping** - The default extractor now preserves more content before readability processing. Previously removed elements (`header`, `img`, `video`, `svg`, `link`) are now retained, providing better context for readability scoring and preserving useful content like captions and bylines.
+- **Unified fetch contract** - Both single-page and recursive crawling now use the same internal `FetchRecord` structure, ensuring consistent provenance tracking and failure handling across all fetch modes.
+- **Serializer schema updated** - Both text and JSON output formats now include:
+  - `pathname`: URL path component
+  - `extractor`: Which extractor processed the page
+  - Full metadata object (JSON only)
+### Fixed
+- **JSON serialization** - The JSON output now correctly includes the `pathname` field that was documented but missing in previous versions.
+- **Extraction failure visibility** - Previously, extraction failures were only visible with debug logging enabled. The new `FetchResult` structure makes failures programmatically accessible.
+### Internal Changes
+- **Duck-typed extractor interface** - Extractors now receive the full `FetchRecord` object and can access the fields they need (`Default` uses `record.html`, `Jina` uses `record.final_url`).
+- **Removed `RawPage` primitive** - Superseded by the richer `FetchRecord` which handles both success and failure uniformly.
 ## [0.3.0] - 2026-02-23
 - Remove `Nous::Error` base hierarchy; colocated errors inherit directly from `StandardError` with descriptive names
@@ -29,3 +75,9 @@
 ## [0.1.0] - 2026-02-21
 - Initial release
+[Unreleased]: https://github.com/danfrenette/nous/compare/v0.4.0...HEAD
+[0.4.0]: https://github.com/danfrenette/nous/compare/v0.3.0...v0.4.0
+[0.3.0]: https://github.com/danfrenette/nous/compare/v0.2.0...v0.3.0
+[0.2.0]: https://github.com/danfrenette/nous/compare/v0.1.0...v0.2.0
+[0.1.0]: https://github.com/danfrenette/nous/releases/tag/v0.1.0

data/README.md CHANGED Viewed

@@ -64,13 +64,15 @@ nous https://example.com -d
 ## Ruby API
+### Basic Usage
 ```ruby
 require "nous"
 # Fetch pages with the default extractor
 pages = Nous.fetch("https://example.com", limit: 10, concurrency: 3)
-# Each page is a Nous::Page with title, url, pathname, content
+# Each page is a Nous::Page with title, url, pathname, content, metadata
 pages.each do |page|
   puts "#{page.title} (#{page.url})"
   puts page.content
@@ -89,11 +91,70 @@ pages = Nous.fetch("https://spa-site.com",
 )
 ```
+### Detailed Results
+Use the `details: true` option to receive full fetch results including failures:
+```ruby
+result = Nous.fetch("https://example.com", details: true)
+result.pages       # Array<Nous::Page> - successfully extracted pages
+result.failures    # Array<{requested_url:, error:}> - failed fetches
+result.total_requested  # Integer - total URLs attempted
+result.all_succeeded?   # Boolean - true if no failures
+result.any_succeeded?   # Boolean - true if at least one page extracted
+```
+This is useful when you need to handle failures explicitly:
+```ruby
+result = Nous.fetch("https://example.com/api-docs", details: true)
+if result.failures.any?
+  puts "Failed to fetch:"
+  result.failures.each do |failure|
+    puts "  #{failure[:requested_url]}: #{failure[:error]}"
+  end
+end
+result.pages.each do |page|
+  puts "Successfully extracted: #{page.title}"
+end
+```
+### Page Structure
+Each extracted page contains:
+| Field | Type | Description |
+|-------|------|-------------|
+| `title` | String | Page title (fallback chain: readability → `<title>` tag → `<h1>`) |
+| `url` | String | Final URL after redirects |
+| `pathname` | String | URL path component |
+| `content` | String | Extracted content as Markdown |
+| `metadata` | Hash | Provenance information (see below) |
+### Page Metadata
+```ruby
+page.metadata  # => {
+  #   extractor: "Nous::Extractor::Default",  # Which extractor was used
+  #   requested_url: "https://example.com/blog", # Original URL before redirects
+  #   content_type: "text/html; charset=utf-8",  # HTTP Content-Type header
+  #   redirected: true                           # Whether redirects occurred
+  # }
+```
 ## Extraction Backends
 ### Default (ruby-readability)
-Parses static HTML using [ruby-readability](https://github.com/cantino/ruby-readability), strips noisy elements (nav, footer, script, header), and converts to Markdown via [reverse_markdown](https://github.com/xijo/reverse_markdown). Fast and requires no external services, but cannot extract content from JS-rendered pages.
+Parses static HTML using [ruby-readability](https://github.com/cantino/ruby-readability), strips noisy elements (script, style, nav, footer), and converts to Markdown via [reverse_markdown](https://github.com/xijo/reverse_markdown). Fast and requires no external services, but cannot extract content from JS-rendered pages.
+Title extraction uses a fallback chain:
+1. Readability's extracted title
+2. Original `<title>` tag from HTML
+3. First `<h1>` from extracted content
 ### Jina Reader API
@@ -107,13 +168,15 @@ XML-tagged output designed for LLM context windows:
 ```xml
 <page>
-<title>Page Title</title>
-<url>https://example.com/page</url>
-<content>
+  <title>Page Title</title>
+  <url>https://example.com/page</url>
+  <pathname>/page</pathname>
+  <extractor>Nous::Extractor::Default</extractor>
+  <content>
 # Heading
 Extracted markdown content...
-</content>
+  </content>
 </page>
 ```
@@ -125,7 +188,13 @@ Extracted markdown content...
     "title": "Page Title",
     "url": "https://example.com/page",
     "pathname": "/page",
-    "content": "# Heading\n\nExtracted markdown content..."
+    "content": "# Heading\n\nExtracted markdown content...",
+    "metadata": {
+      "extractor": "Nous::Extractor::Default",
+      "requested_url": "https://example.com/page",
+      "content_type": "text/html; charset=utf-8",
+      "redirected": false
+    }
   }
 ]
 ```

data/lib/nous/crawler/async_page_fetcher.rb CHANGED Viewed

@@ -13,20 +13,30 @@ module Nous
       def fetch(url)
         Async::Task.current.with_timeout(config.timeout) do
           result = RedirectFollower.call(client:, seed_host:, url:)
-          return skip(url, result.error.message) if result.failure?
+          return build_failed_record(url, result.error.message) if result.failure?
           response, final_url = result.payload
-          return skip(url, "status #{response.status}") unless response.status == 200
-          return skip(url, "non-html content") unless html?(response)
+          content_type = response.headers["content-type"].to_s
+          redirected = final_url.to_s != url
-          RawPage.new(url: final_url.to_s, pathname: final_url.path, html: response.read)
+          return build_failed_record(url, "status #{response.status}") unless response.status == 200
+          return build_failed_record(url, "non-html content") unless html?(content_type)
+          build_success_record(
+            url: url,
+            final_url: final_url.to_s,
+            pathname: final_url.path,
+            html: response.read,
+            content_type: content_type,
+            redirected: redirected
+          )
         ensure
           response&.close
         end
       rescue Async::TimeoutError
-        skip(url, "timeout after #{config.timeout}s")
+        build_failed_record(url, "timeout after #{config.timeout}s")
       rescue IOError, SocketError, Errno::ECONNREFUSED => e
-        skip(url, e.message)
+        build_failed_record(url, e.message)
       end
       private
@@ -37,14 +47,36 @@ module Nous
         Nous.configuration
       end
-      def html?(response)
-        content_type = response.headers["content-type"].to_s
+      def html?(content_type)
         HTML_CONTENT_TYPES.any? { |type| content_type.include?(type) }
       end
-      def skip(url, reason)
-        warn("[nous] skip #{url}: #{reason}") if config.debug?
-        nil
+      def build_success_record(url:, final_url:, pathname:, html:, content_type:, redirected:)
+        FetchRecord.new(
+          requested_url: url,
+          final_url: final_url,
+          pathname: pathname,
+          html: html,
+          content_type: content_type,
+          ok: true,
+          error: nil,
+          redirected: redirected
+        )
+      end
+      def build_failed_record(url, error)
+        FetchRecord.new(
+          requested_url: url,
+          final_url: nil,
+          pathname: Url.new(url).path,
+          html: nil,
+          content_type: nil,
+          ok: false,
+          error: error,
+          redirected: false
+        ).tap do |record|
+          warn("[nous] skip #{url}: #{error}") if config.debug?
+        end
       end
     end
   end

data/lib/nous/crawler/recursive_page_fetcher.rb CHANGED Viewed

@@ -9,7 +9,7 @@ module Nous
       def initialize(seed_url:, http_client: nil)
         @seed_uri = Url.new(seed_url)
         @http_client = http_client
-        @pages = []
+        @records = []
         @queue = [url_filter.canonicalize(seed_uri)]
         @seen = Set.new(queue)
       end
@@ -21,12 +21,12 @@ module Nous
           crawl(client)
         end
-        success(payload: pages)
+        success(payload: records)
       end
       private
-      attr_reader :seed_uri, :http_client, :pages, :queue, :seen
+      attr_reader :seed_uri, :http_client, :records, :queue, :seen
       def config
         Nous.configuration
@@ -37,13 +37,13 @@ module Nous
       end
       def fetch_and_enqueue(batch, client)
-        fetch_batch(batch, client).each do |page|
-          next unless page
+        fetch_batch(batch, client).each do |record|
+          next unless record.ok
           break unless within_limit?
-          pages << page
-          seen << page.url
-          enqueue_links(page)
+          records << record
+          seen << record.final_url
+          enqueue_links(record)
         end
       end
@@ -59,8 +59,8 @@ module Nous
         tasks.map(&:wait)
       end
-      def enqueue_links(page)
-        link_extractor.extract(page.url, page.html).each do |url|
+      def enqueue_links(record)
+        link_extractor.extract(record.final_url, record.html).each do |url|
           next if seen.include?(url)
           seen << url
@@ -69,7 +69,7 @@ module Nous
       end
       def within_limit?
-        pages.length < config.limit
+        records.count(&:ok) < config.limit
       end
       def open_connection
@@ -91,7 +91,7 @@ module Nous
       end
       def link_extractor
-        @link_extractor ||= LinkExtractor.new(url_filter:)
+        @link_filter ||= LinkExtractor.new(url_filter:)
       end
       def suppress_async_warnings

data/lib/nous/crawler/single_page_fetcher.rb CHANGED Viewed

@@ -20,16 +20,26 @@ module Nous
       def call
         response = connection.get(url)
         final_url = resolve_final_url(response)
+        content_type = response.headers["content-type"].to_s
+        redirected = final_url.to_s != url
+        record = build_record(
+          final_url: final_url.to_s,
+          pathname: final_url.path,
+          html: response.body,
+          content_type: content_type,
+          redirected: redirected
+        )
-        validate_host!(final_url)
-        validate_html!(response)
+        validate!(record)
-        raw_page = RawPage.new(url: final_url.to_s, pathname: final_url.path, html: response.body)
-        success(payload: [raw_page])
+        success(payload: [record])
       rescue FetchError => e
-        failure(e)
+        record = build_failed_record(error: e.message)
+        success(payload: [record])
       rescue Faraday::Error => e
-        failure(FetchError.new(e.message))
+        record = build_failed_record(error: e.message)
+        success(payload: [record])
       end
       private
@@ -40,19 +50,49 @@ module Nous
         Nous.configuration
       end
+      def build_record(final_url:, pathname:, html:, content_type:, redirected:)
+        FetchRecord.new(
+          requested_url: url,
+          final_url: final_url,
+          pathname: pathname,
+          html: html,
+          content_type: content_type,
+          ok: true,
+          error: nil,
+          redirected: redirected
+        )
+      end
+      def build_failed_record(error:)
+        FetchRecord.new(
+          requested_url: url,
+          final_url: nil,
+          pathname: Url.new(url).path,
+          html: nil,
+          content_type: nil,
+          ok: false,
+          error: error,
+          redirected: false
+        )
+      end
+      def validate!(record)
+        validate_host!(record.final_url)
+        validate_html!(record.content_type)
+      end
       def resolve_final_url(response)
         location = response.env.url.to_s
         Url.new(location)
       end
       def validate_host!(final_url)
-        return if final_url.host == seed_host
+        return if Url.new(final_url).host == seed_host
         raise FetchError, "redirected to #{final_url} outside #{seed_host}"
       end
-      def validate_html!(response)
-        content_type = response.headers["content-type"].to_s
+      def validate_html!(content_type)
         return if HTML_CONTENT_TYPES.any? { |type| content_type.include?(type) }
         raise FetchError, "non-html content: #{content_type}"

data/lib/nous/extractor/default/client.rb CHANGED Viewed

@@ -8,7 +8,7 @@ module Nous
       class Client < Command
         class ExtractionError < StandardError; end
-        NOISY_TAGS = %w[script style link nav header footer img video svg].freeze
+        NOISY_TAGS = %w[script style nav footer].freeze
         def initialize(html:, selector: nil)
           @html = html
@@ -16,34 +16,52 @@ module Nous
         end
         def call
-          doc = Nokogiri::HTML(html)
-          doc = scope_to_selector(doc) if selector
-          strip_noisy_tags(doc)
+          readable = ::Readability::Document.new(prepared_html)
-          readable = ::Readability::Document.new(doc.to_html)
           text = Nokogiri::HTML(readable.content).text.strip
           return failure(ExtractionError.new("readability returned no content")) if text.empty?
-          success(payload: {title: readable.title, content: readable.content})
+          title = resolve_title(readable)
+          success(payload: {title: title, content: readable.content})
         end
         private
         attr_reader :html, :selector
-        def scope_to_selector(doc)
+        def prepared_html
+          doc = Nokogiri::HTML(html)
+          original_title(doc)
+          doc = scope(doc, selector) if selector
+          strip_tags(doc)
+          doc.to_html
+        end
+        def original_title(doc)
+          @original_title ||= doc.at_css("title")&.text.to_s.strip
+        end
+        def scope(doc, selector)
           scoped = doc.at_css(selector)
           return doc unless scoped
-          fragment = Nokogiri::HTML::Document.new
-          fragment.root = scoped
-          fragment
+          Nokogiri::HTML.fragment(scoped.to_html)
         end
-        def strip_noisy_tags(doc)
+        def strip_tags(doc)
           NOISY_TAGS.each { |tag| doc.css(tag).each(&:remove) }
         end
+        def resolve_title(readable)
+          title = readable.title.to_s.strip
+          title = @original_title if title.empty?
+          title = title_from_content(readable.content) if title.empty?
+          title
+        end
+        def title_from_content(content)
+          Nokogiri::HTML(content).at_css("h1")&.text.to_s.strip
+        end
       end
     end
   end

data/lib/nous/extractor/default.rb CHANGED Viewed

@@ -9,8 +9,8 @@ module Nous
         @selector = selector
       end
-      def extract(raw_page)
-        extracted = extract_content(raw_page.html)
+      def extract(record)
+        extracted = extract_content(record.html)
         markdown = convert_to_markdown(extracted[:content])
         success(payload: ExtractedContent.new(title: extracted[:title], content: markdown))

data/lib/nous/extractor/jina.rb CHANGED Viewed

@@ -7,8 +7,8 @@ module Nous
         @client = Client.new(api_key: api_key || ENV["JINA_API_KEY"], timeout:, **client_options)
       end
-      def extract(raw_page)
-        body = client.get(raw_page.url)
+      def extract(record)
+        body = client.get(record.final_url)
         success(payload: ExtractedContent.new(
           title: body.dig("data", "title") || "",

data/lib/nous/fetcher/extraction_runner.rb CHANGED Viewed

@@ -5,14 +5,14 @@ module Nous
     class ExtractionRunner < Command
       class ExtractionError < StandardError; end
-      def initialize(raw_pages:, extractor:)
-        @raw_pages = raw_pages
+      def initialize(records:, extractor:)
+        @records = records
         @extractor = extractor
       end
       def call
-        pages = raw_pages.each_slice(Nous.configuration.concurrency).each_with_object([]) do |batch, results|
-          threads = batch.map { |raw_page| Thread.new { PageExtractor.call(extractor:, raw_page:) } }
+        pages = records.each_slice(Nous.configuration.concurrency).each_with_object([]) do |batch, results|
+          threads = batch.map { |record| Thread.new { PageExtractor.call(extractor:, record:) } }
           threads.each do |thread|
             result = thread.value
@@ -25,7 +25,7 @@ module Nous
       private
-      attr_reader :raw_pages, :extractor
+      attr_reader :records, :extractor
     end
   end
 end

data/lib/nous/fetcher/page_extractor.rb CHANGED Viewed

@@ -3,24 +3,30 @@
 module Nous
   class Fetcher < Command
     class PageExtractor < Command
-      def initialize(extractor:, raw_page:)
+      def initialize(extractor:, record:)
         @extractor = extractor
-        @raw_page = raw_page
+        @record = record
       end
       def call
-        result = extractor.extract(raw_page)
+        result = extractor.extract(record)
         unless result.success?
-          warn("[nous] extract skip #{raw_page.url}: #{result.error.message}") if Nous.configuration.debug?
+          warn("[nous] extract skip #{record.final_url}: #{result.error.message}") if Nous.configuration.debug?
           return failure(result.error)
         end
         page = Page.new(
           title: result.payload.title,
-          url: raw_page.url,
-          pathname: raw_page.pathname,
-          content: result.payload.content
+          url: record.final_url,
+          pathname: record.pathname,
+          content: result.payload.content,
+          metadata: {
+            extractor: extractor.class.name,
+            requested_url: record.requested_url,
+            content_type: record.content_type,
+            redirected: record.redirected
+          }
         )
         success(payload: page)
@@ -28,7 +34,7 @@ module Nous
       private
-      attr_reader :extractor, :raw_page
+      attr_reader :extractor, :record
     end
   end
 end

data/lib/nous/fetcher.rb CHANGED Viewed

@@ -4,21 +4,38 @@ module Nous
   class Fetcher < Command
     class FetchError < StandardError; end
-    def initialize(seed_url:, extractor: Extractor::Default.new, http_client: nil)
+    def initialize(seed_url:, extractor: Extractor::Default.new, http_client: nil, details: false)
       @seed_url = seed_url
       @extractor = extractor
       @http_client = http_client
+      @single_page = Nous.configuration.single_page?
+      @details = details
     end
     def call
-      raw_pages = crawl
-      pages = extract(raw_pages)
-      success(payload: pages)
+      records = crawl
+      successful_records, failed_records = records.partition(&:ok)
+      if single_page && !details && successful_records.empty?
+        raise FetchError, failed_records.first&.error || "fetch failed"
+      end
+      pages = extract(successful_records)
+      if details
+        success(payload: FetchResult.new(
+          pages: pages,
+          failures: build_failures(failed_records),
+          total_requested: records.length
+        ))
+      else
+        success(payload: pages)
+      end
     end
     private
-    attr_reader :seed_url, :extractor, :http_client
+    attr_reader :seed_url, :extractor, :http_client, :single_page, :details
     def crawl
       result = Crawler.call(seed_url:, http_client:)
@@ -27,11 +44,20 @@ module Nous
       result.payload
     end
-    def extract(raw_pages)
-      result = ExtractionRunner.call(raw_pages:, extractor:)
+    def extract(records)
+      result = ExtractionRunner.call(records:, extractor:)
       raise FetchError, result.error.message if result.failure?
       result.payload
     end
+    def build_failures(records)
+      records.map do |record|
+        {
+          requested_url: record.requested_url,
+          error: record.error
+        }
+      end
+    end
   end
 end

data/lib/nous/primitives/configuration.rb CHANGED Viewed

@@ -12,5 +12,6 @@ module Nous
   ) do
     def debug? = debug
     def recursive? = recursive
+    def single_page? = !recursive
   end
 end

data/lib/nous/primitives/fetch_record.rb ADDED Viewed

@@ -0,0 +1,26 @@
+# frozen_string_literal: true
+module Nous
+  FetchRecord = Data.define(
+    :requested_url,
+    :final_url,
+    :pathname,
+    :html,
+    :content_type,
+    :ok,
+    :error,
+    :redirected
+  ) do
+    def initialize(
+      requested_url:,
+      pathname:, final_url: nil,
+      html: nil,
+      content_type: nil,
+      ok: true,
+      error: nil,
+      redirected: false
+    )
+      super
+    end
+  end
+end

data/lib/nous/primitives/fetch_result.rb ADDED Viewed

@@ -0,0 +1,21 @@
+# frozen_string_literal: true
+module Nous
+  FetchResult = Data.define(:pages, :failures, :total_requested) do
+    def succeeded
+      pages.length
+    end
+    def failed
+      failures.length
+    end
+    def all_succeeded?
+      failures.empty?
+    end
+    def any_succeeded?
+      pages.any?
+    end
+  end
+end

data/lib/nous/primitives/page.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module Nous
-  Page = Data.define(:title, :url, :pathname, :content)
+  Page = Data.define(:title, :url, :pathname, :content, :metadata)
 end

data/lib/nous/serializer.rb CHANGED Viewed

@@ -43,6 +43,8 @@ module Nous
         <page>
           <title>#{page.title}</title>
           <url>#{page.url}</url>
+          <pathname>#{page.pathname}</pathname>
+          <extractor>#{page.metadata[:extractor]}</extractor>
           <content>
         #{page.content}
           </content>
@@ -51,7 +53,13 @@ module Nous
     end
     def json_page(page)
-      {title: page.title, url: page.url, content: page.content}
+      {
+        title: page.title,
+        url: page.url,
+        pathname: page.pathname,
+        content: page.content,
+        metadata: page.metadata
+      }
     end
   end
 end

data/lib/nous/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module Nous
-  VERSION = "0.3.0"
+  VERSION = "0.4.0"
 end

data/lib/nous.rb CHANGED Viewed

@@ -18,10 +18,10 @@ module Nous
       @configuration = nil
     end
-    def fetch(seed_url, extractor: Extractor::Default.new, http_client: nil, **options)
+    def fetch(seed_url, extractor: Extractor::Default.new, http_client: nil, details: false, **options)
       configure(**options)
-      result = Fetcher.call(seed_url:, extractor:, http_client:)
+      result = Fetcher.call(seed_url:, extractor:, http_client:, details:)
       raise result.error if result.failure?
       result.payload

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: nous
 version: !ruby/object:Gem::Version
-  version: 0.3.0
+  version: 0.4.0
 platform: ruby
 authors:
 - Dan Frenette
@@ -243,8 +243,9 @@ files:
 - lib/nous/fetcher/page_extractor.rb
 - lib/nous/primitives/configuration.rb
 - lib/nous/primitives/extracted_content.rb
+- lib/nous/primitives/fetch_record.rb
+- lib/nous/primitives/fetch_result.rb
 - lib/nous/primitives/page.rb
-- lib/nous/primitives/raw_page.rb
 - lib/nous/primitives/url.rb
 - lib/nous/serializer.rb
 - lib/nous/url_resolver.rb

data/lib/nous/primitives/raw_page.rb DELETED Viewed

@@ -1,5 +0,0 @@
-# frozen_string_literal: true
-module Nous
-  RawPage = Data.define(:url, :pathname, :html)
-end