nous 0.3.0 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 7636d207654dbf38a64aeec480164c0e57b3c8bf98ac8373e576f692896fb3a3
4
- data.tar.gz: 62ae3b01ec837d71caf104710c42bde82df6d50e6c7acc50252f2902ef9b2046
3
+ metadata.gz: c73c21d427c9bb99cc148e089ed5899e7aa9e3ca86a4825540380d41771354d2
4
+ data.tar.gz: 4b361a7aed3c0dfb28a6a650b0813d371622c82b910b8880502281047152a739
5
5
  SHA512:
6
- metadata.gz: af52f527a8720d46cd00f3a42814d432730d105aed05ebb2435f1546afb2140bd88fd1e2c6f4e75c0226afd5ef6c9072c6919518bae366047eb022f24b30ffcd
7
- data.tar.gz: '049b133f406f694771617c34d3adfef2aa64ef3aa5608d89d98230b2f596e4cdef831e53f9ee039aa247a6c588900fc51cfd07a7b456dec3ee524e83abe08b93'
6
+ metadata.gz: aacbc4777dc1e5bd66513ddc3bc5a1f667276ac2e89ff781f7b43762133ae640d68bc1191c11b61ee00ac7911a128fcf6ad80653f86d369d284223e830f09120
7
+ data.tar.gz: 90fc8f0cf3c30c6e06bf6aeebd2790539ff80d0d63469a367c4a09b008c9ceed3dedc0d18528e5f6fab3c4a02103b4d2c7c9b3788b1708c6a1a46790d9f5cbab
data/CHANGELOG.md CHANGED
@@ -1,5 +1,51 @@
1
1
  ## [Unreleased]
2
2
 
3
+ ## [0.4.0] - 2026-04-11
4
+
5
+ ### Added
6
+
7
+ - **New `details: true` option for `Nous.fetch`** - Returns a `FetchResult` object containing both successful pages and failed fetch/extraction attempts. This enables explicit failure handling without exceptions.
8
+ ```ruby
9
+ result = Nous.fetch("https://example.com", details: true)
10
+ result.pages # Array<Page> - successfully extracted
11
+ result.failures # [{requested_url:, error:}, ...]
12
+ ```
13
+
14
+ - **Page metadata** - Every extracted page now includes provenance information:
15
+ - `extractor`: Which extractor backend was used (e.g., "Nous::Extractor::Default")
16
+ - `requested_url`: The original URL before any redirects
17
+ - `content_type`: HTTP Content-Type header from the response
18
+ - `redirected`: Boolean indicating if redirects occurred
19
+
20
+ - **FetchRecord internal primitive** - Unified fetch result representation that captures both success and failure cases with full provenance tracking. Replaces the previous `RawPage` which only handled successful fetches.
21
+
22
+ - **Configuration#single_page? helper** - Convenience predicate method for checking if the current configuration is in single-page (non-recursive) mode.
23
+
24
+ ### Changed
25
+
26
+ - **Improved title extraction** - Title extraction now uses a fallback chain: readability extracted title → HTML `<title>` tag → first `<h1>` element. This significantly improves title reliability on pages where readability fails to identify the title.
27
+
28
+ - **Reduced aggressive DOM stripping** - The default extractor now preserves more content before readability processing. Previously removed elements (`header`, `img`, `video`, `svg`, `link`) are now retained, providing better context for readability scoring and preserving useful content like captions and bylines.
29
+
30
+ - **Unified fetch contract** - Both single-page and recursive crawling now use the same internal `FetchRecord` structure, ensuring consistent provenance tracking and failure handling across all fetch modes.
31
+
32
+ - **Serializer schema updated** - Both text and JSON output formats now include:
33
+ - `pathname`: URL path component
34
+ - `extractor`: Which extractor processed the page
35
+ - Full metadata object (JSON only)
36
+
37
+ ### Fixed
38
+
39
+ - **JSON serialization** - The JSON output now correctly includes the `pathname` field that was documented but missing in previous versions.
40
+
41
+ - **Extraction failure visibility** - Previously, extraction failures were only visible with debug logging enabled. The new `FetchResult` structure makes failures programmatically accessible.
42
+
43
+ ### Internal Changes
44
+
45
+ - **Duck-typed extractor interface** - Extractors now receive the full `FetchRecord` object and can access the fields they need (`Default` uses `record.html`, `Jina` uses `record.final_url`).
46
+
47
+ - **Removed `RawPage` primitive** - Superseded by the richer `FetchRecord` which handles both success and failure uniformly.
48
+
3
49
  ## [0.3.0] - 2026-02-23
4
50
 
5
51
  - Remove `Nous::Error` base hierarchy; colocated errors inherit directly from `StandardError` with descriptive names
@@ -29,3 +75,9 @@
29
75
  ## [0.1.0] - 2026-02-21
30
76
 
31
77
  - Initial release
78
+
79
+ [Unreleased]: https://github.com/danfrenette/nous/compare/v0.4.0...HEAD
80
+ [0.4.0]: https://github.com/danfrenette/nous/compare/v0.3.0...v0.4.0
81
+ [0.3.0]: https://github.com/danfrenette/nous/compare/v0.2.0...v0.3.0
82
+ [0.2.0]: https://github.com/danfrenette/nous/compare/v0.1.0...v0.2.0
83
+ [0.1.0]: https://github.com/danfrenette/nous/releases/tag/v0.1.0
data/README.md CHANGED
@@ -64,13 +64,15 @@ nous https://example.com -d
64
64
 
65
65
  ## Ruby API
66
66
 
67
+ ### Basic Usage
68
+
67
69
  ```ruby
68
70
  require "nous"
69
71
 
70
72
  # Fetch pages with the default extractor
71
73
  pages = Nous.fetch("https://example.com", limit: 10, concurrency: 3)
72
74
 
73
- # Each page is a Nous::Page with title, url, pathname, content
75
+ # Each page is a Nous::Page with title, url, pathname, content, metadata
74
76
  pages.each do |page|
75
77
  puts "#{page.title} (#{page.url})"
76
78
  puts page.content
@@ -89,11 +91,70 @@ pages = Nous.fetch("https://spa-site.com",
89
91
  )
90
92
  ```
91
93
 
94
+ ### Detailed Results
95
+
96
+ Use the `details: true` option to receive full fetch results including failures:
97
+
98
+ ```ruby
99
+ result = Nous.fetch("https://example.com", details: true)
100
+
101
+ result.pages # Array<Nous::Page> - successfully extracted pages
102
+ result.failures # Array<{requested_url:, error:}> - failed fetches
103
+ result.total_requested # Integer - total URLs attempted
104
+ result.all_succeeded? # Boolean - true if no failures
105
+ result.any_succeeded? # Boolean - true if at least one page extracted
106
+ ```
107
+
108
+ This is useful when you need to handle failures explicitly:
109
+
110
+ ```ruby
111
+ result = Nous.fetch("https://example.com/api-docs", details: true)
112
+
113
+ if result.failures.any?
114
+ puts "Failed to fetch:"
115
+ result.failures.each do |failure|
116
+ puts " #{failure[:requested_url]}: #{failure[:error]}"
117
+ end
118
+ end
119
+
120
+ result.pages.each do |page|
121
+ puts "Successfully extracted: #{page.title}"
122
+ end
123
+ ```
124
+
125
+ ### Page Structure
126
+
127
+ Each extracted page contains:
128
+
129
+ | Field | Type | Description |
130
+ |-------|------|-------------|
131
+ | `title` | String | Page title (fallback chain: readability → `<title>` tag → `<h1>`) |
132
+ | `url` | String | Final URL after redirects |
133
+ | `pathname` | String | URL path component |
134
+ | `content` | String | Extracted content as Markdown |
135
+ | `metadata` | Hash | Provenance information (see below) |
136
+
137
+ ### Page Metadata
138
+
139
+ ```ruby
140
+ page.metadata # => {
141
+ # extractor: "Nous::Extractor::Default", # Which extractor was used
142
+ # requested_url: "https://example.com/blog", # Original URL before redirects
143
+ # content_type: "text/html; charset=utf-8", # HTTP Content-Type header
144
+ # redirected: true # Whether redirects occurred
145
+ # }
146
+ ```
147
+
92
148
  ## Extraction Backends
93
149
 
94
150
  ### Default (ruby-readability)
95
151
 
96
- Parses static HTML using [ruby-readability](https://github.com/cantino/ruby-readability), strips noisy elements (nav, footer, script, header), and converts to Markdown via [reverse_markdown](https://github.com/xijo/reverse_markdown). Fast and requires no external services, but cannot extract content from JS-rendered pages.
152
+ Parses static HTML using [ruby-readability](https://github.com/cantino/ruby-readability), strips noisy elements (script, style, nav, footer), and converts to Markdown via [reverse_markdown](https://github.com/xijo/reverse_markdown). Fast and requires no external services, but cannot extract content from JS-rendered pages.
153
+
154
+ Title extraction uses a fallback chain:
155
+ 1. Readability's extracted title
156
+ 2. Original `<title>` tag from HTML
157
+ 3. First `<h1>` from extracted content
97
158
 
98
159
  ### Jina Reader API
99
160
 
@@ -107,13 +168,15 @@ XML-tagged output designed for LLM context windows:
107
168
 
108
169
  ```xml
109
170
  <page>
110
- <title>Page Title</title>
111
- <url>https://example.com/page</url>
112
- <content>
171
+ <title>Page Title</title>
172
+ <url>https://example.com/page</url>
173
+ <pathname>/page</pathname>
174
+ <extractor>Nous::Extractor::Default</extractor>
175
+ <content>
113
176
  # Heading
114
177
 
115
178
  Extracted markdown content...
116
- </content>
179
+ </content>
117
180
  </page>
118
181
  ```
119
182
 
@@ -125,7 +188,13 @@ Extracted markdown content...
125
188
  "title": "Page Title",
126
189
  "url": "https://example.com/page",
127
190
  "pathname": "/page",
128
- "content": "# Heading\n\nExtracted markdown content..."
191
+ "content": "# Heading\n\nExtracted markdown content...",
192
+ "metadata": {
193
+ "extractor": "Nous::Extractor::Default",
194
+ "requested_url": "https://example.com/page",
195
+ "content_type": "text/html; charset=utf-8",
196
+ "redirected": false
197
+ }
129
198
  }
130
199
  ]
131
200
  ```
@@ -13,20 +13,30 @@ module Nous
13
13
  def fetch(url)
14
14
  Async::Task.current.with_timeout(config.timeout) do
15
15
  result = RedirectFollower.call(client:, seed_host:, url:)
16
- return skip(url, result.error.message) if result.failure?
16
+ return build_failed_record(url, result.error.message) if result.failure?
17
17
 
18
18
  response, final_url = result.payload
19
- return skip(url, "status #{response.status}") unless response.status == 200
20
- return skip(url, "non-html content") unless html?(response)
19
+ content_type = response.headers["content-type"].to_s
20
+ redirected = final_url.to_s != url
21
21
 
22
- RawPage.new(url: final_url.to_s, pathname: final_url.path, html: response.read)
22
+ return build_failed_record(url, "status #{response.status}") unless response.status == 200
23
+ return build_failed_record(url, "non-html content") unless html?(content_type)
24
+
25
+ build_success_record(
26
+ url: url,
27
+ final_url: final_url.to_s,
28
+ pathname: final_url.path,
29
+ html: response.read,
30
+ content_type: content_type,
31
+ redirected: redirected
32
+ )
23
33
  ensure
24
34
  response&.close
25
35
  end
26
36
  rescue Async::TimeoutError
27
- skip(url, "timeout after #{config.timeout}s")
37
+ build_failed_record(url, "timeout after #{config.timeout}s")
28
38
  rescue IOError, SocketError, Errno::ECONNREFUSED => e
29
- skip(url, e.message)
39
+ build_failed_record(url, e.message)
30
40
  end
31
41
 
32
42
  private
@@ -37,14 +47,36 @@ module Nous
37
47
  Nous.configuration
38
48
  end
39
49
 
40
- def html?(response)
41
- content_type = response.headers["content-type"].to_s
50
+ def html?(content_type)
42
51
  HTML_CONTENT_TYPES.any? { |type| content_type.include?(type) }
43
52
  end
44
53
 
45
- def skip(url, reason)
46
- warn("[nous] skip #{url}: #{reason}") if config.debug?
47
- nil
54
+ def build_success_record(url:, final_url:, pathname:, html:, content_type:, redirected:)
55
+ FetchRecord.new(
56
+ requested_url: url,
57
+ final_url: final_url,
58
+ pathname: pathname,
59
+ html: html,
60
+ content_type: content_type,
61
+ ok: true,
62
+ error: nil,
63
+ redirected: redirected
64
+ )
65
+ end
66
+
67
+ def build_failed_record(url, error)
68
+ FetchRecord.new(
69
+ requested_url: url,
70
+ final_url: nil,
71
+ pathname: Url.new(url).path,
72
+ html: nil,
73
+ content_type: nil,
74
+ ok: false,
75
+ error: error,
76
+ redirected: false
77
+ ).tap do |record|
78
+ warn("[nous] skip #{url}: #{error}") if config.debug?
79
+ end
48
80
  end
49
81
  end
50
82
  end
@@ -9,7 +9,7 @@ module Nous
9
9
  def initialize(seed_url:, http_client: nil)
10
10
  @seed_uri = Url.new(seed_url)
11
11
  @http_client = http_client
12
- @pages = []
12
+ @records = []
13
13
  @queue = [url_filter.canonicalize(seed_uri)]
14
14
  @seen = Set.new(queue)
15
15
  end
@@ -21,12 +21,12 @@ module Nous
21
21
  crawl(client)
22
22
  end
23
23
 
24
- success(payload: pages)
24
+ success(payload: records)
25
25
  end
26
26
 
27
27
  private
28
28
 
29
- attr_reader :seed_uri, :http_client, :pages, :queue, :seen
29
+ attr_reader :seed_uri, :http_client, :records, :queue, :seen
30
30
 
31
31
  def config
32
32
  Nous.configuration
@@ -37,13 +37,13 @@ module Nous
37
37
  end
38
38
 
39
39
  def fetch_and_enqueue(batch, client)
40
- fetch_batch(batch, client).each do |page|
41
- next unless page
40
+ fetch_batch(batch, client).each do |record|
41
+ next unless record.ok
42
42
  break unless within_limit?
43
43
 
44
- pages << page
45
- seen << page.url
46
- enqueue_links(page)
44
+ records << record
45
+ seen << record.final_url
46
+ enqueue_links(record)
47
47
  end
48
48
  end
49
49
 
@@ -59,8 +59,8 @@ module Nous
59
59
  tasks.map(&:wait)
60
60
  end
61
61
 
62
- def enqueue_links(page)
63
- link_extractor.extract(page.url, page.html).each do |url|
62
+ def enqueue_links(record)
63
+ link_extractor.extract(record.final_url, record.html).each do |url|
64
64
  next if seen.include?(url)
65
65
 
66
66
  seen << url
@@ -69,7 +69,7 @@ module Nous
69
69
  end
70
70
 
71
71
  def within_limit?
72
- pages.length < config.limit
72
+ records.count(&:ok) < config.limit
73
73
  end
74
74
 
75
75
  def open_connection
@@ -91,7 +91,7 @@ module Nous
91
91
  end
92
92
 
93
93
  def link_extractor
94
- @link_extractor ||= LinkExtractor.new(url_filter:)
94
+ @link_filter ||= LinkExtractor.new(url_filter:)
95
95
  end
96
96
 
97
97
  def suppress_async_warnings
@@ -20,16 +20,26 @@ module Nous
20
20
  def call
21
21
  response = connection.get(url)
22
22
  final_url = resolve_final_url(response)
23
+ content_type = response.headers["content-type"].to_s
24
+ redirected = final_url.to_s != url
25
+
26
+ record = build_record(
27
+ final_url: final_url.to_s,
28
+ pathname: final_url.path,
29
+ html: response.body,
30
+ content_type: content_type,
31
+ redirected: redirected
32
+ )
23
33
 
24
- validate_host!(final_url)
25
- validate_html!(response)
34
+ validate!(record)
26
35
 
27
- raw_page = RawPage.new(url: final_url.to_s, pathname: final_url.path, html: response.body)
28
- success(payload: [raw_page])
36
+ success(payload: [record])
29
37
  rescue FetchError => e
30
- failure(e)
38
+ record = build_failed_record(error: e.message)
39
+ success(payload: [record])
31
40
  rescue Faraday::Error => e
32
- failure(FetchError.new(e.message))
41
+ record = build_failed_record(error: e.message)
42
+ success(payload: [record])
33
43
  end
34
44
 
35
45
  private
@@ -40,19 +50,49 @@ module Nous
40
50
  Nous.configuration
41
51
  end
42
52
 
53
+ def build_record(final_url:, pathname:, html:, content_type:, redirected:)
54
+ FetchRecord.new(
55
+ requested_url: url,
56
+ final_url: final_url,
57
+ pathname: pathname,
58
+ html: html,
59
+ content_type: content_type,
60
+ ok: true,
61
+ error: nil,
62
+ redirected: redirected
63
+ )
64
+ end
65
+
66
+ def build_failed_record(error:)
67
+ FetchRecord.new(
68
+ requested_url: url,
69
+ final_url: nil,
70
+ pathname: Url.new(url).path,
71
+ html: nil,
72
+ content_type: nil,
73
+ ok: false,
74
+ error: error,
75
+ redirected: false
76
+ )
77
+ end
78
+
79
+ def validate!(record)
80
+ validate_host!(record.final_url)
81
+ validate_html!(record.content_type)
82
+ end
83
+
43
84
  def resolve_final_url(response)
44
85
  location = response.env.url.to_s
45
86
  Url.new(location)
46
87
  end
47
88
 
48
89
  def validate_host!(final_url)
49
- return if final_url.host == seed_host
90
+ return if Url.new(final_url).host == seed_host
50
91
 
51
92
  raise FetchError, "redirected to #{final_url} outside #{seed_host}"
52
93
  end
53
94
 
54
- def validate_html!(response)
55
- content_type = response.headers["content-type"].to_s
95
+ def validate_html!(content_type)
56
96
  return if HTML_CONTENT_TYPES.any? { |type| content_type.include?(type) }
57
97
 
58
98
  raise FetchError, "non-html content: #{content_type}"
@@ -8,7 +8,7 @@ module Nous
8
8
  class Client < Command
9
9
  class ExtractionError < StandardError; end
10
10
 
11
- NOISY_TAGS = %w[script style link nav header footer img video svg].freeze
11
+ NOISY_TAGS = %w[script style nav footer].freeze
12
12
 
13
13
  def initialize(html:, selector: nil)
14
14
  @html = html
@@ -16,34 +16,52 @@ module Nous
16
16
  end
17
17
 
18
18
  def call
19
- doc = Nokogiri::HTML(html)
20
- doc = scope_to_selector(doc) if selector
21
- strip_noisy_tags(doc)
19
+ readable = ::Readability::Document.new(prepared_html)
22
20
 
23
- readable = ::Readability::Document.new(doc.to_html)
24
21
  text = Nokogiri::HTML(readable.content).text.strip
25
-
26
22
  return failure(ExtractionError.new("readability returned no content")) if text.empty?
27
23
 
28
- success(payload: {title: readable.title, content: readable.content})
24
+ title = resolve_title(readable)
25
+ success(payload: {title: title, content: readable.content})
29
26
  end
30
27
 
31
28
  private
32
29
 
33
30
  attr_reader :html, :selector
34
31
 
35
- def scope_to_selector(doc)
32
+ def prepared_html
33
+ doc = Nokogiri::HTML(html)
34
+ original_title(doc)
35
+ doc = scope(doc, selector) if selector
36
+ strip_tags(doc)
37
+ doc.to_html
38
+ end
39
+
40
+ def original_title(doc)
41
+ @original_title ||= doc.at_css("title")&.text.to_s.strip
42
+ end
43
+
44
+ def scope(doc, selector)
36
45
  scoped = doc.at_css(selector)
37
46
  return doc unless scoped
38
47
 
39
- fragment = Nokogiri::HTML::Document.new
40
- fragment.root = scoped
41
- fragment
48
+ Nokogiri::HTML.fragment(scoped.to_html)
42
49
  end
43
50
 
44
- def strip_noisy_tags(doc)
51
+ def strip_tags(doc)
45
52
  NOISY_TAGS.each { |tag| doc.css(tag).each(&:remove) }
46
53
  end
54
+
55
+ def resolve_title(readable)
56
+ title = readable.title.to_s.strip
57
+ title = @original_title if title.empty?
58
+ title = title_from_content(readable.content) if title.empty?
59
+ title
60
+ end
61
+
62
+ def title_from_content(content)
63
+ Nokogiri::HTML(content).at_css("h1")&.text.to_s.strip
64
+ end
47
65
  end
48
66
  end
49
67
  end
@@ -9,8 +9,8 @@ module Nous
9
9
  @selector = selector
10
10
  end
11
11
 
12
- def extract(raw_page)
13
- extracted = extract_content(raw_page.html)
12
+ def extract(record)
13
+ extracted = extract_content(record.html)
14
14
  markdown = convert_to_markdown(extracted[:content])
15
15
 
16
16
  success(payload: ExtractedContent.new(title: extracted[:title], content: markdown))
@@ -7,8 +7,8 @@ module Nous
7
7
  @client = Client.new(api_key: api_key || ENV["JINA_API_KEY"], timeout:, **client_options)
8
8
  end
9
9
 
10
- def extract(raw_page)
11
- body = client.get(raw_page.url)
10
+ def extract(record)
11
+ body = client.get(record.final_url)
12
12
 
13
13
  success(payload: ExtractedContent.new(
14
14
  title: body.dig("data", "title") || "",
@@ -5,14 +5,14 @@ module Nous
5
5
  class ExtractionRunner < Command
6
6
  class ExtractionError < StandardError; end
7
7
 
8
- def initialize(raw_pages:, extractor:)
9
- @raw_pages = raw_pages
8
+ def initialize(records:, extractor:)
9
+ @records = records
10
10
  @extractor = extractor
11
11
  end
12
12
 
13
13
  def call
14
- pages = raw_pages.each_slice(Nous.configuration.concurrency).each_with_object([]) do |batch, results|
15
- threads = batch.map { |raw_page| Thread.new { PageExtractor.call(extractor:, raw_page:) } }
14
+ pages = records.each_slice(Nous.configuration.concurrency).each_with_object([]) do |batch, results|
15
+ threads = batch.map { |record| Thread.new { PageExtractor.call(extractor:, record:) } }
16
16
 
17
17
  threads.each do |thread|
18
18
  result = thread.value
@@ -25,7 +25,7 @@ module Nous
25
25
 
26
26
  private
27
27
 
28
- attr_reader :raw_pages, :extractor
28
+ attr_reader :records, :extractor
29
29
  end
30
30
  end
31
31
  end
@@ -3,24 +3,30 @@
3
3
  module Nous
4
4
  class Fetcher < Command
5
5
  class PageExtractor < Command
6
- def initialize(extractor:, raw_page:)
6
+ def initialize(extractor:, record:)
7
7
  @extractor = extractor
8
- @raw_page = raw_page
8
+ @record = record
9
9
  end
10
10
 
11
11
  def call
12
- result = extractor.extract(raw_page)
12
+ result = extractor.extract(record)
13
13
 
14
14
  unless result.success?
15
- warn("[nous] extract skip #{raw_page.url}: #{result.error.message}") if Nous.configuration.debug?
15
+ warn("[nous] extract skip #{record.final_url}: #{result.error.message}") if Nous.configuration.debug?
16
16
  return failure(result.error)
17
17
  end
18
18
 
19
19
  page = Page.new(
20
20
  title: result.payload.title,
21
- url: raw_page.url,
22
- pathname: raw_page.pathname,
23
- content: result.payload.content
21
+ url: record.final_url,
22
+ pathname: record.pathname,
23
+ content: result.payload.content,
24
+ metadata: {
25
+ extractor: extractor.class.name,
26
+ requested_url: record.requested_url,
27
+ content_type: record.content_type,
28
+ redirected: record.redirected
29
+ }
24
30
  )
25
31
 
26
32
  success(payload: page)
@@ -28,7 +34,7 @@ module Nous
28
34
 
29
35
  private
30
36
 
31
- attr_reader :extractor, :raw_page
37
+ attr_reader :extractor, :record
32
38
  end
33
39
  end
34
40
  end
data/lib/nous/fetcher.rb CHANGED
@@ -4,21 +4,38 @@ module Nous
4
4
  class Fetcher < Command
5
5
  class FetchError < StandardError; end
6
6
 
7
- def initialize(seed_url:, extractor: Extractor::Default.new, http_client: nil)
7
+ def initialize(seed_url:, extractor: Extractor::Default.new, http_client: nil, details: false)
8
8
  @seed_url = seed_url
9
9
  @extractor = extractor
10
10
  @http_client = http_client
11
+ @single_page = Nous.configuration.single_page?
12
+ @details = details
11
13
  end
12
14
 
13
15
  def call
14
- raw_pages = crawl
15
- pages = extract(raw_pages)
16
- success(payload: pages)
16
+ records = crawl
17
+ successful_records, failed_records = records.partition(&:ok)
18
+
19
+ if single_page && !details && successful_records.empty?
20
+ raise FetchError, failed_records.first&.error || "fetch failed"
21
+ end
22
+
23
+ pages = extract(successful_records)
24
+
25
+ if details
26
+ success(payload: FetchResult.new(
27
+ pages: pages,
28
+ failures: build_failures(failed_records),
29
+ total_requested: records.length
30
+ ))
31
+ else
32
+ success(payload: pages)
33
+ end
17
34
  end
18
35
 
19
36
  private
20
37
 
21
- attr_reader :seed_url, :extractor, :http_client
38
+ attr_reader :seed_url, :extractor, :http_client, :single_page, :details
22
39
 
23
40
  def crawl
24
41
  result = Crawler.call(seed_url:, http_client:)
@@ -27,11 +44,20 @@ module Nous
27
44
  result.payload
28
45
  end
29
46
 
30
- def extract(raw_pages)
31
- result = ExtractionRunner.call(raw_pages:, extractor:)
47
+ def extract(records)
48
+ result = ExtractionRunner.call(records:, extractor:)
32
49
  raise FetchError, result.error.message if result.failure?
33
50
 
34
51
  result.payload
35
52
  end
53
+
54
+ def build_failures(records)
55
+ records.map do |record|
56
+ {
57
+ requested_url: record.requested_url,
58
+ error: record.error
59
+ }
60
+ end
61
+ end
36
62
  end
37
63
  end
@@ -12,5 +12,6 @@ module Nous
12
12
  ) do
13
13
  def debug? = debug
14
14
  def recursive? = recursive
15
+ def single_page? = !recursive
15
16
  end
16
17
  end
@@ -0,0 +1,26 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Nous
4
+ FetchRecord = Data.define(
5
+ :requested_url,
6
+ :final_url,
7
+ :pathname,
8
+ :html,
9
+ :content_type,
10
+ :ok,
11
+ :error,
12
+ :redirected
13
+ ) do
14
+ def initialize(
15
+ requested_url:,
16
+ pathname:, final_url: nil,
17
+ html: nil,
18
+ content_type: nil,
19
+ ok: true,
20
+ error: nil,
21
+ redirected: false
22
+ )
23
+ super
24
+ end
25
+ end
26
+ end
@@ -0,0 +1,21 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Nous
4
+ FetchResult = Data.define(:pages, :failures, :total_requested) do
5
+ def succeeded
6
+ pages.length
7
+ end
8
+
9
+ def failed
10
+ failures.length
11
+ end
12
+
13
+ def all_succeeded?
14
+ failures.empty?
15
+ end
16
+
17
+ def any_succeeded?
18
+ pages.any?
19
+ end
20
+ end
21
+ end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Nous
4
- Page = Data.define(:title, :url, :pathname, :content)
4
+ Page = Data.define(:title, :url, :pathname, :content, :metadata)
5
5
  end
@@ -43,6 +43,8 @@ module Nous
43
43
  <page>
44
44
  <title>#{page.title}</title>
45
45
  <url>#{page.url}</url>
46
+ <pathname>#{page.pathname}</pathname>
47
+ <extractor>#{page.metadata[:extractor]}</extractor>
46
48
  <content>
47
49
  #{page.content}
48
50
  </content>
@@ -51,7 +53,13 @@ module Nous
51
53
  end
52
54
 
53
55
  def json_page(page)
54
- {title: page.title, url: page.url, content: page.content}
56
+ {
57
+ title: page.title,
58
+ url: page.url,
59
+ pathname: page.pathname,
60
+ content: page.content,
61
+ metadata: page.metadata
62
+ }
55
63
  end
56
64
  end
57
65
  end
data/lib/nous/version.rb CHANGED
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Nous
4
- VERSION = "0.3.0"
4
+ VERSION = "0.4.0"
5
5
  end
data/lib/nous.rb CHANGED
@@ -18,10 +18,10 @@ module Nous
18
18
  @configuration = nil
19
19
  end
20
20
 
21
- def fetch(seed_url, extractor: Extractor::Default.new, http_client: nil, **options)
21
+ def fetch(seed_url, extractor: Extractor::Default.new, http_client: nil, details: false, **options)
22
22
  configure(**options)
23
23
 
24
- result = Fetcher.call(seed_url:, extractor:, http_client:)
24
+ result = Fetcher.call(seed_url:, extractor:, http_client:, details:)
25
25
  raise result.error if result.failure?
26
26
 
27
27
  result.payload
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: nous
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.3.0
4
+ version: 0.4.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Dan Frenette
@@ -243,8 +243,9 @@ files:
243
243
  - lib/nous/fetcher/page_extractor.rb
244
244
  - lib/nous/primitives/configuration.rb
245
245
  - lib/nous/primitives/extracted_content.rb
246
+ - lib/nous/primitives/fetch_record.rb
247
+ - lib/nous/primitives/fetch_result.rb
246
248
  - lib/nous/primitives/page.rb
247
- - lib/nous/primitives/raw_page.rb
248
249
  - lib/nous/primitives/url.rb
249
250
  - lib/nous/serializer.rb
250
251
  - lib/nous/url_resolver.rb
@@ -1,5 +0,0 @@
1
- # frozen_string_literal: true
2
-
3
- module Nous
4
- RawPage = Data.define(:url, :pathname, :html)
5
- end