nous 0.3.0 → 0.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +52 -0
- data/README.md +76 -7
- data/lib/nous/crawler/async_page_fetcher.rb +43 -11
- data/lib/nous/crawler/recursive_page_fetcher.rb +12 -12
- data/lib/nous/crawler/single_page_fetcher.rb +49 -9
- data/lib/nous/extractor/default/client.rb +30 -12
- data/lib/nous/extractor/default.rb +2 -2
- data/lib/nous/extractor/jina.rb +2 -2
- data/lib/nous/fetcher/extraction_runner.rb +5 -5
- data/lib/nous/fetcher/page_extractor.rb +14 -8
- data/lib/nous/fetcher.rb +33 -7
- data/lib/nous/primitives/configuration.rb +1 -0
- data/lib/nous/primitives/fetch_record.rb +26 -0
- data/lib/nous/primitives/fetch_result.rb +21 -0
- data/lib/nous/primitives/page.rb +1 -1
- data/lib/nous/serializer.rb +9 -1
- data/lib/nous/version.rb +1 -1
- data/lib/nous.rb +2 -2
- metadata +3 -2
- data/lib/nous/primitives/raw_page.rb +0 -5
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: c73c21d427c9bb99cc148e089ed5899e7aa9e3ca86a4825540380d41771354d2
|
|
4
|
+
data.tar.gz: 4b361a7aed3c0dfb28a6a650b0813d371622c82b910b8880502281047152a739
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: aacbc4777dc1e5bd66513ddc3bc5a1f667276ac2e89ff781f7b43762133ae640d68bc1191c11b61ee00ac7911a128fcf6ad80653f86d369d284223e830f09120
|
|
7
|
+
data.tar.gz: 90fc8f0cf3c30c6e06bf6aeebd2790539ff80d0d63469a367c4a09b008c9ceed3dedc0d18528e5f6fab3c4a02103b4d2c7c9b3788b1708c6a1a46790d9f5cbab
|
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,51 @@
|
|
|
1
1
|
## [Unreleased]
|
|
2
2
|
|
|
3
|
+
## [0.4.0] - 2026-04-11
|
|
4
|
+
|
|
5
|
+
### Added
|
|
6
|
+
|
|
7
|
+
- **New `details: true` option for `Nous.fetch`** - Returns a `FetchResult` object containing both successful pages and failed fetch/extraction attempts. This enables explicit failure handling without exceptions.
|
|
8
|
+
```ruby
|
|
9
|
+
result = Nous.fetch("https://example.com", details: true)
|
|
10
|
+
result.pages # Array<Page> - successfully extracted
|
|
11
|
+
result.failures # [{requested_url:, error:}, ...]
|
|
12
|
+
```
|
|
13
|
+
|
|
14
|
+
- **Page metadata** - Every extracted page now includes provenance information:
|
|
15
|
+
- `extractor`: Which extractor backend was used (e.g., "Nous::Extractor::Default")
|
|
16
|
+
- `requested_url`: The original URL before any redirects
|
|
17
|
+
- `content_type`: HTTP Content-Type header from the response
|
|
18
|
+
- `redirected`: Boolean indicating if redirects occurred
|
|
19
|
+
|
|
20
|
+
- **FetchRecord internal primitive** - Unified fetch result representation that captures both success and failure cases with full provenance tracking. Replaces the previous `RawPage` which only handled successful fetches.
|
|
21
|
+
|
|
22
|
+
- **Configuration#single_page? helper** - Convenience predicate method for checking if the current configuration is in single-page (non-recursive) mode.
|
|
23
|
+
|
|
24
|
+
### Changed
|
|
25
|
+
|
|
26
|
+
- **Improved title extraction** - Title extraction now uses a fallback chain: readability extracted title → HTML `<title>` tag → first `<h1>` element. This significantly improves title reliability on pages where readability fails to identify the title.
|
|
27
|
+
|
|
28
|
+
- **Reduced aggressive DOM stripping** - The default extractor now preserves more content before readability processing. Previously removed elements (`header`, `img`, `video`, `svg`, `link`) are now retained, providing better context for readability scoring and preserving useful content like captions and bylines.
|
|
29
|
+
|
|
30
|
+
- **Unified fetch contract** - Both single-page and recursive crawling now use the same internal `FetchRecord` structure, ensuring consistent provenance tracking and failure handling across all fetch modes.
|
|
31
|
+
|
|
32
|
+
- **Serializer schema updated** - Both text and JSON output formats now include:
|
|
33
|
+
- `pathname`: URL path component
|
|
34
|
+
- `extractor`: Which extractor processed the page
|
|
35
|
+
- Full metadata object (JSON only)
|
|
36
|
+
|
|
37
|
+
### Fixed
|
|
38
|
+
|
|
39
|
+
- **JSON serialization** - The JSON output now correctly includes the `pathname` field that was documented but missing in previous versions.
|
|
40
|
+
|
|
41
|
+
- **Extraction failure visibility** - Previously, extraction failures were only visible with debug logging enabled. The new `FetchResult` structure makes failures programmatically accessible.
|
|
42
|
+
|
|
43
|
+
### Internal Changes
|
|
44
|
+
|
|
45
|
+
- **Duck-typed extractor interface** - Extractors now receive the full `FetchRecord` object and can access the fields they need (`Default` uses `record.html`, `Jina` uses `record.final_url`).
|
|
46
|
+
|
|
47
|
+
- **Removed `RawPage` primitive** - Superseded by the richer `FetchRecord` which handles both success and failure uniformly.
|
|
48
|
+
|
|
3
49
|
## [0.3.0] - 2026-02-23
|
|
4
50
|
|
|
5
51
|
- Remove `Nous::Error` base hierarchy; colocated errors inherit directly from `StandardError` with descriptive names
|
|
@@ -29,3 +75,9 @@
|
|
|
29
75
|
## [0.1.0] - 2026-02-21
|
|
30
76
|
|
|
31
77
|
- Initial release
|
|
78
|
+
|
|
79
|
+
[Unreleased]: https://github.com/danfrenette/nous/compare/v0.4.0...HEAD
|
|
80
|
+
[0.4.0]: https://github.com/danfrenette/nous/compare/v0.3.0...v0.4.0
|
|
81
|
+
[0.3.0]: https://github.com/danfrenette/nous/compare/v0.2.0...v0.3.0
|
|
82
|
+
[0.2.0]: https://github.com/danfrenette/nous/compare/v0.1.0...v0.2.0
|
|
83
|
+
[0.1.0]: https://github.com/danfrenette/nous/releases/tag/v0.1.0
|
data/README.md
CHANGED
|
@@ -64,13 +64,15 @@ nous https://example.com -d
|
|
|
64
64
|
|
|
65
65
|
## Ruby API
|
|
66
66
|
|
|
67
|
+
### Basic Usage
|
|
68
|
+
|
|
67
69
|
```ruby
|
|
68
70
|
require "nous"
|
|
69
71
|
|
|
70
72
|
# Fetch pages with the default extractor
|
|
71
73
|
pages = Nous.fetch("https://example.com", limit: 10, concurrency: 3)
|
|
72
74
|
|
|
73
|
-
# Each page is a Nous::Page with title, url, pathname, content
|
|
75
|
+
# Each page is a Nous::Page with title, url, pathname, content, metadata
|
|
74
76
|
pages.each do |page|
|
|
75
77
|
puts "#{page.title} (#{page.url})"
|
|
76
78
|
puts page.content
|
|
@@ -89,11 +91,70 @@ pages = Nous.fetch("https://spa-site.com",
|
|
|
89
91
|
)
|
|
90
92
|
```
|
|
91
93
|
|
|
94
|
+
### Detailed Results
|
|
95
|
+
|
|
96
|
+
Use the `details: true` option to receive full fetch results including failures:
|
|
97
|
+
|
|
98
|
+
```ruby
|
|
99
|
+
result = Nous.fetch("https://example.com", details: true)
|
|
100
|
+
|
|
101
|
+
result.pages # Array<Nous::Page> - successfully extracted pages
|
|
102
|
+
result.failures # Array<{requested_url:, error:}> - failed fetches
|
|
103
|
+
result.total_requested # Integer - total URLs attempted
|
|
104
|
+
result.all_succeeded? # Boolean - true if no failures
|
|
105
|
+
result.any_succeeded? # Boolean - true if at least one page extracted
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
This is useful when you need to handle failures explicitly:
|
|
109
|
+
|
|
110
|
+
```ruby
|
|
111
|
+
result = Nous.fetch("https://example.com/api-docs", details: true)
|
|
112
|
+
|
|
113
|
+
if result.failures.any?
|
|
114
|
+
puts "Failed to fetch:"
|
|
115
|
+
result.failures.each do |failure|
|
|
116
|
+
puts " #{failure[:requested_url]}: #{failure[:error]}"
|
|
117
|
+
end
|
|
118
|
+
end
|
|
119
|
+
|
|
120
|
+
result.pages.each do |page|
|
|
121
|
+
puts "Successfully extracted: #{page.title}"
|
|
122
|
+
end
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
### Page Structure
|
|
126
|
+
|
|
127
|
+
Each extracted page contains:
|
|
128
|
+
|
|
129
|
+
| Field | Type | Description |
|
|
130
|
+
|-------|------|-------------|
|
|
131
|
+
| `title` | String | Page title (fallback chain: readability → `<title>` tag → `<h1>`) |
|
|
132
|
+
| `url` | String | Final URL after redirects |
|
|
133
|
+
| `pathname` | String | URL path component |
|
|
134
|
+
| `content` | String | Extracted content as Markdown |
|
|
135
|
+
| `metadata` | Hash | Provenance information (see below) |
|
|
136
|
+
|
|
137
|
+
### Page Metadata
|
|
138
|
+
|
|
139
|
+
```ruby
|
|
140
|
+
page.metadata # => {
|
|
141
|
+
# extractor: "Nous::Extractor::Default", # Which extractor was used
|
|
142
|
+
# requested_url: "https://example.com/blog", # Original URL before redirects
|
|
143
|
+
# content_type: "text/html; charset=utf-8", # HTTP Content-Type header
|
|
144
|
+
# redirected: true # Whether redirects occurred
|
|
145
|
+
# }
|
|
146
|
+
```
|
|
147
|
+
|
|
92
148
|
## Extraction Backends
|
|
93
149
|
|
|
94
150
|
### Default (ruby-readability)
|
|
95
151
|
|
|
96
|
-
Parses static HTML using [ruby-readability](https://github.com/cantino/ruby-readability), strips noisy elements (
|
|
152
|
+
Parses static HTML using [ruby-readability](https://github.com/cantino/ruby-readability), strips noisy elements (script, style, nav, footer), and converts to Markdown via [reverse_markdown](https://github.com/xijo/reverse_markdown). Fast and requires no external services, but cannot extract content from JS-rendered pages.
|
|
153
|
+
|
|
154
|
+
Title extraction uses a fallback chain:
|
|
155
|
+
1. Readability's extracted title
|
|
156
|
+
2. Original `<title>` tag from HTML
|
|
157
|
+
3. First `<h1>` from extracted content
|
|
97
158
|
|
|
98
159
|
### Jina Reader API
|
|
99
160
|
|
|
@@ -107,13 +168,15 @@ XML-tagged output designed for LLM context windows:
|
|
|
107
168
|
|
|
108
169
|
```xml
|
|
109
170
|
<page>
|
|
110
|
-
<title>Page Title</title>
|
|
111
|
-
<url>https://example.com/page</url>
|
|
112
|
-
<
|
|
171
|
+
<title>Page Title</title>
|
|
172
|
+
<url>https://example.com/page</url>
|
|
173
|
+
<pathname>/page</pathname>
|
|
174
|
+
<extractor>Nous::Extractor::Default</extractor>
|
|
175
|
+
<content>
|
|
113
176
|
# Heading
|
|
114
177
|
|
|
115
178
|
Extracted markdown content...
|
|
116
|
-
</content>
|
|
179
|
+
</content>
|
|
117
180
|
</page>
|
|
118
181
|
```
|
|
119
182
|
|
|
@@ -125,7 +188,13 @@ Extracted markdown content...
|
|
|
125
188
|
"title": "Page Title",
|
|
126
189
|
"url": "https://example.com/page",
|
|
127
190
|
"pathname": "/page",
|
|
128
|
-
"content": "# Heading\n\nExtracted markdown content..."
|
|
191
|
+
"content": "# Heading\n\nExtracted markdown content...",
|
|
192
|
+
"metadata": {
|
|
193
|
+
"extractor": "Nous::Extractor::Default",
|
|
194
|
+
"requested_url": "https://example.com/page",
|
|
195
|
+
"content_type": "text/html; charset=utf-8",
|
|
196
|
+
"redirected": false
|
|
197
|
+
}
|
|
129
198
|
}
|
|
130
199
|
]
|
|
131
200
|
```
|
|
@@ -13,20 +13,30 @@ module Nous
|
|
|
13
13
|
def fetch(url)
|
|
14
14
|
Async::Task.current.with_timeout(config.timeout) do
|
|
15
15
|
result = RedirectFollower.call(client:, seed_host:, url:)
|
|
16
|
-
return
|
|
16
|
+
return build_failed_record(url, result.error.message) if result.failure?
|
|
17
17
|
|
|
18
18
|
response, final_url = result.payload
|
|
19
|
-
|
|
20
|
-
|
|
19
|
+
content_type = response.headers["content-type"].to_s
|
|
20
|
+
redirected = final_url.to_s != url
|
|
21
21
|
|
|
22
|
-
|
|
22
|
+
return build_failed_record(url, "status #{response.status}") unless response.status == 200
|
|
23
|
+
return build_failed_record(url, "non-html content") unless html?(content_type)
|
|
24
|
+
|
|
25
|
+
build_success_record(
|
|
26
|
+
url: url,
|
|
27
|
+
final_url: final_url.to_s,
|
|
28
|
+
pathname: final_url.path,
|
|
29
|
+
html: response.read,
|
|
30
|
+
content_type: content_type,
|
|
31
|
+
redirected: redirected
|
|
32
|
+
)
|
|
23
33
|
ensure
|
|
24
34
|
response&.close
|
|
25
35
|
end
|
|
26
36
|
rescue Async::TimeoutError
|
|
27
|
-
|
|
37
|
+
build_failed_record(url, "timeout after #{config.timeout}s")
|
|
28
38
|
rescue IOError, SocketError, Errno::ECONNREFUSED => e
|
|
29
|
-
|
|
39
|
+
build_failed_record(url, e.message)
|
|
30
40
|
end
|
|
31
41
|
|
|
32
42
|
private
|
|
@@ -37,14 +47,36 @@ module Nous
|
|
|
37
47
|
Nous.configuration
|
|
38
48
|
end
|
|
39
49
|
|
|
40
|
-
def html?(
|
|
41
|
-
content_type = response.headers["content-type"].to_s
|
|
50
|
+
def html?(content_type)
|
|
42
51
|
HTML_CONTENT_TYPES.any? { |type| content_type.include?(type) }
|
|
43
52
|
end
|
|
44
53
|
|
|
45
|
-
def
|
|
46
|
-
|
|
47
|
-
|
|
54
|
+
def build_success_record(url:, final_url:, pathname:, html:, content_type:, redirected:)
|
|
55
|
+
FetchRecord.new(
|
|
56
|
+
requested_url: url,
|
|
57
|
+
final_url: final_url,
|
|
58
|
+
pathname: pathname,
|
|
59
|
+
html: html,
|
|
60
|
+
content_type: content_type,
|
|
61
|
+
ok: true,
|
|
62
|
+
error: nil,
|
|
63
|
+
redirected: redirected
|
|
64
|
+
)
|
|
65
|
+
end
|
|
66
|
+
|
|
67
|
+
def build_failed_record(url, error)
|
|
68
|
+
FetchRecord.new(
|
|
69
|
+
requested_url: url,
|
|
70
|
+
final_url: nil,
|
|
71
|
+
pathname: Url.new(url).path,
|
|
72
|
+
html: nil,
|
|
73
|
+
content_type: nil,
|
|
74
|
+
ok: false,
|
|
75
|
+
error: error,
|
|
76
|
+
redirected: false
|
|
77
|
+
).tap do |record|
|
|
78
|
+
warn("[nous] skip #{url}: #{error}") if config.debug?
|
|
79
|
+
end
|
|
48
80
|
end
|
|
49
81
|
end
|
|
50
82
|
end
|
|
@@ -9,7 +9,7 @@ module Nous
|
|
|
9
9
|
def initialize(seed_url:, http_client: nil)
|
|
10
10
|
@seed_uri = Url.new(seed_url)
|
|
11
11
|
@http_client = http_client
|
|
12
|
-
@
|
|
12
|
+
@records = []
|
|
13
13
|
@queue = [url_filter.canonicalize(seed_uri)]
|
|
14
14
|
@seen = Set.new(queue)
|
|
15
15
|
end
|
|
@@ -21,12 +21,12 @@ module Nous
|
|
|
21
21
|
crawl(client)
|
|
22
22
|
end
|
|
23
23
|
|
|
24
|
-
success(payload:
|
|
24
|
+
success(payload: records)
|
|
25
25
|
end
|
|
26
26
|
|
|
27
27
|
private
|
|
28
28
|
|
|
29
|
-
attr_reader :seed_uri, :http_client, :
|
|
29
|
+
attr_reader :seed_uri, :http_client, :records, :queue, :seen
|
|
30
30
|
|
|
31
31
|
def config
|
|
32
32
|
Nous.configuration
|
|
@@ -37,13 +37,13 @@ module Nous
|
|
|
37
37
|
end
|
|
38
38
|
|
|
39
39
|
def fetch_and_enqueue(batch, client)
|
|
40
|
-
fetch_batch(batch, client).each do |
|
|
41
|
-
next unless
|
|
40
|
+
fetch_batch(batch, client).each do |record|
|
|
41
|
+
next unless record.ok
|
|
42
42
|
break unless within_limit?
|
|
43
43
|
|
|
44
|
-
|
|
45
|
-
seen <<
|
|
46
|
-
enqueue_links(
|
|
44
|
+
records << record
|
|
45
|
+
seen << record.final_url
|
|
46
|
+
enqueue_links(record)
|
|
47
47
|
end
|
|
48
48
|
end
|
|
49
49
|
|
|
@@ -59,8 +59,8 @@ module Nous
|
|
|
59
59
|
tasks.map(&:wait)
|
|
60
60
|
end
|
|
61
61
|
|
|
62
|
-
def enqueue_links(
|
|
63
|
-
link_extractor.extract(
|
|
62
|
+
def enqueue_links(record)
|
|
63
|
+
link_extractor.extract(record.final_url, record.html).each do |url|
|
|
64
64
|
next if seen.include?(url)
|
|
65
65
|
|
|
66
66
|
seen << url
|
|
@@ -69,7 +69,7 @@ module Nous
|
|
|
69
69
|
end
|
|
70
70
|
|
|
71
71
|
def within_limit?
|
|
72
|
-
|
|
72
|
+
records.count(&:ok) < config.limit
|
|
73
73
|
end
|
|
74
74
|
|
|
75
75
|
def open_connection
|
|
@@ -91,7 +91,7 @@ module Nous
|
|
|
91
91
|
end
|
|
92
92
|
|
|
93
93
|
def link_extractor
|
|
94
|
-
@
|
|
94
|
+
@link_filter ||= LinkExtractor.new(url_filter:)
|
|
95
95
|
end
|
|
96
96
|
|
|
97
97
|
def suppress_async_warnings
|
|
@@ -20,16 +20,26 @@ module Nous
|
|
|
20
20
|
def call
|
|
21
21
|
response = connection.get(url)
|
|
22
22
|
final_url = resolve_final_url(response)
|
|
23
|
+
content_type = response.headers["content-type"].to_s
|
|
24
|
+
redirected = final_url.to_s != url
|
|
25
|
+
|
|
26
|
+
record = build_record(
|
|
27
|
+
final_url: final_url.to_s,
|
|
28
|
+
pathname: final_url.path,
|
|
29
|
+
html: response.body,
|
|
30
|
+
content_type: content_type,
|
|
31
|
+
redirected: redirected
|
|
32
|
+
)
|
|
23
33
|
|
|
24
|
-
|
|
25
|
-
validate_html!(response)
|
|
34
|
+
validate!(record)
|
|
26
35
|
|
|
27
|
-
|
|
28
|
-
success(payload: [raw_page])
|
|
36
|
+
success(payload: [record])
|
|
29
37
|
rescue FetchError => e
|
|
30
|
-
|
|
38
|
+
record = build_failed_record(error: e.message)
|
|
39
|
+
success(payload: [record])
|
|
31
40
|
rescue Faraday::Error => e
|
|
32
|
-
|
|
41
|
+
record = build_failed_record(error: e.message)
|
|
42
|
+
success(payload: [record])
|
|
33
43
|
end
|
|
34
44
|
|
|
35
45
|
private
|
|
@@ -40,19 +50,49 @@ module Nous
|
|
|
40
50
|
Nous.configuration
|
|
41
51
|
end
|
|
42
52
|
|
|
53
|
+
def build_record(final_url:, pathname:, html:, content_type:, redirected:)
|
|
54
|
+
FetchRecord.new(
|
|
55
|
+
requested_url: url,
|
|
56
|
+
final_url: final_url,
|
|
57
|
+
pathname: pathname,
|
|
58
|
+
html: html,
|
|
59
|
+
content_type: content_type,
|
|
60
|
+
ok: true,
|
|
61
|
+
error: nil,
|
|
62
|
+
redirected: redirected
|
|
63
|
+
)
|
|
64
|
+
end
|
|
65
|
+
|
|
66
|
+
def build_failed_record(error:)
|
|
67
|
+
FetchRecord.new(
|
|
68
|
+
requested_url: url,
|
|
69
|
+
final_url: nil,
|
|
70
|
+
pathname: Url.new(url).path,
|
|
71
|
+
html: nil,
|
|
72
|
+
content_type: nil,
|
|
73
|
+
ok: false,
|
|
74
|
+
error: error,
|
|
75
|
+
redirected: false
|
|
76
|
+
)
|
|
77
|
+
end
|
|
78
|
+
|
|
79
|
+
def validate!(record)
|
|
80
|
+
validate_host!(record.final_url)
|
|
81
|
+
validate_html!(record.content_type)
|
|
82
|
+
end
|
|
83
|
+
|
|
43
84
|
def resolve_final_url(response)
|
|
44
85
|
location = response.env.url.to_s
|
|
45
86
|
Url.new(location)
|
|
46
87
|
end
|
|
47
88
|
|
|
48
89
|
def validate_host!(final_url)
|
|
49
|
-
return if final_url.host == seed_host
|
|
90
|
+
return if Url.new(final_url).host == seed_host
|
|
50
91
|
|
|
51
92
|
raise FetchError, "redirected to #{final_url} outside #{seed_host}"
|
|
52
93
|
end
|
|
53
94
|
|
|
54
|
-
def validate_html!(
|
|
55
|
-
content_type = response.headers["content-type"].to_s
|
|
95
|
+
def validate_html!(content_type)
|
|
56
96
|
return if HTML_CONTENT_TYPES.any? { |type| content_type.include?(type) }
|
|
57
97
|
|
|
58
98
|
raise FetchError, "non-html content: #{content_type}"
|
|
@@ -8,7 +8,7 @@ module Nous
|
|
|
8
8
|
class Client < Command
|
|
9
9
|
class ExtractionError < StandardError; end
|
|
10
10
|
|
|
11
|
-
NOISY_TAGS = %w[script style
|
|
11
|
+
NOISY_TAGS = %w[script style nav footer].freeze
|
|
12
12
|
|
|
13
13
|
def initialize(html:, selector: nil)
|
|
14
14
|
@html = html
|
|
@@ -16,34 +16,52 @@ module Nous
|
|
|
16
16
|
end
|
|
17
17
|
|
|
18
18
|
def call
|
|
19
|
-
|
|
20
|
-
doc = scope_to_selector(doc) if selector
|
|
21
|
-
strip_noisy_tags(doc)
|
|
19
|
+
readable = ::Readability::Document.new(prepared_html)
|
|
22
20
|
|
|
23
|
-
readable = ::Readability::Document.new(doc.to_html)
|
|
24
21
|
text = Nokogiri::HTML(readable.content).text.strip
|
|
25
|
-
|
|
26
22
|
return failure(ExtractionError.new("readability returned no content")) if text.empty?
|
|
27
23
|
|
|
28
|
-
|
|
24
|
+
title = resolve_title(readable)
|
|
25
|
+
success(payload: {title: title, content: readable.content})
|
|
29
26
|
end
|
|
30
27
|
|
|
31
28
|
private
|
|
32
29
|
|
|
33
30
|
attr_reader :html, :selector
|
|
34
31
|
|
|
35
|
-
def
|
|
32
|
+
def prepared_html
|
|
33
|
+
doc = Nokogiri::HTML(html)
|
|
34
|
+
original_title(doc)
|
|
35
|
+
doc = scope(doc, selector) if selector
|
|
36
|
+
strip_tags(doc)
|
|
37
|
+
doc.to_html
|
|
38
|
+
end
|
|
39
|
+
|
|
40
|
+
def original_title(doc)
|
|
41
|
+
@original_title ||= doc.at_css("title")&.text.to_s.strip
|
|
42
|
+
end
|
|
43
|
+
|
|
44
|
+
def scope(doc, selector)
|
|
36
45
|
scoped = doc.at_css(selector)
|
|
37
46
|
return doc unless scoped
|
|
38
47
|
|
|
39
|
-
|
|
40
|
-
fragment.root = scoped
|
|
41
|
-
fragment
|
|
48
|
+
Nokogiri::HTML.fragment(scoped.to_html)
|
|
42
49
|
end
|
|
43
50
|
|
|
44
|
-
def
|
|
51
|
+
def strip_tags(doc)
|
|
45
52
|
NOISY_TAGS.each { |tag| doc.css(tag).each(&:remove) }
|
|
46
53
|
end
|
|
54
|
+
|
|
55
|
+
def resolve_title(readable)
|
|
56
|
+
title = readable.title.to_s.strip
|
|
57
|
+
title = @original_title if title.empty?
|
|
58
|
+
title = title_from_content(readable.content) if title.empty?
|
|
59
|
+
title
|
|
60
|
+
end
|
|
61
|
+
|
|
62
|
+
def title_from_content(content)
|
|
63
|
+
Nokogiri::HTML(content).at_css("h1")&.text.to_s.strip
|
|
64
|
+
end
|
|
47
65
|
end
|
|
48
66
|
end
|
|
49
67
|
end
|
|
@@ -9,8 +9,8 @@ module Nous
|
|
|
9
9
|
@selector = selector
|
|
10
10
|
end
|
|
11
11
|
|
|
12
|
-
def extract(
|
|
13
|
-
extracted = extract_content(
|
|
12
|
+
def extract(record)
|
|
13
|
+
extracted = extract_content(record.html)
|
|
14
14
|
markdown = convert_to_markdown(extracted[:content])
|
|
15
15
|
|
|
16
16
|
success(payload: ExtractedContent.new(title: extracted[:title], content: markdown))
|
data/lib/nous/extractor/jina.rb
CHANGED
|
@@ -7,8 +7,8 @@ module Nous
|
|
|
7
7
|
@client = Client.new(api_key: api_key || ENV["JINA_API_KEY"], timeout:, **client_options)
|
|
8
8
|
end
|
|
9
9
|
|
|
10
|
-
def extract(
|
|
11
|
-
body = client.get(
|
|
10
|
+
def extract(record)
|
|
11
|
+
body = client.get(record.final_url)
|
|
12
12
|
|
|
13
13
|
success(payload: ExtractedContent.new(
|
|
14
14
|
title: body.dig("data", "title") || "",
|
|
@@ -5,14 +5,14 @@ module Nous
|
|
|
5
5
|
class ExtractionRunner < Command
|
|
6
6
|
class ExtractionError < StandardError; end
|
|
7
7
|
|
|
8
|
-
def initialize(
|
|
9
|
-
@
|
|
8
|
+
def initialize(records:, extractor:)
|
|
9
|
+
@records = records
|
|
10
10
|
@extractor = extractor
|
|
11
11
|
end
|
|
12
12
|
|
|
13
13
|
def call
|
|
14
|
-
pages =
|
|
15
|
-
threads = batch.map { |
|
|
14
|
+
pages = records.each_slice(Nous.configuration.concurrency).each_with_object([]) do |batch, results|
|
|
15
|
+
threads = batch.map { |record| Thread.new { PageExtractor.call(extractor:, record:) } }
|
|
16
16
|
|
|
17
17
|
threads.each do |thread|
|
|
18
18
|
result = thread.value
|
|
@@ -25,7 +25,7 @@ module Nous
|
|
|
25
25
|
|
|
26
26
|
private
|
|
27
27
|
|
|
28
|
-
attr_reader :
|
|
28
|
+
attr_reader :records, :extractor
|
|
29
29
|
end
|
|
30
30
|
end
|
|
31
31
|
end
|
|
@@ -3,24 +3,30 @@
|
|
|
3
3
|
module Nous
|
|
4
4
|
class Fetcher < Command
|
|
5
5
|
class PageExtractor < Command
|
|
6
|
-
def initialize(extractor:,
|
|
6
|
+
def initialize(extractor:, record:)
|
|
7
7
|
@extractor = extractor
|
|
8
|
-
@
|
|
8
|
+
@record = record
|
|
9
9
|
end
|
|
10
10
|
|
|
11
11
|
def call
|
|
12
|
-
result = extractor.extract(
|
|
12
|
+
result = extractor.extract(record)
|
|
13
13
|
|
|
14
14
|
unless result.success?
|
|
15
|
-
warn("[nous] extract skip #{
|
|
15
|
+
warn("[nous] extract skip #{record.final_url}: #{result.error.message}") if Nous.configuration.debug?
|
|
16
16
|
return failure(result.error)
|
|
17
17
|
end
|
|
18
18
|
|
|
19
19
|
page = Page.new(
|
|
20
20
|
title: result.payload.title,
|
|
21
|
-
url:
|
|
22
|
-
pathname:
|
|
23
|
-
content: result.payload.content
|
|
21
|
+
url: record.final_url,
|
|
22
|
+
pathname: record.pathname,
|
|
23
|
+
content: result.payload.content,
|
|
24
|
+
metadata: {
|
|
25
|
+
extractor: extractor.class.name,
|
|
26
|
+
requested_url: record.requested_url,
|
|
27
|
+
content_type: record.content_type,
|
|
28
|
+
redirected: record.redirected
|
|
29
|
+
}
|
|
24
30
|
)
|
|
25
31
|
|
|
26
32
|
success(payload: page)
|
|
@@ -28,7 +34,7 @@ module Nous
|
|
|
28
34
|
|
|
29
35
|
private
|
|
30
36
|
|
|
31
|
-
attr_reader :extractor, :
|
|
37
|
+
attr_reader :extractor, :record
|
|
32
38
|
end
|
|
33
39
|
end
|
|
34
40
|
end
|
data/lib/nous/fetcher.rb
CHANGED
|
@@ -4,21 +4,38 @@ module Nous
|
|
|
4
4
|
class Fetcher < Command
|
|
5
5
|
class FetchError < StandardError; end
|
|
6
6
|
|
|
7
|
-
def initialize(seed_url:, extractor: Extractor::Default.new, http_client: nil)
|
|
7
|
+
def initialize(seed_url:, extractor: Extractor::Default.new, http_client: nil, details: false)
|
|
8
8
|
@seed_url = seed_url
|
|
9
9
|
@extractor = extractor
|
|
10
10
|
@http_client = http_client
|
|
11
|
+
@single_page = Nous.configuration.single_page?
|
|
12
|
+
@details = details
|
|
11
13
|
end
|
|
12
14
|
|
|
13
15
|
def call
|
|
14
|
-
|
|
15
|
-
|
|
16
|
-
|
|
16
|
+
records = crawl
|
|
17
|
+
successful_records, failed_records = records.partition(&:ok)
|
|
18
|
+
|
|
19
|
+
if single_page && !details && successful_records.empty?
|
|
20
|
+
raise FetchError, failed_records.first&.error || "fetch failed"
|
|
21
|
+
end
|
|
22
|
+
|
|
23
|
+
pages = extract(successful_records)
|
|
24
|
+
|
|
25
|
+
if details
|
|
26
|
+
success(payload: FetchResult.new(
|
|
27
|
+
pages: pages,
|
|
28
|
+
failures: build_failures(failed_records),
|
|
29
|
+
total_requested: records.length
|
|
30
|
+
))
|
|
31
|
+
else
|
|
32
|
+
success(payload: pages)
|
|
33
|
+
end
|
|
17
34
|
end
|
|
18
35
|
|
|
19
36
|
private
|
|
20
37
|
|
|
21
|
-
attr_reader :seed_url, :extractor, :http_client
|
|
38
|
+
attr_reader :seed_url, :extractor, :http_client, :single_page, :details
|
|
22
39
|
|
|
23
40
|
def crawl
|
|
24
41
|
result = Crawler.call(seed_url:, http_client:)
|
|
@@ -27,11 +44,20 @@ module Nous
|
|
|
27
44
|
result.payload
|
|
28
45
|
end
|
|
29
46
|
|
|
30
|
-
def extract(
|
|
31
|
-
result = ExtractionRunner.call(
|
|
47
|
+
def extract(records)
|
|
48
|
+
result = ExtractionRunner.call(records:, extractor:)
|
|
32
49
|
raise FetchError, result.error.message if result.failure?
|
|
33
50
|
|
|
34
51
|
result.payload
|
|
35
52
|
end
|
|
53
|
+
|
|
54
|
+
def build_failures(records)
|
|
55
|
+
records.map do |record|
|
|
56
|
+
{
|
|
57
|
+
requested_url: record.requested_url,
|
|
58
|
+
error: record.error
|
|
59
|
+
}
|
|
60
|
+
end
|
|
61
|
+
end
|
|
36
62
|
end
|
|
37
63
|
end
|
|
@@ -0,0 +1,26 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module Nous
|
|
4
|
+
FetchRecord = Data.define(
|
|
5
|
+
:requested_url,
|
|
6
|
+
:final_url,
|
|
7
|
+
:pathname,
|
|
8
|
+
:html,
|
|
9
|
+
:content_type,
|
|
10
|
+
:ok,
|
|
11
|
+
:error,
|
|
12
|
+
:redirected
|
|
13
|
+
) do
|
|
14
|
+
def initialize(
|
|
15
|
+
requested_url:,
|
|
16
|
+
pathname:, final_url: nil,
|
|
17
|
+
html: nil,
|
|
18
|
+
content_type: nil,
|
|
19
|
+
ok: true,
|
|
20
|
+
error: nil,
|
|
21
|
+
redirected: false
|
|
22
|
+
)
|
|
23
|
+
super
|
|
24
|
+
end
|
|
25
|
+
end
|
|
26
|
+
end
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module Nous
|
|
4
|
+
FetchResult = Data.define(:pages, :failures, :total_requested) do
|
|
5
|
+
def succeeded
|
|
6
|
+
pages.length
|
|
7
|
+
end
|
|
8
|
+
|
|
9
|
+
def failed
|
|
10
|
+
failures.length
|
|
11
|
+
end
|
|
12
|
+
|
|
13
|
+
def all_succeeded?
|
|
14
|
+
failures.empty?
|
|
15
|
+
end
|
|
16
|
+
|
|
17
|
+
def any_succeeded?
|
|
18
|
+
pages.any?
|
|
19
|
+
end
|
|
20
|
+
end
|
|
21
|
+
end
|
data/lib/nous/primitives/page.rb
CHANGED
data/lib/nous/serializer.rb
CHANGED
|
@@ -43,6 +43,8 @@ module Nous
|
|
|
43
43
|
<page>
|
|
44
44
|
<title>#{page.title}</title>
|
|
45
45
|
<url>#{page.url}</url>
|
|
46
|
+
<pathname>#{page.pathname}</pathname>
|
|
47
|
+
<extractor>#{page.metadata[:extractor]}</extractor>
|
|
46
48
|
<content>
|
|
47
49
|
#{page.content}
|
|
48
50
|
</content>
|
|
@@ -51,7 +53,13 @@ module Nous
|
|
|
51
53
|
end
|
|
52
54
|
|
|
53
55
|
def json_page(page)
|
|
54
|
-
{
|
|
56
|
+
{
|
|
57
|
+
title: page.title,
|
|
58
|
+
url: page.url,
|
|
59
|
+
pathname: page.pathname,
|
|
60
|
+
content: page.content,
|
|
61
|
+
metadata: page.metadata
|
|
62
|
+
}
|
|
55
63
|
end
|
|
56
64
|
end
|
|
57
65
|
end
|
data/lib/nous/version.rb
CHANGED
data/lib/nous.rb
CHANGED
|
@@ -18,10 +18,10 @@ module Nous
|
|
|
18
18
|
@configuration = nil
|
|
19
19
|
end
|
|
20
20
|
|
|
21
|
-
def fetch(seed_url, extractor: Extractor::Default.new, http_client: nil, **options)
|
|
21
|
+
def fetch(seed_url, extractor: Extractor::Default.new, http_client: nil, details: false, **options)
|
|
22
22
|
configure(**options)
|
|
23
23
|
|
|
24
|
-
result = Fetcher.call(seed_url:, extractor:, http_client:)
|
|
24
|
+
result = Fetcher.call(seed_url:, extractor:, http_client:, details:)
|
|
25
25
|
raise result.error if result.failure?
|
|
26
26
|
|
|
27
27
|
result.payload
|
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: nous
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.
|
|
4
|
+
version: 0.4.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Dan Frenette
|
|
@@ -243,8 +243,9 @@ files:
|
|
|
243
243
|
- lib/nous/fetcher/page_extractor.rb
|
|
244
244
|
- lib/nous/primitives/configuration.rb
|
|
245
245
|
- lib/nous/primitives/extracted_content.rb
|
|
246
|
+
- lib/nous/primitives/fetch_record.rb
|
|
247
|
+
- lib/nous/primitives/fetch_result.rb
|
|
246
248
|
- lib/nous/primitives/page.rb
|
|
247
|
-
- lib/nous/primitives/raw_page.rb
|
|
248
249
|
- lib/nous/primitives/url.rb
|
|
249
250
|
- lib/nous/serializer.rb
|
|
250
251
|
- lib/nous/url_resolver.rb
|