coelacanth 0.3.10 → 0.4.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.env.example +5 -0
- data/CHANGELOG.md +3 -8
- data/Gemfile +1 -1
- data/README.md +146 -52
- data/compose.yml +5 -2
- data/config/coelacanth.yml +5 -3
- data/lib/coelacanth/client/ferrum.rb +12 -2
- data/lib/coelacanth/client/screenshot_one.rb +31 -9
- data/lib/coelacanth/configure.rb +6 -1
- data/lib/coelacanth/dom.rb +8 -2
- data/lib/coelacanth/extractor/fallback_probe.rb +34 -0
- data/lib/coelacanth/extractor/heuristic_probe.rb +175 -0
- data/lib/coelacanth/extractor/image_collector.rb +19 -0
- data/lib/coelacanth/extractor/markdown_listing_collector.rb +108 -0
- data/lib/coelacanth/extractor/markdown_renderer.rb +132 -0
- data/lib/coelacanth/extractor/metadata_probe.rb +121 -0
- data/lib/coelacanth/extractor/normalizer.rb +47 -0
- data/lib/coelacanth/extractor/utilities.rb +145 -0
- data/lib/coelacanth/extractor/weak_ml_probe.rb +136 -0
- data/lib/coelacanth/extractor.rb +67 -0
- data/lib/coelacanth/http.rb +72 -0
- data/lib/coelacanth/redirect.rb +6 -1
- data/lib/coelacanth/robots.rb +150 -0
- data/lib/coelacanth/version.rb +1 -1
- data/lib/coelacanth.rb +16 -1
- metadata +14 -2
- data/Gemfile.lock +0 -103
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 9097f36247caad8f0764313b306398e3707290eca25d5cfffc05f61e97784884
|
|
4
|
+
data.tar.gz: 19e093800bcb9ae663e0f36a1ad18d15472ff2e76f34f6ddf65c396440aea2e0
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 32c30afcf5316814e42f2ed2f5b94efe49daecdd78b7acdf1d10d64ed2c8253a86a0a0be3b66ed1824cfbefb32c5654f3abf0fb08c41411d736281f480375400
|
|
7
|
+
data.tar.gz: 4e6683d596d50535dd13df26d2c6c4339fe543013d37250c8e55f7605dc441aa3ed888fcf46f9550540583e3234bd52667cd28d1b9e098a18e3eb5faae354bed
|
data/.env.example
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
1
|
+
# Copy this file to .env and fill in the values to configure Coelacanth.
|
|
2
|
+
# Optional: only set when the remote browser requires authentication.
|
|
3
|
+
COELACANTH_REMOTE_CLIENT_AUTHORIZATION=
|
|
4
|
+
COELACANTH_REMOTE_CLIENT_USER_AGENT="Coelacanth Chrome Extension"
|
|
5
|
+
COELACANTH_SCREENSHOT_ONE_API_KEY="your_screenshot_one_api_key_here"
|
data/CHANGELOG.md
CHANGED
|
@@ -4,13 +4,8 @@ All notable changes to this project will be documented in this file.
|
|
|
4
4
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
5
5
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
6
6
|
|
|
7
|
-
## [v0.
|
|
8
|
-
### :bug: Bug Fixes
|
|
9
|
-
- [`f297847`](https://github.com/slidict/coelacanth/commit/f29784715aba9453599ad7fd6467d4d2d3e9e82c) - improve error handling in Ferrum client and bump version to 0.3.10 *(commit by [@yubele](https://github.com/yubele))*
|
|
10
|
-
|
|
7
|
+
## [v0.4.1] - 2025-11-03
|
|
11
8
|
### :wrench: Chores
|
|
12
|
-
- [`
|
|
13
|
-
- [`6f1b766`](https://github.com/slidict/coelacanth/commit/6f1b766ea8f65d2f9c0f1d8772cbd6f1c840a43b) - **deps**: Bump rake from 13.2.1 to 13.3.0 *(commit by [@dependabot[bot]](https://github.com/apps/dependabot))*
|
|
14
|
-
- [`0c9a6b2`](https://github.com/slidict/coelacanth/commit/0c9a6b2b9f28f686665914054016c90c49407e69) - **deps**: Bump rubocop from 1.75.7 to 1.76.1 *(commit by [@dependabot[bot]](https://github.com/apps/dependabot))*
|
|
9
|
+
- [`41e89b7`](https://github.com/slidict/coelacanth/commit/41e89b799573f6cfaf0a12e7abc5c260f1905aec) - Bump version from 0.4.0 to 0.4.1 *(commit by [@yubele](https://github.com/yubele))*
|
|
15
10
|
|
|
16
|
-
[v0.
|
|
11
|
+
[v0.4.1]: https://github.com/slidict/coelacanth/compare/v0.4.0...v0.4.1
|
data/Gemfile
CHANGED
data/README.md
CHANGED
|
@@ -1,94 +1,188 @@
|
|
|
1
|
-
#
|
|
1
|
+
# Coelacanth
|
|
2
2
|
|
|
3
3
|
[](https://badge.fury.io/rb/coelacanth)
|
|
4
4
|
[](https://github.com/slidict/coelacanth/actions)
|
|
5
5
|
[](https://opensource.org/licenses/MIT)
|
|
6
6
|
|
|
7
|
-
|
|
7
|
+
Coelacanth is a Ruby gem for extracting high-quality article content, metadata, and screenshots from arbitrary web pages. It is
|
|
8
|
+
built to power content ingestion pipelines that have to withstand layout experiments, CMS redesigns, and inconsistent markup
|
|
9
|
+
while remaining easy to extend.
|
|
10
|
+
|
|
11
|
+
It is the successor to [`web_stat`](https://rubygems.org/gems/web_stat) and continues the same goal of reliable article
|
|
12
|
+
extraction under the `slidict` umbrella. Compared to [`web_stat`](https://github.com/slidict/web_stat/) the gem has been
|
|
13
|
+
re-architected with a modern extractor pipeline, built-in screenshot capture, and a clearer configuration story so you can drop
|
|
14
|
+
it into contemporary ingestion stacks without bespoke glue code.
|
|
15
|
+
|
|
16
|
+
## Table of contents
|
|
17
|
+
- [Features](#features)
|
|
18
|
+
- [Requirements](#requirements)
|
|
19
|
+
- [Installation](#installation)
|
|
20
|
+
- [Quick start](#quick-start)
|
|
21
|
+
- [Extractor pipeline](#extractor-pipeline)
|
|
22
|
+
- [Configuration](#configuration)
|
|
23
|
+
- [Development workflow](#development-workflow)
|
|
24
|
+
- [Testing](#testing)
|
|
25
|
+
- [Contributing](#contributing)
|
|
26
|
+
- [License](#license)
|
|
8
27
|
|
|
9
|
-
##
|
|
10
|
-
|
|
11
|
-
|
|
28
|
+
## Features
|
|
29
|
+
- **Layout-resilient extraction** – Multi-stage extractor falls back from structured metadata to heuristics and lightweight
|
|
30
|
+
machine learning so you continue to get clean article bodies even when markup drifts.
|
|
31
|
+
- **UTF-8 normalization** – HTML responses are normalized into UTF-8 before parsing to play nicely with Japanese and other
|
|
32
|
+
multi-byte sources.
|
|
33
|
+
- **Screenshot capture** – Fetches full-page PNGs via a configurable browser client so you can archive visual context alongside
|
|
34
|
+
the extracted text.
|
|
35
|
+
- **Redirect resolution** – Follows HTTP redirects and long redirect chains to guarantee the extractor works on the final
|
|
36
|
+
landing page.
|
|
37
|
+
- **Configurable HTTP headers** – Inject custom headers (user agent, authorization, etc.) into the remote browser session for
|
|
38
|
+
authenticated or geo-targeted crawling.
|
|
39
|
+
|
|
40
|
+
### What's new compared to web_stat?
|
|
41
|
+
|
|
42
|
+
- **Multi-stage pipeline** – `web_stat` relied on a single-pass heuristic extractor, whereas Coelacanth layers metadata,
|
|
43
|
+
heuristic, and optional ML probes that graduate based on confidence thresholds.
|
|
44
|
+
- **First-class screenshots** – Capture full-page PNGs alongside the extracted text without writing a separate headless browser
|
|
45
|
+
integration.
|
|
46
|
+
- **Environment-aware configuration** – Manage remote browser credentials, HTTP headers, and client selection through
|
|
47
|
+
`config/coelacanth.yml` instead of hand-tuned initializer code.
|
|
48
|
+
- **Markdown-first output** – Get both Markdown and raw DOM representations from `Coelacanth.analyze` so you can publish the
|
|
49
|
+
same payload to static-site builders, CMS importers, or downstream summarizers.
|
|
50
|
+
|
|
51
|
+
## Requirements
|
|
52
|
+
- Ruby **3.4 or newer**
|
|
53
|
+
- [Bundler](https://bundler.io/) for dependency management
|
|
54
|
+
- A remote Chrome-compatible WebSocket endpoint when using the default Ferrum client (see [Configuration](#configuration))
|
|
12
55
|
|
|
56
|
+
## Installation
|
|
57
|
+
Add the gem to your application:
|
|
13
58
|
|
|
14
59
|
```ruby
|
|
15
|
-
gem
|
|
60
|
+
gem "coelacanth"
|
|
16
61
|
```
|
|
17
62
|
|
|
18
|
-
|
|
63
|
+
Install the dependencies:
|
|
19
64
|
|
|
20
65
|
```bash
|
|
21
|
-
|
|
66
|
+
bundle install
|
|
22
67
|
```
|
|
23
68
|
|
|
24
|
-
Or install
|
|
69
|
+
Or install the gem directly:
|
|
25
70
|
|
|
26
71
|
```bash
|
|
27
|
-
|
|
72
|
+
gem install coelacanth
|
|
28
73
|
```
|
|
29
74
|
|
|
30
|
-
|
|
75
|
+
## Quick start
|
|
76
|
+
```ruby
|
|
77
|
+
require "coelacanth"
|
|
31
78
|
|
|
32
|
-
|
|
79
|
+
result = Coelacanth.analyze("https://example.com/article")
|
|
33
80
|
|
|
34
|
-
|
|
35
|
-
|
|
81
|
+
result[:extraction] # => article metadata and body markdown
|
|
82
|
+
result[:dom] # => Oga DOM representation for downstream processing
|
|
83
|
+
result[:screenshot] # => PNG screenshot as a binary string
|
|
36
84
|
```
|
|
37
85
|
|
|
38
|
-
|
|
86
|
+
The returned hash includes:
|
|
87
|
+
|
|
88
|
+
- `:extraction` – output from `Coelacanth::Extractor`, including title, Markdown body (`body_markdown` and
|
|
89
|
+
`body_markdown_list`), images, listings, published date, and the probe source and confidence score.
|
|
90
|
+
- `:dom` – a parsed Oga DOM if you need to traverse the document manually.
|
|
91
|
+
- `:screenshot` – raw PNG data that you can persist or feed to other systems.
|
|
92
|
+
|
|
93
|
+
## Extractor pipeline
|
|
94
|
+
Coelacanth ships with a multi-stage extractor that tries increasingly involved probes until one meets its confidence target:
|
|
95
|
+
|
|
96
|
+
1. **MetadataProbe** (threshold `0.85`) pulls `schema.org` JSON-LD, Open Graph, Twitter Cards, or semantic containers such as
|
|
97
|
+
`<main>`/`<article>` when available.
|
|
98
|
+
2. **HeuristicProbe** (threshold `0.75`) scores block-level nodes using text length, link density, punctuation density, DOM
|
|
99
|
+
depth, and sibling variance, then greedily attaches surrounding headers and media.
|
|
100
|
+
3. **WeakMlProbe** (threshold `0.70`) optionally boosts accuracy with a lightweight classifier that combines heuristic features
|
|
101
|
+
with class and id tokens (e.g., `article-body`, `post`, `content`).
|
|
102
|
+
4. **FallbackProbe** acts as a safety net by following AMP/print links or summarizing the whole document when the previous
|
|
103
|
+
probes fail.
|
|
104
|
+
|
|
105
|
+
Markdown-based listings are generated from the extracted body so lists such as "Latest news" blocks can be stored alongside the
|
|
106
|
+
article without scanning the rest of the page layout.
|
|
107
|
+
|
|
108
|
+
## Configuration
|
|
109
|
+
Runtime configuration is stored in `config/coelacanth.yml`. Environments inherit from the `development` section by default.
|
|
110
|
+
|
|
111
|
+
```yaml
|
|
112
|
+
development:
|
|
113
|
+
client: "ferrum" # Options: "ferrum", "screenshot_one"
|
|
114
|
+
remote_client:
|
|
115
|
+
ws_url: "ws://chrome:3000/chrome"
|
|
116
|
+
timeout: 10
|
|
117
|
+
headers:
|
|
118
|
+
<% if (auth = ENV["COELACANTH_REMOTE_CLIENT_AUTHORIZATION"]).to_s.strip != "" %>
|
|
119
|
+
Authorization: "<%= auth %>"
|
|
120
|
+
<% end %>
|
|
121
|
+
User-Agent: "<%= ENV.fetch("COELACANTH_REMOTE_CLIENT_USER_AGENT", "Coelacanth Chrome Extension") %>"
|
|
122
|
+
screenshot_one:
|
|
123
|
+
key: "<%= ENV.fetch("COELACANTH_SCREENSHOT_ONE_API_KEY", "your_screenshot_one_api_key_here") %>"
|
|
124
|
+
```
|
|
39
125
|
|
|
40
|
-
|
|
126
|
+
- **Ferrum client** – Requires a running Chrome instance that exposes the DevTools protocol via WebSocket. Configure the URL,
|
|
127
|
+
timeout, and any headers to inject.
|
|
128
|
+
- **ScreenshotOne client** – Supply an API key to offload screenshot capture to [ScreenshotOne](https://screenshotone.com/).
|
|
129
|
+
- Configuration is environment-aware: set `RAILS_ENV`/`RACK_ENV` or use Rails' built-in environment handling when the gem is
|
|
130
|
+
used inside a Rails project.
|
|
41
131
|
|
|
42
|
-
|
|
43
|
-
To use coelacanth, first require it.
|
|
132
|
+
### Environment variables
|
|
44
133
|
|
|
45
|
-
|
|
46
|
-
|
|
47
|
-
```
|
|
134
|
+
Configuration values that would otherwise contain credentials are loaded from environment variables. Set the following
|
|
135
|
+
variables in your shell (or `dotenv` file) before running the gem:
|
|
48
136
|
|
|
49
|
-
|
|
137
|
+
```bash
|
|
138
|
+
# Optional: only set when the remote browser requires authentication.
|
|
139
|
+
export COELACANTH_REMOTE_CLIENT_AUTHORIZATION="Bearer <token>"
|
|
50
140
|
|
|
51
|
-
|
|
52
|
-
|
|
53
|
-
stats = Coelacanth.analyze(url)
|
|
141
|
+
export COELACANTH_REMOTE_CLIENT_USER_AGENT="Coelacanth Chrome Extension"
|
|
142
|
+
export COELACANTH_SCREENSHOT_ONE_API_KEY="your_screenshot_one_api_key_here"
|
|
54
143
|
```
|
|
55
144
|
|
|
56
|
-
|
|
145
|
+
If `COELACANTH_REMOTE_CLIENT_AUTHORIZATION` is omitted or left blank, the `Authorization` header is not injected into the
|
|
146
|
+
remote browser session.
|
|
57
147
|
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
```
|
|
148
|
+
When using Docker Compose, you can create a `.env` file or export the variables in your environment so the `app` service picks
|
|
149
|
+
them up automatically.
|
|
61
150
|
|
|
62
|
-
|
|
63
|
-
|
|
64
|
-
- Get screenshot
|
|
151
|
+
If you are working inside Docker, make sure the `UID` environment variable matches your host user by exporting it in your shell
|
|
152
|
+
startup file:
|
|
65
153
|
|
|
66
|
-
|
|
154
|
+
```bash
|
|
155
|
+
export UID=${UID}
|
|
156
|
+
```
|
|
67
157
|
|
|
68
|
-
|
|
158
|
+
## Development workflow
|
|
159
|
+
Clone the repository and install dependencies:
|
|
69
160
|
|
|
70
|
-
|
|
161
|
+
```bash
|
|
162
|
+
git clone https://github.com/slidict/coelacanth.git
|
|
163
|
+
cd coelacanth
|
|
164
|
+
bundle install
|
|
165
|
+
```
|
|
71
166
|
|
|
72
|
-
|
|
167
|
+
You can open an interactive console with the gem loaded via:
|
|
73
168
|
|
|
74
|
-
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
- `style: fix code style issues`
|
|
81
|
-
- `refactor: refactor code for better readability`
|
|
82
|
-
- `perf: improve performance of data processing`
|
|
83
|
-
- `test: add new tests for URL parsing module`
|
|
169
|
+
```bash
|
|
170
|
+
bin/console
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
## Testing
|
|
174
|
+
Run the test suite with RSpec:
|
|
84
175
|
|
|
85
|
-
|
|
176
|
+
```bash
|
|
177
|
+
bundle exec rspec
|
|
178
|
+
```
|
|
86
179
|
|
|
87
180
|
## Contributing
|
|
88
|
-
Bug reports and pull requests are welcome on GitHub at
|
|
181
|
+
Bug reports and pull requests are welcome on GitHub at
|
|
182
|
+
[https://github.com/slidict/coelacanth](https://github.com/slidict/coelacanth). Please follow the
|
|
183
|
+
[Conventional Commits](https://www.conventionalcommits.org/) specification so we can keep the changelog automation healthy.
|
|
89
184
|
|
|
90
|
-
|
|
91
|
-
The gem is available as open-source under the terms of the MIT License.
|
|
185
|
+
By participating in this project you agree to abide by the [Contributor Covenant](CODE_OF_CONDUCT.md).
|
|
92
186
|
|
|
93
|
-
##
|
|
94
|
-
|
|
187
|
+
## License
|
|
188
|
+
Coelacanth is available as open source under the terms of the [MIT License](LICENSE.txt).
|
data/compose.yml
CHANGED
|
@@ -3,8 +3,11 @@ networks:
|
|
|
3
3
|
driver: bridge
|
|
4
4
|
services:
|
|
5
5
|
app:
|
|
6
|
-
environment:
|
|
7
|
-
- UID=${UID}
|
|
6
|
+
environment:
|
|
7
|
+
- UID=${UID}
|
|
8
|
+
- COELACANTH_REMOTE_CLIENT_AUTHORIZATION=${COELACANTH_REMOTE_CLIENT_AUTHORIZATION:-}
|
|
9
|
+
- COELACANTH_REMOTE_CLIENT_USER_AGENT=${COELACANTH_REMOTE_CLIENT_USER_AGENT:-}
|
|
10
|
+
- COELACANTH_SCREENSHOT_ONE_API_KEY=${COELACANTH_SCREENSHOT_ONE_API_KEY:-}
|
|
8
11
|
tty: true
|
|
9
12
|
stdin_open: true
|
|
10
13
|
build:
|
data/config/coelacanth.yml
CHANGED
|
@@ -4,10 +4,12 @@ development: &development
|
|
|
4
4
|
ws_url: "ws://chrome:3000/chrome"
|
|
5
5
|
timeout: 10 # seconds
|
|
6
6
|
headers:
|
|
7
|
-
|
|
8
|
-
|
|
7
|
+
<% if (auth = ENV["COELACANTH_REMOTE_CLIENT_AUTHORIZATION"]).to_s.strip != "" %>
|
|
8
|
+
Authorization: "<%= auth %>"
|
|
9
|
+
<% end %>
|
|
10
|
+
User-Agent: "<%= ENV.fetch("COELACANTH_REMOTE_CLIENT_USER_AGENT", "Coelacanth Chrome Extension") %>"
|
|
9
11
|
screenshot_one:
|
|
10
|
-
key: "your_screenshot_one_api_key_here"
|
|
12
|
+
key: "<%= ENV.fetch("COELACANTH_SCREENSHOT_ONE_API_KEY", "your_screenshot_one_api_key_here") %>"
|
|
11
13
|
test:
|
|
12
14
|
<<: *development
|
|
13
15
|
production:
|
|
@@ -16,7 +16,7 @@ module Coelacanth::Client
|
|
|
16
16
|
body = remote_client.body
|
|
17
17
|
body
|
|
18
18
|
rescue => e
|
|
19
|
-
raise
|
|
19
|
+
raise sanitized_remote_client_error(e)
|
|
20
20
|
end
|
|
21
21
|
|
|
22
22
|
def get_screenshot
|
|
@@ -26,11 +26,21 @@ module Coelacanth::Client
|
|
|
26
26
|
File.read(tempfile.path)
|
|
27
27
|
rescue => e
|
|
28
28
|
tempfile.close
|
|
29
|
-
raise
|
|
29
|
+
raise sanitized_remote_client_error(e)
|
|
30
30
|
end
|
|
31
31
|
|
|
32
32
|
private
|
|
33
33
|
|
|
34
|
+
def sanitized_remote_client_error(error)
|
|
35
|
+
"#{error.class}: #{error.message} RemoteClient: #{sanitized_remote_client_identifier}"
|
|
36
|
+
end
|
|
37
|
+
|
|
38
|
+
def sanitized_remote_client_identifier
|
|
39
|
+
return "nil" unless @remote_client
|
|
40
|
+
|
|
41
|
+
"#{@remote_client.class.name}(object_id=#{@remote_client.object_id})"
|
|
42
|
+
end
|
|
43
|
+
|
|
34
44
|
def remote_client
|
|
35
45
|
return @remote_client if @remote_client
|
|
36
46
|
|
|
@@ -1,19 +1,29 @@
|
|
|
1
1
|
# frozen_string_literal: true
|
|
2
2
|
|
|
3
|
-
require "open-uri"
|
|
4
3
|
require "ferrum"
|
|
4
|
+
require_relative "ferrum"
|
|
5
|
+
require_relative "../http"
|
|
5
6
|
|
|
6
7
|
module Coelacanth::Client
|
|
7
8
|
# Coelacanth::Client
|
|
8
9
|
class ScreenshotOne < Coelacanth::Client::Base
|
|
9
10
|
def get_response
|
|
10
|
-
|
|
11
|
-
|
|
12
|
-
|
|
13
|
-
|
|
14
|
-
|
|
15
|
-
|
|
16
|
-
|
|
11
|
+
uri = URI.parse(@url)
|
|
12
|
+
response = Coelacanth::HTTP.get_response(
|
|
13
|
+
uri,
|
|
14
|
+
open_timeout: Coelacanth::HTTP::DEFAULT_OPEN_TIMEOUT,
|
|
15
|
+
read_timeout: Coelacanth::HTTP::DEFAULT_READ_TIMEOUT
|
|
16
|
+
)
|
|
17
|
+
@origin_response = response
|
|
18
|
+
@status_code = response.code.to_i
|
|
19
|
+
|
|
20
|
+
return response.body if response.is_a?(Net::HTTPSuccess)
|
|
21
|
+
|
|
22
|
+
Coelacanth::HTTP.raise_http_error(uri, response)
|
|
23
|
+
rescue Coelacanth::TimeoutError
|
|
24
|
+
fallback_response = fallback_client.get_response
|
|
25
|
+
@status_code = fallback_client.instance_variable_get(:@status_code)
|
|
26
|
+
fallback_response
|
|
17
27
|
end
|
|
18
28
|
|
|
19
29
|
def get_screenshot
|
|
@@ -34,10 +44,22 @@ module Coelacanth::Client
|
|
|
34
44
|
}
|
|
35
45
|
uri.query = URI.encode_www_form(params)
|
|
36
46
|
|
|
37
|
-
response =
|
|
47
|
+
response = Coelacanth::HTTP.get_response(
|
|
48
|
+
uri,
|
|
49
|
+
open_timeout: Coelacanth::HTTP::DEFAULT_OPEN_TIMEOUT,
|
|
50
|
+
read_timeout: 30
|
|
51
|
+
)
|
|
38
52
|
raise "Failed to fetch screenshot: #{response.code}" unless response.is_a?(Net::HTTPSuccess)
|
|
39
53
|
|
|
40
54
|
response.body
|
|
55
|
+
rescue Coelacanth::TimeoutError
|
|
56
|
+
fallback_client.get_screenshot
|
|
57
|
+
end
|
|
58
|
+
|
|
59
|
+
private
|
|
60
|
+
|
|
61
|
+
def fallback_client
|
|
62
|
+
@fallback_client ||= Coelacanth::Client::Ferrum.new(@url, @config)
|
|
41
63
|
end
|
|
42
64
|
end
|
|
43
65
|
end
|
data/lib/coelacanth/configure.rb
CHANGED
|
@@ -15,7 +15,12 @@ module Coelacanth
|
|
|
15
15
|
end
|
|
16
16
|
|
|
17
17
|
def yaml
|
|
18
|
-
@yaml ||= YAML.
|
|
18
|
+
@yaml ||= YAML.safe_load(
|
|
19
|
+
ERB.new(File.read(file)).result,
|
|
20
|
+
permitted_classes: [],
|
|
21
|
+
permitted_symbols: [],
|
|
22
|
+
aliases: true
|
|
23
|
+
)[env]
|
|
19
24
|
end
|
|
20
25
|
|
|
21
26
|
private
|
data/lib/coelacanth/dom.rb
CHANGED
|
@@ -1,12 +1,18 @@
|
|
|
1
1
|
# frozen_string_literal: true
|
|
2
2
|
|
|
3
3
|
require "oga"
|
|
4
|
+
require_relative "http"
|
|
4
5
|
|
|
5
6
|
module Coelacanth
|
|
6
7
|
# Coelacanth::Dom
|
|
7
8
|
class Dom
|
|
8
|
-
def oga(url)
|
|
9
|
-
|
|
9
|
+
def oga(url, html: nil)
|
|
10
|
+
html ||= begin
|
|
11
|
+
Coelacanth::HTTP.get_response(URI.parse(url)).body
|
|
12
|
+
rescue Coelacanth::TimeoutError
|
|
13
|
+
""
|
|
14
|
+
end
|
|
15
|
+
Oga.parse_xml(html.to_s)
|
|
10
16
|
end
|
|
11
17
|
end
|
|
12
18
|
end
|
|
@@ -0,0 +1,34 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require "oga"
|
|
4
|
+
|
|
5
|
+
require_relative "utilities"
|
|
6
|
+
|
|
7
|
+
module Coelacanth
|
|
8
|
+
class Extractor
|
|
9
|
+
# Attempts final recovery strategies when all other probes fail.
|
|
10
|
+
class FallbackProbe
|
|
11
|
+
Result = Struct.new(
|
|
12
|
+
:title,
|
|
13
|
+
:node,
|
|
14
|
+
:published_at,
|
|
15
|
+
:byline,
|
|
16
|
+
:source_tag,
|
|
17
|
+
:confidence,
|
|
18
|
+
keyword_init: true
|
|
19
|
+
)
|
|
20
|
+
|
|
21
|
+
def call(doc:, url: nil)
|
|
22
|
+
body = doc.at_css("body") || doc
|
|
23
|
+
Result.new(
|
|
24
|
+
title: doc.at_css("title")&.text&.strip,
|
|
25
|
+
node: body,
|
|
26
|
+
published_at: nil,
|
|
27
|
+
byline: nil,
|
|
28
|
+
source_tag: :fallback,
|
|
29
|
+
confidence: 0.35
|
|
30
|
+
)
|
|
31
|
+
end
|
|
32
|
+
end
|
|
33
|
+
end
|
|
34
|
+
end
|
|
@@ -0,0 +1,175 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require "oga"
|
|
4
|
+
|
|
5
|
+
require_relative "utilities"
|
|
6
|
+
|
|
7
|
+
module Coelacanth
|
|
8
|
+
class Extractor
|
|
9
|
+
# Scores DOM nodes based on simple heuristics to locate the primary article body.
|
|
10
|
+
class HeuristicProbe
|
|
11
|
+
Result = Struct.new(
|
|
12
|
+
:title,
|
|
13
|
+
:node,
|
|
14
|
+
:published_at,
|
|
15
|
+
:byline,
|
|
16
|
+
:source_tag,
|
|
17
|
+
:confidence,
|
|
18
|
+
keyword_init: true
|
|
19
|
+
)
|
|
20
|
+
|
|
21
|
+
BLOCK_SELECTOR = "article, main, section, div".freeze
|
|
22
|
+
TAG_WEIGHTS = Hash.new(0).merge(
|
|
23
|
+
"article" => 80,
|
|
24
|
+
"main" => 60,
|
|
25
|
+
"section" => 30,
|
|
26
|
+
"div" => 10
|
|
27
|
+
).freeze
|
|
28
|
+
NEGATIVE_TOKENS = %w[nav footer header sidebar related share menu].freeze
|
|
29
|
+
POSITIVE_TOKENS = %w[content article body post entry text].freeze
|
|
30
|
+
|
|
31
|
+
def call(doc:, url: nil)
|
|
32
|
+
candidates = doc.css(BLOCK_SELECTOR).map do |node|
|
|
33
|
+
score_candidate(node)
|
|
34
|
+
end.compact
|
|
35
|
+
|
|
36
|
+
return if candidates.empty?
|
|
37
|
+
|
|
38
|
+
best = candidates.max_by { |candidate| candidate[:score] }
|
|
39
|
+
return if best[:score] < minimum_score
|
|
40
|
+
|
|
41
|
+
Result.new(
|
|
42
|
+
title: title_from_meta(doc),
|
|
43
|
+
node: expand(best[:node]),
|
|
44
|
+
published_at: published_at_from_meta(doc),
|
|
45
|
+
byline: byline_from_meta(doc),
|
|
46
|
+
source_tag: :heuristic,
|
|
47
|
+
confidence: confidence(best[:score])
|
|
48
|
+
)
|
|
49
|
+
end
|
|
50
|
+
|
|
51
|
+
private
|
|
52
|
+
|
|
53
|
+
def score_candidate(node)
|
|
54
|
+
text_length = Utilities.text_length(node)
|
|
55
|
+
return if text_length < 80
|
|
56
|
+
|
|
57
|
+
link_density = Utilities.link_density(node)
|
|
58
|
+
punct_density = Utilities.punctuation_density(node)
|
|
59
|
+
tag_weight = TAG_WEIGHTS[node.name]
|
|
60
|
+
class_weight = class_score(node)
|
|
61
|
+
depth_penalty = Utilities.depth(node) * 4
|
|
62
|
+
sibling_bonus = sibling_variance(node)
|
|
63
|
+
|
|
64
|
+
score = (
|
|
65
|
+
text_length * 0.35 +
|
|
66
|
+
punct_density * 280 -
|
|
67
|
+
link_density * 160 +
|
|
68
|
+
tag_weight +
|
|
69
|
+
class_weight +
|
|
70
|
+
sibling_bonus -
|
|
71
|
+
depth_penalty
|
|
72
|
+
)
|
|
73
|
+
|
|
74
|
+
{ node: node, score: score }
|
|
75
|
+
end
|
|
76
|
+
|
|
77
|
+
def minimum_score
|
|
78
|
+
95
|
|
79
|
+
end
|
|
80
|
+
|
|
81
|
+
def class_score(node)
|
|
82
|
+
tokens = Utilities.class_id_tokens(node)
|
|
83
|
+
score = tokens.count { |token| POSITIVE_TOKENS.include?(token) } * 40
|
|
84
|
+
score -= tokens.count { |token| NEGATIVE_TOKENS.include?(token) } * 60
|
|
85
|
+
score
|
|
86
|
+
end
|
|
87
|
+
|
|
88
|
+
def sibling_variance(node)
|
|
89
|
+
parent = node.parent
|
|
90
|
+
return 0 unless parent
|
|
91
|
+
|
|
92
|
+
siblings = Utilities.element_children(parent)
|
|
93
|
+
return 0 if siblings.length < 2
|
|
94
|
+
|
|
95
|
+
lengths = siblings.map { |sibling| Utilities.text_length(sibling) }
|
|
96
|
+
mean = lengths.sum.to_f / lengths.length
|
|
97
|
+
variance = lengths.map { |length| (length - mean)**2 }.sum.to_f / lengths.length
|
|
98
|
+
Math.sqrt(variance) * 0.25
|
|
99
|
+
end
|
|
100
|
+
|
|
101
|
+
def expand(node)
|
|
102
|
+
return node unless node.parent
|
|
103
|
+
|
|
104
|
+
before = neighboring_nodes(node, -1).reverse
|
|
105
|
+
after = neighboring_nodes(node, 1)
|
|
106
|
+
|
|
107
|
+
wrap_fragment(before + [node] + after)
|
|
108
|
+
end
|
|
109
|
+
|
|
110
|
+
def neighboring_nodes(node, direction)
|
|
111
|
+
siblings = []
|
|
112
|
+
current = node
|
|
113
|
+
loop do
|
|
114
|
+
current = if direction.negative?
|
|
115
|
+
Utilities.previous_element(current)
|
|
116
|
+
else
|
|
117
|
+
Utilities.next_element(current)
|
|
118
|
+
end
|
|
119
|
+
break unless current
|
|
120
|
+
|
|
121
|
+
break unless include_in_expansion?(current)
|
|
122
|
+
|
|
123
|
+
siblings << current
|
|
124
|
+
end
|
|
125
|
+
siblings
|
|
126
|
+
end
|
|
127
|
+
|
|
128
|
+
def include_in_expansion?(node)
|
|
129
|
+
%w[h1 h2 h3 h4 h5 h6 img blockquote p ul ol figure].include?(node.name)
|
|
130
|
+
end
|
|
131
|
+
|
|
132
|
+
def wrap_fragment(nodes)
|
|
133
|
+
container = Oga::XML::Element.new(name: "article")
|
|
134
|
+
nodes.each { |node| container.children << node }
|
|
135
|
+
container
|
|
136
|
+
end
|
|
137
|
+
|
|
138
|
+
def confidence(score)
|
|
139
|
+
return 0.0 if score.to_f <= 0.0
|
|
140
|
+
|
|
141
|
+
value = 1.0 / (1.0 + Math.exp(-(score - 100) / 12.0))
|
|
142
|
+
value.clamp(0.0, 0.95)
|
|
143
|
+
end
|
|
144
|
+
|
|
145
|
+
def title_from_meta(doc)
|
|
146
|
+
Utilities.meta_content(
|
|
147
|
+
doc,
|
|
148
|
+
"meta[property='og:title']",
|
|
149
|
+
"meta[name='twitter:title']",
|
|
150
|
+
"meta[name='title']"
|
|
151
|
+
) || doc.at_css("title")&.text&.strip
|
|
152
|
+
end
|
|
153
|
+
|
|
154
|
+
def published_at_from_meta(doc)
|
|
155
|
+
Utilities.parse_time(
|
|
156
|
+
Utilities.meta_content(
|
|
157
|
+
doc,
|
|
158
|
+
"meta[property='article:published_time']",
|
|
159
|
+
"meta[name='pubdate']",
|
|
160
|
+
"meta[name='publish_date']",
|
|
161
|
+
"meta[name='date']"
|
|
162
|
+
)
|
|
163
|
+
)
|
|
164
|
+
end
|
|
165
|
+
|
|
166
|
+
def byline_from_meta(doc)
|
|
167
|
+
Utilities.meta_content(
|
|
168
|
+
doc,
|
|
169
|
+
"meta[name='author']",
|
|
170
|
+
"meta[property='article:author']"
|
|
171
|
+
)
|
|
172
|
+
end
|
|
173
|
+
end
|
|
174
|
+
end
|
|
175
|
+
end
|