coelacanth 0.3.10 → 0.4.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 62040d12a6682d20ad4ec71ec6ab5d45f00a4814f2c4cd8dad22bc555352daac
4
- data.tar.gz: 400b7e76c8a0260abc9c713566323dac84cd50758cda8013e54747cd465f909a
3
+ metadata.gz: 9097f36247caad8f0764313b306398e3707290eca25d5cfffc05f61e97784884
4
+ data.tar.gz: 19e093800bcb9ae663e0f36a1ad18d15472ff2e76f34f6ddf65c396440aea2e0
5
5
  SHA512:
6
- metadata.gz: 6ffc542b1004c3cc3b8170f46e6e2f465bd7012ecebd99e9495c88431169f73e34f4e4558fab6c65b30e71b5e10509b2d76cca576cbc44345e098fa174f65732
7
- data.tar.gz: 05c11331d46384976a872638e1e601118e49de427eb4a989e76ee73b02773379e6a3172a0a264a015e5db910800cc8fe0188c5d9d178bb2b9522c84b6043c884
6
+ metadata.gz: 32c30afcf5316814e42f2ed2f5b94efe49daecdd78b7acdf1d10d64ed2c8253a86a0a0be3b66ed1824cfbefb32c5654f3abf0fb08c41411d736281f480375400
7
+ data.tar.gz: 4e6683d596d50535dd13df26d2c6c4339fe543013d37250c8e55f7605dc441aa3ed888fcf46f9550540583e3234bd52667cd28d1b9e098a18e3eb5faae354bed
data/.env.example ADDED
@@ -0,0 +1,5 @@
1
+ # Copy this file to .env and fill in the values to configure Coelacanth.
2
+ # Optional: only set when the remote browser requires authentication.
3
+ COELACANTH_REMOTE_CLIENT_AUTHORIZATION=
4
+ COELACANTH_REMOTE_CLIENT_USER_AGENT="Coelacanth Chrome Extension"
5
+ COELACANTH_SCREENSHOT_ONE_API_KEY="your_screenshot_one_api_key_here"
data/CHANGELOG.md CHANGED
@@ -4,13 +4,8 @@ All notable changes to this project will be documented in this file.
4
4
  The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
5
5
  and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
6
6
 
7
- ## [v0.3.10] - 2025-06-19
8
- ### :bug: Bug Fixes
9
- - [`f297847`](https://github.com/slidict/coelacanth/commit/f29784715aba9453599ad7fd6467d4d2d3e9e82c) - improve error handling in Ferrum client and bump version to 0.3.10 *(commit by [@yubele](https://github.com/yubele))*
10
-
7
+ ## [v0.4.1] - 2025-11-03
11
8
  ### :wrench: Chores
12
- - [`58d383f`](https://github.com/slidict/coelacanth/commit/58d383fdc34e220e584d92447e938078ed75a889) - **deps**: Bump base64 from 0.2.0 to 0.3.0 *(commit by [@dependabot[bot]](https://github.com/apps/dependabot))*
13
- - [`6f1b766`](https://github.com/slidict/coelacanth/commit/6f1b766ea8f65d2f9c0f1d8772cbd6f1c840a43b) - **deps**: Bump rake from 13.2.1 to 13.3.0 *(commit by [@dependabot[bot]](https://github.com/apps/dependabot))*
14
- - [`0c9a6b2`](https://github.com/slidict/coelacanth/commit/0c9a6b2b9f28f686665914054016c90c49407e69) - **deps**: Bump rubocop from 1.75.7 to 1.76.1 *(commit by [@dependabot[bot]](https://github.com/apps/dependabot))*
9
+ - [`41e89b7`](https://github.com/slidict/coelacanth/commit/41e89b799573f6cfaf0a12e7abc5c260f1905aec) - Bump version from 0.4.0 to 0.4.1 *(commit by [@yubele](https://github.com/yubele))*
15
10
 
16
- [v0.3.10]: https://github.com/slidict/coelacanth/compare/v0.3.9...v0.3.10
11
+ [v0.4.1]: https://github.com/slidict/coelacanth/compare/v0.4.0...v0.4.1
data/Gemfile CHANGED
@@ -8,7 +8,7 @@ gemspec
8
8
  gem "ferrum", "~> 0.16"
9
9
  gem "rake", "~> 13.3"
10
10
  gem "rspec", "~> 3.0"
11
- gem "rubocop", "~> 1.76"
11
+ gem "rubocop", "~> 1.81"
12
12
  gem "oga", "~> 3.4"
13
13
  gem "base64", "~> 0.3.0"
14
14
 
data/README.md CHANGED
@@ -1,94 +1,188 @@
1
- # coelacanth
1
+ # Coelacanth
2
2
 
3
3
  [![Gem Version](https://badge.fury.io/rb/coelacanth.svg)](https://badge.fury.io/rb/coelacanth)
4
4
  [![Build Status](https://github.com/slidict/coelacanth/actions/workflows/main.yml/badge.svg)](https://github.com/slidict/coelacanth/actions)
5
5
  [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
6
6
 
7
- `coelacanth` is a gem that allows you to parse and analyze web pages, extracting key statistics and information for further use within your projects.
7
+ Coelacanth is a Ruby gem for extracting high-quality article content, metadata, and screenshots from arbitrary web pages. It is
8
+ built to power content ingestion pipelines that have to withstand layout experiments, CMS redesigns, and inconsistent markup
9
+ while remaining easy to extend.
10
+
11
+ It is the successor to [`web_stat`](https://rubygems.org/gems/web_stat) and continues the same goal of reliable article
12
+ extraction under the `slidict` umbrella. Compared to [`web_stat`](https://github.com/slidict/web_stat/) the gem has been
13
+ re-architected with a modern extractor pipeline, built-in screenshot capture, and a clearer configuration story so you can drop
14
+ it into contemporary ingestion stacks without bespoke glue code.
15
+
16
+ ## Table of contents
17
+ - [Features](#features)
18
+ - [Requirements](#requirements)
19
+ - [Installation](#installation)
20
+ - [Quick start](#quick-start)
21
+ - [Extractor pipeline](#extractor-pipeline)
22
+ - [Configuration](#configuration)
23
+ - [Development workflow](#development-workflow)
24
+ - [Testing](#testing)
25
+ - [Contributing](#contributing)
26
+ - [License](#license)
8
27
 
9
- ## Installation
10
-
11
- Add this line to your application's Gemfile:
28
+ ## Features
29
+ - **Layout-resilient extraction** – Multi-stage extractor falls back from structured metadata to heuristics and lightweight
30
+ machine learning so you continue to get clean article bodies even when markup drifts.
31
+ - **UTF-8 normalization** – HTML responses are normalized into UTF-8 before parsing to play nicely with Japanese and other
32
+ multi-byte sources.
33
+ - **Screenshot capture** – Fetches full-page PNGs via a configurable browser client so you can archive visual context alongside
34
+ the extracted text.
35
+ - **Redirect resolution** – Follows HTTP redirects and long redirect chains to guarantee the extractor works on the final
36
+ landing page.
37
+ - **Configurable HTTP headers** – Inject custom headers (user agent, authorization, etc.) into the remote browser session for
38
+ authenticated or geo-targeted crawling.
39
+
40
+ ### What's new compared to web_stat?
41
+
42
+ - **Multi-stage pipeline** – `web_stat` relied on a single-pass heuristic extractor, whereas Coelacanth layers metadata,
43
+ heuristic, and optional ML probes that graduate based on confidence thresholds.
44
+ - **First-class screenshots** – Capture full-page PNGs alongside the extracted text without writing a separate headless browser
45
+ integration.
46
+ - **Environment-aware configuration** – Manage remote browser credentials, HTTP headers, and client selection through
47
+ `config/coelacanth.yml` instead of hand-tuned initializer code.
48
+ - **Markdown-first output** – Get both Markdown and raw DOM representations from `Coelacanth.analyze` so you can publish the
49
+ same payload to static-site builders, CMS importers, or downstream summarizers.
50
+
51
+ ## Requirements
52
+ - Ruby **3.4 or newer**
53
+ - [Bundler](https://bundler.io/) for dependency management
54
+ - A remote Chrome-compatible WebSocket endpoint when using the default Ferrum client (see [Configuration](#configuration))
12
55
 
56
+ ## Installation
57
+ Add the gem to your application:
13
58
 
14
59
  ```ruby
15
- gem 'coelacanth'
60
+ gem "coelacanth"
16
61
  ```
17
62
 
18
- And then execute:
63
+ Install the dependencies:
19
64
 
20
65
  ```bash
21
- $ bundle install
66
+ bundle install
22
67
  ```
23
68
 
24
- Or install it yourself as:
69
+ Or install the gem directly:
25
70
 
26
71
  ```bash
27
- $ gem install coelacanth
72
+ gem install coelacanth
28
73
  ```
29
74
 
30
- ### Resolving UID Mismatch Between Docker and Host
75
+ ## Quick start
76
+ ```ruby
77
+ require "coelacanth"
31
78
 
32
- To resolve issues related to the difference between Docker's UID and the host's UID, add the following line to your .bashrc or similar shell configuration file:
79
+ result = Coelacanth.analyze("https://example.com/article")
33
80
 
34
- ```bash
35
- export UID=${UID}
81
+ result[:extraction] # => article metadata and body markdown
82
+ result[:dom] # => Oga DOM representation for downstream processing
83
+ result[:screenshot] # => PNG screenshot as a binary string
36
84
  ```
37
85
 
38
- This will ensure that the environment variable UID is correctly set in your Docker containers, matching your host system's user ID.
86
+ The returned hash includes:
87
+
88
+ - `:extraction` – output from `Coelacanth::Extractor`, including title, Markdown body (`body_markdown` and
89
+ `body_markdown_list`), images, listings, published date, and the probe source and confidence score.
90
+ - `:dom` – a parsed Oga DOM if you need to traverse the document manually.
91
+ - `:screenshot` – raw PNG data that you can persist or feed to other systems.
92
+
93
+ ## Extractor pipeline
94
+ Coelacanth ships with a multi-stage extractor that tries increasingly involved probes until one meets its confidence target:
95
+
96
+ 1. **MetadataProbe** (threshold `0.85`) pulls `schema.org` JSON-LD, Open Graph, Twitter Cards, or semantic containers such as
97
+ `<main>`/`<article>` when available.
98
+ 2. **HeuristicProbe** (threshold `0.75`) scores block-level nodes using text length, link density, punctuation density, DOM
99
+ depth, and sibling variance, then greedily attaches surrounding headers and media.
100
+ 3. **WeakMlProbe** (threshold `0.70`) optionally boosts accuracy with a lightweight classifier that combines heuristic features
101
+ with class and id tokens (e.g., `article-body`, `post`, `content`).
102
+ 4. **FallbackProbe** acts as a safety net by following AMP/print links or summarizing the whole document when the previous
103
+ probes fail.
104
+
105
+ Markdown-based listings are generated from the extracted body so lists such as "Latest news" blocks can be stored alongside the
106
+ article without scanning the rest of the page layout.
107
+
108
+ ## Configuration
109
+ Runtime configuration is stored in `config/coelacanth.yml`. Environments inherit from the `development` section by default.
110
+
111
+ ```yaml
112
+ development:
113
+ client: "ferrum" # Options: "ferrum", "screenshot_one"
114
+ remote_client:
115
+ ws_url: "ws://chrome:3000/chrome"
116
+ timeout: 10
117
+ headers:
118
+ <% if (auth = ENV["COELACANTH_REMOTE_CLIENT_AUTHORIZATION"]).to_s.strip != "" %>
119
+ Authorization: "<%= auth %>"
120
+ <% end %>
121
+ User-Agent: "<%= ENV.fetch("COELACANTH_REMOTE_CLIENT_USER_AGENT", "Coelacanth Chrome Extension") %>"
122
+ screenshot_one:
123
+ key: "<%= ENV.fetch("COELACANTH_SCREENSHOT_ONE_API_KEY", "your_screenshot_one_api_key_here") %>"
124
+ ```
39
125
 
40
- This explanation provides clear instructions on how to resolve the UID mismatch issue using the export command.
126
+ - **Ferrum client** Requires a running Chrome instance that exposes the DevTools protocol via WebSocket. Configure the URL,
127
+ timeout, and any headers to inject.
128
+ - **ScreenshotOne client** – Supply an API key to offload screenshot capture to [ScreenshotOne](https://screenshotone.com/).
129
+ - Configuration is environment-aware: set `RAILS_ENV`/`RACK_ENV` or use Rails' built-in environment handling when the gem is
130
+ used inside a Rails project.
41
131
 
42
- ## Usage
43
- To use coelacanth, first require it.
132
+ ### Environment variables
44
133
 
45
- ```ruby
46
- require 'coelacanth'
47
- ```
134
+ Configuration values that would otherwise contain credentials are loaded from environment variables. Set the following
135
+ variables in your shell (or `dotenv` file) before running the gem:
48
136
 
49
- Then, you can easily parse and extract information from a web page like this:
137
+ ```bash
138
+ # Optional: only set when the remote browser requires authentication.
139
+ export COELACANTH_REMOTE_CLIENT_AUTHORIZATION="Bearer <token>"
50
140
 
51
- ```ruby
52
- url = "https://example.com"
53
- stats = Coelacanth.analyze(url)
141
+ export COELACANTH_REMOTE_CLIENT_USER_AGENT="Coelacanth Chrome Extension"
142
+ export COELACANTH_SCREENSHOT_ONE_API_KEY="your_screenshot_one_api_key_here"
54
143
  ```
55
144
 
56
- - rspec
145
+ If `COELACANTH_REMOTE_CLIENT_AUTHORIZATION` is omitted or left blank, the `Authorization` header is not injected into the
146
+ remote browser session.
57
147
 
58
- ```
59
- $ bundle exec rspec
60
- ```
148
+ When using Docker Compose, you can create a `.env` file or export the variables in your environment so the `app` service picks
149
+ them up automatically.
61
150
 
62
- ## Features
63
- - Get dom by oga
64
- - Get screenshot
151
+ If you are working inside Docker, make sure the `UID` environment variable matches your host user by exporting it in your shell
152
+ startup file:
65
153
 
66
- ## Commit Message Guidelines
154
+ ```bash
155
+ export UID=${UID}
156
+ ```
67
157
 
68
- To ensure consistency and facilitate automatic updates to the `CHANGELOG.md`, please follow the [Conventional Commits](https://www.conventionalcommits.org/) specification when creating commit messages. This helps maintain a clear and structured commit history.
158
+ ## Development workflow
159
+ Clone the repository and install dependencies:
69
160
 
70
- When submitting a Pull Request (PR), make sure your commits adhere to these guidelines.
161
+ ```bash
162
+ git clone https://github.com/slidict/coelacanth.git
163
+ cd coelacanth
164
+ bundle install
165
+ ```
71
166
 
72
- ### Example of Conventional Commit Messages:
167
+ You can open an interactive console with the gem loaded via:
73
168
 
74
- - `feat: add new feature for parsing web pages`
75
- - `fix: resolve issue with URL redirection`
76
- - `docs: update README with usage instructions`
77
- - `chore: update dependencies`
78
- - `build: update build configuration`
79
- - `ci: update CI pipeline`
80
- - `style: fix code style issues`
81
- - `refactor: refactor code for better readability`
82
- - `perf: improve performance of data processing`
83
- - `test: add new tests for URL parsing module`
169
+ ```bash
170
+ bin/console
171
+ ```
172
+
173
+ ## Testing
174
+ Run the test suite with RSpec:
84
175
 
85
- By following these guidelines, you help ensure that our project's commit history is easy to navigate and that versioning and release notes are generated correctly.
176
+ ```bash
177
+ bundle exec rspec
178
+ ```
86
179
 
87
180
  ## Contributing
88
- Bug reports and pull requests are welcome on GitHub at https://github.com/slidict/coelacanth. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.
181
+ Bug reports and pull requests are welcome on GitHub at
182
+ [https://github.com/slidict/coelacanth](https://github.com/slidict/coelacanth). Please follow the
183
+ [Conventional Commits](https://www.conventionalcommits.org/) specification so we can keep the changelog automation healthy.
89
184
 
90
- ## License
91
- The gem is available as open-source under the terms of the MIT License.
185
+ By participating in this project you agree to abide by the [Contributor Covenant](CODE_OF_CONDUCT.md).
92
186
 
93
- ## Acknowledgments
94
- Special thanks to all the contributors and open-source projects that make this possible.
187
+ ## License
188
+ Coelacanth is available as open source under the terms of the [MIT License](LICENSE.txt).
data/compose.yml CHANGED
@@ -3,8 +3,11 @@ networks:
3
3
  driver: bridge
4
4
  services:
5
5
  app:
6
- environment:
7
- - UID=${UID}
6
+ environment:
7
+ - UID=${UID}
8
+ - COELACANTH_REMOTE_CLIENT_AUTHORIZATION=${COELACANTH_REMOTE_CLIENT_AUTHORIZATION:-}
9
+ - COELACANTH_REMOTE_CLIENT_USER_AGENT=${COELACANTH_REMOTE_CLIENT_USER_AGENT:-}
10
+ - COELACANTH_SCREENSHOT_ONE_API_KEY=${COELACANTH_SCREENSHOT_ONE_API_KEY:-}
8
11
  tty: true
9
12
  stdin_open: true
10
13
  build:
@@ -4,10 +4,12 @@ development: &development
4
4
  ws_url: "ws://chrome:3000/chrome"
5
5
  timeout: 10 # seconds
6
6
  headers:
7
- Authorization: "Bearer 1234567890"
8
- User-Agent: "Coelacanth Chrome Extension"
7
+ <% if (auth = ENV["COELACANTH_REMOTE_CLIENT_AUTHORIZATION"]).to_s.strip != "" %>
8
+ Authorization: "<%= auth %>"
9
+ <% end %>
10
+ User-Agent: "<%= ENV.fetch("COELACANTH_REMOTE_CLIENT_USER_AGENT", "Coelacanth Chrome Extension") %>"
9
11
  screenshot_one:
10
- key: "your_screenshot_one_api_key_here"
12
+ key: "<%= ENV.fetch("COELACANTH_SCREENSHOT_ONE_API_KEY", "your_screenshot_one_api_key_here") %>"
11
13
  test:
12
14
  <<: *development
13
15
  production:
@@ -16,7 +16,7 @@ module Coelacanth::Client
16
16
  body = remote_client.body
17
17
  body
18
18
  rescue => e
19
- raise "#{e.class}: #{e.message} RemoteClient: #{@remote_client.inspect}"
19
+ raise sanitized_remote_client_error(e)
20
20
  end
21
21
 
22
22
  def get_screenshot
@@ -26,11 +26,21 @@ module Coelacanth::Client
26
26
  File.read(tempfile.path)
27
27
  rescue => e
28
28
  tempfile.close
29
- raise "#{e.class}: #{e.message} RemoteClient: #{@remote_client.inspect}"
29
+ raise sanitized_remote_client_error(e)
30
30
  end
31
31
 
32
32
  private
33
33
 
34
+ def sanitized_remote_client_error(error)
35
+ "#{error.class}: #{error.message} RemoteClient: #{sanitized_remote_client_identifier}"
36
+ end
37
+
38
+ def sanitized_remote_client_identifier
39
+ return "nil" unless @remote_client
40
+
41
+ "#{@remote_client.class.name}(object_id=#{@remote_client.object_id})"
42
+ end
43
+
34
44
  def remote_client
35
45
  return @remote_client if @remote_client
36
46
 
@@ -1,19 +1,29 @@
1
1
  # frozen_string_literal: true
2
2
 
3
- require "open-uri"
4
3
  require "ferrum"
4
+ require_relative "ferrum"
5
+ require_relative "../http"
5
6
 
6
7
  module Coelacanth::Client
7
8
  # Coelacanth::Client
8
9
  class ScreenshotOne < Coelacanth::Client::Base
9
10
  def get_response
10
- @origin_response = URI(@url).open
11
- @status_code = @origin_response.status[0].to_i
12
- body = @origin_response.read
13
- body
14
- rescue OpenURI::HTTPError => e
15
- @status_code = e.io.status[0].to_i
16
- raise e
11
+ uri = URI.parse(@url)
12
+ response = Coelacanth::HTTP.get_response(
13
+ uri,
14
+ open_timeout: Coelacanth::HTTP::DEFAULT_OPEN_TIMEOUT,
15
+ read_timeout: Coelacanth::HTTP::DEFAULT_READ_TIMEOUT
16
+ )
17
+ @origin_response = response
18
+ @status_code = response.code.to_i
19
+
20
+ return response.body if response.is_a?(Net::HTTPSuccess)
21
+
22
+ Coelacanth::HTTP.raise_http_error(uri, response)
23
+ rescue Coelacanth::TimeoutError
24
+ fallback_response = fallback_client.get_response
25
+ @status_code = fallback_client.instance_variable_get(:@status_code)
26
+ fallback_response
17
27
  end
18
28
 
19
29
  def get_screenshot
@@ -34,10 +44,22 @@ module Coelacanth::Client
34
44
  }
35
45
  uri.query = URI.encode_www_form(params)
36
46
 
37
- response = Net::HTTP.get_response(uri)
47
+ response = Coelacanth::HTTP.get_response(
48
+ uri,
49
+ open_timeout: Coelacanth::HTTP::DEFAULT_OPEN_TIMEOUT,
50
+ read_timeout: 30
51
+ )
38
52
  raise "Failed to fetch screenshot: #{response.code}" unless response.is_a?(Net::HTTPSuccess)
39
53
 
40
54
  response.body
55
+ rescue Coelacanth::TimeoutError
56
+ fallback_client.get_screenshot
57
+ end
58
+
59
+ private
60
+
61
+ def fallback_client
62
+ @fallback_client ||= Coelacanth::Client::Ferrum.new(@url, @config)
41
63
  end
42
64
  end
43
65
  end
@@ -15,7 +15,12 @@ module Coelacanth
15
15
  end
16
16
 
17
17
  def yaml
18
- @yaml ||= YAML.unsafe_load(ERB.new(File.read(file)).result)[env]
18
+ @yaml ||= YAML.safe_load(
19
+ ERB.new(File.read(file)).result,
20
+ permitted_classes: [],
21
+ permitted_symbols: [],
22
+ aliases: true
23
+ )[env]
19
24
  end
20
25
 
21
26
  private
@@ -1,12 +1,18 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  require "oga"
4
+ require_relative "http"
4
5
 
5
6
  module Coelacanth
6
7
  # Coelacanth::Dom
7
8
  class Dom
8
- def oga(url)
9
- Oga.parse_xml(Net::HTTP.get_response(URI.parse(url)).body)
9
+ def oga(url, html: nil)
10
+ html ||= begin
11
+ Coelacanth::HTTP.get_response(URI.parse(url)).body
12
+ rescue Coelacanth::TimeoutError
13
+ ""
14
+ end
15
+ Oga.parse_xml(html.to_s)
10
16
  end
11
17
  end
12
18
  end
@@ -0,0 +1,34 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "oga"
4
+
5
+ require_relative "utilities"
6
+
7
+ module Coelacanth
8
+ class Extractor
9
+ # Attempts final recovery strategies when all other probes fail.
10
+ class FallbackProbe
11
+ Result = Struct.new(
12
+ :title,
13
+ :node,
14
+ :published_at,
15
+ :byline,
16
+ :source_tag,
17
+ :confidence,
18
+ keyword_init: true
19
+ )
20
+
21
+ def call(doc:, url: nil)
22
+ body = doc.at_css("body") || doc
23
+ Result.new(
24
+ title: doc.at_css("title")&.text&.strip,
25
+ node: body,
26
+ published_at: nil,
27
+ byline: nil,
28
+ source_tag: :fallback,
29
+ confidence: 0.35
30
+ )
31
+ end
32
+ end
33
+ end
34
+ end
@@ -0,0 +1,175 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "oga"
4
+
5
+ require_relative "utilities"
6
+
7
+ module Coelacanth
8
+ class Extractor
9
+ # Scores DOM nodes based on simple heuristics to locate the primary article body.
10
+ class HeuristicProbe
11
+ Result = Struct.new(
12
+ :title,
13
+ :node,
14
+ :published_at,
15
+ :byline,
16
+ :source_tag,
17
+ :confidence,
18
+ keyword_init: true
19
+ )
20
+
21
+ BLOCK_SELECTOR = "article, main, section, div".freeze
22
+ TAG_WEIGHTS = Hash.new(0).merge(
23
+ "article" => 80,
24
+ "main" => 60,
25
+ "section" => 30,
26
+ "div" => 10
27
+ ).freeze
28
+ NEGATIVE_TOKENS = %w[nav footer header sidebar related share menu].freeze
29
+ POSITIVE_TOKENS = %w[content article body post entry text].freeze
30
+
31
+ def call(doc:, url: nil)
32
+ candidates = doc.css(BLOCK_SELECTOR).map do |node|
33
+ score_candidate(node)
34
+ end.compact
35
+
36
+ return if candidates.empty?
37
+
38
+ best = candidates.max_by { |candidate| candidate[:score] }
39
+ return if best[:score] < minimum_score
40
+
41
+ Result.new(
42
+ title: title_from_meta(doc),
43
+ node: expand(best[:node]),
44
+ published_at: published_at_from_meta(doc),
45
+ byline: byline_from_meta(doc),
46
+ source_tag: :heuristic,
47
+ confidence: confidence(best[:score])
48
+ )
49
+ end
50
+
51
+ private
52
+
53
+ def score_candidate(node)
54
+ text_length = Utilities.text_length(node)
55
+ return if text_length < 80
56
+
57
+ link_density = Utilities.link_density(node)
58
+ punct_density = Utilities.punctuation_density(node)
59
+ tag_weight = TAG_WEIGHTS[node.name]
60
+ class_weight = class_score(node)
61
+ depth_penalty = Utilities.depth(node) * 4
62
+ sibling_bonus = sibling_variance(node)
63
+
64
+ score = (
65
+ text_length * 0.35 +
66
+ punct_density * 280 -
67
+ link_density * 160 +
68
+ tag_weight +
69
+ class_weight +
70
+ sibling_bonus -
71
+ depth_penalty
72
+ )
73
+
74
+ { node: node, score: score }
75
+ end
76
+
77
+ def minimum_score
78
+ 95
79
+ end
80
+
81
+ def class_score(node)
82
+ tokens = Utilities.class_id_tokens(node)
83
+ score = tokens.count { |token| POSITIVE_TOKENS.include?(token) } * 40
84
+ score -= tokens.count { |token| NEGATIVE_TOKENS.include?(token) } * 60
85
+ score
86
+ end
87
+
88
+ def sibling_variance(node)
89
+ parent = node.parent
90
+ return 0 unless parent
91
+
92
+ siblings = Utilities.element_children(parent)
93
+ return 0 if siblings.length < 2
94
+
95
+ lengths = siblings.map { |sibling| Utilities.text_length(sibling) }
96
+ mean = lengths.sum.to_f / lengths.length
97
+ variance = lengths.map { |length| (length - mean)**2 }.sum.to_f / lengths.length
98
+ Math.sqrt(variance) * 0.25
99
+ end
100
+
101
+ def expand(node)
102
+ return node unless node.parent
103
+
104
+ before = neighboring_nodes(node, -1).reverse
105
+ after = neighboring_nodes(node, 1)
106
+
107
+ wrap_fragment(before + [node] + after)
108
+ end
109
+
110
+ def neighboring_nodes(node, direction)
111
+ siblings = []
112
+ current = node
113
+ loop do
114
+ current = if direction.negative?
115
+ Utilities.previous_element(current)
116
+ else
117
+ Utilities.next_element(current)
118
+ end
119
+ break unless current
120
+
121
+ break unless include_in_expansion?(current)
122
+
123
+ siblings << current
124
+ end
125
+ siblings
126
+ end
127
+
128
+ def include_in_expansion?(node)
129
+ %w[h1 h2 h3 h4 h5 h6 img blockquote p ul ol figure].include?(node.name)
130
+ end
131
+
132
+ def wrap_fragment(nodes)
133
+ container = Oga::XML::Element.new(name: "article")
134
+ nodes.each { |node| container.children << node }
135
+ container
136
+ end
137
+
138
+ def confidence(score)
139
+ return 0.0 if score.to_f <= 0.0
140
+
141
+ value = 1.0 / (1.0 + Math.exp(-(score - 100) / 12.0))
142
+ value.clamp(0.0, 0.95)
143
+ end
144
+
145
+ def title_from_meta(doc)
146
+ Utilities.meta_content(
147
+ doc,
148
+ "meta[property='og:title']",
149
+ "meta[name='twitter:title']",
150
+ "meta[name='title']"
151
+ ) || doc.at_css("title")&.text&.strip
152
+ end
153
+
154
+ def published_at_from_meta(doc)
155
+ Utilities.parse_time(
156
+ Utilities.meta_content(
157
+ doc,
158
+ "meta[property='article:published_time']",
159
+ "meta[name='pubdate']",
160
+ "meta[name='publish_date']",
161
+ "meta[name='date']"
162
+ )
163
+ )
164
+ end
165
+
166
+ def byline_from_meta(doc)
167
+ Utilities.meta_content(
168
+ doc,
169
+ "meta[name='author']",
170
+ "meta[property='article:author']"
171
+ )
172
+ end
173
+ end
174
+ end
175
+ end