RubyGems - coelacanth - Versions diffs - 0.4.3 → 0.5.0 - Mend

coelacanth 0.4.3 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +5 -4
data/README.md +69 -2
data/config/coelacanth.yml +21 -0
data/lib/coelacanth/extractor/eyecatch_image_extractor.rb +384 -0
data/lib/coelacanth/extractor/morphological_analyzer.rb +552 -0
data/lib/coelacanth/extractor/preprocessor.rb +166 -0
data/lib/coelacanth/extractor.rb +41 -6
data/lib/coelacanth/http.rb +28 -2
data/lib/coelacanth/version.rb +1 -1
data/lib/coelacanth.rb +7 -1
metadata +4 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 163fc75f51d17478d0314620279c1d23f4259fbd278046629a7c9bd293bf2eec
-  data.tar.gz: a47846ba9bb23797a40957a376b77c4e6229b61898046b0ce910a80fc07de83b
+  metadata.gz: 06a629b2865e5c4be5508a92637b2824bce0922b2de1209cee7c8f358ea8b438
+  data.tar.gz: 43eac188f8c3d27e975753ff459c444d9ca49dc6e83aa8d067dca75b3223db87
 SHA512:
-  metadata.gz: 9fb2695856b6eddefbaaccc853b53292dfa3fddaa3d83426ea461aba1485f808b2f5553c5f2a34c79c02185b17641e35cfddf83fc86d0676a4a6b9ed16309a34
-  data.tar.gz: 259a3244d4633134307d3668730ec1345ed5df1c3bfb39e07d4d44f96d3eb5324ae3b0ff6813b9fda8df3baf68eaab031594144d9728626b9540a24351ef48e0
+  metadata.gz: 4dc3c36802dce0be0e9deb9debdeccb5840bafa44a2613e6a59a270242b16f7f977d44b66a1e47472b9edf5a2a4026d057ffc91a490a776118dd493490e5ca9f
+  data.tar.gz: 3e558a85ab45b8f738be4c993413c279c7cb46bca7e44399e8fd40d1aa5e764b90865bbc51c7b38ad62be7501605e4f9d6edd3c0e8885ce8fe81830fda36d362

data/CHANGELOG.md CHANGED Viewed

@@ -4,8 +4,9 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
-## [v0.4.3] - 2025-11-05
-### :wrench: Chores
-- [`46fc62f`](https://github.com/slidict/coelacanth/commit/46fc62f1e8222fc878246a4c621ea1f5a6ceccc0) - Bump version to 0.4.3 *(commit by [@yubele](https://github.com/yubele))*
+## [v0.5.0] - 2025-11-08
+### :sparkles: New Features
+- [`d34ef32`](https://github.com/slidict/coelacanth/commit/d34ef32dbb969f7ef86dce6cd587c44a848ee32d) - add YouTube preprocessing support *(commit by [@yubele](https://github.com/yubele))*
+- [`2a566ad`](https://github.com/slidict/coelacanth/commit/2a566adeaaa5b813fded4b9ebd8ce8d90d43ee7c) - add morphological analysis for body markdown *(commit by [@yubele](https://github.com/yubele))*
-[v0.4.3]: https://github.com/slidict/coelacanth/compare/v0.4.2...v0.4.3
+[v0.5.0]: https://github.com/slidict/coelacanth/compare/v0.4.3...v0.5.0

data/README.md CHANGED Viewed

@@ -81,14 +81,33 @@ result = Coelacanth.analyze("https://example.com/article")
 result[:extraction] # => article metadata and body markdown
 result[:dom]        # => Oga DOM representation for downstream processing
 result[:screenshot] # => PNG screenshot as a binary string
+result[:response]   # => HTTP status, headers, and final URL
 ```
 The returned hash includes:
-- `:extraction` – output from `Coelacanth::Extractor`, including title, Markdown body (`body_markdown` and
-  `body_markdown_list`), images, listings, published date, and the probe source and confidence score.
+- `:extraction` – output from `Coelacanth::Extractor`, including title, Markdown body (`body_markdown`,
+  `body_markdown_list`, and scored morphemes in `body_morphemes`), the normalized plain-text body (`body_text`),
+  images, listings, published date, detected site name, and the probe source and confidence score. The extractor also echoes the
+  HTTP metadata it received via `response_metadata` for downstream consumers that only operate on the extraction payload.
 - `:dom` – a parsed Oga DOM if you need to traverse the document manually.
 - `:screenshot` – raw PNG data that you can persist or feed to other systems.
+- `:response` – HTTP metadata captured during the initial fetch.
+### Response and extraction metadata
+The `:response` key exposes a hash with the following keys:
+- `:status_code` – Numeric HTTP status (e.g., `200`).
+- `:headers` – A lowercase header hash as returned by `Net::HTTP#each_header`.
+- `:final_url` – The URL that was ultimately fetched after resolving redirects.
+Within the extraction payload (`result[:extraction]`), the following additional metadata is available:
+- `:site_name` – Site or application name inferred from Open Graph/Twitter meta tags or the document `<title>`.
+- `:body_text` – Plain-text body with collapsed whitespace, suitable for search indexing or summarization.
+- `:response_metadata` – Mirrors the top-level `:response` hash so downstream processing can access HTTP metadata without
+  carrying the entire analysis result.
 ## Extractor pipeline
 Coelacanth ships with a multi-stage extractor that tries increasingly involved probes until one meets its confidence target:
@@ -122,14 +141,47 @@ development:
       User-Agent: "<%= ENV.fetch("COELACANTH_REMOTE_CLIENT_USER_AGENT", "Coelacanth Chrome Extension") %>"
   screenshot_one:
     key: "<%= ENV.fetch("COELACANTH_SCREENSHOT_ONE_API_KEY", "your_screenshot_one_api_key_here") %>"
+  youtube:
+    api_key: "<%= ENV.fetch("COELACANTH_YOUTUBE_API_KEY", "") %>"
+  morphology:
+    latin_joiners:
+      - ","
+    japanese_hiragana_suffixes:
+      - "ら"
+      - "の"
+      - "え"
+    japanese_category_breaks:
+      - "katakana_to_kanji"
 ```
 - **Ferrum client** – Requires a running Chrome instance that exposes the DevTools protocol via WebSocket. Configure the URL,
   timeout, the network idle timeout, and any headers to inject.
 - **ScreenshotOne client** – Supply an API key to offload screenshot capture to [ScreenshotOne](https://screenshotone.com/).
+- **Eyecatch image extraction** – Representative images are discovered automatically by checking Open Graph/Twitter metadata,
+  Schema.org JSON-LD payloads, and high-signal `<img>` elements (hero/cover images, large dimensions, etc.). No manual XPath
+  maintenance is required.
+- **YouTube Data API** – Set an API key to turn YouTube watch URLs into structured articles using the video description and
+  thumbnail for downstream processing.
 - Configuration is environment-aware: set `RAILS_ENV`/`RACK_ENV` or use Rails' built-in environment handling when the gem is
   used inside a Rails project.
+#### Morphological analyzer tuning
+The terms returned in `body_morphemes` can be tuned per deployment by configuring the optional `morphology` section:
+- `morphology.latin_joiners` — An array of characters that should be treated as connectors between Latin tokens. The default
+  value includes a comma so numbers such as `7,000` stay intact instead of being split into separate terms.
+- `morphology.japanese_hiragana_suffixes` — A whitelist of Hiragana tokens that are allowed to extend Kanji sequences. By
+  default we keep common nominal suffixes such as `ら`, `の`, and the trailing `え` in `訴え` while preventing particles like `に`
+  from merging with the preceding noun. Provide your own list or set the value to `null`/`~` to allow any Hiragana suffix.
+- `morphology.japanese_category_breaks` — An array of transitions (e.g., `katakana_to_kanji`) that should stop Japanese token
+  sequences. This is useful when you want Katakana loanwords such as `タワマン` to stand alone instead of being merged with the
+  Kanji terms that follow them.
+Representative images are downloaded into a temporary directory using the built-in HTTP client. The extractor returns both the
+resolved URL and the local file path via `extraction[:eyecatch_image]`. Remember to move or delete the file once you have
+persisted it—temporary directories are not automatically cleaned up for long-running processes.
 ### Environment variables
 Configuration values that would otherwise contain credentials are loaded from environment variables. Set the following
@@ -141,11 +193,26 @@ export COELACANTH_REMOTE_CLIENT_AUTHORIZATION="Bearer <token>"
 export COELACANTH_REMOTE_CLIENT_USER_AGENT="Coelacanth Chrome Extension"
 export COELACANTH_SCREENSHOT_ONE_API_KEY="your_screenshot_one_api_key_here"
+export COELACANTH_YOUTUBE_API_KEY="your_youtube_data_api_key"
 ```
 If `COELACANTH_REMOTE_CLIENT_AUTHORIZATION` is omitted or left blank, the `Authorization` header is not injected into the
 remote browser session.
+### YouTube Data API integration
+With `COELACANTH_YOUTUBE_API_KEY` configured (or `youtube.api_key` populated directly in `config/coelacanth.yml`),
+`Coelacanth::Extractor` runs a preprocessor that recognizes standard YouTube watch URLs (`youtube.com`, `youtu.be`,
+`m.youtube.com`, etc.). The preprocessor fetches the video snippet from the YouTube Data API and builds an article-like HTML
+document that contains:
+- The video title and publish timestamp as structured metadata (JSON-LD and Open Graph).
+- The full description rendered as Markdown-friendly paragraphs.
+- The highest available thumbnail, passed to the eye-catch/image collector pipeline.
+If the API key is missing or the API request fails, the extractor falls back to the original HTML that was fetched from
+YouTube, so non-video pages continue to behave as before.
 When using Docker Compose, you can create a `.env` file or export the variables in your environment so the `app` service picks
 them up automatically.

data/config/coelacanth.yml CHANGED Viewed

@@ -11,6 +11,27 @@ development: &development
       User-Agent: "<%= ENV.fetch("COELACANTH_REMOTE_CLIENT_USER_AGENT", "Coelacanth Chrome Extension") %>"
   screenshot_one:
     key: "<%= ENV.fetch("COELACANTH_SCREENSHOT_ONE_API_KEY", "your_screenshot_one_api_key_here") %>"
+  youtube:
+    api_key: "<%= ENV.fetch("COELACANTH_YOUTUBE_API_KEY", "") %>"
+  morphology:
+    # Example configuration:
+    # latin_joiners:
+    #   - "'"
+    #   - "-"
+    # japanese_hiragana_suffixes:
+    #   - "さん"
+    #   - "ちゃん"
+    # japanese_category_breaks:
+    #   - "kanji_to_katakana"
+    #   - "katakana_to_kanji"
+    latin_joiners:
+      - ","
+    japanese_hiragana_suffixes:
+      - "ら"
+      - "の"
+      - "え"
+    japanese_category_breaks:
+      - "katakana_to_kanji"
 test:
   <<: *development
 production:

data/lib/coelacanth/extractor/eyecatch_image_extractor.rb ADDED Viewed

@@ -0,0 +1,384 @@
+# frozen_string_literal: true
+require "json"
+require "set"
+require "tmpdir"
+require "uri"
+require_relative "utilities"
+require_relative "../http"
+module Coelacanth
+  class Extractor
+    # Finds and downloads the representative image for a document.
+    class EyecatchImageExtractor
+      Result = Struct.new(:url, :path, keyword_init: true)
+      POSITIVE_KEYWORDS = %w[eyecatch hero main featured cover headline banner article primary lead].freeze
+      NEGATIVE_KEYWORDS = %w[avatar icon logo emoji badge button profile author comment footer nav thumbnail thumb ad sponsor].freeze
+      METADATA_SOURCES = [
+        { selector: "meta[property='og:image:secure_url']", attribute: "content", score: 140 },
+        { selector: "meta[property='og:image:url']", attribute: "content", score: 135 },
+        { selector: "meta[property='og:image']", attribute: "content", score: 130 },
+        { selector: "meta[name='twitter:image:src']", attribute: "content", score: 125 },
+        { selector: "meta[name='twitter:image']", attribute: "content", score: 120 },
+        { selector: "meta[itemprop='image']", attribute: "content", score: 110 },
+        { selector: "meta[name='thumbnail']", attribute: "content", score: 100 },
+        { selector: "link[rel='image_src']", attribute: "href", score: 95 }
+      ].freeze
+      JSON_LD_IMAGE_KEYS = %w[image imageUrl imageURL thumbnail thumbnailUrl thumbnailURL contentUrl contentURL].freeze
+      LAZY_SOURCE_ATTRIBUTES = %w[data-src data-original data-lazy-src data-lazy data-url data-image data-preview src].freeze
+      def initialize(http_client: Coelacanth::HTTP)
+        @http_client = http_client
+      end
+      def call(doc:, base_url: nil)
+        return unless doc
+        image_url = locate_image_url(doc, base_url)
+        return unless image_url
+        download(image_url)
+      end
+      private
+      attr_reader :http_client
+      def locate_image_url(doc, base_url)
+        candidates = []
+        candidates.concat(metadata_candidates(doc, base_url))
+        candidates.concat(structured_data_candidates(doc, base_url))
+        candidates.concat(document_image_candidates(doc, base_url))
+        best_candidate(candidates)&.dig(:url)
+      end
+      def metadata_candidates(doc, base_url)
+        METADATA_SOURCES.flat_map do |source|
+          doc.css(source[:selector]).filter_map do |node|
+            value = node[source[:attribute]].to_s.strip
+            next if value.empty?
+            url = absolutize(base_url, value)
+            next unless url
+            {
+              url: url,
+              score: source[:score],
+              origin: :metadata
+            }
+          end
+        end
+      end
+      def structured_data_candidates(doc, base_url)
+        doc.css("script[type='application/ld+json']").flat_map do |script|
+          parse_structured_data(script).flat_map do |value|
+            url = absolutize(base_url, value)
+            next unless url
+            {
+              url: url,
+              score: 105,
+              origin: :structured_data
+            }
+          end
+        end
+      end
+      def parse_structured_data(script)
+        payload = script.text.to_s.strip
+        return [] if payload.empty?
+        Array(extract_images_from_jsonld(JSON.parse(payload)))
+      rescue JSON::ParserError
+        []
+      end
+      def extract_images_from_jsonld(data)
+        case data
+        when String
+          return [] unless valid_image_url?(data)
+          [data]
+        when Array
+          data.flat_map { |value| extract_images_from_jsonld(value) }
+        when Hash
+          urls = []
+          JSON_LD_IMAGE_KEYS.each do |key|
+            next unless data.key?(key)
+            urls.concat(Array(extract_images_from_jsonld(data[key])))
+          end
+          if data["@type"].to_s.casecmp("ImageObject").zero? && data["url"].to_s.strip != ""
+            urls << data["url"]
+          end
+          data.each_value do |value|
+            next unless value.is_a?(Array) || value.is_a?(Hash)
+            urls.concat(Array(extract_images_from_jsonld(value)))
+          end
+          urls
+        else
+          []
+        end
+      end
+      def document_image_candidates(doc, base_url)
+        doc.css("img").flat_map do |node|
+          sources_for(node).filter_map do |source|
+            url = absolutize(base_url, source[:url])
+            next unless url
+            score = 60
+            score += descriptor_bonus(source[:weight])
+            score += score_for_image_node(node, url)
+            {
+              url: url,
+              score: score,
+              origin: :document
+            }
+          end
+        end
+      end
+      def sources_for(node)
+        seen = Set.new
+        entries = []
+        LAZY_SOURCE_ATTRIBUTES.each do |attribute|
+          value = node[attribute]
+          next unless valid_image_url?(value)
+          next if seen.include?(value)
+          seen << value
+          entries << { url: value, weight: nil }
+        end
+        [node["srcset"], node["data-srcset"]].compact.each do |srcset|
+          parse_srcset(srcset).each do |entry|
+            next if seen.include?(entry[:url])
+            seen << entry[:url]
+            entries << entry
+          end
+        end
+        if node.parent&.name == "picture"
+          node.parent.css("source").each do |source|
+            [source["src"], source["data-src"]].compact.each do |value|
+              next unless valid_image_url?(value)
+              next if seen.include?(value)
+              seen << value
+              entries << { url: value, weight: nil }
+            end
+            [source["srcset"], source["data-srcset"]].compact.each do |srcset|
+              parse_srcset(srcset).each do |entry|
+                next if seen.include?(entry[:url])
+                seen << entry[:url]
+                entries << entry
+              end
+            end
+          end
+        end
+        entries
+      end
+      def parse_srcset(srcset)
+        return [] if srcset.to_s.strip.empty?
+        srcset.split(",").filter_map do |candidate|
+          parts = candidate.strip.split
+          url = parts[0].to_s.strip
+          next unless valid_image_url?(url)
+          descriptor = parts[1]
+          { url: url, weight: descriptor_weight(descriptor) }
+        end
+      end
+      def descriptor_weight(descriptor)
+        return nil if descriptor.to_s.empty?
+        if descriptor.end_with?("w")
+          descriptor.to_i
+        elsif descriptor.end_with?("x")
+          (descriptor.to_f * 1000).to_i
+        elsif descriptor.end_with?("h")
+          descriptor.to_i
+        else
+          descriptor.to_i
+        end
+      end
+      def descriptor_bonus(weight)
+        return 0 unless weight
+        case weight
+        when 0..399 then 0
+        when 400..799 then 8
+        when 800..1199 then 15
+        else
+          22
+        end
+      end
+      def score_for_image_node(node, url)
+        score = 0
+        tokens = Utilities.class_id_tokens(node).map(&:downcase)
+        score += tokens.count { |token| POSITIVE_KEYWORDS.include?(token) } * 25
+        score -= tokens.count { |token| NEGATIVE_KEYWORDS.include?(token) } * 30
+        alt_text = node["alt"].to_s.downcase
+        score += keyword_score(alt_text, 12)
+        score -= keyword_score(alt_text, 18, NEGATIVE_KEYWORDS)
+        src_score_text = url.downcase
+        score += keyword_score(src_score_text, 8)
+        score -= keyword_score(src_score_text, 16, NEGATIVE_KEYWORDS)
+        width = dimension_from(node["width"], node["data-width"]) || descriptor_dimension(node["srcset"]) || descriptor_dimension(node["data-srcset"])
+        height = dimension_from(node["height"], node["data-height"]) || width
+        score += 18 if width && width >= 700
+        score += 12 if height && height >= 400
+        score -= 20 if width && width <= 64
+        score -= 20 if height && height <= 64
+        ancestors = Utilities.ancestors(node)
+        score += 12 if ancestors.any? { |ancestor| ancestor.respond_to?(:name) && ancestor.name == "figure" }
+        score += 8 if ancestors.any? { |ancestor| ancestor.respond_to?(:name) && ancestor.name == "article" }
+        score -= 18 if ancestors.any? { |ancestor| ancestor.respond_to?(:name) && %w[footer aside nav].include?(ancestor.name) }
+        score
+      end
+      def keyword_score(text, value, keywords = POSITIVE_KEYWORDS)
+        return 0 if text.empty?
+        keywords.count { |keyword| text.include?(keyword) } * value
+      end
+      def dimension_from(*values)
+        values.compact.each do |value|
+          digits = value.to_s.scan(/[0-9]+/).first
+          return digits.to_i if digits
+        end
+        nil
+      end
+      def descriptor_dimension(srcset)
+        candidate = parse_srcset(srcset).max_by { |entry| entry[:weight].to_i }
+        candidate && candidate[:weight]
+      end
+      def valid_image_url?(value)
+        value = value.to_s.strip
+        return false if value.empty?
+        return false if value.match?(/\A(?:data|javascript):/i)
+        true
+      end
+      def best_candidate(candidates)
+        deduped = {}
+        candidates.each do |candidate|
+          next unless candidate[:url]
+          key = candidate[:url]
+          existing = deduped[key]
+          if !existing || candidate[:score] > existing[:score]
+            deduped[key] = candidate
+          end
+        end
+        deduped.values.max_by { |candidate| candidate[:score] }
+      end
+      def absolutize(base_url, value)
+        return if value.nil? || value.empty?
+        if base_url
+          Utilities.absolute_url(base_url, value)
+        else
+          value
+        end
+      rescue URI::Error
+        value
+      end
+      def download(url)
+        response = http_client.get_response(URI.parse(url))
+        return unless http_success?(response)
+        body = response.body.to_s
+        return if body.empty?
+        directory = Dir.mktmpdir("coelacanth-eyecatch-")
+        file_path = File.join(directory, filename_for(url, response))
+        File.binwrite(file_path, body)
+        Result.new(url: url, path: file_path)
+      rescue StandardError
+        nil
+      end
+      def http_success?(response)
+        return false unless response.respond_to?(:code)
+        response.code.to_i.between?(200, 299)
+      end
+      def filename_for(url, response)
+        uri = URI.parse(url)
+        candidate = File.basename(uri.path.to_s)
+        candidate = nil if candidate.nil? or candidate.empty? or candidate == "."
+        extension = File.extname(candidate.to_s)
+        if extension.empty?
+          extension = extension_for_content_type(response)
+          candidate = ["eyecatch", extension.delete_prefix(".")].compact.join(".")
+        end
+        candidate || "eyecatch#{extension_for_content_type(response)}"
+      rescue URI::Error
+        "eyecatch#{extension_for_content_type(response)}"
+      end
+      def extension_for_content_type(response)
+        content_type = if response.respond_to?(:content_type)
+                         response.content_type
+                       elsif response.respond_to?(:[])
+                         response["content-type"]
+                       end
+        content_type = content_type.to_s.split(";").first
+        case content_type
+        when "image/jpeg", "image/jpg" then ".jpg"
+        when "image/png" then ".png"
+        when "image/gif" then ".gif"
+        when "image/webp" then ".webp"
+        when "image/svg+xml" then ".svg"
+        else
+          ".bin"
+        end
+      end
+    end
+  end
+end