RubyGems - html2text - Versions diffs - 0.3.1 → 0.4.0 - Mend

html2text 0.3.1 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (53) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +13 -0
data/README.md +5 -5
data/lib/html2text/version.rb +3 -1
data/lib/html2text.rb +108 -106
metadata +65 -110
data/spec/examples/anchors.html +0 -12
data/spec/examples/anchors.txt +0 -5
data/spec/examples/basic.html +0 -21
data/spec/examples/basic.txt +0 -15
data/spec/examples/dom-processing.html +0 -8
data/spec/examples/dom-processing.txt +0 -1
data/spec/examples/empty.html +0 -0
data/spec/examples/empty.txt +0 -0
data/spec/examples/full_email.html +0 -220
data/spec/examples/full_email.txt +0 -54
data/spec/examples/huge-msoffice.html +0 -1
data/spec/examples/huge-msoffice.txt +0 -25872
data/spec/examples/images.html +0 -54
data/spec/examples/images.txt +0 -27
data/spec/examples/invalid.html +0 -4
data/spec/examples/invalid.txt +0 -1
data/spec/examples/lists.html +0 -24
data/spec/examples/lists.txt +0 -17
data/spec/examples/more-anchors.html +0 -14
data/spec/examples/more-anchors.txt +0 -7
data/spec/examples/msoffice.html +0 -1
data/spec/examples/msoffice.txt +0 -12
data/spec/examples/nbsp.html +0 -1
data/spec/examples/nbsp.txt +0 -1
data/spec/examples/nested-divs.html +0 -17
data/spec/examples/nested-divs.txt +0 -12
data/spec/examples/newlines.html +0 -50
data/spec/examples/newlines.txt +0 -35
data/spec/examples/non-breaking-spaces.html +0 -1
data/spec/examples/non-breaking-spaces.txt +0 -1
data/spec/examples/pre.html +0 -10
data/spec/examples/pre.txt +0 -8
data/spec/examples/table.html +0 -53
data/spec/examples/table.txt +0 -7
data/spec/examples/test3.html +0 -1
data/spec/examples/test3.txt +0 -2
data/spec/examples/test4.html +0 -1
data/spec/examples/test4.txt +0 -5
data/spec/examples/utf8-example.html +0 -4
data/spec/examples/utf8-example.txt +0 -2
data/spec/examples/windows-1252-example.html +0 -4
data/spec/examples/windows-1252-example.txt +0 -2
data/spec/examples/zero-width-non-joiners.html +0 -1
data/spec/examples/zero-width-non-joiners.txt +0 -1
data/spec/examples_spec.rb +0 -41
data/spec/html2text_spec.rb +0 -58
data/spec/spec_helper.rb +0 -4

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 7d1902161f7964cd95630662cfe326001842de6ae9cfc791216b2a5c2d6fc763
-  data.tar.gz: 4940f60ec3ea46df4a3117aa7c053d1b30b935c3114bddb81e8d6e81e29fccbb
+  metadata.gz: 32afc21e326c44b7881358081161b9581c396b167fad44614a96cc0b6df91f23
+  data.tar.gz: fe03a0811cbff965e6b720ad1fdfdd55c0aa1e03165c16c84de7ecac39d65c9d
 SHA512:
-  metadata.gz: cd7354466697fc737c336a6abf38e6c70a9480e7d609de135348d4f8b6ab765832929ccd5687fc88209a75d2f82932421a8a59fe8c0754121680d60a0a5f3496
-  data.tar.gz: 39337ef32bc46adf101c06fc33cc98d8960bf31ce1816fde93dfb1a8a6aa75381b28114a8ff0ad363c5335f2bd61df9766ece0ef8c2b325c28d261e9a3552f7b
+  metadata.gz: e26ef2f826da8958c56a390bd4242461abdd26a110216ee3903c902e007b5fb26b38a8f358b420abe6eac480b4a452fedba90b75a36eb7f3f0c2bec4dad040a7
+  data.tar.gz: 31515d14c3ca612f2eb9faaf52655639c3c0b72687fed89c41d83fad816af852854057e3ff0803e9f16e5e1d2b62657699cedef5b9240ab114826d936d0ed3c0

data/CHANGELOG.md CHANGED Viewed

@@ -6,6 +6,19 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
+## [0.4.0] - 2024-06-08
+### Added
+- Switch from Travis to Github Actions for Build and Test
+- Add rubocop for linting and cleanup existing violations ([#36](https://github.com/soundasleep/html2text_ruby/pull/36))
+### Changed
+- Add support for Ruby 3.x, removed support for Ruby < 3.0 since it is EOL
+- Allow subclassing of `Html2Text` to override the default behaviour ([#30](https://github.com/soundasleep/html2text_ruby/pull/30))
+### Fixed
+- Loosen nokogiri dependency to allow for nokogiri < 2.0 ([#17](https://github.com/soundasleep/html2text_ruby/pull/17))
+- Fix `NoMethodError` when parsing nodes with no name ([#15](https://github.com/soundasleep/html2text_ruby/pull/15))
 ## [0.3.1] - 2019-06-12
 ### Security
 - Bumped nokogiri requirement to ~> 1.10.3, resolving [CVE-2019-11068](https://nvd.nist.gov/vuln/detail/CVE-2019-11068)

data/README.md CHANGED Viewed

@@ -1,5 +1,5 @@
-html2text [![Build Status](https://travis-ci.org/soundasleep/html2text_ruby.svg?branch=master)](https://travis-ci.org/soundasleep/html2text_ruby) [![Total Downloads](https://ruby-gem-downloads-badge.herokuapp.com/html2text?type=total&metric=true)](https://rubygems.org/gems/html2text/)
-==============
+html2text ![Build](https://github.com/soundasleep/html2text_ruby/actions/workflows/build.yml/badge.svg) [![Gem Version](https://badge.fury.io/rb/html2text.svg)](https://rubygems.org/gems/html2text)
+---
 `html2text` is a very simple gem that uses DOM methods to convert HTML into a format similar to what would be
 rendered by a browser - perfect for places where you need a quick text representation. For example:
@@ -20,7 +20,7 @@ rendered by a browser - perfect for places where you need a quick text represent
   <div>Another div</div>
   <div>A div<div>within a div</div></div>
-  <a href="http://foo.com">A link</a>
+  <a href="https://foo.com">A link</a>
 </body>
 </html>
@@ -40,7 +40,7 @@ Another div
 A div
 within a div
-[A link](http://foo.com)
+[A link](https://foo.com)
 ```
 See the [original blog post](http://journals.jevon.org/users/jevon-phd/entry/19818) or the related [StackOverflow answer](http://stackoverflow.com/a/2564472/39531).
@@ -63,7 +63,7 @@ text = Html2Text.convert(html)
 ## Tests
-See all of the test cases defined in [spec/examples/](spec/examples/). These can be run with `bundle && rspec`.
+See all of the test cases defined in [spec/examples/](spec/examples/). These can be run with `bundle exec rake`.
 ## License

data/lib/html2text/version.rb CHANGED Viewed

@@ -1,3 +1,5 @@
+# frozen_string_literal: true
 class Html2Text
-  VERSION = "0.3.1"
+  VERSION = '0.4.0'
 end

data/lib/html2text.rb CHANGED Viewed

@@ -1,3 +1,5 @@
+# frozen_string_literal: true
 require 'nokogiri'
 class Html2Text
@@ -10,14 +12,14 @@ class Html2Text
   def self.convert(html)
     html = html.to_s
-    if is_office_document?(html)
+    if office_document?(html)
       # Emulate the CSS rendering of Office documents
-      html = html.gsub("<p class=MsoNormal>", "<br>")
-        .gsub("<o:p>&nbsp;</o:p>", "<br>")
-        .gsub("<o:p></o:p>", "")
+      html = html.gsub('<p class=MsoNormal>', '<br>')
+                 .gsub('<o:p>&nbsp;</o:p>', '<br>')
+                 .gsub('<o:p></o:p>', '')
     end
-    if !html.include?("<html")
+    unless html.include?('<html')
       # Stop Nokogiri from inserting in <p> tags
       html = "<div>#{html}</div>"
     end
@@ -25,25 +27,29 @@ class Html2Text
     html = fix_newlines(replace_entities(html))
     doc = Nokogiri::HTML(html)
-    Html2Text.new(doc).convert
+    new(doc).convert
   end
   def self.fix_newlines(text)
+    # rubocop:disable Performance/StringReplacement
     text.gsub("\r\n", "\n").gsub("\r", "\n")
+    # rubocop:enable Performance/StringReplacement
   end
   def self.replace_entities(text)
-    text.gsub("&nbsp;", " ").gsub("\u00a0", " ").gsub("&zwnj;", "")
+    # rubocop:disable Performance/StringReplacement
+    text.gsub('&nbsp;', ' ').gsub("\u00a0", ' ').gsub('&zwnj;', '')
+    # rubocop:enable Performance/StringReplacement
   end
   def convert
     output = iterate_over(doc)
     output = remove_leading_and_trailing_whitespace(output)
     output = remove_unnecessary_empty_lines(output)
-    return output.strip
+    output.strip
   end
-  DO_NOT_TOUCH_WHITESPACE = "<do-not-touch-whitespace>"
+  DO_NOT_TOUCH_WHITESPACE = '<do-not-touch-whitespace>'
   def remove_leading_and_trailing_whitespace(text)
     # ignore any <pre> blocks, which we don't want to interact with
@@ -51,22 +57,22 @@ class Html2Text
     output = []
     pre_blocks.each.with_index do |block, index|
-      if index % 2 == 0
-        output << block.gsub(/[ \t]*\n[ \t]*/im, "\n").gsub(/ *\t */im, "\t")
-      else
-        output << block
-      end
+      output << if index.even?
+                  block.gsub(/[ \t]*\n[ \t]*/im, "\n").gsub(/ *\t */im, "\t")
+                else
+                  block
+                end
     end
-    output.join("")
+    output.join
   end
-  private
-  def self.is_office_document?(text)
-    text.include?("urn:schemas-microsoft-com:office")
+  private_class_method def self.office_document?(text)
+    text.include?('urn:schemas-microsoft-com:office')
   end
+  private
   def remove_unnecessary_empty_lines(text)
     text.gsub(/\n\n\n*/im, "\n\n")
   end
@@ -75,187 +81,183 @@ class Html2Text
     # Replace whitespace characters with a space (equivalent to \s)
     # and force any text encoding into UTF-8
     if text.valid_encoding?
-      text.gsub(/[\t\n\f\r ]+/im, " ")
+      text.gsub(/[\t\n\f\r ]+/im, ' ')
     else
-      text.force_encoding("WINDOWS-1252")
-      return trimmed_whitespace(text.encode("UTF-16be", invalid: :replace, replace: "?").encode('UTF-8'))
+      text.force_encoding('WINDOWS-1252')
+      trimmed_whitespace(text.encode('UTF-16be', invalid: :replace, replace: '?').encode('UTF-8'))
     end
   end
   def iterate_over(node)
-    return "\n" if node.name.downcase == "br" && next_node_is_text?(node)
+    return "\n" if node.name.downcase == 'br' && next_node_is_text?(node)
     return trimmed_whitespace(node.text) if node.text?
-    if ["style", "head", "title", "meta", "script"].include?(node.name.downcase)
-      return ""
-    end
+    return '' if %w[style head title meta script].include?(node.name.downcase)
-    if node.name.downcase == "pre"
-      return "\n#{DO_NOT_TOUCH_WHITESPACE}#{node.text}#{DO_NOT_TOUCH_WHITESPACE}"
-    end
+    return "\n#{DO_NOT_TOUCH_WHITESPACE}#{node.text}#{DO_NOT_TOUCH_WHITESPACE}" if node.name.downcase == 'pre'
     output = []
     output << prefix_whitespace(node)
     output += node.children.map do |child|
-      iterate_over(child)
+      iterate_over(child) unless child.name.nil?
     end
     output << suffix_whitespace(node)
-    output = output.compact.join("") || ""
+    output = output.compact.join || ''
-    if node.name.downcase == "a"
-      output = wrap_link(node, output)
-    elsif node.name.downcase == "img"
-      output = image_text(node)
+    unless node.name.nil?
+      if node.name.downcase == 'a'
+        output = wrap_link(node, output)
+      elsif node.name.downcase == 'img'
+        output = image_text(node)
+      end
     end
-    return output
+    output
   end
+  # rubocop:disable Lint/DuplicateBranch
   def prefix_whitespace(node)
     case node.name.downcase
-      when "hr"
-        "\n---------------------------------------------------------------\n"
+    when 'hr'
+      "\n---------------------------------------------------------------\n"
-      when "h1", "h2", "h3", "h4", "h5", "h6", "ol", "ul"
-        "\n\n"
+    when 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'ol', 'ul'
+      "\n\n"
-      when "p"
-        "\n\n"
+    when 'p'
+      "\n\n"
-      when "tr"
-        "\n"
+    when 'tr'
+      "\n"
-      when "div"
-        if node.parent.name == "div" && (node.parent.text.strip == node.text.strip)
-          ""
-        else
-          "\n"
-        end
+    when 'div'
+      if node.parent.name == 'div' && (node.parent.text.strip == node.text.strip)
+        ''
+      else
+        "\n"
+      end
-      when "td", "th"
-        "\t"
+    when 'td', 'th'
+      "\t"
-      when "li"
-        "- "
+    when 'li'
+      '- '
     end
   end
+  # rubocop:enable Lint/DuplicateBranch
+  # rubocop:disable Lint/DuplicateBranch
   def suffix_whitespace(node)
     case node.name.downcase
-      when "h1", "h2", "h3", "h4", "h5", "h6"
-        # add another line
-        "\n\n"
+    when 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'
+      # add another line
+      "\n\n"
-      when "p"
-        "\n\n"
+    when 'p'
+      "\n\n"
-      when "br"
-        if next_node_name(node) != "div" && next_node_name(node) != nil
-          "\n"
-        end
+    when 'br'
+      "\n" if next_node_name(node) != 'div' && !next_node_name(node).nil?
-      when "li"
-        "\n"
+    when 'li'
+      "\n"
-      when "div"
-        if next_node_is_text?(node)
-          "\n"
-        elsif next_node_name(node) != "div" && next_node_name(node) != nil
-          "\n"
-        end
+    when 'div'
+      if next_node_is_text?(node)
+        "\n"
+      elsif next_node_name(node) != 'div' && !next_node_name(node).nil?
+        "\n"
+      end
     end
   end
+  # rubocop:enable Lint/DuplicateBranch
   # links are returned in [text](link) format
   def wrap_link(node, output)
-    href = node.attribute("href")
-    name = node.attribute("name")
+    href = node.attribute('href')
+    name = node.attribute('name')
     output = output.strip
     # remove double [[ ]]s from linking images
-    if output[0] == "[" && output[-1] == "]"
+    if output[0] == '[' && output[-1] == ']'
       output = output[1, output.length - 2]
       # for linking images, the title of the <a> overrides the title of the <img>
-      if node.attribute("title")
-        output = node.attribute("title").to_s
-      end
+      output = node.attribute('title').to_s if node.attribute('title')
     end
     # if there is no link text, but a title attr
-    if output.empty? && node.attribute("title")
-      output = node.attribute("title").to_s
-    end
+    output = node.attribute('title').to_s if output.empty? && node.attribute('title')
     if href.nil?
-      if !name.nil?
-        output = "[#{output}]"
-      end
+      output = "[#{output}]" unless name.nil?
     else
       href = href.to_s
       if href != output && href != "mailto:#{output}" &&
-          href != "http://#{output}" && href != "https://#{output}"
-        if output.empty?
-          output = href
-        else
-          output = "[#{output}](#{href})"
-        end
+         href != "http://#{output}" && href != "https://#{output}"
+        output = if output.empty?
+                   href
+                 else
+                   "[#{output}](#{href})"
+                 end
       end
     end
     case next_node_name(node)
-      when "h1", "h2", "h3", "h4", "h5", "h6"
-        output += "\n"
+    when 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'
+      output += "\n"
     end
     output
   end
   def image_text(node)
-    if node.attribute("title")
-      "[" + node.attribute("title").to_s + "]"
-    elsif node.attribute("alt")
-      "[" + node.attribute("alt").to_s + "]"
+    if node.attribute('title')
+      "[#{node.attribute('title')}]"
+    elsif node.attribute('alt')
+      "[#{node.attribute('alt')}]"
     else
-      ""
+      ''
     end
   end
   def next_node_name(node)
     next_node = node.next_sibling
-    while next_node != nil
+    until next_node.nil?
       break if next_node.element?
       next_node = next_node.next_sibling
     end
-    if next_node && next_node.element?
-      next_node.name.downcase
-    end
+    return unless next_node&.element?
+    next_node.name.downcase
   end
   def next_node_is_text?(node)
-    return !node.next_sibling.nil? && node.next_sibling.text? && !node.next_sibling.text.strip.empty?
+    !node.next_sibling.nil? && node.next_sibling.text? && !node.next_sibling.text.strip.empty?
   end
   def previous_node_name(node)
     previous_node = node.previous_sibling
-    while previous_node != nil
+    until previous_node.nil?
       break if previous_node.element?
       previous_node = previous_node.previous_sibling
     end
-    if previous_node && previous_node.element?
-      previous_node.name.downcase
-    end
+    return unless previous_node&.element?
+    previous_node.name.downcase
   end
   def previous_node_is_text?(node)
-    return !node.previous_sibling.nil? && node.previous_sibling.text? && !node.previous_sibling.text.strip.empty?
+    !node.previous_sibling.nil? && node.previous_sibling.text? && !node.previous_sibling.text.strip.empty?
   end
   # def previous_node_is_not_text?(node)