RubyGems - html2text - Versions diffs - 0.2.0 → 0.4.0 - Mend

html2text 0.2.0 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (29) hide show

checksums.yaml +5 -13
data/CHANGELOG.md +50 -0
data/README.md +19 -14
data/lib/html2text/version.rb +3 -1
data/lib/html2text.rb +158 -69
metadata +90 -73
data/spec/examples/anchors.html +0 -12
data/spec/examples/anchors.txt +0 -5
data/spec/examples/basic.html +0 -21
data/spec/examples/basic.txt +0 -13
data/spec/examples/full_email.html +0 -220
data/spec/examples/full_email.txt +0 -54
data/spec/examples/images.html +0 -54
data/spec/examples/images.txt +0 -27
data/spec/examples/lists.html +0 -24
data/spec/examples/lists.txt +0 -17
data/spec/examples/more-anchors.html +0 -14
data/spec/examples/more-anchors.txt +0 -7
data/spec/examples/nbsp.html +0 -1
data/spec/examples/nbsp.txt +0 -1
data/spec/examples/table.html +0 -53
data/spec/examples/table.txt +0 -7
data/spec/examples/test3.html +0 -1
data/spec/examples/test3.txt +0 -2
data/spec/examples/test4.html +0 -1
data/spec/examples/test4.txt +0 -5
data/spec/examples_spec.rb +0 -29
data/spec/html2text_spec.rb +0 -37
data/spec/spec_helper.rb +0 -4

checksums.yaml CHANGED Viewed

@@ -1,15 +1,7 @@
 ---
-!binary "U0hBMQ==":
-  metadata.gz: !binary |-
-    Mjk5MjBiMzliYjc0Y2IyNDRkOThkNTJhNTBjNGFlZTMzNjM5NTU0YQ==
-  data.tar.gz: !binary |-
-    NjlhZDRjZjg4MjhjMjcxNGJkNzcyMDg5Mzk0Y2Q0MjA4MTM2MDJmMg==
+SHA256:
+  metadata.gz: 32afc21e326c44b7881358081161b9581c396b167fad44614a96cc0b6df91f23
+  data.tar.gz: fe03a0811cbff965e6b720ad1fdfdd55c0aa1e03165c16c84de7ecac39d65c9d
 SHA512:
-  metadata.gz: !binary |-
-    MDAxNDJiYzY3Mjg1NjhiMWMzOGFmM2U5ZjJkNzQ0MGYwMTFiYjM5Njg0N2M0
-    OGU4NGM3ZjYwZGJjYzdmZWFlZWUyMzBkNTI1MzIxZDFhMjIwM2E1ZmI2NDI0
-    ZDk3ODViYmRkZGQ4MWUwNmRkMzFmOTE2NjQ3ZWRkZmQ0M2NlYzI=
-  data.tar.gz: !binary |-
-    OWQ3MzM4ZTkyODA2ZmE0YThjZTA5MjhjYTQ1YzNiYjhjMzJmNWUyMDViNDE5
-    NGMxNGJjZDAwYzZjODJlYWRhOTc5NjY0YmFhNTZlOGFlMzNiNzE1ODE5Njgw
-    MmY0ODNmZDMzZTdkNjNjNTBmNTRmNzBjNTY3NDNhMjg0YjlmZWQ=
+  metadata.gz: e26ef2f826da8958c56a390bd4242461abdd26a110216ee3903c902e007b5fb26b38a8f358b420abe6eac480b4a452fedba90b75a36eb7f3f0c2bec4dad040a7
+  data.tar.gz: 31515d14c3ca612f2eb9faaf52655639c3c0b72687fed89c41d83fad816af852854057e3ff0803e9f16e5e1d2b62657699cedef5b9240ab114826d936d0ed3c0

data/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,50 @@
+# Changelog
+All notable changes to this project will be documented in this file.
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [Unreleased]
+## [0.4.0] - 2024-06-08
+### Added
+- Switch from Travis to Github Actions for Build and Test
+- Add rubocop for linting and cleanup existing violations ([#36](https://github.com/soundasleep/html2text_ruby/pull/36))
+### Changed
+- Add support for Ruby 3.x, removed support for Ruby < 3.0 since it is EOL
+- Allow subclassing of `Html2Text` to override the default behaviour ([#30](https://github.com/soundasleep/html2text_ruby/pull/30))
+### Fixed
+- Loosen nokogiri dependency to allow for nokogiri < 2.0 ([#17](https://github.com/soundasleep/html2text_ruby/pull/17))
+- Fix `NoMethodError` when parsing nodes with no name ([#15](https://github.com/soundasleep/html2text_ruby/pull/15))
+## [0.3.1] - 2019-06-12
+### Security
+- Bumped nokogiri requirement to ~> 1.10.3, resolving [CVE-2019-11068](https://nvd.nist.gov/vuln/detail/CVE-2019-11068)
+  ([#8](https://github.com/soundasleep/html2text_ruby/issues/8))
+## [0.3.0] - 2019-02-15
+### Added
+- Zero-width non-joiners are now stripped ([#5](https://github.com/soundasleep/html2text_ruby/pull/5))
+- Support both UTF-8 and Windows-1252 encoded files
+- Support converting `<pre>` blocks, including whitespace within these blocks
+- MS Office (MsoNormal) documents are now rendered closer to actual render output
+  - Note this assumes that the input MS Office document has standard `MsoNormal` CSS.
+    This component is _not_ designed to try and interpret CSS within an HTML document.
+### Changed
+- Behaviour with multiple and nested `<p>`, `<div>` tags has been improved to be more in line with
+  actual browser render behaviour (see test suite)
+### Fixed
+- Update nokogiri dependency to 1.8.5
+## [0.2.1] - 2017-09-27
+### Fixed
+- Convert non-string input into strings ([#3](https://github.com/soundasleep/html2text_ruby/pull/3))
+[Unreleased]: https://github.com/soundasleep/html2text_ruby/compare/0.3.1...HEAD
+[0.3.1]: https://github.com/soundasleep/html2text_ruby/compare/0.3.0...0.3.1
+[0.3.0]: https://github.com/soundasleep/html2text_ruby/compare/0.2.1...0.3.0
+[0.2.1]: https://github.com/soundasleep/html2text_ruby/compare/0.2.1...0.2.1

data/README.md CHANGED Viewed

@@ -1,7 +1,8 @@
-html2text [![Build Status](https://travis-ci.org/soundasleep/html2text_ruby.svg?branch=master)](https://travis-ci.org/soundasleep/html2text_ruby)
-==============
+html2text ![Build](https://github.com/soundasleep/html2text_ruby/actions/workflows/build.yml/badge.svg) [![Gem Version](https://badge.fury.io/rb/html2text.svg)](https://rubygems.org/gems/html2text)
+---
-`html2text` is a very simple script that uses Ruby's DOM methods to load HTML from a string, and then iterates over the resulting DOM to correctly output plain text. For example:
+`html2text` is a very simple gem that uses DOM methods to convert HTML into a format similar to what would be
+rendered by a browser - perfect for places where you need a quick text representation. For example:
 ```html
 <html>
@@ -19,7 +20,7 @@ html2text [![Build Status](https://travis-ci.org/soundasleep/html2text_ruby.svg?
   <div>Another div</div>
   <div>A div<div>within a div</div></div>
-  <a href="http://foo.com">A link</a>
+  <a href="https://foo.com">A link</a>
 </body>
 </html>
@@ -33,18 +34,26 @@ Hello, World!
 This is some e-mail content. Even though it has whitespace and newlines, the e-mail converter will handle it correctly.
 Even mismatched tags.
 A div
 Another div
 A div
 within a div
-[A link](http://foo.com)
+[A link](https://foo.com)
 ```
 See the [original blog post](http://journals.jevon.org/users/jevon-phd/entry/19818) or the related [StackOverflow answer](http://stackoverflow.com/a/2564472/39531).
 ## Installing
-TODO Install the gem, then you can:
+Add [the gem](https://rubygems.org/gems/html2text) into your Gemfile and run `bundle install`:
+```ruby
+gem 'html2text'
+```
+Then you can:
 ```ruby
 require 'html2text'
@@ -54,17 +63,13 @@ text = Html2Text.convert(html)
 ## Tests
-See all of the test cases defined in [spec/examples/](spec/examples/). These can be run with:
-```
-bundle install
-rspec
-```
+See all of the test cases defined in [spec/examples/](spec/examples/). These can be run with `bundle exec rake`.
 ## License
-`html2text` is licensed under MIT.
+`html2text` is [licensed under MIT](LICENSE.md).
 ## Other versions
-Also see [html2text](https://github.com/soundasleep/html2text), the original PHP implementation.
+1. [html2text](https://github.com/soundasleep/html2text), the original PHP implementation.
+2. [actionmailer-html2text](https://github.com/soundasleep/actionmailer-html2text), automatically generate text parts for HTML emails sent with ActionMailer.

data/lib/html2text/version.rb CHANGED Viewed

@@ -1,3 +1,5 @@
+# frozen_string_literal: true
 class Html2Text
-  VERSION = "0.2.0"
+  VERSION = '0.4.0'
 end

data/lib/html2text.rb CHANGED Viewed

@@ -1,3 +1,5 @@
+# frozen_string_literal: true
 require 'nokogiri'
 class Html2Text
@@ -8,18 +10,36 @@ class Html2Text
   end
   def self.convert(html)
+    html = html.to_s
+    if office_document?(html)
+      # Emulate the CSS rendering of Office documents
+      html = html.gsub('<p class=MsoNormal>', '<br>')
+                 .gsub('<o:p>&nbsp;</o:p>', '<br>')
+                 .gsub('<o:p></o:p>', '')
+    end
+    unless html.include?('<html')
+      # Stop Nokogiri from inserting in <p> tags
+      html = "<div>#{html}</div>"
+    end
     html = fix_newlines(replace_entities(html))
     doc = Nokogiri::HTML(html)
-    Html2Text.new(doc).convert
+    new(doc).convert
   end
   def self.fix_newlines(text)
+    # rubocop:disable Performance/StringReplacement
     text.gsub("\r\n", "\n").gsub("\r", "\n")
+    # rubocop:enable Performance/StringReplacement
   end
   def self.replace_entities(text)
-    text.gsub("&nbsp;", " ").gsub("\u00a0", " ")
+    # rubocop:disable Performance/StringReplacement
+    text.gsub('&nbsp;', ' ').gsub("\u00a0", ' ').gsub('&zwnj;', '')
+    # rubocop:enable Performance/StringReplacement
   end
   def convert
@@ -29,149 +49,218 @@ class Html2Text
     output.strip
   end
+  DO_NOT_TOUCH_WHITESPACE = '<do-not-touch-whitespace>'
   def remove_leading_and_trailing_whitespace(text)
-    text.gsub(/[ \t]*\n[ \t]*/im, "\n").gsub(/ *\t */im, "\t")
+    # ignore any <pre> blocks, which we don't want to interact with
+    pre_blocks = text.split(DO_NOT_TOUCH_WHITESPACE)
+    output = []
+    pre_blocks.each.with_index do |block, index|
+      output << if index.even?
+                  block.gsub(/[ \t]*\n[ \t]*/im, "\n").gsub(/ *\t */im, "\t")
+                else
+                  block
+                end
+    end
+    output.join
+  end
+  private_class_method def self.office_document?(text)
+    text.include?('urn:schemas-microsoft-com:office')
   end
+  private
   def remove_unnecessary_empty_lines(text)
     text.gsub(/\n\n\n*/im, "\n\n")
   end
   def trimmed_whitespace(text)
     # Replace whitespace characters with a space (equivalent to \s)
-    text.gsub(/[\t\n\f\r ]+/im, " ")
-  end
-  def next_node_name(node)
-    next_node = node.next_sibling
-    while next_node != nil
-      break if next_node.element?
-      next_node = next_node.next_sibling
-    end
-    if next_node && next_node.element?
-      next_node.name.downcase
+    # and force any text encoding into UTF-8
+    if text.valid_encoding?
+      text.gsub(/[\t\n\f\r ]+/im, ' ')
+    else
+      text.force_encoding('WINDOWS-1252')
+      trimmed_whitespace(text.encode('UTF-16be', invalid: :replace, replace: '?').encode('UTF-8'))
     end
   end
   def iterate_over(node)
+    return "\n" if node.name.downcase == 'br' && next_node_is_text?(node)
     return trimmed_whitespace(node.text) if node.text?
-    if ["style", "head", "title", "meta", "script"].include?(node.name.downcase)
-      return ""
-    end
+    return '' if %w[style head title meta script].include?(node.name.downcase)
+    return "\n#{DO_NOT_TOUCH_WHITESPACE}#{node.text}#{DO_NOT_TOUCH_WHITESPACE}" if node.name.downcase == 'pre'
     output = []
     output << prefix_whitespace(node)
     output += node.children.map do |child|
-      iterate_over(child)
+      iterate_over(child) unless child.name.nil?
     end
     output << suffix_whitespace(node)
-    output = output.compact.join("") || ""
+    output = output.compact.join || ''
-    if node.name.downcase == "a"
-      output = wrap_link(node, output)
-    end
-    if node.name.downcase == "img"
-      output = image_text(node)
+    unless node.name.nil?
+      if node.name.downcase == 'a'
+        output = wrap_link(node, output)
+      elsif node.name.downcase == 'img'
+        output = image_text(node)
+      end
     end
     output
   end
+  # rubocop:disable Lint/DuplicateBranch
   def prefix_whitespace(node)
     case node.name.downcase
-      when "hr"
-        "---------------------------------------------------------------\n"
+    when 'hr'
+      "\n---------------------------------------------------------------\n"
-      when "h1", "h2", "h3", "h4", "h5", "h6", "ol", "ul"
-        "\n"
+    when 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'ol', 'ul'
+      "\n\n"
+    when 'p'
+      "\n\n"
+    when 'tr'
+      "\n"
-      when "tr", "p", "div"
+    when 'div'
+      if node.parent.name == 'div' && (node.parent.text.strip == node.text.strip)
+        ''
+      else
         "\n"
+      end
-      when "td", "th"
-        "\t"
+    when 'td', 'th'
+      "\t"
-      when "li"
-        "- "
+    when 'li'
+      '- '
     end
   end
+  # rubocop:enable Lint/DuplicateBranch
+  # rubocop:disable Lint/DuplicateBranch
   def suffix_whitespace(node)
     case node.name.downcase
-      when "h1", "h2", "h3", "h4", "h5", "h6"
-        # add another line
-        "\n"
+    when 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'
+      # add another line
+      "\n\n"
-      when "p", "br"
-        "\n" if next_node_name(node) != "div"
+    when 'p'
+      "\n\n"
-      when "li"
-        "\n"
+    when 'br'
+      "\n" if next_node_name(node) != 'div' && !next_node_name(node).nil?
+    when 'li'
+      "\n"
-      when "div"
-        # add one line only if the next child isn't a div
-        "\n" if next_node_name(node) != "div" && next_node_name(node) != nil
+    when 'div'
+      if next_node_is_text?(node)
+        "\n"
+      elsif next_node_name(node) != 'div' && !next_node_name(node).nil?
+        "\n"
+      end
     end
   end
+  # rubocop:enable Lint/DuplicateBranch
   # links are returned in [text](link) format
   def wrap_link(node, output)
-    href = node.attribute("href")
-    name = node.attribute("name")
+    href = node.attribute('href')
+    name = node.attribute('name')
     output = output.strip
     # remove double [[ ]]s from linking images
-    if output[0] == "[" && output[-1] == "]"
+    if output[0] == '[' && output[-1] == ']'
       output = output[1, output.length - 2]
       # for linking images, the title of the <a> overrides the title of the <img>
-      if node.attribute("title")
-        output = node.attribute("title").to_s
-      end
+      output = node.attribute('title').to_s if node.attribute('title')
     end
     # if there is no link text, but a title attr
-    if output.empty? && node.attribute("title")
-      output = node.attribute("title").to_s
-    end
+    output = node.attribute('title').to_s if output.empty? && node.attribute('title')
     if href.nil?
-      if !name.nil?
-        output = "[#{output}]"
-      end
+      output = "[#{output}]" unless name.nil?
     else
       href = href.to_s
       if href != output && href != "mailto:#{output}" &&
-          href != "http://#{output}" && href != "https://#{output}"
-        if output.empty?
-          output = href
-        else
-          output = "[#{output}](#{href})"
-        end
+         href != "http://#{output}" && href != "https://#{output}"
+        output = if output.empty?
+                   href
+                 else
+                   "[#{output}](#{href})"
+                 end
       end
     end
     case next_node_name(node)
-      when "h1", "h2", "h3", "h4", "h5", "h6"
-        output += "\n"
+    when 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'
+      output += "\n"
     end
     output
   end
   def image_text(node)
-    if node.attribute("title")
-      "[" + node.attribute("title").to_s + "]"
-    elsif node.attribute("alt")
-      "[" + node.attribute("alt").to_s + "]"
+    if node.attribute('title')
+      "[#{node.attribute('title')}]"
+    elsif node.attribute('alt')
+      "[#{node.attribute('alt')}]"
     else
-      ""
+      ''
     end
   end
+  def next_node_name(node)
+    next_node = node.next_sibling
+    until next_node.nil?
+      break if next_node.element?
+      next_node = next_node.next_sibling
+    end
+    return unless next_node&.element?
+    next_node.name.downcase
+  end
+  def next_node_is_text?(node)
+    !node.next_sibling.nil? && node.next_sibling.text? && !node.next_sibling.text.strip.empty?
+  end
+  def previous_node_name(node)
+    previous_node = node.previous_sibling
+    until previous_node.nil?
+      break if previous_node.element?
+      previous_node = previous_node.previous_sibling
+    end
+    return unless previous_node&.element?
+    previous_node.name.downcase
+  end
+  def previous_node_is_text?(node)
+    !node.previous_sibling.nil? && node.previous_sibling.text? && !node.previous_sibling.text.strip.empty?
+  end
+  # def previous_node_is_not_text?(node)
+  #   return node.previous_sibling.nil? || !node.previous_sibling.text? || node.previous_sibling.text.strip.empty?
+  # end
 end