RubyGems - html2text - Versions diffs - 0.2.0 → 0.3.1 - Mend

html2text 0.2.0 → 0.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (37) hide show

checksums.yaml +5 -13
data/CHANGELOG.md +37 -0
data/README.md +16 -11
data/lib/html2text/version.rb +1 -1
data/lib/html2text.rb +113 -26
data/spec/examples/basic.html +21 -21
data/spec/examples/basic.txt +2 -0
data/spec/examples/dom-processing.html +8 -0
data/spec/examples/dom-processing.txt +1 -0
data/spec/examples/empty.html +0 -0
data/spec/examples/empty.txt +0 -0
data/spec/examples/full_email.txt +1 -1
data/spec/examples/huge-msoffice.html +1 -0
data/spec/examples/huge-msoffice.txt +25872 -0
data/spec/examples/invalid.html +4 -0
data/spec/examples/invalid.txt +1 -0
data/spec/examples/msoffice.html +1 -0
data/spec/examples/msoffice.txt +12 -0
data/spec/examples/nested-divs.html +17 -0
data/spec/examples/nested-divs.txt +12 -0
data/spec/examples/newlines.html +50 -0
data/spec/examples/newlines.txt +35 -0
data/spec/examples/non-breaking-spaces.html +1 -0
data/spec/examples/non-breaking-spaces.txt +1 -0
data/spec/examples/pre.html +10 -0
data/spec/examples/pre.txt +8 -0
data/spec/examples/test4.html +1 -1
data/spec/examples/test4.txt +5 -5
data/spec/examples/utf8-example.html +4 -0
data/spec/examples/utf8-example.txt +2 -0
data/spec/examples/windows-1252-example.html +4 -0
data/spec/examples/windows-1252-example.txt +2 -0
data/spec/examples/zero-width-non-joiners.html +1 -0
data/spec/examples/zero-width-non-joiners.txt +1 -0
data/spec/examples_spec.rb +13 -1
data/spec/html2text_spec.rb +21 -0
metadata +96 -34

checksums.yaml CHANGED Viewed

@@ -1,15 +1,7 @@
 ---
-!binary "U0hBMQ==":
-  metadata.gz: !binary |-
-    Mjk5MjBiMzliYjc0Y2IyNDRkOThkNTJhNTBjNGFlZTMzNjM5NTU0YQ==
-  data.tar.gz: !binary |-
-    NjlhZDRjZjg4MjhjMjcxNGJkNzcyMDg5Mzk0Y2Q0MjA4MTM2MDJmMg==
+SHA256:
+  metadata.gz: 7d1902161f7964cd95630662cfe326001842de6ae9cfc791216b2a5c2d6fc763
+  data.tar.gz: 4940f60ec3ea46df4a3117aa7c053d1b30b935c3114bddb81e8d6e81e29fccbb
 SHA512:
-  metadata.gz: !binary |-
-    MDAxNDJiYzY3Mjg1NjhiMWMzOGFmM2U5ZjJkNzQ0MGYwMTFiYjM5Njg0N2M0
-    OGU4NGM3ZjYwZGJjYzdmZWFlZWUyMzBkNTI1MzIxZDFhMjIwM2E1ZmI2NDI0
-    ZDk3ODViYmRkZGQ4MWUwNmRkMzFmOTE2NjQ3ZWRkZmQ0M2NlYzI=
-  data.tar.gz: !binary |-
-    OWQ3MzM4ZTkyODA2ZmE0YThjZTA5MjhjYTQ1YzNiYjhjMzJmNWUyMDViNDE5
-    NGMxNGJjZDAwYzZjODJlYWRhOTc5NjY0YmFhNTZlOGFlMzNiNzE1ODE5Njgw
-    MmY0ODNmZDMzZTdkNjNjNTBmNTRmNzBjNTY3NDNhMjg0YjlmZWQ=
+  metadata.gz: cd7354466697fc737c336a6abf38e6c70a9480e7d609de135348d4f8b6ab765832929ccd5687fc88209a75d2f82932421a8a59fe8c0754121680d60a0a5f3496
+  data.tar.gz: 39337ef32bc46adf101c06fc33cc98d8960bf31ce1816fde93dfb1a8a6aa75381b28114a8ff0ad363c5335f2bd61df9766ece0ef8c2b325c28d261e9a3552f7b

data/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,37 @@
+# Changelog
+All notable changes to this project will be documented in this file.
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [Unreleased]
+## [0.3.1] - 2019-06-12
+### Security
+- Bumped nokogiri requirement to ~> 1.10.3, resolving [CVE-2019-11068](https://nvd.nist.gov/vuln/detail/CVE-2019-11068)
+  ([#8](https://github.com/soundasleep/html2text_ruby/issues/8))
+## [0.3.0] - 2019-02-15
+### Added
+- Zero-width non-joiners are now stripped ([#5](https://github.com/soundasleep/html2text_ruby/pull/5))
+- Support both UTF-8 and Windows-1252 encoded files
+- Support converting `<pre>` blocks, including whitespace within these blocks
+- MS Office (MsoNormal) documents are now rendered closer to actual render output
+  - Note this assumes that the input MS Office document has standard `MsoNormal` CSS.
+    This component is _not_ designed to try and interpret CSS within an HTML document.
+### Changed
+- Behaviour with multiple and nested `<p>`, `<div>` tags has been improved to be more in line with
+  actual browser render behaviour (see test suite)
+### Fixed
+- Update nokogiri dependency to 1.8.5
+## [0.2.1] - 2017-09-27
+### Fixed
+- Convert non-string input into strings ([#3](https://github.com/soundasleep/html2text_ruby/pull/3))
+[Unreleased]: https://github.com/soundasleep/html2text_ruby/compare/0.3.1...HEAD
+[0.3.1]: https://github.com/soundasleep/html2text_ruby/compare/0.3.0...0.3.1
+[0.3.0]: https://github.com/soundasleep/html2text_ruby/compare/0.2.1...0.3.0
+[0.2.1]: https://github.com/soundasleep/html2text_ruby/compare/0.2.1...0.2.1

data/README.md CHANGED Viewed

@@ -1,7 +1,8 @@
-html2text [![Build Status](https://travis-ci.org/soundasleep/html2text_ruby.svg?branch=master)](https://travis-ci.org/soundasleep/html2text_ruby)
+html2text [![Build Status](https://travis-ci.org/soundasleep/html2text_ruby.svg?branch=master)](https://travis-ci.org/soundasleep/html2text_ruby) [![Total Downloads](https://ruby-gem-downloads-badge.herokuapp.com/html2text?type=total&metric=true)](https://rubygems.org/gems/html2text/)
 ==============
-`html2text` is a very simple script that uses Ruby's DOM methods to load HTML from a string, and then iterates over the resulting DOM to correctly output plain text. For example:
+`html2text` is a very simple gem that uses DOM methods to convert HTML into a format similar to what would be
+rendered by a browser - perfect for places where you need a quick text representation. For example:
 ```html
 <html>
@@ -33,10 +34,12 @@ Hello, World!
 This is some e-mail content. Even though it has whitespace and newlines, the e-mail converter will handle it correctly.
 Even mismatched tags.
 A div
 Another div
 A div
 within a div
 [A link](http://foo.com)
 ```
@@ -44,7 +47,13 @@ See the [original blog post](http://journals.jevon.org/users/jevon-phd/entry/198
 ## Installing
-TODO Install the gem, then you can:
+Add [the gem](https://rubygems.org/gems/html2text) into your Gemfile and run `bundle install`:
+```ruby
+gem 'html2text'
+```
+Then you can:
 ```ruby
 require 'html2text'
@@ -54,17 +63,13 @@ text = Html2Text.convert(html)
 ## Tests
-See all of the test cases defined in [spec/examples/](spec/examples/). These can be run with:
-```
-bundle install
-rspec
-```
+See all of the test cases defined in [spec/examples/](spec/examples/). These can be run with `bundle && rspec`.
 ## License
-`html2text` is licensed under MIT.
+`html2text` is [licensed under MIT](LICENSE.md).
 ## Other versions
-Also see [html2text](https://github.com/soundasleep/html2text), the original PHP implementation.
+1. [html2text](https://github.com/soundasleep/html2text), the original PHP implementation.
+2. [actionmailer-html2text](https://github.com/soundasleep/actionmailer-html2text), automatically generate text parts for HTML emails sent with ActionMailer.

data/lib/html2text/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 class Html2Text
-  VERSION = "0.2.0"
+  VERSION = "0.3.1"
 end

data/lib/html2text.rb CHANGED Viewed

@@ -8,6 +8,20 @@ class Html2Text
   end
   def self.convert(html)
+    html = html.to_s
+    if is_office_document?(html)
+      # Emulate the CSS rendering of Office documents
+      html = html.gsub("<p class=MsoNormal>", "<br>")
+        .gsub("<o:p>&nbsp;</o:p>", "<br>")
+        .gsub("<o:p></o:p>", "")
+    end
+    if !html.include?("<html")
+      # Stop Nokogiri from inserting in <p> tags
+      html = "<div>#{html}</div>"
+    end
     html = fix_newlines(replace_entities(html))
     doc = Nokogiri::HTML(html)
@@ -19,18 +33,38 @@ class Html2Text
   end
   def self.replace_entities(text)
-    text.gsub("&nbsp;", " ").gsub("\u00a0", " ")
+    text.gsub("&nbsp;", " ").gsub("\u00a0", " ").gsub("&zwnj;", "")
   end
   def convert
     output = iterate_over(doc)
     output = remove_leading_and_trailing_whitespace(output)
     output = remove_unnecessary_empty_lines(output)
-    output.strip
+    return output.strip
   end
+  DO_NOT_TOUCH_WHITESPACE = "<do-not-touch-whitespace>"
   def remove_leading_and_trailing_whitespace(text)
-    text.gsub(/[ \t]*\n[ \t]*/im, "\n").gsub(/ *\t */im, "\t")
+    # ignore any <pre> blocks, which we don't want to interact with
+    pre_blocks = text.split(DO_NOT_TOUCH_WHITESPACE)
+    output = []
+    pre_blocks.each.with_index do |block, index|
+      if index % 2 == 0
+        output << block.gsub(/[ \t]*\n[ \t]*/im, "\n").gsub(/ *\t */im, "\t")
+      else
+        output << block
+      end
+    end
+    output.join("")
+  end
+  private
+  def self.is_office_document?(text)
+    text.include?("urn:schemas-microsoft-com:office")
   end
   def remove_unnecessary_empty_lines(text)
@@ -39,28 +73,28 @@ class Html2Text
   def trimmed_whitespace(text)
     # Replace whitespace characters with a space (equivalent to \s)
-    text.gsub(/[\t\n\f\r ]+/im, " ")
-  end
-  def next_node_name(node)
-    next_node = node.next_sibling
-    while next_node != nil
-      break if next_node.element?
-      next_node = next_node.next_sibling
-    end
-    if next_node && next_node.element?
-      next_node.name.downcase
+    # and force any text encoding into UTF-8
+    if text.valid_encoding?
+      text.gsub(/[\t\n\f\r ]+/im, " ")
+    else
+      text.force_encoding("WINDOWS-1252")
+      return trimmed_whitespace(text.encode("UTF-16be", invalid: :replace, replace: "?").encode('UTF-8'))
     end
   end
   def iterate_over(node)
+    return "\n" if node.name.downcase == "br" && next_node_is_text?(node)
     return trimmed_whitespace(node.text) if node.text?
     if ["style", "head", "title", "meta", "script"].include?(node.name.downcase)
       return ""
     end
+    if node.name.downcase == "pre"
+      return "\n#{DO_NOT_TOUCH_WHITESPACE}#{node.text}#{DO_NOT_TOUCH_WHITESPACE}"
+    end
     output = []
     output << prefix_whitespace(node)
@@ -73,25 +107,34 @@ class Html2Text
     if node.name.downcase == "a"
       output = wrap_link(node, output)
-    end
-    if node.name.downcase == "img"
+    elsif node.name.downcase == "img"
       output = image_text(node)
     end
-    output
+    return output
   end
   def prefix_whitespace(node)
     case node.name.downcase
       when "hr"
-        "---------------------------------------------------------------\n"
+        "\n---------------------------------------------------------------\n"
       when "h1", "h2", "h3", "h4", "h5", "h6", "ol", "ul"
-        "\n"
+        "\n\n"
-      when "tr", "p", "div"
+      when "p"
+        "\n\n"
+      when "tr"
         "\n"
+      when "div"
+        if node.parent.name == "div" && (node.parent.text.strip == node.text.strip)
+          ""
+        else
+          "\n"
+        end
       when "td", "th"
         "\t"
@@ -104,17 +147,25 @@ class Html2Text
     case node.name.downcase
       when "h1", "h2", "h3", "h4", "h5", "h6"
         # add another line
-        "\n"
+        "\n\n"
-      when "p", "br"
-        "\n" if next_node_name(node) != "div"
+      when "p"
+        "\n\n"
+      when "br"
+        if next_node_name(node) != "div" && next_node_name(node) != nil
+          "\n"
+        end
       when "li"
         "\n"
       when "div"
-        # add one line only if the next child isn't a div
-        "\n" if next_node_name(node) != "div" && next_node_name(node) != nil
+        if next_node_is_text?(node)
+          "\n"
+        elsif next_node_name(node) != "div" && next_node_name(node) != nil
+          "\n"
+        end
     end
   end
@@ -174,4 +225,40 @@ class Html2Text
       ""
     end
   end
+  def next_node_name(node)
+    next_node = node.next_sibling
+    while next_node != nil
+      break if next_node.element?
+      next_node = next_node.next_sibling
+    end
+    if next_node && next_node.element?
+      next_node.name.downcase
+    end
+  end
+  def next_node_is_text?(node)
+    return !node.next_sibling.nil? && node.next_sibling.text? && !node.next_sibling.text.strip.empty?
+  end
+  def previous_node_name(node)
+    previous_node = node.previous_sibling
+    while previous_node != nil
+      break if previous_node.element?
+      previous_node = previous_node.previous_sibling
+    end
+    if previous_node && previous_node.element?
+      previous_node.name.downcase
+    end
+  end
+  def previous_node_is_text?(node)
+    return !node.previous_sibling.nil? && node.previous_sibling.text? && !node.previous_sibling.text.strip.empty?
+  end
+  # def previous_node_is_not_text?(node)
+  #   return node.previous_sibling.nil? || !node.previous_sibling.text? || node.previous_sibling.text.strip.empty?
+  # end
 end

data/spec/examples/basic.html CHANGED Viewed

@@ -1,21 +1,21 @@
-<html>
-<title>Ignored Title</title>
-<body>
-  <h1>Hello, World!</h1>
-  <p>This is some e-mail content.
-  Even though it has whitespace and newlines, the e-mail converter
-  will handle it correctly.
-  <p>Even mismatched tags.</p>
-  <div>A div</div>
-  <div>Another div</div>
-  <div>A div<div>within a div</div></div>
-  <p>Another line<br />Yet another line</p>
-  <a href="http://foo.com">A link</a>
-</body>
-</html>
+<html>
+<title>Ignored Title</title>
+<body>
+  <h1>Hello, World!</h1>
+  <p>This is some e-mail content.
+  Even though it has whitespace and newlines, the e-mail converter
+  will handle it correctly.
+  <p>Even mismatched tags.</p>
+  <div>A div</div>
+  <div>Another div</div>
+  <div>A div<div>within a div</div></div>
+  <p>Another line<br />Yet another line</p>
+  <a href="http://foo.com">A link</a>
+</body>
+</html>

data/spec/examples/basic.txt CHANGED Viewed

@@ -3,6 +3,7 @@ Hello, World!
 This is some e-mail content. Even though it has whitespace and newlines, the e-mail converter will handle it correctly.
 Even mismatched tags.
 A div
 Another div
 A div
@@ -10,4 +11,5 @@ within a div
 Another line
 Yet another line
 [A link](http://foo.com)

data/spec/examples/dom-processing.html ADDED Viewed

@@ -0,0 +1,8 @@
+<html>
+<body>
+<?a
+I am a random piece of code
+?>
+Hello
+</body>
+</html>

data/spec/examples/dom-processing.txt ADDED Viewed

	@@ -0,0 +1 @@
1	+ Hello

data/spec/examples/empty.html ADDED Viewed

File without changes

data/spec/examples/empty.txt ADDED Viewed

File without changes

data/spec/examples/full_email.txt CHANGED Viewed

@@ -6,7 +6,6 @@ Hi Susan
 Here is your cat report.
 You have found 5 cats less than anyone else
 [Find more cats](http://localhost/cats)
 Down the road
@@ -20,6 +19,7 @@ You're currently finding about
 per day
 [Number of cats found]
 ---------------------------------------------------------------
 Your last cat was found two days ago.