html2text 0.2.0 → 0.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,15 +1,7 @@
1
1
  ---
2
- !binary "U0hBMQ==":
3
- metadata.gz: !binary |-
4
- Mjk5MjBiMzliYjc0Y2IyNDRkOThkNTJhNTBjNGFlZTMzNjM5NTU0YQ==
5
- data.tar.gz: !binary |-
6
- NjlhZDRjZjg4MjhjMjcxNGJkNzcyMDg5Mzk0Y2Q0MjA4MTM2MDJmMg==
2
+ SHA256:
3
+ metadata.gz: 7d1902161f7964cd95630662cfe326001842de6ae9cfc791216b2a5c2d6fc763
4
+ data.tar.gz: 4940f60ec3ea46df4a3117aa7c053d1b30b935c3114bddb81e8d6e81e29fccbb
7
5
  SHA512:
8
- metadata.gz: !binary |-
9
- MDAxNDJiYzY3Mjg1NjhiMWMzOGFmM2U5ZjJkNzQ0MGYwMTFiYjM5Njg0N2M0
10
- OGU4NGM3ZjYwZGJjYzdmZWFlZWUyMzBkNTI1MzIxZDFhMjIwM2E1ZmI2NDI0
11
- ZDk3ODViYmRkZGQ4MWUwNmRkMzFmOTE2NjQ3ZWRkZmQ0M2NlYzI=
12
- data.tar.gz: !binary |-
13
- OWQ3MzM4ZTkyODA2ZmE0YThjZTA5MjhjYTQ1YzNiYjhjMzJmNWUyMDViNDE5
14
- NGMxNGJjZDAwYzZjODJlYWRhOTc5NjY0YmFhNTZlOGFlMzNiNzE1ODE5Njgw
15
- MmY0ODNmZDMzZTdkNjNjNTBmNTRmNzBjNTY3NDNhMjg0YjlmZWQ=
6
+ metadata.gz: cd7354466697fc737c336a6abf38e6c70a9480e7d609de135348d4f8b6ab765832929ccd5687fc88209a75d2f82932421a8a59fe8c0754121680d60a0a5f3496
7
+ data.tar.gz: 39337ef32bc46adf101c06fc33cc98d8960bf31ce1816fde93dfb1a8a6aa75381b28114a8ff0ad363c5335f2bd61df9766ece0ef8c2b325c28d261e9a3552f7b
data/CHANGELOG.md ADDED
@@ -0,0 +1,37 @@
1
+ # Changelog
2
+ All notable changes to this project will be documented in this file.
3
+
4
+ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
5
+ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
6
+
7
+ ## [Unreleased]
8
+
9
+ ## [0.3.1] - 2019-06-12
10
+ ### Security
11
+ - Bumped nokogiri requirement to ~> 1.10.3, resolving [CVE-2019-11068](https://nvd.nist.gov/vuln/detail/CVE-2019-11068)
12
+ ([#8](https://github.com/soundasleep/html2text_ruby/issues/8))
13
+
14
+ ## [0.3.0] - 2019-02-15
15
+ ### Added
16
+ - Zero-width non-joiners are now stripped ([#5](https://github.com/soundasleep/html2text_ruby/pull/5))
17
+ - Support both UTF-8 and Windows-1252 encoded files
18
+ - Support converting `<pre>` blocks, including whitespace within these blocks
19
+ - MS Office (MsoNormal) documents are now rendered closer to actual render output
20
+ - Note this assumes that the input MS Office document has standard `MsoNormal` CSS.
21
+ This component is _not_ designed to try and interpret CSS within an HTML document.
22
+
23
+ ### Changed
24
+ - Behaviour with multiple and nested `<p>`, `<div>` tags has been improved to be more in line with
25
+ actual browser render behaviour (see test suite)
26
+
27
+ ### Fixed
28
+ - Update nokogiri dependency to 1.8.5
29
+
30
+ ## [0.2.1] - 2017-09-27
31
+ ### Fixed
32
+ - Convert non-string input into strings ([#3](https://github.com/soundasleep/html2text_ruby/pull/3))
33
+
34
+ [Unreleased]: https://github.com/soundasleep/html2text_ruby/compare/0.3.1...HEAD
35
+ [0.3.1]: https://github.com/soundasleep/html2text_ruby/compare/0.3.0...0.3.1
36
+ [0.3.0]: https://github.com/soundasleep/html2text_ruby/compare/0.2.1...0.3.0
37
+ [0.2.1]: https://github.com/soundasleep/html2text_ruby/compare/0.2.1...0.2.1
data/README.md CHANGED
@@ -1,7 +1,8 @@
1
- html2text [![Build Status](https://travis-ci.org/soundasleep/html2text_ruby.svg?branch=master)](https://travis-ci.org/soundasleep/html2text_ruby)
1
+ html2text [![Build Status](https://travis-ci.org/soundasleep/html2text_ruby.svg?branch=master)](https://travis-ci.org/soundasleep/html2text_ruby) [![Total Downloads](https://ruby-gem-downloads-badge.herokuapp.com/html2text?type=total&metric=true)](https://rubygems.org/gems/html2text/)
2
2
  ==============
3
3
 
4
- `html2text` is a very simple script that uses Ruby's DOM methods to load HTML from a string, and then iterates over the resulting DOM to correctly output plain text. For example:
4
+ `html2text` is a very simple gem that uses DOM methods to convert HTML into a format similar to what would be
5
+ rendered by a browser - perfect for places where you need a quick text representation. For example:
5
6
 
6
7
  ```html
7
8
  <html>
@@ -33,10 +34,12 @@ Hello, World!
33
34
  This is some e-mail content. Even though it has whitespace and newlines, the e-mail converter will handle it correctly.
34
35
 
35
36
  Even mismatched tags.
37
+
36
38
  A div
37
39
  Another div
38
40
  A div
39
41
  within a div
42
+
40
43
  [A link](http://foo.com)
41
44
  ```
42
45
 
@@ -44,7 +47,13 @@ See the [original blog post](http://journals.jevon.org/users/jevon-phd/entry/198
44
47
 
45
48
  ## Installing
46
49
 
47
- TODO Install the gem, then you can:
50
+ Add [the gem](https://rubygems.org/gems/html2text) into your Gemfile and run `bundle install`:
51
+
52
+ ```ruby
53
+ gem 'html2text'
54
+ ```
55
+
56
+ Then you can:
48
57
 
49
58
  ```ruby
50
59
  require 'html2text'
@@ -54,17 +63,13 @@ text = Html2Text.convert(html)
54
63
 
55
64
  ## Tests
56
65
 
57
- See all of the test cases defined in [spec/examples/](spec/examples/). These can be run with:
58
-
59
- ```
60
- bundle install
61
- rspec
62
- ```
66
+ See all of the test cases defined in [spec/examples/](spec/examples/). These can be run with `bundle && rspec`.
63
67
 
64
68
  ## License
65
69
 
66
- `html2text` is licensed under MIT.
70
+ `html2text` is [licensed under MIT](LICENSE.md).
67
71
 
68
72
  ## Other versions
69
73
 
70
- Also see [html2text](https://github.com/soundasleep/html2text), the original PHP implementation.
74
+ 1. [html2text](https://github.com/soundasleep/html2text), the original PHP implementation.
75
+ 2. [actionmailer-html2text](https://github.com/soundasleep/actionmailer-html2text), automatically generate text parts for HTML emails sent with ActionMailer.
@@ -1,3 +1,3 @@
1
1
  class Html2Text
2
- VERSION = "0.2.0"
2
+ VERSION = "0.3.1"
3
3
  end
data/lib/html2text.rb CHANGED
@@ -8,6 +8,20 @@ class Html2Text
8
8
  end
9
9
 
10
10
  def self.convert(html)
11
+ html = html.to_s
12
+
13
+ if is_office_document?(html)
14
+ # Emulate the CSS rendering of Office documents
15
+ html = html.gsub("<p class=MsoNormal>", "<br>")
16
+ .gsub("<o:p>&nbsp;</o:p>", "<br>")
17
+ .gsub("<o:p></o:p>", "")
18
+ end
19
+
20
+ if !html.include?("<html")
21
+ # Stop Nokogiri from inserting in <p> tags
22
+ html = "<div>#{html}</div>"
23
+ end
24
+
11
25
  html = fix_newlines(replace_entities(html))
12
26
  doc = Nokogiri::HTML(html)
13
27
 
@@ -19,18 +33,38 @@ class Html2Text
19
33
  end
20
34
 
21
35
  def self.replace_entities(text)
22
- text.gsub("&nbsp;", " ").gsub("\u00a0", " ")
36
+ text.gsub("&nbsp;", " ").gsub("\u00a0", " ").gsub("&zwnj;", "")
23
37
  end
24
38
 
25
39
  def convert
26
40
  output = iterate_over(doc)
27
41
  output = remove_leading_and_trailing_whitespace(output)
28
42
  output = remove_unnecessary_empty_lines(output)
29
- output.strip
43
+ return output.strip
30
44
  end
31
45
 
46
+ DO_NOT_TOUCH_WHITESPACE = "<do-not-touch-whitespace>"
47
+
32
48
  def remove_leading_and_trailing_whitespace(text)
33
- text.gsub(/[ \t]*\n[ \t]*/im, "\n").gsub(/ *\t */im, "\t")
49
+ # ignore any <pre> blocks, which we don't want to interact with
50
+ pre_blocks = text.split(DO_NOT_TOUCH_WHITESPACE)
51
+
52
+ output = []
53
+ pre_blocks.each.with_index do |block, index|
54
+ if index % 2 == 0
55
+ output << block.gsub(/[ \t]*\n[ \t]*/im, "\n").gsub(/ *\t */im, "\t")
56
+ else
57
+ output << block
58
+ end
59
+ end
60
+
61
+ output.join("")
62
+ end
63
+
64
+ private
65
+
66
+ def self.is_office_document?(text)
67
+ text.include?("urn:schemas-microsoft-com:office")
34
68
  end
35
69
 
36
70
  def remove_unnecessary_empty_lines(text)
@@ -39,28 +73,28 @@ class Html2Text
39
73
 
40
74
  def trimmed_whitespace(text)
41
75
  # Replace whitespace characters with a space (equivalent to \s)
42
- text.gsub(/[\t\n\f\r ]+/im, " ")
43
- end
44
-
45
- def next_node_name(node)
46
- next_node = node.next_sibling
47
- while next_node != nil
48
- break if next_node.element?
49
- next_node = next_node.next_sibling
50
- end
51
-
52
- if next_node && next_node.element?
53
- next_node.name.downcase
76
+ # and force any text encoding into UTF-8
77
+ if text.valid_encoding?
78
+ text.gsub(/[\t\n\f\r ]+/im, " ")
79
+ else
80
+ text.force_encoding("WINDOWS-1252")
81
+ return trimmed_whitespace(text.encode("UTF-16be", invalid: :replace, replace: "?").encode('UTF-8'))
54
82
  end
55
83
  end
56
84
 
57
85
  def iterate_over(node)
86
+ return "\n" if node.name.downcase == "br" && next_node_is_text?(node)
87
+
58
88
  return trimmed_whitespace(node.text) if node.text?
59
89
 
60
90
  if ["style", "head", "title", "meta", "script"].include?(node.name.downcase)
61
91
  return ""
62
92
  end
63
93
 
94
+ if node.name.downcase == "pre"
95
+ return "\n#{DO_NOT_TOUCH_WHITESPACE}#{node.text}#{DO_NOT_TOUCH_WHITESPACE}"
96
+ end
97
+
64
98
  output = []
65
99
 
66
100
  output << prefix_whitespace(node)
@@ -73,25 +107,34 @@ class Html2Text
73
107
 
74
108
  if node.name.downcase == "a"
75
109
  output = wrap_link(node, output)
76
- end
77
- if node.name.downcase == "img"
110
+ elsif node.name.downcase == "img"
78
111
  output = image_text(node)
79
112
  end
80
113
 
81
- output
114
+ return output
82
115
  end
83
116
 
84
117
  def prefix_whitespace(node)
85
118
  case node.name.downcase
86
119
  when "hr"
87
- "---------------------------------------------------------------\n"
120
+ "\n---------------------------------------------------------------\n"
88
121
 
89
122
  when "h1", "h2", "h3", "h4", "h5", "h6", "ol", "ul"
90
- "\n"
123
+ "\n\n"
91
124
 
92
- when "tr", "p", "div"
125
+ when "p"
126
+ "\n\n"
127
+
128
+ when "tr"
93
129
  "\n"
94
130
 
131
+ when "div"
132
+ if node.parent.name == "div" && (node.parent.text.strip == node.text.strip)
133
+ ""
134
+ else
135
+ "\n"
136
+ end
137
+
95
138
  when "td", "th"
96
139
  "\t"
97
140
 
@@ -104,17 +147,25 @@ class Html2Text
104
147
  case node.name.downcase
105
148
  when "h1", "h2", "h3", "h4", "h5", "h6"
106
149
  # add another line
107
- "\n"
150
+ "\n\n"
108
151
 
109
- when "p", "br"
110
- "\n" if next_node_name(node) != "div"
152
+ when "p"
153
+ "\n\n"
154
+
155
+ when "br"
156
+ if next_node_name(node) != "div" && next_node_name(node) != nil
157
+ "\n"
158
+ end
111
159
 
112
160
  when "li"
113
161
  "\n"
114
162
 
115
163
  when "div"
116
- # add one line only if the next child isn't a div
117
- "\n" if next_node_name(node) != "div" && next_node_name(node) != nil
164
+ if next_node_is_text?(node)
165
+ "\n"
166
+ elsif next_node_name(node) != "div" && next_node_name(node) != nil
167
+ "\n"
168
+ end
118
169
  end
119
170
  end
120
171
 
@@ -174,4 +225,40 @@ class Html2Text
174
225
  ""
175
226
  end
176
227
  end
228
+
229
+ def next_node_name(node)
230
+ next_node = node.next_sibling
231
+ while next_node != nil
232
+ break if next_node.element?
233
+ next_node = next_node.next_sibling
234
+ end
235
+
236
+ if next_node && next_node.element?
237
+ next_node.name.downcase
238
+ end
239
+ end
240
+
241
+ def next_node_is_text?(node)
242
+ return !node.next_sibling.nil? && node.next_sibling.text? && !node.next_sibling.text.strip.empty?
243
+ end
244
+
245
+ def previous_node_name(node)
246
+ previous_node = node.previous_sibling
247
+ while previous_node != nil
248
+ break if previous_node.element?
249
+ previous_node = previous_node.previous_sibling
250
+ end
251
+
252
+ if previous_node && previous_node.element?
253
+ previous_node.name.downcase
254
+ end
255
+ end
256
+
257
+ def previous_node_is_text?(node)
258
+ return !node.previous_sibling.nil? && node.previous_sibling.text? && !node.previous_sibling.text.strip.empty?
259
+ end
260
+
261
+ # def previous_node_is_not_text?(node)
262
+ # return node.previous_sibling.nil? || !node.previous_sibling.text? || node.previous_sibling.text.strip.empty?
263
+ # end
177
264
  end
@@ -1,21 +1,21 @@
1
- <html>
2
- <title>Ignored Title</title>
3
- <body>
4
- <h1>Hello, World!</h1>
5
-
6
- <p>This is some e-mail content.
7
- Even though it has whitespace and newlines, the e-mail converter
8
- will handle it correctly.
9
-
10
- <p>Even mismatched tags.</p>
11
-
12
- <div>A div</div>
13
- <div>Another div</div>
14
- <div>A div<div>within a div</div></div>
15
-
16
- <p>Another line<br />Yet another line</p>
17
-
18
- <a href="http://foo.com">A link</a>
19
-
20
- </body>
21
- </html>
1
+ <html>
2
+ <title>Ignored Title</title>
3
+ <body>
4
+ <h1>Hello, World!</h1>
5
+
6
+ <p>This is some e-mail content.
7
+ Even though it has whitespace and newlines, the e-mail converter
8
+ will handle it correctly.
9
+
10
+ <p>Even mismatched tags.</p>
11
+
12
+ <div>A div</div>
13
+ <div>Another div</div>
14
+ <div>A div<div>within a div</div></div>
15
+
16
+ <p>Another line<br />Yet another line</p>
17
+
18
+ <a href="http://foo.com">A link</a>
19
+
20
+ </body>
21
+ </html>
@@ -3,6 +3,7 @@ Hello, World!
3
3
  This is some e-mail content. Even though it has whitespace and newlines, the e-mail converter will handle it correctly.
4
4
 
5
5
  Even mismatched tags.
6
+
6
7
  A div
7
8
  Another div
8
9
  A div
@@ -10,4 +11,5 @@ within a div
10
11
 
11
12
  Another line
12
13
  Yet another line
14
+
13
15
  [A link](http://foo.com)
@@ -0,0 +1,8 @@
1
+ <html>
2
+ <body>
3
+ <?a
4
+ I am a random piece of code
5
+ ?>
6
+ Hello
7
+ </body>
8
+ </html>
@@ -0,0 +1 @@
1
+ Hello
File without changes
File without changes
@@ -6,7 +6,6 @@ Hi Susan
6
6
  Here is your cat report.
7
7
 
8
8
  You have found 5 cats less than anyone else
9
-
10
9
  [Find more cats](http://localhost/cats)
11
10
 
12
11
  Down the road
@@ -20,6 +19,7 @@ You're currently finding about
20
19
  per day
21
20
 
22
21
  [Number of cats found]
22
+
23
23
  ---------------------------------------------------------------
24
24
 
25
25
  Your last cat was found two days ago.