html2text 0.2.0 → 0.3.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +5 -13
- data/CHANGELOG.md +37 -0
- data/README.md +16 -11
- data/lib/html2text/version.rb +1 -1
- data/lib/html2text.rb +113 -26
- data/spec/examples/basic.html +21 -21
- data/spec/examples/basic.txt +2 -0
- data/spec/examples/dom-processing.html +8 -0
- data/spec/examples/dom-processing.txt +1 -0
- data/spec/examples/empty.html +0 -0
- data/spec/examples/empty.txt +0 -0
- data/spec/examples/full_email.txt +1 -1
- data/spec/examples/huge-msoffice.html +1 -0
- data/spec/examples/huge-msoffice.txt +25872 -0
- data/spec/examples/invalid.html +4 -0
- data/spec/examples/invalid.txt +1 -0
- data/spec/examples/msoffice.html +1 -0
- data/spec/examples/msoffice.txt +12 -0
- data/spec/examples/nested-divs.html +17 -0
- data/spec/examples/nested-divs.txt +12 -0
- data/spec/examples/newlines.html +50 -0
- data/spec/examples/newlines.txt +35 -0
- data/spec/examples/non-breaking-spaces.html +1 -0
- data/spec/examples/non-breaking-spaces.txt +1 -0
- data/spec/examples/pre.html +10 -0
- data/spec/examples/pre.txt +8 -0
- data/spec/examples/test4.html +1 -1
- data/spec/examples/test4.txt +5 -5
- data/spec/examples/utf8-example.html +4 -0
- data/spec/examples/utf8-example.txt +2 -0
- data/spec/examples/windows-1252-example.html +4 -0
- data/spec/examples/windows-1252-example.txt +2 -0
- data/spec/examples/zero-width-non-joiners.html +1 -0
- data/spec/examples/zero-width-non-joiners.txt +1 -0
- data/spec/examples_spec.rb +13 -1
- data/spec/html2text_spec.rb +21 -0
- metadata +96 -34
checksums.yaml
CHANGED
@@ -1,15 +1,7 @@
|
|
1
1
|
---
|
2
|
-
|
3
|
-
metadata.gz:
|
4
|
-
|
5
|
-
data.tar.gz: !binary |-
|
6
|
-
NjlhZDRjZjg4MjhjMjcxNGJkNzcyMDg5Mzk0Y2Q0MjA4MTM2MDJmMg==
|
2
|
+
SHA256:
|
3
|
+
metadata.gz: 7d1902161f7964cd95630662cfe326001842de6ae9cfc791216b2a5c2d6fc763
|
4
|
+
data.tar.gz: 4940f60ec3ea46df4a3117aa7c053d1b30b935c3114bddb81e8d6e81e29fccbb
|
7
5
|
SHA512:
|
8
|
-
metadata.gz:
|
9
|
-
|
10
|
-
OGU4NGM3ZjYwZGJjYzdmZWFlZWUyMzBkNTI1MzIxZDFhMjIwM2E1ZmI2NDI0
|
11
|
-
ZDk3ODViYmRkZGQ4MWUwNmRkMzFmOTE2NjQ3ZWRkZmQ0M2NlYzI=
|
12
|
-
data.tar.gz: !binary |-
|
13
|
-
OWQ3MzM4ZTkyODA2ZmE0YThjZTA5MjhjYTQ1YzNiYjhjMzJmNWUyMDViNDE5
|
14
|
-
NGMxNGJjZDAwYzZjODJlYWRhOTc5NjY0YmFhNTZlOGFlMzNiNzE1ODE5Njgw
|
15
|
-
MmY0ODNmZDMzZTdkNjNjNTBmNTRmNzBjNTY3NDNhMjg0YjlmZWQ=
|
6
|
+
metadata.gz: cd7354466697fc737c336a6abf38e6c70a9480e7d609de135348d4f8b6ab765832929ccd5687fc88209a75d2f82932421a8a59fe8c0754121680d60a0a5f3496
|
7
|
+
data.tar.gz: 39337ef32bc46adf101c06fc33cc98d8960bf31ce1816fde93dfb1a8a6aa75381b28114a8ff0ad363c5335f2bd61df9766ece0ef8c2b325c28d261e9a3552f7b
|
data/CHANGELOG.md
ADDED
@@ -0,0 +1,37 @@
|
|
1
|
+
# Changelog
|
2
|
+
All notable changes to this project will be documented in this file.
|
3
|
+
|
4
|
+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
5
|
+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
6
|
+
|
7
|
+
## [Unreleased]
|
8
|
+
|
9
|
+
## [0.3.1] - 2019-06-12
|
10
|
+
### Security
|
11
|
+
- Bumped nokogiri requirement to ~> 1.10.3, resolving [CVE-2019-11068](https://nvd.nist.gov/vuln/detail/CVE-2019-11068)
|
12
|
+
([#8](https://github.com/soundasleep/html2text_ruby/issues/8))
|
13
|
+
|
14
|
+
## [0.3.0] - 2019-02-15
|
15
|
+
### Added
|
16
|
+
- Zero-width non-joiners are now stripped ([#5](https://github.com/soundasleep/html2text_ruby/pull/5))
|
17
|
+
- Support both UTF-8 and Windows-1252 encoded files
|
18
|
+
- Support converting `<pre>` blocks, including whitespace within these blocks
|
19
|
+
- MS Office (MsoNormal) documents are now rendered closer to actual render output
|
20
|
+
- Note this assumes that the input MS Office document has standard `MsoNormal` CSS.
|
21
|
+
This component is _not_ designed to try and interpret CSS within an HTML document.
|
22
|
+
|
23
|
+
### Changed
|
24
|
+
- Behaviour with multiple and nested `<p>`, `<div>` tags has been improved to be more in line with
|
25
|
+
actual browser render behaviour (see test suite)
|
26
|
+
|
27
|
+
### Fixed
|
28
|
+
- Update nokogiri dependency to 1.8.5
|
29
|
+
|
30
|
+
## [0.2.1] - 2017-09-27
|
31
|
+
### Fixed
|
32
|
+
- Convert non-string input into strings ([#3](https://github.com/soundasleep/html2text_ruby/pull/3))
|
33
|
+
|
34
|
+
[Unreleased]: https://github.com/soundasleep/html2text_ruby/compare/0.3.1...HEAD
|
35
|
+
[0.3.1]: https://github.com/soundasleep/html2text_ruby/compare/0.3.0...0.3.1
|
36
|
+
[0.3.0]: https://github.com/soundasleep/html2text_ruby/compare/0.2.1...0.3.0
|
37
|
+
[0.2.1]: https://github.com/soundasleep/html2text_ruby/compare/0.2.1...0.2.1
|
data/README.md
CHANGED
@@ -1,7 +1,8 @@
|
|
1
|
-
html2text [](https://travis-ci.org/soundasleep/html2text_ruby)
|
1
|
+
html2text [](https://travis-ci.org/soundasleep/html2text_ruby) [](https://rubygems.org/gems/html2text/)
|
2
2
|
==============
|
3
3
|
|
4
|
-
`html2text` is a very simple
|
4
|
+
`html2text` is a very simple gem that uses DOM methods to convert HTML into a format similar to what would be
|
5
|
+
rendered by a browser - perfect for places where you need a quick text representation. For example:
|
5
6
|
|
6
7
|
```html
|
7
8
|
<html>
|
@@ -33,10 +34,12 @@ Hello, World!
|
|
33
34
|
This is some e-mail content. Even though it has whitespace and newlines, the e-mail converter will handle it correctly.
|
34
35
|
|
35
36
|
Even mismatched tags.
|
37
|
+
|
36
38
|
A div
|
37
39
|
Another div
|
38
40
|
A div
|
39
41
|
within a div
|
42
|
+
|
40
43
|
[A link](http://foo.com)
|
41
44
|
```
|
42
45
|
|
@@ -44,7 +47,13 @@ See the [original blog post](http://journals.jevon.org/users/jevon-phd/entry/198
|
|
44
47
|
|
45
48
|
## Installing
|
46
49
|
|
47
|
-
|
50
|
+
Add [the gem](https://rubygems.org/gems/html2text) into your Gemfile and run `bundle install`:
|
51
|
+
|
52
|
+
```ruby
|
53
|
+
gem 'html2text'
|
54
|
+
```
|
55
|
+
|
56
|
+
Then you can:
|
48
57
|
|
49
58
|
```ruby
|
50
59
|
require 'html2text'
|
@@ -54,17 +63,13 @@ text = Html2Text.convert(html)
|
|
54
63
|
|
55
64
|
## Tests
|
56
65
|
|
57
|
-
See all of the test cases defined in [spec/examples/](spec/examples/). These can be run with
|
58
|
-
|
59
|
-
```
|
60
|
-
bundle install
|
61
|
-
rspec
|
62
|
-
```
|
66
|
+
See all of the test cases defined in [spec/examples/](spec/examples/). These can be run with `bundle && rspec`.
|
63
67
|
|
64
68
|
## License
|
65
69
|
|
66
|
-
`html2text` is licensed under MIT.
|
70
|
+
`html2text` is [licensed under MIT](LICENSE.md).
|
67
71
|
|
68
72
|
## Other versions
|
69
73
|
|
70
|
-
|
74
|
+
1. [html2text](https://github.com/soundasleep/html2text), the original PHP implementation.
|
75
|
+
2. [actionmailer-html2text](https://github.com/soundasleep/actionmailer-html2text), automatically generate text parts for HTML emails sent with ActionMailer.
|
data/lib/html2text/version.rb
CHANGED
data/lib/html2text.rb
CHANGED
@@ -8,6 +8,20 @@ class Html2Text
|
|
8
8
|
end
|
9
9
|
|
10
10
|
def self.convert(html)
|
11
|
+
html = html.to_s
|
12
|
+
|
13
|
+
if is_office_document?(html)
|
14
|
+
# Emulate the CSS rendering of Office documents
|
15
|
+
html = html.gsub("<p class=MsoNormal>", "<br>")
|
16
|
+
.gsub("<o:p> </o:p>", "<br>")
|
17
|
+
.gsub("<o:p></o:p>", "")
|
18
|
+
end
|
19
|
+
|
20
|
+
if !html.include?("<html")
|
21
|
+
# Stop Nokogiri from inserting in <p> tags
|
22
|
+
html = "<div>#{html}</div>"
|
23
|
+
end
|
24
|
+
|
11
25
|
html = fix_newlines(replace_entities(html))
|
12
26
|
doc = Nokogiri::HTML(html)
|
13
27
|
|
@@ -19,18 +33,38 @@ class Html2Text
|
|
19
33
|
end
|
20
34
|
|
21
35
|
def self.replace_entities(text)
|
22
|
-
text.gsub(" ", " ").gsub("\u00a0", " ")
|
36
|
+
text.gsub(" ", " ").gsub("\u00a0", " ").gsub("‌", "")
|
23
37
|
end
|
24
38
|
|
25
39
|
def convert
|
26
40
|
output = iterate_over(doc)
|
27
41
|
output = remove_leading_and_trailing_whitespace(output)
|
28
42
|
output = remove_unnecessary_empty_lines(output)
|
29
|
-
output.strip
|
43
|
+
return output.strip
|
30
44
|
end
|
31
45
|
|
46
|
+
DO_NOT_TOUCH_WHITESPACE = "<do-not-touch-whitespace>"
|
47
|
+
|
32
48
|
def remove_leading_and_trailing_whitespace(text)
|
33
|
-
|
49
|
+
# ignore any <pre> blocks, which we don't want to interact with
|
50
|
+
pre_blocks = text.split(DO_NOT_TOUCH_WHITESPACE)
|
51
|
+
|
52
|
+
output = []
|
53
|
+
pre_blocks.each.with_index do |block, index|
|
54
|
+
if index % 2 == 0
|
55
|
+
output << block.gsub(/[ \t]*\n[ \t]*/im, "\n").gsub(/ *\t */im, "\t")
|
56
|
+
else
|
57
|
+
output << block
|
58
|
+
end
|
59
|
+
end
|
60
|
+
|
61
|
+
output.join("")
|
62
|
+
end
|
63
|
+
|
64
|
+
private
|
65
|
+
|
66
|
+
def self.is_office_document?(text)
|
67
|
+
text.include?("urn:schemas-microsoft-com:office")
|
34
68
|
end
|
35
69
|
|
36
70
|
def remove_unnecessary_empty_lines(text)
|
@@ -39,28 +73,28 @@ class Html2Text
|
|
39
73
|
|
40
74
|
def trimmed_whitespace(text)
|
41
75
|
# Replace whitespace characters with a space (equivalent to \s)
|
42
|
-
text
|
43
|
-
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
48
|
-
break if next_node.element?
|
49
|
-
next_node = next_node.next_sibling
|
50
|
-
end
|
51
|
-
|
52
|
-
if next_node && next_node.element?
|
53
|
-
next_node.name.downcase
|
76
|
+
# and force any text encoding into UTF-8
|
77
|
+
if text.valid_encoding?
|
78
|
+
text.gsub(/[\t\n\f\r ]+/im, " ")
|
79
|
+
else
|
80
|
+
text.force_encoding("WINDOWS-1252")
|
81
|
+
return trimmed_whitespace(text.encode("UTF-16be", invalid: :replace, replace: "?").encode('UTF-8'))
|
54
82
|
end
|
55
83
|
end
|
56
84
|
|
57
85
|
def iterate_over(node)
|
86
|
+
return "\n" if node.name.downcase == "br" && next_node_is_text?(node)
|
87
|
+
|
58
88
|
return trimmed_whitespace(node.text) if node.text?
|
59
89
|
|
60
90
|
if ["style", "head", "title", "meta", "script"].include?(node.name.downcase)
|
61
91
|
return ""
|
62
92
|
end
|
63
93
|
|
94
|
+
if node.name.downcase == "pre"
|
95
|
+
return "\n#{DO_NOT_TOUCH_WHITESPACE}#{node.text}#{DO_NOT_TOUCH_WHITESPACE}"
|
96
|
+
end
|
97
|
+
|
64
98
|
output = []
|
65
99
|
|
66
100
|
output << prefix_whitespace(node)
|
@@ -73,25 +107,34 @@ class Html2Text
|
|
73
107
|
|
74
108
|
if node.name.downcase == "a"
|
75
109
|
output = wrap_link(node, output)
|
76
|
-
|
77
|
-
if node.name.downcase == "img"
|
110
|
+
elsif node.name.downcase == "img"
|
78
111
|
output = image_text(node)
|
79
112
|
end
|
80
113
|
|
81
|
-
output
|
114
|
+
return output
|
82
115
|
end
|
83
116
|
|
84
117
|
def prefix_whitespace(node)
|
85
118
|
case node.name.downcase
|
86
119
|
when "hr"
|
87
|
-
"---------------------------------------------------------------\n"
|
120
|
+
"\n---------------------------------------------------------------\n"
|
88
121
|
|
89
122
|
when "h1", "h2", "h3", "h4", "h5", "h6", "ol", "ul"
|
90
|
-
"\n"
|
123
|
+
"\n\n"
|
91
124
|
|
92
|
-
when "
|
125
|
+
when "p"
|
126
|
+
"\n\n"
|
127
|
+
|
128
|
+
when "tr"
|
93
129
|
"\n"
|
94
130
|
|
131
|
+
when "div"
|
132
|
+
if node.parent.name == "div" && (node.parent.text.strip == node.text.strip)
|
133
|
+
""
|
134
|
+
else
|
135
|
+
"\n"
|
136
|
+
end
|
137
|
+
|
95
138
|
when "td", "th"
|
96
139
|
"\t"
|
97
140
|
|
@@ -104,17 +147,25 @@ class Html2Text
|
|
104
147
|
case node.name.downcase
|
105
148
|
when "h1", "h2", "h3", "h4", "h5", "h6"
|
106
149
|
# add another line
|
107
|
-
"\n"
|
150
|
+
"\n\n"
|
108
151
|
|
109
|
-
when "p"
|
110
|
-
"\n"
|
152
|
+
when "p"
|
153
|
+
"\n\n"
|
154
|
+
|
155
|
+
when "br"
|
156
|
+
if next_node_name(node) != "div" && next_node_name(node) != nil
|
157
|
+
"\n"
|
158
|
+
end
|
111
159
|
|
112
160
|
when "li"
|
113
161
|
"\n"
|
114
162
|
|
115
163
|
when "div"
|
116
|
-
|
117
|
-
|
164
|
+
if next_node_is_text?(node)
|
165
|
+
"\n"
|
166
|
+
elsif next_node_name(node) != "div" && next_node_name(node) != nil
|
167
|
+
"\n"
|
168
|
+
end
|
118
169
|
end
|
119
170
|
end
|
120
171
|
|
@@ -174,4 +225,40 @@ class Html2Text
|
|
174
225
|
""
|
175
226
|
end
|
176
227
|
end
|
228
|
+
|
229
|
+
def next_node_name(node)
|
230
|
+
next_node = node.next_sibling
|
231
|
+
while next_node != nil
|
232
|
+
break if next_node.element?
|
233
|
+
next_node = next_node.next_sibling
|
234
|
+
end
|
235
|
+
|
236
|
+
if next_node && next_node.element?
|
237
|
+
next_node.name.downcase
|
238
|
+
end
|
239
|
+
end
|
240
|
+
|
241
|
+
def next_node_is_text?(node)
|
242
|
+
return !node.next_sibling.nil? && node.next_sibling.text? && !node.next_sibling.text.strip.empty?
|
243
|
+
end
|
244
|
+
|
245
|
+
def previous_node_name(node)
|
246
|
+
previous_node = node.previous_sibling
|
247
|
+
while previous_node != nil
|
248
|
+
break if previous_node.element?
|
249
|
+
previous_node = previous_node.previous_sibling
|
250
|
+
end
|
251
|
+
|
252
|
+
if previous_node && previous_node.element?
|
253
|
+
previous_node.name.downcase
|
254
|
+
end
|
255
|
+
end
|
256
|
+
|
257
|
+
def previous_node_is_text?(node)
|
258
|
+
return !node.previous_sibling.nil? && node.previous_sibling.text? && !node.previous_sibling.text.strip.empty?
|
259
|
+
end
|
260
|
+
|
261
|
+
# def previous_node_is_not_text?(node)
|
262
|
+
# return node.previous_sibling.nil? || !node.previous_sibling.text? || node.previous_sibling.text.strip.empty?
|
263
|
+
# end
|
177
264
|
end
|
data/spec/examples/basic.html
CHANGED
@@ -1,21 +1,21 @@
|
|
1
|
-
<html>
|
2
|
-
<title>Ignored Title</title>
|
3
|
-
<body>
|
4
|
-
<h1>Hello, World!</h1>
|
5
|
-
|
6
|
-
<p>This is some e-mail content.
|
7
|
-
Even though it has whitespace and newlines, the e-mail converter
|
8
|
-
will handle it correctly.
|
9
|
-
|
10
|
-
<p>Even mismatched tags.</p>
|
11
|
-
|
12
|
-
<div>A div</div>
|
13
|
-
<div>Another div</div>
|
14
|
-
<div>A div<div>within a div</div></div>
|
15
|
-
|
16
|
-
<p>Another line<br />Yet another line</p>
|
17
|
-
|
18
|
-
<a href="http://foo.com">A link</a>
|
19
|
-
|
20
|
-
</body>
|
21
|
-
</html>
|
1
|
+
<html>
|
2
|
+
<title>Ignored Title</title>
|
3
|
+
<body>
|
4
|
+
<h1>Hello, World!</h1>
|
5
|
+
|
6
|
+
<p>This is some e-mail content.
|
7
|
+
Even though it has whitespace and newlines, the e-mail converter
|
8
|
+
will handle it correctly.
|
9
|
+
|
10
|
+
<p>Even mismatched tags.</p>
|
11
|
+
|
12
|
+
<div>A div</div>
|
13
|
+
<div>Another div</div>
|
14
|
+
<div>A div<div>within a div</div></div>
|
15
|
+
|
16
|
+
<p>Another line<br />Yet another line</p>
|
17
|
+
|
18
|
+
<a href="http://foo.com">A link</a>
|
19
|
+
|
20
|
+
</body>
|
21
|
+
</html>
|
data/spec/examples/basic.txt
CHANGED
@@ -3,6 +3,7 @@ Hello, World!
|
|
3
3
|
This is some e-mail content. Even though it has whitespace and newlines, the e-mail converter will handle it correctly.
|
4
4
|
|
5
5
|
Even mismatched tags.
|
6
|
+
|
6
7
|
A div
|
7
8
|
Another div
|
8
9
|
A div
|
@@ -10,4 +11,5 @@ within a div
|
|
10
11
|
|
11
12
|
Another line
|
12
13
|
Yet another line
|
14
|
+
|
13
15
|
[A link](http://foo.com)
|
@@ -0,0 +1 @@
|
|
1
|
+
Hello
|
File without changes
|
File without changes
|
@@ -6,7 +6,6 @@ Hi Susan
|
|
6
6
|
Here is your cat report.
|
7
7
|
|
8
8
|
You have found 5 cats less than anyone else
|
9
|
-
|
10
9
|
[Find more cats](http://localhost/cats)
|
11
10
|
|
12
11
|
Down the road
|
@@ -20,6 +19,7 @@ You're currently finding about
|
|
20
19
|
per day
|
21
20
|
|
22
21
|
[Number of cats found]
|
22
|
+
|
23
23
|
---------------------------------------------------------------
|
24
24
|
|
25
25
|
Your last cat was found two days ago.
|