ghostwriter 0.4.0 → 1.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +337 -22
- data/RELEASE_NOTES.md +77 -0
- data/dirt-textify.gemspec +4 -2
- data/lib/ghostwriter/version.rb +1 -1
- data/lib/ghostwriter/writer.rb +133 -44
- metadata +6 -5
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: b87477fb4de8fc196eda6ea049eb3f1355ae583f0622c11b98543a67cfe71c8b
|
4
|
+
data.tar.gz: fa4e66701a3326ccceef69b3eb8f8591f8eea337f6f4f37c5b18a820bd120d42
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: b5c545e83c7073b3c5923efc3ffcb6ecfd78650c472a482b5ec16fd7ef5533859ce6fb24d32378aafd20c83e35acd558c136856b46f0c7e0373beb7e3191be34
|
7
|
+
data.tar.gz: 8b259323e6653c112885ddf6fe4ee99b31c01e5625ee2fcf22d0259330dc090df11a1b7ee2dc1507c222b5f79c8322717aba9455b9bdbdca04334c0becc5959a
|
data/README.md
CHANGED
@@ -1,14 +1,15 @@
|
|
1
1
|
# Ghostwriter
|
2
2
|
|
3
|
-
|
3
|
+
A ruby gem that converts HTML to plain text, preserving as much legibility and functionality as possible.
|
4
4
|
|
5
|
-
It's sort of like a reverse-markdown.
|
5
|
+
It's sort of like a reverse-markdown or a very, very simple screen reader.
|
6
6
|
|
7
7
|
## But Why, Though?
|
8
8
|
|
9
|
-
* Spam filters tend to like emails with a plain text alternative
|
10
9
|
* Some email clients won't or can’t handle HTML at all
|
11
10
|
* Some people explicitly choose plaintext just by preference or accessibility
|
11
|
+
* Spam filters tend to prefer emails with a plain text alternative (but if you use this gem to spam people, I will yell
|
12
|
+
at you)
|
12
13
|
|
13
14
|
## Installation
|
14
15
|
|
@@ -28,26 +29,339 @@ Or install it manually with:
|
|
28
29
|
|
29
30
|
## Usage
|
30
31
|
|
31
|
-
Create a `Ghostwriter::Writer` with the html you want modified
|
32
|
+
Create a `Ghostwriter::Writer` and call `#textify` with the html string you want modified:
|
32
33
|
|
33
34
|
```ruby
|
34
|
-
html =
|
35
|
+
html = <<~HTML
|
36
|
+
<html>
|
37
|
+
<body>
|
38
|
+
<p>This is some text with <a href="tenjin.ca">a link</a></p>
|
39
|
+
<p>It handles other stuff, too.</p>
|
40
|
+
<hr>
|
41
|
+
<h1>Stuff Like</h1>
|
42
|
+
<ul>
|
43
|
+
<li>Images</li>
|
44
|
+
<li>Lists</li>
|
45
|
+
<li>Tables</li>
|
46
|
+
<li>And more</li>
|
47
|
+
</ul>
|
48
|
+
</body>
|
49
|
+
</html>
|
50
|
+
HTML
|
51
|
+
|
52
|
+
ghostwriter = Ghostwriter::Writer.new
|
53
|
+
|
54
|
+
puts ghostwriter.textify(html)
|
55
|
+
```
|
56
|
+
|
57
|
+
Produces:
|
58
|
+
|
59
|
+
```
|
60
|
+
This is some text with a link (tenjin.ca)
|
61
|
+
|
62
|
+
It handles other stuff, too.
|
63
|
+
|
64
|
+
|
65
|
+
----------
|
66
|
+
|
67
|
+
-- Stuff Like --
|
68
|
+
- Images
|
69
|
+
- Lists
|
70
|
+
- Tables
|
71
|
+
- And more
|
72
|
+
```
|
73
|
+
|
74
|
+
### Links
|
75
|
+
|
76
|
+
Links are converted to the link text followed by the link target in brackets:
|
77
|
+
|
78
|
+
```html
|
79
|
+
Visit our <a href="https://example.com">Website</a>
|
80
|
+
```
|
81
|
+
|
82
|
+
Becomes:
|
83
|
+
|
84
|
+
```
|
85
|
+
Visit our Website (https://example.com)
|
86
|
+
```
|
87
|
+
|
88
|
+
#### Relative Links
|
89
|
+
|
90
|
+
Since emails are wholly distinct from your web address, relative links might break.
|
91
|
+
|
92
|
+
To avoid this problem, either use the `<base>` header tag:
|
93
|
+
|
94
|
+
```html
|
95
|
+
|
96
|
+
<html>
|
97
|
+
<head>
|
98
|
+
<base href="https://www.example.com">
|
99
|
+
</head>
|
100
|
+
<body>
|
101
|
+
Use the base tag to <a href="/contact">expand</a> links.
|
102
|
+
</body>
|
103
|
+
</html>
|
104
|
+
```
|
105
|
+
|
106
|
+
Becomes:
|
107
|
+
|
108
|
+
```
|
109
|
+
Use the base tag to expand (https://www.example.com/contact) links.
|
110
|
+
```
|
111
|
+
|
112
|
+
Or you can use the `link_base` configuration:
|
113
|
+
|
114
|
+
```ruby
|
115
|
+
Ghostwriter::Writer.new(link_base: 'tenjin.ca').textify(html)
|
116
|
+
```
|
117
|
+
|
118
|
+
### Images
|
119
|
+
|
120
|
+
Images with alt text are converted:
|
121
|
+
|
122
|
+
```html
|
123
|
+
<img src="logo.jpg" alt="ACME Anvils" />
|
124
|
+
```
|
125
|
+
|
126
|
+
Becomes:
|
127
|
+
|
128
|
+
```
|
129
|
+
ACME Anvils (logo.jpg)
|
130
|
+
```
|
131
|
+
|
132
|
+
But images lacking alt text or with a presentation ARIA role are ignored:
|
133
|
+
|
134
|
+
```html
|
135
|
+
<!-- these will just become an empty string -->
|
136
|
+
<img src="decoration.jpg">
|
137
|
+
<img src="logo.jpg" role="presentation">
|
138
|
+
```
|
139
|
+
|
140
|
+
And images with data URIs won't include the data portion.
|
141
|
+
|
142
|
+
```html
|
143
|
+
|
144
|
+
<img src=""
|
145
|
+
alt="Data picture" />
|
146
|
+
```
|
147
|
+
|
148
|
+
Becomes:
|
149
|
+
|
150
|
+
```
|
151
|
+
Data picture (embedded)
|
152
|
+
```
|
153
|
+
|
154
|
+
### Paragraphs and Linebreaks
|
155
|
+
|
156
|
+
Paragraphs are padded with a newline at the end. Line break tags add an empty line.
|
157
|
+
|
158
|
+
```html
|
159
|
+
<p>I would like to propose a toast.</p>
|
160
|
+
<p>This meal we enjoy together would be improved by one.</p>
|
161
|
+
<br />
|
162
|
+
<p>... Plug in the toaster and I'll get the bread.</p>
|
163
|
+
```
|
164
|
+
|
165
|
+
```
|
166
|
+
I would like to propose a toast.
|
167
|
+
|
168
|
+
This meal we enjoy together would be improved by one.
|
169
|
+
|
170
|
+
|
171
|
+
... Plug in the toaster and I'll get the bread.
|
35
172
|
|
36
|
-
|
173
|
+
```
|
174
|
+
|
175
|
+
### Headings
|
176
|
+
|
177
|
+
Headings are wrapped with a marker per heading level:
|
178
|
+
|
179
|
+
```html
|
180
|
+
<h1>Dog Maintenance and Repair</h1>
|
181
|
+
<h2>Food Input Port</h2>
|
182
|
+
<h3>Exhaust Port Considerations</h3>
|
183
|
+
```
|
184
|
+
|
185
|
+
Becomes:
|
186
|
+
|
187
|
+
```
|
188
|
+
-- Dog Maintenance and Repair --
|
189
|
+
---- Food Input Port ----
|
190
|
+
------ Exhaust Port Considerations ------
|
191
|
+
```
|
192
|
+
|
193
|
+
The `<header>` tag is treated like an `<h1>` tag.
|
194
|
+
|
195
|
+
### Lists
|
196
|
+
|
197
|
+
Lists are converted, too. They are padded with newlines and are given simple markers:
|
198
|
+
|
199
|
+
```html
|
200
|
+
|
201
|
+
<ul>
|
202
|
+
<li>Planes</li>
|
203
|
+
<li>Trains</li>
|
204
|
+
<li>Automobiles</li>
|
205
|
+
</ul>
|
206
|
+
<ol>
|
207
|
+
<li>I get knocked down</li>
|
208
|
+
<li>I get up again</li>
|
209
|
+
<li>Never gonna keep me down</li>
|
210
|
+
</ol>
|
211
|
+
```
|
37
212
|
|
38
|
-
|
213
|
+
Becomes:
|
39
214
|
|
40
215
|
```
|
216
|
+
- Planes
|
217
|
+
- Trains
|
218
|
+
- Automobiles
|
41
219
|
|
42
|
-
|
220
|
+
1. I get knocked down
|
221
|
+
2. I get up again
|
222
|
+
3. Never gonna keep me down
|
223
|
+
```
|
224
|
+
|
225
|
+
### Tables
|
226
|
+
|
227
|
+
Tables are still often used in email structuring because support for more modern HTML and CSS is inconsistent. If your
|
228
|
+
table is purely presentational, mark it with `role="presentation"`. See below for details.
|
229
|
+
|
230
|
+
For real data tables, Ghostwriter tries to maintain table structure for simple tables:
|
231
|
+
|
232
|
+
```html
|
233
|
+
|
234
|
+
<table>
|
235
|
+
<thead>
|
236
|
+
<tr>
|
237
|
+
<th>Ship</th>
|
238
|
+
<th>Captain</th>
|
239
|
+
</tr>
|
240
|
+
</thead>
|
241
|
+
<tbody>
|
242
|
+
<tr>
|
243
|
+
<td>Enterprise</td>
|
244
|
+
<td>Jean-Luc Picard</td>
|
245
|
+
</tr>
|
246
|
+
<tr>
|
247
|
+
<td>TARDIS</td>
|
248
|
+
<td>The Doctor</td>
|
249
|
+
</tr>
|
250
|
+
<tr>
|
251
|
+
<td>Planet Express Ship</td>
|
252
|
+
<td>Turanga Leela</td>
|
253
|
+
</tr>
|
254
|
+
</tbody>
|
255
|
+
</table>
|
256
|
+
```
|
257
|
+
|
258
|
+
Becomes:
|
259
|
+
|
260
|
+
```
|
261
|
+
| Ship | Captain |
|
262
|
+
|---------------------|-----------------|
|
263
|
+
| Enterprise | Jean-Luc Picard |
|
264
|
+
| TARDIS | The Doctor |
|
265
|
+
| Planet Express Ship | Turanga Leela |
|
266
|
+
```
|
267
|
+
|
268
|
+
### Customizing Output
|
269
|
+
|
270
|
+
Ghostwriter has some constructor options to customize output.
|
271
|
+
|
272
|
+
You can set heading markers.
|
43
273
|
|
44
274
|
```ruby
|
45
|
-
html =
|
275
|
+
html = <<~HTML
|
276
|
+
<h1>Emergency Cat Procedures</h1>
|
277
|
+
HTML
|
46
278
|
|
47
|
-
Ghostwriter::Writer.new(
|
279
|
+
writer = Ghostwriter::Writer.new(heading_marker: '#')
|
280
|
+
|
281
|
+
puts writer.textify(html)
|
282
|
+
```
|
48
283
|
|
49
|
-
|
284
|
+
Produces:
|
50
285
|
|
286
|
+
```
|
287
|
+
# Emergency Cat Procedures #
|
288
|
+
```
|
289
|
+
|
290
|
+
You can also set list item markers. Ordered markers can be anything that responds to `#next` (eg. any `Enumerator`)
|
291
|
+
|
292
|
+
```ruby
|
293
|
+
html = <<~HTML
|
294
|
+
<ol><li>Mercury</li><li>Venus</li><li>Mars</li></ol>
|
295
|
+
<ul><li>Teapot</li><li>Kettle</li></ul>
|
296
|
+
HTML
|
297
|
+
|
298
|
+
writer = Ghostwriter::Writer.new(ul_marker: '*', ol_marker: 'a')
|
299
|
+
|
300
|
+
puts writer.textify(html)
|
301
|
+
```
|
302
|
+
|
303
|
+
Produces:
|
304
|
+
|
305
|
+
```
|
306
|
+
a. Mercury
|
307
|
+
b. Venus
|
308
|
+
c. Mars
|
309
|
+
|
310
|
+
* Teapot
|
311
|
+
* Kettle
|
312
|
+
```
|
313
|
+
|
314
|
+
And tables can be customized:
|
315
|
+
|
316
|
+
```ruby
|
317
|
+
writer = Ghostwriter::Writer.new(table_row: '.',
|
318
|
+
table_column: '#',
|
319
|
+
table_corner: '+')
|
320
|
+
|
321
|
+
puts writer.textify <<~HTML
|
322
|
+
<table>
|
323
|
+
<thead>
|
324
|
+
<tr><th>Moon</th><th>Portfolio</th></tr>
|
325
|
+
</thead>
|
326
|
+
<tbody>
|
327
|
+
<tr><td>Phobos</td><td>Fear & Panic</td></tr>
|
328
|
+
<tr><td>Deimos</td><td>Dread and Terror</td></tr>
|
329
|
+
</tbody>
|
330
|
+
</table>
|
331
|
+
HTML
|
332
|
+
```
|
333
|
+
|
334
|
+
Produces:
|
335
|
+
|
336
|
+
```
|
337
|
+
# Moon # Portfolio #
|
338
|
+
+........+..................+
|
339
|
+
# Phobos # Fear & Panic #
|
340
|
+
# Deimos # Dread and Terror #
|
341
|
+
|
342
|
+
```
|
343
|
+
|
344
|
+
#### Presentation ARIA Role
|
345
|
+
|
346
|
+
Tags with `role="presentation"` will be treated as a simple container and the normal behaviour will be suppressed.
|
347
|
+
|
348
|
+
```html
|
349
|
+
|
350
|
+
<table role="presentation">
|
351
|
+
<tr>
|
352
|
+
<td>The table is a lie</td>
|
353
|
+
</tr>
|
354
|
+
</table>
|
355
|
+
<ul role="presentation">
|
356
|
+
<li>No such list</li>
|
357
|
+
</ul>
|
358
|
+
```
|
359
|
+
|
360
|
+
Becomes:
|
361
|
+
|
362
|
+
```
|
363
|
+
The table is a lie
|
364
|
+
No such list
|
51
365
|
```
|
52
366
|
|
53
367
|
### Mail Gem Example
|
@@ -58,9 +372,10 @@ through Ghostwriter:
|
|
58
372
|
```ruby
|
59
373
|
require 'mail'
|
60
374
|
|
61
|
-
html
|
375
|
+
html = 'My email and a <a href="https://tenjin.ca">link</a>'
|
376
|
+
ghostwriter = Ghostwriter::Writer.new
|
62
377
|
|
63
|
-
|
378
|
+
Mail.deliver do
|
64
379
|
to 'bob@example.com'
|
65
380
|
from 'dot@example.com'
|
66
381
|
subject 'Using Ghostwriter with Mail'
|
@@ -71,7 +386,7 @@ mail = Mail.deliver do
|
|
71
386
|
end
|
72
387
|
|
73
388
|
text_part do
|
74
|
-
body
|
389
|
+
body ghostwriter.textify(html)
|
75
390
|
end
|
76
391
|
end
|
77
392
|
|
@@ -90,19 +405,19 @@ After checking out the repo, run `bundle install` to install dependencies. Then,
|
|
90
405
|
can also run `bin/console` for an interactive prompt that will allow you to experiment.
|
91
406
|
|
92
407
|
#### Local Install
|
93
|
-
To install this gem onto your local machine only, run
|
94
408
|
|
95
|
-
|
409
|
+
To install this gem onto your local machine only, run
|
410
|
+
|
411
|
+
`bundle exec rake install`
|
96
412
|
|
97
413
|
#### Gem Release
|
414
|
+
|
98
415
|
To release a gem to the world at large
|
99
416
|
|
100
|
-
|
101
|
-
|
102
|
-
|
103
|
-
|
104
|
-
and push the `.gem` file to [rubygems.org](https://rubygems.org).
|
105
|
-
3. Do a wee dance
|
417
|
+
1. Update the version number in `version.rb`,
|
418
|
+
2. Run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push
|
419
|
+
the `.gem` file to [rubygems.org](https://rubygems.org).
|
420
|
+
3. Do a wee dance
|
106
421
|
|
107
422
|
## License
|
108
423
|
|
data/RELEASE_NOTES.md
CHANGED
@@ -1,5 +1,82 @@
|
|
1
1
|
# Release Notes
|
2
2
|
|
3
|
+
## 1.1.0 (2021-03-23)
|
4
|
+
|
5
|
+
### Major
|
6
|
+
|
7
|
+
* none
|
8
|
+
|
9
|
+
### Minor
|
10
|
+
|
11
|
+
* Added customization for headings
|
12
|
+
* Headings now marked more for higher order headings
|
13
|
+
* Added customization for list markers
|
14
|
+
* Added customization for table markers
|
15
|
+
* Writer is now immutable
|
16
|
+
|
17
|
+
### Bugfixes
|
18
|
+
|
19
|
+
* none
|
20
|
+
|
21
|
+
## 1.0.1 (2021-03-22)
|
22
|
+
|
23
|
+
### Major
|
24
|
+
|
25
|
+
* none
|
26
|
+
|
27
|
+
### Minor
|
28
|
+
|
29
|
+
* Updated README
|
30
|
+
|
31
|
+
### Bugfixes
|
32
|
+
|
33
|
+
* Fixed hr padding behaviour
|
34
|
+
|
35
|
+
## 1.0.0 (2021-03-21)
|
36
|
+
|
37
|
+
### Major
|
38
|
+
|
39
|
+
* Moved `link_base` parameter to constructor
|
40
|
+
* Moved input HTML parameter to `#textify`
|
41
|
+
|
42
|
+
### Minor
|
43
|
+
|
44
|
+
* Treats tables and lists with role="presentation" as simple containers
|
45
|
+
* Now handles ordered and unordered lists
|
46
|
+
* Images are now replaced with their alt text
|
47
|
+
|
48
|
+
### Bugfixes
|
49
|
+
|
50
|
+
* none
|
51
|
+
|
52
|
+
## 0.4.2 (2021-03-17)
|
53
|
+
|
54
|
+
### Major
|
55
|
+
|
56
|
+
* none
|
57
|
+
|
58
|
+
### Minor
|
59
|
+
|
60
|
+
* none
|
61
|
+
|
62
|
+
### Bugfixes
|
63
|
+
|
64
|
+
* Works with links using `tel:` and `mailto:` schemas.
|
65
|
+
|
66
|
+
## 0.4.1 (2021-03-17)
|
67
|
+
|
68
|
+
### Major
|
69
|
+
|
70
|
+
* none
|
71
|
+
|
72
|
+
### Minor
|
73
|
+
|
74
|
+
* No longer provides link target in brackets after link text when they are the same
|
75
|
+
|
76
|
+
### Bugfixes
|
77
|
+
|
78
|
+
* Added explicit testing for HTML entity interpretation
|
79
|
+
|
3
80
|
## 0.4.0 (2021-03-16)
|
4
81
|
|
5
82
|
### Major
|
data/dirt-textify.gemspec
CHANGED
@@ -10,9 +10,11 @@ Gem::Specification.new do |spec|
|
|
10
10
|
spec.authors = ['Robin Miller']
|
11
11
|
spec.email = ['robin@tenjin.ca']
|
12
12
|
|
13
|
-
spec.summary = '
|
13
|
+
spec.summary = 'Converts HTML to plain text'
|
14
14
|
spec.description = <<~DESC
|
15
|
-
|
15
|
+
Converts HTML to plain text, preserving as much legibility and functionality as possible.
|
16
|
+
|
17
|
+
Ideal for providing a plaintext multipart segment of email messages.
|
16
18
|
DESC
|
17
19
|
spec.homepage = 'https://github.com/TenjinInc/ghostwriter'
|
18
20
|
spec.license = 'MIT'
|
data/lib/ghostwriter/version.rb
CHANGED
data/lib/ghostwriter/writer.rb
CHANGED
@@ -3,99 +3,188 @@
|
|
3
3
|
module Ghostwriter
|
4
4
|
# Main Ghostwriter converter object.
|
5
5
|
class Writer
|
6
|
-
|
7
|
-
|
6
|
+
attr_reader :link_base, :heading_marker, :ul_marker, :ol_marker, :table_row, :table_column, :table_corner
|
7
|
+
|
8
|
+
# Creates a new ghostwriter
|
9
|
+
#
|
10
|
+
# @param [String] link_base the url to prefix relative links with
|
11
|
+
def initialize(link_base: '', heading_marker: '--', ul_marker: '-', ol_marker: '1',
|
12
|
+
table_column: '|', table_row: '-', table_corner: '|')
|
13
|
+
@link_base = link_base
|
14
|
+
@heading_marker = heading_marker
|
15
|
+
@ul_marker = ul_marker
|
16
|
+
@ol_marker = ol_marker
|
17
|
+
@table_column = table_column
|
18
|
+
@table_row = table_row
|
19
|
+
@table_corner = table_corner
|
20
|
+
|
21
|
+
freeze
|
8
22
|
end
|
9
23
|
|
10
24
|
# Strips HTML down to plain text.
|
11
25
|
#
|
12
|
-
# @param
|
13
|
-
|
14
|
-
|
26
|
+
# @param html [String] the HTML to be convert to text
|
27
|
+
#
|
28
|
+
# @return converted text
|
29
|
+
def textify(html)
|
30
|
+
doc = Nokogiri::HTML(html.gsub(/\s+/, ' '))
|
15
31
|
|
16
|
-
doc
|
32
|
+
doc.search('style, script').remove
|
17
33
|
|
18
|
-
doc
|
19
|
-
doc
|
34
|
+
replace_anchors(doc)
|
35
|
+
replace_images(doc)
|
36
|
+
|
37
|
+
simple_replace(doc, '*[role="presentation"]', "\n")
|
20
38
|
|
21
|
-
replace_anchors(doc, link_base)
|
22
39
|
replace_headers(doc)
|
23
|
-
|
40
|
+
replace_lists(doc)
|
41
|
+
replace_tables(doc)
|
24
42
|
|
25
|
-
simple_replace(doc, 'hr', "\n----------\n")
|
43
|
+
simple_replace(doc, 'hr', "\n----------\n\n")
|
26
44
|
simple_replace(doc, 'br', "\n")
|
45
|
+
simple_replace(doc, 'p', "\n\n")
|
27
46
|
|
28
|
-
|
29
|
-
# link_node.inner_html = link_node.inner_html + "\n\n"
|
30
|
-
# end
|
31
|
-
|
32
|
-
# trim, but only single-space character
|
33
|
-
doc.text.gsub(/^ +| +$/, '')
|
47
|
+
normalize_lines(doc)
|
34
48
|
end
|
35
49
|
|
36
50
|
private
|
37
51
|
|
38
|
-
def
|
39
|
-
|
52
|
+
def normalize_lines(doc)
|
53
|
+
doc.text.strip.split("\n").collect(&:strip).join("\n").concat("\n")
|
40
54
|
end
|
41
55
|
|
42
|
-
def replace_anchors(doc
|
56
|
+
def replace_anchors(doc)
|
57
|
+
doc.search('a').each do |link_node|
|
58
|
+
href = get_link_target(link_node, get_link_base(doc))
|
59
|
+
|
60
|
+
link_node.inner_html = if link_matches(href, link_node.inner_html)
|
61
|
+
href.to_s
|
62
|
+
else
|
63
|
+
"#{ link_node.inner_html } (#{ href })"
|
64
|
+
end
|
65
|
+
end
|
66
|
+
end
|
67
|
+
|
68
|
+
def link_matches(first, second)
|
69
|
+
first.to_s.gsub(%r{^https?://}, '').chomp('/') == second.gsub(%r{^https?://}, '').chomp('/')
|
70
|
+
end
|
71
|
+
|
72
|
+
def get_link_base(doc)
|
43
73
|
# <base> node is unique by W3C spec
|
44
|
-
|
45
|
-
base_url = base ? base['href'] : link_base
|
74
|
+
base_node = doc.search('base').first
|
46
75
|
|
47
|
-
|
48
|
-
|
49
|
-
href = base_url + href.to_s unless href.absolute?
|
76
|
+
base_node ? base_node['href'] : @link_base
|
77
|
+
end
|
50
78
|
|
51
|
-
|
79
|
+
def get_link_target(link_node, base)
|
80
|
+
href = URI(link_node['href'])
|
81
|
+
if href.absolute?
|
82
|
+
href
|
83
|
+
else
|
84
|
+
base + href.to_s
|
52
85
|
end
|
86
|
+
rescue URI::InvalidURIError
|
87
|
+
link_node['href'].gsub(/^(tel|mailto):/, '').strip
|
53
88
|
end
|
54
89
|
|
55
90
|
def replace_headers(doc)
|
56
|
-
doc.search('header, h1
|
57
|
-
node.
|
91
|
+
doc.search('header, h1').each do |node|
|
92
|
+
node.replace("#{ @heading_marker } #{ node.inner_html } #{ @heading_marker }\n"
|
93
|
+
.squeeze(' '))
|
94
|
+
end
|
95
|
+
|
96
|
+
(2..6).each do |n|
|
97
|
+
doc.search("h#{ n }").each do |node|
|
98
|
+
node.replace("#{ @heading_marker * n } #{ node.inner_html } #{ @heading_marker * n }\n"
|
99
|
+
.squeeze(' '))
|
100
|
+
end
|
101
|
+
end
|
102
|
+
end
|
103
|
+
|
104
|
+
def replace_images(doc)
|
105
|
+
doc.search('img[role=presentation]').remove
|
106
|
+
|
107
|
+
doc.search('img').each do |img_node|
|
108
|
+
src = img_node['src']
|
109
|
+
alt = img_node['alt']
|
110
|
+
|
111
|
+
src = 'embedded' if src.start_with? 'data:'
|
112
|
+
|
113
|
+
img_node.replace("#{ alt } (#{ src })") unless alt.nil? || alt.empty?
|
58
114
|
end
|
59
115
|
end
|
60
116
|
|
61
|
-
def
|
117
|
+
def replace_lists(doc)
|
118
|
+
doc.search('ol').each do |list_node|
|
119
|
+
replace_list_items(list_node, @ol_marker, after_marker: '.', increment: true)
|
120
|
+
end
|
121
|
+
|
122
|
+
doc.search('ul').each do |list_node|
|
123
|
+
replace_list_items(list_node, @ul_marker)
|
124
|
+
end
|
125
|
+
|
126
|
+
doc.search('ul, ol').each do |list_node|
|
127
|
+
list_node.replace("#{ list_node.inner_html }\n")
|
128
|
+
end
|
129
|
+
end
|
130
|
+
|
131
|
+
def replace_list_items(list_node, marker, after_marker: '', increment: false)
|
132
|
+
list_node.search('./li').each do |list_item|
|
133
|
+
list_item.replace("#{ marker }#{ after_marker } #{ list_item.inner_html }\n")
|
134
|
+
|
135
|
+
marker = marker.next if increment
|
136
|
+
end
|
137
|
+
end
|
138
|
+
|
139
|
+
def replace_tables(doc)
|
62
140
|
doc.css('table').each do |table|
|
63
|
-
|
64
|
-
|
65
|
-
node.inner_html.length
|
66
|
-
end
|
67
|
-
end
|
141
|
+
# remove whitespace between nodes
|
142
|
+
table.search('//text()[normalize-space()=""]').remove
|
68
143
|
|
69
|
-
column_sizes =
|
144
|
+
column_sizes = calculate_column_sizes(table)
|
70
145
|
|
71
146
|
table.search('./thead/tr', './tbody/tr', './tr').each do |row|
|
72
147
|
replace_table_nodes(row, column_sizes)
|
73
148
|
|
74
|
-
row.
|
149
|
+
row.replace("#{ row.inner_html }#{ @table_column }\n")
|
75
150
|
end
|
76
151
|
|
77
|
-
table
|
78
|
-
|
152
|
+
add_table_header_underline(table, column_sizes)
|
153
|
+
|
154
|
+
table.replace("\n#{ table.inner_html }\n")
|
155
|
+
end
|
156
|
+
end
|
79
157
|
|
80
|
-
|
158
|
+
def calculate_column_sizes(table)
|
159
|
+
column_sizes = table.search('tr').collect do |row|
|
160
|
+
row.search('th', 'td').collect do |node|
|
161
|
+
node.text.length
|
81
162
|
end
|
163
|
+
end
|
164
|
+
|
165
|
+
column_sizes.transpose.collect(&:max)
|
166
|
+
end
|
167
|
+
|
168
|
+
def add_table_header_underline(table, column_sizes)
|
169
|
+
table.search('./thead').each do |thead|
|
170
|
+
lines = column_sizes.collect { |len| @table_row * (len + 2) }
|
171
|
+
underline_row = "#{ table_corner }#{ lines.join(@table_corner) }#{ @table_corner }"
|
82
172
|
|
83
|
-
|
173
|
+
thead.replace("#{ thead.inner_html }#{ underline_row }\n")
|
84
174
|
end
|
85
175
|
end
|
86
176
|
|
87
177
|
def replace_table_nodes(row, column_sizes)
|
88
178
|
row.search('th', 'td').each_with_index do |node, i|
|
89
|
-
new_content =
|
179
|
+
new_content = node.text.ljust(column_sizes[i] + 1)
|
90
180
|
|
91
|
-
#
|
92
|
-
node.inner_html = new_content.ljust(column_sizes[i] + 2)
|
181
|
+
node.replace("#{ @table_column } #{ new_content }")
|
93
182
|
end
|
94
183
|
end
|
95
184
|
|
96
185
|
def simple_replace(doc, tag, replacement)
|
97
186
|
doc.search(tag).each do |node|
|
98
|
-
node.replace(replacement)
|
187
|
+
node.replace(node.inner_html + replacement)
|
99
188
|
end
|
100
189
|
end
|
101
190
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: ghostwriter
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version:
|
4
|
+
version: 1.1.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Robin Miller
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2021-03-
|
11
|
+
date: 2021-03-23 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: nokogiri
|
@@ -94,9 +94,10 @@ dependencies:
|
|
94
94
|
- - "~>"
|
95
95
|
- !ruby/object:Gem::Version
|
96
96
|
version: '1.10'
|
97
|
-
description:
|
97
|
+
description: |
|
98
|
+
Converts HTML to plain text, preserving as much legibility and functionality as possible.
|
98
99
|
|
99
|
-
|
100
|
+
Ideal for providing a plaintext multipart segment of email messages.
|
100
101
|
email:
|
101
102
|
- robin@tenjin.ca
|
102
103
|
executables: []
|
@@ -142,5 +143,5 @@ requirements: []
|
|
142
143
|
rubygems_version: 3.1.2
|
143
144
|
signing_key:
|
144
145
|
specification_version: 4
|
145
|
-
summary:
|
146
|
+
summary: Converts HTML to plain text
|
146
147
|
test_files: []
|