ghostwriter 0.4.2 → 1.2.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +308 -80
- data/RELEASE_NOTES.md +65 -0
- data/dirt-textify.gemspec +8 -6
- data/lib/ghostwriter/version.rb +1 -1
- data/lib/ghostwriter/writer.rb +105 -41
- metadata +16 -15
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: ccebf53b35ad212e0c7e353a3cb4d79e1f91c63cec6490c5fff7ff56b72eed70
|
4
|
+
data.tar.gz: 5b3d36839dcef24d80605421970eb39e17efe3ec24d3126d753ac7a5039512a5
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: d6039d9d2b3c1d6606da533c108c8087faa8b985ac0bf6adac7bc1ad4310cc601f76840ff32e83255847bedbd77dbc6cc280054f6eb4975dbe815a98b2a07373
|
7
|
+
data.tar.gz: d005669ca03f3ff465c351df0e47eba0372ca6061057102e14b0d879ddb0c5191a88bfaa389fa82a865787e6ff3b5d3fb351ae8b6351f108aef404a3bd66dd7a
|
data/README.md
CHANGED
@@ -1,14 +1,19 @@
|
|
1
1
|
# Ghostwriter
|
2
2
|
|
3
|
-
|
3
|
+
A ruby gem that converts HTML to plain text, preserving as much legibility and functionality as possible.
|
4
4
|
|
5
|
-
It's sort of like a reverse-markdown.
|
5
|
+
It's sort of like a reverse-markdown or a *very* simple screen reader.
|
6
6
|
|
7
7
|
## But Why, Though?
|
8
8
|
|
9
|
-
*
|
10
|
-
* Some
|
11
|
-
*
|
9
|
+
* Some email clients won't or can’t offer HTML support.
|
10
|
+
* Some people explicitly choose plaintext for accessibility or just plain preference.
|
11
|
+
* Spam filters tend to prefer emails with a plain text alternative (but if you use this gem to spam people,
|
12
|
+
not only might you be
|
13
|
+
[breaking](https://fightspam.gc.ca)
|
14
|
+
[various](https://gdpr.eu/)
|
15
|
+
[laws](https://www.ftc.gov/tips-advice/business-center/guidance/can-spam-act-compliance-guide-business),
|
16
|
+
I will also personally curse you)
|
12
17
|
|
13
18
|
## Installation
|
14
19
|
|
@@ -28,111 +33,234 @@ Or install it manually with:
|
|
28
33
|
|
29
34
|
## Usage
|
30
35
|
|
31
|
-
Create a `Ghostwriter::Writer` with the html you want modified
|
36
|
+
Create a `Ghostwriter::Writer` and call `#textify` with the html string you want modified:
|
32
37
|
|
33
38
|
```ruby
|
34
|
-
html =
|
39
|
+
html = <<~HTML
|
40
|
+
<html>
|
41
|
+
<body>
|
42
|
+
<p>This is some text with <a href="tenjin.ca">a link</a></p>
|
43
|
+
<p>It handles other stuff, too.</p>
|
44
|
+
<hr>
|
45
|
+
<h1>Stuff Like</h1>
|
46
|
+
<ul>
|
47
|
+
<li>Images</li>
|
48
|
+
<li>Lists</li>
|
49
|
+
<li>Tables</li>
|
50
|
+
<li>And more</li>
|
51
|
+
</ul>
|
52
|
+
</body>
|
53
|
+
</html>
|
54
|
+
HTML
|
55
|
+
|
56
|
+
ghostwriter = Ghostwriter::Writer.new
|
35
57
|
|
36
|
-
|
58
|
+
puts ghostwriter.textify(html)
|
37
59
|
```
|
60
|
+
|
38
61
|
Produces:
|
62
|
+
|
39
63
|
```
|
40
|
-
This is some
|
64
|
+
This is some text with a link (tenjin.ca)
|
41
65
|
|
42
|
-
|
66
|
+
It handles other stuff, too.
|
67
|
+
|
68
|
+
|
69
|
+
----------
|
70
|
+
|
71
|
+
-- Stuff Like --
|
72
|
+
- Images
|
73
|
+
- Lists
|
74
|
+
- Tables
|
75
|
+
- And more
|
43
76
|
```
|
44
77
|
|
45
78
|
### Links
|
46
79
|
|
47
80
|
Links are converted to the link text followed by the link target in brackets:
|
48
81
|
|
49
|
-
```
|
50
|
-
|
51
|
-
Ghostwriter::Writer.new(html).textify
|
82
|
+
```html
|
83
|
+
Visit our <a href="https://example.com">Website</a>
|
52
84
|
```
|
53
85
|
|
54
|
-
|
86
|
+
Becomes:
|
87
|
+
|
55
88
|
```
|
56
89
|
Visit our Website (https://example.com)
|
57
90
|
```
|
58
91
|
|
59
92
|
#### Relative Links
|
93
|
+
|
60
94
|
Since emails are wholly distinct from your web address, relative links might break.
|
61
95
|
|
62
96
|
To avoid this problem, either use the `<base>` header tag:
|
63
97
|
|
64
|
-
```
|
65
|
-
html = <<~HTML
|
66
|
-
<html>
|
67
|
-
<head>
|
68
|
-
<base href="https://www.example.com/">
|
69
|
-
</head>
|
70
|
-
<body>
|
71
|
-
Relative links get <a href="/contact">expanded</a> using the head's base tag.
|
72
|
-
</body>
|
73
|
-
</html>
|
74
|
-
HTML
|
98
|
+
```html
|
75
99
|
|
76
|
-
|
100
|
+
<html>
|
101
|
+
<head>
|
102
|
+
<base href="https://www.example.com">
|
103
|
+
</head>
|
104
|
+
<body>
|
105
|
+
Use the base tag to <a href="/contact">expand</a> links.
|
106
|
+
</body>
|
107
|
+
</html>
|
77
108
|
```
|
78
|
-
|
109
|
+
|
110
|
+
Becomes:
|
111
|
+
|
79
112
|
```
|
80
|
-
|
113
|
+
Use the base tag to expand (https://www.example.com/contact) links.
|
81
114
|
```
|
82
115
|
|
83
|
-
Or you can use the `link_base`
|
116
|
+
Or you can use the `link_base` configuration:
|
117
|
+
|
84
118
|
```ruby
|
85
|
-
|
119
|
+
Ghostwriter::Writer.new(link_base: 'tenjin.ca').textify(html)
|
120
|
+
```
|
121
|
+
|
122
|
+
### Images
|
86
123
|
|
87
|
-
|
124
|
+
Images with alt text are converted:
|
125
|
+
|
126
|
+
```html
|
127
|
+
<img src="logo.jpg" alt="ACME Anvils" />
|
88
128
|
```
|
89
129
|
|
90
|
-
|
130
|
+
Becomes:
|
131
|
+
|
91
132
|
```
|
92
|
-
|
133
|
+
ACME Anvils (logo.jpg)
|
93
134
|
```
|
94
135
|
|
95
|
-
|
96
|
-
Tables are often used email structuring because support for more modern CSS is inconsistent.
|
136
|
+
But images lacking alt text or with a presentation ARIA role are ignored:
|
97
137
|
|
98
|
-
|
138
|
+
```html
|
139
|
+
<!-- these will just become an empty string -->
|
140
|
+
<img src="decoration.jpg">
|
141
|
+
<img src="logo.jpg" role="presentation">
|
142
|
+
```
|
99
143
|
|
100
|
-
|
101
|
-
|
102
|
-
|
103
|
-
<head>
|
104
|
-
<base href="https://www.example.com/">
|
105
|
-
</head>
|
106
|
-
<body>
|
107
|
-
<table>
|
108
|
-
<thead>
|
109
|
-
<tr>
|
110
|
-
<th>Ship</th>
|
111
|
-
<th>Captain</th>
|
112
|
-
</tr>
|
113
|
-
</thead>
|
114
|
-
<tbody>
|
115
|
-
<tr>
|
116
|
-
<td>Enterprise</td>
|
117
|
-
<td>Jean-Luc Picard</td>
|
118
|
-
</tr>
|
119
|
-
<tr>
|
120
|
-
<td>TARDIS</td>
|
121
|
-
<td>The Doctor</td>
|
122
|
-
</tr>
|
123
|
-
<tr>
|
124
|
-
<td>Planet Express Ship</td>
|
125
|
-
<td>Turanga Leela</td>
|
126
|
-
</tr>
|
127
|
-
</tbody>
|
128
|
-
</table>
|
129
|
-
</body>
|
130
|
-
</html>
|
131
|
-
HTML
|
144
|
+
And images with data URIs won't include the data portion.
|
145
|
+
|
146
|
+
```html
|
132
147
|
|
133
|
-
|
148
|
+
<img src=""
|
149
|
+
alt="Data picture" />
|
134
150
|
```
|
135
|
-
|
151
|
+
|
152
|
+
Becomes:
|
153
|
+
|
154
|
+
```
|
155
|
+
Data picture (embedded)
|
156
|
+
```
|
157
|
+
|
158
|
+
### Paragraphs and Linebreaks
|
159
|
+
|
160
|
+
Paragraphs are padded with a newline at the end. Line break tags add an empty line.
|
161
|
+
|
162
|
+
```html
|
163
|
+
<p>I would like to propose a toast.</p>
|
164
|
+
<p>This meal we enjoy together would be improved by one.</p>
|
165
|
+
<br />
|
166
|
+
<p>... Plug in the toaster and I'll get the bread.</p>
|
167
|
+
```
|
168
|
+
|
169
|
+
```
|
170
|
+
I would like to propose a toast.
|
171
|
+
|
172
|
+
This meal we enjoy together would be improved by one.
|
173
|
+
|
174
|
+
|
175
|
+
... Plug in the toaster and I'll get the bread.
|
176
|
+
|
177
|
+
```
|
178
|
+
|
179
|
+
### Headings
|
180
|
+
|
181
|
+
Headings are wrapped with a marker per heading level:
|
182
|
+
|
183
|
+
```html
|
184
|
+
<h1>Dog Maintenance and Repair</h1>
|
185
|
+
<h2>Food Input Port</h2>
|
186
|
+
<h3>Exhaust Port Considerations</h3>
|
187
|
+
```
|
188
|
+
|
189
|
+
Becomes:
|
190
|
+
|
191
|
+
```
|
192
|
+
-- Dog Maintenance and Repair --
|
193
|
+
---- Food Input Port ----
|
194
|
+
------ Exhaust Port Considerations ------
|
195
|
+
```
|
196
|
+
|
197
|
+
The `<header>` tag is treated like an `<h1>` tag.
|
198
|
+
|
199
|
+
### Lists
|
200
|
+
|
201
|
+
Lists are converted, too. They are padded with newlines and are given simple markers:
|
202
|
+
|
203
|
+
```html
|
204
|
+
|
205
|
+
<ul>
|
206
|
+
<li>Planes</li>
|
207
|
+
<li>Trains</li>
|
208
|
+
<li>Automobiles</li>
|
209
|
+
</ul>
|
210
|
+
<ol>
|
211
|
+
<li>I get knocked down</li>
|
212
|
+
<li>I get up again</li>
|
213
|
+
<li>Never gonna keep me down</li>
|
214
|
+
</ol>
|
215
|
+
```
|
216
|
+
|
217
|
+
Becomes:
|
218
|
+
|
219
|
+
```
|
220
|
+
- Planes
|
221
|
+
- Trains
|
222
|
+
- Automobiles
|
223
|
+
|
224
|
+
1. I get knocked down
|
225
|
+
2. I get up again
|
226
|
+
3. Never gonna keep me down
|
227
|
+
```
|
228
|
+
|
229
|
+
### Tables
|
230
|
+
|
231
|
+
Tables are still often used in email structuring because support for more modern HTML and CSS is inconsistent. If your
|
232
|
+
table is purely presentational, mark it with `role="presentation"`. See below for details.
|
233
|
+
|
234
|
+
For real data tables, Ghostwriter tries to maintain table structure for simple tables:
|
235
|
+
|
236
|
+
```html
|
237
|
+
|
238
|
+
<table>
|
239
|
+
<thead>
|
240
|
+
<tr>
|
241
|
+
<th>Ship</th>
|
242
|
+
<th>Captain</th>
|
243
|
+
</tr>
|
244
|
+
</thead>
|
245
|
+
<tbody>
|
246
|
+
<tr>
|
247
|
+
<td>Enterprise</td>
|
248
|
+
<td>Jean-Luc Picard</td>
|
249
|
+
</tr>
|
250
|
+
<tr>
|
251
|
+
<td>TARDIS</td>
|
252
|
+
<td>The Doctor</td>
|
253
|
+
</tr>
|
254
|
+
<tr>
|
255
|
+
<td>Planet Express Ship</td>
|
256
|
+
<td>Turanga Leela</td>
|
257
|
+
</tr>
|
258
|
+
</tbody>
|
259
|
+
</table>
|
260
|
+
```
|
261
|
+
|
262
|
+
Becomes:
|
263
|
+
|
136
264
|
```
|
137
265
|
| Ship | Captain |
|
138
266
|
|---------------------|-----------------|
|
@@ -141,6 +269,105 @@ Produces:
|
|
141
269
|
| Planet Express Ship | Turanga Leela |
|
142
270
|
```
|
143
271
|
|
272
|
+
### Customizing Output
|
273
|
+
|
274
|
+
Ghostwriter has some constructor options to customize output.
|
275
|
+
|
276
|
+
You can set heading markers.
|
277
|
+
|
278
|
+
```ruby
|
279
|
+
html = <<~HTML
|
280
|
+
<h1>Emergency Cat Procedures</h1>
|
281
|
+
HTML
|
282
|
+
|
283
|
+
writer = Ghostwriter::Writer.new(heading_marker: '#')
|
284
|
+
|
285
|
+
puts writer.textify(html)
|
286
|
+
```
|
287
|
+
|
288
|
+
Produces:
|
289
|
+
|
290
|
+
```
|
291
|
+
# Emergency Cat Procedures #
|
292
|
+
```
|
293
|
+
|
294
|
+
You can also set list item markers. Ordered markers can be anything that responds to `#next` (eg. any `Enumerator`)
|
295
|
+
|
296
|
+
```ruby
|
297
|
+
html = <<~HTML
|
298
|
+
<ol><li>Mercury</li><li>Venus</li><li>Mars</li></ol>
|
299
|
+
<ul><li>Teapot</li><li>Kettle</li></ul>
|
300
|
+
HTML
|
301
|
+
|
302
|
+
writer = Ghostwriter::Writer.new(ul_marker: '*', ol_marker: 'a')
|
303
|
+
|
304
|
+
puts writer.textify(html)
|
305
|
+
```
|
306
|
+
|
307
|
+
Produces:
|
308
|
+
|
309
|
+
```
|
310
|
+
a. Mercury
|
311
|
+
b. Venus
|
312
|
+
c. Mars
|
313
|
+
|
314
|
+
* Teapot
|
315
|
+
* Kettle
|
316
|
+
```
|
317
|
+
|
318
|
+
And tables can be customized:
|
319
|
+
|
320
|
+
```ruby
|
321
|
+
writer = Ghostwriter::Writer.new(table_row: '.',
|
322
|
+
table_column: '#',
|
323
|
+
table_corner: '+')
|
324
|
+
|
325
|
+
puts writer.textify <<~HTML
|
326
|
+
<table>
|
327
|
+
<thead>
|
328
|
+
<tr><th>Moon</th><th>Portfolio</th></tr>
|
329
|
+
</thead>
|
330
|
+
<tbody>
|
331
|
+
<tr><td>Phobos</td><td>Fear & Panic</td></tr>
|
332
|
+
<tr><td>Deimos</td><td>Dread and Terror</td></tr>
|
333
|
+
</tbody>
|
334
|
+
</table>
|
335
|
+
HTML
|
336
|
+
```
|
337
|
+
|
338
|
+
Produces:
|
339
|
+
|
340
|
+
```
|
341
|
+
# Moon # Portfolio #
|
342
|
+
+........+..................+
|
343
|
+
# Phobos # Fear & Panic #
|
344
|
+
# Deimos # Dread and Terror #
|
345
|
+
|
346
|
+
```
|
347
|
+
|
348
|
+
#### Presentation ARIA Role
|
349
|
+
|
350
|
+
Tags with `role="presentation"` will be treated as a simple container and the normal behaviour will be suppressed.
|
351
|
+
|
352
|
+
```html
|
353
|
+
|
354
|
+
<table role="presentation">
|
355
|
+
<tr>
|
356
|
+
<td>The table is a lie</td>
|
357
|
+
</tr>
|
358
|
+
</table>
|
359
|
+
<ul role="presentation">
|
360
|
+
<li>No such list</li>
|
361
|
+
</ul>
|
362
|
+
```
|
363
|
+
|
364
|
+
Becomes:
|
365
|
+
|
366
|
+
```
|
367
|
+
The table is a lie
|
368
|
+
No such list
|
369
|
+
```
|
370
|
+
|
144
371
|
### Mail Gem Example
|
145
372
|
|
146
373
|
To use `#textify` with the [mail](https://github.com/mikel/mail) gem, just provide the text-part by pasisng the html
|
@@ -149,7 +376,8 @@ through Ghostwriter:
|
|
149
376
|
```ruby
|
150
377
|
require 'mail'
|
151
378
|
|
152
|
-
html
|
379
|
+
html = 'My email and a <a href="https://tenjin.ca">link</a>'
|
380
|
+
ghostwriter = Ghostwriter::Writer.new
|
153
381
|
|
154
382
|
Mail.deliver do
|
155
383
|
to 'bob@example.com'
|
@@ -162,7 +390,7 @@ Mail.deliver do
|
|
162
390
|
end
|
163
391
|
|
164
392
|
text_part do
|
165
|
-
body
|
393
|
+
body ghostwriter.textify(html)
|
166
394
|
end
|
167
395
|
end
|
168
396
|
|
@@ -181,19 +409,19 @@ After checking out the repo, run `bundle install` to install dependencies. Then,
|
|
181
409
|
can also run `bin/console` for an interactive prompt that will allow you to experiment.
|
182
410
|
|
183
411
|
#### Local Install
|
184
|
-
To install this gem onto your local machine only, run
|
185
412
|
|
186
|
-
|
413
|
+
To install this gem onto your local machine only, run
|
414
|
+
|
415
|
+
`bundle exec rake install`
|
187
416
|
|
188
417
|
#### Gem Release
|
418
|
+
|
189
419
|
To release a gem to the world at large
|
190
420
|
|
191
|
-
|
192
|
-
|
193
|
-
|
194
|
-
|
195
|
-
and push the `.gem` file to [rubygems.org](https://rubygems.org).
|
196
|
-
3. Do a wee dance
|
421
|
+
1. Update the version number in `version.rb`,
|
422
|
+
2. Run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push
|
423
|
+
the `.gem` file to [rubygems.org](https://rubygems.org).
|
424
|
+
3. Do a wee dance
|
197
425
|
|
198
426
|
## License
|
199
427
|
|
data/RELEASE_NOTES.md
CHANGED
@@ -1,5 +1,70 @@
|
|
1
1
|
# Release Notes
|
2
2
|
|
3
|
+
## 1.2.1 (2021-10-29)
|
4
|
+
|
5
|
+
### Major
|
6
|
+
|
7
|
+
* none
|
8
|
+
|
9
|
+
### Minor
|
10
|
+
|
11
|
+
* Updated Nokogiri version to resolve https://github.com/advisories/GHSA-7rrm-v45f-jp64
|
12
|
+
* Updated Ruby version dependency to match
|
13
|
+
* Relaxed dependency upper bounds
|
14
|
+
|
15
|
+
### Bugfixes
|
16
|
+
|
17
|
+
* none
|
18
|
+
|
19
|
+
## 1.1.0 (2021-03-23)
|
20
|
+
|
21
|
+
### Major
|
22
|
+
|
23
|
+
* none
|
24
|
+
|
25
|
+
### Minor
|
26
|
+
|
27
|
+
* Added customization for headings
|
28
|
+
* Headings now marked more for higher order headings
|
29
|
+
* Added customization for list markers
|
30
|
+
* Added customization for table markers
|
31
|
+
* Writer is now immutable
|
32
|
+
|
33
|
+
### Bugfixes
|
34
|
+
|
35
|
+
* none
|
36
|
+
|
37
|
+
## 1.0.1 (2021-03-22)
|
38
|
+
|
39
|
+
### Major
|
40
|
+
|
41
|
+
* none
|
42
|
+
|
43
|
+
### Minor
|
44
|
+
|
45
|
+
* Updated README
|
46
|
+
|
47
|
+
### Bugfixes
|
48
|
+
|
49
|
+
* Fixed hr padding behaviour
|
50
|
+
|
51
|
+
## 1.0.0 (2021-03-21)
|
52
|
+
|
53
|
+
### Major
|
54
|
+
|
55
|
+
* Moved `link_base` parameter to constructor
|
56
|
+
* Moved input HTML parameter to `#textify`
|
57
|
+
|
58
|
+
### Minor
|
59
|
+
|
60
|
+
* Treats tables and lists with role="presentation" as simple containers
|
61
|
+
* Now handles ordered and unordered lists
|
62
|
+
* Images are now replaced with their alt text
|
63
|
+
|
64
|
+
### Bugfixes
|
65
|
+
|
66
|
+
* none
|
67
|
+
|
3
68
|
## 0.4.2 (2021-03-17)
|
4
69
|
|
5
70
|
### Major
|
data/dirt-textify.gemspec
CHANGED
@@ -10,9 +10,11 @@ Gem::Specification.new do |spec|
|
|
10
10
|
spec.authors = ['Robin Miller']
|
11
11
|
spec.email = ['robin@tenjin.ca']
|
12
12
|
|
13
|
-
spec.summary = '
|
13
|
+
spec.summary = 'Converts HTML to plain text'
|
14
14
|
spec.description = <<~DESC
|
15
|
-
|
15
|
+
Converts HTML to plain text, preserving as much legibility and functionality as possible.
|
16
|
+
|
17
|
+
Ideal for providing a plaintext multipart segment of email messages.
|
16
18
|
DESC
|
17
19
|
spec.homepage = 'https://github.com/TenjinInc/ghostwriter'
|
18
20
|
spec.license = 'MIT'
|
@@ -25,13 +27,13 @@ Gem::Specification.new do |spec|
|
|
25
27
|
spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
|
26
28
|
spec.require_paths = ['lib']
|
27
29
|
|
28
|
-
spec.required_ruby_version = '
|
30
|
+
spec.required_ruby_version = '>= 2.7'
|
29
31
|
|
30
|
-
spec.add_dependency 'nokogiri', '
|
32
|
+
spec.add_dependency 'nokogiri', '>= 1.12'
|
31
33
|
|
32
34
|
spec.add_development_dependency 'bundler', '~> 2.2'
|
33
35
|
spec.add_development_dependency 'rake', '~> 13.0'
|
34
36
|
spec.add_development_dependency 'rspec', '~> 3.3'
|
35
|
-
spec.add_development_dependency 'rubocop', '~> 1.
|
36
|
-
spec.add_development_dependency 'rubocop-performance', '~> 1.
|
37
|
+
spec.add_development_dependency 'rubocop', '~> 1.22'
|
38
|
+
spec.add_development_dependency 'rubocop-performance', '~> 1.11'
|
37
39
|
end
|
data/lib/ghostwriter/version.rb
CHANGED
data/lib/ghostwriter/writer.rb
CHANGED
@@ -3,52 +3,59 @@
|
|
3
3
|
module Ghostwriter
|
4
4
|
# Main Ghostwriter converter object.
|
5
5
|
class Writer
|
6
|
-
|
7
|
-
|
6
|
+
attr_reader :link_base, :heading_marker, :ul_marker, :ol_marker, :table_row, :table_column, :table_corner
|
7
|
+
|
8
|
+
# Creates a new ghostwriter
|
9
|
+
#
|
10
|
+
# @param [String] link_base the url to prefix relative links with
|
11
|
+
def initialize(link_base: '', heading_marker: '--', ul_marker: '-', ol_marker: '1',
|
12
|
+
table_column: '|', table_row: '-', table_corner: '|')
|
13
|
+
@link_base = link_base
|
14
|
+
@heading_marker = heading_marker
|
15
|
+
@ul_marker = ul_marker
|
16
|
+
@ol_marker = ol_marker
|
17
|
+
@table_column = table_column
|
18
|
+
@table_row = table_row
|
19
|
+
@table_corner = table_corner
|
20
|
+
|
21
|
+
freeze
|
8
22
|
end
|
9
23
|
|
10
24
|
# Strips HTML down to plain text.
|
11
25
|
#
|
12
|
-
# @param
|
13
|
-
|
14
|
-
|
26
|
+
# @param html [String] the HTML to be convert to text
|
27
|
+
#
|
28
|
+
# @return converted text
|
29
|
+
def textify(html)
|
30
|
+
doc = Nokogiri::HTML(html.gsub(/\s+/, ' '))
|
31
|
+
|
32
|
+
doc.search('style, script').remove
|
15
33
|
|
16
|
-
doc
|
34
|
+
replace_anchors(doc)
|
35
|
+
replace_images(doc)
|
17
36
|
|
18
|
-
doc
|
19
|
-
doc.search('script').remove
|
37
|
+
simple_replace(doc, '*[role="presentation"]', "\n")
|
20
38
|
|
21
|
-
replace_anchors(doc, link_base)
|
22
39
|
replace_headers(doc)
|
40
|
+
replace_lists(doc)
|
23
41
|
replace_tables(doc)
|
24
42
|
|
25
|
-
simple_replace(doc, 'hr', "\n----------\n")
|
43
|
+
simple_replace(doc, 'hr', "\n----------\n\n")
|
26
44
|
simple_replace(doc, 'br', "\n")
|
45
|
+
simple_replace(doc, 'p', "\n\n")
|
27
46
|
|
28
|
-
|
29
|
-
# link_node.inner_html = link_node.inner_html + "\n\n"
|
30
|
-
# end
|
31
|
-
|
32
|
-
# trim, but only single-space character
|
33
|
-
doc.text.gsub(/^ +| +$/, '')
|
47
|
+
normalize_lines(doc)
|
34
48
|
end
|
35
49
|
|
36
50
|
private
|
37
51
|
|
38
|
-
def
|
39
|
-
|
52
|
+
def normalize_lines(doc)
|
53
|
+
doc.text.strip.split("\n").collect(&:strip).join("\n").concat("\n")
|
40
54
|
end
|
41
55
|
|
42
|
-
def replace_anchors(doc
|
43
|
-
base = get_link_base(doc, default: link_base)
|
44
|
-
|
56
|
+
def replace_anchors(doc)
|
45
57
|
doc.search('a').each do |link_node|
|
46
|
-
|
47
|
-
href = URI(link_node['href'])
|
48
|
-
href = base + href.to_s unless href.absolute?
|
49
|
-
rescue URI::InvalidURIError
|
50
|
-
href = link_node['href'].gsub(/^(tel|mailto):/, '').strip
|
51
|
-
end
|
58
|
+
href = get_link_target(link_node, get_link_base(doc))
|
52
59
|
|
53
60
|
link_node.inner_html = if link_matches(href, link_node.inner_html)
|
54
61
|
href.to_s
|
@@ -62,39 +69,96 @@ module Ghostwriter
|
|
62
69
|
first.to_s.gsub(%r{^https?://}, '').chomp('/') == second.gsub(%r{^https?://}, '').chomp('/')
|
63
70
|
end
|
64
71
|
|
65
|
-
def get_link_base(doc
|
72
|
+
def get_link_base(doc)
|
66
73
|
# <base> node is unique by W3C spec
|
67
74
|
base_node = doc.search('base').first
|
68
75
|
|
69
|
-
base_node ? base_node['href'] :
|
76
|
+
base_node ? base_node['href'] : @link_base
|
77
|
+
end
|
78
|
+
|
79
|
+
def get_link_target(link_node, base)
|
80
|
+
href = URI(link_node['href'])
|
81
|
+
if href.absolute?
|
82
|
+
href
|
83
|
+
else
|
84
|
+
base + href.to_s
|
85
|
+
end
|
86
|
+
rescue URI::InvalidURIError
|
87
|
+
link_node['href'].gsub(/^(tel|mailto):/, '').strip
|
70
88
|
end
|
71
89
|
|
72
90
|
def replace_headers(doc)
|
73
|
-
doc.search('header, h1
|
74
|
-
node.
|
91
|
+
doc.search('header, h1').each do |node|
|
92
|
+
node.replace("#{ @heading_marker } #{ node.inner_html } #{ @heading_marker }\n"
|
93
|
+
.squeeze(' '))
|
94
|
+
end
|
95
|
+
|
96
|
+
(2..6).each do |n|
|
97
|
+
doc.search("h#{ n }").each do |node|
|
98
|
+
node.replace("#{ @heading_marker * n } #{ node.inner_html } #{ @heading_marker * n }\n"
|
99
|
+
.squeeze(' '))
|
100
|
+
end
|
101
|
+
end
|
102
|
+
end
|
103
|
+
|
104
|
+
def replace_images(doc)
|
105
|
+
doc.search('img[role=presentation]').remove
|
106
|
+
|
107
|
+
doc.search('img').each do |img_node|
|
108
|
+
src = img_node['src']
|
109
|
+
alt = img_node['alt']
|
110
|
+
|
111
|
+
src = 'embedded' if src.start_with? 'data:'
|
112
|
+
|
113
|
+
img_node.replace("#{ alt } (#{ src })") unless alt.nil? || alt.empty?
|
114
|
+
end
|
115
|
+
end
|
116
|
+
|
117
|
+
def replace_lists(doc)
|
118
|
+
doc.search('ol').each do |list_node|
|
119
|
+
replace_list_items(list_node, @ol_marker, after_marker: '.', increment: true)
|
120
|
+
end
|
121
|
+
|
122
|
+
doc.search('ul').each do |list_node|
|
123
|
+
replace_list_items(list_node, @ul_marker)
|
124
|
+
end
|
125
|
+
|
126
|
+
doc.search('ul, ol').each do |list_node|
|
127
|
+
list_node.replace("#{ list_node.inner_html }\n")
|
128
|
+
end
|
129
|
+
end
|
130
|
+
|
131
|
+
def replace_list_items(list_node, marker, after_marker: '', increment: false)
|
132
|
+
list_node.search('./li').each do |list_item|
|
133
|
+
list_item.replace("#{ marker }#{ after_marker } #{ list_item.inner_html }\n")
|
134
|
+
|
135
|
+
marker = marker.next if increment
|
75
136
|
end
|
76
137
|
end
|
77
138
|
|
78
139
|
def replace_tables(doc)
|
79
140
|
doc.css('table').each do |table|
|
141
|
+
# remove whitespace between nodes
|
142
|
+
table.search('//text()[normalize-space()=""]').remove
|
143
|
+
|
80
144
|
column_sizes = calculate_column_sizes(table)
|
81
145
|
|
82
146
|
table.search('./thead/tr', './tbody/tr', './tr').each do |row|
|
83
147
|
replace_table_nodes(row, column_sizes)
|
84
148
|
|
85
|
-
row.
|
149
|
+
row.replace("#{ row.inner_html }#{ @table_column }\n")
|
86
150
|
end
|
87
151
|
|
88
152
|
add_table_header_underline(table, column_sizes)
|
89
153
|
|
90
|
-
table.
|
154
|
+
table.replace("\n#{ table.inner_html }\n")
|
91
155
|
end
|
92
156
|
end
|
93
157
|
|
94
158
|
def calculate_column_sizes(table)
|
95
159
|
column_sizes = table.search('tr').collect do |row|
|
96
160
|
row.search('th', 'td').collect do |node|
|
97
|
-
node.
|
161
|
+
node.text.length
|
98
162
|
end
|
99
163
|
end
|
100
164
|
|
@@ -102,25 +166,25 @@ module Ghostwriter
|
|
102
166
|
end
|
103
167
|
|
104
168
|
def add_table_header_underline(table, column_sizes)
|
105
|
-
table.search('./thead').each do |
|
106
|
-
|
169
|
+
table.search('./thead').each do |thead|
|
170
|
+
lines = column_sizes.collect { |len| @table_row * (len + 2) }
|
171
|
+
underline_row = "#{ table_corner }#{ lines.join(@table_corner) }#{ @table_corner }"
|
107
172
|
|
108
|
-
|
173
|
+
thead.replace("#{ thead.inner_html }#{ underline_row }\n")
|
109
174
|
end
|
110
175
|
end
|
111
176
|
|
112
177
|
def replace_table_nodes(row, column_sizes)
|
113
178
|
row.search('th', 'td').each_with_index do |node, i|
|
114
|
-
new_content =
|
179
|
+
new_content = node.text.ljust(column_sizes[i] + 1)
|
115
180
|
|
116
|
-
#
|
117
|
-
node.inner_html = new_content.ljust(column_sizes[i] + 2)
|
181
|
+
node.replace("#{ @table_column } #{ new_content }")
|
118
182
|
end
|
119
183
|
end
|
120
184
|
|
121
185
|
def simple_replace(doc, tag, replacement)
|
122
186
|
doc.search(tag).each do |node|
|
123
|
-
node.replace(replacement)
|
187
|
+
node.replace(node.inner_html + replacement)
|
124
188
|
end
|
125
189
|
end
|
126
190
|
end
|
metadata
CHANGED
@@ -1,29 +1,29 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: ghostwriter
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version:
|
4
|
+
version: 1.2.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Robin Miller
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2021-
|
11
|
+
date: 2021-10-29 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: nokogiri
|
15
15
|
requirement: !ruby/object:Gem::Requirement
|
16
16
|
requirements:
|
17
|
-
- -
|
17
|
+
- - ">="
|
18
18
|
- !ruby/object:Gem::Version
|
19
|
-
version: 1.
|
19
|
+
version: '1.12'
|
20
20
|
type: :runtime
|
21
21
|
prerelease: false
|
22
22
|
version_requirements: !ruby/object:Gem::Requirement
|
23
23
|
requirements:
|
24
|
-
- -
|
24
|
+
- - ">="
|
25
25
|
- !ruby/object:Gem::Version
|
26
|
-
version: 1.
|
26
|
+
version: '1.12'
|
27
27
|
- !ruby/object:Gem::Dependency
|
28
28
|
name: bundler
|
29
29
|
requirement: !ruby/object:Gem::Requirement
|
@@ -72,31 +72,32 @@ dependencies:
|
|
72
72
|
requirements:
|
73
73
|
- - "~>"
|
74
74
|
- !ruby/object:Gem::Version
|
75
|
-
version: '1.
|
75
|
+
version: '1.22'
|
76
76
|
type: :development
|
77
77
|
prerelease: false
|
78
78
|
version_requirements: !ruby/object:Gem::Requirement
|
79
79
|
requirements:
|
80
80
|
- - "~>"
|
81
81
|
- !ruby/object:Gem::Version
|
82
|
-
version: '1.
|
82
|
+
version: '1.22'
|
83
83
|
- !ruby/object:Gem::Dependency
|
84
84
|
name: rubocop-performance
|
85
85
|
requirement: !ruby/object:Gem::Requirement
|
86
86
|
requirements:
|
87
87
|
- - "~>"
|
88
88
|
- !ruby/object:Gem::Version
|
89
|
-
version: '1.
|
89
|
+
version: '1.11'
|
90
90
|
type: :development
|
91
91
|
prerelease: false
|
92
92
|
version_requirements: !ruby/object:Gem::Requirement
|
93
93
|
requirements:
|
94
94
|
- - "~>"
|
95
95
|
- !ruby/object:Gem::Version
|
96
|
-
version: '1.
|
97
|
-
description:
|
96
|
+
version: '1.11'
|
97
|
+
description: |
|
98
|
+
Converts HTML to plain text, preserving as much legibility and functionality as possible.
|
98
99
|
|
99
|
-
|
100
|
+
Ideal for providing a plaintext multipart segment of email messages.
|
100
101
|
email:
|
101
102
|
- robin@tenjin.ca
|
102
103
|
executables: []
|
@@ -130,9 +131,9 @@ require_paths:
|
|
130
131
|
- lib
|
131
132
|
required_ruby_version: !ruby/object:Gem::Requirement
|
132
133
|
requirements:
|
133
|
-
- - "
|
134
|
+
- - ">="
|
134
135
|
- !ruby/object:Gem::Version
|
135
|
-
version: '2.
|
136
|
+
version: '2.7'
|
136
137
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
137
138
|
requirements:
|
138
139
|
- - ">="
|
@@ -142,5 +143,5 @@ requirements: []
|
|
142
143
|
rubygems_version: 3.1.2
|
143
144
|
signing_key:
|
144
145
|
specification_version: 4
|
145
|
-
summary:
|
146
|
+
summary: Converts HTML to plain text
|
146
147
|
test_files: []
|