ghostwriter 0.4.0 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: afedaad685b3c06c7baf6e7b1a7b00d8397f9ad1446ddfdcf835dfc99df19969
4
- data.tar.gz: 786a612861e5d11e19672057371720c57dd7dc67e1f9b7333bf9269b819b0548
3
+ metadata.gz: b87477fb4de8fc196eda6ea049eb3f1355ae583f0622c11b98543a67cfe71c8b
4
+ data.tar.gz: fa4e66701a3326ccceef69b3eb8f8591f8eea337f6f4f37c5b18a820bd120d42
5
5
  SHA512:
6
- metadata.gz: 610473e1214cd68edd7b1ce952fd43c44c22a29bc0895bebcc57b4e5e00385bccaec2008e23252eaf438a8b9a1ebd56af0a4c3956be53791ef1b5ee8b75dc1a3
7
- data.tar.gz: 3f3bee72a1515077e7ccff0b1baae94607e33c84c83a84f8b51fdf3234aeaecaa48ddf0d6cfc2a38584ce6c0d0d7f52db5b4021638eb611089087ee638ba2d16
6
+ metadata.gz: b5c545e83c7073b3c5923efc3ffcb6ecfd78650c472a482b5ec16fd7ef5533859ce6fb24d32378aafd20c83e35acd558c136856b46f0c7e0373beb7e3191be34
7
+ data.tar.gz: 8b259323e6653c112885ddf6fe4ee99b31c01e5625ee2fcf22d0259330dc090df11a1b7ee2dc1507c222b5f79c8322717aba9455b9bdbdca04334c0becc5959a
data/README.md CHANGED
@@ -1,14 +1,15 @@
1
1
  # Ghostwriter
2
2
 
3
- Ghostwriter rewrites HTML as plain text while preserving as much legibility and functionality as possible.
3
+ A ruby gem that converts HTML to plain text, preserving as much legibility and functionality as possible.
4
4
 
5
- It's sort of like a reverse-markdown.
5
+ It's sort of like a reverse-markdown or a very, very simple screen reader.
6
6
 
7
7
  ## But Why, Though?
8
8
 
9
- * Spam filters tend to like emails with a plain text alternative
10
9
  * Some email clients won't or can’t handle HTML at all
11
10
  * Some people explicitly choose plaintext just by preference or accessibility
11
+ * Spam filters tend to prefer emails with a plain text alternative (but if you use this gem to spam people, I will yell
12
+ at you)
12
13
 
13
14
  ## Installation
14
15
 
@@ -28,26 +29,339 @@ Or install it manually with:
28
29
 
29
30
  ## Usage
30
31
 
31
- Create a `Ghostwriter::Writer` with the html you want modified, and call `#textify`:
32
+ Create a `Ghostwriter::Writer` and call `#textify` with the html string you want modified:
32
33
 
33
34
  ```ruby
34
- html = '<html><body>This is some markup <a href="tenjin.ca">and a link</a><p>Other tags translate, too</p></body></html>'
35
+ html = <<~HTML
36
+ <html>
37
+ <body>
38
+ <p>This is some text with <a href="tenjin.ca">a link</a></p>
39
+ <p>It handles other stuff, too.</p>
40
+ <hr>
41
+ <h1>Stuff Like</h1>
42
+ <ul>
43
+ <li>Images</li>
44
+ <li>Lists</li>
45
+ <li>Tables</li>
46
+ <li>And more</li>
47
+ </ul>
48
+ </body>
49
+ </html>
50
+ HTML
51
+
52
+ ghostwriter = Ghostwriter::Writer.new
53
+
54
+ puts ghostwriter.textify(html)
55
+ ```
56
+
57
+ Produces:
58
+
59
+ ```
60
+ This is some text with a link (tenjin.ca)
61
+
62
+ It handles other stuff, too.
63
+
64
+
65
+ ----------
66
+
67
+ -- Stuff Like --
68
+ - Images
69
+ - Lists
70
+ - Tables
71
+ - And more
72
+ ```
73
+
74
+ ### Links
75
+
76
+ Links are converted to the link text followed by the link target in brackets:
77
+
78
+ ```html
79
+ Visit our <a href="https://example.com">Website</a>
80
+ ```
81
+
82
+ Becomes:
83
+
84
+ ```
85
+ Visit our Website (https://example.com)
86
+ ```
87
+
88
+ #### Relative Links
89
+
90
+ Since emails are wholly distinct from your web address, relative links might break.
91
+
92
+ To avoid this problem, either use the `<base>` header tag:
93
+
94
+ ```html
95
+
96
+ <html>
97
+ <head>
98
+ <base href="https://www.example.com">
99
+ </head>
100
+ <body>
101
+ Use the base tag to <a href="/contact">expand</a> links.
102
+ </body>
103
+ </html>
104
+ ```
105
+
106
+ Becomes:
107
+
108
+ ```
109
+ Use the base tag to expand (https://www.example.com/contact) links.
110
+ ```
111
+
112
+ Or you can use the `link_base` configuration:
113
+
114
+ ```ruby
115
+ Ghostwriter::Writer.new(link_base: 'tenjin.ca').textify(html)
116
+ ```
117
+
118
+ ### Images
119
+
120
+ Images with alt text are converted:
121
+
122
+ ```html
123
+ <img src="logo.jpg" alt="ACME Anvils" />
124
+ ```
125
+
126
+ Becomes:
127
+
128
+ ```
129
+ ACME Anvils (logo.jpg)
130
+ ```
131
+
132
+ But images lacking alt text or with a presentation ARIA role are ignored:
133
+
134
+ ```html
135
+ <!-- these will just become an empty string -->
136
+ <img src="decoration.jpg">
137
+ <img src="logo.jpg" role="presentation">
138
+ ```
139
+
140
+ And images with data URIs won't include the data portion.
141
+
142
+ ```html
143
+
144
+ <img src="data:image/gif;base64,R0lGODdhIwAjAMZ/AAkMBxETEBUUDBoaExkaGCIcFx4fGCEfFCcfECkjHiUlHiglGikmFjAqFi8pJCsrJT8sCjMzLDUzJzs0GjkzLTszKTM1Mzg4MD48Mzs+O0tAIElCJ1NCGVdBHUtEMkNFQjlHTFJDOkdGPT1ISUxLRENOT1tMI01PTGdLKk1RU0hTVEtTT0NVVFRTTExYWE9YVGhVP1VZXGFYTWhaMFRcWHFYL1FdXV1dRHdZMVRgYFhgXFdiY11hY1tkX31hJltmZ2pnWnloLGFrbG9oYXlqN3NqTnBqWHxqRItvRIh0Nod0ToF2U5J4LX55Xm97e4B5aZqAQpGAdqOCOZKEYZ2FOJyEVoyKbqiOXpySbLCVcLCXaKWbdKCdfZyhi66dksGdc76fbbije7mkdLOmgq6ogrCpibyvirexisWvhs2vgsGyiLq1lce1lMC5ks28nsfBmcHDq9bAl9PDmMnFo9TGh8zIoM7Jm9vLs9nRo93QqtfSquLQpdXUs+fdterlw////ywAAAAAIwAjAAAH/oArOTo6PYaGOz08P0KMOTZCOzw7PzY/Pz2JPYSDhTSFPTSXPY0tIiIfJz05o5Q/O7A5moc6O4Q0oS8uQisXGCItwTItP5OxOrKjhzSfLzYvgz85ERQXJKcSIkZeJDqOl43StrSEKzo2LhkOGBISDw40JyIVFVEyorBCkZmwtCsrtnLQSJCAwoMFCiwoiECPAr0TjPrtECJwXLMVNARlUCBhQAEFC2SsgWPGDBs3d2RcorSD1SVGr3qskOkihoIH70DO0cOHDx48evD0KQONmQ0aORZJE3VLRYoPBRwoUCCCSx07eoL+xLNnj5UfNFry4BHuR6EcK0qkKJFhAYUE/g+cdHlz1efPrnvM2MjhQlYOWTxktXThIoUKhQoKDHBi5Y0dO0CD5smzJ46NvWJfjYW1w4WKEiWkKkgw9UYdPXTo8Mn6042bvX9pTHoFa5GKzykekP5owEidN1u6PKnzMw+QJ3ttUPr7qKUs0C5KHOyoAMMaNWrmjKlSRYscMFm+nBBUybkLSYsIl3DxwAgcKwWMzGnz5kqTK1e09AEDI0uGE8rJEgNfsuxVggoujGABF1xMoYAVc9RRhxxq5JGVHn3EEYcIGfT1igvGKLfDZyWMkMINa5QhQRNz9CQhT1n5URmHJ8Sygw2BSWLDbaCpgEFPNzxBV4QwApVhHBhg/vABZ0pJIhuCoI0wQhFlkLEGGWfQ9wZ2W6KRBhoUJKncKyK2tMOBPI6wwAxltInlG1uKcQUUV3xpwQUXACSJjbCAxgJoJShggBVtnmGGlm/M4UYcX14QQQQ1PpJjUjmsd5sKCg5gBRdkYMlGG2KwoUYWWYARxgXVnODXqmP9CWgJIESwxhJTbEHGGGbMsSWpaRRBQQQXpPKIiJOgg+BnI4AwwhxcHFHrGGN0KYYYaEhAzQX/7flIDMqx4CoIJY7QxhpY0GorXXXwkUcRj1Lg7gfMDavcCSx4BqsIHpyxRhtT1FCDEmNgF4YY1j6KZ4eXXTast9GVcAIHG2TZRhlT/qCAAg5IZIzCA+1QQ0EGKbgAG7c0pPOAAgQcwEQSZ2R5RhlYVIFEFVccAQEAAASgWEIrXEZYDDHQYAEBAQSAcxBUbCExGWVsMfMVCHSA89QCbHBDX4QRRsPURuMcQBBQYLHGHGuwoYUYVdQQxAIOBCCACVLUgDMBS7rwwgtENHDAAEYLMIAAHhABRRVYKFEDDjjU0AA9HiQhxQQOCDC1BXe/UAQVVATRwAIDDGCAAAd0EAQTTEgBBQ4IIFSBFHFPdYEIFJBAQOUE1K5AAyZgnsQME/jNwAG/e7QBFT4sYEABBiQv6ANDDLDCCwPULr0ADYyeOQcMLMAAAxNAIQUHJwckYEDn5CfvgAEKvECA3+R7nrwB2k+ggQkmaLB3++Sz3zkMIawQCAA7"
145
+ alt="Data picture" />
146
+ ```
147
+
148
+ Becomes:
149
+
150
+ ```
151
+ Data picture (embedded)
152
+ ```
153
+
154
+ ### Paragraphs and Linebreaks
155
+
156
+ Paragraphs are padded with a newline at the end. Line break tags add an empty line.
157
+
158
+ ```html
159
+ <p>I would like to propose a toast.</p>
160
+ <p>This meal we enjoy together would be improved by one.</p>
161
+ <br />
162
+ <p>... Plug in the toaster and I'll get the bread.</p>
163
+ ```
164
+
165
+ ```
166
+ I would like to propose a toast.
167
+
168
+ This meal we enjoy together would be improved by one.
169
+
170
+
171
+ ... Plug in the toaster and I'll get the bread.
35
172
 
36
- Ghostwriter::Writer.new(html).textify
173
+ ```
174
+
175
+ ### Headings
176
+
177
+ Headings are wrapped with a marker per heading level:
178
+
179
+ ```html
180
+ <h1>Dog Maintenance and Repair</h1>
181
+ <h2>Food Input Port</h2>
182
+ <h3>Exhaust Port Considerations</h3>
183
+ ```
184
+
185
+ Becomes:
186
+
187
+ ```
188
+ -- Dog Maintenance and Repair --
189
+ ---- Food Input Port ----
190
+ ------ Exhaust Port Considerations ------
191
+ ```
192
+
193
+ The `<header>` tag is treated like an `<h1>` tag.
194
+
195
+ ### Lists
196
+
197
+ Lists are converted, too. They are padded with newlines and are given simple markers:
198
+
199
+ ```html
200
+
201
+ <ul>
202
+ <li>Planes</li>
203
+ <li>Trains</li>
204
+ <li>Automobiles</li>
205
+ </ul>
206
+ <ol>
207
+ <li>I get knocked down</li>
208
+ <li>I get up again</li>
209
+ <li>Never gonna keep me down</li>
210
+ </ol>
211
+ ```
37
212
 
38
- => "This is some markup and a link (tenjin.ca)\nOther tags translate, too\n\n"
213
+ Becomes:
39
214
 
40
215
  ```
216
+ - Planes
217
+ - Trains
218
+ - Automobiles
41
219
 
42
- `#textify` will use the `<base>` tag if found in the HTML source, or if one is provided explicitly:
220
+ 1. I get knocked down
221
+ 2. I get up again
222
+ 3. Never gonna keep me down
223
+ ```
224
+
225
+ ### Tables
226
+
227
+ Tables are still often used in email structuring because support for more modern HTML and CSS is inconsistent. If your
228
+ table is purely presentational, mark it with `role="presentation"`. See below for details.
229
+
230
+ For real data tables, Ghostwriter tries to maintain table structure for simple tables:
231
+
232
+ ```html
233
+
234
+ <table>
235
+ <thead>
236
+ <tr>
237
+ <th>Ship</th>
238
+ <th>Captain</th>
239
+ </tr>
240
+ </thead>
241
+ <tbody>
242
+ <tr>
243
+ <td>Enterprise</td>
244
+ <td>Jean-Luc Picard</td>
245
+ </tr>
246
+ <tr>
247
+ <td>TARDIS</td>
248
+ <td>The Doctor</td>
249
+ </tr>
250
+ <tr>
251
+ <td>Planet Express Ship</td>
252
+ <td>Turanga Leela</td>
253
+ </tr>
254
+ </tbody>
255
+ </table>
256
+ ```
257
+
258
+ Becomes:
259
+
260
+ ```
261
+ | Ship | Captain |
262
+ |---------------------|-----------------|
263
+ | Enterprise | Jean-Luc Picard |
264
+ | TARDIS | The Doctor |
265
+ | Planet Express Ship | Turanga Leela |
266
+ ```
267
+
268
+ ### Customizing Output
269
+
270
+ Ghostwriter has some constructor options to customize output.
271
+
272
+ You can set heading markers.
43
273
 
44
274
  ```ruby
45
- html = '<html><body>Relative links <a href="/contact">Link</a></body></html>'
275
+ html = <<~HTML
276
+ <h1>Emergency Cat Procedures</h1>
277
+ HTML
46
278
 
47
- Ghostwriter::Writer.new(html).textify(link_base: 'tenjin.ca')
279
+ writer = Ghostwriter::Writer.new(heading_marker: '#')
280
+
281
+ puts writer.textify(html)
282
+ ```
48
283
 
49
- => "Relative links Link (tenjin.ca/contact)"
284
+ Produces:
50
285
 
286
+ ```
287
+ # Emergency Cat Procedures #
288
+ ```
289
+
290
+ You can also set list item markers. Ordered markers can be anything that responds to `#next` (eg. any `Enumerator`)
291
+
292
+ ```ruby
293
+ html = <<~HTML
294
+ <ol><li>Mercury</li><li>Venus</li><li>Mars</li></ol>
295
+ <ul><li>Teapot</li><li>Kettle</li></ul>
296
+ HTML
297
+
298
+ writer = Ghostwriter::Writer.new(ul_marker: '*', ol_marker: 'a')
299
+
300
+ puts writer.textify(html)
301
+ ```
302
+
303
+ Produces:
304
+
305
+ ```
306
+ a. Mercury
307
+ b. Venus
308
+ c. Mars
309
+
310
+ * Teapot
311
+ * Kettle
312
+ ```
313
+
314
+ And tables can be customized:
315
+
316
+ ```ruby
317
+ writer = Ghostwriter::Writer.new(table_row: '.',
318
+ table_column: '#',
319
+ table_corner: '+')
320
+
321
+ puts writer.textify <<~HTML
322
+ <table>
323
+ <thead>
324
+ <tr><th>Moon</th><th>Portfolio</th></tr>
325
+ </thead>
326
+ <tbody>
327
+ <tr><td>Phobos</td><td>Fear & Panic</td></tr>
328
+ <tr><td>Deimos</td><td>Dread and Terror</td></tr>
329
+ </tbody>
330
+ </table>
331
+ HTML
332
+ ```
333
+
334
+ Produces:
335
+
336
+ ```
337
+ # Moon # Portfolio #
338
+ +........+..................+
339
+ # Phobos # Fear & Panic #
340
+ # Deimos # Dread and Terror #
341
+
342
+ ```
343
+
344
+ #### Presentation ARIA Role
345
+
346
+ Tags with `role="presentation"` will be treated as a simple container and the normal behaviour will be suppressed.
347
+
348
+ ```html
349
+
350
+ <table role="presentation">
351
+ <tr>
352
+ <td>The table is a lie</td>
353
+ </tr>
354
+ </table>
355
+ <ul role="presentation">
356
+ <li>No such list</li>
357
+ </ul>
358
+ ```
359
+
360
+ Becomes:
361
+
362
+ ```
363
+ The table is a lie
364
+ No such list
51
365
  ```
52
366
 
53
367
  ### Mail Gem Example
@@ -58,9 +372,10 @@ through Ghostwriter:
58
372
  ```ruby
59
373
  require 'mail'
60
374
 
61
- html = 'My email and a <a href="http://tenjin.ca">link</a>'
375
+ html = 'My email and a <a href="https://tenjin.ca">link</a>'
376
+ ghostwriter = Ghostwriter::Writer.new
62
377
 
63
- mail = Mail.deliver do
378
+ Mail.deliver do
64
379
  to 'bob@example.com'
65
380
  from 'dot@example.com'
66
381
  subject 'Using Ghostwriter with Mail'
@@ -71,7 +386,7 @@ mail = Mail.deliver do
71
386
  end
72
387
 
73
388
  text_part do
74
- body Ghostwriter::Writer.new(html).textify
389
+ body ghostwriter.textify(html)
75
390
  end
76
391
  end
77
392
 
@@ -90,19 +405,19 @@ After checking out the repo, run `bundle install` to install dependencies. Then,
90
405
  can also run `bin/console` for an interactive prompt that will allow you to experiment.
91
406
 
92
407
  #### Local Install
93
- To install this gem onto your local machine only, run
94
408
 
95
- `bundle exec rake install`
409
+ To install this gem onto your local machine only, run
410
+
411
+ `bundle exec rake install`
96
412
 
97
413
  #### Gem Release
414
+
98
415
  To release a gem to the world at large
99
416
 
100
- 1. Update the version number in `version.rb`,
101
- 2. Run `bundle exec rake release`,
102
- which will create a git tag for the version,
103
- push git commits and tags,
104
- and push the `.gem` file to [rubygems.org](https://rubygems.org).
105
- 3. Do a wee dance
417
+ 1. Update the version number in `version.rb`,
418
+ 2. Run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push
419
+ the `.gem` file to [rubygems.org](https://rubygems.org).
420
+ 3. Do a wee dance
106
421
 
107
422
  ## License
108
423
 
data/RELEASE_NOTES.md CHANGED
@@ -1,5 +1,82 @@
1
1
  # Release Notes
2
2
 
3
+ ## 1.1.0 (2021-03-23)
4
+
5
+ ### Major
6
+
7
+ * none
8
+
9
+ ### Minor
10
+
11
+ * Added customization for headings
12
+ * Headings now marked more for higher order headings
13
+ * Added customization for list markers
14
+ * Added customization for table markers
15
+ * Writer is now immutable
16
+
17
+ ### Bugfixes
18
+
19
+ * none
20
+
21
+ ## 1.0.1 (2021-03-22)
22
+
23
+ ### Major
24
+
25
+ * none
26
+
27
+ ### Minor
28
+
29
+ * Updated README
30
+
31
+ ### Bugfixes
32
+
33
+ * Fixed hr padding behaviour
34
+
35
+ ## 1.0.0 (2021-03-21)
36
+
37
+ ### Major
38
+
39
+ * Moved `link_base` parameter to constructor
40
+ * Moved input HTML parameter to `#textify`
41
+
42
+ ### Minor
43
+
44
+ * Treats tables and lists with role="presentation" as simple containers
45
+ * Now handles ordered and unordered lists
46
+ * Images are now replaced with their alt text
47
+
48
+ ### Bugfixes
49
+
50
+ * none
51
+
52
+ ## 0.4.2 (2021-03-17)
53
+
54
+ ### Major
55
+
56
+ * none
57
+
58
+ ### Minor
59
+
60
+ * none
61
+
62
+ ### Bugfixes
63
+
64
+ * Works with links using `tel:` and `mailto:` schemas.
65
+
66
+ ## 0.4.1 (2021-03-17)
67
+
68
+ ### Major
69
+
70
+ * none
71
+
72
+ ### Minor
73
+
74
+ * No longer provides link target in brackets after link text when they are the same
75
+
76
+ ### Bugfixes
77
+
78
+ * Added explicit testing for HTML entity interpretation
79
+
3
80
  ## 0.4.0 (2021-03-16)
4
81
 
5
82
  ### Major
data/dirt-textify.gemspec CHANGED
@@ -10,9 +10,11 @@ Gem::Specification.new do |spec|
10
10
  spec.authors = ['Robin Miller']
11
11
  spec.email = ['robin@tenjin.ca']
12
12
 
13
- spec.summary = 'Intelligently extracts plaintext from an HTML document.'
13
+ spec.summary = 'Converts HTML to plain text'
14
14
  spec.description = <<~DESC
15
- Transforms HTML into plaintext while preserving legibility and functionality.
15
+ Converts HTML to plain text, preserving as much legibility and functionality as possible.
16
+
17
+ Ideal for providing a plaintext multipart segment of email messages.
16
18
  DESC
17
19
  spec.homepage = 'https://github.com/TenjinInc/ghostwriter'
18
20
  spec.license = 'MIT'
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Ghostwriter
4
- VERSION = '0.4.0'
4
+ VERSION = '1.1.0'
5
5
  end
@@ -3,99 +3,188 @@
3
3
  module Ghostwriter
4
4
  # Main Ghostwriter converter object.
5
5
  class Writer
6
- def initialize(html)
7
- @source_html = html
6
+ attr_reader :link_base, :heading_marker, :ul_marker, :ol_marker, :table_row, :table_column, :table_corner
7
+
8
+ # Creates a new ghostwriter
9
+ #
10
+ # @param [String] link_base the url to prefix relative links with
11
+ def initialize(link_base: '', heading_marker: '--', ul_marker: '-', ol_marker: '1',
12
+ table_column: '|', table_row: '-', table_corner: '|')
13
+ @link_base = link_base
14
+ @heading_marker = heading_marker
15
+ @ul_marker = ul_marker
16
+ @ol_marker = ol_marker
17
+ @table_column = table_column
18
+ @table_row = table_row
19
+ @table_corner = table_corner
20
+
21
+ freeze
8
22
  end
9
23
 
10
24
  # Strips HTML down to plain text.
11
25
  #
12
- # @param link_base the url to prefix relative links with
13
- def textify(link_base: '')
14
- html = normalize_whitespace(@source_html).gsub('</p>', "</p>\n\n")
26
+ # @param html [String] the HTML to be convert to text
27
+ #
28
+ # @return converted text
29
+ def textify(html)
30
+ doc = Nokogiri::HTML(html.gsub(/\s+/, ' '))
15
31
 
16
- doc = Nokogiri::HTML(html)
32
+ doc.search('style, script').remove
17
33
 
18
- doc.search('style').remove
19
- doc.search('script').remove
34
+ replace_anchors(doc)
35
+ replace_images(doc)
36
+
37
+ simple_replace(doc, '*[role="presentation"]', "\n")
20
38
 
21
- replace_anchors(doc, link_base)
22
39
  replace_headers(doc)
23
- replace_table(doc)
40
+ replace_lists(doc)
41
+ replace_tables(doc)
24
42
 
25
- simple_replace(doc, 'hr', "\n----------\n")
43
+ simple_replace(doc, 'hr', "\n----------\n\n")
26
44
  simple_replace(doc, 'br', "\n")
45
+ simple_replace(doc, 'p', "\n\n")
27
46
 
28
- # doc.search('p').each do |link_node|
29
- # link_node.inner_html = link_node.inner_html + "\n\n"
30
- # end
31
-
32
- # trim, but only single-space character
33
- doc.text.gsub(/^ +| +$/, '')
47
+ normalize_lines(doc)
34
48
  end
35
49
 
36
50
  private
37
51
 
38
- def normalize_whitespace(html)
39
- html.gsub(/\s/, ' ').squeeze(' ')
52
+ def normalize_lines(doc)
53
+ doc.text.strip.split("\n").collect(&:strip).join("\n").concat("\n")
40
54
  end
41
55
 
42
- def replace_anchors(doc, link_base)
56
+ def replace_anchors(doc)
57
+ doc.search('a').each do |link_node|
58
+ href = get_link_target(link_node, get_link_base(doc))
59
+
60
+ link_node.inner_html = if link_matches(href, link_node.inner_html)
61
+ href.to_s
62
+ else
63
+ "#{ link_node.inner_html } (#{ href })"
64
+ end
65
+ end
66
+ end
67
+
68
+ def link_matches(first, second)
69
+ first.to_s.gsub(%r{^https?://}, '').chomp('/') == second.gsub(%r{^https?://}, '').chomp('/')
70
+ end
71
+
72
+ def get_link_base(doc)
43
73
  # <base> node is unique by W3C spec
44
- base = doc.search('base').first
45
- base_url = base ? base['href'] : link_base
74
+ base_node = doc.search('base').first
46
75
 
47
- doc.search('a').each do |link_node|
48
- href = URI(link_node['href'])
49
- href = base_url + href.to_s unless href.absolute?
76
+ base_node ? base_node['href'] : @link_base
77
+ end
50
78
 
51
- link_node.inner_html = "#{ link_node.inner_html } (#{ href })"
79
+ def get_link_target(link_node, base)
80
+ href = URI(link_node['href'])
81
+ if href.absolute?
82
+ href
83
+ else
84
+ base + href.to_s
52
85
  end
86
+ rescue URI::InvalidURIError
87
+ link_node['href'].gsub(/^(tel|mailto):/, '').strip
53
88
  end
54
89
 
55
90
  def replace_headers(doc)
56
- doc.search('header, h1, h2, h3, h4, h5, h6').each do |node|
57
- node.inner_html = "- #{ node.inner_html } -\n".squeeze(' ')
91
+ doc.search('header, h1').each do |node|
92
+ node.replace("#{ @heading_marker } #{ node.inner_html } #{ @heading_marker }\n"
93
+ .squeeze(' '))
94
+ end
95
+
96
+ (2..6).each do |n|
97
+ doc.search("h#{ n }").each do |node|
98
+ node.replace("#{ @heading_marker * n } #{ node.inner_html } #{ @heading_marker * n }\n"
99
+ .squeeze(' '))
100
+ end
101
+ end
102
+ end
103
+
104
+ def replace_images(doc)
105
+ doc.search('img[role=presentation]').remove
106
+
107
+ doc.search('img').each do |img_node|
108
+ src = img_node['src']
109
+ alt = img_node['alt']
110
+
111
+ src = 'embedded' if src.start_with? 'data:'
112
+
113
+ img_node.replace("#{ alt } (#{ src })") unless alt.nil? || alt.empty?
58
114
  end
59
115
  end
60
116
 
61
- def replace_table(doc)
117
+ def replace_lists(doc)
118
+ doc.search('ol').each do |list_node|
119
+ replace_list_items(list_node, @ol_marker, after_marker: '.', increment: true)
120
+ end
121
+
122
+ doc.search('ul').each do |list_node|
123
+ replace_list_items(list_node, @ul_marker)
124
+ end
125
+
126
+ doc.search('ul, ol').each do |list_node|
127
+ list_node.replace("#{ list_node.inner_html }\n")
128
+ end
129
+ end
130
+
131
+ def replace_list_items(list_node, marker, after_marker: '', increment: false)
132
+ list_node.search('./li').each do |list_item|
133
+ list_item.replace("#{ marker }#{ after_marker } #{ list_item.inner_html }\n")
134
+
135
+ marker = marker.next if increment
136
+ end
137
+ end
138
+
139
+ def replace_tables(doc)
62
140
  doc.css('table').each do |table|
63
- column_sizes = table.search('tr').collect do |row|
64
- row.search('th', 'td').collect do |node|
65
- node.inner_html.length
66
- end
67
- end
141
+ # remove whitespace between nodes
142
+ table.search('//text()[normalize-space()=""]').remove
68
143
 
69
- column_sizes = column_sizes.transpose.collect(&:max)
144
+ column_sizes = calculate_column_sizes(table)
70
145
 
71
146
  table.search('./thead/tr', './tbody/tr', './tr').each do |row|
72
147
  replace_table_nodes(row, column_sizes)
73
148
 
74
- row.inner_html = "#{ row.inner_html }|\n"
149
+ row.replace("#{ row.inner_html }#{ @table_column }\n")
75
150
  end
76
151
 
77
- table.search('./thead').each do |row|
78
- header_bottom = "|#{ column_sizes.collect { |len| ('-' * (len + 2)) }.join('|') }|"
152
+ add_table_header_underline(table, column_sizes)
153
+
154
+ table.replace("\n#{ table.inner_html }\n")
155
+ end
156
+ end
79
157
 
80
- row.inner_html = "#{ row.inner_html }#{ header_bottom }\n"
158
+ def calculate_column_sizes(table)
159
+ column_sizes = table.search('tr').collect do |row|
160
+ row.search('th', 'td').collect do |node|
161
+ node.text.length
81
162
  end
163
+ end
164
+
165
+ column_sizes.transpose.collect(&:max)
166
+ end
167
+
168
+ def add_table_header_underline(table, column_sizes)
169
+ table.search('./thead').each do |thead|
170
+ lines = column_sizes.collect { |len| @table_row * (len + 2) }
171
+ underline_row = "#{ table_corner }#{ lines.join(@table_corner) }#{ @table_corner }"
82
172
 
83
- table.inner_html = "#{ table.inner_html }\n"
173
+ thead.replace("#{ thead.inner_html }#{ underline_row }\n")
84
174
  end
85
175
  end
86
176
 
87
177
  def replace_table_nodes(row, column_sizes)
88
178
  row.search('th', 'td').each_with_index do |node, i|
89
- new_content = "| #{ node.inner_html }".squeeze(' ')
179
+ new_content = node.text.ljust(column_sizes[i] + 1)
90
180
 
91
- # +2 for the extra spacing between text and pipe
92
- node.inner_html = new_content.ljust(column_sizes[i] + 2)
181
+ node.replace("#{ @table_column } #{ new_content }")
93
182
  end
94
183
  end
95
184
 
96
185
  def simple_replace(doc, tag, replacement)
97
186
  doc.search(tag).each do |node|
98
- node.replace(replacement)
187
+ node.replace(node.inner_html + replacement)
99
188
  end
100
189
  end
101
190
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: ghostwriter
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.4.0
4
+ version: 1.1.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Robin Miller
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2021-03-17 00:00:00.000000000 Z
11
+ date: 2021-03-23 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: nokogiri
@@ -94,9 +94,10 @@ dependencies:
94
94
  - - "~>"
95
95
  - !ruby/object:Gem::Version
96
96
  version: '1.10'
97
- description: 'Transforms HTML into plaintext while preserving legibility and functionality.
97
+ description: |
98
+ Converts HTML to plain text, preserving as much legibility and functionality as possible.
98
99
 
99
- '
100
+ Ideal for providing a plaintext multipart segment of email messages.
100
101
  email:
101
102
  - robin@tenjin.ca
102
103
  executables: []
@@ -142,5 +143,5 @@ requirements: []
142
143
  rubygems_version: 3.1.2
143
144
  signing_key:
144
145
  specification_version: 4
145
- summary: Intelligently extracts plaintext from an HTML document.
146
+ summary: Converts HTML to plain text
146
147
  test_files: []