ghostwriter 0.4.0 → 1.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: afedaad685b3c06c7baf6e7b1a7b00d8397f9ad1446ddfdcf835dfc99df19969
4
- data.tar.gz: 786a612861e5d11e19672057371720c57dd7dc67e1f9b7333bf9269b819b0548
3
+ metadata.gz: b87477fb4de8fc196eda6ea049eb3f1355ae583f0622c11b98543a67cfe71c8b
4
+ data.tar.gz: fa4e66701a3326ccceef69b3eb8f8591f8eea337f6f4f37c5b18a820bd120d42
5
5
  SHA512:
6
- metadata.gz: 610473e1214cd68edd7b1ce952fd43c44c22a29bc0895bebcc57b4e5e00385bccaec2008e23252eaf438a8b9a1ebd56af0a4c3956be53791ef1b5ee8b75dc1a3
7
- data.tar.gz: 3f3bee72a1515077e7ccff0b1baae94607e33c84c83a84f8b51fdf3234aeaecaa48ddf0d6cfc2a38584ce6c0d0d7f52db5b4021638eb611089087ee638ba2d16
6
+ metadata.gz: b5c545e83c7073b3c5923efc3ffcb6ecfd78650c472a482b5ec16fd7ef5533859ce6fb24d32378aafd20c83e35acd558c136856b46f0c7e0373beb7e3191be34
7
+ data.tar.gz: 8b259323e6653c112885ddf6fe4ee99b31c01e5625ee2fcf22d0259330dc090df11a1b7ee2dc1507c222b5f79c8322717aba9455b9bdbdca04334c0becc5959a
data/README.md CHANGED
@@ -1,14 +1,15 @@
1
1
  # Ghostwriter
2
2
 
3
- Ghostwriter rewrites HTML as plain text while preserving as much legibility and functionality as possible.
3
+ A ruby gem that converts HTML to plain text, preserving as much legibility and functionality as possible.
4
4
 
5
- It's sort of like a reverse-markdown.
5
+ It's sort of like a reverse-markdown or a very, very simple screen reader.
6
6
 
7
7
  ## But Why, Though?
8
8
 
9
- * Spam filters tend to like emails with a plain text alternative
10
9
  * Some email clients won't or can’t handle HTML at all
11
10
  * Some people explicitly choose plaintext just by preference or accessibility
11
+ * Spam filters tend to prefer emails with a plain text alternative (but if you use this gem to spam people, I will yell
12
+ at you)
12
13
 
13
14
  ## Installation
14
15
 
@@ -28,26 +29,339 @@ Or install it manually with:
28
29
 
29
30
  ## Usage
30
31
 
31
- Create a `Ghostwriter::Writer` with the html you want modified, and call `#textify`:
32
+ Create a `Ghostwriter::Writer` and call `#textify` with the html string you want modified:
32
33
 
33
34
  ```ruby
34
- html = '<html><body>This is some markup <a href="tenjin.ca">and a link</a><p>Other tags translate, too</p></body></html>'
35
+ html = <<~HTML
36
+ <html>
37
+ <body>
38
+ <p>This is some text with <a href="tenjin.ca">a link</a></p>
39
+ <p>It handles other stuff, too.</p>
40
+ <hr>
41
+ <h1>Stuff Like</h1>
42
+ <ul>
43
+ <li>Images</li>
44
+ <li>Lists</li>
45
+ <li>Tables</li>
46
+ <li>And more</li>
47
+ </ul>
48
+ </body>
49
+ </html>
50
+ HTML
51
+
52
+ ghostwriter = Ghostwriter::Writer.new
53
+
54
+ puts ghostwriter.textify(html)
55
+ ```
56
+
57
+ Produces:
58
+
59
+ ```
60
+ This is some text with a link (tenjin.ca)
61
+
62
+ It handles other stuff, too.
63
+
64
+
65
+ ----------
66
+
67
+ -- Stuff Like --
68
+ - Images
69
+ - Lists
70
+ - Tables
71
+ - And more
72
+ ```
73
+
74
+ ### Links
75
+
76
+ Links are converted to the link text followed by the link target in brackets:
77
+
78
+ ```html
79
+ Visit our <a href="https://example.com">Website</a>
80
+ ```
81
+
82
+ Becomes:
83
+
84
+ ```
85
+ Visit our Website (https://example.com)
86
+ ```
87
+
88
+ #### Relative Links
89
+
90
+ Since emails are wholly distinct from your web address, relative links might break.
91
+
92
+ To avoid this problem, either use the `<base>` header tag:
93
+
94
+ ```html
95
+
96
+ <html>
97
+ <head>
98
+ <base href="https://www.example.com">
99
+ </head>
100
+ <body>
101
+ Use the base tag to <a href="/contact">expand</a> links.
102
+ </body>
103
+ </html>
104
+ ```
105
+
106
+ Becomes:
107
+
108
+ ```
109
+ Use the base tag to expand (https://www.example.com/contact) links.
110
+ ```
111
+
112
+ Or you can use the `link_base` configuration:
113
+
114
+ ```ruby
115
+ Ghostwriter::Writer.new(link_base: 'tenjin.ca').textify(html)
116
+ ```
117
+
118
+ ### Images
119
+
120
+ Images with alt text are converted:
121
+
122
+ ```html
123
+ <img src="logo.jpg" alt="ACME Anvils" />
124
+ ```
125
+
126
+ Becomes:
127
+
128
+ ```
129
+ ACME Anvils (logo.jpg)
130
+ ```
131
+
132
+ But images lacking alt text or with a presentation ARIA role are ignored:
133
+
134
+ ```html
135
+ <!-- these will just become an empty string -->
136
+ <img src="decoration.jpg">
137
+ <img src="logo.jpg" role="presentation">
138
+ ```
139
+
140
+ And images with data URIs won't include the data portion.
141
+
142
+ ```html
143
+
144
+ <img src="data:image/gif;base64,R0lGODdhIwAjAMZ/AAkMBxETEBUUDBoaExkaGCIcFx4fGCEfFCcfECkjHiUlHiglGikmFjAqFi8pJCsrJT8sCjMzLDUzJzs0GjkzLTszKTM1Mzg4MD48Mzs+O0tAIElCJ1NCGVdBHUtEMkNFQjlHTFJDOkdGPT1ISUxLRENOT1tMI01PTGdLKk1RU0hTVEtTT0NVVFRTTExYWE9YVGhVP1VZXGFYTWhaMFRcWHFYL1FdXV1dRHdZMVRgYFhgXFdiY11hY1tkX31hJltmZ2pnWnloLGFrbG9oYXlqN3NqTnBqWHxqRItvRIh0Nod0ToF2U5J4LX55Xm97e4B5aZqAQpGAdqOCOZKEYZ2FOJyEVoyKbqiOXpySbLCVcLCXaKWbdKCdfZyhi66dksGdc76fbbije7mkdLOmgq6ogrCpibyvirexisWvhs2vgsGyiLq1lce1lMC5ks28nsfBmcHDq9bAl9PDmMnFo9TGh8zIoM7Jm9vLs9nRo93QqtfSquLQpdXUs+fdterlw////ywAAAAAIwAjAAAH/oArOTo6PYaGOz08P0KMOTZCOzw7PzY/Pz2JPYSDhTSFPTSXPY0tIiIfJz05o5Q/O7A5moc6O4Q0oS8uQisXGCItwTItP5OxOrKjhzSfLzYvgz85ERQXJKcSIkZeJDqOl43StrSEKzo2LhkOGBISDw40JyIVFVEyorBCkZmwtCsrtnLQSJCAwoMFCiwoiECPAr0TjPrtECJwXLMVNARlUCBhQAEFC2SsgWPGDBs3d2RcorSD1SVGr3qskOkihoIH70DO0cOHDx48evD0KQONmQ0aORZJE3VLRYoPBRwoUCCCSx07eoL+xLNnj5UfNFry4BHuR6EcK0qkKJFhAYUE/g+cdHlz1efPrnvM2MjhQlYOWTxktXThIoUKhQoKDHBi5Y0dO0CD5smzJ46NvWJfjYW1w4WKEiWkKkgw9UYdPXTo8Mn6042bvX9pTHoFa5GKzykekP5owEidN1u6PKnzMw+QJ3ttUPr7qKUs0C5KHOyoAMMaNWrmjKlSRYscMFm+nBBUybkLSYsIl3DxwAgcKwWMzGnz5kqTK1e09AEDI0uGE8rJEgNfsuxVggoujGABF1xMoYAVc9RRhxxq5JGVHn3EEYcIGfT1igvGKLfDZyWMkMINa5QhQRNz9CQhT1n5URmHJ8Sygw2BSWLDbaCpgEFPNzxBV4QwApVhHBhg/vABZ0pJIhuCoI0wQhFlkLEGGWfQ9wZ2W6KRBhoUJKncKyK2tMOBPI6wwAxltInlG1uKcQUUV3xpwQUXACSJjbCAxgJoJShggBVtnmGGlm/M4UYcX14QQQQ1PpJjUjmsd5sKCg5gBRdkYMlGG2KwoUYWWYARxgXVnODXqmP9CWgJIESwxhJTbEHGGGbMsSWpaRRBQQQXpPKIiJOgg+BnI4AwwhxcHFHrGGN0KYYYaEhAzQX/7flIDMqx4CoIJY7QxhpY0GorXXXwkUcRj1Lg7gfMDavcCSx4BqsIHpyxRhtT1FCDEmNgF4YY1j6KZ4eXXTast9GVcAIHG2TZRhlT/qCAAg5IZIzCA+1QQ0EGKbgAG7c0pPOAAgQcwEQSZ2R5RhlYVIFEFVccAQEAAASgWEIrXEZYDDHQYAEBAQSAcxBUbCExGWVsMfMVCHSA89QCbHBDX4QRRsPURuMcQBBQYLHGHGuwoYUYVdQQxAIOBCCACVLUgDMBS7rwwgtENHDAAEYLMIAAHhABRRVYKFEDDjjU0AA9HiQhxQQOCDC1BXe/UAQVVATRwAIDDGCAAAd0EAQTTEgBBQ4IIFSBFHFPdYEIFJBAQOUE1K5AAyZgnsQME/jNwAG/e7QBFT4sYEABBiQv6ANDDLDCCwPULr0ADYyeOQcMLMAAAxNAIQUHJwckYEDn5CfvgAEKvECA3+R7nrwB2k+ggQkmaLB3++Sz3zkMIawQCAA7"
145
+ alt="Data picture" />
146
+ ```
147
+
148
+ Becomes:
149
+
150
+ ```
151
+ Data picture (embedded)
152
+ ```
153
+
154
+ ### Paragraphs and Linebreaks
155
+
156
+ Paragraphs are padded with a newline at the end. Line break tags add an empty line.
157
+
158
+ ```html
159
+ <p>I would like to propose a toast.</p>
160
+ <p>This meal we enjoy together would be improved by one.</p>
161
+ <br />
162
+ <p>... Plug in the toaster and I'll get the bread.</p>
163
+ ```
164
+
165
+ ```
166
+ I would like to propose a toast.
167
+
168
+ This meal we enjoy together would be improved by one.
169
+
170
+
171
+ ... Plug in the toaster and I'll get the bread.
35
172
 
36
- Ghostwriter::Writer.new(html).textify
173
+ ```
174
+
175
+ ### Headings
176
+
177
+ Headings are wrapped with a marker per heading level:
178
+
179
+ ```html
180
+ <h1>Dog Maintenance and Repair</h1>
181
+ <h2>Food Input Port</h2>
182
+ <h3>Exhaust Port Considerations</h3>
183
+ ```
184
+
185
+ Becomes:
186
+
187
+ ```
188
+ -- Dog Maintenance and Repair --
189
+ ---- Food Input Port ----
190
+ ------ Exhaust Port Considerations ------
191
+ ```
192
+
193
+ The `<header>` tag is treated like an `<h1>` tag.
194
+
195
+ ### Lists
196
+
197
+ Lists are converted, too. They are padded with newlines and are given simple markers:
198
+
199
+ ```html
200
+
201
+ <ul>
202
+ <li>Planes</li>
203
+ <li>Trains</li>
204
+ <li>Automobiles</li>
205
+ </ul>
206
+ <ol>
207
+ <li>I get knocked down</li>
208
+ <li>I get up again</li>
209
+ <li>Never gonna keep me down</li>
210
+ </ol>
211
+ ```
37
212
 
38
- => "This is some markup and a link (tenjin.ca)\nOther tags translate, too\n\n"
213
+ Becomes:
39
214
 
40
215
  ```
216
+ - Planes
217
+ - Trains
218
+ - Automobiles
41
219
 
42
- `#textify` will use the `<base>` tag if found in the HTML source, or if one is provided explicitly:
220
+ 1. I get knocked down
221
+ 2. I get up again
222
+ 3. Never gonna keep me down
223
+ ```
224
+
225
+ ### Tables
226
+
227
+ Tables are still often used in email structuring because support for more modern HTML and CSS is inconsistent. If your
228
+ table is purely presentational, mark it with `role="presentation"`. See below for details.
229
+
230
+ For real data tables, Ghostwriter tries to maintain table structure for simple tables:
231
+
232
+ ```html
233
+
234
+ <table>
235
+ <thead>
236
+ <tr>
237
+ <th>Ship</th>
238
+ <th>Captain</th>
239
+ </tr>
240
+ </thead>
241
+ <tbody>
242
+ <tr>
243
+ <td>Enterprise</td>
244
+ <td>Jean-Luc Picard</td>
245
+ </tr>
246
+ <tr>
247
+ <td>TARDIS</td>
248
+ <td>The Doctor</td>
249
+ </tr>
250
+ <tr>
251
+ <td>Planet Express Ship</td>
252
+ <td>Turanga Leela</td>
253
+ </tr>
254
+ </tbody>
255
+ </table>
256
+ ```
257
+
258
+ Becomes:
259
+
260
+ ```
261
+ | Ship | Captain |
262
+ |---------------------|-----------------|
263
+ | Enterprise | Jean-Luc Picard |
264
+ | TARDIS | The Doctor |
265
+ | Planet Express Ship | Turanga Leela |
266
+ ```
267
+
268
+ ### Customizing Output
269
+
270
+ Ghostwriter has some constructor options to customize output.
271
+
272
+ You can set heading markers.
43
273
 
44
274
  ```ruby
45
- html = '<html><body>Relative links <a href="/contact">Link</a></body></html>'
275
+ html = <<~HTML
276
+ <h1>Emergency Cat Procedures</h1>
277
+ HTML
46
278
 
47
- Ghostwriter::Writer.new(html).textify(link_base: 'tenjin.ca')
279
+ writer = Ghostwriter::Writer.new(heading_marker: '#')
280
+
281
+ puts writer.textify(html)
282
+ ```
48
283
 
49
- => "Relative links Link (tenjin.ca/contact)"
284
+ Produces:
50
285
 
286
+ ```
287
+ # Emergency Cat Procedures #
288
+ ```
289
+
290
+ You can also set list item markers. Ordered markers can be anything that responds to `#next` (eg. any `Enumerator`)
291
+
292
+ ```ruby
293
+ html = <<~HTML
294
+ <ol><li>Mercury</li><li>Venus</li><li>Mars</li></ol>
295
+ <ul><li>Teapot</li><li>Kettle</li></ul>
296
+ HTML
297
+
298
+ writer = Ghostwriter::Writer.new(ul_marker: '*', ol_marker: 'a')
299
+
300
+ puts writer.textify(html)
301
+ ```
302
+
303
+ Produces:
304
+
305
+ ```
306
+ a. Mercury
307
+ b. Venus
308
+ c. Mars
309
+
310
+ * Teapot
311
+ * Kettle
312
+ ```
313
+
314
+ And tables can be customized:
315
+
316
+ ```ruby
317
+ writer = Ghostwriter::Writer.new(table_row: '.',
318
+ table_column: '#',
319
+ table_corner: '+')
320
+
321
+ puts writer.textify <<~HTML
322
+ <table>
323
+ <thead>
324
+ <tr><th>Moon</th><th>Portfolio</th></tr>
325
+ </thead>
326
+ <tbody>
327
+ <tr><td>Phobos</td><td>Fear & Panic</td></tr>
328
+ <tr><td>Deimos</td><td>Dread and Terror</td></tr>
329
+ </tbody>
330
+ </table>
331
+ HTML
332
+ ```
333
+
334
+ Produces:
335
+
336
+ ```
337
+ # Moon # Portfolio #
338
+ +........+..................+
339
+ # Phobos # Fear & Panic #
340
+ # Deimos # Dread and Terror #
341
+
342
+ ```
343
+
344
+ #### Presentation ARIA Role
345
+
346
+ Tags with `role="presentation"` will be treated as a simple container and the normal behaviour will be suppressed.
347
+
348
+ ```html
349
+
350
+ <table role="presentation">
351
+ <tr>
352
+ <td>The table is a lie</td>
353
+ </tr>
354
+ </table>
355
+ <ul role="presentation">
356
+ <li>No such list</li>
357
+ </ul>
358
+ ```
359
+
360
+ Becomes:
361
+
362
+ ```
363
+ The table is a lie
364
+ No such list
51
365
  ```
52
366
 
53
367
  ### Mail Gem Example
@@ -58,9 +372,10 @@ through Ghostwriter:
58
372
  ```ruby
59
373
  require 'mail'
60
374
 
61
- html = 'My email and a <a href="http://tenjin.ca">link</a>'
375
+ html = 'My email and a <a href="https://tenjin.ca">link</a>'
376
+ ghostwriter = Ghostwriter::Writer.new
62
377
 
63
- mail = Mail.deliver do
378
+ Mail.deliver do
64
379
  to 'bob@example.com'
65
380
  from 'dot@example.com'
66
381
  subject 'Using Ghostwriter with Mail'
@@ -71,7 +386,7 @@ mail = Mail.deliver do
71
386
  end
72
387
 
73
388
  text_part do
74
- body Ghostwriter::Writer.new(html).textify
389
+ body ghostwriter.textify(html)
75
390
  end
76
391
  end
77
392
 
@@ -90,19 +405,19 @@ After checking out the repo, run `bundle install` to install dependencies. Then,
90
405
  can also run `bin/console` for an interactive prompt that will allow you to experiment.
91
406
 
92
407
  #### Local Install
93
- To install this gem onto your local machine only, run
94
408
 
95
- `bundle exec rake install`
409
+ To install this gem onto your local machine only, run
410
+
411
+ `bundle exec rake install`
96
412
 
97
413
  #### Gem Release
414
+
98
415
  To release a gem to the world at large
99
416
 
100
- 1. Update the version number in `version.rb`,
101
- 2. Run `bundle exec rake release`,
102
- which will create a git tag for the version,
103
- push git commits and tags,
104
- and push the `.gem` file to [rubygems.org](https://rubygems.org).
105
- 3. Do a wee dance
417
+ 1. Update the version number in `version.rb`,
418
+ 2. Run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push
419
+ the `.gem` file to [rubygems.org](https://rubygems.org).
420
+ 3. Do a wee dance
106
421
 
107
422
  ## License
108
423
 
data/RELEASE_NOTES.md CHANGED
@@ -1,5 +1,82 @@
1
1
  # Release Notes
2
2
 
3
+ ## 1.1.0 (2021-03-23)
4
+
5
+ ### Major
6
+
7
+ * none
8
+
9
+ ### Minor
10
+
11
+ * Added customization for headings
12
+ * Headings now marked more for higher order headings
13
+ * Added customization for list markers
14
+ * Added customization for table markers
15
+ * Writer is now immutable
16
+
17
+ ### Bugfixes
18
+
19
+ * none
20
+
21
+ ## 1.0.1 (2021-03-22)
22
+
23
+ ### Major
24
+
25
+ * none
26
+
27
+ ### Minor
28
+
29
+ * Updated README
30
+
31
+ ### Bugfixes
32
+
33
+ * Fixed hr padding behaviour
34
+
35
+ ## 1.0.0 (2021-03-21)
36
+
37
+ ### Major
38
+
39
+ * Moved `link_base` parameter to constructor
40
+ * Moved input HTML parameter to `#textify`
41
+
42
+ ### Minor
43
+
44
+ * Treats tables and lists with role="presentation" as simple containers
45
+ * Now handles ordered and unordered lists
46
+ * Images are now replaced with their alt text
47
+
48
+ ### Bugfixes
49
+
50
+ * none
51
+
52
+ ## 0.4.2 (2021-03-17)
53
+
54
+ ### Major
55
+
56
+ * none
57
+
58
+ ### Minor
59
+
60
+ * none
61
+
62
+ ### Bugfixes
63
+
64
+ * Works with links using `tel:` and `mailto:` schemas.
65
+
66
+ ## 0.4.1 (2021-03-17)
67
+
68
+ ### Major
69
+
70
+ * none
71
+
72
+ ### Minor
73
+
74
+ * No longer provides link target in brackets after link text when they are the same
75
+
76
+ ### Bugfixes
77
+
78
+ * Added explicit testing for HTML entity interpretation
79
+
3
80
  ## 0.4.0 (2021-03-16)
4
81
 
5
82
  ### Major
data/dirt-textify.gemspec CHANGED
@@ -10,9 +10,11 @@ Gem::Specification.new do |spec|
10
10
  spec.authors = ['Robin Miller']
11
11
  spec.email = ['robin@tenjin.ca']
12
12
 
13
- spec.summary = 'Intelligently extracts plaintext from an HTML document.'
13
+ spec.summary = 'Converts HTML to plain text'
14
14
  spec.description = <<~DESC
15
- Transforms HTML into plaintext while preserving legibility and functionality.
15
+ Converts HTML to plain text, preserving as much legibility and functionality as possible.
16
+
17
+ Ideal for providing a plaintext multipart segment of email messages.
16
18
  DESC
17
19
  spec.homepage = 'https://github.com/TenjinInc/ghostwriter'
18
20
  spec.license = 'MIT'
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Ghostwriter
4
- VERSION = '0.4.0'
4
+ VERSION = '1.1.0'
5
5
  end
@@ -3,99 +3,188 @@
3
3
  module Ghostwriter
4
4
  # Main Ghostwriter converter object.
5
5
  class Writer
6
- def initialize(html)
7
- @source_html = html
6
+ attr_reader :link_base, :heading_marker, :ul_marker, :ol_marker, :table_row, :table_column, :table_corner
7
+
8
+ # Creates a new ghostwriter
9
+ #
10
+ # @param [String] link_base the url to prefix relative links with
11
+ def initialize(link_base: '', heading_marker: '--', ul_marker: '-', ol_marker: '1',
12
+ table_column: '|', table_row: '-', table_corner: '|')
13
+ @link_base = link_base
14
+ @heading_marker = heading_marker
15
+ @ul_marker = ul_marker
16
+ @ol_marker = ol_marker
17
+ @table_column = table_column
18
+ @table_row = table_row
19
+ @table_corner = table_corner
20
+
21
+ freeze
8
22
  end
9
23
 
10
24
  # Strips HTML down to plain text.
11
25
  #
12
- # @param link_base the url to prefix relative links with
13
- def textify(link_base: '')
14
- html = normalize_whitespace(@source_html).gsub('</p>', "</p>\n\n")
26
+ # @param html [String] the HTML to be convert to text
27
+ #
28
+ # @return converted text
29
+ def textify(html)
30
+ doc = Nokogiri::HTML(html.gsub(/\s+/, ' '))
15
31
 
16
- doc = Nokogiri::HTML(html)
32
+ doc.search('style, script').remove
17
33
 
18
- doc.search('style').remove
19
- doc.search('script').remove
34
+ replace_anchors(doc)
35
+ replace_images(doc)
36
+
37
+ simple_replace(doc, '*[role="presentation"]', "\n")
20
38
 
21
- replace_anchors(doc, link_base)
22
39
  replace_headers(doc)
23
- replace_table(doc)
40
+ replace_lists(doc)
41
+ replace_tables(doc)
24
42
 
25
- simple_replace(doc, 'hr', "\n----------\n")
43
+ simple_replace(doc, 'hr', "\n----------\n\n")
26
44
  simple_replace(doc, 'br', "\n")
45
+ simple_replace(doc, 'p', "\n\n")
27
46
 
28
- # doc.search('p').each do |link_node|
29
- # link_node.inner_html = link_node.inner_html + "\n\n"
30
- # end
31
-
32
- # trim, but only single-space character
33
- doc.text.gsub(/^ +| +$/, '')
47
+ normalize_lines(doc)
34
48
  end
35
49
 
36
50
  private
37
51
 
38
- def normalize_whitespace(html)
39
- html.gsub(/\s/, ' ').squeeze(' ')
52
+ def normalize_lines(doc)
53
+ doc.text.strip.split("\n").collect(&:strip).join("\n").concat("\n")
40
54
  end
41
55
 
42
- def replace_anchors(doc, link_base)
56
+ def replace_anchors(doc)
57
+ doc.search('a').each do |link_node|
58
+ href = get_link_target(link_node, get_link_base(doc))
59
+
60
+ link_node.inner_html = if link_matches(href, link_node.inner_html)
61
+ href.to_s
62
+ else
63
+ "#{ link_node.inner_html } (#{ href })"
64
+ end
65
+ end
66
+ end
67
+
68
+ def link_matches(first, second)
69
+ first.to_s.gsub(%r{^https?://}, '').chomp('/') == second.gsub(%r{^https?://}, '').chomp('/')
70
+ end
71
+
72
+ def get_link_base(doc)
43
73
  # <base> node is unique by W3C spec
44
- base = doc.search('base').first
45
- base_url = base ? base['href'] : link_base
74
+ base_node = doc.search('base').first
46
75
 
47
- doc.search('a').each do |link_node|
48
- href = URI(link_node['href'])
49
- href = base_url + href.to_s unless href.absolute?
76
+ base_node ? base_node['href'] : @link_base
77
+ end
50
78
 
51
- link_node.inner_html = "#{ link_node.inner_html } (#{ href })"
79
+ def get_link_target(link_node, base)
80
+ href = URI(link_node['href'])
81
+ if href.absolute?
82
+ href
83
+ else
84
+ base + href.to_s
52
85
  end
86
+ rescue URI::InvalidURIError
87
+ link_node['href'].gsub(/^(tel|mailto):/, '').strip
53
88
  end
54
89
 
55
90
  def replace_headers(doc)
56
- doc.search('header, h1, h2, h3, h4, h5, h6').each do |node|
57
- node.inner_html = "- #{ node.inner_html } -\n".squeeze(' ')
91
+ doc.search('header, h1').each do |node|
92
+ node.replace("#{ @heading_marker } #{ node.inner_html } #{ @heading_marker }\n"
93
+ .squeeze(' '))
94
+ end
95
+
96
+ (2..6).each do |n|
97
+ doc.search("h#{ n }").each do |node|
98
+ node.replace("#{ @heading_marker * n } #{ node.inner_html } #{ @heading_marker * n }\n"
99
+ .squeeze(' '))
100
+ end
101
+ end
102
+ end
103
+
104
+ def replace_images(doc)
105
+ doc.search('img[role=presentation]').remove
106
+
107
+ doc.search('img').each do |img_node|
108
+ src = img_node['src']
109
+ alt = img_node['alt']
110
+
111
+ src = 'embedded' if src.start_with? 'data:'
112
+
113
+ img_node.replace("#{ alt } (#{ src })") unless alt.nil? || alt.empty?
58
114
  end
59
115
  end
60
116
 
61
- def replace_table(doc)
117
+ def replace_lists(doc)
118
+ doc.search('ol').each do |list_node|
119
+ replace_list_items(list_node, @ol_marker, after_marker: '.', increment: true)
120
+ end
121
+
122
+ doc.search('ul').each do |list_node|
123
+ replace_list_items(list_node, @ul_marker)
124
+ end
125
+
126
+ doc.search('ul, ol').each do |list_node|
127
+ list_node.replace("#{ list_node.inner_html }\n")
128
+ end
129
+ end
130
+
131
+ def replace_list_items(list_node, marker, after_marker: '', increment: false)
132
+ list_node.search('./li').each do |list_item|
133
+ list_item.replace("#{ marker }#{ after_marker } #{ list_item.inner_html }\n")
134
+
135
+ marker = marker.next if increment
136
+ end
137
+ end
138
+
139
+ def replace_tables(doc)
62
140
  doc.css('table').each do |table|
63
- column_sizes = table.search('tr').collect do |row|
64
- row.search('th', 'td').collect do |node|
65
- node.inner_html.length
66
- end
67
- end
141
+ # remove whitespace between nodes
142
+ table.search('//text()[normalize-space()=""]').remove
68
143
 
69
- column_sizes = column_sizes.transpose.collect(&:max)
144
+ column_sizes = calculate_column_sizes(table)
70
145
 
71
146
  table.search('./thead/tr', './tbody/tr', './tr').each do |row|
72
147
  replace_table_nodes(row, column_sizes)
73
148
 
74
- row.inner_html = "#{ row.inner_html }|\n"
149
+ row.replace("#{ row.inner_html }#{ @table_column }\n")
75
150
  end
76
151
 
77
- table.search('./thead').each do |row|
78
- header_bottom = "|#{ column_sizes.collect { |len| ('-' * (len + 2)) }.join('|') }|"
152
+ add_table_header_underline(table, column_sizes)
153
+
154
+ table.replace("\n#{ table.inner_html }\n")
155
+ end
156
+ end
79
157
 
80
- row.inner_html = "#{ row.inner_html }#{ header_bottom }\n"
158
+ def calculate_column_sizes(table)
159
+ column_sizes = table.search('tr').collect do |row|
160
+ row.search('th', 'td').collect do |node|
161
+ node.text.length
81
162
  end
163
+ end
164
+
165
+ column_sizes.transpose.collect(&:max)
166
+ end
167
+
168
+ def add_table_header_underline(table, column_sizes)
169
+ table.search('./thead').each do |thead|
170
+ lines = column_sizes.collect { |len| @table_row * (len + 2) }
171
+ underline_row = "#{ table_corner }#{ lines.join(@table_corner) }#{ @table_corner }"
82
172
 
83
- table.inner_html = "#{ table.inner_html }\n"
173
+ thead.replace("#{ thead.inner_html }#{ underline_row }\n")
84
174
  end
85
175
  end
86
176
 
87
177
  def replace_table_nodes(row, column_sizes)
88
178
  row.search('th', 'td').each_with_index do |node, i|
89
- new_content = "| #{ node.inner_html }".squeeze(' ')
179
+ new_content = node.text.ljust(column_sizes[i] + 1)
90
180
 
91
- # +2 for the extra spacing between text and pipe
92
- node.inner_html = new_content.ljust(column_sizes[i] + 2)
181
+ node.replace("#{ @table_column } #{ new_content }")
93
182
  end
94
183
  end
95
184
 
96
185
  def simple_replace(doc, tag, replacement)
97
186
  doc.search(tag).each do |node|
98
- node.replace(replacement)
187
+ node.replace(node.inner_html + replacement)
99
188
  end
100
189
  end
101
190
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: ghostwriter
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.4.0
4
+ version: 1.1.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Robin Miller
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2021-03-17 00:00:00.000000000 Z
11
+ date: 2021-03-23 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: nokogiri
@@ -94,9 +94,10 @@ dependencies:
94
94
  - - "~>"
95
95
  - !ruby/object:Gem::Version
96
96
  version: '1.10'
97
- description: 'Transforms HTML into plaintext while preserving legibility and functionality.
97
+ description: |
98
+ Converts HTML to plain text, preserving as much legibility and functionality as possible.
98
99
 
99
- '
100
+ Ideal for providing a plaintext multipart segment of email messages.
100
101
  email:
101
102
  - robin@tenjin.ca
102
103
  executables: []
@@ -142,5 +143,5 @@ requirements: []
142
143
  rubygems_version: 3.1.2
143
144
  signing_key:
144
145
  specification_version: 4
145
- summary: Intelligently extracts plaintext from an HTML document.
146
+ summary: Converts HTML to plain text
146
147
  test_files: []