ghostwriter 0.4.2 → 1.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: d48bada0259aa38eb1cf98bfbf101970dce0f979ffb7293a8c23b5d5ca393d7d
4
- data.tar.gz: d1f4ac853988b75497b4c1ff7c4a16ba8cddd47be93e0112001fd5960a61a565
3
+ metadata.gz: ccebf53b35ad212e0c7e353a3cb4d79e1f91c63cec6490c5fff7ff56b72eed70
4
+ data.tar.gz: 5b3d36839dcef24d80605421970eb39e17efe3ec24d3126d753ac7a5039512a5
5
5
  SHA512:
6
- metadata.gz: 5d85b5f1e5ef90c91fd03288012805c195c0a9c6ea8a3d7efb10b03b55443e060252f3f1e82f18f2854234174aa9daa80cc35a7b15f774798af5593a84b55881
7
- data.tar.gz: afaf20dbb5876e667fc33d3ff35eff58d1cb076e5bec941863a4b54689ac99bbdf2716cd52145bc7e5cd99f64c0d5954023cf160e01240c24c6a16036fca2eee
6
+ metadata.gz: d6039d9d2b3c1d6606da533c108c8087faa8b985ac0bf6adac7bc1ad4310cc601f76840ff32e83255847bedbd77dbc6cc280054f6eb4975dbe815a98b2a07373
7
+ data.tar.gz: d005669ca03f3ff465c351df0e47eba0372ca6061057102e14b0d879ddb0c5191a88bfaa389fa82a865787e6ff3b5d3fb351ae8b6351f108aef404a3bd66dd7a
data/README.md CHANGED
@@ -1,14 +1,19 @@
1
1
  # Ghostwriter
2
2
 
3
- Ghostwriter rewrites HTML as plain text while preserving as much legibility and functionality as possible.
3
+ A ruby gem that converts HTML to plain text, preserving as much legibility and functionality as possible.
4
4
 
5
- It's sort of like a reverse-markdown.
5
+ It's sort of like a reverse-markdown or a *very* simple screen reader.
6
6
 
7
7
  ## But Why, Though?
8
8
 
9
- * Spam filters tend to like emails with a plain text alternative
10
- * Some email clients won't or can’t handle HTML at all
11
- * Some people explicitly choose plaintext just by preference or accessibility
9
+ * Some email clients won't or can’t offer HTML support.
10
+ * Some people explicitly choose plaintext for accessibility or just plain preference.
11
+ * Spam filters tend to prefer emails with a plain text alternative (but if you use this gem to spam people,
12
+ not only might you be
13
+ [breaking](https://fightspam.gc.ca)
14
+ [various](https://gdpr.eu/)
15
+ [laws](https://www.ftc.gov/tips-advice/business-center/guidance/can-spam-act-compliance-guide-business),
16
+ I will also personally curse you)
12
17
 
13
18
  ## Installation
14
19
 
@@ -28,111 +33,234 @@ Or install it manually with:
28
33
 
29
34
  ## Usage
30
35
 
31
- Create a `Ghostwriter::Writer` with the html you want modified, and call `#textify`:
36
+ Create a `Ghostwriter::Writer` and call `#textify` with the html string you want modified:
32
37
 
33
38
  ```ruby
34
- html = '<html><body><p>This is some markup <a href="tenjin.ca">and a link</a></p><p>Other tags translate, too</p></body></html>'
39
+ html = <<~HTML
40
+ <html>
41
+ <body>
42
+ <p>This is some text with <a href="tenjin.ca">a link</a></p>
43
+ <p>It handles other stuff, too.</p>
44
+ <hr>
45
+ <h1>Stuff Like</h1>
46
+ <ul>
47
+ <li>Images</li>
48
+ <li>Lists</li>
49
+ <li>Tables</li>
50
+ <li>And more</li>
51
+ </ul>
52
+ </body>
53
+ </html>
54
+ HTML
55
+
56
+ ghostwriter = Ghostwriter::Writer.new
35
57
 
36
- Ghostwriter::Writer.new(html).textify
58
+ puts ghostwriter.textify(html)
37
59
  ```
60
+
38
61
  Produces:
62
+
39
63
  ```
40
- This is some markup and a link (tenjin.ca)
64
+ This is some text with a link (tenjin.ca)
41
65
 
42
- Other tags translate, too
66
+ It handles other stuff, too.
67
+
68
+
69
+ ----------
70
+
71
+ -- Stuff Like --
72
+ - Images
73
+ - Lists
74
+ - Tables
75
+ - And more
43
76
  ```
44
77
 
45
78
  ### Links
46
79
 
47
80
  Links are converted to the link text followed by the link target in brackets:
48
81
 
49
- ```ruby
50
- html = '<html><body>Visit our <a href="https://example.com">Website</a><body></html>'
51
- Ghostwriter::Writer.new(html).textify
82
+ ```html
83
+ Visit our <a href="https://example.com">Website</a>
52
84
  ```
53
85
 
54
- Produces:
86
+ Becomes:
87
+
55
88
  ```
56
89
  Visit our Website (https://example.com)
57
90
  ```
58
91
 
59
92
  #### Relative Links
93
+
60
94
  Since emails are wholly distinct from your web address, relative links might break.
61
95
 
62
96
  To avoid this problem, either use the `<base>` header tag:
63
97
 
64
- ```ruby
65
- html = <<~HTML
66
- <html>
67
- <head>
68
- <base href="https://www.example.com/">
69
- </head>
70
- <body>
71
- Relative links get <a href="/contact">expanded</a> using the head's base tag.
72
- </body>
73
- </html>
74
- HTML
98
+ ```html
75
99
 
76
- Ghostwriter::Writer.new(html).textify
100
+ <html>
101
+ <head>
102
+ <base href="https://www.example.com">
103
+ </head>
104
+ <body>
105
+ Use the base tag to <a href="/contact">expand</a> links.
106
+ </body>
107
+ </html>
77
108
  ```
78
- Produces:
109
+
110
+ Becomes:
111
+
79
112
  ```
80
- Relative links get expanded (https://www.example.com//contact) using the head's base tag.
113
+ Use the base tag to expand (https://www.example.com/contact) links.
81
114
  ```
82
115
 
83
- Or you can use the `link_base` parameter:
116
+ Or you can use the `link_base` configuration:
117
+
84
118
  ```ruby
85
- html = '<html><body>Relative links get <a href="/contact">expanded</a></body></html> using the link_base parmeter, too.'
119
+ Ghostwriter::Writer.new(link_base: 'tenjin.ca').textify(html)
120
+ ```
121
+
122
+ ### Images
86
123
 
87
- Ghostwriter::Writer.new(html).textify(link_base: 'tenjin.ca')
124
+ Images with alt text are converted:
125
+
126
+ ```html
127
+ <img src="logo.jpg" alt="ACME Anvils" />
88
128
  ```
89
129
 
90
- Produces:
130
+ Becomes:
131
+
91
132
  ```
92
- "Relative links get expanded (tenjin.ca/contact) using the link_base parmeter, too."
133
+ ACME Anvils (logo.jpg)
93
134
  ```
94
135
 
95
- ### Tables
96
- Tables are often used email structuring because support for more modern CSS is inconsistent.
136
+ But images lacking alt text or with a presentation ARIA role are ignored:
97
137
 
98
- Ghostwriter tries to maintain table structure, but this will quickly devolve for complex structures.
138
+ ```html
139
+ <!-- these will just become an empty string -->
140
+ <img src="decoration.jpg">
141
+ <img src="logo.jpg" role="presentation">
142
+ ```
99
143
 
100
- ```ruby
101
- html = <<~HTML
102
- <html>
103
- <head>
104
- <base href="https://www.example.com/">
105
- </head>
106
- <body>
107
- <table>
108
- <thead>
109
- <tr>
110
- <th>Ship</th>
111
- <th>Captain</th>
112
- </tr>
113
- </thead>
114
- <tbody>
115
- <tr>
116
- <td>Enterprise</td>
117
- <td>Jean-Luc Picard</td>
118
- </tr>
119
- <tr>
120
- <td>TARDIS</td>
121
- <td>The Doctor</td>
122
- </tr>
123
- <tr>
124
- <td>Planet Express Ship</td>
125
- <td>Turanga Leela</td>
126
- </tr>
127
- </tbody>
128
- </table>
129
- </body>
130
- </html>
131
- HTML
144
+ And images with data URIs won't include the data portion.
145
+
146
+ ```html
132
147
 
133
- Ghostwriter::Writer.new(html).textify
148
+ <img src="data:image/gif;base64,R0lGODdhIwAjAMZ/AAkMBxETEBUUDBoaExkaGCIcFx4fGCEfFCcfECkjHiUlHiglGikmFjAqFi8pJCsrJT8sCjMzLDUzJzs0GjkzLTszKTM1Mzg4MD48Mzs+O0tAIElCJ1NCGVdBHUtEMkNFQjlHTFJDOkdGPT1ISUxLRENOT1tMI01PTGdLKk1RU0hTVEtTT0NVVFRTTExYWE9YVGhVP1VZXGFYTWhaMFRcWHFYL1FdXV1dRHdZMVRgYFhgXFdiY11hY1tkX31hJltmZ2pnWnloLGFrbG9oYXlqN3NqTnBqWHxqRItvRIh0Nod0ToF2U5J4LX55Xm97e4B5aZqAQpGAdqOCOZKEYZ2FOJyEVoyKbqiOXpySbLCVcLCXaKWbdKCdfZyhi66dksGdc76fbbije7mkdLOmgq6ogrCpibyvirexisWvhs2vgsGyiLq1lce1lMC5ks28nsfBmcHDq9bAl9PDmMnFo9TGh8zIoM7Jm9vLs9nRo93QqtfSquLQpdXUs+fdterlw////ywAAAAAIwAjAAAH/oArOTo6PYaGOz08P0KMOTZCOzw7PzY/Pz2JPYSDhTSFPTSXPY0tIiIfJz05o5Q/O7A5moc6O4Q0oS8uQisXGCItwTItP5OxOrKjhzSfLzYvgz85ERQXJKcSIkZeJDqOl43StrSEKzo2LhkOGBISDw40JyIVFVEyorBCkZmwtCsrtnLQSJCAwoMFCiwoiECPAr0TjPrtECJwXLMVNARlUCBhQAEFC2SsgWPGDBs3d2RcorSD1SVGr3qskOkihoIH70DO0cOHDx48evD0KQONmQ0aORZJE3VLRYoPBRwoUCCCSx07eoL+xLNnj5UfNFry4BHuR6EcK0qkKJFhAYUE/g+cdHlz1efPrnvM2MjhQlYOWTxktXThIoUKhQoKDHBi5Y0dO0CD5smzJ46NvWJfjYW1w4WKEiWkKkgw9UYdPXTo8Mn6042bvX9pTHoFa5GKzykekP5owEidN1u6PKnzMw+QJ3ttUPr7qKUs0C5KHOyoAMMaNWrmjKlSRYscMFm+nBBUybkLSYsIl3DxwAgcKwWMzGnz5kqTK1e09AEDI0uGE8rJEgNfsuxVggoujGABF1xMoYAVc9RRhxxq5JGVHn3EEYcIGfT1igvGKLfDZyWMkMINa5QhQRNz9CQhT1n5URmHJ8Sygw2BSWLDbaCpgEFPNzxBV4QwApVhHBhg/vABZ0pJIhuCoI0wQhFlkLEGGWfQ9wZ2W6KRBhoUJKncKyK2tMOBPI6wwAxltInlG1uKcQUUV3xpwQUXACSJjbCAxgJoJShggBVtnmGGlm/M4UYcX14QQQQ1PpJjUjmsd5sKCg5gBRdkYMlGG2KwoUYWWYARxgXVnODXqmP9CWgJIESwxhJTbEHGGGbMsSWpaRRBQQQXpPKIiJOgg+BnI4AwwhxcHFHrGGN0KYYYaEhAzQX/7flIDMqx4CoIJY7QxhpY0GorXXXwkUcRj1Lg7gfMDavcCSx4BqsIHpyxRhtT1FCDEmNgF4YY1j6KZ4eXXTast9GVcAIHG2TZRhlT/qCAAg5IZIzCA+1QQ0EGKbgAG7c0pPOAAgQcwEQSZ2R5RhlYVIFEFVccAQEAAASgWEIrXEZYDDHQYAEBAQSAcxBUbCExGWVsMfMVCHSA89QCbHBDX4QRRsPURuMcQBBQYLHGHGuwoYUYVdQQxAIOBCCACVLUgDMBS7rwwgtENHDAAEYLMIAAHhABRRVYKFEDDjjU0AA9HiQhxQQOCDC1BXe/UAQVVATRwAIDDGCAAAd0EAQTTEgBBQ4IIFSBFHFPdYEIFJBAQOUE1K5AAyZgnsQME/jNwAG/e7QBFT4sYEABBiQv6ANDDLDCCwPULr0ADYyeOQcMLMAAAxNAIQUHJwckYEDn5CfvgAEKvECA3+R7nrwB2k+ggQkmaLB3++Sz3zkMIawQCAA7"
149
+ alt="Data picture" />
134
150
  ```
135
- Produces:
151
+
152
+ Becomes:
153
+
154
+ ```
155
+ Data picture (embedded)
156
+ ```
157
+
158
+ ### Paragraphs and Linebreaks
159
+
160
+ Paragraphs are padded with a newline at the end. Line break tags add an empty line.
161
+
162
+ ```html
163
+ <p>I would like to propose a toast.</p>
164
+ <p>This meal we enjoy together would be improved by one.</p>
165
+ <br />
166
+ <p>... Plug in the toaster and I'll get the bread.</p>
167
+ ```
168
+
169
+ ```
170
+ I would like to propose a toast.
171
+
172
+ This meal we enjoy together would be improved by one.
173
+
174
+
175
+ ... Plug in the toaster and I'll get the bread.
176
+
177
+ ```
178
+
179
+ ### Headings
180
+
181
+ Headings are wrapped with a marker per heading level:
182
+
183
+ ```html
184
+ <h1>Dog Maintenance and Repair</h1>
185
+ <h2>Food Input Port</h2>
186
+ <h3>Exhaust Port Considerations</h3>
187
+ ```
188
+
189
+ Becomes:
190
+
191
+ ```
192
+ -- Dog Maintenance and Repair --
193
+ ---- Food Input Port ----
194
+ ------ Exhaust Port Considerations ------
195
+ ```
196
+
197
+ The `<header>` tag is treated like an `<h1>` tag.
198
+
199
+ ### Lists
200
+
201
+ Lists are converted, too. They are padded with newlines and are given simple markers:
202
+
203
+ ```html
204
+
205
+ <ul>
206
+ <li>Planes</li>
207
+ <li>Trains</li>
208
+ <li>Automobiles</li>
209
+ </ul>
210
+ <ol>
211
+ <li>I get knocked down</li>
212
+ <li>I get up again</li>
213
+ <li>Never gonna keep me down</li>
214
+ </ol>
215
+ ```
216
+
217
+ Becomes:
218
+
219
+ ```
220
+ - Planes
221
+ - Trains
222
+ - Automobiles
223
+
224
+ 1. I get knocked down
225
+ 2. I get up again
226
+ 3. Never gonna keep me down
227
+ ```
228
+
229
+ ### Tables
230
+
231
+ Tables are still often used in email structuring because support for more modern HTML and CSS is inconsistent. If your
232
+ table is purely presentational, mark it with `role="presentation"`. See below for details.
233
+
234
+ For real data tables, Ghostwriter tries to maintain table structure for simple tables:
235
+
236
+ ```html
237
+
238
+ <table>
239
+ <thead>
240
+ <tr>
241
+ <th>Ship</th>
242
+ <th>Captain</th>
243
+ </tr>
244
+ </thead>
245
+ <tbody>
246
+ <tr>
247
+ <td>Enterprise</td>
248
+ <td>Jean-Luc Picard</td>
249
+ </tr>
250
+ <tr>
251
+ <td>TARDIS</td>
252
+ <td>The Doctor</td>
253
+ </tr>
254
+ <tr>
255
+ <td>Planet Express Ship</td>
256
+ <td>Turanga Leela</td>
257
+ </tr>
258
+ </tbody>
259
+ </table>
260
+ ```
261
+
262
+ Becomes:
263
+
136
264
  ```
137
265
  | Ship | Captain |
138
266
  |---------------------|-----------------|
@@ -141,6 +269,105 @@ Produces:
141
269
  | Planet Express Ship | Turanga Leela |
142
270
  ```
143
271
 
272
+ ### Customizing Output
273
+
274
+ Ghostwriter has some constructor options to customize output.
275
+
276
+ You can set heading markers.
277
+
278
+ ```ruby
279
+ html = <<~HTML
280
+ <h1>Emergency Cat Procedures</h1>
281
+ HTML
282
+
283
+ writer = Ghostwriter::Writer.new(heading_marker: '#')
284
+
285
+ puts writer.textify(html)
286
+ ```
287
+
288
+ Produces:
289
+
290
+ ```
291
+ # Emergency Cat Procedures #
292
+ ```
293
+
294
+ You can also set list item markers. Ordered markers can be anything that responds to `#next` (eg. any `Enumerator`)
295
+
296
+ ```ruby
297
+ html = <<~HTML
298
+ <ol><li>Mercury</li><li>Venus</li><li>Mars</li></ol>
299
+ <ul><li>Teapot</li><li>Kettle</li></ul>
300
+ HTML
301
+
302
+ writer = Ghostwriter::Writer.new(ul_marker: '*', ol_marker: 'a')
303
+
304
+ puts writer.textify(html)
305
+ ```
306
+
307
+ Produces:
308
+
309
+ ```
310
+ a. Mercury
311
+ b. Venus
312
+ c. Mars
313
+
314
+ * Teapot
315
+ * Kettle
316
+ ```
317
+
318
+ And tables can be customized:
319
+
320
+ ```ruby
321
+ writer = Ghostwriter::Writer.new(table_row: '.',
322
+ table_column: '#',
323
+ table_corner: '+')
324
+
325
+ puts writer.textify <<~HTML
326
+ <table>
327
+ <thead>
328
+ <tr><th>Moon</th><th>Portfolio</th></tr>
329
+ </thead>
330
+ <tbody>
331
+ <tr><td>Phobos</td><td>Fear & Panic</td></tr>
332
+ <tr><td>Deimos</td><td>Dread and Terror</td></tr>
333
+ </tbody>
334
+ </table>
335
+ HTML
336
+ ```
337
+
338
+ Produces:
339
+
340
+ ```
341
+ # Moon # Portfolio #
342
+ +........+..................+
343
+ # Phobos # Fear & Panic #
344
+ # Deimos # Dread and Terror #
345
+
346
+ ```
347
+
348
+ #### Presentation ARIA Role
349
+
350
+ Tags with `role="presentation"` will be treated as a simple container and the normal behaviour will be suppressed.
351
+
352
+ ```html
353
+
354
+ <table role="presentation">
355
+ <tr>
356
+ <td>The table is a lie</td>
357
+ </tr>
358
+ </table>
359
+ <ul role="presentation">
360
+ <li>No such list</li>
361
+ </ul>
362
+ ```
363
+
364
+ Becomes:
365
+
366
+ ```
367
+ The table is a lie
368
+ No such list
369
+ ```
370
+
144
371
  ### Mail Gem Example
145
372
 
146
373
  To use `#textify` with the [mail](https://github.com/mikel/mail) gem, just provide the text-part by pasisng the html
@@ -149,7 +376,8 @@ through Ghostwriter:
149
376
  ```ruby
150
377
  require 'mail'
151
378
 
152
- html = 'My email and a <a href="https://tenjin.ca">link</a>'
379
+ html = 'My email and a <a href="https://tenjin.ca">link</a>'
380
+ ghostwriter = Ghostwriter::Writer.new
153
381
 
154
382
  Mail.deliver do
155
383
  to 'bob@example.com'
@@ -162,7 +390,7 @@ Mail.deliver do
162
390
  end
163
391
 
164
392
  text_part do
165
- body Ghostwriter::Writer.new(html).textify
393
+ body ghostwriter.textify(html)
166
394
  end
167
395
  end
168
396
 
@@ -181,19 +409,19 @@ After checking out the repo, run `bundle install` to install dependencies. Then,
181
409
  can also run `bin/console` for an interactive prompt that will allow you to experiment.
182
410
 
183
411
  #### Local Install
184
- To install this gem onto your local machine only, run
185
412
 
186
- `bundle exec rake install`
413
+ To install this gem onto your local machine only, run
414
+
415
+ `bundle exec rake install`
187
416
 
188
417
  #### Gem Release
418
+
189
419
  To release a gem to the world at large
190
420
 
191
- 1. Update the version number in `version.rb`,
192
- 2. Run `bundle exec rake release`,
193
- which will create a git tag for the version,
194
- push git commits and tags,
195
- and push the `.gem` file to [rubygems.org](https://rubygems.org).
196
- 3. Do a wee dance
421
+ 1. Update the version number in `version.rb`,
422
+ 2. Run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push
423
+ the `.gem` file to [rubygems.org](https://rubygems.org).
424
+ 3. Do a wee dance
197
425
 
198
426
  ## License
199
427
 
data/RELEASE_NOTES.md CHANGED
@@ -1,5 +1,70 @@
1
1
  # Release Notes
2
2
 
3
+ ## 1.2.1 (2021-10-29)
4
+
5
+ ### Major
6
+
7
+ * none
8
+
9
+ ### Minor
10
+
11
+ * Updated Nokogiri version to resolve https://github.com/advisories/GHSA-7rrm-v45f-jp64
12
+ * Updated Ruby version dependency to match
13
+ * Relaxed dependency upper bounds
14
+
15
+ ### Bugfixes
16
+
17
+ * none
18
+
19
+ ## 1.1.0 (2021-03-23)
20
+
21
+ ### Major
22
+
23
+ * none
24
+
25
+ ### Minor
26
+
27
+ * Added customization for headings
28
+ * Headings now marked more for higher order headings
29
+ * Added customization for list markers
30
+ * Added customization for table markers
31
+ * Writer is now immutable
32
+
33
+ ### Bugfixes
34
+
35
+ * none
36
+
37
+ ## 1.0.1 (2021-03-22)
38
+
39
+ ### Major
40
+
41
+ * none
42
+
43
+ ### Minor
44
+
45
+ * Updated README
46
+
47
+ ### Bugfixes
48
+
49
+ * Fixed hr padding behaviour
50
+
51
+ ## 1.0.0 (2021-03-21)
52
+
53
+ ### Major
54
+
55
+ * Moved `link_base` parameter to constructor
56
+ * Moved input HTML parameter to `#textify`
57
+
58
+ ### Minor
59
+
60
+ * Treats tables and lists with role="presentation" as simple containers
61
+ * Now handles ordered and unordered lists
62
+ * Images are now replaced with their alt text
63
+
64
+ ### Bugfixes
65
+
66
+ * none
67
+
3
68
  ## 0.4.2 (2021-03-17)
4
69
 
5
70
  ### Major
data/dirt-textify.gemspec CHANGED
@@ -10,9 +10,11 @@ Gem::Specification.new do |spec|
10
10
  spec.authors = ['Robin Miller']
11
11
  spec.email = ['robin@tenjin.ca']
12
12
 
13
- spec.summary = 'Intelligently extracts plaintext from an HTML document.'
13
+ spec.summary = 'Converts HTML to plain text'
14
14
  spec.description = <<~DESC
15
- Transforms HTML into plaintext while preserving legibility and functionality.
15
+ Converts HTML to plain text, preserving as much legibility and functionality as possible.
16
+
17
+ Ideal for providing a plaintext multipart segment of email messages.
16
18
  DESC
17
19
  spec.homepage = 'https://github.com/TenjinInc/ghostwriter'
18
20
  spec.license = 'MIT'
@@ -25,13 +27,13 @@ Gem::Specification.new do |spec|
25
27
  spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
26
28
  spec.require_paths = ['lib']
27
29
 
28
- spec.required_ruby_version = '~> 2.4'
30
+ spec.required_ruby_version = '>= 2.7'
29
31
 
30
- spec.add_dependency 'nokogiri', '= 1.8.4'
32
+ spec.add_dependency 'nokogiri', '>= 1.12'
31
33
 
32
34
  spec.add_development_dependency 'bundler', '~> 2.2'
33
35
  spec.add_development_dependency 'rake', '~> 13.0'
34
36
  spec.add_development_dependency 'rspec', '~> 3.3'
35
- spec.add_development_dependency 'rubocop', '~> 1.11'
36
- spec.add_development_dependency 'rubocop-performance', '~> 1.10'
37
+ spec.add_development_dependency 'rubocop', '~> 1.22'
38
+ spec.add_development_dependency 'rubocop-performance', '~> 1.11'
37
39
  end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Ghostwriter
4
- VERSION = '0.4.2'
4
+ VERSION = '1.2.1'
5
5
  end
@@ -3,52 +3,59 @@
3
3
  module Ghostwriter
4
4
  # Main Ghostwriter converter object.
5
5
  class Writer
6
- def initialize(html)
7
- @source_html = html
6
+ attr_reader :link_base, :heading_marker, :ul_marker, :ol_marker, :table_row, :table_column, :table_corner
7
+
8
+ # Creates a new ghostwriter
9
+ #
10
+ # @param [String] link_base the url to prefix relative links with
11
+ def initialize(link_base: '', heading_marker: '--', ul_marker: '-', ol_marker: '1',
12
+ table_column: '|', table_row: '-', table_corner: '|')
13
+ @link_base = link_base
14
+ @heading_marker = heading_marker
15
+ @ul_marker = ul_marker
16
+ @ol_marker = ol_marker
17
+ @table_column = table_column
18
+ @table_row = table_row
19
+ @table_corner = table_corner
20
+
21
+ freeze
8
22
  end
9
23
 
10
24
  # Strips HTML down to plain text.
11
25
  #
12
- # @param link_base the url to prefix relative links with
13
- def textify(link_base: '')
14
- html = normalize_whitespace(@source_html).gsub('</p>', "</p>\n\n")
26
+ # @param html [String] the HTML to be convert to text
27
+ #
28
+ # @return converted text
29
+ def textify(html)
30
+ doc = Nokogiri::HTML(html.gsub(/\s+/, ' '))
31
+
32
+ doc.search('style, script').remove
15
33
 
16
- doc = Nokogiri::HTML(html)
34
+ replace_anchors(doc)
35
+ replace_images(doc)
17
36
 
18
- doc.search('style').remove
19
- doc.search('script').remove
37
+ simple_replace(doc, '*[role="presentation"]', "\n")
20
38
 
21
- replace_anchors(doc, link_base)
22
39
  replace_headers(doc)
40
+ replace_lists(doc)
23
41
  replace_tables(doc)
24
42
 
25
- simple_replace(doc, 'hr', "\n----------\n")
43
+ simple_replace(doc, 'hr', "\n----------\n\n")
26
44
  simple_replace(doc, 'br', "\n")
45
+ simple_replace(doc, 'p', "\n\n")
27
46
 
28
- # doc.search('p').each do |link_node|
29
- # link_node.inner_html = link_node.inner_html + "\n\n"
30
- # end
31
-
32
- # trim, but only single-space character
33
- doc.text.gsub(/^ +| +$/, '')
47
+ normalize_lines(doc)
34
48
  end
35
49
 
36
50
  private
37
51
 
38
- def normalize_whitespace(html)
39
- html.gsub(/\s/, ' ').squeeze(' ')
52
+ def normalize_lines(doc)
53
+ doc.text.strip.split("\n").collect(&:strip).join("\n").concat("\n")
40
54
  end
41
55
 
42
- def replace_anchors(doc, link_base)
43
- base = get_link_base(doc, default: link_base)
44
-
56
+ def replace_anchors(doc)
45
57
  doc.search('a').each do |link_node|
46
- begin
47
- href = URI(link_node['href'])
48
- href = base + href.to_s unless href.absolute?
49
- rescue URI::InvalidURIError
50
- href = link_node['href'].gsub(/^(tel|mailto):/, '').strip
51
- end
58
+ href = get_link_target(link_node, get_link_base(doc))
52
59
 
53
60
  link_node.inner_html = if link_matches(href, link_node.inner_html)
54
61
  href.to_s
@@ -62,39 +69,96 @@ module Ghostwriter
62
69
  first.to_s.gsub(%r{^https?://}, '').chomp('/') == second.gsub(%r{^https?://}, '').chomp('/')
63
70
  end
64
71
 
65
- def get_link_base(doc, default:)
72
+ def get_link_base(doc)
66
73
  # <base> node is unique by W3C spec
67
74
  base_node = doc.search('base').first
68
75
 
69
- base_node ? base_node['href'] : default
76
+ base_node ? base_node['href'] : @link_base
77
+ end
78
+
79
+ def get_link_target(link_node, base)
80
+ href = URI(link_node['href'])
81
+ if href.absolute?
82
+ href
83
+ else
84
+ base + href.to_s
85
+ end
86
+ rescue URI::InvalidURIError
87
+ link_node['href'].gsub(/^(tel|mailto):/, '').strip
70
88
  end
71
89
 
72
90
  def replace_headers(doc)
73
- doc.search('header, h1, h2, h3, h4, h5, h6').each do |node|
74
- node.inner_html = "- #{ node.inner_html } -\n".squeeze(' ')
91
+ doc.search('header, h1').each do |node|
92
+ node.replace("#{ @heading_marker } #{ node.inner_html } #{ @heading_marker }\n"
93
+ .squeeze(' '))
94
+ end
95
+
96
+ (2..6).each do |n|
97
+ doc.search("h#{ n }").each do |node|
98
+ node.replace("#{ @heading_marker * n } #{ node.inner_html } #{ @heading_marker * n }\n"
99
+ .squeeze(' '))
100
+ end
101
+ end
102
+ end
103
+
104
+ def replace_images(doc)
105
+ doc.search('img[role=presentation]').remove
106
+
107
+ doc.search('img').each do |img_node|
108
+ src = img_node['src']
109
+ alt = img_node['alt']
110
+
111
+ src = 'embedded' if src.start_with? 'data:'
112
+
113
+ img_node.replace("#{ alt } (#{ src })") unless alt.nil? || alt.empty?
114
+ end
115
+ end
116
+
117
+ def replace_lists(doc)
118
+ doc.search('ol').each do |list_node|
119
+ replace_list_items(list_node, @ol_marker, after_marker: '.', increment: true)
120
+ end
121
+
122
+ doc.search('ul').each do |list_node|
123
+ replace_list_items(list_node, @ul_marker)
124
+ end
125
+
126
+ doc.search('ul, ol').each do |list_node|
127
+ list_node.replace("#{ list_node.inner_html }\n")
128
+ end
129
+ end
130
+
131
+ def replace_list_items(list_node, marker, after_marker: '', increment: false)
132
+ list_node.search('./li').each do |list_item|
133
+ list_item.replace("#{ marker }#{ after_marker } #{ list_item.inner_html }\n")
134
+
135
+ marker = marker.next if increment
75
136
  end
76
137
  end
77
138
 
78
139
  def replace_tables(doc)
79
140
  doc.css('table').each do |table|
141
+ # remove whitespace between nodes
142
+ table.search('//text()[normalize-space()=""]').remove
143
+
80
144
  column_sizes = calculate_column_sizes(table)
81
145
 
82
146
  table.search('./thead/tr', './tbody/tr', './tr').each do |row|
83
147
  replace_table_nodes(row, column_sizes)
84
148
 
85
- row.inner_html = "#{ row.inner_html }|\n"
149
+ row.replace("#{ row.inner_html }#{ @table_column }\n")
86
150
  end
87
151
 
88
152
  add_table_header_underline(table, column_sizes)
89
153
 
90
- table.inner_html = "#{ table.inner_html }\n"
154
+ table.replace("\n#{ table.inner_html }\n")
91
155
  end
92
156
  end
93
157
 
94
158
  def calculate_column_sizes(table)
95
159
  column_sizes = table.search('tr').collect do |row|
96
160
  row.search('th', 'td').collect do |node|
97
- node.inner_html.length
161
+ node.text.length
98
162
  end
99
163
  end
100
164
 
@@ -102,25 +166,25 @@ module Ghostwriter
102
166
  end
103
167
 
104
168
  def add_table_header_underline(table, column_sizes)
105
- table.search('./thead').each do |row|
106
- header_bottom = "|#{ column_sizes.collect { |len| ('-' * (len + 2)) }.join('|') }|"
169
+ table.search('./thead').each do |thead|
170
+ lines = column_sizes.collect { |len| @table_row * (len + 2) }
171
+ underline_row = "#{ table_corner }#{ lines.join(@table_corner) }#{ @table_corner }"
107
172
 
108
- row.inner_html = "#{ row.inner_html }#{ header_bottom }\n"
173
+ thead.replace("#{ thead.inner_html }#{ underline_row }\n")
109
174
  end
110
175
  end
111
176
 
112
177
  def replace_table_nodes(row, column_sizes)
113
178
  row.search('th', 'td').each_with_index do |node, i|
114
- new_content = "| #{ node.inner_html }".squeeze(' ')
179
+ new_content = node.text.ljust(column_sizes[i] + 1)
115
180
 
116
- # +2 for the extra spacing between text and pipe
117
- node.inner_html = new_content.ljust(column_sizes[i] + 2)
181
+ node.replace("#{ @table_column } #{ new_content }")
118
182
  end
119
183
  end
120
184
 
121
185
  def simple_replace(doc, tag, replacement)
122
186
  doc.search(tag).each do |node|
123
- node.replace(replacement)
187
+ node.replace(node.inner_html + replacement)
124
188
  end
125
189
  end
126
190
  end
metadata CHANGED
@@ -1,29 +1,29 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: ghostwriter
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.4.2
4
+ version: 1.2.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Robin Miller
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2021-03-18 00:00:00.000000000 Z
11
+ date: 2021-10-29 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: nokogiri
15
15
  requirement: !ruby/object:Gem::Requirement
16
16
  requirements:
17
- - - '='
17
+ - - ">="
18
18
  - !ruby/object:Gem::Version
19
- version: 1.8.4
19
+ version: '1.12'
20
20
  type: :runtime
21
21
  prerelease: false
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
- - - '='
24
+ - - ">="
25
25
  - !ruby/object:Gem::Version
26
- version: 1.8.4
26
+ version: '1.12'
27
27
  - !ruby/object:Gem::Dependency
28
28
  name: bundler
29
29
  requirement: !ruby/object:Gem::Requirement
@@ -72,31 +72,32 @@ dependencies:
72
72
  requirements:
73
73
  - - "~>"
74
74
  - !ruby/object:Gem::Version
75
- version: '1.11'
75
+ version: '1.22'
76
76
  type: :development
77
77
  prerelease: false
78
78
  version_requirements: !ruby/object:Gem::Requirement
79
79
  requirements:
80
80
  - - "~>"
81
81
  - !ruby/object:Gem::Version
82
- version: '1.11'
82
+ version: '1.22'
83
83
  - !ruby/object:Gem::Dependency
84
84
  name: rubocop-performance
85
85
  requirement: !ruby/object:Gem::Requirement
86
86
  requirements:
87
87
  - - "~>"
88
88
  - !ruby/object:Gem::Version
89
- version: '1.10'
89
+ version: '1.11'
90
90
  type: :development
91
91
  prerelease: false
92
92
  version_requirements: !ruby/object:Gem::Requirement
93
93
  requirements:
94
94
  - - "~>"
95
95
  - !ruby/object:Gem::Version
96
- version: '1.10'
97
- description: 'Transforms HTML into plaintext while preserving legibility and functionality.
96
+ version: '1.11'
97
+ description: |
98
+ Converts HTML to plain text, preserving as much legibility and functionality as possible.
98
99
 
99
- '
100
+ Ideal for providing a plaintext multipart segment of email messages.
100
101
  email:
101
102
  - robin@tenjin.ca
102
103
  executables: []
@@ -130,9 +131,9 @@ require_paths:
130
131
  - lib
131
132
  required_ruby_version: !ruby/object:Gem::Requirement
132
133
  requirements:
133
- - - "~>"
134
+ - - ">="
134
135
  - !ruby/object:Gem::Version
135
- version: '2.4'
136
+ version: '2.7'
136
137
  required_rubygems_version: !ruby/object:Gem::Requirement
137
138
  requirements:
138
139
  - - ">="
@@ -142,5 +143,5 @@ requirements: []
142
143
  rubygems_version: 3.1.2
143
144
  signing_key:
144
145
  specification_version: 4
145
- summary: Intelligently extracts plaintext from an HTML document.
146
+ summary: Converts HTML to plain text
146
147
  test_files: []