ghostwriter 0.4.2 → 1.2.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: d48bada0259aa38eb1cf98bfbf101970dce0f979ffb7293a8c23b5d5ca393d7d
4
- data.tar.gz: d1f4ac853988b75497b4c1ff7c4a16ba8cddd47be93e0112001fd5960a61a565
3
+ metadata.gz: ccebf53b35ad212e0c7e353a3cb4d79e1f91c63cec6490c5fff7ff56b72eed70
4
+ data.tar.gz: 5b3d36839dcef24d80605421970eb39e17efe3ec24d3126d753ac7a5039512a5
5
5
  SHA512:
6
- metadata.gz: 5d85b5f1e5ef90c91fd03288012805c195c0a9c6ea8a3d7efb10b03b55443e060252f3f1e82f18f2854234174aa9daa80cc35a7b15f774798af5593a84b55881
7
- data.tar.gz: afaf20dbb5876e667fc33d3ff35eff58d1cb076e5bec941863a4b54689ac99bbdf2716cd52145bc7e5cd99f64c0d5954023cf160e01240c24c6a16036fca2eee
6
+ metadata.gz: d6039d9d2b3c1d6606da533c108c8087faa8b985ac0bf6adac7bc1ad4310cc601f76840ff32e83255847bedbd77dbc6cc280054f6eb4975dbe815a98b2a07373
7
+ data.tar.gz: d005669ca03f3ff465c351df0e47eba0372ca6061057102e14b0d879ddb0c5191a88bfaa389fa82a865787e6ff3b5d3fb351ae8b6351f108aef404a3bd66dd7a
data/README.md CHANGED
@@ -1,14 +1,19 @@
1
1
  # Ghostwriter
2
2
 
3
- Ghostwriter rewrites HTML as plain text while preserving as much legibility and functionality as possible.
3
+ A ruby gem that converts HTML to plain text, preserving as much legibility and functionality as possible.
4
4
 
5
- It's sort of like a reverse-markdown.
5
+ It's sort of like a reverse-markdown or a *very* simple screen reader.
6
6
 
7
7
  ## But Why, Though?
8
8
 
9
- * Spam filters tend to like emails with a plain text alternative
10
- * Some email clients won't or can’t handle HTML at all
11
- * Some people explicitly choose plaintext just by preference or accessibility
9
+ * Some email clients won't or can’t offer HTML support.
10
+ * Some people explicitly choose plaintext for accessibility or just plain preference.
11
+ * Spam filters tend to prefer emails with a plain text alternative (but if you use this gem to spam people,
12
+ not only might you be
13
+ [breaking](https://fightspam.gc.ca)
14
+ [various](https://gdpr.eu/)
15
+ [laws](https://www.ftc.gov/tips-advice/business-center/guidance/can-spam-act-compliance-guide-business),
16
+ I will also personally curse you)
12
17
 
13
18
  ## Installation
14
19
 
@@ -28,111 +33,234 @@ Or install it manually with:
28
33
 
29
34
  ## Usage
30
35
 
31
- Create a `Ghostwriter::Writer` with the html you want modified, and call `#textify`:
36
+ Create a `Ghostwriter::Writer` and call `#textify` with the html string you want modified:
32
37
 
33
38
  ```ruby
34
- html = '<html><body><p>This is some markup <a href="tenjin.ca">and a link</a></p><p>Other tags translate, too</p></body></html>'
39
+ html = <<~HTML
40
+ <html>
41
+ <body>
42
+ <p>This is some text with <a href="tenjin.ca">a link</a></p>
43
+ <p>It handles other stuff, too.</p>
44
+ <hr>
45
+ <h1>Stuff Like</h1>
46
+ <ul>
47
+ <li>Images</li>
48
+ <li>Lists</li>
49
+ <li>Tables</li>
50
+ <li>And more</li>
51
+ </ul>
52
+ </body>
53
+ </html>
54
+ HTML
55
+
56
+ ghostwriter = Ghostwriter::Writer.new
35
57
 
36
- Ghostwriter::Writer.new(html).textify
58
+ puts ghostwriter.textify(html)
37
59
  ```
60
+
38
61
  Produces:
62
+
39
63
  ```
40
- This is some markup and a link (tenjin.ca)
64
+ This is some text with a link (tenjin.ca)
41
65
 
42
- Other tags translate, too
66
+ It handles other stuff, too.
67
+
68
+
69
+ ----------
70
+
71
+ -- Stuff Like --
72
+ - Images
73
+ - Lists
74
+ - Tables
75
+ - And more
43
76
  ```
44
77
 
45
78
  ### Links
46
79
 
47
80
  Links are converted to the link text followed by the link target in brackets:
48
81
 
49
- ```ruby
50
- html = '<html><body>Visit our <a href="https://example.com">Website</a><body></html>'
51
- Ghostwriter::Writer.new(html).textify
82
+ ```html
83
+ Visit our <a href="https://example.com">Website</a>
52
84
  ```
53
85
 
54
- Produces:
86
+ Becomes:
87
+
55
88
  ```
56
89
  Visit our Website (https://example.com)
57
90
  ```
58
91
 
59
92
  #### Relative Links
93
+
60
94
  Since emails are wholly distinct from your web address, relative links might break.
61
95
 
62
96
  To avoid this problem, either use the `<base>` header tag:
63
97
 
64
- ```ruby
65
- html = <<~HTML
66
- <html>
67
- <head>
68
- <base href="https://www.example.com/">
69
- </head>
70
- <body>
71
- Relative links get <a href="/contact">expanded</a> using the head's base tag.
72
- </body>
73
- </html>
74
- HTML
98
+ ```html
75
99
 
76
- Ghostwriter::Writer.new(html).textify
100
+ <html>
101
+ <head>
102
+ <base href="https://www.example.com">
103
+ </head>
104
+ <body>
105
+ Use the base tag to <a href="/contact">expand</a> links.
106
+ </body>
107
+ </html>
77
108
  ```
78
- Produces:
109
+
110
+ Becomes:
111
+
79
112
  ```
80
- Relative links get expanded (https://www.example.com//contact) using the head's base tag.
113
+ Use the base tag to expand (https://www.example.com/contact) links.
81
114
  ```
82
115
 
83
- Or you can use the `link_base` parameter:
116
+ Or you can use the `link_base` configuration:
117
+
84
118
  ```ruby
85
- html = '<html><body>Relative links get <a href="/contact">expanded</a></body></html> using the link_base parmeter, too.'
119
+ Ghostwriter::Writer.new(link_base: 'tenjin.ca').textify(html)
120
+ ```
121
+
122
+ ### Images
86
123
 
87
- Ghostwriter::Writer.new(html).textify(link_base: 'tenjin.ca')
124
+ Images with alt text are converted:
125
+
126
+ ```html
127
+ <img src="logo.jpg" alt="ACME Anvils" />
88
128
  ```
89
129
 
90
- Produces:
130
+ Becomes:
131
+
91
132
  ```
92
- "Relative links get expanded (tenjin.ca/contact) using the link_base parmeter, too."
133
+ ACME Anvils (logo.jpg)
93
134
  ```
94
135
 
95
- ### Tables
96
- Tables are often used email structuring because support for more modern CSS is inconsistent.
136
+ But images lacking alt text or with a presentation ARIA role are ignored:
97
137
 
98
- Ghostwriter tries to maintain table structure, but this will quickly devolve for complex structures.
138
+ ```html
139
+ <!-- these will just become an empty string -->
140
+ <img src="decoration.jpg">
141
+ <img src="logo.jpg" role="presentation">
142
+ ```
99
143
 
100
- ```ruby
101
- html = <<~HTML
102
- <html>
103
- <head>
104
- <base href="https://www.example.com/">
105
- </head>
106
- <body>
107
- <table>
108
- <thead>
109
- <tr>
110
- <th>Ship</th>
111
- <th>Captain</th>
112
- </tr>
113
- </thead>
114
- <tbody>
115
- <tr>
116
- <td>Enterprise</td>
117
- <td>Jean-Luc Picard</td>
118
- </tr>
119
- <tr>
120
- <td>TARDIS</td>
121
- <td>The Doctor</td>
122
- </tr>
123
- <tr>
124
- <td>Planet Express Ship</td>
125
- <td>Turanga Leela</td>
126
- </tr>
127
- </tbody>
128
- </table>
129
- </body>
130
- </html>
131
- HTML
144
+ And images with data URIs won't include the data portion.
145
+
146
+ ```html
132
147
 
133
- Ghostwriter::Writer.new(html).textify
148
+ <img src="data:image/gif;base64,R0lGODdhIwAjAMZ/AAkMBxETEBUUDBoaExkaGCIcFx4fGCEfFCcfECkjHiUlHiglGikmFjAqFi8pJCsrJT8sCjMzLDUzJzs0GjkzLTszKTM1Mzg4MD48Mzs+O0tAIElCJ1NCGVdBHUtEMkNFQjlHTFJDOkdGPT1ISUxLRENOT1tMI01PTGdLKk1RU0hTVEtTT0NVVFRTTExYWE9YVGhVP1VZXGFYTWhaMFRcWHFYL1FdXV1dRHdZMVRgYFhgXFdiY11hY1tkX31hJltmZ2pnWnloLGFrbG9oYXlqN3NqTnBqWHxqRItvRIh0Nod0ToF2U5J4LX55Xm97e4B5aZqAQpGAdqOCOZKEYZ2FOJyEVoyKbqiOXpySbLCVcLCXaKWbdKCdfZyhi66dksGdc76fbbije7mkdLOmgq6ogrCpibyvirexisWvhs2vgsGyiLq1lce1lMC5ks28nsfBmcHDq9bAl9PDmMnFo9TGh8zIoM7Jm9vLs9nRo93QqtfSquLQpdXUs+fdterlw////ywAAAAAIwAjAAAH/oArOTo6PYaGOz08P0KMOTZCOzw7PzY/Pz2JPYSDhTSFPTSXPY0tIiIfJz05o5Q/O7A5moc6O4Q0oS8uQisXGCItwTItP5OxOrKjhzSfLzYvgz85ERQXJKcSIkZeJDqOl43StrSEKzo2LhkOGBISDw40JyIVFVEyorBCkZmwtCsrtnLQSJCAwoMFCiwoiECPAr0TjPrtECJwXLMVNARlUCBhQAEFC2SsgWPGDBs3d2RcorSD1SVGr3qskOkihoIH70DO0cOHDx48evD0KQONmQ0aORZJE3VLRYoPBRwoUCCCSx07eoL+xLNnj5UfNFry4BHuR6EcK0qkKJFhAYUE/g+cdHlz1efPrnvM2MjhQlYOWTxktXThIoUKhQoKDHBi5Y0dO0CD5smzJ46NvWJfjYW1w4WKEiWkKkgw9UYdPXTo8Mn6042bvX9pTHoFa5GKzykekP5owEidN1u6PKnzMw+QJ3ttUPr7qKUs0C5KHOyoAMMaNWrmjKlSRYscMFm+nBBUybkLSYsIl3DxwAgcKwWMzGnz5kqTK1e09AEDI0uGE8rJEgNfsuxVggoujGABF1xMoYAVc9RRhxxq5JGVHn3EEYcIGfT1igvGKLfDZyWMkMINa5QhQRNz9CQhT1n5URmHJ8Sygw2BSWLDbaCpgEFPNzxBV4QwApVhHBhg/vABZ0pJIhuCoI0wQhFlkLEGGWfQ9wZ2W6KRBhoUJKncKyK2tMOBPI6wwAxltInlG1uKcQUUV3xpwQUXACSJjbCAxgJoJShggBVtnmGGlm/M4UYcX14QQQQ1PpJjUjmsd5sKCg5gBRdkYMlGG2KwoUYWWYARxgXVnODXqmP9CWgJIESwxhJTbEHGGGbMsSWpaRRBQQQXpPKIiJOgg+BnI4AwwhxcHFHrGGN0KYYYaEhAzQX/7flIDMqx4CoIJY7QxhpY0GorXXXwkUcRj1Lg7gfMDavcCSx4BqsIHpyxRhtT1FCDEmNgF4YY1j6KZ4eXXTast9GVcAIHG2TZRhlT/qCAAg5IZIzCA+1QQ0EGKbgAG7c0pPOAAgQcwEQSZ2R5RhlYVIFEFVccAQEAAASgWEIrXEZYDDHQYAEBAQSAcxBUbCExGWVsMfMVCHSA89QCbHBDX4QRRsPURuMcQBBQYLHGHGuwoYUYVdQQxAIOBCCACVLUgDMBS7rwwgtENHDAAEYLMIAAHhABRRVYKFEDDjjU0AA9HiQhxQQOCDC1BXe/UAQVVATRwAIDDGCAAAd0EAQTTEgBBQ4IIFSBFHFPdYEIFJBAQOUE1K5AAyZgnsQME/jNwAG/e7QBFT4sYEABBiQv6ANDDLDCCwPULr0ADYyeOQcMLMAAAxNAIQUHJwckYEDn5CfvgAEKvECA3+R7nrwB2k+ggQkmaLB3++Sz3zkMIawQCAA7"
149
+ alt="Data picture" />
134
150
  ```
135
- Produces:
151
+
152
+ Becomes:
153
+
154
+ ```
155
+ Data picture (embedded)
156
+ ```
157
+
158
+ ### Paragraphs and Linebreaks
159
+
160
+ Paragraphs are padded with a newline at the end. Line break tags add an empty line.
161
+
162
+ ```html
163
+ <p>I would like to propose a toast.</p>
164
+ <p>This meal we enjoy together would be improved by one.</p>
165
+ <br />
166
+ <p>... Plug in the toaster and I'll get the bread.</p>
167
+ ```
168
+
169
+ ```
170
+ I would like to propose a toast.
171
+
172
+ This meal we enjoy together would be improved by one.
173
+
174
+
175
+ ... Plug in the toaster and I'll get the bread.
176
+
177
+ ```
178
+
179
+ ### Headings
180
+
181
+ Headings are wrapped with a marker per heading level:
182
+
183
+ ```html
184
+ <h1>Dog Maintenance and Repair</h1>
185
+ <h2>Food Input Port</h2>
186
+ <h3>Exhaust Port Considerations</h3>
187
+ ```
188
+
189
+ Becomes:
190
+
191
+ ```
192
+ -- Dog Maintenance and Repair --
193
+ ---- Food Input Port ----
194
+ ------ Exhaust Port Considerations ------
195
+ ```
196
+
197
+ The `<header>` tag is treated like an `<h1>` tag.
198
+
199
+ ### Lists
200
+
201
+ Lists are converted, too. They are padded with newlines and are given simple markers:
202
+
203
+ ```html
204
+
205
+ <ul>
206
+ <li>Planes</li>
207
+ <li>Trains</li>
208
+ <li>Automobiles</li>
209
+ </ul>
210
+ <ol>
211
+ <li>I get knocked down</li>
212
+ <li>I get up again</li>
213
+ <li>Never gonna keep me down</li>
214
+ </ol>
215
+ ```
216
+
217
+ Becomes:
218
+
219
+ ```
220
+ - Planes
221
+ - Trains
222
+ - Automobiles
223
+
224
+ 1. I get knocked down
225
+ 2. I get up again
226
+ 3. Never gonna keep me down
227
+ ```
228
+
229
+ ### Tables
230
+
231
+ Tables are still often used in email structuring because support for more modern HTML and CSS is inconsistent. If your
232
+ table is purely presentational, mark it with `role="presentation"`. See below for details.
233
+
234
+ For real data tables, Ghostwriter tries to maintain table structure for simple tables:
235
+
236
+ ```html
237
+
238
+ <table>
239
+ <thead>
240
+ <tr>
241
+ <th>Ship</th>
242
+ <th>Captain</th>
243
+ </tr>
244
+ </thead>
245
+ <tbody>
246
+ <tr>
247
+ <td>Enterprise</td>
248
+ <td>Jean-Luc Picard</td>
249
+ </tr>
250
+ <tr>
251
+ <td>TARDIS</td>
252
+ <td>The Doctor</td>
253
+ </tr>
254
+ <tr>
255
+ <td>Planet Express Ship</td>
256
+ <td>Turanga Leela</td>
257
+ </tr>
258
+ </tbody>
259
+ </table>
260
+ ```
261
+
262
+ Becomes:
263
+
136
264
  ```
137
265
  | Ship | Captain |
138
266
  |---------------------|-----------------|
@@ -141,6 +269,105 @@ Produces:
141
269
  | Planet Express Ship | Turanga Leela |
142
270
  ```
143
271
 
272
+ ### Customizing Output
273
+
274
+ Ghostwriter has some constructor options to customize output.
275
+
276
+ You can set heading markers.
277
+
278
+ ```ruby
279
+ html = <<~HTML
280
+ <h1>Emergency Cat Procedures</h1>
281
+ HTML
282
+
283
+ writer = Ghostwriter::Writer.new(heading_marker: '#')
284
+
285
+ puts writer.textify(html)
286
+ ```
287
+
288
+ Produces:
289
+
290
+ ```
291
+ # Emergency Cat Procedures #
292
+ ```
293
+
294
+ You can also set list item markers. Ordered markers can be anything that responds to `#next` (eg. any `Enumerator`)
295
+
296
+ ```ruby
297
+ html = <<~HTML
298
+ <ol><li>Mercury</li><li>Venus</li><li>Mars</li></ol>
299
+ <ul><li>Teapot</li><li>Kettle</li></ul>
300
+ HTML
301
+
302
+ writer = Ghostwriter::Writer.new(ul_marker: '*', ol_marker: 'a')
303
+
304
+ puts writer.textify(html)
305
+ ```
306
+
307
+ Produces:
308
+
309
+ ```
310
+ a. Mercury
311
+ b. Venus
312
+ c. Mars
313
+
314
+ * Teapot
315
+ * Kettle
316
+ ```
317
+
318
+ And tables can be customized:
319
+
320
+ ```ruby
321
+ writer = Ghostwriter::Writer.new(table_row: '.',
322
+ table_column: '#',
323
+ table_corner: '+')
324
+
325
+ puts writer.textify <<~HTML
326
+ <table>
327
+ <thead>
328
+ <tr><th>Moon</th><th>Portfolio</th></tr>
329
+ </thead>
330
+ <tbody>
331
+ <tr><td>Phobos</td><td>Fear & Panic</td></tr>
332
+ <tr><td>Deimos</td><td>Dread and Terror</td></tr>
333
+ </tbody>
334
+ </table>
335
+ HTML
336
+ ```
337
+
338
+ Produces:
339
+
340
+ ```
341
+ # Moon # Portfolio #
342
+ +........+..................+
343
+ # Phobos # Fear & Panic #
344
+ # Deimos # Dread and Terror #
345
+
346
+ ```
347
+
348
+ #### Presentation ARIA Role
349
+
350
+ Tags with `role="presentation"` will be treated as a simple container and the normal behaviour will be suppressed.
351
+
352
+ ```html
353
+
354
+ <table role="presentation">
355
+ <tr>
356
+ <td>The table is a lie</td>
357
+ </tr>
358
+ </table>
359
+ <ul role="presentation">
360
+ <li>No such list</li>
361
+ </ul>
362
+ ```
363
+
364
+ Becomes:
365
+
366
+ ```
367
+ The table is a lie
368
+ No such list
369
+ ```
370
+
144
371
  ### Mail Gem Example
145
372
 
146
373
  To use `#textify` with the [mail](https://github.com/mikel/mail) gem, just provide the text-part by pasisng the html
@@ -149,7 +376,8 @@ through Ghostwriter:
149
376
  ```ruby
150
377
  require 'mail'
151
378
 
152
- html = 'My email and a <a href="https://tenjin.ca">link</a>'
379
+ html = 'My email and a <a href="https://tenjin.ca">link</a>'
380
+ ghostwriter = Ghostwriter::Writer.new
153
381
 
154
382
  Mail.deliver do
155
383
  to 'bob@example.com'
@@ -162,7 +390,7 @@ Mail.deliver do
162
390
  end
163
391
 
164
392
  text_part do
165
- body Ghostwriter::Writer.new(html).textify
393
+ body ghostwriter.textify(html)
166
394
  end
167
395
  end
168
396
 
@@ -181,19 +409,19 @@ After checking out the repo, run `bundle install` to install dependencies. Then,
181
409
  can also run `bin/console` for an interactive prompt that will allow you to experiment.
182
410
 
183
411
  #### Local Install
184
- To install this gem onto your local machine only, run
185
412
 
186
- `bundle exec rake install`
413
+ To install this gem onto your local machine only, run
414
+
415
+ `bundle exec rake install`
187
416
 
188
417
  #### Gem Release
418
+
189
419
  To release a gem to the world at large
190
420
 
191
- 1. Update the version number in `version.rb`,
192
- 2. Run `bundle exec rake release`,
193
- which will create a git tag for the version,
194
- push git commits and tags,
195
- and push the `.gem` file to [rubygems.org](https://rubygems.org).
196
- 3. Do a wee dance
421
+ 1. Update the version number in `version.rb`,
422
+ 2. Run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push
423
+ the `.gem` file to [rubygems.org](https://rubygems.org).
424
+ 3. Do a wee dance
197
425
 
198
426
  ## License
199
427
 
data/RELEASE_NOTES.md CHANGED
@@ -1,5 +1,70 @@
1
1
  # Release Notes
2
2
 
3
+ ## 1.2.1 (2021-10-29)
4
+
5
+ ### Major
6
+
7
+ * none
8
+
9
+ ### Minor
10
+
11
+ * Updated Nokogiri version to resolve https://github.com/advisories/GHSA-7rrm-v45f-jp64
12
+ * Updated Ruby version dependency to match
13
+ * Relaxed dependency upper bounds
14
+
15
+ ### Bugfixes
16
+
17
+ * none
18
+
19
+ ## 1.1.0 (2021-03-23)
20
+
21
+ ### Major
22
+
23
+ * none
24
+
25
+ ### Minor
26
+
27
+ * Added customization for headings
28
+ * Headings now marked more for higher order headings
29
+ * Added customization for list markers
30
+ * Added customization for table markers
31
+ * Writer is now immutable
32
+
33
+ ### Bugfixes
34
+
35
+ * none
36
+
37
+ ## 1.0.1 (2021-03-22)
38
+
39
+ ### Major
40
+
41
+ * none
42
+
43
+ ### Minor
44
+
45
+ * Updated README
46
+
47
+ ### Bugfixes
48
+
49
+ * Fixed hr padding behaviour
50
+
51
+ ## 1.0.0 (2021-03-21)
52
+
53
+ ### Major
54
+
55
+ * Moved `link_base` parameter to constructor
56
+ * Moved input HTML parameter to `#textify`
57
+
58
+ ### Minor
59
+
60
+ * Treats tables and lists with role="presentation" as simple containers
61
+ * Now handles ordered and unordered lists
62
+ * Images are now replaced with their alt text
63
+
64
+ ### Bugfixes
65
+
66
+ * none
67
+
3
68
  ## 0.4.2 (2021-03-17)
4
69
 
5
70
  ### Major
data/dirt-textify.gemspec CHANGED
@@ -10,9 +10,11 @@ Gem::Specification.new do |spec|
10
10
  spec.authors = ['Robin Miller']
11
11
  spec.email = ['robin@tenjin.ca']
12
12
 
13
- spec.summary = 'Intelligently extracts plaintext from an HTML document.'
13
+ spec.summary = 'Converts HTML to plain text'
14
14
  spec.description = <<~DESC
15
- Transforms HTML into plaintext while preserving legibility and functionality.
15
+ Converts HTML to plain text, preserving as much legibility and functionality as possible.
16
+
17
+ Ideal for providing a plaintext multipart segment of email messages.
16
18
  DESC
17
19
  spec.homepage = 'https://github.com/TenjinInc/ghostwriter'
18
20
  spec.license = 'MIT'
@@ -25,13 +27,13 @@ Gem::Specification.new do |spec|
25
27
  spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
26
28
  spec.require_paths = ['lib']
27
29
 
28
- spec.required_ruby_version = '~> 2.4'
30
+ spec.required_ruby_version = '>= 2.7'
29
31
 
30
- spec.add_dependency 'nokogiri', '= 1.8.4'
32
+ spec.add_dependency 'nokogiri', '>= 1.12'
31
33
 
32
34
  spec.add_development_dependency 'bundler', '~> 2.2'
33
35
  spec.add_development_dependency 'rake', '~> 13.0'
34
36
  spec.add_development_dependency 'rspec', '~> 3.3'
35
- spec.add_development_dependency 'rubocop', '~> 1.11'
36
- spec.add_development_dependency 'rubocop-performance', '~> 1.10'
37
+ spec.add_development_dependency 'rubocop', '~> 1.22'
38
+ spec.add_development_dependency 'rubocop-performance', '~> 1.11'
37
39
  end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Ghostwriter
4
- VERSION = '0.4.2'
4
+ VERSION = '1.2.1'
5
5
  end
@@ -3,52 +3,59 @@
3
3
  module Ghostwriter
4
4
  # Main Ghostwriter converter object.
5
5
  class Writer
6
- def initialize(html)
7
- @source_html = html
6
+ attr_reader :link_base, :heading_marker, :ul_marker, :ol_marker, :table_row, :table_column, :table_corner
7
+
8
+ # Creates a new ghostwriter
9
+ #
10
+ # @param [String] link_base the url to prefix relative links with
11
+ def initialize(link_base: '', heading_marker: '--', ul_marker: '-', ol_marker: '1',
12
+ table_column: '|', table_row: '-', table_corner: '|')
13
+ @link_base = link_base
14
+ @heading_marker = heading_marker
15
+ @ul_marker = ul_marker
16
+ @ol_marker = ol_marker
17
+ @table_column = table_column
18
+ @table_row = table_row
19
+ @table_corner = table_corner
20
+
21
+ freeze
8
22
  end
9
23
 
10
24
  # Strips HTML down to plain text.
11
25
  #
12
- # @param link_base the url to prefix relative links with
13
- def textify(link_base: '')
14
- html = normalize_whitespace(@source_html).gsub('</p>', "</p>\n\n")
26
+ # @param html [String] the HTML to be convert to text
27
+ #
28
+ # @return converted text
29
+ def textify(html)
30
+ doc = Nokogiri::HTML(html.gsub(/\s+/, ' '))
31
+
32
+ doc.search('style, script').remove
15
33
 
16
- doc = Nokogiri::HTML(html)
34
+ replace_anchors(doc)
35
+ replace_images(doc)
17
36
 
18
- doc.search('style').remove
19
- doc.search('script').remove
37
+ simple_replace(doc, '*[role="presentation"]', "\n")
20
38
 
21
- replace_anchors(doc, link_base)
22
39
  replace_headers(doc)
40
+ replace_lists(doc)
23
41
  replace_tables(doc)
24
42
 
25
- simple_replace(doc, 'hr', "\n----------\n")
43
+ simple_replace(doc, 'hr', "\n----------\n\n")
26
44
  simple_replace(doc, 'br', "\n")
45
+ simple_replace(doc, 'p', "\n\n")
27
46
 
28
- # doc.search('p').each do |link_node|
29
- # link_node.inner_html = link_node.inner_html + "\n\n"
30
- # end
31
-
32
- # trim, but only single-space character
33
- doc.text.gsub(/^ +| +$/, '')
47
+ normalize_lines(doc)
34
48
  end
35
49
 
36
50
  private
37
51
 
38
- def normalize_whitespace(html)
39
- html.gsub(/\s/, ' ').squeeze(' ')
52
+ def normalize_lines(doc)
53
+ doc.text.strip.split("\n").collect(&:strip).join("\n").concat("\n")
40
54
  end
41
55
 
42
- def replace_anchors(doc, link_base)
43
- base = get_link_base(doc, default: link_base)
44
-
56
+ def replace_anchors(doc)
45
57
  doc.search('a').each do |link_node|
46
- begin
47
- href = URI(link_node['href'])
48
- href = base + href.to_s unless href.absolute?
49
- rescue URI::InvalidURIError
50
- href = link_node['href'].gsub(/^(tel|mailto):/, '').strip
51
- end
58
+ href = get_link_target(link_node, get_link_base(doc))
52
59
 
53
60
  link_node.inner_html = if link_matches(href, link_node.inner_html)
54
61
  href.to_s
@@ -62,39 +69,96 @@ module Ghostwriter
62
69
  first.to_s.gsub(%r{^https?://}, '').chomp('/') == second.gsub(%r{^https?://}, '').chomp('/')
63
70
  end
64
71
 
65
- def get_link_base(doc, default:)
72
+ def get_link_base(doc)
66
73
  # <base> node is unique by W3C spec
67
74
  base_node = doc.search('base').first
68
75
 
69
- base_node ? base_node['href'] : default
76
+ base_node ? base_node['href'] : @link_base
77
+ end
78
+
79
+ def get_link_target(link_node, base)
80
+ href = URI(link_node['href'])
81
+ if href.absolute?
82
+ href
83
+ else
84
+ base + href.to_s
85
+ end
86
+ rescue URI::InvalidURIError
87
+ link_node['href'].gsub(/^(tel|mailto):/, '').strip
70
88
  end
71
89
 
72
90
  def replace_headers(doc)
73
- doc.search('header, h1, h2, h3, h4, h5, h6').each do |node|
74
- node.inner_html = "- #{ node.inner_html } -\n".squeeze(' ')
91
+ doc.search('header, h1').each do |node|
92
+ node.replace("#{ @heading_marker } #{ node.inner_html } #{ @heading_marker }\n"
93
+ .squeeze(' '))
94
+ end
95
+
96
+ (2..6).each do |n|
97
+ doc.search("h#{ n }").each do |node|
98
+ node.replace("#{ @heading_marker * n } #{ node.inner_html } #{ @heading_marker * n }\n"
99
+ .squeeze(' '))
100
+ end
101
+ end
102
+ end
103
+
104
+ def replace_images(doc)
105
+ doc.search('img[role=presentation]').remove
106
+
107
+ doc.search('img').each do |img_node|
108
+ src = img_node['src']
109
+ alt = img_node['alt']
110
+
111
+ src = 'embedded' if src.start_with? 'data:'
112
+
113
+ img_node.replace("#{ alt } (#{ src })") unless alt.nil? || alt.empty?
114
+ end
115
+ end
116
+
117
+ def replace_lists(doc)
118
+ doc.search('ol').each do |list_node|
119
+ replace_list_items(list_node, @ol_marker, after_marker: '.', increment: true)
120
+ end
121
+
122
+ doc.search('ul').each do |list_node|
123
+ replace_list_items(list_node, @ul_marker)
124
+ end
125
+
126
+ doc.search('ul, ol').each do |list_node|
127
+ list_node.replace("#{ list_node.inner_html }\n")
128
+ end
129
+ end
130
+
131
+ def replace_list_items(list_node, marker, after_marker: '', increment: false)
132
+ list_node.search('./li').each do |list_item|
133
+ list_item.replace("#{ marker }#{ after_marker } #{ list_item.inner_html }\n")
134
+
135
+ marker = marker.next if increment
75
136
  end
76
137
  end
77
138
 
78
139
  def replace_tables(doc)
79
140
  doc.css('table').each do |table|
141
+ # remove whitespace between nodes
142
+ table.search('//text()[normalize-space()=""]').remove
143
+
80
144
  column_sizes = calculate_column_sizes(table)
81
145
 
82
146
  table.search('./thead/tr', './tbody/tr', './tr').each do |row|
83
147
  replace_table_nodes(row, column_sizes)
84
148
 
85
- row.inner_html = "#{ row.inner_html }|\n"
149
+ row.replace("#{ row.inner_html }#{ @table_column }\n")
86
150
  end
87
151
 
88
152
  add_table_header_underline(table, column_sizes)
89
153
 
90
- table.inner_html = "#{ table.inner_html }\n"
154
+ table.replace("\n#{ table.inner_html }\n")
91
155
  end
92
156
  end
93
157
 
94
158
  def calculate_column_sizes(table)
95
159
  column_sizes = table.search('tr').collect do |row|
96
160
  row.search('th', 'td').collect do |node|
97
- node.inner_html.length
161
+ node.text.length
98
162
  end
99
163
  end
100
164
 
@@ -102,25 +166,25 @@ module Ghostwriter
102
166
  end
103
167
 
104
168
  def add_table_header_underline(table, column_sizes)
105
- table.search('./thead').each do |row|
106
- header_bottom = "|#{ column_sizes.collect { |len| ('-' * (len + 2)) }.join('|') }|"
169
+ table.search('./thead').each do |thead|
170
+ lines = column_sizes.collect { |len| @table_row * (len + 2) }
171
+ underline_row = "#{ table_corner }#{ lines.join(@table_corner) }#{ @table_corner }"
107
172
 
108
- row.inner_html = "#{ row.inner_html }#{ header_bottom }\n"
173
+ thead.replace("#{ thead.inner_html }#{ underline_row }\n")
109
174
  end
110
175
  end
111
176
 
112
177
  def replace_table_nodes(row, column_sizes)
113
178
  row.search('th', 'td').each_with_index do |node, i|
114
- new_content = "| #{ node.inner_html }".squeeze(' ')
179
+ new_content = node.text.ljust(column_sizes[i] + 1)
115
180
 
116
- # +2 for the extra spacing between text and pipe
117
- node.inner_html = new_content.ljust(column_sizes[i] + 2)
181
+ node.replace("#{ @table_column } #{ new_content }")
118
182
  end
119
183
  end
120
184
 
121
185
  def simple_replace(doc, tag, replacement)
122
186
  doc.search(tag).each do |node|
123
- node.replace(replacement)
187
+ node.replace(node.inner_html + replacement)
124
188
  end
125
189
  end
126
190
  end
metadata CHANGED
@@ -1,29 +1,29 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: ghostwriter
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.4.2
4
+ version: 1.2.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Robin Miller
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2021-03-18 00:00:00.000000000 Z
11
+ date: 2021-10-29 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: nokogiri
15
15
  requirement: !ruby/object:Gem::Requirement
16
16
  requirements:
17
- - - '='
17
+ - - ">="
18
18
  - !ruby/object:Gem::Version
19
- version: 1.8.4
19
+ version: '1.12'
20
20
  type: :runtime
21
21
  prerelease: false
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
- - - '='
24
+ - - ">="
25
25
  - !ruby/object:Gem::Version
26
- version: 1.8.4
26
+ version: '1.12'
27
27
  - !ruby/object:Gem::Dependency
28
28
  name: bundler
29
29
  requirement: !ruby/object:Gem::Requirement
@@ -72,31 +72,32 @@ dependencies:
72
72
  requirements:
73
73
  - - "~>"
74
74
  - !ruby/object:Gem::Version
75
- version: '1.11'
75
+ version: '1.22'
76
76
  type: :development
77
77
  prerelease: false
78
78
  version_requirements: !ruby/object:Gem::Requirement
79
79
  requirements:
80
80
  - - "~>"
81
81
  - !ruby/object:Gem::Version
82
- version: '1.11'
82
+ version: '1.22'
83
83
  - !ruby/object:Gem::Dependency
84
84
  name: rubocop-performance
85
85
  requirement: !ruby/object:Gem::Requirement
86
86
  requirements:
87
87
  - - "~>"
88
88
  - !ruby/object:Gem::Version
89
- version: '1.10'
89
+ version: '1.11'
90
90
  type: :development
91
91
  prerelease: false
92
92
  version_requirements: !ruby/object:Gem::Requirement
93
93
  requirements:
94
94
  - - "~>"
95
95
  - !ruby/object:Gem::Version
96
- version: '1.10'
97
- description: 'Transforms HTML into plaintext while preserving legibility and functionality.
96
+ version: '1.11'
97
+ description: |
98
+ Converts HTML to plain text, preserving as much legibility and functionality as possible.
98
99
 
99
- '
100
+ Ideal for providing a plaintext multipart segment of email messages.
100
101
  email:
101
102
  - robin@tenjin.ca
102
103
  executables: []
@@ -130,9 +131,9 @@ require_paths:
130
131
  - lib
131
132
  required_ruby_version: !ruby/object:Gem::Requirement
132
133
  requirements:
133
- - - "~>"
134
+ - - ">="
134
135
  - !ruby/object:Gem::Version
135
- version: '2.4'
136
+ version: '2.7'
136
137
  required_rubygems_version: !ruby/object:Gem::Requirement
137
138
  requirements:
138
139
  - - ">="
@@ -142,5 +143,5 @@ requirements: []
142
143
  rubygems_version: 3.1.2
143
144
  signing_key:
144
145
  specification_version: 4
145
- summary: Intelligently extracts plaintext from an HTML document.
146
+ summary: Converts HTML to plain text
146
147
  test_files: []