ghostwriter 0.3.0 → 1.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,15 +1,7 @@
1
1
  ---
2
- !binary "U0hBMQ==":
3
- metadata.gz: !binary |-
4
- MjAxZjQzNDU2NTEyMDgwODc3NTAyYjQyYTRhNGZjOTEwMzg4MTE0Yw==
5
- data.tar.gz: !binary |-
6
- NjBjZWU4OWExMmYyOGU1NDMzNTI4MDhkNzEzMGU3YWFhNjdiM2M0MA==
2
+ SHA256:
3
+ metadata.gz: 2868e207e695355f8e9f40521b38d1f96b72b45c9ab73207396a87e5b4d535cd
4
+ data.tar.gz: cf111d734daa4bf94e4d9c2924dbd6c7d12b38b2b26a55db79110730f306fccc
7
5
  SHA512:
8
- metadata.gz: !binary |-
9
- ZDcxMDQ5N2IzMTU0YzlhMzZmMDBkODk2NDVhZjEwZTkyOTY0ZDFmMzZiMmYw
10
- NDc3ZmY5NjgyNzkxOTUzMDg4YWU2NmNkZDgyMDdkNjc4OTMzMmIzYWY1MWY2
11
- OWRhMzgxNjc5ODM4Yjc4NGE5ZTBmODBmNGUzOGU3NDY2YTA5NDk=
12
- data.tar.gz: !binary |-
13
- NTIzYTdjNmRmZTRkMTc4NGIxZWJiYjBhYWVkMDI4ZDM1NTc4NDQ4ZTdkZDZk
14
- YTVlN2Y1OGMzYTg3MjlkYmNhMjUyMGQwNTZlMmIzNjYxYmQyMDIxMjJiOTVi
15
- N2YzMzczODBkZWMxZmY2MmJkYzJkYjQxZjJlZjBjZTM2OWJkMjQ=
6
+ metadata.gz: 8d71989a44d8d2da33496172c600ec38063100c53642850982a9a93ccefbea37e40847e93edfde09d3b0f0dad98f457296f01c070387d86b11cc06e2ee9e04c1
7
+ data.tar.gz: 0767f0d24a895477aee922bd960380608185b741c0d87975bba65eaa03270539e2af2776bbcef2f689b604b76fde7882dbe9322e522b190858c2699359ce8a3b
data/.rubocop.yml ADDED
@@ -0,0 +1 @@
1
+ inherit_from: ../.rubocop.yml
data/.ruby-version CHANGED
@@ -1 +1 @@
1
- ruby-1.9.3
1
+ ruby-2.7.1
data/Gemfile CHANGED
@@ -1,3 +1,5 @@
1
+ # frozen_string_literal: true
2
+
1
3
  source 'https://rubygems.org'
2
4
 
3
5
  # Specify gem dependencies in ghostwriter.gemspec
data/README.md CHANGED
@@ -1,6 +1,15 @@
1
1
  # Ghostwriter
2
2
 
3
- Ghostwriter rewrites your emails to conform to varying email client requirements.
3
+ A ruby gem that converts HTML to plain text, preserving as much legibility and functionality as possible.
4
+
5
+ It's sort of like a reverse-markdown or a very, very simple screen reader.
6
+
7
+ ## But Why, Though?
8
+
9
+ * Some email clients won't or can’t handle HTML at all
10
+ * Some people explicitly choose plaintext just by preference or accessibility
11
+ * Spam filters tend to prefer emails with a plain text alternative (but if you use this gem to spam people, I will yell
12
+ at you)
4
13
 
5
14
  ## Installation
6
15
 
@@ -12,86 +21,329 @@ gem 'ghostwriter'
12
21
 
13
22
  And then execute:
14
23
 
15
- $ bundle
24
+ bundle
16
25
 
17
26
  Or install it manually with:
18
27
 
19
- $ gem install ghostwriter
28
+ gem install ghostwriter
20
29
 
21
30
  ## Usage
22
31
 
23
- ###Stripping HTML
32
+ Create a `Ghostwriter::Writer` and call `#textify` with the html string you want modified:
33
+
34
+ ```ruby
35
+ html = <<~HTML
36
+ <html>
37
+ <body>
38
+ <p>This is some text with <a href="tenjin.ca">a link</a></p>
39
+ <p>It handles other stuff, too.</p>
40
+ <hr>
41
+ <h1>Stuff Like</h1>
42
+ <ul>
43
+ <li>Images</li>
44
+ <li>Lists</li>
45
+ <li>Tables</li>
46
+ <li>And more</li>
47
+ </ul>
48
+ </body>
49
+ </html>
50
+ HTML
51
+
52
+ ghostwriter = Ghostwriter::Writer.new
53
+
54
+ puts ghostwriter.textify(html)
55
+ ```
56
+
57
+ Produces:
24
58
 
25
- Transform HTML into plaintext while preserving as much legibility and functionality as possible.
26
- It's prime use is in quickly producing an automatic plaintext version of HTML emails.
59
+ ```
60
+ This is some text with a link (tenjin.ca)
27
61
 
28
- Why offer plaintext?
62
+ It handles other stuff, too.
29
63
 
30
- * Spam filters prefer included plain text alternative
31
- * Some email clients and apps can’t handle HTML
32
- * Some people explicitly choose plaintext, either by requirement or simple preference
33
64
 
34
- Create a `Ghostwriter::Writer` with the html you want modified, and call `#textify`:
65
+ ----------
35
66
 
36
- ```ruby
37
- html = '<html><body>This is some markup <a href="tenjin.ca">and a link</a><p>Other tags translate, too</p></body></html>'
67
+ -- Stuff Like --
68
+ - Images
69
+ - Lists
70
+ - Tables
71
+ - And more
72
+ ```
73
+
74
+ ### Links
75
+
76
+ Links are converted to the link text followed by the link target in brackets:
77
+
78
+ ```html
79
+ Visit our <a href="https://example.com">Website</a>
80
+ ```
81
+
82
+ Becomes:
83
+
84
+ ```
85
+ Visit our Website (https://example.com)
86
+ ```
87
+
88
+ #### Relative Links
89
+
90
+ Since emails are wholly distinct from your web address, relative links might break.
91
+
92
+ To avoid this problem, either use the `<base>` header tag:
93
+
94
+ ```html
95
+
96
+ <html>
97
+ <head>
98
+ <base href="https://www.example.com">
99
+ </head>
100
+ <body>
101
+ Use the base tag to <a href="/contact">expand</a> links.
102
+ </body>
103
+ </html>
104
+ ```
38
105
 
39
- Ghostwriter::Writer.new(html).textify
106
+ Becomes:
40
107
 
41
- => "This is some markup and a link (tenjin.ca)\nOther tags translate, too\n\n"
42
-
108
+ ```
109
+ Use the base tag to expand (https://www.example.com/contact) links.
43
110
  ```
44
111
 
45
- `#textify` will use a `<base>` tag if included in the HTML source, or if one is provided explicitly:
112
+ Or you can use the `link_base` configuration:
46
113
 
47
114
  ```ruby
48
- html = '<html><body>Relative links <a href="/contact">Link</a></body></html>'
115
+ Ghostwriter::Writer.new(link_base: 'tenjin.ca').textify(html)
116
+ ```
117
+
118
+ ### Images
119
+
120
+ Images with alt text are converted:
121
+
122
+ ```html
123
+ <img src="logo.jpg" alt="ACME Anvils" />
124
+ ```
125
+
126
+ Becomes:
127
+
128
+ ```
129
+ ACME Anvils (logo.jpg)
130
+ ```
131
+
132
+ But images lacking alt text or with a presentation ARIA role are ignored:
133
+
134
+ ```html
135
+ <!-- these will just become an empty string -->
136
+ <img src="decoration.jpg">
137
+ <img src="logo.jpg" role="presentation">
138
+ ```
139
+
140
+ And images with data URIs won't include the data portion.
141
+
142
+ ```html
143
+
144
+ <img src="data:image/gif;base64,R0lGODdhIwAjAMZ/AAkMBxETEBUUDBoaExkaGCIcFx4fGCEfFCcfECkjHiUlHiglGikmFjAqFi8pJCsrJT8sCjMzLDUzJzs0GjkzLTszKTM1Mzg4MD48Mzs+O0tAIElCJ1NCGVdBHUtEMkNFQjlHTFJDOkdGPT1ISUxLRENOT1tMI01PTGdLKk1RU0hTVEtTT0NVVFRTTExYWE9YVGhVP1VZXGFYTWhaMFRcWHFYL1FdXV1dRHdZMVRgYFhgXFdiY11hY1tkX31hJltmZ2pnWnloLGFrbG9oYXlqN3NqTnBqWHxqRItvRIh0Nod0ToF2U5J4LX55Xm97e4B5aZqAQpGAdqOCOZKEYZ2FOJyEVoyKbqiOXpySbLCVcLCXaKWbdKCdfZyhi66dksGdc76fbbije7mkdLOmgq6ogrCpibyvirexisWvhs2vgsGyiLq1lce1lMC5ks28nsfBmcHDq9bAl9PDmMnFo9TGh8zIoM7Jm9vLs9nRo93QqtfSquLQpdXUs+fdterlw////ywAAAAAIwAjAAAH/oArOTo6PYaGOz08P0KMOTZCOzw7PzY/Pz2JPYSDhTSFPTSXPY0tIiIfJz05o5Q/O7A5moc6O4Q0oS8uQisXGCItwTItP5OxOrKjhzSfLzYvgz85ERQXJKcSIkZeJDqOl43StrSEKzo2LhkOGBISDw40JyIVFVEyorBCkZmwtCsrtnLQSJCAwoMFCiwoiECPAr0TjPrtECJwXLMVNARlUCBhQAEFC2SsgWPGDBs3d2RcorSD1SVGr3qskOkihoIH70DO0cOHDx48evD0KQONmQ0aORZJE3VLRYoPBRwoUCCCSx07eoL+xLNnj5UfNFry4BHuR6EcK0qkKJFhAYUE/g+cdHlz1efPrnvM2MjhQlYOWTxktXThIoUKhQoKDHBi5Y0dO0CD5smzJ46NvWJfjYW1w4WKEiWkKkgw9UYdPXTo8Mn6042bvX9pTHoFa5GKzykekP5owEidN1u6PKnzMw+QJ3ttUPr7qKUs0C5KHOyoAMMaNWrmjKlSRYscMFm+nBBUybkLSYsIl3DxwAgcKwWMzGnz5kqTK1e09AEDI0uGE8rJEgNfsuxVggoujGABF1xMoYAVc9RRhxxq5JGVHn3EEYcIGfT1igvGKLfDZyWMkMINa5QhQRNz9CQhT1n5URmHJ8Sygw2BSWLDbaCpgEFPNzxBV4QwApVhHBhg/vABZ0pJIhuCoI0wQhFlkLEGGWfQ9wZ2W6KRBhoUJKncKyK2tMOBPI6wwAxltInlG1uKcQUUV3xpwQUXACSJjbCAxgJoJShggBVtnmGGlm/M4UYcX14QQQQ1PpJjUjmsd5sKCg5gBRdkYMlGG2KwoUYWWYARxgXVnODXqmP9CWgJIESwxhJTbEHGGGbMsSWpaRRBQQQXpPKIiJOgg+BnI4AwwhxcHFHrGGN0KYYYaEhAzQX/7flIDMqx4CoIJY7QxhpY0GorXXXwkUcRj1Lg7gfMDavcCSx4BqsIHpyxRhtT1FCDEmNgF4YY1j6KZ4eXXTast9GVcAIHG2TZRhlT/qCAAg5IZIzCA+1QQ0EGKbgAG7c0pPOAAgQcwEQSZ2R5RhlYVIFEFVccAQEAAASgWEIrXEZYDDHQYAEBAQSAcxBUbCExGWVsMfMVCHSA89QCbHBDX4QRRsPURuMcQBBQYLHGHGuwoYUYVdQQxAIOBCCACVLUgDMBS7rwwgtENHDAAEYLMIAAHhABRRVYKFEDDjjU0AA9HiQhxQQOCDC1BXe/UAQVVATRwAIDDGCAAAd0EAQTTEgBBQ4IIFSBFHFPdYEIFJBAQOUE1K5AAyZgnsQME/jNwAG/e7QBFT4sYEABBiQv6ANDDLDCCwPULr0ADYyeOQcMLMAAAxNAIQUHJwckYEDn5CfvgAEKvECA3+R7nrwB2k+ggQkmaLB3++Sz3zkMIawQCAA7"
145
+ alt="Data picture" />
146
+ ```
147
+
148
+ Becomes:
149
+
150
+ ```
151
+ Data picture (embedded)
152
+ ```
153
+
154
+ ### Paragraphs and Linebreaks
155
+
156
+ Paragraphs are padded with a newline at the end. Line break tags add an empty line.
157
+
158
+ ```html
159
+ <p>I would like to propose a toast.</p>
160
+ <p>This meal we enjoy together would be improved by one.</p>
161
+ <br />
162
+ <p>... Plug in the toaster and I'll get the bread.</p>
163
+ ```
164
+
165
+ ```
166
+ I would like to propose a toast.
167
+
168
+ This meal we enjoy together would be improved by one.
169
+
170
+
171
+ ... Plug in the toaster and I'll get the bread.
172
+
173
+ ```
174
+
175
+ ### Headers
176
+
177
+ For now, headers are all treated the same and given a simple marker:
178
+
179
+ ```html
180
+ <h1>Dog Maintenance and Repair</h1>
181
+ <h2>Food Input Port</h2>
182
+ <h3>Exhaust Port Considerations</h3>
183
+ ```
184
+
185
+ Becomes:
186
+
187
+ ```
188
+ -- Dog Maintenance and Repair --
189
+ -- Food Input Port --
190
+ -- Exhaust Port Considerations --
191
+ ```
192
+
193
+ ### Lists
194
+
195
+ Lists are converted, too. They are padded with newlines and are given simple markers:
196
+
197
+ ```html
198
+
199
+ <ul>
200
+ <li>Planes</li>
201
+ <li>Trains</li>
202
+ <li>Automobiles</li>
203
+ </ul>
204
+ <ol>
205
+ <li>I get knocked down</li>
206
+ <li>I get up again</li>
207
+ <li>Never gonna keep me down</li>
208
+ </ol>
209
+ ```
210
+
211
+ Becomes:
212
+
213
+ ```
49
214
 
50
- Ghostwriter::Writer.new(html).textify(link_base: 'tenjin.ca')
215
+ - Planes
216
+ - Trains
217
+ - Automobiles
51
218
 
52
- => "Relative links Link (tenjin.ca/contact)"
219
+ 1. I get knocked down
220
+ 2. I get up again
221
+ 3. Never gonna keep me down
53
222
 
54
223
  ```
55
224
 
56
- #### Mail Gem Example
225
+ ### Tables
226
+
227
+ Tables are still often used in email structuring because support for more modern HTML and CSS is inconsistent. If your
228
+ table is purely presentational, mark it with `role="presentation"`. See below for details.
229
+
230
+ For real data tables, Ghostwriter tries to maintain table structure for simple tables:
231
+
232
+ ```html
233
+
234
+ <table>
235
+ <thead>
236
+ <tr>
237
+ <th>Ship</th>
238
+ <th>Captain</th>
239
+ </tr>
240
+ </thead>
241
+ <tbody>
242
+ <tr>
243
+ <td>Enterprise</td>
244
+ <td>Jean-Luc Picard</td>
245
+ </tr>
246
+ <tr>
247
+ <td>TARDIS</td>
248
+ <td>The Doctor</td>
249
+ </tr>
250
+ <tr>
251
+ <td>Planet Express Ship</td>
252
+ <td>Turanga Leela</td>
253
+ </tr>
254
+ </tbody>
255
+ </table>
256
+ ```
257
+
258
+ Becomes:
259
+
260
+ ```
261
+ | Ship | Captain |
262
+ |---------------------|-----------------|
263
+ | Enterprise | Jean-Luc Picard |
264
+ | TARDIS | The Doctor |
265
+ | Planet Express Ship | Turanga Leela |
266
+ ```
267
+
268
+ ### Presentation ARIA Role
269
+
270
+ Lists and tables with `role="presentation"` will be treated as a simple container and the normal behaviour will be
271
+ suppressed.
272
+
273
+ ```html
274
+
275
+ <table role="presentation">
276
+ <tr>
277
+ <td>The table is a lie</td>
278
+ </tr>
279
+ </table>
280
+ <ul role="presentation">
281
+ <li>No such list</li>
282
+ </ul>
283
+ ```
284
+
285
+ Becomes:
286
+
287
+ ```
288
+ The table is a lie
289
+ No such list
290
+ ```
291
+
292
+ ### Mail Gem Example
57
293
 
58
- To use `#textify` with the [mail](https://github.com/mikel/mail) gem, just provide the text-part by pasisng the html through Ghostwriter:
294
+ To use `#textify` with the [mail](https://github.com/mikel/mail) gem, just provide the text-part by pasisng the html
295
+ through Ghostwriter:
59
296
 
60
297
  ```ruby
61
298
  require 'mail'
62
299
 
63
- html = 'My email and a <a href="http://tenjin.ca">link</a>'
300
+ html = 'My email and a <a href="https://tenjin.ca">link</a>'
301
+ ghostwriter = Ghostwriter::Writer.new
64
302
 
65
- mail = Mail.deliver do
66
- to 'bob@example.com'
67
- from 'dot@example.com'
68
- subject 'Using Ghostwriter with Mail'
303
+ Mail.deliver do
304
+ to 'bob@example.com'
305
+ from 'dot@example.com'
306
+ subject 'Using Ghostwriter with Mail'
69
307
 
70
- html_part do
308
+ html_part do
71
309
  content_type 'text/html; charset=UTF-8'
72
310
  body html
73
- end
74
-
75
- text_part do
76
- body Ghostwriter::Writer.new(html).textify
77
- end
311
+ end
312
+
313
+ text_part do
314
+ body ghostwriter.textify(html)
315
+ end
78
316
  end
79
317
 
80
318
  ```
81
319
 
82
320
  ## Contributing
321
+
83
322
  Bug reports and pull requests are welcome on GitHub at https://github.com/TenjinInc/ghostwriter
84
323
 
85
- This project is intended to be a friendly space for collaboration, and contributors are expected to adhere to the
324
+ This project is intended to be a friendly space for collaboration, and contributors are expected to adhere to the
86
325
  [Contributor Covenant](contributor-covenant.org) code of conduct.
87
326
 
88
327
  ### Core Developers
89
- After checking out the repo, run `bundle install` to install dependencies. Then, run `rake spec` to run the tests.
90
- You can also run `bin/console` for an interactive prompt that will allow you to experiment.
91
328
 
92
- To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the
93
- version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version,
94
- push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
329
+ After checking out the repo, run `bundle install` to install dependencies. Then, run `rake spec` to run the tests. You
330
+ can also run `bin/console` for an interactive prompt that will allow you to experiment.
331
+
332
+ #### Local Install
333
+
334
+ To install this gem onto your local machine only, run
335
+
336
+ `bundle exec rake install`
337
+
338
+ #### Gem Release
339
+
340
+ To release a gem to the world at large
341
+
342
+ 1. Update the version number in `version.rb`,
343
+ 2. Run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push
344
+ the `.gem` file to [rubygems.org](https://rubygems.org).
345
+ 3. Do a wee dance
95
346
 
96
347
  ## License
348
+
97
349
  The gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).
data/RELEASE_NOTES.md ADDED
@@ -0,0 +1,91 @@
1
+ # Release Notes
2
+
3
+ ## 1.0.1 (2021-03-22)
4
+
5
+ ### Major
6
+
7
+ * none
8
+
9
+ ### Minor
10
+
11
+ * Updated README
12
+
13
+ ### Bugfixes
14
+
15
+ * Fixed hr padding behaviour
16
+
17
+ ## 1.0.0 (2021-03-21)
18
+
19
+ ### Major
20
+
21
+ * Moved `link_base` parameter to constructor
22
+ * Moved input HTML parameter to `#textify`
23
+
24
+ ### Minor
25
+
26
+ * Treats tables and lists with role="presentation" as simple containers
27
+ * Now handles ordered and unordered lists
28
+ * Images are now replaced with their alt text
29
+
30
+ ### Bugfixes
31
+
32
+ * none
33
+
34
+ ## 0.4.2 (2021-03-17)
35
+
36
+ ### Major
37
+
38
+ * none
39
+
40
+ ### Minor
41
+
42
+ * none
43
+
44
+ ### Bugfixes
45
+
46
+ * Works with links using `tel:` and `mailto:` schemas.
47
+
48
+ ## 0.4.1 (2021-03-17)
49
+
50
+ ### Major
51
+
52
+ * none
53
+
54
+ ### Minor
55
+
56
+ * No longer provides link target in brackets after link text when they are the same
57
+
58
+ ### Bugfixes
59
+
60
+ * Added explicit testing for HTML entity interpretation
61
+
62
+ ## 0.4.0 (2021-03-16)
63
+
64
+ ### Major
65
+
66
+ * Updated gem dependencies
67
+
68
+ ### Minor
69
+
70
+ * Updated docs
71
+ * Added support for tables
72
+
73
+ ### Bugfixes
74
+
75
+ * none
76
+
77
+ ## 0.3.0 (2016-03-06)
78
+
79
+ ### Major
80
+
81
+ * Renamed to Ghostwriter
82
+
83
+ ### Minor
84
+
85
+ * Docs: Added instruction for using textify with mail gem
86
+
87
+ ### Bugfixes
88
+
89
+ * none
90
+
91
+
data/Rakefile CHANGED
@@ -1,6 +1,8 @@
1
+ # frozen_string_literal: true
2
+
1
3
  require 'bundler/gem_tasks'
2
4
  require 'rspec/core/rake_task'
3
5
 
4
6
  RSpec::Core::RakeTask.new(:spec)
5
7
 
6
- task :default => :spec
8
+ task default: :spec
data/bin/console CHANGED
@@ -1,4 +1,6 @@
1
1
  #!/usr/bin/env ruby
2
+ #
3
+ # frozen_string_literal: true
2
4
 
3
5
  require 'bundler/setup'
4
6
  require 'ghostwriter'
@@ -10,5 +12,5 @@ require 'ghostwriter'
10
12
  # require "pry"
11
13
  # Pry.start
12
14
 
13
- require "irb"
15
+ require 'irb'
14
16
  IRB.start
data/dirt-textify.gemspec CHANGED
@@ -1,27 +1,39 @@
1
- # coding: utf-8
2
- lib = File.expand_path('../lib', __FILE__)
1
+ # frozen_string_literal: true
2
+
3
+ lib = File.expand_path('lib', __dir__)
3
4
  $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
5
  require 'ghostwriter/version'
5
6
 
6
- Gem::Specification.new do |gemspec|
7
- gemspec.name = 'ghostwriter'
8
- gemspec.version = Ghostwriter::VERSION
9
- gemspec.authors = ['Robin Miller']
10
- gemspec.email = ['robin@tenjin.ca']
7
+ Gem::Specification.new do |spec|
8
+ spec.name = 'ghostwriter'
9
+ spec.version = Ghostwriter::VERSION
10
+ spec.authors = ['Robin Miller']
11
+ spec.email = ['robin@tenjin.ca']
12
+
13
+ spec.summary = 'Converts HTML to plain text'
14
+ spec.description = <<~DESC
15
+ Converts HTML to plain text, preserving as much legibility and functionality as possible.
16
+
17
+ Ideal for providing a plaintext multipart segment of email messages.
18
+ DESC
19
+ spec.homepage = 'https://github.com/TenjinInc/ghostwriter'
20
+ spec.license = 'MIT'
21
+
22
+ spec.files = `git ls-files -z`.split("\x0").reject do |f|
23
+ f.match(%r{^(test|spec|features)/})
24
+ end
11
25
 
12
- gemspec.summary = %q{Intelligently extracts plaintext from an HTML document.}
13
- gemspec.description = %q{Transforms HTML into plaintext while preserving legibility and functionality. }
14
- gemspec.homepage = 'https://github.com/TenjinInc/ghostwriter'
15
- gemspec.license = 'MIT'
26
+ spec.bindir = 'exe'
27
+ spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
28
+ spec.require_paths = ['lib']
16
29
 
17
- gemspec.files = `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
18
- gemspec.bindir = 'exe'
19
- gemspec.executables = gemspec.files.grep(%r{^exe/}) { |f| File.basename(f) }
20
- gemspec.require_paths = ['lib']
30
+ spec.required_ruby_version = '~> 2.4'
21
31
 
22
- gemspec.add_dependency 'nokogiri', '~> 1.6'
32
+ spec.add_dependency 'nokogiri', '= 1.8.4'
23
33
 
24
- gemspec.add_development_dependency 'bundler', '~> 1.10'
25
- gemspec.add_development_dependency 'rake', '~> 10.0'
26
- gemspec.add_development_dependency 'rspec', '~> 3.3'
34
+ spec.add_development_dependency 'bundler', '~> 2.2'
35
+ spec.add_development_dependency 'rake', '~> 13.0'
36
+ spec.add_development_dependency 'rspec', '~> 3.3'
37
+ spec.add_development_dependency 'rubocop', '~> 1.11'
38
+ spec.add_development_dependency 'rubocop-performance', '~> 1.10'
27
39
  end
data/lib/ghostwriter.rb CHANGED
@@ -1,3 +1,5 @@
1
+ # frozen_string_literal: true
2
+
1
3
  require 'ghostwriter/version'
2
4
  require 'ghostwriter/writer'
3
5
  require 'nokogiri'
@@ -1,3 +1,5 @@
1
+ # frozen_string_literal: true
2
+
1
3
  module Ghostwriter
2
- VERSION = '0.3.0'
4
+ VERSION = '1.0.1'
3
5
  end
@@ -1,54 +1,164 @@
1
+ # frozen_string_literal: true
2
+
1
3
  module Ghostwriter
4
+ # Main Ghostwriter converter object.
2
5
  class Writer
3
- def initialize(html)
4
- @source_html = html
6
+ # Creates a new ghostwriter
7
+ #
8
+ # @param [String] link_base the url to prefix relative links with
9
+ def initialize(link_base: '')
10
+ @link_base = link_base
11
+ @list_marker = '-'
5
12
  end
6
13
 
7
- # Intelligently strips HTML down to text.
14
+ # Strips HTML down to plain text.
15
+ #
16
+ # @param html [String] the HTML to be convert to text
8
17
  #
9
- # Options:
10
- # link_base: the url to prefix relative links with
11
- def textify(options={})
12
- html = @source_html.dup
18
+ # @return converted text
19
+ def textify(html)
20
+ doc = Nokogiri::HTML(html.gsub(/\s+/, ' '))
13
21
 
14
- html.gsub!(/\n|\t/, ' ')
15
- html.squeeze!(' ')
22
+ doc.search('style, script').remove
16
23
 
17
- html.gsub!('</p>', "</p>\n\n")
24
+ replace_anchors(doc)
25
+ replace_images(doc)
18
26
 
19
- doc = Nokogiri::HTML(html)
27
+ simple_replace(doc, '*[role="presentation"]', "\n")
20
28
 
21
- doc.search('style').remove
22
- doc.search('script').remove
29
+ replace_headers(doc)
30
+ replace_lists(doc)
31
+ replace_tables(doc)
23
32
 
24
- base = doc.search('base').first #<base> is unique by W3C spec
33
+ simple_replace(doc, 'hr', "\n----------\n\n")
34
+ simple_replace(doc, 'br', "\n")
35
+ simple_replace(doc, 'p', "\n\n")
25
36
 
26
- base_url = base ? base['href'] : options[:link_base] || ''
37
+ doc.text.strip.split("\n").collect(&:strip).join("\n").concat("\n")
38
+ end
39
+
40
+ private
27
41
 
42
+ def normalize_whitespace(html)
43
+ html.gsub(/\s/, ' ').squeeze(' ')
44
+ end
45
+
46
+ def replace_anchors(doc)
28
47
  doc.search('a').each do |link_node|
29
- href = URI(link_node['href'])
30
- href = base_url + href.to_s unless href.absolute?
48
+ href = get_link_target(link_node, get_link_base(doc))
49
+
50
+ link_node.inner_html = if link_matches(href, link_node.inner_html)
51
+ href.to_s
52
+ else
53
+ "#{ link_node.inner_html } (#{ href })"
54
+ end
55
+ end
56
+ end
57
+
58
+ def link_matches(first, second)
59
+ first.to_s.gsub(%r{^https?://}, '').chomp('/') == second.gsub(%r{^https?://}, '').chomp('/')
60
+ end
61
+
62
+ def get_link_base(doc)
63
+ # <base> node is unique by W3C spec
64
+ base_node = doc.search('base').first
65
+
66
+ base_node ? base_node['href'] : @link_base
67
+ end
31
68
 
32
- link_node.inner_html = "#{link_node.inner_html} (#{href})"
69
+ def get_link_target(link_node, base)
70
+ href = URI(link_node['href'])
71
+ if href.absolute?
72
+ href
73
+ else
74
+ base + href.to_s
33
75
  end
76
+ rescue URI::InvalidURIError
77
+ link_node['href'].gsub(/^(tel|mailto):/, '').strip
78
+ end
34
79
 
80
+ def replace_headers(doc)
35
81
  doc.search('header, h1, h2, h3, h4, h5, h6').each do |node|
36
- node.inner_html = "- #{node.inner_html} -\n".squeeze(' ')
82
+ node.inner_html = "-- #{ node.inner_html } --\n".squeeze(' ')
37
83
  end
84
+ end
38
85
 
39
- doc.search('hr').each do |node|
40
- node.replace "\n----------\n"
86
+ def replace_images(doc)
87
+ doc.search('img[role=presentation]').remove
88
+
89
+ doc.search('img').each do |img_node|
90
+ src = img_node['src']
91
+ alt = img_node['alt']
92
+
93
+ src = 'embedded' if src.start_with? 'data:'
94
+
95
+ img_node.replace("#{ alt } (#{ src })") unless alt.nil? || alt.empty?
41
96
  end
97
+ end
98
+
99
+ def replace_lists(doc)
100
+ doc.search('ul, ol').each do |list_node|
101
+ list_node.search('./li').each_with_index do |list_item, i|
102
+ marker = if list_node.node_name == 'ol'
103
+ "#{ i + 1 }."
104
+ else
105
+ @list_marker
106
+ end
42
107
 
43
- doc.search('br').each do |node|
44
- node.replace "\n"
108
+ list_item.inner_html = "#{ marker } #{ list_item.inner_html }\n".squeeze(' ')
109
+ end
110
+
111
+ list_node.replace("#{ list_node.inner_html }\n")
45
112
  end
113
+ end
114
+
115
+ def replace_tables(doc)
116
+ doc.css('table').each do |table|
117
+ column_sizes = calculate_column_sizes(table)
46
118
 
47
- # doc.search('p').each do |link_node|
48
- # link_node.inner_html = link_node.inner_html + "\n\n"
49
- # end
119
+ table.search('./thead/tr', './tbody/tr', './tr').each do |row|
120
+ replace_table_nodes(row, column_sizes)
50
121
 
51
- doc.text.gsub(/^[ ]+|[ ]+$/, '')
122
+ row.inner_html = "#{ row.inner_html }|\n"
123
+ end
124
+
125
+ add_table_header_underline(table, column_sizes)
126
+
127
+ table.inner_html = "#{ table.inner_html }\n"
128
+ end
129
+ end
130
+
131
+ def calculate_column_sizes(table)
132
+ column_sizes = table.search('tr').collect do |row|
133
+ row.search('th', 'td').collect do |node|
134
+ node.inner_html.length
135
+ end
136
+ end
137
+
138
+ column_sizes.transpose.collect(&:max)
139
+ end
140
+
141
+ def add_table_header_underline(table, column_sizes)
142
+ table.search('./thead').each do |row|
143
+ header_bottom = "|#{ column_sizes.collect { |len| ('-' * (len + 2)) }.join('|') }|"
144
+
145
+ row.inner_html = "#{ row.inner_html }#{ header_bottom }\n"
146
+ end
147
+ end
148
+
149
+ def replace_table_nodes(row, column_sizes)
150
+ row.search('th', 'td').each_with_index do |node, i|
151
+ new_content = "| #{ node.inner_html }".squeeze(' ')
152
+
153
+ # +2 for the extra spacing between text and pipe
154
+ node.inner_html = new_content.ljust(column_sizes[i] + 2)
155
+ end
156
+ end
157
+
158
+ def simple_replace(doc, tag, replacement)
159
+ doc.search(tag).each do |node|
160
+ node.replace(node.inner_html + replacement)
161
+ end
52
162
  end
53
163
  end
54
- end
164
+ end
metadata CHANGED
@@ -1,86 +1,119 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: ghostwriter
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.3.0
4
+ version: 1.0.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Robin Miller
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2016-03-07 00:00:00.000000000 Z
11
+ date: 2021-03-23 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: nokogiri
15
15
  requirement: !ruby/object:Gem::Requirement
16
16
  requirements:
17
- - - ~>
17
+ - - '='
18
18
  - !ruby/object:Gem::Version
19
- version: '1.6'
19
+ version: 1.8.4
20
20
  type: :runtime
21
21
  prerelease: false
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
- - - ~>
24
+ - - '='
25
25
  - !ruby/object:Gem::Version
26
- version: '1.6'
26
+ version: 1.8.4
27
27
  - !ruby/object:Gem::Dependency
28
28
  name: bundler
29
29
  requirement: !ruby/object:Gem::Requirement
30
30
  requirements:
31
- - - ~>
31
+ - - "~>"
32
32
  - !ruby/object:Gem::Version
33
- version: '1.10'
33
+ version: '2.2'
34
34
  type: :development
35
35
  prerelease: false
36
36
  version_requirements: !ruby/object:Gem::Requirement
37
37
  requirements:
38
- - - ~>
38
+ - - "~>"
39
39
  - !ruby/object:Gem::Version
40
- version: '1.10'
40
+ version: '2.2'
41
41
  - !ruby/object:Gem::Dependency
42
42
  name: rake
43
43
  requirement: !ruby/object:Gem::Requirement
44
44
  requirements:
45
- - - ~>
45
+ - - "~>"
46
46
  - !ruby/object:Gem::Version
47
- version: '10.0'
47
+ version: '13.0'
48
48
  type: :development
49
49
  prerelease: false
50
50
  version_requirements: !ruby/object:Gem::Requirement
51
51
  requirements:
52
- - - ~>
52
+ - - "~>"
53
53
  - !ruby/object:Gem::Version
54
- version: '10.0'
54
+ version: '13.0'
55
55
  - !ruby/object:Gem::Dependency
56
56
  name: rspec
57
57
  requirement: !ruby/object:Gem::Requirement
58
58
  requirements:
59
- - - ~>
59
+ - - "~>"
60
60
  - !ruby/object:Gem::Version
61
61
  version: '3.3'
62
62
  type: :development
63
63
  prerelease: false
64
64
  version_requirements: !ruby/object:Gem::Requirement
65
65
  requirements:
66
- - - ~>
66
+ - - "~>"
67
67
  - !ruby/object:Gem::Version
68
68
  version: '3.3'
69
- description: ! 'Transforms HTML into plaintext while preserving legibility and functionality. '
69
+ - !ruby/object:Gem::Dependency
70
+ name: rubocop
71
+ requirement: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - "~>"
74
+ - !ruby/object:Gem::Version
75
+ version: '1.11'
76
+ type: :development
77
+ prerelease: false
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - "~>"
81
+ - !ruby/object:Gem::Version
82
+ version: '1.11'
83
+ - !ruby/object:Gem::Dependency
84
+ name: rubocop-performance
85
+ requirement: !ruby/object:Gem::Requirement
86
+ requirements:
87
+ - - "~>"
88
+ - !ruby/object:Gem::Version
89
+ version: '1.10'
90
+ type: :development
91
+ prerelease: false
92
+ version_requirements: !ruby/object:Gem::Requirement
93
+ requirements:
94
+ - - "~>"
95
+ - !ruby/object:Gem::Version
96
+ version: '1.10'
97
+ description: |
98
+ Converts HTML to plain text, preserving as much legibility and functionality as possible.
99
+
100
+ Ideal for providing a plaintext multipart segment of email messages.
70
101
  email:
71
102
  - robin@tenjin.ca
72
103
  executables: []
73
104
  extensions: []
74
105
  extra_rdoc_files: []
75
106
  files:
76
- - .gitignore
77
- - .rspec
78
- - .ruby-version
79
- - .travis.yml
107
+ - ".gitignore"
108
+ - ".rspec"
109
+ - ".rubocop.yml"
110
+ - ".ruby-version"
111
+ - ".travis.yml"
80
112
  - CODE_OF_CONDUCT.md
81
113
  - Gemfile
82
114
  - LICENSE.txt
83
115
  - README.md
116
+ - RELEASE_NOTES.md
84
117
  - Rakefile
85
118
  - bin/console
86
119
  - bin/setup
@@ -98,18 +131,17 @@ require_paths:
98
131
  - lib
99
132
  required_ruby_version: !ruby/object:Gem::Requirement
100
133
  requirements:
101
- - - ! '>='
134
+ - - "~>"
102
135
  - !ruby/object:Gem::Version
103
- version: '0'
136
+ version: '2.4'
104
137
  required_rubygems_version: !ruby/object:Gem::Requirement
105
138
  requirements:
106
- - - ! '>='
139
+ - - ">="
107
140
  - !ruby/object:Gem::Version
108
141
  version: '0'
109
142
  requirements: []
110
- rubyforge_project:
111
- rubygems_version: 2.4.3
143
+ rubygems_version: 3.1.2
112
144
  signing_key:
113
145
  specification_version: 4
114
- summary: Intelligently extracts plaintext from an HTML document.
146
+ summary: Converts HTML to plain text
115
147
  test_files: []