ghostwriter 0.3.0 → 1.0.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,15 +1,7 @@
1
1
  ---
2
- !binary "U0hBMQ==":
3
- metadata.gz: !binary |-
4
- MjAxZjQzNDU2NTEyMDgwODc3NTAyYjQyYTRhNGZjOTEwMzg4MTE0Yw==
5
- data.tar.gz: !binary |-
6
- NjBjZWU4OWExMmYyOGU1NDMzNTI4MDhkNzEzMGU3YWFhNjdiM2M0MA==
2
+ SHA256:
3
+ metadata.gz: 2868e207e695355f8e9f40521b38d1f96b72b45c9ab73207396a87e5b4d535cd
4
+ data.tar.gz: cf111d734daa4bf94e4d9c2924dbd6c7d12b38b2b26a55db79110730f306fccc
7
5
  SHA512:
8
- metadata.gz: !binary |-
9
- ZDcxMDQ5N2IzMTU0YzlhMzZmMDBkODk2NDVhZjEwZTkyOTY0ZDFmMzZiMmYw
10
- NDc3ZmY5NjgyNzkxOTUzMDg4YWU2NmNkZDgyMDdkNjc4OTMzMmIzYWY1MWY2
11
- OWRhMzgxNjc5ODM4Yjc4NGE5ZTBmODBmNGUzOGU3NDY2YTA5NDk=
12
- data.tar.gz: !binary |-
13
- NTIzYTdjNmRmZTRkMTc4NGIxZWJiYjBhYWVkMDI4ZDM1NTc4NDQ4ZTdkZDZk
14
- YTVlN2Y1OGMzYTg3MjlkYmNhMjUyMGQwNTZlMmIzNjYxYmQyMDIxMjJiOTVi
15
- N2YzMzczODBkZWMxZmY2MmJkYzJkYjQxZjJlZjBjZTM2OWJkMjQ=
6
+ metadata.gz: 8d71989a44d8d2da33496172c600ec38063100c53642850982a9a93ccefbea37e40847e93edfde09d3b0f0dad98f457296f01c070387d86b11cc06e2ee9e04c1
7
+ data.tar.gz: 0767f0d24a895477aee922bd960380608185b741c0d87975bba65eaa03270539e2af2776bbcef2f689b604b76fde7882dbe9322e522b190858c2699359ce8a3b
data/.rubocop.yml ADDED
@@ -0,0 +1 @@
1
+ inherit_from: ../.rubocop.yml
data/.ruby-version CHANGED
@@ -1 +1 @@
1
- ruby-1.9.3
1
+ ruby-2.7.1
data/Gemfile CHANGED
@@ -1,3 +1,5 @@
1
+ # frozen_string_literal: true
2
+
1
3
  source 'https://rubygems.org'
2
4
 
3
5
  # Specify gem dependencies in ghostwriter.gemspec
data/README.md CHANGED
@@ -1,6 +1,15 @@
1
1
  # Ghostwriter
2
2
 
3
- Ghostwriter rewrites your emails to conform to varying email client requirements.
3
+ A ruby gem that converts HTML to plain text, preserving as much legibility and functionality as possible.
4
+
5
+ It's sort of like a reverse-markdown or a very, very simple screen reader.
6
+
7
+ ## But Why, Though?
8
+
9
+ * Some email clients won't or can’t handle HTML at all
10
+ * Some people explicitly choose plaintext just by preference or accessibility
11
+ * Spam filters tend to prefer emails with a plain text alternative (but if you use this gem to spam people, I will yell
12
+ at you)
4
13
 
5
14
  ## Installation
6
15
 
@@ -12,86 +21,329 @@ gem 'ghostwriter'
12
21
 
13
22
  And then execute:
14
23
 
15
- $ bundle
24
+ bundle
16
25
 
17
26
  Or install it manually with:
18
27
 
19
- $ gem install ghostwriter
28
+ gem install ghostwriter
20
29
 
21
30
  ## Usage
22
31
 
23
- ###Stripping HTML
32
+ Create a `Ghostwriter::Writer` and call `#textify` with the html string you want modified:
33
+
34
+ ```ruby
35
+ html = <<~HTML
36
+ <html>
37
+ <body>
38
+ <p>This is some text with <a href="tenjin.ca">a link</a></p>
39
+ <p>It handles other stuff, too.</p>
40
+ <hr>
41
+ <h1>Stuff Like</h1>
42
+ <ul>
43
+ <li>Images</li>
44
+ <li>Lists</li>
45
+ <li>Tables</li>
46
+ <li>And more</li>
47
+ </ul>
48
+ </body>
49
+ </html>
50
+ HTML
51
+
52
+ ghostwriter = Ghostwriter::Writer.new
53
+
54
+ puts ghostwriter.textify(html)
55
+ ```
56
+
57
+ Produces:
24
58
 
25
- Transform HTML into plaintext while preserving as much legibility and functionality as possible.
26
- It's prime use is in quickly producing an automatic plaintext version of HTML emails.
59
+ ```
60
+ This is some text with a link (tenjin.ca)
27
61
 
28
- Why offer plaintext?
62
+ It handles other stuff, too.
29
63
 
30
- * Spam filters prefer included plain text alternative
31
- * Some email clients and apps can’t handle HTML
32
- * Some people explicitly choose plaintext, either by requirement or simple preference
33
64
 
34
- Create a `Ghostwriter::Writer` with the html you want modified, and call `#textify`:
65
+ ----------
35
66
 
36
- ```ruby
37
- html = '<html><body>This is some markup <a href="tenjin.ca">and a link</a><p>Other tags translate, too</p></body></html>'
67
+ -- Stuff Like --
68
+ - Images
69
+ - Lists
70
+ - Tables
71
+ - And more
72
+ ```
73
+
74
+ ### Links
75
+
76
+ Links are converted to the link text followed by the link target in brackets:
77
+
78
+ ```html
79
+ Visit our <a href="https://example.com">Website</a>
80
+ ```
81
+
82
+ Becomes:
83
+
84
+ ```
85
+ Visit our Website (https://example.com)
86
+ ```
87
+
88
+ #### Relative Links
89
+
90
+ Since emails are wholly distinct from your web address, relative links might break.
91
+
92
+ To avoid this problem, either use the `<base>` header tag:
93
+
94
+ ```html
95
+
96
+ <html>
97
+ <head>
98
+ <base href="https://www.example.com">
99
+ </head>
100
+ <body>
101
+ Use the base tag to <a href="/contact">expand</a> links.
102
+ </body>
103
+ </html>
104
+ ```
38
105
 
39
- Ghostwriter::Writer.new(html).textify
106
+ Becomes:
40
107
 
41
- => "This is some markup and a link (tenjin.ca)\nOther tags translate, too\n\n"
42
-
108
+ ```
109
+ Use the base tag to expand (https://www.example.com/contact) links.
43
110
  ```
44
111
 
45
- `#textify` will use a `<base>` tag if included in the HTML source, or if one is provided explicitly:
112
+ Or you can use the `link_base` configuration:
46
113
 
47
114
  ```ruby
48
- html = '<html><body>Relative links <a href="/contact">Link</a></body></html>'
115
+ Ghostwriter::Writer.new(link_base: 'tenjin.ca').textify(html)
116
+ ```
117
+
118
+ ### Images
119
+
120
+ Images with alt text are converted:
121
+
122
+ ```html
123
+ <img src="logo.jpg" alt="ACME Anvils" />
124
+ ```
125
+
126
+ Becomes:
127
+
128
+ ```
129
+ ACME Anvils (logo.jpg)
130
+ ```
131
+
132
+ But images lacking alt text or with a presentation ARIA role are ignored:
133
+
134
+ ```html
135
+ <!-- these will just become an empty string -->
136
+ <img src="decoration.jpg">
137
+ <img src="logo.jpg" role="presentation">
138
+ ```
139
+
140
+ And images with data URIs won't include the data portion.
141
+
142
+ ```html
143
+
144
+ <img src="data:image/gif;base64,R0lGODdhIwAjAMZ/AAkMBxETEBUUDBoaExkaGCIcFx4fGCEfFCcfECkjHiUlHiglGikmFjAqFi8pJCsrJT8sCjMzLDUzJzs0GjkzLTszKTM1Mzg4MD48Mzs+O0tAIElCJ1NCGVdBHUtEMkNFQjlHTFJDOkdGPT1ISUxLRENOT1tMI01PTGdLKk1RU0hTVEtTT0NVVFRTTExYWE9YVGhVP1VZXGFYTWhaMFRcWHFYL1FdXV1dRHdZMVRgYFhgXFdiY11hY1tkX31hJltmZ2pnWnloLGFrbG9oYXlqN3NqTnBqWHxqRItvRIh0Nod0ToF2U5J4LX55Xm97e4B5aZqAQpGAdqOCOZKEYZ2FOJyEVoyKbqiOXpySbLCVcLCXaKWbdKCdfZyhi66dksGdc76fbbije7mkdLOmgq6ogrCpibyvirexisWvhs2vgsGyiLq1lce1lMC5ks28nsfBmcHDq9bAl9PDmMnFo9TGh8zIoM7Jm9vLs9nRo93QqtfSquLQpdXUs+fdterlw////ywAAAAAIwAjAAAH/oArOTo6PYaGOz08P0KMOTZCOzw7PzY/Pz2JPYSDhTSFPTSXPY0tIiIfJz05o5Q/O7A5moc6O4Q0oS8uQisXGCItwTItP5OxOrKjhzSfLzYvgz85ERQXJKcSIkZeJDqOl43StrSEKzo2LhkOGBISDw40JyIVFVEyorBCkZmwtCsrtnLQSJCAwoMFCiwoiECPAr0TjPrtECJwXLMVNARlUCBhQAEFC2SsgWPGDBs3d2RcorSD1SVGr3qskOkihoIH70DO0cOHDx48evD0KQONmQ0aORZJE3VLRYoPBRwoUCCCSx07eoL+xLNnj5UfNFry4BHuR6EcK0qkKJFhAYUE/g+cdHlz1efPrnvM2MjhQlYOWTxktXThIoUKhQoKDHBi5Y0dO0CD5smzJ46NvWJfjYW1w4WKEiWkKkgw9UYdPXTo8Mn6042bvX9pTHoFa5GKzykekP5owEidN1u6PKnzMw+QJ3ttUPr7qKUs0C5KHOyoAMMaNWrmjKlSRYscMFm+nBBUybkLSYsIl3DxwAgcKwWMzGnz5kqTK1e09AEDI0uGE8rJEgNfsuxVggoujGABF1xMoYAVc9RRhxxq5JGVHn3EEYcIGfT1igvGKLfDZyWMkMINa5QhQRNz9CQhT1n5URmHJ8Sygw2BSWLDbaCpgEFPNzxBV4QwApVhHBhg/vABZ0pJIhuCoI0wQhFlkLEGGWfQ9wZ2W6KRBhoUJKncKyK2tMOBPI6wwAxltInlG1uKcQUUV3xpwQUXACSJjbCAxgJoJShggBVtnmGGlm/M4UYcX14QQQQ1PpJjUjmsd5sKCg5gBRdkYMlGG2KwoUYWWYARxgXVnODXqmP9CWgJIESwxhJTbEHGGGbMsSWpaRRBQQQXpPKIiJOgg+BnI4AwwhxcHFHrGGN0KYYYaEhAzQX/7flIDMqx4CoIJY7QxhpY0GorXXXwkUcRj1Lg7gfMDavcCSx4BqsIHpyxRhtT1FCDEmNgF4YY1j6KZ4eXXTast9GVcAIHG2TZRhlT/qCAAg5IZIzCA+1QQ0EGKbgAG7c0pPOAAgQcwEQSZ2R5RhlYVIFEFVccAQEAAASgWEIrXEZYDDHQYAEBAQSAcxBUbCExGWVsMfMVCHSA89QCbHBDX4QRRsPURuMcQBBQYLHGHGuwoYUYVdQQxAIOBCCACVLUgDMBS7rwwgtENHDAAEYLMIAAHhABRRVYKFEDDjjU0AA9HiQhxQQOCDC1BXe/UAQVVATRwAIDDGCAAAd0EAQTTEgBBQ4IIFSBFHFPdYEIFJBAQOUE1K5AAyZgnsQME/jNwAG/e7QBFT4sYEABBiQv6ANDDLDCCwPULr0ADYyeOQcMLMAAAxNAIQUHJwckYEDn5CfvgAEKvECA3+R7nrwB2k+ggQkmaLB3++Sz3zkMIawQCAA7"
145
+ alt="Data picture" />
146
+ ```
147
+
148
+ Becomes:
149
+
150
+ ```
151
+ Data picture (embedded)
152
+ ```
153
+
154
+ ### Paragraphs and Linebreaks
155
+
156
+ Paragraphs are padded with a newline at the end. Line break tags add an empty line.
157
+
158
+ ```html
159
+ <p>I would like to propose a toast.</p>
160
+ <p>This meal we enjoy together would be improved by one.</p>
161
+ <br />
162
+ <p>... Plug in the toaster and I'll get the bread.</p>
163
+ ```
164
+
165
+ ```
166
+ I would like to propose a toast.
167
+
168
+ This meal we enjoy together would be improved by one.
169
+
170
+
171
+ ... Plug in the toaster and I'll get the bread.
172
+
173
+ ```
174
+
175
+ ### Headers
176
+
177
+ For now, headers are all treated the same and given a simple marker:
178
+
179
+ ```html
180
+ <h1>Dog Maintenance and Repair</h1>
181
+ <h2>Food Input Port</h2>
182
+ <h3>Exhaust Port Considerations</h3>
183
+ ```
184
+
185
+ Becomes:
186
+
187
+ ```
188
+ -- Dog Maintenance and Repair --
189
+ -- Food Input Port --
190
+ -- Exhaust Port Considerations --
191
+ ```
192
+
193
+ ### Lists
194
+
195
+ Lists are converted, too. They are padded with newlines and are given simple markers:
196
+
197
+ ```html
198
+
199
+ <ul>
200
+ <li>Planes</li>
201
+ <li>Trains</li>
202
+ <li>Automobiles</li>
203
+ </ul>
204
+ <ol>
205
+ <li>I get knocked down</li>
206
+ <li>I get up again</li>
207
+ <li>Never gonna keep me down</li>
208
+ </ol>
209
+ ```
210
+
211
+ Becomes:
212
+
213
+ ```
49
214
 
50
- Ghostwriter::Writer.new(html).textify(link_base: 'tenjin.ca')
215
+ - Planes
216
+ - Trains
217
+ - Automobiles
51
218
 
52
- => "Relative links Link (tenjin.ca/contact)"
219
+ 1. I get knocked down
220
+ 2. I get up again
221
+ 3. Never gonna keep me down
53
222
 
54
223
  ```
55
224
 
56
- #### Mail Gem Example
225
+ ### Tables
226
+
227
+ Tables are still often used in email structuring because support for more modern HTML and CSS is inconsistent. If your
228
+ table is purely presentational, mark it with `role="presentation"`. See below for details.
229
+
230
+ For real data tables, Ghostwriter tries to maintain table structure for simple tables:
231
+
232
+ ```html
233
+
234
+ <table>
235
+ <thead>
236
+ <tr>
237
+ <th>Ship</th>
238
+ <th>Captain</th>
239
+ </tr>
240
+ </thead>
241
+ <tbody>
242
+ <tr>
243
+ <td>Enterprise</td>
244
+ <td>Jean-Luc Picard</td>
245
+ </tr>
246
+ <tr>
247
+ <td>TARDIS</td>
248
+ <td>The Doctor</td>
249
+ </tr>
250
+ <tr>
251
+ <td>Planet Express Ship</td>
252
+ <td>Turanga Leela</td>
253
+ </tr>
254
+ </tbody>
255
+ </table>
256
+ ```
257
+
258
+ Becomes:
259
+
260
+ ```
261
+ | Ship | Captain |
262
+ |---------------------|-----------------|
263
+ | Enterprise | Jean-Luc Picard |
264
+ | TARDIS | The Doctor |
265
+ | Planet Express Ship | Turanga Leela |
266
+ ```
267
+
268
+ ### Presentation ARIA Role
269
+
270
+ Lists and tables with `role="presentation"` will be treated as a simple container and the normal behaviour will be
271
+ suppressed.
272
+
273
+ ```html
274
+
275
+ <table role="presentation">
276
+ <tr>
277
+ <td>The table is a lie</td>
278
+ </tr>
279
+ </table>
280
+ <ul role="presentation">
281
+ <li>No such list</li>
282
+ </ul>
283
+ ```
284
+
285
+ Becomes:
286
+
287
+ ```
288
+ The table is a lie
289
+ No such list
290
+ ```
291
+
292
+ ### Mail Gem Example
57
293
 
58
- To use `#textify` with the [mail](https://github.com/mikel/mail) gem, just provide the text-part by pasisng the html through Ghostwriter:
294
+ To use `#textify` with the [mail](https://github.com/mikel/mail) gem, just provide the text-part by pasisng the html
295
+ through Ghostwriter:
59
296
 
60
297
  ```ruby
61
298
  require 'mail'
62
299
 
63
- html = 'My email and a <a href="http://tenjin.ca">link</a>'
300
+ html = 'My email and a <a href="https://tenjin.ca">link</a>'
301
+ ghostwriter = Ghostwriter::Writer.new
64
302
 
65
- mail = Mail.deliver do
66
- to 'bob@example.com'
67
- from 'dot@example.com'
68
- subject 'Using Ghostwriter with Mail'
303
+ Mail.deliver do
304
+ to 'bob@example.com'
305
+ from 'dot@example.com'
306
+ subject 'Using Ghostwriter with Mail'
69
307
 
70
- html_part do
308
+ html_part do
71
309
  content_type 'text/html; charset=UTF-8'
72
310
  body html
73
- end
74
-
75
- text_part do
76
- body Ghostwriter::Writer.new(html).textify
77
- end
311
+ end
312
+
313
+ text_part do
314
+ body ghostwriter.textify(html)
315
+ end
78
316
  end
79
317
 
80
318
  ```
81
319
 
82
320
  ## Contributing
321
+
83
322
  Bug reports and pull requests are welcome on GitHub at https://github.com/TenjinInc/ghostwriter
84
323
 
85
- This project is intended to be a friendly space for collaboration, and contributors are expected to adhere to the
324
+ This project is intended to be a friendly space for collaboration, and contributors are expected to adhere to the
86
325
  [Contributor Covenant](contributor-covenant.org) code of conduct.
87
326
 
88
327
  ### Core Developers
89
- After checking out the repo, run `bundle install` to install dependencies. Then, run `rake spec` to run the tests.
90
- You can also run `bin/console` for an interactive prompt that will allow you to experiment.
91
328
 
92
- To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the
93
- version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version,
94
- push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
329
+ After checking out the repo, run `bundle install` to install dependencies. Then, run `rake spec` to run the tests. You
330
+ can also run `bin/console` for an interactive prompt that will allow you to experiment.
331
+
332
+ #### Local Install
333
+
334
+ To install this gem onto your local machine only, run
335
+
336
+ `bundle exec rake install`
337
+
338
+ #### Gem Release
339
+
340
+ To release a gem to the world at large
341
+
342
+ 1. Update the version number in `version.rb`,
343
+ 2. Run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push
344
+ the `.gem` file to [rubygems.org](https://rubygems.org).
345
+ 3. Do a wee dance
95
346
 
96
347
  ## License
348
+
97
349
  The gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).
data/RELEASE_NOTES.md ADDED
@@ -0,0 +1,91 @@
1
+ # Release Notes
2
+
3
+ ## 1.0.1 (2021-03-22)
4
+
5
+ ### Major
6
+
7
+ * none
8
+
9
+ ### Minor
10
+
11
+ * Updated README
12
+
13
+ ### Bugfixes
14
+
15
+ * Fixed hr padding behaviour
16
+
17
+ ## 1.0.0 (2021-03-21)
18
+
19
+ ### Major
20
+
21
+ * Moved `link_base` parameter to constructor
22
+ * Moved input HTML parameter to `#textify`
23
+
24
+ ### Minor
25
+
26
+ * Treats tables and lists with role="presentation" as simple containers
27
+ * Now handles ordered and unordered lists
28
+ * Images are now replaced with their alt text
29
+
30
+ ### Bugfixes
31
+
32
+ * none
33
+
34
+ ## 0.4.2 (2021-03-17)
35
+
36
+ ### Major
37
+
38
+ * none
39
+
40
+ ### Minor
41
+
42
+ * none
43
+
44
+ ### Bugfixes
45
+
46
+ * Works with links using `tel:` and `mailto:` schemas.
47
+
48
+ ## 0.4.1 (2021-03-17)
49
+
50
+ ### Major
51
+
52
+ * none
53
+
54
+ ### Minor
55
+
56
+ * No longer provides link target in brackets after link text when they are the same
57
+
58
+ ### Bugfixes
59
+
60
+ * Added explicit testing for HTML entity interpretation
61
+
62
+ ## 0.4.0 (2021-03-16)
63
+
64
+ ### Major
65
+
66
+ * Updated gem dependencies
67
+
68
+ ### Minor
69
+
70
+ * Updated docs
71
+ * Added support for tables
72
+
73
+ ### Bugfixes
74
+
75
+ * none
76
+
77
+ ## 0.3.0 (2016-03-06)
78
+
79
+ ### Major
80
+
81
+ * Renamed to Ghostwriter
82
+
83
+ ### Minor
84
+
85
+ * Docs: Added instruction for using textify with mail gem
86
+
87
+ ### Bugfixes
88
+
89
+ * none
90
+
91
+
data/Rakefile CHANGED
@@ -1,6 +1,8 @@
1
+ # frozen_string_literal: true
2
+
1
3
  require 'bundler/gem_tasks'
2
4
  require 'rspec/core/rake_task'
3
5
 
4
6
  RSpec::Core::RakeTask.new(:spec)
5
7
 
6
- task :default => :spec
8
+ task default: :spec
data/bin/console CHANGED
@@ -1,4 +1,6 @@
1
1
  #!/usr/bin/env ruby
2
+ #
3
+ # frozen_string_literal: true
2
4
 
3
5
  require 'bundler/setup'
4
6
  require 'ghostwriter'
@@ -10,5 +12,5 @@ require 'ghostwriter'
10
12
  # require "pry"
11
13
  # Pry.start
12
14
 
13
- require "irb"
15
+ require 'irb'
14
16
  IRB.start
data/dirt-textify.gemspec CHANGED
@@ -1,27 +1,39 @@
1
- # coding: utf-8
2
- lib = File.expand_path('../lib', __FILE__)
1
+ # frozen_string_literal: true
2
+
3
+ lib = File.expand_path('lib', __dir__)
3
4
  $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
5
  require 'ghostwriter/version'
5
6
 
6
- Gem::Specification.new do |gemspec|
7
- gemspec.name = 'ghostwriter'
8
- gemspec.version = Ghostwriter::VERSION
9
- gemspec.authors = ['Robin Miller']
10
- gemspec.email = ['robin@tenjin.ca']
7
+ Gem::Specification.new do |spec|
8
+ spec.name = 'ghostwriter'
9
+ spec.version = Ghostwriter::VERSION
10
+ spec.authors = ['Robin Miller']
11
+ spec.email = ['robin@tenjin.ca']
12
+
13
+ spec.summary = 'Converts HTML to plain text'
14
+ spec.description = <<~DESC
15
+ Converts HTML to plain text, preserving as much legibility and functionality as possible.
16
+
17
+ Ideal for providing a plaintext multipart segment of email messages.
18
+ DESC
19
+ spec.homepage = 'https://github.com/TenjinInc/ghostwriter'
20
+ spec.license = 'MIT'
21
+
22
+ spec.files = `git ls-files -z`.split("\x0").reject do |f|
23
+ f.match(%r{^(test|spec|features)/})
24
+ end
11
25
 
12
- gemspec.summary = %q{Intelligently extracts plaintext from an HTML document.}
13
- gemspec.description = %q{Transforms HTML into plaintext while preserving legibility and functionality. }
14
- gemspec.homepage = 'https://github.com/TenjinInc/ghostwriter'
15
- gemspec.license = 'MIT'
26
+ spec.bindir = 'exe'
27
+ spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
28
+ spec.require_paths = ['lib']
16
29
 
17
- gemspec.files = `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
18
- gemspec.bindir = 'exe'
19
- gemspec.executables = gemspec.files.grep(%r{^exe/}) { |f| File.basename(f) }
20
- gemspec.require_paths = ['lib']
30
+ spec.required_ruby_version = '~> 2.4'
21
31
 
22
- gemspec.add_dependency 'nokogiri', '~> 1.6'
32
+ spec.add_dependency 'nokogiri', '= 1.8.4'
23
33
 
24
- gemspec.add_development_dependency 'bundler', '~> 1.10'
25
- gemspec.add_development_dependency 'rake', '~> 10.0'
26
- gemspec.add_development_dependency 'rspec', '~> 3.3'
34
+ spec.add_development_dependency 'bundler', '~> 2.2'
35
+ spec.add_development_dependency 'rake', '~> 13.0'
36
+ spec.add_development_dependency 'rspec', '~> 3.3'
37
+ spec.add_development_dependency 'rubocop', '~> 1.11'
38
+ spec.add_development_dependency 'rubocop-performance', '~> 1.10'
27
39
  end
data/lib/ghostwriter.rb CHANGED
@@ -1,3 +1,5 @@
1
+ # frozen_string_literal: true
2
+
1
3
  require 'ghostwriter/version'
2
4
  require 'ghostwriter/writer'
3
5
  require 'nokogiri'
@@ -1,3 +1,5 @@
1
+ # frozen_string_literal: true
2
+
1
3
  module Ghostwriter
2
- VERSION = '0.3.0'
4
+ VERSION = '1.0.1'
3
5
  end
@@ -1,54 +1,164 @@
1
+ # frozen_string_literal: true
2
+
1
3
  module Ghostwriter
4
+ # Main Ghostwriter converter object.
2
5
  class Writer
3
- def initialize(html)
4
- @source_html = html
6
+ # Creates a new ghostwriter
7
+ #
8
+ # @param [String] link_base the url to prefix relative links with
9
+ def initialize(link_base: '')
10
+ @link_base = link_base
11
+ @list_marker = '-'
5
12
  end
6
13
 
7
- # Intelligently strips HTML down to text.
14
+ # Strips HTML down to plain text.
15
+ #
16
+ # @param html [String] the HTML to be convert to text
8
17
  #
9
- # Options:
10
- # link_base: the url to prefix relative links with
11
- def textify(options={})
12
- html = @source_html.dup
18
+ # @return converted text
19
+ def textify(html)
20
+ doc = Nokogiri::HTML(html.gsub(/\s+/, ' '))
13
21
 
14
- html.gsub!(/\n|\t/, ' ')
15
- html.squeeze!(' ')
22
+ doc.search('style, script').remove
16
23
 
17
- html.gsub!('</p>', "</p>\n\n")
24
+ replace_anchors(doc)
25
+ replace_images(doc)
18
26
 
19
- doc = Nokogiri::HTML(html)
27
+ simple_replace(doc, '*[role="presentation"]', "\n")
20
28
 
21
- doc.search('style').remove
22
- doc.search('script').remove
29
+ replace_headers(doc)
30
+ replace_lists(doc)
31
+ replace_tables(doc)
23
32
 
24
- base = doc.search('base').first #<base> is unique by W3C spec
33
+ simple_replace(doc, 'hr', "\n----------\n\n")
34
+ simple_replace(doc, 'br', "\n")
35
+ simple_replace(doc, 'p', "\n\n")
25
36
 
26
- base_url = base ? base['href'] : options[:link_base] || ''
37
+ doc.text.strip.split("\n").collect(&:strip).join("\n").concat("\n")
38
+ end
39
+
40
+ private
27
41
 
42
+ def normalize_whitespace(html)
43
+ html.gsub(/\s/, ' ').squeeze(' ')
44
+ end
45
+
46
+ def replace_anchors(doc)
28
47
  doc.search('a').each do |link_node|
29
- href = URI(link_node['href'])
30
- href = base_url + href.to_s unless href.absolute?
48
+ href = get_link_target(link_node, get_link_base(doc))
49
+
50
+ link_node.inner_html = if link_matches(href, link_node.inner_html)
51
+ href.to_s
52
+ else
53
+ "#{ link_node.inner_html } (#{ href })"
54
+ end
55
+ end
56
+ end
57
+
58
+ def link_matches(first, second)
59
+ first.to_s.gsub(%r{^https?://}, '').chomp('/') == second.gsub(%r{^https?://}, '').chomp('/')
60
+ end
61
+
62
+ def get_link_base(doc)
63
+ # <base> node is unique by W3C spec
64
+ base_node = doc.search('base').first
65
+
66
+ base_node ? base_node['href'] : @link_base
67
+ end
31
68
 
32
- link_node.inner_html = "#{link_node.inner_html} (#{href})"
69
+ def get_link_target(link_node, base)
70
+ href = URI(link_node['href'])
71
+ if href.absolute?
72
+ href
73
+ else
74
+ base + href.to_s
33
75
  end
76
+ rescue URI::InvalidURIError
77
+ link_node['href'].gsub(/^(tel|mailto):/, '').strip
78
+ end
34
79
 
80
+ def replace_headers(doc)
35
81
  doc.search('header, h1, h2, h3, h4, h5, h6').each do |node|
36
- node.inner_html = "- #{node.inner_html} -\n".squeeze(' ')
82
+ node.inner_html = "-- #{ node.inner_html } --\n".squeeze(' ')
37
83
  end
84
+ end
38
85
 
39
- doc.search('hr').each do |node|
40
- node.replace "\n----------\n"
86
+ def replace_images(doc)
87
+ doc.search('img[role=presentation]').remove
88
+
89
+ doc.search('img').each do |img_node|
90
+ src = img_node['src']
91
+ alt = img_node['alt']
92
+
93
+ src = 'embedded' if src.start_with? 'data:'
94
+
95
+ img_node.replace("#{ alt } (#{ src })") unless alt.nil? || alt.empty?
41
96
  end
97
+ end
98
+
99
+ def replace_lists(doc)
100
+ doc.search('ul, ol').each do |list_node|
101
+ list_node.search('./li').each_with_index do |list_item, i|
102
+ marker = if list_node.node_name == 'ol'
103
+ "#{ i + 1 }."
104
+ else
105
+ @list_marker
106
+ end
42
107
 
43
- doc.search('br').each do |node|
44
- node.replace "\n"
108
+ list_item.inner_html = "#{ marker } #{ list_item.inner_html }\n".squeeze(' ')
109
+ end
110
+
111
+ list_node.replace("#{ list_node.inner_html }\n")
45
112
  end
113
+ end
114
+
115
+ def replace_tables(doc)
116
+ doc.css('table').each do |table|
117
+ column_sizes = calculate_column_sizes(table)
46
118
 
47
- # doc.search('p').each do |link_node|
48
- # link_node.inner_html = link_node.inner_html + "\n\n"
49
- # end
119
+ table.search('./thead/tr', './tbody/tr', './tr').each do |row|
120
+ replace_table_nodes(row, column_sizes)
50
121
 
51
- doc.text.gsub(/^[ ]+|[ ]+$/, '')
122
+ row.inner_html = "#{ row.inner_html }|\n"
123
+ end
124
+
125
+ add_table_header_underline(table, column_sizes)
126
+
127
+ table.inner_html = "#{ table.inner_html }\n"
128
+ end
129
+ end
130
+
131
+ def calculate_column_sizes(table)
132
+ column_sizes = table.search('tr').collect do |row|
133
+ row.search('th', 'td').collect do |node|
134
+ node.inner_html.length
135
+ end
136
+ end
137
+
138
+ column_sizes.transpose.collect(&:max)
139
+ end
140
+
141
+ def add_table_header_underline(table, column_sizes)
142
+ table.search('./thead').each do |row|
143
+ header_bottom = "|#{ column_sizes.collect { |len| ('-' * (len + 2)) }.join('|') }|"
144
+
145
+ row.inner_html = "#{ row.inner_html }#{ header_bottom }\n"
146
+ end
147
+ end
148
+
149
+ def replace_table_nodes(row, column_sizes)
150
+ row.search('th', 'td').each_with_index do |node, i|
151
+ new_content = "| #{ node.inner_html }".squeeze(' ')
152
+
153
+ # +2 for the extra spacing between text and pipe
154
+ node.inner_html = new_content.ljust(column_sizes[i] + 2)
155
+ end
156
+ end
157
+
158
+ def simple_replace(doc, tag, replacement)
159
+ doc.search(tag).each do |node|
160
+ node.replace(node.inner_html + replacement)
161
+ end
52
162
  end
53
163
  end
54
- end
164
+ end
metadata CHANGED
@@ -1,86 +1,119 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: ghostwriter
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.3.0
4
+ version: 1.0.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Robin Miller
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2016-03-07 00:00:00.000000000 Z
11
+ date: 2021-03-23 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: nokogiri
15
15
  requirement: !ruby/object:Gem::Requirement
16
16
  requirements:
17
- - - ~>
17
+ - - '='
18
18
  - !ruby/object:Gem::Version
19
- version: '1.6'
19
+ version: 1.8.4
20
20
  type: :runtime
21
21
  prerelease: false
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
- - - ~>
24
+ - - '='
25
25
  - !ruby/object:Gem::Version
26
- version: '1.6'
26
+ version: 1.8.4
27
27
  - !ruby/object:Gem::Dependency
28
28
  name: bundler
29
29
  requirement: !ruby/object:Gem::Requirement
30
30
  requirements:
31
- - - ~>
31
+ - - "~>"
32
32
  - !ruby/object:Gem::Version
33
- version: '1.10'
33
+ version: '2.2'
34
34
  type: :development
35
35
  prerelease: false
36
36
  version_requirements: !ruby/object:Gem::Requirement
37
37
  requirements:
38
- - - ~>
38
+ - - "~>"
39
39
  - !ruby/object:Gem::Version
40
- version: '1.10'
40
+ version: '2.2'
41
41
  - !ruby/object:Gem::Dependency
42
42
  name: rake
43
43
  requirement: !ruby/object:Gem::Requirement
44
44
  requirements:
45
- - - ~>
45
+ - - "~>"
46
46
  - !ruby/object:Gem::Version
47
- version: '10.0'
47
+ version: '13.0'
48
48
  type: :development
49
49
  prerelease: false
50
50
  version_requirements: !ruby/object:Gem::Requirement
51
51
  requirements:
52
- - - ~>
52
+ - - "~>"
53
53
  - !ruby/object:Gem::Version
54
- version: '10.0'
54
+ version: '13.0'
55
55
  - !ruby/object:Gem::Dependency
56
56
  name: rspec
57
57
  requirement: !ruby/object:Gem::Requirement
58
58
  requirements:
59
- - - ~>
59
+ - - "~>"
60
60
  - !ruby/object:Gem::Version
61
61
  version: '3.3'
62
62
  type: :development
63
63
  prerelease: false
64
64
  version_requirements: !ruby/object:Gem::Requirement
65
65
  requirements:
66
- - - ~>
66
+ - - "~>"
67
67
  - !ruby/object:Gem::Version
68
68
  version: '3.3'
69
- description: ! 'Transforms HTML into plaintext while preserving legibility and functionality. '
69
+ - !ruby/object:Gem::Dependency
70
+ name: rubocop
71
+ requirement: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - "~>"
74
+ - !ruby/object:Gem::Version
75
+ version: '1.11'
76
+ type: :development
77
+ prerelease: false
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - "~>"
81
+ - !ruby/object:Gem::Version
82
+ version: '1.11'
83
+ - !ruby/object:Gem::Dependency
84
+ name: rubocop-performance
85
+ requirement: !ruby/object:Gem::Requirement
86
+ requirements:
87
+ - - "~>"
88
+ - !ruby/object:Gem::Version
89
+ version: '1.10'
90
+ type: :development
91
+ prerelease: false
92
+ version_requirements: !ruby/object:Gem::Requirement
93
+ requirements:
94
+ - - "~>"
95
+ - !ruby/object:Gem::Version
96
+ version: '1.10'
97
+ description: |
98
+ Converts HTML to plain text, preserving as much legibility and functionality as possible.
99
+
100
+ Ideal for providing a plaintext multipart segment of email messages.
70
101
  email:
71
102
  - robin@tenjin.ca
72
103
  executables: []
73
104
  extensions: []
74
105
  extra_rdoc_files: []
75
106
  files:
76
- - .gitignore
77
- - .rspec
78
- - .ruby-version
79
- - .travis.yml
107
+ - ".gitignore"
108
+ - ".rspec"
109
+ - ".rubocop.yml"
110
+ - ".ruby-version"
111
+ - ".travis.yml"
80
112
  - CODE_OF_CONDUCT.md
81
113
  - Gemfile
82
114
  - LICENSE.txt
83
115
  - README.md
116
+ - RELEASE_NOTES.md
84
117
  - Rakefile
85
118
  - bin/console
86
119
  - bin/setup
@@ -98,18 +131,17 @@ require_paths:
98
131
  - lib
99
132
  required_ruby_version: !ruby/object:Gem::Requirement
100
133
  requirements:
101
- - - ! '>='
134
+ - - "~>"
102
135
  - !ruby/object:Gem::Version
103
- version: '0'
136
+ version: '2.4'
104
137
  required_rubygems_version: !ruby/object:Gem::Requirement
105
138
  requirements:
106
- - - ! '>='
139
+ - - ">="
107
140
  - !ruby/object:Gem::Version
108
141
  version: '0'
109
142
  requirements: []
110
- rubyforge_project:
111
- rubygems_version: 2.4.3
143
+ rubygems_version: 3.1.2
112
144
  signing_key:
113
145
  specification_version: 4
114
- summary: Intelligently extracts plaintext from an HTML document.
146
+ summary: Converts HTML to plain text
115
147
  test_files: []