html_massage 0.2.1 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.travis.yml +6 -0
- data/README.md +103 -59
- data/lib/html_massage/version.rb +1 -1
- metadata +21 -34
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 9a3fc809a01adfac1d58cd81df25ac7aaea1ebc7
|
4
|
+
data.tar.gz: 00662d083333766c7b79f16e70b8ce5f565bebc3
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 06ff063b79bc5f7a12058ff179d2d65c30f089d2225e8452c616393d9e27136a6f7b4924dfb4890e9d9ca4f75b1537c53b4d9d10fab1434f8e8eb828dcd5e4ee
|
7
|
+
data.tar.gz: de7cbde0140cfb12581f3bf97b3534742ce248a16d7efa92b84c2475625554efc0ef6cf894a78108b75c48d91a0397b5f1a15f2ee2490da0eafbc5d0abb6c1ac
|
data/.travis.yml
ADDED
data/README.md
CHANGED
@@ -1,81 +1,125 @@
|
|
1
|
-
# html_massage
|
1
|
+
# HTML Massage [](https://travis-ci.org/harlantwood/html_massage) [](http://badge.fury.io/rb/html_massage)
|
2
2
|
|
3
|
-
|
3
|
+
## Supported Ruby versions
|
4
|
+
|
5
|
+
- 1.9.2
|
6
|
+
- 1.9.3
|
7
|
+
- 2.0.0
|
8
|
+
|
9
|
+
Note that ruby 1.8.x is _not_ supported.
|
10
|
+
|
11
|
+
## Summary
|
4
12
|
|
5
13
|
* Remove headers and footers and navigation, and strip to only the "content" part of the HTML
|
6
14
|
* Sanitize tags, removing javascript and styling
|
7
|
-
* Convert
|
15
|
+
* Convert HTML to markdown, plain text, or sanitized HTML
|
16
|
+
|
17
|
+
## Massaging from the command line
|
18
|
+
|
19
|
+
html_massage html https://en.wikipedia.org/wiki/Technological_singularity > singularity.html
|
20
|
+
html_massage text https://en.wikipedia.org/wiki/Technological_singularity > singularity.txt
|
21
|
+
html_massage markdown https://en.wikipedia.org/wiki/Technological_singularity > singularity.md
|
22
|
+
|
23
|
+
These files will look something like:
|
24
|
+
|
25
|
+
==> singularity.html <==
|
26
|
+
<h1 id="firstHeading" class="firstHeading"><span dir="auto">Technological singularity</span></h1>
|
8
27
|
|
9
|
-
|
28
|
+
<p>The <b>technological singularity</b> is the theoretical emergence of greater-than-human <a href="/wiki/Superintelligence" title="Superintelligence">superintelligence</a> through technological means.<sup id="cite_ref-1" class="reference"><a href="#cite_note-1"><span>[</span>1<span>]</span></a></sup> Since the capabilities of such intelligence would be difficult for an unaided human mind to comprehend, the occurrence of a technological singularity is seen as an intellectual <a href="/wiki/Event_horizon" title="Event horizon">event horizon</a>, beyond which events cannot be predicted or understood.</p>
|
29
|
+
...
|
30
|
+
|
31
|
+
==> singularity.md <==
|
32
|
+
# Technological singularity
|
33
|
+
|
34
|
+
The **technological singularity** is the theoretical emergence of greater-than-human [superintelligence](https://en.wikipedia.org/wiki/Superintelligence "Superintelligence") through technological means. [1] Since the capabilities of such intelligence would be difficult for an unaided human mind to comprehend, the occurrence of a technological singularity is seen as an intellectual [event horizon](https://en.wikipedia.org/wiki/Event_horizon "Event horizon") , beyond which events cannot be predicted or understood.
|
35
|
+
...
|
36
|
+
|
37
|
+
==> singularity.txt <==
|
38
|
+
Technological singularity
|
39
|
+
|
40
|
+
The technological singularity is the theoretical emergence of greater-than-human superintelligence through technological means.[1] Since the capabilities of such intelligence would be difficult for an unaided human mind to comprehend, the occurrence of a technological singularity is seen as an intellectual event horizon, beyond which events cannot be predicted or understood.
|
41
|
+
...
|
42
|
+
|
43
|
+
## Massaging from Ruby
|
10
44
|
|
11
45
|
### Full Massage
|
12
46
|
|
13
|
-
|
47
|
+
* Use default whitelist of tags and attributes to sanitize HTML
|
48
|
+
* Use default selectors (both include and exclude lists) to attempt to capture only the "content" part of the HTML page
|
49
|
+
|
50
|
+
```ruby
|
51
|
+
require 'html_massage'
|
14
52
|
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
53
|
+
html = %{
|
54
|
+
<html>
|
55
|
+
<head>
|
56
|
+
<script type="text/javascript">document.write('I am a bad script');</script>
|
57
|
+
</head>
|
58
|
+
<body>
|
59
|
+
<div id="header">My Site</div>
|
60
|
+
<div>This is some <i>great</i> content!</div>
|
61
|
+
</body>
|
62
|
+
</html>
|
63
|
+
}
|
26
64
|
|
27
|
-
|
28
|
-
|
65
|
+
HtmlMassage.html( html )
|
66
|
+
# => "<div>This is some <i>great</i> content!</div>"
|
29
67
|
|
30
|
-
|
31
|
-
|
68
|
+
HtmlMassage.markdown( html )
|
69
|
+
# => "This is some _great_ content!"
|
32
70
|
|
33
|
-
|
34
|
-
|
71
|
+
HtmlMassage.text( html )
|
72
|
+
# => "This is some great content!"
|
73
|
+
```
|
35
74
|
|
36
75
|
### Custom includes and excludes
|
37
76
|
|
38
|
-
|
39
|
-
|
40
|
-
|
41
|
-
|
42
|
-
|
43
|
-
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
48
|
-
|
49
|
-
|
50
|
-
|
51
|
-
|
77
|
+
```ruby
|
78
|
+
html = %{
|
79
|
+
<html>
|
80
|
+
<body>
|
81
|
+
<div class="custom_navigation">some links to other pages...</div>
|
82
|
+
<div>This is some <i>great</i> content!</div>
|
83
|
+
</body>
|
84
|
+
</html>
|
85
|
+
}
|
86
|
+
|
87
|
+
html_massage = HtmlMassage.new( html )
|
88
|
+
html_massage.exclude!( [ '.custom_navigation' ] )
|
89
|
+
html_massage.include!( [ 'body' ] )
|
90
|
+
html_massage.to_html
|
91
|
+
# => <div>This is some <i>great</i> content!</div>
|
92
|
+
```
|
52
93
|
|
53
94
|
### Sanitize HTML
|
54
95
|
|
55
|
-
|
56
|
-
|
57
|
-
|
58
|
-
|
59
|
-
|
60
|
-
|
61
|
-
|
62
|
-
|
63
|
-
|
64
|
-
|
65
|
-
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
|
96
|
+
```ruby
|
97
|
+
html = %{
|
98
|
+
<html>
|
99
|
+
<head>
|
100
|
+
<script type="text/javascript">document.write('I am a bad script');</script>
|
101
|
+
</head>
|
102
|
+
<body>
|
103
|
+
<div>This is some <i>great</i> content!</div>
|
104
|
+
</body>
|
105
|
+
</html>
|
106
|
+
}
|
107
|
+
|
108
|
+
html_massage = HtmlMassage.new( html )
|
109
|
+
html_massage.sanitize!( :elements => ['div'] )
|
110
|
+
html_massage.to_html
|
111
|
+
# => <div>This is some <i>great</i> content!</div>
|
112
|
+
```
|
70
113
|
|
71
114
|
### Make Links Absolute
|
72
115
|
|
73
|
-
|
74
|
-
|
75
|
-
|
76
|
-
|
77
|
-
html_massage = HtmlMassage.new( html )
|
78
|
-
html_massage.absolutify_links!( 'http://example.com/joe/page1.html' )
|
79
|
-
html_massage.to_html
|
80
|
-
# <a href ="http://example.com/foo/bar.html">Click this link</a>
|
116
|
+
```ruby
|
117
|
+
html = %{
|
118
|
+
<a href ="/foo/bar.html">Click this link</a>
|
119
|
+
}
|
81
120
|
|
121
|
+
html_massage = HtmlMassage.new( html )
|
122
|
+
html_massage.absolutify_links!( 'http://example.com/joe/page1.html' )
|
123
|
+
html_massage.to_html
|
124
|
+
# => <a href ="http://example.com/foo/bar.html">Click this link</a>
|
125
|
+
```
|
data/lib/html_massage/version.rb
CHANGED
metadata
CHANGED
@@ -1,113 +1,100 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: html_massage
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
5
|
-
prerelease:
|
4
|
+
version: 0.3.0
|
6
5
|
platform: ruby
|
7
6
|
authors:
|
8
7
|
- Harlan T Wood
|
9
8
|
autorequire:
|
10
9
|
bindir: bin
|
11
10
|
cert_chain: []
|
12
|
-
date:
|
11
|
+
date: 2013-08-29 00:00:00.000000000 Z
|
13
12
|
dependencies:
|
14
13
|
- !ruby/object:Gem::Dependency
|
15
14
|
name: nokogiri
|
16
15
|
requirement: !ruby/object:Gem::Requirement
|
17
|
-
none: false
|
18
16
|
requirements:
|
19
|
-
- -
|
17
|
+
- - '>='
|
20
18
|
- !ruby/object:Gem::Version
|
21
19
|
version: '1.4'
|
22
20
|
type: :runtime
|
23
21
|
prerelease: false
|
24
22
|
version_requirements: !ruby/object:Gem::Requirement
|
25
|
-
none: false
|
26
23
|
requirements:
|
27
|
-
- -
|
24
|
+
- - '>='
|
28
25
|
- !ruby/object:Gem::Version
|
29
26
|
version: '1.4'
|
30
27
|
- !ruby/object:Gem::Dependency
|
31
28
|
name: sanitize
|
32
29
|
requirement: !ruby/object:Gem::Requirement
|
33
|
-
none: false
|
34
30
|
requirements:
|
35
|
-
- -
|
31
|
+
- - '>='
|
36
32
|
- !ruby/object:Gem::Version
|
37
33
|
version: '2.0'
|
38
34
|
type: :runtime
|
39
35
|
prerelease: false
|
40
36
|
version_requirements: !ruby/object:Gem::Requirement
|
41
|
-
none: false
|
42
37
|
requirements:
|
43
|
-
- -
|
38
|
+
- - '>='
|
44
39
|
- !ruby/object:Gem::Version
|
45
40
|
version: '2.0'
|
46
41
|
- !ruby/object:Gem::Dependency
|
47
42
|
name: thor
|
48
43
|
requirement: !ruby/object:Gem::Requirement
|
49
|
-
none: false
|
50
44
|
requirements:
|
51
|
-
- -
|
45
|
+
- - '>='
|
52
46
|
- !ruby/object:Gem::Version
|
53
47
|
version: '0'
|
54
48
|
type: :runtime
|
55
49
|
prerelease: false
|
56
50
|
version_requirements: !ruby/object:Gem::Requirement
|
57
|
-
none: false
|
58
51
|
requirements:
|
59
|
-
- -
|
52
|
+
- - '>='
|
60
53
|
- !ruby/object:Gem::Version
|
61
54
|
version: '0'
|
62
55
|
- !ruby/object:Gem::Dependency
|
63
56
|
name: rest-client
|
64
57
|
requirement: !ruby/object:Gem::Requirement
|
65
|
-
none: false
|
66
58
|
requirements:
|
67
|
-
- -
|
59
|
+
- - '>='
|
68
60
|
- !ruby/object:Gem::Version
|
69
61
|
version: '1.6'
|
70
62
|
type: :runtime
|
71
63
|
prerelease: false
|
72
64
|
version_requirements: !ruby/object:Gem::Requirement
|
73
|
-
none: false
|
74
65
|
requirements:
|
75
|
-
- -
|
66
|
+
- - '>='
|
76
67
|
- !ruby/object:Gem::Version
|
77
68
|
version: '1.6'
|
78
69
|
- !ruby/object:Gem::Dependency
|
79
70
|
name: reverse_markdown
|
80
71
|
requirement: !ruby/object:Gem::Requirement
|
81
|
-
none: false
|
82
72
|
requirements:
|
83
|
-
- -
|
73
|
+
- - '>='
|
84
74
|
- !ruby/object:Gem::Version
|
85
75
|
version: '0.4'
|
86
76
|
type: :runtime
|
87
77
|
prerelease: false
|
88
78
|
version_requirements: !ruby/object:Gem::Requirement
|
89
|
-
none: false
|
90
79
|
requirements:
|
91
|
-
- -
|
80
|
+
- - '>='
|
92
81
|
- !ruby/object:Gem::Version
|
93
82
|
version: '0.4'
|
94
83
|
- !ruby/object:Gem::Dependency
|
95
84
|
name: rspec
|
96
85
|
requirement: !ruby/object:Gem::Requirement
|
97
|
-
none: false
|
98
86
|
requirements:
|
99
|
-
- -
|
87
|
+
- - '>='
|
100
88
|
- !ruby/object:Gem::Version
|
101
89
|
version: '2.5'
|
102
90
|
type: :development
|
103
91
|
prerelease: false
|
104
92
|
version_requirements: !ruby/object:Gem::Requirement
|
105
|
-
none: false
|
106
93
|
requirements:
|
107
|
-
- -
|
94
|
+
- - '>='
|
108
95
|
- !ruby/object:Gem::Version
|
109
96
|
version: '2.5'
|
110
|
-
description:
|
97
|
+
description: 'Massages HTML how you want to: sanitize tags, remove headers and footers;
|
111
98
|
output to html, markdown, or plain text.'
|
112
99
|
email:
|
113
100
|
- code@harlantwood.net
|
@@ -117,6 +104,7 @@ extensions: []
|
|
117
104
|
extra_rdoc_files: []
|
118
105
|
files:
|
119
106
|
- .gitignore
|
107
|
+
- .travis.yml
|
120
108
|
- Gemfile
|
121
109
|
- License-MIT
|
122
110
|
- README.md
|
@@ -130,27 +118,26 @@ files:
|
|
130
118
|
- spec/html_massage_spec.rb
|
131
119
|
homepage: https://github.com/harlantwood/html_massage
|
132
120
|
licenses: []
|
121
|
+
metadata: {}
|
133
122
|
post_install_message:
|
134
123
|
rdoc_options: []
|
135
124
|
require_paths:
|
136
125
|
- lib
|
137
126
|
required_ruby_version: !ruby/object:Gem::Requirement
|
138
|
-
none: false
|
139
127
|
requirements:
|
140
|
-
- -
|
128
|
+
- - '>='
|
141
129
|
- !ruby/object:Gem::Version
|
142
130
|
version: '0'
|
143
131
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
144
|
-
none: false
|
145
132
|
requirements:
|
146
|
-
- -
|
133
|
+
- - '>='
|
147
134
|
- !ruby/object:Gem::Version
|
148
135
|
version: '0'
|
149
136
|
requirements: []
|
150
137
|
rubyforge_project:
|
151
|
-
rubygems_version:
|
138
|
+
rubygems_version: 2.0.7
|
152
139
|
signing_key:
|
153
|
-
specification_version:
|
140
|
+
specification_version: 4
|
154
141
|
summary: Massages HTML how you want to.
|
155
142
|
test_files:
|
156
143
|
- spec/html_massage_spec.rb
|