hscrubber 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
data/.document ADDED
@@ -0,0 +1,4 @@
1
+ README.rdoc
2
+ lib/**/*.rb
3
+ bin/*
4
+ LICENSE
data/.gitignore ADDED
@@ -0,0 +1,4 @@
1
+ *.gem
2
+ .bundle
3
+ Gemfile.lock
4
+ pkg/*
data/CHANGES.md ADDED
@@ -0,0 +1,8 @@
1
+ Changes in hscrubber
2
+ ====================
3
+
4
+ Changes in hscrubber reha filter releases are listed here.
5
+
6
+ v0.0.1
7
+ ------
8
+ - Scrubs an HTML code with a reha configuration filter
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source "http://rubygems.org"
2
+
3
+ # Specify your gem's dependencies in priehlazx.gemspec
4
+ gemspec
data/LICENSE ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2011 Malo Skrylevo
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,70 @@
1
+ # HScrubber
2
+
3
+ HScrubber есть движокъ для прорѣшиванія HTML-документа. Онъ позволяетъ процѣдить содержимое входного потока очистивъ его отъ ненужныхъ предмѣтовъ на основѣ рѣхи, являющейся YAML-документомъ, по опредѣлённымъ правиламъ состаленнымъ.
4
+
5
+ HScrubber is HTML reha engine, and it allows filtering an input flow according to the special reha template that is formed as YAML-document.
6
+
7
+ # Рѣха (Reha)
8
+ ## Объясненіе рѣхи (Description of reha filter)
9
+
10
+ Рѣха задаётся въ видѣ YAML-документа. На самомъ верхнемъ уровнѣ описываются мѣты (HTML tags), допустимыя въ выходномъ документѣ. Слѣдующій уровень задаётъ допустимыя свойства (attributes) для опредѣлённой мѣты, а также ключи, управляющія мѣтою и её содержимымъ. Возможныя ключи и их значенія суть такія:
11
+ * '%' содержимое мѣты будетъ очищено, если оно подпадаетъ подъ заданное въ значеніи ключа правило;
12
+ * '-' мѣта удаляется въ томъ случаѣ, если её содержимое подпадаетъ подъ заданное въ значеніи ключа правило;
13
+ * '^' содержимое мѣты добавляется къ содержимому родителькой мѣты въ томъ случаѣ, если содержимое сей мѣты подпадаетъ подъ правило, или если правило не задано.
14
+ Ключи здѣ расположены въ порядкѣ первичности ихъ провѣрки. Каджый изъ нихъ обязательно предваряется символомъ '@'
15
+
16
+ Reha is set up as an YAML-document. The allowed in an output flow HTML tags is described at the top level of the document. The following level described allowed attributes of the specified tag, and also rule keys that controls the tag and ots containment. The keys, and its values are the following:
17
+ * '%' declares that the containment of the tag will be cleaned up, if it matches to the specified rule;
18
+ * '-' a tag will be removed, if its containment matches to the specified rule;
19
+ * '^' containment of a tag will be added to containment of the parent tag, if containment of the tag matches to the specified rule, or if the rule isn't defined.
20
+ The keys are ranged according to priority their analysing. The '@' symbol necessarily outruns each of the keys.
21
+
22
+ ## Примѣръ (Sample)
23
+ Примѣрный шаблонъ файла рѣхи представленъ нижѣ:
24
+ Sample reha template is described as follows:
25
+
26
+ ---
27
+ html:
28
+ body:
29
+ p:
30
+ i:
31
+ @-: ^[.,:;!?\s]*$
32
+ font:
33
+ face:
34
+ size:
35
+ @-: ^\s+$
36
+ @%: ^[.,:!?#]+$
37
+ span:
38
+ @^:
39
+ @-: ^[.,:;!?\s]*$
40
+
41
+ Поясненія:
42
+ * Мѣта 'i' не имѣетъ допустимыхъ ключей, и они будутъ удалены изъ входного потока. Въ случаѣ, если содержимое мѣты удовлѣтворяетъ правилу удаления, на выходѣ сія мѣта будетъ отсутствовать.
43
+ Примѣры:
44
+ * <i id="i_id">Text</i> -> <i>Text</i>
45
+ * <font>Text<i>?</i></font> -> <font>Text</font>
46
+
47
+ * Допустимыми ключами для мѣты 'font' являются 'face' и 'size'. Въ случаѣ, если содержимое мѣты удовлѣтворяетъ правилу удаления, на выходѣ сія мѣта будетъ отсутствовать, а если правилу очищенія, то её содержимое станетъ порожнимъ.
48
+ Примѣры:
49
+ * <font size="5" color="blue">Text</font> -> <font size="5">Text</font>
50
+ * <i>Text<font> </font></i> -> <i>Text</i>
51
+ * <i>Text<font>??</font></i> -> <i>Text<font></font></i>
52
+
53
+ * Допустимыя ключи для мѣты 'span' отсутствуютъ, и въ случаѣ ихъ обнаруженія въ входномъ потокѣ они будутъ вырѣзаны изъ него. Если содержимое мѣты удовлѣтворяетъ правилу удаления, на выходѣ сія мѣта будетъ отсутствовать какъ таковая. Въ остальныхъ же случаяхъ её содержимое будетъ добавлено къ мѣтѣ родительской.
54
+ Примѣры:
55
+ * <span id="span_id">Text</span> -> <span>Text</span>
56
+ * <i>Text<span>?</span></i> -> <i>Text</i>
57
+
58
+ Descriptions:
59
+ * Tag 'i' hasn't allowable attributes, so them will be removed from an output stream. In case, the tag containment meets a remove rule, the tag will be absent in the output;
60
+ * Allowable attributes for the 'font' tag are 'face', and 'size'. In case, if the tag containment meets a remove rule, the tag will be absent in the output, and if meets a cleanup rule, the containment will be purged;
61
+ * Tag 'span' hasn't allowable attributes, so them will be removed from an output stream. In case, the tag containment meets a remove rule, the tag will be absent in the output as it is. In other cases, its containment will be added to a parent tag.
62
+
63
+ # Права (Copyright)
64
+
65
+ Авторскія и исключительныя права (а) 2011 Малъ Скрылевъ
66
+ Зри LICENSE за подробностями.
67
+
68
+ Copyright (c) 2011 Malo Skrylevo
69
+ See LICENSE for details.
70
+
data/Rakefile ADDED
@@ -0,0 +1,3 @@
1
+ require 'rubygems'
2
+ require 'bundler'
3
+ Bundler::GemHelper.install_tasks
data/bin/hscrub ADDED
@@ -0,0 +1,78 @@
1
+ #!/usr/bin/ruby -KU
2
+ #<Encoding:UTF-8>
3
+
4
+ require 'optparse'
5
+ # check if hscrubber is available as not-a-gem
6
+ begin
7
+ $: << '../lib' << './lib'
8
+ require 'hscrubber'
9
+ rescue LoadError
10
+ begin require 'rubygems'; rescue LoadError; end
11
+ retry
12
+ end
13
+
14
+ begin
15
+ dest = File.expand_path(File.dirname($0))
16
+ content = ofile = nil
17
+ reha = '.реха.yml'
18
+
19
+ ARGV.options do |o|
20
+ script_name = File.basename($0)
21
+
22
+ o.set_summary_indent(' ')
23
+ o.banner = "Usage: #{script_name} [OPTIONS] files"
24
+ o.separator ""
25
+ o.separator "Mandatory arguments to long options are mandatory for " +
26
+ "short options too."
27
+
28
+ o.on("-o", "--output-target=val", String,
29
+ "Output file or folder to store a result") { |i| ofile = i }
30
+ o.on("-r", "--reha-filter-config=val", String,
31
+ "Reha filter configuration file") { |i| reha = i }
32
+
33
+ o.separator ""
34
+
35
+ o.on_tail("-h", "--help", "Show this help message") { $stderr.puts o; exit }
36
+
37
+ o.parse!
38
+
39
+ end
40
+
41
+ $hs = HScrubber.new( IO.read(reha) )
42
+
43
+ def scrub(content, of)
44
+ content = content.gsub(/\r/, '')
45
+ of.puts $hs.scrub_html(content)
46
+ end
47
+
48
+ if ofile
49
+ if ARGV.empty?
50
+ ofile = File.join(ofile, 'stdin') if File.exist?(ofile) and File.directory?(ofile)
51
+ File.open(ofile, 'w+') do |of| scrub( $stdin.read, of ) end
52
+ else
53
+ if File.exist?(ofile) and File.directory?(ofile)
54
+ ARGV.each do |fn|
55
+ content = IO.read(fn)
56
+ File.open(File.join(ofile, File.basename(fn)), 'w+') do |of|
57
+ scrub( content, of )
58
+ end
59
+ end
60
+ else
61
+ File.open(ofile, 'w+') do |of|
62
+ ARGV.each do |fn| scrub( IO.read(fn), of ) end
63
+ end
64
+ end
65
+ end
66
+ else
67
+ if ARGV.empty?
68
+ scrub( $stdin.read, $stdout )
69
+ else
70
+ ARGV.each do |fn| scrub( IO.read(fn), $stdout ) end
71
+ end
72
+ end
73
+ rescue
74
+ puts $!.to_s + "\n\t#{$@.join("\n\t")}"
75
+ exit 1
76
+ end
77
+
78
+
data/hscrubber.gemspec ADDED
@@ -0,0 +1,29 @@
1
+ # -*- encoding: utf-8 -*-
2
+ $:.push File.expand_path("../lib", __FILE__)
3
+ require "hscrubber/version"
4
+
5
+ Gem::Specification.new do |s|
6
+ s.name = "hscrubber"
7
+ s.version = HScrubber::VERSION
8
+ s.platform = Gem::Platform::RUBY
9
+ s.authors = [ 'Малъ Скрылёвъ (Malo Skrylevo)' ]
10
+ s.email = [ '3aHyga@gmail.com' ]
11
+ s.homepage = 'https://github.com/3aHyga/hscrubber'
12
+ s.summary = 'hscrubber is HTML scrubber'
13
+ s.description = 'hscrubber is HTML scrubber based on a HTML reha filter'
14
+
15
+ s.executables = [ 'hscrub' ]
16
+ s.rubyforge_project = "hscrubber"
17
+
18
+ s.required_rubygems_version = '>= 1.6.0'
19
+
20
+ s.add_dependency 'hpricot', ">= 0.8.4"
21
+
22
+ s.add_development_dependency("bundler", ">= 1.0.0")
23
+ s.add_development_dependency("rspec", "~> 2.0.1")
24
+
25
+ s.files = `git ls-files`.split("\n")
26
+ s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
27
+ s.executables = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
28
+ s.require_paths = ["lib"]
29
+ end
@@ -0,0 +1,3 @@
1
+ class HScrubber
2
+ VERSION = "0.0.1"
3
+ end
data/lib/hscrubber.rb ADDED
@@ -0,0 +1,243 @@
1
+ #!/usr/bin/ruby -KU
2
+ # encoding: utf-8
3
+
4
+ require 'yaml'
5
+ require 'hpricot'
6
+ require 'hscrubber/version'
7
+
8
+ class HScrubber
9
+ def self.fix(str)
10
+ str.unpack('C*').pack('U*') # Workaround to fix HPricot error
11
+ end
12
+
13
+ def self.scrub_special(elem)
14
+ elem.each_child do |sub|
15
+ if sub.class == Hpricot::Text
16
+ sub.content = sub.content.gsub(/\x1F+/x, '')
17
+ end
18
+ end
19
+ false
20
+ end
21
+
22
+ def self.scrub_follower(elem)
23
+ chnaged = false
24
+ if elem.children and elem.children.size == 1
25
+ sub = elem.children[0]
26
+ if sub.class == Hpricot::Elem and elem.name == sub.name
27
+ html = sub.inner_html
28
+ if elem.raw_attributes
29
+ if sub.raw_attributes
30
+ elem.raw_attributes.merge! sub.raw_attributes
31
+ end
32
+ else
33
+ elem.raw_attributes = sub.raw_attributes
34
+ end
35
+ elem.children.delete(sub)
36
+ elem.inner_html += html
37
+ changed = true
38
+ end
39
+ end
40
+ changed
41
+ end
42
+
43
+ def self.scrub_children(elem, verility)
44
+ changed = false
45
+ old = nil
46
+ elem.children.delete_if do |sub|
47
+ if sub.class == Hpricot::Elem and sub.parent == elem
48
+ self.scrub_elem(sub, verility)
49
+ if old and old.name == sub.name and
50
+ (old.raw_attributes.class == sub.raw_attributes.class and
51
+ old.raw_attributes.class == NilClass or
52
+ (old.raw_attributes.class == Hash and
53
+ old.raw_attributes == sub.raw_attributes))
54
+ sub_ch = sub.children
55
+ idx = elem.children.index(sub)
56
+
57
+ sub_ch.each do |x| x.parent = old end
58
+ if not old.children
59
+ old.children = sub_ch
60
+ elsif old.children.empty?
61
+ old.children.replace sub_ch
62
+ else
63
+ old.children.concat sub_ch
64
+ end
65
+ changed = true
66
+ true
67
+ elsif sub.children and sub.children.size != 0
68
+ idx = elem.children.index(sub)
69
+ old = sub
70
+ false
71
+ else
72
+ idx = elem.children.index(sub)
73
+ changed = true
74
+ true
75
+ end
76
+ else
77
+ old = nil
78
+ false
79
+ end
80
+ end if elem.children
81
+
82
+ changed
83
+ end
84
+
85
+ def self.scrub_remove_by_rule(elem, verility)
86
+ changed = nil
87
+ repeat = true
88
+
89
+ res = []
90
+
91
+ if elem.children and not elem.children.empty?
92
+ elem.children.each do |sub|
93
+ repeat = false
94
+ sub_ch = nil
95
+ case sub.class.to_s
96
+ when 'Hpricot::Elem'
97
+ if verility.key? sub.name
98
+ new_attrs = {}
99
+ strip_tags = []
100
+ verility[sub.name].keys.sort do |x,y|
101
+ x == '@-' ? -1 : y == '@-' ? 1 : x <=> y
102
+ end.each do |key|
103
+ value = verility[sub.name][key]
104
+ # TODO match value to as regexp to attr value
105
+ case key
106
+ when '@-'
107
+ # delete elem if match to re
108
+ inner = sub.inner_html.gsub(/(\r\n|\n)/,' ')
109
+ if self.fix(inner) =~ /#{value}/
110
+ if elem.children.index(sub)
111
+ repeat = true; changed = true
112
+ end
113
+ end
114
+ when '@^'
115
+ inner = sub.inner_html.gsub(/(\r\n|\n)/,' ')
116
+ unless value and self.fix(inner) !~ /#{value}/
117
+ sub_ch = sub.children
118
+ idx = elem.children.index(sub)
119
+ if idx and not sub_ch.empty?
120
+ sub_ch.each do |x| x.parent = elem end
121
+ elem.children[idx] = sub_ch
122
+ else
123
+ repeat = true; changed = true
124
+ end
125
+ end
126
+ when '@%'
127
+ # replace elem if match to re
128
+ inner = sub.inner_html.gsub(/(\r\n|\n)/,' ')
129
+ if self.fix(inner) =~ /#{value}/
130
+ new = Hpricot::Text.new("\x1F")
131
+ new.parent = sub
132
+ sub.children.replace [ new ]
133
+ changed = true
134
+ end
135
+ else
136
+ attr = sub.get_attribute(key)
137
+ new_attrs[key] = attr if attr
138
+ end
139
+ end if verility[sub.name]
140
+ if sub.raw_attributes != new_attrs
141
+ sub.raw_attributes = new_attrs
142
+ changed = true
143
+ end
144
+ else
145
+ repeat = true; changed = true
146
+ end
147
+ when 'Hpricot::Text'
148
+ else
149
+ repeat = true; changed = true
150
+ end if sub.parent == elem
151
+ if not repeat
152
+ if sub_ch
153
+ res.concat sub_ch
154
+ else
155
+ res << sub
156
+ end
157
+ end
158
+ end
159
+
160
+ elem.children.replace res
161
+ end
162
+
163
+ changed
164
+ end
165
+
166
+ def self.scrub_replace_by_rule(elem, verility)
167
+ changed = nil
168
+ repeat = true
169
+
170
+ res = []
171
+
172
+ elem.children.each do |sub|
173
+ repeat = false
174
+ sub_ch = nil
175
+ case sub.class.to_s
176
+ when 'Hpricot::Elem'
177
+ if sub.children and verility.key? sub.name
178
+ strip_tags = []
179
+ verility[sub.name].each do |key, value|
180
+ # TODO match value to as regexp to attr value
181
+ case key
182
+ when '@%'
183
+ # replace elem if match to re
184
+ inner = sub.inner_html.gsub(/(\r\n|\n)/,' ')
185
+ if self.fix(inner) =~ /#{value}/
186
+ new = Hpricot::Text.new("\x1F")
187
+ new.parent = sub
188
+ sub.children.replace [ new ]
189
+ changed = true
190
+ end
191
+ end
192
+ end if verility[sub.name]
193
+ else
194
+ repeat = true; changed = true
195
+ end
196
+ when 'Hpricot::Text'
197
+ else
198
+ repeat = true; changed = true
199
+ end if sub.parent == elem
200
+ if not repeat
201
+ if sub_ch
202
+ res.concat sub_ch
203
+ else
204
+ res << sub
205
+ end
206
+ end
207
+ end if elem.children and not elem.children.empty?
208
+
209
+ elem.children.replace res
210
+
211
+ changed
212
+ end
213
+
214
+ def self.scrub_elem(elem, verility)
215
+ while (
216
+ self.scrub_children(elem, verility) ||
217
+ self.scrub_remove_by_rule(elem, verility) ||
218
+ self.scrub_follower(elem) ||
219
+ self.scrub_special(elem))
220
+ end
221
+ end
222
+
223
+ def self.scrub_html(content, реха)
224
+ return content unless реха
225
+ реха = YAML.load( StringIO.new реха ) if реха.class == String
226
+ doc = Hpricot(content)
227
+ self.scrub_elem(doc, реха)
228
+ doc.inner_html
229
+ end
230
+
231
+ def initialize(реха)
232
+ @реха = YAML.load( StringIO.new реха ) if реха.class == String
233
+ end
234
+
235
+ def scrub_html(content)
236
+ return content unless @реха
237
+ doc = Hpricot(content)
238
+ self.class.scrub_elem(doc, @реха)
239
+ doc.inner_html
240
+ end
241
+
242
+ end
243
+
metadata ADDED
@@ -0,0 +1,129 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: hscrubber
3
+ version: !ruby/object:Gem::Version
4
+ hash: 29
5
+ prerelease:
6
+ segments:
7
+ - 0
8
+ - 0
9
+ - 1
10
+ version: 0.0.1
11
+ platform: ruby
12
+ authors:
13
+ - !binary |
14
+ 0JzQsNC70Yog0KHQutGA0YvQu9GR0LLRiiAoTWFsbyBTa3J5bGV2byk=
15
+
16
+ autorequire:
17
+ bindir: bin
18
+ cert_chain: []
19
+
20
+ date: 2011-04-05 00:00:00 +04:00
21
+ default_executable:
22
+ dependencies:
23
+ - !ruby/object:Gem::Dependency
24
+ name: hpricot
25
+ prerelease: false
26
+ requirement: &id001 !ruby/object:Gem::Requirement
27
+ none: false
28
+ requirements:
29
+ - - ">="
30
+ - !ruby/object:Gem::Version
31
+ hash: 55
32
+ segments:
33
+ - 0
34
+ - 8
35
+ - 4
36
+ version: 0.8.4
37
+ type: :runtime
38
+ version_requirements: *id001
39
+ - !ruby/object:Gem::Dependency
40
+ name: bundler
41
+ prerelease: false
42
+ requirement: &id002 !ruby/object:Gem::Requirement
43
+ none: false
44
+ requirements:
45
+ - - ">="
46
+ - !ruby/object:Gem::Version
47
+ hash: 23
48
+ segments:
49
+ - 1
50
+ - 0
51
+ - 0
52
+ version: 1.0.0
53
+ type: :development
54
+ version_requirements: *id002
55
+ - !ruby/object:Gem::Dependency
56
+ name: rspec
57
+ prerelease: false
58
+ requirement: &id003 !ruby/object:Gem::Requirement
59
+ none: false
60
+ requirements:
61
+ - - ~>
62
+ - !ruby/object:Gem::Version
63
+ hash: 13
64
+ segments:
65
+ - 2
66
+ - 0
67
+ - 1
68
+ version: 2.0.1
69
+ type: :development
70
+ version_requirements: *id003
71
+ description: hscrubber is HTML scrubber based on a HTML reha filter
72
+ email:
73
+ - 3aHyga@gmail.com
74
+ executables:
75
+ - hscrub
76
+ extensions: []
77
+
78
+ extra_rdoc_files: []
79
+
80
+ files:
81
+ - .document
82
+ - .gitignore
83
+ - CHANGES.md
84
+ - Gemfile
85
+ - LICENSE
86
+ - README.md
87
+ - Rakefile
88
+ - bin/hscrub
89
+ - hscrubber.gemspec
90
+ - lib/hscrubber.rb
91
+ - lib/hscrubber/version.rb
92
+ has_rdoc: true
93
+ homepage: https://github.com/3aHyga/hscrubber
94
+ licenses: []
95
+
96
+ post_install_message:
97
+ rdoc_options: []
98
+
99
+ require_paths:
100
+ - lib
101
+ required_ruby_version: !ruby/object:Gem::Requirement
102
+ none: false
103
+ requirements:
104
+ - - ">="
105
+ - !ruby/object:Gem::Version
106
+ hash: 3
107
+ segments:
108
+ - 0
109
+ version: "0"
110
+ required_rubygems_version: !ruby/object:Gem::Requirement
111
+ none: false
112
+ requirements:
113
+ - - ">="
114
+ - !ruby/object:Gem::Version
115
+ hash: 15
116
+ segments:
117
+ - 1
118
+ - 6
119
+ - 0
120
+ version: 1.6.0
121
+ requirements: []
122
+
123
+ rubyforge_project: hscrubber
124
+ rubygems_version: 1.6.2
125
+ signing_key:
126
+ specification_version: 3
127
+ summary: hscrubber is HTML scrubber
128
+ test_files: []
129
+