hscrubber 0.0.2 → 0.0.3

Sign up to get free protection for your applications and to get access to all the features.
data/.document CHANGED
@@ -1,4 +1,4 @@
1
- README.rdoc
1
+ README.md
2
2
  lib/**/*.rb
3
3
  bin/*
4
4
  LICENSE
data/.rspec ADDED
@@ -0,0 +1,2 @@
1
+ --format nested
2
+ --color
data/Gemfile CHANGED
@@ -1,4 +1,4 @@
1
1
  source "http://rubygems.org"
2
2
 
3
- # Specify your gem's dependencies in priehlazx.gemspec
3
+ # Specify your gem's dependencies in hscrubber.gemspec
4
4
  gemspec
@@ -0,0 +1,85 @@
1
+ # HScrubber
2
+
3
+ HScrubber is HTML reha engine, and it allows filtering an input flow according to the special reha template that is formed as YAML-document.
4
+
5
+ # Reha
6
+ ## Description of reha filter
7
+
8
+ Reha is set up as an YAML-document. The allowed in an output flow HTML tags is described at the top level of the document. The following level described allowed attributes of the specified tag, and also rule keys that controls the tag and ots containment. The keys, and its values are the following:
9
+
10
+ * '_' declares that the containment of the tag will be cleaned up, if it matches to the specified rule;
11
+
12
+ * '-' a tag will be removed, if its containment matches to the specified rule;
13
+
14
+ * '^' containment of a tag will be added to containment of the parent tag, if containment of the tag matches to the specified rule, or if the rule isn't defined;
15
+
16
+ * '%' sets the attributes order in the output file. The attributes is writing via comma.
17
+
18
+ The keys are ranged according to priority their analysing. The '@' symbol necessarily outruns each of the keys.
19
+
20
+ ## Sample
21
+
22
+ Sample reha template is described as follows:
23
+
24
+ ---
25
+ html:
26
+ body:
27
+ p:
28
+ i:
29
+ @-: ^[.,:;!?\s]*$
30
+ font:
31
+ face:
32
+ size:
33
+ @%: size,face
34
+ @-: ^\s+$
35
+ @_: ^[.,:!?#]+$
36
+ span:
37
+ @^:
38
+ @-: ^[.,:;!?\s]*$
39
+
40
+ Descriptions:
41
+
42
+ Tag 'i' hasn't allowable attributes, so them will be removed from an output stream. In case, the tag containment meets a remove rule, the tag will be absent in the output;
43
+
44
+
45
+ <i id="i_id">Text</i> -> <i>Text</i>
46
+ <font>Text<i>?</i></font> -> <font>Text</font>
47
+
48
+ Allowable attributes for the 'font' tag are 'face', and 'size'. In case, if the tag containment meets a remove rule, the tag will be absent in the output, and if meets a cleanup rule, the containment will be purged, and the attributes will be ordered as 'size', and then 'face';
49
+
50
+ <font size="5" color="blue">Text</font> -> <font size="5">Text</font>
51
+ <i>Text<font> </font></i> -> <i>Text</i>
52
+ <i>Text<font>??</font></i> -> <i>Text<font></font></i>
53
+ <font face="Arial" size="5">Text</font> -> <font size="5" face="Arial">Text</font>
54
+
55
+ Tag 'span' hasn't allowable attributes, so them will be removed from an output stream. In case, the tag containment meets a remove rule, the tag will be absent in the output as it is. In other cases, its containment will be added to a parent tag.
56
+
57
+ <span id="span_id">Text</span> -> <span>Text</span>
58
+ <i>Text<span>?</span></i> -> <i>Text</i>
59
+
60
+ ## Usage
61
+ There are 2 ways to use the package in ruby applications.
62
+
63
+ ### Using the class instance method
64
+ Make a class instance, passing a reha to its initialize function. The reha must be loaded as a String, or an IO class. Then filter a HTML:
65
+
66
+ рѣха = IO.read('.рѣха.yml.sample')
67
+ hs = HScrubber.new(рѣха)
68
+
69
+ html = IO.read('sample.html').gsub(/\r/, '')
70
+ new_html = hs.scrub_html(html)
71
+
72
+ puts html
73
+
74
+ ### Using the class method
75
+ Thou art able to filter the HTML-document without a class instance creation. Do as follows:
76
+
77
+ рѣха = IO.read('.рѣха.yml.sample')
78
+ html = IO.read('sample.html').gsub(/\r/, '')
79
+ new_html = HScrubber.scrub_html(html, рѣха)
80
+
81
+ puts html
82
+
83
+ # Copyright
84
+ Copyright (c) 2011 Malo Skrylevo. See LICENSE for details.
85
+
data/README.md CHANGED
@@ -1,37 +1,25 @@
1
- # HScrubber
1
+ # Урѣзчикъ (HScrubber)
2
2
 
3
3
  HScrubber есть движокъ для прорѣшиванія HTML-документа. Онъ позволяетъ процѣдить содержимое входного потока очистивъ его отъ ненужныхъ предмѣтовъ на основѣ рѣхи, являющейся YAML-документомъ, по опредѣлённымъ правиламъ состаленнымъ.
4
4
 
5
- HScrubber is HTML reha engine, and it allows filtering an input flow according to the special reha template that is formed as YAML-document.
6
-
7
- # Рѣха (Reha)
8
- ## Объясненіе рѣхи (Description of reha filter)
5
+ # Рѣха
6
+ ## Объясненіе рѣхи
9
7
 
10
8
  Рѣха задаётся въ видѣ YAML-документа. На самомъ верхнемъ уровнѣ описываются мѣты (HTML tags), допустимыя въ выходномъ документѣ. Слѣдующій уровень задаётъ допустимыя свойства (attributes) для опредѣлённой мѣты, а также ключи, управляющія мѣтою и её содержимымъ. Возможныя ключи и их значенія суть такія:
11
9
 
12
- * '%' содержимое мѣты будетъ очищено, если оно подпадаетъ подъ заданное въ значеніи ключа правило;
10
+ * '_' содержимое мѣты будетъ очищено, если оно подпадаетъ подъ заданное въ значеніи ключа правило;
13
11
 
14
12
  * '-' мѣта удаляется въ томъ случаѣ, если её содержимое подпадаетъ подъ заданное въ значеніи ключа правило;
15
13
 
16
- * '^' содержимое мѣты добавляется къ содержимому родителькой мѣты въ томъ случаѣ, если содержимое сей мѣты подпадаетъ подъ правило, или если правило не задано.
17
-
18
- Ключи здѣ расположены въ порядкѣ первичности ихъ провѣрки. Каджый изъ нихъ обязательно предваряется символомъ '@'
19
-
20
- Reha is set up as an YAML-document. The allowed in an output flow HTML tags is described at the top level of the document. The following level described allowed attributes of the specified tag, and also rule keys that controls the tag and ots containment. The keys, and its values are the following:
14
+ * '^' содержимое мѣты добавляется къ содержимому родителькой мѣты въ томъ случаѣ, если содержимое сей мѣты подпадаетъ подъ правило, или если правило не задано;
21
15
 
22
- * '%' declares that the containment of the tag will be cleaned up, if it matches to the specified rule;
16
+ * '%' задаётъ порядокъ слѣдованія ключей мѣты въ выходномъ файлѣ. Ключи пишутся черезъ запятую.
23
17
 
24
- * '-' a tag will be removed, if its containment matches to the specified rule;
25
-
26
- * '^' containment of a tag will be added to containment of the parent tag, if containment of the tag matches to the specified rule, or if the rule isn't defined.
27
-
28
- The keys are ranged according to priority their analysing. The '@' symbol necessarily outruns each of the keys.
18
+ Ключи здѣ расположены въ порядкѣ первичности ихъ провѣрки. Каджый изъ нихъ обязательно предваряется символомъ '@'
29
19
 
30
- ## Примѣръ (Sample)
20
+ ## Примѣръ
31
21
  Примѣрный шаблонъ файла рѣхи представленъ нижѣ:
32
22
 
33
- Sample reha template is described as follows:
34
-
35
23
  ---
36
24
  html:
37
25
  body:
@@ -41,8 +29,9 @@ Sample reha template is described as follows:
41
29
  font:
42
30
  face:
43
31
  size:
32
+ @%: size,face
44
33
  @-: ^\s+$
45
- @%: ^[.,:!?#]+$
34
+ @_: ^[.,:!?#]+$
46
35
  span:
47
36
  @^:
48
37
  @-: ^[.,:;!?\s]*$
@@ -54,33 +43,24 @@ Sample reha template is described as follows:
54
43
  <i id="i_id">Text</i> -> <i>Text</i>
55
44
  <font>Text<i>?</i></font> -> <font>Text</font>
56
45
 
57
- Допустимыми ключами для мѣты 'font' являются 'face' и 'size'. Въ случаѣ, если содержимое мѣты удовлѣтворяетъ правилу удаления, на выходѣ сія мѣта будетъ отсутствовать, а если правилу очищенія, то её содержимое станетъ порожнимъ. Примѣры:
46
+ Допустимыми ключами для мѣты 'font' являются 'face' и 'size'. Въ случаѣ, если содержимое мѣты удовлѣтворяетъ правилу удаления, на выходѣ сія мѣта будетъ отсутствовать, а если правилу очищенія, то её содержимое станетъ порожнимъ, ключи будутъ расположены въ порядкѣ size, face. Примѣры:
58
47
 
59
48
  <font size="5" color="blue">Text</font> -> <font size="5">Text</font>
60
49
  <i>Text<font> </font></i> -> <i>Text</i>
61
50
  <i>Text<font>??</font></i> -> <i>Text<font></font></i>
51
+ <font face="Arial" size="5">Text</font> -> <font size="5" face="Arial">Text</font>
62
52
 
63
53
  Допустимыя ключи для мѣты 'span' отсутствуютъ, и въ случаѣ ихъ обнаруженія въ входномъ потокѣ они будутъ вырѣзаны изъ него. Если содержимое мѣты удовлѣтворяетъ правилу удаления, на выходѣ сія мѣта будетъ отсутствовать какъ таковая. Въ остальныхъ же случаяхъ её содержимое будетъ добавлено къ мѣтѣ родительской. Примѣры:
64
54
 
65
55
  <span id="span_id">Text</span> -> <span>Text</span>
66
56
  <i>Text<span>?</span></i> -> <i>Text</i>
67
57
 
68
- Descriptions:
58
+ ## Использованіе
59
+ Суть 2 способа использованія пакета въ ruby-приложеніяхъ.
69
60
 
70
- Tag 'i' hasn't allowable attributes, so them will be removed from an output stream. In case, the tag containment meets a remove rule, the tag will be absent in the output;
71
-
72
- Allowable attributes for the 'font' tag are 'face', and 'size'. In case, if the tag containment meets a remove rule, the tag will be absent in the output, and if meets a cleanup rule, the containment will be purged;
73
-
74
- Tag 'span' hasn't allowable attributes, so them will be removed from an output stream. In case, the tag containment meets a remove rule, the tag will be absent in the output as it is. In other cases, its containment will be added to a parent tag.
75
-
76
- ## Использованіе (Usage)
77
- Суть 2 способа испозованія пакета въ ruby-приложеніяхъ.
78
-
79
- ### Используя методъ экземпляра класса (Using the class instance method)
61
+ ### Используя методъ экземпляра класса
80
62
  Создай экземпляръ класса, передавъ конструктору рѣху загруженную въ видѣ строки или IO-класса, а затѣмъ прорѣши HTML-документъ:
81
63
 
82
- Make a class instance, passing a reha to its initialize function. The reha must be loaded as a String, or an IO class. Then filter a HTML:
83
-
84
64
  рѣха = IO.read('.рѣха.yml.sample')
85
65
  hs = HScrubber.new(рѣха)
86
66
 
@@ -89,22 +69,15 @@ Make a class instance, passing a reha to its initialize function. The reha must
89
69
 
90
70
  puts html
91
71
 
92
- ### Используя методъ класса (Using the class method)
72
+ ### Используя методъ класса
93
73
  Можно прорѣшить HTML-документъ и не создавая экземпляръ класса. Тогда дѣлай такъ:
94
74
 
95
- We able to filter the HTML-document without a class instance creation. Do as follows:
96
-
97
75
  рѣха = IO.read('.рѣха.yml.sample')
98
76
  html = IO.read('sample.html').gsub(/\r/, '')
99
77
  new_html = HScrubber.scrub_html(html, рѣха)
100
78
 
101
79
  puts html
102
80
 
103
- # Права (Copyright)
104
-
105
- Авторскія и исключительныя права (а) 2011 Малъ Скрылевъ
106
- Зри LICENSE за подробностями.
107
-
108
- Copyright (c) 2011 Malo Skrylevo
109
- See LICENSE for details.
81
+ # Права
110
82
 
83
+ Авторскія и исключительныя права (а) 2011 Малъ Скрылевъ. Зри LICENSE за подробностями.
@@ -17,6 +17,7 @@ Gem::Specification.new do |s|
17
17
 
18
18
  s.required_rubygems_version = '>= 1.6.0'
19
19
 
20
+ s.add_dependency 'rdoba', ">= 0.1"
20
21
  s.add_dependency 'hpricot', ">= 0.8.4"
21
22
 
22
23
  s.add_development_dependency("bundler", ">= 1.0.0")
@@ -1,6 +1,7 @@
1
1
  #!/usr/bin/ruby -KU
2
2
  # encoding: utf-8
3
3
 
4
+ require 'rdoba/hashorder'
4
5
  require 'yaml'
5
6
  require 'hpricot'
6
7
  require 'hscrubber/version'
@@ -82,7 +83,7 @@ class HScrubber
82
83
  changed
83
84
  end
84
85
 
85
- def self.scrub_remove_by_rule(elem, verility)
86
+ def self.process_specials(elem, verility)
86
87
  changed = nil
87
88
  repeat = true
88
89
 
@@ -111,6 +112,8 @@ class HScrubber
111
112
  repeat = true; changed = true
112
113
  end
113
114
  end
115
+ when '@%'
116
+ (sub.raw_attributes.order = value.split(',').map do |x| x.strip end) if value
114
117
  when '@^'
115
118
  inner = sub.inner_html.gsub(/(\r\n|\n)/,' ')
116
119
  unless value and self.fix(inner) !~ /#{value}/
@@ -123,8 +126,8 @@ class HScrubber
123
126
  repeat = true; changed = true
124
127
  end
125
128
  end
126
- when '@%'
127
- # replace elem if match to re
129
+ when '@_'
130
+ # clear elem if match to re
128
131
  inner = sub.inner_html.gsub(/(\r\n|\n)/,' ')
129
132
  if self.fix(inner) =~ /#{value}/
130
133
  new = Hpricot::Text.new("\x1F")
@@ -163,58 +166,10 @@ class HScrubber
163
166
  changed
164
167
  end
165
168
 
166
- def self.scrub_replace_by_rule(elem, verility)
167
- changed = nil
168
- repeat = true
169
-
170
- res = []
171
-
172
- elem.children.each do |sub|
173
- repeat = false
174
- sub_ch = nil
175
- case sub.class.to_s
176
- when 'Hpricot::Elem'
177
- if sub.children and verility.key? sub.name
178
- strip_tags = []
179
- verility[sub.name].each do |key, value|
180
- # TODO match value to as regexp to attr value
181
- case key
182
- when '@%'
183
- # replace elem if match to re
184
- inner = sub.inner_html.gsub(/(\r\n|\n)/,' ')
185
- if self.fix(inner) =~ /#{value}/
186
- new = Hpricot::Text.new("\x1F")
187
- new.parent = sub
188
- sub.children.replace [ new ]
189
- changed = true
190
- end
191
- end
192
- end if verility[sub.name]
193
- else
194
- repeat = true; changed = true
195
- end
196
- when 'Hpricot::Text'
197
- else
198
- repeat = true; changed = true
199
- end if sub.parent == elem
200
- if not repeat
201
- if sub_ch
202
- res.concat sub_ch
203
- else
204
- res << sub
205
- end
206
- end
207
- end if elem.children and not elem.children.empty?
208
-
209
- elem.children.replace res
210
-
211
- changed
212
- end
213
-
214
169
  def self.scrub_elem(elem, verility)
215
170
  while (
216
171
  self.scrub_children(elem, verility) ||
217
- self.scrub_remove_by_rule(elem, verility) ||
172
+ self.process_specials(elem, verility) ||
218
173
  self.scrub_follower(elem) ||
219
174
  self.scrub_special(elem))
220
175
  end
@@ -1,3 +1,3 @@
1
1
  class HScrubber
2
- VERSION = "0.0.2"
2
+ VERSION = "0.0.3"
3
3
  end
metadata CHANGED
@@ -1,129 +1,103 @@
1
- --- !ruby/object:Gem::Specification
1
+ --- !ruby/object:Gem::Specification
2
2
  name: hscrubber
3
- version: !ruby/object:Gem::Version
4
- hash: 27
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.3
5
5
  prerelease:
6
- segments:
7
- - 0
8
- - 0
9
- - 2
10
- version: 0.0.2
11
6
  platform: ruby
12
- authors:
13
- - !binary |
14
- 0JzQsNC70Yog0KHQutGA0YvQu9GR0LLRiiAoTWFsbyBTa3J5bGV2byk=
15
-
7
+ authors:
8
+ - Малъ Скрылёвъ (Malo Skrylevo)
16
9
  autorequire:
17
10
  bindir: bin
18
11
  cert_chain: []
19
-
20
- date: 2011-04-05 00:00:00 +04:00
21
- default_executable:
22
- dependencies:
23
- - !ruby/object:Gem::Dependency
24
- name: hpricot
12
+ date: 2011-05-18 00:00:00.000000000Z
13
+ dependencies:
14
+ - !ruby/object:Gem::Dependency
15
+ name: rdoba
16
+ requirement: &69581990 !ruby/object:Gem::Requirement
17
+ none: false
18
+ requirements:
19
+ - - ! '>='
20
+ - !ruby/object:Gem::Version
21
+ version: '0.1'
22
+ type: :runtime
25
23
  prerelease: false
26
- requirement: &id001 !ruby/object:Gem::Requirement
24
+ version_requirements: *69581990
25
+ - !ruby/object:Gem::Dependency
26
+ name: hpricot
27
+ requirement: &69624550 !ruby/object:Gem::Requirement
27
28
  none: false
28
- requirements:
29
- - - ">="
30
- - !ruby/object:Gem::Version
31
- hash: 55
32
- segments:
33
- - 0
34
- - 8
35
- - 4
29
+ requirements:
30
+ - - ! '>='
31
+ - !ruby/object:Gem::Version
36
32
  version: 0.8.4
37
33
  type: :runtime
38
- version_requirements: *id001
39
- - !ruby/object:Gem::Dependency
40
- name: bundler
41
34
  prerelease: false
42
- requirement: &id002 !ruby/object:Gem::Requirement
35
+ version_requirements: *69624550
36
+ - !ruby/object:Gem::Dependency
37
+ name: bundler
38
+ requirement: &69624320 !ruby/object:Gem::Requirement
43
39
  none: false
44
- requirements:
45
- - - ">="
46
- - !ruby/object:Gem::Version
47
- hash: 23
48
- segments:
49
- - 1
50
- - 0
51
- - 0
40
+ requirements:
41
+ - - ! '>='
42
+ - !ruby/object:Gem::Version
52
43
  version: 1.0.0
53
44
  type: :development
54
- version_requirements: *id002
55
- - !ruby/object:Gem::Dependency
56
- name: rspec
57
45
  prerelease: false
58
- requirement: &id003 !ruby/object:Gem::Requirement
46
+ version_requirements: *69624320
47
+ - !ruby/object:Gem::Dependency
48
+ name: rspec
49
+ requirement: &69624090 !ruby/object:Gem::Requirement
59
50
  none: false
60
- requirements:
51
+ requirements:
61
52
  - - ~>
62
- - !ruby/object:Gem::Version
63
- hash: 13
64
- segments:
65
- - 2
66
- - 0
67
- - 1
53
+ - !ruby/object:Gem::Version
68
54
  version: 2.0.1
69
55
  type: :development
70
- version_requirements: *id003
56
+ prerelease: false
57
+ version_requirements: *69624090
71
58
  description: hscrubber is HTML scrubber based on a HTML reha filter
72
- email:
59
+ email:
73
60
  - 3aHyga@gmail.com
74
- executables:
61
+ executables:
75
62
  - hscrub
76
63
  extensions: []
77
-
78
64
  extra_rdoc_files: []
79
-
80
- files:
65
+ files:
81
66
  - .document
82
67
  - .gitignore
68
+ - .rspec
83
69
  - CHANGES.md
84
70
  - Gemfile
85
71
  - LICENSE
72
+ - README.en.md
86
73
  - README.md
87
74
  - Rakefile
88
75
  - bin/hscrub
89
76
  - hscrubber.gemspec
90
77
  - lib/hscrubber.rb
91
78
  - lib/hscrubber/version.rb
92
- has_rdoc: true
93
79
  homepage: https://github.com/3aHyga/hscrubber
94
80
  licenses: []
95
-
96
81
  post_install_message:
97
82
  rdoc_options: []
98
-
99
- require_paths:
83
+ require_paths:
100
84
  - lib
101
- required_ruby_version: !ruby/object:Gem::Requirement
85
+ required_ruby_version: !ruby/object:Gem::Requirement
102
86
  none: false
103
- requirements:
104
- - - ">="
105
- - !ruby/object:Gem::Version
106
- hash: 3
107
- segments:
108
- - 0
109
- version: "0"
110
- required_rubygems_version: !ruby/object:Gem::Requirement
87
+ requirements:
88
+ - - ! '>='
89
+ - !ruby/object:Gem::Version
90
+ version: '0'
91
+ required_rubygems_version: !ruby/object:Gem::Requirement
111
92
  none: false
112
- requirements:
113
- - - ">="
114
- - !ruby/object:Gem::Version
115
- hash: 15
116
- segments:
117
- - 1
118
- - 6
119
- - 0
93
+ requirements:
94
+ - - ! '>='
95
+ - !ruby/object:Gem::Version
120
96
  version: 1.6.0
121
97
  requirements: []
122
-
123
98
  rubyforge_project: hscrubber
124
- rubygems_version: 1.6.2
99
+ rubygems_version: 1.7.2
125
100
  signing_key:
126
101
  specification_version: 3
127
102
  summary: hscrubber is HTML scrubber
128
103
  test_files: []
129
-