dq-readability 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
data/Gemfile ADDED
@@ -0,0 +1,10 @@
1
+ source "http://rubygems.org"
2
+
3
+ gem 'fastimage', '~> 1.2.13'
4
+ gem 'rake'
5
+
6
+ group :test do
7
+ gem "fakeweb", "~> 1.3.0"
8
+ end
9
+
10
+ gemspec
data/LICENSE ADDED
@@ -0,0 +1,202 @@
1
+
2
+ Apache License
3
+ Version 2.0, January 2004
4
+ http://www.apache.org/licenses/
5
+
6
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
7
+
8
+ 1. Definitions.
9
+
10
+ "License" shall mean the terms and conditions for use, reproduction,
11
+ and distribution as defined by Sections 1 through 9 of this document.
12
+
13
+ "Licensor" shall mean the copyright owner or entity authorized by
14
+ the copyright owner that is granting the License.
15
+
16
+ "Legal Entity" shall mean the union of the acting entity and all
17
+ other entities that control, are controlled by, or are under common
18
+ control with that entity. For the purposes of this definition,
19
+ "control" means (i) the power, direct or indirect, to cause the
20
+ direction or management of such entity, whether by contract or
21
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
22
+ outstanding shares, or (iii) beneficial ownership of such entity.
23
+
24
+ "You" (or "Your") shall mean an individual or Legal Entity
25
+ exercising permissions granted by this License.
26
+
27
+ "Source" form shall mean the preferred form for making modifications,
28
+ including but not limited to software source code, documentation
29
+ source, and configuration files.
30
+
31
+ "Object" form shall mean any form resulting from mechanical
32
+ transformation or translation of a Source form, including but
33
+ not limited to compiled object code, generated documentation,
34
+ and conversions to other media types.
35
+
36
+ "Work" shall mean the work of authorship, whether in Source or
37
+ Object form, made available under the License, as indicated by a
38
+ copyright notice that is included in or attached to the work
39
+ (an example is provided in the Appendix below).
40
+
41
+ "Derivative Works" shall mean any work, whether in Source or Object
42
+ form, that is based on (or derived from) the Work and for which the
43
+ editorial revisions, annotations, elaborations, or other modifications
44
+ represent, as a whole, an original work of authorship. For the purposes
45
+ of this License, Derivative Works shall not include works that remain
46
+ separable from, or merely link (or bind by name) to the interfaces of,
47
+ the Work and Derivative Works thereof.
48
+
49
+ "Contribution" shall mean any work of authorship, including
50
+ the original version of the Work and any modifications or additions
51
+ to that Work or Derivative Works thereof, that is intentionally
52
+ submitted to Licensor for inclusion in the Work by the copyright owner
53
+ or by an individual or Legal Entity authorized to submit on behalf of
54
+ the copyright owner. For the purposes of this definition, "submitted"
55
+ means any form of electronic, verbal, or written communication sent
56
+ to the Licensor or its representatives, including but not limited to
57
+ communication on electronic mailing lists, source code control systems,
58
+ and issue tracking systems that are managed by, or on behalf of, the
59
+ Licensor for the purpose of discussing and improving the Work, but
60
+ excluding communication that is conspicuously marked or otherwise
61
+ designated in writing by the copyright owner as "Not a Contribution."
62
+
63
+ "Contributor" shall mean Licensor and any individual or Legal Entity
64
+ on behalf of whom a Contribution has been received by Licensor and
65
+ subsequently incorporated within the Work.
66
+
67
+ 2. Grant of Copyright License. Subject to the terms and conditions of
68
+ this License, each Contributor hereby grants to You a perpetual,
69
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
70
+ copyright license to reproduce, prepare Derivative Works of,
71
+ publicly display, publicly perform, sublicense, and distribute the
72
+ Work and such Derivative Works in Source or Object form.
73
+
74
+ 3. Grant of Patent License. Subject to the terms and conditions of
75
+ this License, each Contributor hereby grants to You a perpetual,
76
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
77
+ (except as stated in this section) patent license to make, have made,
78
+ use, offer to sell, sell, import, and otherwise transfer the Work,
79
+ where such license applies only to those patent claims licensable
80
+ by such Contributor that are necessarily infringed by their
81
+ Contribution(s) alone or by combination of their Contribution(s)
82
+ with the Work to which such Contribution(s) was submitted. If You
83
+ institute patent litigation against any entity (including a
84
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
85
+ or a Contribution incorporated within the Work constitutes direct
86
+ or contributory patent infringement, then any patent licenses
87
+ granted to You under this License for that Work shall terminate
88
+ as of the date such litigation is filed.
89
+
90
+ 4. Redistribution. You may reproduce and distribute copies of the
91
+ Work or Derivative Works thereof in any medium, with or without
92
+ modifications, and in Source or Object form, provided that You
93
+ meet the following conditions:
94
+
95
+ (a) You must give any other recipients of the Work or
96
+ Derivative Works a copy of this License; and
97
+
98
+ (b) You must cause any modified files to carry prominent notices
99
+ stating that You changed the files; and
100
+
101
+ (c) You must retain, in the Source form of any Derivative Works
102
+ that You distribute, all copyright, patent, trademark, and
103
+ attribution notices from the Source form of the Work,
104
+ excluding those notices that do not pertain to any part of
105
+ the Derivative Works; and
106
+
107
+ (d) If the Work includes a "NOTICE" text file as part of its
108
+ distribution, then any Derivative Works that You distribute must
109
+ include a readable copy of the attribution notices contained
110
+ within such NOTICE file, excluding those notices that do not
111
+ pertain to any part of the Derivative Works, in at least one
112
+ of the following places: within a NOTICE text file distributed
113
+ as part of the Derivative Works; within the Source form or
114
+ documentation, if provided along with the Derivative Works; or,
115
+ within a display generated by the Derivative Works, if and
116
+ wherever such third-party notices normally appear. The contents
117
+ of the NOTICE file are for informational purposes only and
118
+ do not modify the License. You may add Your own attribution
119
+ notices within Derivative Works that You distribute, alongside
120
+ or as an addendum to the NOTICE text from the Work, provided
121
+ that such additional attribution notices cannot be construed
122
+ as modifying the License.
123
+
124
+ You may add Your own copyright statement to Your modifications and
125
+ may provide additional or different license terms and conditions
126
+ for use, reproduction, or distribution of Your modifications, or
127
+ for any such Derivative Works as a whole, provided Your use,
128
+ reproduction, and distribution of the Work otherwise complies with
129
+ the conditions stated in this License.
130
+
131
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
132
+ any Contribution intentionally submitted for inclusion in the Work
133
+ by You to the Licensor shall be under the terms and conditions of
134
+ this License, without any additional terms or conditions.
135
+ Notwithstanding the above, nothing herein shall supersede or modify
136
+ the terms of any separate license agreement you may have executed
137
+ with Licensor regarding such Contributions.
138
+
139
+ 6. Trademarks. This License does not grant permission to use the trade
140
+ names, trademarks, service marks, or product names of the Licensor,
141
+ except as required for reasonable and customary use in describing the
142
+ origin of the Work and reproducing the content of the NOTICE file.
143
+
144
+ 7. Disclaimer of Warranty. Unless required by applicable law or
145
+ agreed to in writing, Licensor provides the Work (and each
146
+ Contributor provides its Contributions) on an "AS IS" BASIS,
147
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
148
+ implied, including, without limitation, any warranties or conditions
149
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
150
+ PARTICULAR PURPOSE. You are solely responsible for determining the
151
+ appropriateness of using or redistributing the Work and assume any
152
+ risks associated with Your exercise of permissions under this License.
153
+
154
+ 8. Limitation of Liability. In no event and under no legal theory,
155
+ whether in tort (including negligence), contract, or otherwise,
156
+ unless required by applicable law (such as deliberate and grossly
157
+ negligent acts) or agreed to in writing, shall any Contributor be
158
+ liable to You for damages, including any direct, indirect, special,
159
+ incidental, or consequential damages of any character arising as a
160
+ result of this License or out of the use or inability to use the
161
+ Work (including but not limited to damages for loss of goodwill,
162
+ work stoppage, computer failure or malfunction, or any and all
163
+ other commercial damages or losses), even if such Contributor
164
+ has been advised of the possibility of such damages.
165
+
166
+ 9. Accepting Warranty or Additional Liability. While redistributing
167
+ the Work or Derivative Works thereof, You may choose to offer,
168
+ and charge a fee for, acceptance of support, warranty, indemnity,
169
+ or other liability obligations and/or rights consistent with this
170
+ License. However, in accepting such obligations, You may act only
171
+ on Your own behalf and on Your sole responsibility, not on behalf
172
+ of any other Contributor, and only if You agree to indemnify,
173
+ defend, and hold each Contributor harmless for any liability
174
+ incurred by, or claims asserted against, such Contributor by reason
175
+ of your accepting any such warranty or additional liability.
176
+
177
+ END OF TERMS AND CONDITIONS
178
+
179
+ APPENDIX: How to apply the Apache License to your work.
180
+
181
+ To apply the Apache License to your work, attach the following
182
+ boilerplate notice, with the fields enclosed by brackets "[]"
183
+ replaced with your own identifying information. (Don't include
184
+ the brackets!) The text should be enclosed in the appropriate
185
+ comment syntax for the file format. We also recommend that a
186
+ file or class name and description of purpose be included on the
187
+ same "printed page" as the copyright notice for easier
188
+ identification within third-party archives.
189
+
190
+ Copyright [yyyy] [name of copyright owner]
191
+
192
+ Licensed under the Apache License, Version 2.0 (the "License");
193
+ you may not use this file except in compliance with the License.
194
+ You may obtain a copy of the License at
195
+
196
+ http://www.apache.org/licenses/LICENSE-2.0
197
+
198
+ Unless required by applicable law or agreed to in writing, software
199
+ distributed under the License is distributed on an "AS IS" BASIS,
200
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
201
+ See the License for the specific language governing permissions and
202
+ limitations under the License.
data/README.md ADDED
@@ -0,0 +1 @@
1
+
data/Rakefile ADDED
@@ -0,0 +1,6 @@
1
+ require "bundler/gem_tasks"
2
+ require 'rspec/core/rake_task'
3
+
4
+ RSpec::Core::RakeTask.new(:spec)
5
+
6
+ task :default => :spec
data/bin/readability ADDED
@@ -0,0 +1,39 @@
1
+ #!/usr/bin/env ruby
2
+ require 'rubygems'
3
+ require 'open-uri'
4
+ require 'optparse'
5
+ require File.dirname(__FILE__) + '/../lib/dq-readability'
6
+
7
+ options = { :debug => false, :images => false }
8
+ options_parser = OptionParser.new do |opts|
9
+ opts.banner = "Usage: #{File.basename($0)} [options] URL"
10
+
11
+ opts.on("-d", "--debug", "Show debug output") do |v|
12
+ options[:debug] = v
13
+ end
14
+
15
+ opts.on("-i", "--images", "Keep images and links") do |i|
16
+ options[:images] = i
17
+ end
18
+
19
+ opts.on_tail("-h", "--help", "Show this message") do
20
+ puts opts
21
+ exit
22
+ end
23
+ end
24
+ options_parser.parse!
25
+
26
+ if ARGV.length != 1
27
+ STDERR.puts options_parser
28
+ exit 1
29
+ end
30
+
31
+ text = open(ARGV.first).read
32
+ if options[:images]
33
+ puts Readability::Document.new(text, :tags => %w[div p img a],
34
+ :attributes => %w[src href],
35
+ :remove_empty_nodes => false,
36
+ :debug => options[:debug]).content
37
+ else
38
+ puts Readability::Document.new(text, :debug => options[:debug]).content
39
+ end
@@ -0,0 +1,25 @@
1
+ # -*- encoding: utf-8 -*-
2
+ $:.push File.expand_path("../lib", __FILE__)
3
+
4
+ Gem::Specification.new do |s|
5
+ s.name = "dq-readability"
6
+ s.version = '0.2.0'
7
+ s.authors = ["Prateek Papriwal"]
8
+ s.email = ["papriwalprateek@gmail.com"]
9
+ s.homepage = "http://github.com/DaQwest/dq-readability"
10
+ s.summary = %q{Port of arc90's readability project to ruby}
11
+ s.description = %q{Port of arc90's readability project to ruby. The base code is derived from https://github.com/cantino/ruby-readability}
12
+
13
+ s.rubyforge_project = "dq-readability"
14
+
15
+ s.files = `git ls-files`.split("\n")
16
+ s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
17
+ s.executables = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
18
+ s.require_paths = ["lib"]
19
+
20
+ s.add_development_dependency "rspec", ">= 2.8"
21
+ s.add_development_dependency "rspec-expectations", ">= 2.8"
22
+ s.add_development_dependency "rr", ">= 1.0"
23
+ s.add_dependency 'nokogiri', '>= 1.4.2'
24
+ s.add_dependency 'guess_html_encoding', '>= 0.0.4'
25
+ end
@@ -0,0 +1,515 @@
1
+ # encoding: utf-8
2
+
3
+ require 'rubygems'
4
+ require 'nokogiri'
5
+ require 'guess_html_encoding'
6
+
7
+ module Readability
8
+ class Document
9
+ DEFAULT_OPTIONS = {
10
+ :retry_length => 250,
11
+ :min_text_length => 25,
12
+ :remove_unlikely_candidates => true,
13
+ :weight_classes => true,
14
+ :clean_conditionally => true,
15
+ :remove_empty_nodes => true,
16
+ :min_image_width => 130,
17
+ :min_image_height => 80,
18
+ :ignore_image_format => []
19
+ }.freeze
20
+
21
+ REGEXES = {
22
+ :unlikelyCandidatesRe => /combx|comment|community|disqus|extra|foot|header|menu|remark|rss|shoutbox|sidebar|sponsor|ad-break|agegate|pagination|pager|popup/i,
23
+ :okMaybeItsACandidateRe => /and|article|body|column|main|shadow/i,
24
+ :positiveRe => /article|body|content|entry|hentry|main|page|pagination|post|text|blog|story/i,
25
+ :negativeRe => /combx|comment|com-|contact|foot|footer|footnote|masthead|media|meta|outbrain|promo|related|scroll|shoutbox|sidebar|sponsor|shopping|tags|tool|widget/i,
26
+ :divToPElementsRe => /<(a|blockquote|dl|div|img|ol|p|pre|table|ul)/i,
27
+ :replaceBrsRe => /(<br[^>]*>[ \n\r\t]*){2,}/i,
28
+ :replaceFontsRe => /<(\/?)font[^>]*>/i,
29
+ :trimRe => /^\s+|\s+$/,
30
+ :normalizeRe => /\s{2,}/,
31
+ :killBreaksRe => /(<br\s*\/?>(\s|&nbsp;?)*){1,}/,
32
+ :videoRe => /http:\/\/(www\.)?(youtube|vimeo)\.com/i
33
+ }
34
+
35
+ attr_accessor :options, :html, :best_candidate, :candidates, :best_candidate_has_image
36
+
37
+ def initialize(input, options = {})
38
+ @options = DEFAULT_OPTIONS.merge(options)
39
+ @input = input
40
+
41
+ if RUBY_VERSION =~ /^(1\.9|2)/ && !@options[:encoding]
42
+ @input = GuessHtmlEncoding.encode(@input, @options[:html_headers]) unless @options[:do_not_guess_encoding]
43
+ @options[:encoding] = @input.encoding.to_s
44
+ end
45
+
46
+ @input = @input.gsub(REGEXES[:replaceBrsRe], '</p><p>').gsub(REGEXES[:replaceFontsRe], '<\1span>')
47
+ @remove_unlikely_candidates = @options[:remove_unlikely_candidates]
48
+ @weight_classes = @options[:weight_classes]
49
+ @clean_conditionally = @options[:clean_conditionally]
50
+ @best_candidate_has_image = true
51
+ make_html
52
+ end
53
+
54
+ def prepare_candidates
55
+ @html.css("script, style").each { |i| i.remove }
56
+ remove_unlikely_candidates! if @remove_unlikely_candidates
57
+ transform_misused_divs_into_paragraphs!
58
+
59
+ @candidates = score_paragraphs(options[:min_text_length])
60
+ @best_candidate = select_best_candidate(@candidates)
61
+ end
62
+
63
+ def make_html
64
+ @html = Nokogiri::HTML(@input, nil, @options[:encoding])
65
+ # In case document has no body, such as from empty string or redirect
66
+ @html = Nokogiri::HTML('<body />', nil, @options[:encoding]) if @html.css('body').length == 0
67
+
68
+ # Remove html comment tags
69
+ @html.xpath('//comment()').each { |i| i.remove }
70
+ end
71
+
72
+ def images(content=nil, reload=false)
73
+ begin
74
+ require 'fastimage'
75
+ rescue LoadError
76
+ raise "Please install fastimage in order to use the #images feature."
77
+ end
78
+
79
+ @best_candidate_has_image = false if reload
80
+
81
+ prepare_candidates
82
+ list_images = []
83
+ tested_images = []
84
+ content = @best_candidate[:elem] unless reload
85
+
86
+ return list_images if content.nil?
87
+ elements = content.css("img").map(&:attributes)
88
+
89
+ elements.each do |element|
90
+ next unless element["src"]
91
+
92
+ url = element["src"].value
93
+ height = element["height"].nil? ? 0 : element["height"].value.to_i
94
+ width = element["width"].nil? ? 0 : element["width"].value.to_i
95
+
96
+ if url =~ /\Ahttps?:\/\//i && (height.zero? || width.zero?)
97
+ image = get_image_size(url)
98
+ next unless image
99
+ else
100
+ image = {:width => width, :height => height}
101
+ end
102
+
103
+ image[:format] = File.extname(url).gsub(".", "")
104
+
105
+ if tested_images.include?(url)
106
+ debug("Image was tested: #{url}")
107
+ next
108
+ end
109
+
110
+ tested_images.push(url)
111
+ if image_meets_criteria?(image)
112
+ list_images << url
113
+ else
114
+ debug("Image discarded: #{url} - height: #{image[:height]} - width: #{image[:width]} - format: #{image[:format]}")
115
+ end
116
+ end
117
+
118
+ (list_images.empty? and content != @html) ? images(@html, true) : list_images
119
+ end
120
+
121
+ def images_with_fqdn_uris!(source_uri)
122
+ images_with_fqdn_uris(@html, source_uri)
123
+ end
124
+
125
+ def images_with_fqdn_uris(document = @html.dup, source_uri)
126
+ uri = URI.parse(source_uri)
127
+ host = uri.host
128
+ scheme = uri.scheme
129
+ port = uri.port # defaults to 80
130
+
131
+ base = "#{scheme}://#{host}:#{port}/"
132
+
133
+ images = []
134
+ document.css("img").each do |elem|
135
+ begin
136
+ elem['src'] = URI.join(base,elem['src']).to_s if URI.parse(elem['src']).host == nil
137
+ images << elem['src'].to_s
138
+ rescue URI::InvalidURIError => exc
139
+ elem.remove
140
+ end
141
+ end
142
+
143
+ images(document,true)
144
+ end
145
+
146
+ def get_image_size(url)
147
+ begin
148
+ w, h = FastImage.size(url)
149
+ raise "Couldn't get size." if w.nil? || h.nil?
150
+ {:width => w, :height => h}
151
+ rescue => e
152
+ debug("Image error: #{e}")
153
+ nil
154
+ end
155
+ end
156
+
157
+ def image_meets_criteria?(image)
158
+ return false if options[:ignore_image_format].include?(image[:format].downcase)
159
+ image[:width] >= (options[:min_image_width] || 0) && image[:height] >= (options[:min_image_height] || 0)
160
+ end
161
+
162
+ def title
163
+ title = @html.css("title").first
164
+ title ? title.text : nil
165
+ end
166
+
167
+ # Look through the @html document looking for the author
168
+ # Precedence Information here on the wiki: (TODO attach wiki URL if it is accepted)
169
+ # Returns nil if no author is detected
170
+ def author
171
+ # Let's grab this author:
172
+ # <meta name="dc.creator" content="Finch - http://www.getfinch.com" />
173
+ author_elements = @html.xpath('//meta[@name = "dc.creator"]')
174
+ unless author_elements.empty?
175
+ author_elements.each do |element|
176
+ if element['content']
177
+ return element['content'].strip
178
+ end
179
+ end
180
+ end
181
+
182
+ # Now let's try to grab this
183
+ # <span class="byline author vcard"><span>By</span><cite class="fn">Austin Fonacier</cite></span>
184
+ # <div class="author">By</div><div class="author vcard"><a class="url fn" href="http://austinlivesinyoapp.com/">Austin Fonacier</a></div>
185
+ author_elements = @html.xpath('//*[contains(@class, "vcard")]//*[contains(@class, "fn")]')
186
+ unless author_elements.empty?
187
+ author_elements.each do |element|
188
+ if element.text
189
+ return element.text.strip
190
+ end
191
+ end
192
+ end
193
+
194
+ # Now let's try to grab this
195
+ # <a rel="author" href="http://dbanksdesign.com">Danny Banks (rel)</a>
196
+ # TODO: strip out the (rel)?
197
+ author_elements = @html.xpath('//a[@rel = "author"]')
198
+ unless author_elements.empty?
199
+ author_elements.each do |element|
200
+ if element.text
201
+ return element.text.strip
202
+ end
203
+ end
204
+ end
205
+
206
+ author_elements = @html.xpath('//*[@id = "author"]')
207
+ unless author_elements.empty?
208
+ author_elements.each do |element|
209
+ if element.text
210
+ return element.text.strip
211
+ end
212
+ end
213
+ end
214
+ end
215
+
216
+ def content(remove_unlikely_candidates = :default)
217
+ @remove_unlikely_candidates = false if remove_unlikely_candidates == false
218
+
219
+ prepare_candidates
220
+ article = get_article(@candidates, @best_candidate)
221
+
222
+ cleaned_article = sanitize(article, @candidates, options)
223
+ if article.text.strip.length < options[:retry_length]
224
+ if @remove_unlikely_candidates
225
+ @remove_unlikely_candidates = false
226
+ elsif @weight_classes
227
+ @weight_classes = false
228
+ elsif @clean_conditionally
229
+ @clean_conditionally = false
230
+ else
231
+ # nothing we can do
232
+ return cleaned_article
233
+ end
234
+
235
+ make_html
236
+ content
237
+ else
238
+ cleaned_article
239
+ end
240
+ end
241
+
242
+ def get_article(candidates, best_candidate)
243
+ # Now that we have the top candidate, look through its siblings for content that might also be related.
244
+ # Things like preambles, content split by ads that we removed, etc.
245
+
246
+ sibling_score_threshold = [10, best_candidate[:content_score] * 0.2].max
247
+ output = Nokogiri::XML::Node.new('div', @html)
248
+ best_candidate[:elem].parent.children.each do |sibling|
249
+ append = false
250
+ append = true if sibling == best_candidate[:elem]
251
+ append = true if candidates[sibling] && candidates[sibling][:content_score] >= sibling_score_threshold
252
+
253
+ if sibling.name.downcase == "p"
254
+ link_density = get_link_density(sibling)
255
+ node_content = sibling.text
256
+ node_length = node_content.length
257
+
258
+ if node_length > 80 && link_density < 0.25
259
+ append = true
260
+ elsif node_length < 80 && link_density == 0 && node_content =~ /\.( |$)/
261
+ append = true
262
+ end
263
+ end
264
+
265
+ if append
266
+ sibling_dup = sibling.dup # otherwise the state of the document in processing will change, thus creating side effects
267
+ sibling_dup.name = "div" unless %w[div p].include?(sibling.name.downcase)
268
+ output << sibling_dup
269
+ end
270
+ end
271
+
272
+ output
273
+ end
274
+
275
+ def select_best_candidate(candidates)
276
+ sorted_candidates = candidates.values.sort { |a, b| b[:content_score] <=> a[:content_score] }
277
+
278
+ debug("Top 5 candidates:")
279
+ sorted_candidates[0...5].each do |candidate|
280
+ debug("Candidate #{candidate[:elem].name}##{candidate[:elem][:id]}.#{candidate[:elem][:class]} with score #{candidate[:content_score]}")
281
+ end
282
+
283
+ best_candidate = sorted_candidates.first || { :elem => @html.css("body").first, :content_score => 0 }
284
+ debug("Best candidate #{best_candidate[:elem].name}##{best_candidate[:elem][:id]}.#{best_candidate[:elem][:class]} with score #{best_candidate[:content_score]}")
285
+
286
+ best_candidate
287
+ end
288
+
289
+ def get_link_density(elem)
290
+ link_length = elem.css("a").map(&:text).join("").length
291
+ text_length = elem.text.length
292
+ link_length / text_length.to_f
293
+ end
294
+
295
+ def score_paragraphs(min_text_length)
296
+ candidates = {}
297
+ @html.css("p,td").each do |elem|
298
+ parent_node = elem.parent
299
+ grand_parent_node = parent_node.respond_to?(:parent) ? parent_node.parent : nil
300
+ inner_text = elem.text
301
+
302
+ # If this paragraph is less than 25 characters, don't even count it.
303
+ next if inner_text.length < min_text_length
304
+
305
+ candidates[parent_node] ||= score_node(parent_node)
306
+ candidates[grand_parent_node] ||= score_node(grand_parent_node) if grand_parent_node
307
+
308
+ content_score = 1
309
+ content_score += inner_text.split(',').length
310
+ content_score += [(inner_text.length / 100).to_i, 3].min
311
+
312
+ candidates[parent_node][:content_score] += content_score
313
+ candidates[grand_parent_node][:content_score] += content_score / 2.0 if grand_parent_node
314
+ end
315
+
316
+ # Scale the final candidates score based on link density. Good content should have a
317
+ # relatively small link density (5% or less) and be mostly unaffected by this operation.
318
+ candidates.each do |elem, candidate|
319
+ candidate[:content_score] = candidate[:content_score] * (1 - get_link_density(elem))
320
+ end
321
+
322
+ candidates
323
+ end
324
+
325
+ def class_weight(e)
326
+ weight = 0
327
+ return weight unless @weight_classes
328
+
329
+ if e[:class] && e[:class] != ""
330
+ if e[:class] =~ REGEXES[:negativeRe]
331
+ weight -= 25
332
+ end
333
+
334
+ if e[:class] =~ REGEXES[:positiveRe]
335
+ weight += 25
336
+ end
337
+ end
338
+
339
+ if e[:id] && e[:id] != ""
340
+ if e[:id] =~ REGEXES[:negativeRe]
341
+ weight -= 25
342
+ end
343
+
344
+ if e[:id] =~ REGEXES[:positiveRe]
345
+ weight += 25
346
+ end
347
+ end
348
+
349
+ weight
350
+ end
351
+
352
+ def score_node(elem)
353
+ content_score = class_weight(elem)
354
+ case elem.name.downcase
355
+ when "div"
356
+ content_score += 5
357
+ when "blockquote"
358
+ content_score += 3
359
+ when "form"
360
+ content_score -= 3
361
+ when "th"
362
+ content_score -= 5
363
+ end
364
+ { :content_score => content_score, :elem => elem }
365
+ end
366
+
367
+ def debug(str)
368
+ puts str if options[:debug]
369
+ end
370
+
371
+ def remove_unlikely_candidates!
372
+ @html.css("*").each do |elem|
373
+ str = "#{elem[:class]}#{elem[:id]}"
374
+ if str =~ REGEXES[:unlikelyCandidatesRe] && str !~ REGEXES[:okMaybeItsACandidateRe] && (elem.name.downcase != 'html') && (elem.name.downcase != 'body')
375
+ debug("Removing unlikely candidate - #{str}")
376
+ elem.remove
377
+ end
378
+ end
379
+ end
380
+
381
+ def transform_misused_divs_into_paragraphs!
382
+ @html.css("*").each do |elem|
383
+ if elem.name.downcase == "div"
384
+ # transform <div>s that do not contain other block elements into <p>s
385
+ if elem.inner_html !~ REGEXES[:divToPElementsRe]
386
+ debug("Altering div(##{elem[:id]}.#{elem[:class]}) to p");
387
+ elem.name = "p"
388
+ end
389
+ else
390
+ # wrap text nodes in p tags
391
+ # elem.children.each do |child|
392
+ # if child.text?
393
+ # debug("wrapping text node with a p")
394
+ # child.swap("<p>#{child.text}</p>")
395
+ # end
396
+ # end
397
+ end
398
+ end
399
+ end
400
+
401
+ def sanitize(node, candidates, options = {})
402
+ node.css("h1, h2, h3, h4, h5, h6").each do |header|
403
+ header.remove if class_weight(header) < 0 || get_link_density(header) > 0.33
404
+ end
405
+
406
+ node.css("form, object, iframe, embed").each do |elem|
407
+ elem.remove
408
+ end
409
+
410
+ if @options[:remove_empty_nodes]
411
+ # remove <p> tags that have no text content - this will also remove p tags that contain only images.
412
+ node.css("p").each do |elem|
413
+ elem.remove if elem.content.strip.empty?
414
+ end
415
+ end
416
+
417
+ # Conditionally clean <table>s, <ul>s, and <div>s
418
+ clean_conditionally(node, candidates, "table, ul, div")
419
+
420
+ # We'll sanitize all elements using a whitelist
421
+ base_whitelist = @options[:tags] || %w[div p]
422
+ # We'll add whitespace instead of block elements,
423
+ # so a<br>b will have a nice space between them
424
+ base_replace_with_whitespace = %w[br hr h1 h2 h3 h4 h5 h6 dl dd ol li ul address blockquote center]
425
+
426
+ # Use a hash for speed (don't want to make a million calls to include?)
427
+ whitelist = Hash.new
428
+ base_whitelist.each {|tag| whitelist[tag] = true }
429
+ replace_with_whitespace = Hash.new
430
+ base_replace_with_whitespace.each { |tag| replace_with_whitespace[tag] = true }
431
+
432
+ ([node] + node.css("*")).each do |el|
433
+ # If element is in whitelist, delete all its attributes
434
+ if whitelist[el.node_name]
435
+ el.attributes.each { |a, x| el.delete(a) unless @options[:attributes] && @options[:attributes].include?(a.to_s) }
436
+
437
+ # Otherwise, replace the element with its contents
438
+ else
439
+ # If element is root, replace the node as a text node
440
+ if el.parent.nil?
441
+ node = Nokogiri::XML::Text.new(el.text, el.document)
442
+ break
443
+ else
444
+ if replace_with_whitespace[el.node_name]
445
+ el.swap(Nokogiri::XML::Text.new(' ' << el.text << ' ', el.document))
446
+ else
447
+ el.swap(Nokogiri::XML::Text.new(el.text, el.document))
448
+ end
449
+ end
450
+ end
451
+
452
+ end
453
+
454
+ s = Nokogiri::XML::Node::SaveOptions
455
+ save_opts = s::NO_DECLARATION | s::NO_EMPTY_TAGS | s::AS_XHTML
456
+ html = node.serialize(:save_with => save_opts)
457
+
458
+ # Get rid of duplicate whitespace
459
+ return html.gsub(/[\r\n\f]+/, "\n" )
460
+ end
461
+
462
+ def clean_conditionally(node, candidates, selector)
463
+ return unless @clean_conditionally
464
+ node.css(selector).each do |el|
465
+ weight = class_weight(el)
466
+ content_score = candidates[el] ? candidates[el][:content_score] : 0
467
+ name = el.name.downcase
468
+
469
+ if weight + content_score < 0
470
+ el.remove
471
+ debug("Conditionally cleaned #{name}##{el[:id]}.#{el[:class]} with weight #{weight} and content score #{content_score} because score + content score was less than zero.")
472
+ elsif el.text.count(",") < 10
473
+ counts = %w[p img li a embed input].inject({}) { |m, kind| m[kind] = el.css(kind).length; m }
474
+ counts["li"] -= 100
475
+
476
+ # For every img under a noscript tag discount one from the count to avoid double counting
477
+ counts["img"] -= el.css("noscript").css("img").length
478
+
479
+ content_length = el.text.strip.length # Count the text length excluding any surrounding whitespace
480
+ link_density = get_link_density(el)
481
+ to_remove = false
482
+ reason = ""
483
+
484
+ if (counts["img"] > counts["p"]) && (counts["img"] > 1)
485
+ reason = "too many images"
486
+ to_remove = true
487
+ elsif counts["li"] > counts["p"] && name != "ul" && name != "ol"
488
+ reason = "more <li>s than <p>s"
489
+ to_remove = true
490
+ elsif counts["input"] > (counts["p"] / 3).to_i
491
+ reason = "less than 3x <p>s than <input>s"
492
+ to_remove = true
493
+ elsif (content_length < options[:min_text_length]) && (counts["img"] != 1)
494
+ reason = "too short a content length without a single image"
495
+ to_remove = true
496
+ elsif weight < 25 && link_density > 0.2
497
+ reason = "too many links for its weight (#{weight})"
498
+ to_remove = true
499
+ elsif weight >= 25 && link_density > 0.5
500
+ reason = "too many links for its weight (#{weight})"
501
+ to_remove = true
502
+ elsif (counts["embed"] == 1 && content_length < 75) || counts["embed"] > 1
503
+ reason = "<embed>s with too short a content length, or too many <embed>s"
504
+ to_remove = true
505
+ end
506
+
507
+ if to_remove
508
+ debug("Conditionally cleaned #{name}##{el[:id]}.#{el[:class]} with weight #{weight} and content score #{content_score} because it has #{reason}.")
509
+ el.remove
510
+ end
511
+ end
512
+ end
513
+ end
514
+ end
515
+ end