extractula 0.0.1 → 0.0.2
Sign up to get free protection for your applications and to get access to all the features.
- data/README.textile +97 -0
- data/lib/extractula.rb +16 -4
- data/lib/extractula/custom_extractors.rb +5 -0
- data/lib/extractula/custom_extractors/dinosaur_comics.rb +9 -0
- data/lib/extractula/custom_extractors/flickr.rb +8 -0
- data/lib/extractula/custom_extractors/you_tube.rb +8 -0
- data/lib/extractula/extracted_content.rb +1 -1
- data/lib/extractula/extractor.rb +151 -0
- data/lib/extractula/oembed.rb +124 -0
- data/spec/extractula/extracted_content_spec.rb +8 -0
- data/spec/extractula_spec.rb +5 -5
- data/spec/spec_helper.rb +1 -0
- metadata +7 -3
- data/lib/extractula/dom_extractor.rb +0 -68
- data/spec/extractula/dom_extractor_spec.rb +0 -109
data/README.textile
CHANGED
@@ -0,0 +1,97 @@
|
|
1
|
+
h1. Extractula
|
2
|
+
|
3
|
+
"http://github.com/pauldix/extractula":http://github.com/pauldix/extractula
|
4
|
+
|
5
|
+
h2. Summary
|
6
|
+
|
7
|
+
Extracts content like title, summary, and images from web pages like Dracula extracts blood: with care and finesse.
|
8
|
+
|
9
|
+
h2. Description
|
10
|
+
|
11
|
+
Extractula attempts to extract the core content from a web page. For a news article or blog post this would be the content of the article itself. For a github project this would be the main README file. The library also has logic for writing your own custom extractors. This is useful if you want to write extractors for popular sites that you want to build custom support for.
|
12
|
+
|
13
|
+
h2. Installation
|
14
|
+
|
15
|
+
<pre>
|
16
|
+
gem install extractula --source http://gemcutter.org
|
17
|
+
</pre>
|
18
|
+
|
19
|
+
h2. Use
|
20
|
+
|
21
|
+
<pre>
|
22
|
+
require 'extractula'
|
23
|
+
some_html = "..." # get some html to extract, yo!
|
24
|
+
|
25
|
+
extracted_content = Extractula.extract(url, some_html)
|
26
|
+
extracted_content.title # pulled from the page
|
27
|
+
extracted_content.url # what you passed in
|
28
|
+
extracted_content.content # the main content body (article, blog post, etc)
|
29
|
+
extracted_content.summary # an automatically generated plain text summary of the content
|
30
|
+
extracted_content.image_urls # the urls for images that appear in the content
|
31
|
+
extracted_content.video_embed # the embed code if a video is embedded in the content
|
32
|
+
|
33
|
+
Extractula.add_extractor(SomeClass) # so you can add a custom extractor
|
34
|
+
</pre>
|
35
|
+
|
36
|
+
h3. Custom Extractors
|
37
|
+
|
38
|
+
The "Use" section showed adding a custom extractor. This should be a class that at a minimum implements the following methods.
|
39
|
+
|
40
|
+
<pre>
|
41
|
+
class MyCustomExtractor
|
42
|
+
def self.can_extract?(url, html)
|
43
|
+
end
|
44
|
+
|
45
|
+
def extract(url, html)
|
46
|
+
# should return a Extractula::ExtractedContent object
|
47
|
+
end
|
48
|
+
end
|
49
|
+
</pre>
|
50
|
+
|
51
|
+
Notice that can_extract? is a class method while extract is an instance method. Extract should return an ExtractedContent object.
|
52
|
+
|
53
|
+
h3. ExtractedContent
|
54
|
+
|
55
|
+
The ExtractedContent object holds the results of an extraction. It additionally has methods to automatically generate a summary, image_urls, and video_embed code from the content. If you implement a custom extractor and want to provide the summary, image_urls, and video_embed, simply pass those values into the constructor for ExtractedContent. Here are some examples:
|
56
|
+
|
57
|
+
<pre>
|
58
|
+
extracted_content = ExtractedContent.new(:url => "http://pauldix.net", :content => "...some content...")
|
59
|
+
extracted_content.summary # auto-generated from content
|
60
|
+
extracted_content.image_urls # auto-generated from content
|
61
|
+
extracted_content.video_embed # auto-generated from content
|
62
|
+
|
63
|
+
extracted_content = ExtractedContent.new(:url => "http://pauldix.net", :content => "...some content...",
|
64
|
+
:summary => "a summary", :image_urls => ["foo.jpg"], :video_embed => "blah")
|
65
|
+
extracted_content.summary # "a summary"
|
66
|
+
extracted_content.image_urls # ["foo.jpg"]
|
67
|
+
extracted_content.video_embed # "blah"
|
68
|
+
</pre>
|
69
|
+
|
70
|
+
Zero, one, or more of the values can be passed into the ExtractedContent constructor. It will auto-generate ones not passed in and keep the others.
|
71
|
+
|
72
|
+
h2. LICENSE
|
73
|
+
|
74
|
+
(The MIT License)
|
75
|
+
|
76
|
+
Copyright (c) 2009:
|
77
|
+
|
78
|
+
"Paul Dix":http://pauldix.net
|
79
|
+
|
80
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
81
|
+
a copy of this software and associated documentation files (the
|
82
|
+
'Software'), to deal in the Software without restriction, including
|
83
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
84
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
85
|
+
permit persons to whom the Software is furnished to do so, subject to
|
86
|
+
the following conditions:
|
87
|
+
|
88
|
+
The above copyright notice and this permission notice shall be
|
89
|
+
included in all copies or substantial portions of the Software.
|
90
|
+
|
91
|
+
THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
|
92
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
93
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
|
94
|
+
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
|
95
|
+
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
|
96
|
+
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
|
97
|
+
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/lib/extractula.rb
CHANGED
@@ -3,10 +3,13 @@ $LOAD_PATH.unshift(File.dirname(__FILE__)) unless $LOAD_PATH.include?(File.dirna
|
|
3
3
|
module Extractula; end
|
4
4
|
|
5
5
|
require 'nokogiri'
|
6
|
+
require 'domainatrix'
|
6
7
|
require 'extractula/extracted_content'
|
7
|
-
require 'extractula/
|
8
|
+
require 'extractula/extractor'
|
8
9
|
|
9
10
|
module Extractula
|
11
|
+
VERSION = "0.0.2"
|
12
|
+
|
10
13
|
@extractors = []
|
11
14
|
|
12
15
|
def self.add_extractor(extractor_class)
|
@@ -18,7 +21,16 @@ module Extractula
|
|
18
21
|
end
|
19
22
|
|
20
23
|
def self.extract(url, html)
|
21
|
-
|
22
|
-
|
24
|
+
parsed_url = Domainatrix.parse(url)
|
25
|
+
parsed_html = Nokogiri::HTML(html)
|
26
|
+
extractor = @extractors.detect {|e| e.can_extract? parsed_url, parsed_html} || Extractor
|
27
|
+
extractor.new(parsed_url, parsed_html).extract
|
28
|
+
end
|
29
|
+
|
30
|
+
def self.custom_extractor(config = {})
|
31
|
+
klass = Class.new(Extractula::Extractor)
|
32
|
+
klass.include(Extractula::OEmbed) if config.delete(:oembed)
|
33
|
+
config.each { |option, args| klass.__send__(option, *args) }
|
34
|
+
klass
|
23
35
|
end
|
24
|
-
end
|
36
|
+
end
|
@@ -0,0 +1,151 @@
|
|
1
|
+
# Abstract (more or less) extractor class from which custom extractor
|
2
|
+
# classes should descend. Subclasses of Extractula::Extractor will be
|
3
|
+
# automatically added to the Extracula module.
|
4
|
+
|
5
|
+
class Extractula::Extractor
|
6
|
+
def self.inherited subclass
|
7
|
+
Extractula.add_extractor subclass
|
8
|
+
end
|
9
|
+
|
10
|
+
def self.domain domain
|
11
|
+
@extractable_domain = domain
|
12
|
+
end
|
13
|
+
|
14
|
+
def self.can_extract? url, html
|
15
|
+
@extractable_domain ? @extractable_domain == url.domain : false
|
16
|
+
end
|
17
|
+
|
18
|
+
%w{title content summary image_urls video_embed }.each do |field|
|
19
|
+
class_eval <<-EOS
|
20
|
+
def self.#{field}_path(path = nil, attrib = nil)
|
21
|
+
if path
|
22
|
+
@#{field}_path = path
|
23
|
+
@#{field}_attr = attrib || :text
|
24
|
+
end
|
25
|
+
@#{field}_path
|
26
|
+
end
|
27
|
+
|
28
|
+
def self.#{field}_attr(attrib = nil)
|
29
|
+
@#{field}_attr = attrib if attrib
|
30
|
+
@#{field}_attr
|
31
|
+
end
|
32
|
+
|
33
|
+
def #{field}_path
|
34
|
+
self.class.#{field}_path
|
35
|
+
end
|
36
|
+
|
37
|
+
def #{field}_attr
|
38
|
+
self.class.#{field}_attr
|
39
|
+
end
|
40
|
+
EOS
|
41
|
+
end
|
42
|
+
|
43
|
+
attr_reader :url, :html
|
44
|
+
|
45
|
+
def initialize url, html
|
46
|
+
@url = url.is_a?(Domainatrix::Url) ? url : Domainatrix.parse(url)
|
47
|
+
@html = html.is_a?(Nokogiri::HTML::Document) ? html : Nokogiri::HTML(html)
|
48
|
+
end
|
49
|
+
|
50
|
+
def extract
|
51
|
+
Extractula::ExtractedContent.new({
|
52
|
+
:url => url.url,
|
53
|
+
:title => title,
|
54
|
+
:content => content,
|
55
|
+
:summary => summary,
|
56
|
+
:image_urls => image_urls,
|
57
|
+
:video_embed => video_embed
|
58
|
+
})
|
59
|
+
end
|
60
|
+
|
61
|
+
def title
|
62
|
+
content_at(title_path, title_attr) || content_at("//title")
|
63
|
+
end
|
64
|
+
|
65
|
+
def content
|
66
|
+
content_at(content_path, content_attr) || extract_content
|
67
|
+
end
|
68
|
+
|
69
|
+
def summary
|
70
|
+
content_at(summary_path, summary_attr)
|
71
|
+
end
|
72
|
+
|
73
|
+
def image_urls
|
74
|
+
if image_urls_path
|
75
|
+
html.search(image_urls_path).collect { |img| img['src'].strip }
|
76
|
+
end
|
77
|
+
end
|
78
|
+
|
79
|
+
def video_embed
|
80
|
+
if video_embed_path
|
81
|
+
html.search(video_embed_path).collect { |embed| embed.to_html }.first
|
82
|
+
end
|
83
|
+
end
|
84
|
+
|
85
|
+
private
|
86
|
+
|
87
|
+
def content_at(path, attrib = :text)
|
88
|
+
if path
|
89
|
+
if node = html.at(path)
|
90
|
+
attrib == :text ? node.text.strip : node[attrib].strip
|
91
|
+
end
|
92
|
+
end
|
93
|
+
end
|
94
|
+
|
95
|
+
def extract_content
|
96
|
+
candidate_nodes = html.search("//div|//p|//br").collect do |node|
|
97
|
+
parent = node.parent
|
98
|
+
if node.node_name == 'div'
|
99
|
+
text_size = calculate_children_text_size(parent, "div")
|
100
|
+
|
101
|
+
if text_size > 0
|
102
|
+
{:text_size => text_size, :parent => parent}
|
103
|
+
else
|
104
|
+
nil
|
105
|
+
end
|
106
|
+
elsif node.node_name == "p"
|
107
|
+
text_size = calculate_children_text_size(parent, "p")
|
108
|
+
|
109
|
+
if text_size > 0
|
110
|
+
{:text_size => text_size, :parent => parent}
|
111
|
+
else
|
112
|
+
nil
|
113
|
+
end
|
114
|
+
elsif node.node_name == "br"
|
115
|
+
begin
|
116
|
+
if node.previous.node_name == "text" && node.next.node_name == "text"
|
117
|
+
text_size = 0
|
118
|
+
parent.children.each do |child|
|
119
|
+
text_size += child.text.strip.size if child.node_name == "text"
|
120
|
+
end
|
121
|
+
|
122
|
+
if text_size > 0
|
123
|
+
{:text_size => text_size, :parent => parent}
|
124
|
+
else
|
125
|
+
nil
|
126
|
+
end
|
127
|
+
else
|
128
|
+
nil
|
129
|
+
end
|
130
|
+
rescue => e
|
131
|
+
nil
|
132
|
+
end
|
133
|
+
else
|
134
|
+
nil
|
135
|
+
end
|
136
|
+
end.compact.uniq
|
137
|
+
|
138
|
+
fragment = candidate_nodes.detect {|n| n[:text_size] > 140}[:parent].inner_html.strip rescue ""
|
139
|
+
end
|
140
|
+
|
141
|
+
def calculate_children_text_size(parent, node_type)
|
142
|
+
text_size = 0
|
143
|
+
parent.children.each do |child|
|
144
|
+
if child.node_name == node_type
|
145
|
+
child.children.each {|c| text_size += c.text.strip.size if c.node_name == "text"}
|
146
|
+
end
|
147
|
+
end
|
148
|
+
|
149
|
+
text_size
|
150
|
+
end
|
151
|
+
end
|
@@ -0,0 +1,124 @@
|
|
1
|
+
require 'typhoeus'
|
2
|
+
require 'json'
|
3
|
+
|
4
|
+
module Extractula
|
5
|
+
module OEmbed
|
6
|
+
|
7
|
+
def self.included(base)
|
8
|
+
base.class_eval {
|
9
|
+
extend Extractula::OEmbed::ClassMethods
|
10
|
+
include Extractula::OEmbed::InstanceMethods
|
11
|
+
}
|
12
|
+
end
|
13
|
+
|
14
|
+
def self.request(request)
|
15
|
+
http_response = Typhoeus::Request.get(request)
|
16
|
+
if http_response.code == 200
|
17
|
+
Extractula::OEmbed::Response.new(http_response.body)
|
18
|
+
else
|
19
|
+
# do something
|
20
|
+
end
|
21
|
+
end
|
22
|
+
|
23
|
+
module ClassMethods
|
24
|
+
def oembed_endpoint(url = nil)
|
25
|
+
if url
|
26
|
+
@oembed_endpoint = url
|
27
|
+
if @oembed_endpoint.match(/\.(xml|json)$/)
|
28
|
+
@oembed_format_param_required = false
|
29
|
+
@oembed_endpoint.sub!(/\.xml$/, '.json') if $1 == 'xml'
|
30
|
+
else
|
31
|
+
@oembed_format_param_required = true
|
32
|
+
end
|
33
|
+
end
|
34
|
+
@oembed_endpoint
|
35
|
+
end
|
36
|
+
|
37
|
+
def oembed_max_width(width = nil)
|
38
|
+
@oembed_max_width = width if width
|
39
|
+
@oembed_max_width
|
40
|
+
end
|
41
|
+
|
42
|
+
def oembed_max_height(height = nil)
|
43
|
+
@oembed_max_height = height if height
|
44
|
+
@oembed_max_height
|
45
|
+
end
|
46
|
+
|
47
|
+
def oembed_format_param_required?
|
48
|
+
@oembed_format_param_required
|
49
|
+
end
|
50
|
+
end
|
51
|
+
|
52
|
+
module InstanceMethods
|
53
|
+
def initialize(*args)
|
54
|
+
super
|
55
|
+
@oembed = Extractula::OEmbed.request(oembed_request)
|
56
|
+
end
|
57
|
+
|
58
|
+
def oembed_endpoint
|
59
|
+
self.class.oembed_endpoint
|
60
|
+
end
|
61
|
+
|
62
|
+
def oembed_max_width
|
63
|
+
self.class.oembed_max_width
|
64
|
+
end
|
65
|
+
|
66
|
+
def oembed_max_height
|
67
|
+
self.class.oembed_max_height
|
68
|
+
end
|
69
|
+
|
70
|
+
def oembed_format_param_required?
|
71
|
+
self.class.oembed_format_param_required?
|
72
|
+
end
|
73
|
+
|
74
|
+
def oembed
|
75
|
+
@oembed
|
76
|
+
end
|
77
|
+
|
78
|
+
def oembed_request
|
79
|
+
request = "#{oembed_endpoint}?url=#{url.url}"
|
80
|
+
request += "&format=json" if oembed_format_param_required?
|
81
|
+
request += "&maxwidth=#{oembed_max_width}" if oembed_max_width
|
82
|
+
request += "&maxheight=#{oembed_max_height}" if oembed_max_height
|
83
|
+
request
|
84
|
+
end
|
85
|
+
|
86
|
+
def title
|
87
|
+
oembed.title
|
88
|
+
end
|
89
|
+
|
90
|
+
def image_urls
|
91
|
+
[ oembed.url ] if oembed.type == 'photo'
|
92
|
+
end
|
93
|
+
|
94
|
+
def video_embed
|
95
|
+
oembed.html
|
96
|
+
end
|
97
|
+
end
|
98
|
+
|
99
|
+
class Response
|
100
|
+
|
101
|
+
FIELDS = %w{ type version title author_name author_url
|
102
|
+
provider_name provider_url cache_age thumbnail_url
|
103
|
+
thumbnail_width thumbnail_height }
|
104
|
+
|
105
|
+
FIELDS.each { |field| attr_reader field.to_sym }
|
106
|
+
attr_reader :width, :height, :url, :html
|
107
|
+
|
108
|
+
def initialize response
|
109
|
+
@doc = ::JSON.parse(response)
|
110
|
+
FIELDS.each { |field| instance_variable_set "@#{field}", @doc[field] }
|
111
|
+
unless @type == 'link'
|
112
|
+
@width = @doc['width']
|
113
|
+
@height = @doc['height']
|
114
|
+
if @type == 'photo'
|
115
|
+
@url = @doc['url']
|
116
|
+
else
|
117
|
+
@html = @doc['html']
|
118
|
+
end
|
119
|
+
end
|
120
|
+
end
|
121
|
+
|
122
|
+
end
|
123
|
+
end
|
124
|
+
end
|
@@ -1,3 +1,4 @@
|
|
1
|
+
# coding: utf-8
|
1
2
|
require File.dirname(__FILE__) + '/../spec_helper'
|
2
3
|
|
3
4
|
describe "extracted content" do
|
@@ -45,4 +46,11 @@ describe "extracted content" do
|
|
45
46
|
extracted.video_embed.should == "<object width=\"425\" height=\"344\"><param name=\"movie\" value=\"http://www.youtube.com/v/0dHTIGas4CA&color1=0x3a3a3a&color2=0x999999&hl=en_US&feature=player_embedded&fs=1\">\n<param name=\"allowFullScreen\" value=\"true\">\n<param name=\"allowScriptAccess\" value=\"always\">\n<embed wmode=\"opaque\" src=\"http://www.youtube.com/v/0dHTIGas4CA&color1=0x3a3a3a&color2=0x999999&hl=en_US&feature=player_embedded&fs=1\" type=\"application/x-shockwave-flash\" allowfullscreen=\"true\" allowscriptaccess=\"always\" width=\"425\" height=\"344\"></embed></object>"
|
46
47
|
end
|
47
48
|
end
|
49
|
+
|
50
|
+
describe "some regressions" do
|
51
|
+
it "doesn't error with undefined method 'node_name' for nil:NilClass when looking at <br /> elements" do
|
52
|
+
extracted = Extractula::Extractor.new("http://viceland.com/caprica/", read_test_file("node-name-error.html")).extract
|
53
|
+
extracted.title.should == "Syfy + Motherboard.tv Caprica Screenings Contest"
|
54
|
+
end
|
55
|
+
end
|
48
56
|
end
|
data/spec/extractula_spec.rb
CHANGED
@@ -2,12 +2,12 @@ require File.dirname(__FILE__) + '/spec_helper'
|
|
2
2
|
|
3
3
|
describe "extractula" do
|
4
4
|
it "can add custom extractors" do
|
5
|
-
custom_extractor = Class.new do
|
5
|
+
custom_extractor = Class.new(Extractula::Extractor) do
|
6
6
|
def self.can_extract? url, html
|
7
7
|
true
|
8
8
|
end
|
9
9
|
|
10
|
-
def extract
|
10
|
+
def extract
|
11
11
|
Extractula::ExtractedContent.new :url => "custom extractor url", :summary => "my custom extractor"
|
12
12
|
end
|
13
13
|
end
|
@@ -20,12 +20,12 @@ describe "extractula" do
|
|
20
20
|
end
|
21
21
|
|
22
22
|
it "skips custom extractors that can't extract the passed url and html" do
|
23
|
-
custom_extractor = Class.new do
|
23
|
+
custom_extractor = Class.new(Extractula::Extractor) do
|
24
24
|
def self.can_extract? url, html
|
25
25
|
false
|
26
26
|
end
|
27
27
|
|
28
|
-
def extract
|
28
|
+
def extract
|
29
29
|
Extractula::ExtractedContent.new :url => "this url", :summary => "this summary"
|
30
30
|
end
|
31
31
|
end
|
@@ -42,4 +42,4 @@ describe "extractula" do
|
|
42
42
|
result.should be_a Extractula::ExtractedContent
|
43
43
|
result.url.should == "http://pauldix.net"
|
44
44
|
end
|
45
|
-
end
|
45
|
+
end
|
data/spec/spec_helper.rb
CHANGED
@@ -8,6 +8,7 @@ path = File.expand_path(File.dirname(__FILE__) + "/../lib/")
|
|
8
8
|
$LOAD_PATH.unshift(path) unless $LOAD_PATH.include?(path)
|
9
9
|
|
10
10
|
require "lib/extractula"
|
11
|
+
require "lib/extractula/custom_extractors"
|
11
12
|
|
12
13
|
def read_test_file(file_name)
|
13
14
|
File.read("#{File.dirname(__FILE__)}/test-files/#{file_name}")
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: extractula
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Paul Dix
|
@@ -32,14 +32,18 @@ extra_rdoc_files: []
|
|
32
32
|
|
33
33
|
files:
|
34
34
|
- lib/extractula.rb
|
35
|
+
- lib/extractula/custom_extractors.rb
|
35
36
|
- lib/extractula/extracted_content.rb
|
36
|
-
- lib/extractula/
|
37
|
+
- lib/extractula/extractor.rb
|
38
|
+
- lib/extractula/oembed.rb
|
39
|
+
- lib/extractula/custom_extractors/dinosaur_comics.rb
|
40
|
+
- lib/extractula/custom_extractors/flickr.rb
|
41
|
+
- lib/extractula/custom_extractors/you_tube.rb
|
37
42
|
- README.textile
|
38
43
|
- spec/spec.opts
|
39
44
|
- spec/spec_helper.rb
|
40
45
|
- spec/extractula_spec.rb
|
41
46
|
- spec/extractula/extracted_content_spec.rb
|
42
|
-
- spec/extractula/dom_extractor_spec.rb
|
43
47
|
- spec/test-files/10-stunning-web-site-prototype-sketches.html
|
44
48
|
- spec/test-files/totlol-youtube.html
|
45
49
|
- spec/test-files/typhoeus-the-best-ruby-http-client-just-got-better.html
|
@@ -1,68 +0,0 @@
|
|
1
|
-
# a basic dom based extractor. it's a generic catch all
|
2
|
-
class Extractula::DomExtractor
|
3
|
-
def extract url, html
|
4
|
-
@doc = Nokogiri::HTML(html)
|
5
|
-
extracted = Extractula::ExtractedContent.new :url => url, :title => title, :content => content
|
6
|
-
end
|
7
|
-
|
8
|
-
def title
|
9
|
-
@title ||= @doc.search("//title").first.text.strip rescue nil
|
10
|
-
end
|
11
|
-
|
12
|
-
def content
|
13
|
-
candidate_nodes = @doc.search("//div|//p|//br").collect do |node|
|
14
|
-
parent = node.parent
|
15
|
-
if node.node_name == 'div'
|
16
|
-
text_size = calculate_children_text_size(parent, "div")
|
17
|
-
|
18
|
-
if text_size > 0
|
19
|
-
{:text_size => text_size, :parent => parent}
|
20
|
-
else
|
21
|
-
nil
|
22
|
-
end
|
23
|
-
elsif node.node_name == "p"
|
24
|
-
text_size = calculate_children_text_size(parent, "p")
|
25
|
-
|
26
|
-
if text_size > 0
|
27
|
-
{:text_size => text_size, :parent => parent}
|
28
|
-
else
|
29
|
-
nil
|
30
|
-
end
|
31
|
-
elsif node.node_name == "br"
|
32
|
-
if node.previous.node_name == "text" && node.next.node_name == "text"
|
33
|
-
text_size = 0
|
34
|
-
parent.children.each do |child|
|
35
|
-
text_size += child.text.strip.size if child.node_name == "text"
|
36
|
-
end
|
37
|
-
|
38
|
-
if text_size > 0
|
39
|
-
{:text_size => text_size, :parent => parent}
|
40
|
-
else
|
41
|
-
nil
|
42
|
-
end
|
43
|
-
else
|
44
|
-
nil
|
45
|
-
end
|
46
|
-
else
|
47
|
-
nil
|
48
|
-
end
|
49
|
-
end.compact.uniq
|
50
|
-
|
51
|
-
fragment = candidate_nodes.detect {|n| n[:text_size] > 140}[:parent].inner_html.strip rescue ""
|
52
|
-
# Loofah.fragment(fragment).scrub!(:prune).to_s
|
53
|
-
end
|
54
|
-
|
55
|
-
def summary
|
56
|
-
end
|
57
|
-
|
58
|
-
def calculate_children_text_size(parent, node_type)
|
59
|
-
text_size = 0
|
60
|
-
parent.children.each do |child|
|
61
|
-
if child.node_name == node_type
|
62
|
-
child.children.each {|c| text_size += c.text.strip.size if c.node_name == "text"}
|
63
|
-
end
|
64
|
-
end
|
65
|
-
|
66
|
-
text_size
|
67
|
-
end
|
68
|
-
end
|
@@ -1,109 +0,0 @@
|
|
1
|
-
require File.dirname(__FILE__) + '/../spec_helper'
|
2
|
-
|
3
|
-
describe "dom extractor" do
|
4
|
-
it "returns an extracted content object with the url set" do
|
5
|
-
result = Extractula::DomExtractor.new.extract("http://pauldix.net", "")
|
6
|
-
result.should be_a Extractula::ExtractedContent
|
7
|
-
result.url.should == "http://pauldix.net"
|
8
|
-
end
|
9
|
-
end
|
10
|
-
|
11
|
-
describe "extraction cases" do
|
12
|
-
describe "extracting from a typepad blog" do
|
13
|
-
before(:all) do
|
14
|
-
@extracted_content = Extractula::DomExtractor.new.extract(
|
15
|
-
"http://www.pauldix.net/2009/10/typhoeus-the-best-ruby-http-client-just-got-better.html",
|
16
|
-
read_test_file("typhoeus-the-best-ruby-http-client-just-got-better.html"))
|
17
|
-
end
|
18
|
-
|
19
|
-
it "extracts the title" do
|
20
|
-
@extracted_content.title.should == "Paul Dix Explains Nothing: Typhoeus, the best Ruby HTTP client just got better"
|
21
|
-
end
|
22
|
-
|
23
|
-
it "extracts the content" do
|
24
|
-
@extracted_content.content.should == "<p>I've been quietly working on Typhoeus for the last few months. With the help of <a href=\"http://metaclass.org/\">Wilson Bilkovich</a> and <a href=\"http://github.com/dbalatero\">David Balatero</a> I've finished what I think is a significant improvement to the library. The new interface removes all the magic and opts instead for clarity.</p>\n<p>It's really slick and includes improved stubing support, caching, memoization, and (of course) parallelism. The <a href=\"http://github.com/pauldix/typhoeus/\">Typhoeus readme</a> highlights all of the awesomeness. It should be noted that the old interface of including Typhoeus into classes and defining remote methods has been deprecated. I'll be removing that sometime in the next six months.</p>\n<p>In addition to thanking everyone using the library and everyone contributing, I should also thank my employer kgbweb. If you're a solid Rubyist that likes parsing, crawling, and stuff, or a machine learning guy, or a Solr/Lucene/indexing bad ass, let me know. We need you and we're doing some crazy awesome stuff.</p>"
|
25
|
-
end
|
26
|
-
end
|
27
|
-
|
28
|
-
describe "extracting from wordpress - techcrunch" do
|
29
|
-
before(:all) do
|
30
|
-
@extracted_content = Extractula::DomExtractor.new.extract(
|
31
|
-
"http://www.techcrunch.com/2009/12/29/totlol-youtube/",
|
32
|
-
read_test_file("totlol-youtube.html"))
|
33
|
-
end
|
34
|
-
|
35
|
-
it "extracts the title" do
|
36
|
-
@extracted_content.title.should == "The Sad Tale Of Totlol And How YouTube’s Changing TOS Made It Hard To Make A Buck"
|
37
|
-
end
|
38
|
-
|
39
|
-
it "extracts the content" do
|
40
|
-
@extracted_content.content.should == Nokogiri::HTML(read_test_file("totlol-youtube.html")).css("div.entry").first.inner_html.strip
|
41
|
-
end
|
42
|
-
end
|
43
|
-
|
44
|
-
describe "extracting from wordpress - mashable" do
|
45
|
-
before(:all) do
|
46
|
-
@extracted_content = Extractula::DomExtractor.new.extract(
|
47
|
-
"http://mashable.com/2009/12/29/ustream-new-years-eve/",
|
48
|
-
read_test_file("ustream-new-years-eve.html"))
|
49
|
-
end
|
50
|
-
|
51
|
-
it "extracts the title" do
|
52
|
-
@extracted_content.title.should == "New Years Eve: Watch Live Celebrations on Ustream"
|
53
|
-
end
|
54
|
-
|
55
|
-
it "extracts the content" do
|
56
|
-
@extracted_content.content.should == Nokogiri::HTML(read_test_file("ustream-new-years-eve.html")).css("div.text-content").first.inner_html.strip
|
57
|
-
end
|
58
|
-
|
59
|
-
it "extracts content with a video embed" do
|
60
|
-
extracted = Extractula::DomExtractor.new.extract(
|
61
|
-
"http://mashable.com/2009/12/30/weather-channel-marriage-proposal-touching-with-a-chance-of-viral-status-video/",
|
62
|
-
read_test_file("weather-channel-marriage-proposal-touching-with-a-chance-of-viral-status-video.html"))
|
63
|
-
extracted.content.should == "<div style=\"float: left; margin-right: 10px; margin-bottom: 4px;\">\n<div class=\"wdt_button\"><iframe scrolling=\"no\" height=\"61\" frameborder=\"0\" width=\"50\" src=\"http://api.tweetmeme.com/widget.js?url=http://mashable.com/2009/12/30/weather-channel-marriage-proposal-touching-with-a-chance-of-viral-status-video/&style=normal&source=mashable&service=bit.ly\"></iframe></div>\n<div class=\"wdt_button\" style=\"height:59px;\">\n<a name=\"fb_share\" type=\"box_count\" share_url=\"http://mashable.com/2009/12/30/weather-channel-marriage-proposal-touching-with-a-chance-of-viral-status-video/\"></a>\n</div>\n</div>\n<p><a href=\"http://mashable.com/wp-content/uploads/2009/12/weather.jpg\"><img src=\"http://mashable.com/wp-content/uploads/2009/12/weather.jpg\" alt=\"\" title=\"weather\" width=\"266\" height=\"184\" class=\"alignright size-full wp-image-174336\"></a>First <a href=\"http://mashable.com/tag/twitter/\">Twitter</a>, then Foursquare, now the Weather Channel? People are broadcasting their wedding proposals all over the place these days. </p>\n<p>That’s right, the other night Weather Channel meteorologist Kim Perez’s beau, police Sgt. Marty Cunningham (best name EVER), asked her to marry him during a routine forecast. Good thing she said yes, otherwise Cunningham’s disposition would have been cloudy with a serious chance of all-out mortification.<br><span id=\"more-174310\"></span></p>\n<p>Social media and viral videos have taken the place of the jumbotron when it comes to marriage proposals, allowing one to sound one’s not-so barbaric yawp over the roofs of the world. In today’s look-at-me society, public proposals are probably the least offensive byproduct. Meaning that even the most hardened of cynics can admit that they’re kind of sweet.</p>\n<p>Check out Cunningham’s proposal below (I personally enjoy that the weather map reads “<em>ring</em>ing in the New Year”), and then dive right into our list of even more social media wooers. What’s next? Entire domains dedicated to popping the question?</p>\n<p></p>\n<center>\n<object width=\"425\" height=\"344\"><param name=\"movie\" value=\"http://www.youtube.com/v/0dHTIGas4CA&color1=0x3a3a3a&color2=0x999999&hl=en_US&feature=player_embedded&fs=1\">\n<param name=\"allowFullScreen\" value=\"true\">\n<param name=\"allowScriptAccess\" value=\"always\">\n<embed wmode=\"opaque\" src=\"http://www.youtube.com/v/0dHTIGas4CA&color1=0x3a3a3a&color2=0x999999&hl=en_US&feature=player_embedded&fs=1\" type=\"application/x-shockwave-flash\" allowfullscreen=\"true\" allowscriptaccess=\"always\" width=\"425\" height=\"344\"></embed></object>\n<p></p>\n</center>\n<hr>\n<h2>More Wedding Bells and Whistles</h2>\n<hr>\n<p><a href=\"http://mashable.com/2009/08/28/mashable-marriage-proposal/\">CONGRATS: Mashable Marriage Proposal Live at #SocialGood [Video]</a></p>\n<p><a href=\"http://mashable.com/2009/12/19/foursquare-proposal/\">Man Proposes Marriage via Foursquare Check-In</a></p>\n<p><a href=\"http://mashable.com/2008/03/21/max-emily-twitter-proposal/\">Did We Just Witness a Twitter Marriage Proposal?</a></p>\n<p><a href=\"http://mashable.com/2009/06/30/twitter-marriage/\">Successful Marriage Proposal on Twitter Today: We #blamedrewscancer</a></p>\n<p><a href=\"http://mashable.com/2009/12/01/groom-facebook-update/\">Just Married: Groom Changes Facebook Relationship Status at the Altar [VIDEO]</a></p>"
|
64
|
-
end
|
65
|
-
end
|
66
|
-
|
67
|
-
describe "extracting from alleyinsider" do
|
68
|
-
before(:all) do
|
69
|
-
@extracted_content = Extractula::DomExtractor.new.extract(
|
70
|
-
"http://www.businessinsider.com/10-stunning-web-site-prototype-sketches-2009-12",
|
71
|
-
read_test_file("10-stunning-web-site-prototype-sketches.html"))
|
72
|
-
end
|
73
|
-
|
74
|
-
it "extracts the title" do
|
75
|
-
@extracted_content.title.should == "10 Stunning Web Site Prototype Sketches"
|
76
|
-
end
|
77
|
-
|
78
|
-
it "extracts the content" do
|
79
|
-
@extracted_content.content.should == Nokogiri::HTML(read_test_file("10-stunning-web-site-prototype-sketches.html")).css("div.KonaBody").first.inner_html.strip
|
80
|
-
end
|
81
|
-
end
|
82
|
-
|
83
|
-
describe "extracting from nytimes" do
|
84
|
-
before(:all) do
|
85
|
-
@front_page = Extractula::DomExtractor.new.extract(
|
86
|
-
"http://www.nytimes.com/",
|
87
|
-
read_test_file("nytimes.html"))
|
88
|
-
@story_page = Extractula::DomExtractor.new.extract(
|
89
|
-
"http://www.nytimes.com/2009/12/31/world/asia/31history.html?_r=1&hp",
|
90
|
-
read_test_file("nytimes_story.html"))
|
91
|
-
end
|
92
|
-
|
93
|
-
it "extracts the title" do
|
94
|
-
@front_page.title.should == "The New York Times - Breaking News, World News & Multimedia"
|
95
|
-
end
|
96
|
-
|
97
|
-
it "extracts the content" do
|
98
|
-
@front_page.content.should == Nokogiri::HTML(read_test_file("nytimes.html")).css("div.story").first.inner_html.strip
|
99
|
-
end
|
100
|
-
|
101
|
-
it "extracts a story title" do
|
102
|
-
@story_page.title.should == "Army Historians Document Early Missteps in Afghanistan - NYTimes.com"
|
103
|
-
end
|
104
|
-
|
105
|
-
it "extracts the story content" do
|
106
|
-
@story_page.content.should == Nokogiri::HTML(read_test_file("nytimes_story.html")).css("nyt_text").first.inner_html.strip
|
107
|
-
end
|
108
|
-
end
|
109
|
-
end
|