extractula 0.0.1 → 0.0.2

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,97 @@
1
+ h1. Extractula
2
+
3
+ "http://github.com/pauldix/extractula":http://github.com/pauldix/extractula
4
+
5
+ h2. Summary
6
+
7
+ Extracts content like title, summary, and images from web pages like Dracula extracts blood: with care and finesse.
8
+
9
+ h2. Description
10
+
11
+ Extractula attempts to extract the core content from a web page. For a news article or blog post this would be the content of the article itself. For a github project this would be the main README file. The library also has logic for writing your own custom extractors. This is useful if you want to write extractors for popular sites that you want to build custom support for.
12
+
13
+ h2. Installation
14
+
15
+ <pre>
16
+ gem install extractula --source http://gemcutter.org
17
+ </pre>
18
+
19
+ h2. Use
20
+
21
+ <pre>
22
+ require 'extractula'
23
+ some_html = "..." # get some html to extract, yo!
24
+
25
+ extracted_content = Extractula.extract(url, some_html)
26
+ extracted_content.title # pulled from the page
27
+ extracted_content.url # what you passed in
28
+ extracted_content.content # the main content body (article, blog post, etc)
29
+ extracted_content.summary # an automatically generated plain text summary of the content
30
+ extracted_content.image_urls # the urls for images that appear in the content
31
+ extracted_content.video_embed # the embed code if a video is embedded in the content
32
+
33
+ Extractula.add_extractor(SomeClass) # so you can add a custom extractor
34
+ </pre>
35
+
36
+ h3. Custom Extractors
37
+
38
+ The "Use" section showed adding a custom extractor. This should be a class that at a minimum implements the following methods.
39
+
40
+ <pre>
41
+ class MyCustomExtractor
42
+ def self.can_extract?(url, html)
43
+ end
44
+
45
+ def extract(url, html)
46
+ # should return a Extractula::ExtractedContent object
47
+ end
48
+ end
49
+ </pre>
50
+
51
+ Notice that can_extract? is a class method while extract is an instance method. Extract should return an ExtractedContent object.
52
+
53
+ h3. ExtractedContent
54
+
55
+ The ExtractedContent object holds the results of an extraction. It additionally has methods to automatically generate a summary, image_urls, and video_embed code from the content. If you implement a custom extractor and want to provide the summary, image_urls, and video_embed, simply pass those values into the constructor for ExtractedContent. Here are some examples:
56
+
57
+ <pre>
58
+ extracted_content = ExtractedContent.new(:url => "http://pauldix.net", :content => "...some content...")
59
+ extracted_content.summary # auto-generated from content
60
+ extracted_content.image_urls # auto-generated from content
61
+ extracted_content.video_embed # auto-generated from content
62
+
63
+ extracted_content = ExtractedContent.new(:url => "http://pauldix.net", :content => "...some content...",
64
+ :summary => "a summary", :image_urls => ["foo.jpg"], :video_embed => "blah")
65
+ extracted_content.summary # "a summary"
66
+ extracted_content.image_urls # ["foo.jpg"]
67
+ extracted_content.video_embed # "blah"
68
+ </pre>
69
+
70
+ Zero, one, or more of the values can be passed into the ExtractedContent constructor. It will auto-generate ones not passed in and keep the others.
71
+
72
+ h2. LICENSE
73
+
74
+ (The MIT License)
75
+
76
+ Copyright (c) 2009:
77
+
78
+ "Paul Dix":http://pauldix.net
79
+
80
+ Permission is hereby granted, free of charge, to any person obtaining
81
+ a copy of this software and associated documentation files (the
82
+ 'Software'), to deal in the Software without restriction, including
83
+ without limitation the rights to use, copy, modify, merge, publish,
84
+ distribute, sublicense, and/or sell copies of the Software, and to
85
+ permit persons to whom the Software is furnished to do so, subject to
86
+ the following conditions:
87
+
88
+ The above copyright notice and this permission notice shall be
89
+ included in all copies or substantial portions of the Software.
90
+
91
+ THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
92
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
93
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
94
+ IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
95
+ CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
96
+ TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
97
+ SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -3,10 +3,13 @@ $LOAD_PATH.unshift(File.dirname(__FILE__)) unless $LOAD_PATH.include?(File.dirna
3
3
  module Extractula; end
4
4
 
5
5
  require 'nokogiri'
6
+ require 'domainatrix'
6
7
  require 'extractula/extracted_content'
7
- require 'extractula/dom_extractor'
8
+ require 'extractula/extractor'
8
9
 
9
10
  module Extractula
11
+ VERSION = "0.0.2"
12
+
10
13
  @extractors = []
11
14
 
12
15
  def self.add_extractor(extractor_class)
@@ -18,7 +21,16 @@ module Extractula
18
21
  end
19
22
 
20
23
  def self.extract(url, html)
21
- extractor = @extractors.detect {|e| e.can_extract? url, html} || DomExtractor
22
- extractor.new.extract(url, html)
24
+ parsed_url = Domainatrix.parse(url)
25
+ parsed_html = Nokogiri::HTML(html)
26
+ extractor = @extractors.detect {|e| e.can_extract? parsed_url, parsed_html} || Extractor
27
+ extractor.new(parsed_url, parsed_html).extract
28
+ end
29
+
30
+ def self.custom_extractor(config = {})
31
+ klass = Class.new(Extractula::Extractor)
32
+ klass.include(Extractula::OEmbed) if config.delete(:oembed)
33
+ config.each { |option, args| klass.__send__(option, *args) }
34
+ klass
23
35
  end
24
- end
36
+ end
@@ -0,0 +1,5 @@
1
+ require File.dirname(__FILE__) + '/oembed'
2
+
3
+ Dir.glob(File.dirname(__FILE__) + '/custom_extractors/*.rb').each do |lib|
4
+ require File.expand_path(lib).chomp('.rb')
5
+ end
@@ -0,0 +1,9 @@
1
+ # This is mostly a proof-of-concept.
2
+
3
+ module Extractula
4
+ class DinosaurComics < Extractula::Extractor
5
+ domain 'qwantz'
6
+ content_path 'img.comic', 'title'
7
+ image_urls_path 'img.comic'
8
+ end
9
+ end
@@ -0,0 +1,8 @@
1
+ module Extractula
2
+ class Flickr < Extractula::Extractor
3
+ include Extractula::OEmbed
4
+ domain 'flickr'
5
+ content_path 'div.photoDescription'
6
+ oembed_endpoint 'http://www.flickr.com/services/oembed/'
7
+ end
8
+ end
@@ -0,0 +1,8 @@
1
+ module Extractula
2
+ class YouTube < Extractula::Extractor
3
+ include Extractula::OEmbed
4
+ domain 'youtube'
5
+ content_path '.description'
6
+ oembed_endpoint 'http://www.youtube.com/oembed'
7
+ end
8
+ end
@@ -1,5 +1,5 @@
1
1
  class Extractula::ExtractedContent
2
- attr_reader :url, :title, :content
2
+ attr_reader :url, :title, :content, :summary, :image_urls, :video_embed
3
3
 
4
4
  def initialize(attributes = {})
5
5
  attributes.each_pair {|k, v| instance_variable_set("@#{k}", v)}
@@ -0,0 +1,151 @@
1
+ # Abstract (more or less) extractor class from which custom extractor
2
+ # classes should descend. Subclasses of Extractula::Extractor will be
3
+ # automatically added to the Extracula module.
4
+
5
+ class Extractula::Extractor
6
+ def self.inherited subclass
7
+ Extractula.add_extractor subclass
8
+ end
9
+
10
+ def self.domain domain
11
+ @extractable_domain = domain
12
+ end
13
+
14
+ def self.can_extract? url, html
15
+ @extractable_domain ? @extractable_domain == url.domain : false
16
+ end
17
+
18
+ %w{title content summary image_urls video_embed }.each do |field|
19
+ class_eval <<-EOS
20
+ def self.#{field}_path(path = nil, attrib = nil)
21
+ if path
22
+ @#{field}_path = path
23
+ @#{field}_attr = attrib || :text
24
+ end
25
+ @#{field}_path
26
+ end
27
+
28
+ def self.#{field}_attr(attrib = nil)
29
+ @#{field}_attr = attrib if attrib
30
+ @#{field}_attr
31
+ end
32
+
33
+ def #{field}_path
34
+ self.class.#{field}_path
35
+ end
36
+
37
+ def #{field}_attr
38
+ self.class.#{field}_attr
39
+ end
40
+ EOS
41
+ end
42
+
43
+ attr_reader :url, :html
44
+
45
+ def initialize url, html
46
+ @url = url.is_a?(Domainatrix::Url) ? url : Domainatrix.parse(url)
47
+ @html = html.is_a?(Nokogiri::HTML::Document) ? html : Nokogiri::HTML(html)
48
+ end
49
+
50
+ def extract
51
+ Extractula::ExtractedContent.new({
52
+ :url => url.url,
53
+ :title => title,
54
+ :content => content,
55
+ :summary => summary,
56
+ :image_urls => image_urls,
57
+ :video_embed => video_embed
58
+ })
59
+ end
60
+
61
+ def title
62
+ content_at(title_path, title_attr) || content_at("//title")
63
+ end
64
+
65
+ def content
66
+ content_at(content_path, content_attr) || extract_content
67
+ end
68
+
69
+ def summary
70
+ content_at(summary_path, summary_attr)
71
+ end
72
+
73
+ def image_urls
74
+ if image_urls_path
75
+ html.search(image_urls_path).collect { |img| img['src'].strip }
76
+ end
77
+ end
78
+
79
+ def video_embed
80
+ if video_embed_path
81
+ html.search(video_embed_path).collect { |embed| embed.to_html }.first
82
+ end
83
+ end
84
+
85
+ private
86
+
87
+ def content_at(path, attrib = :text)
88
+ if path
89
+ if node = html.at(path)
90
+ attrib == :text ? node.text.strip : node[attrib].strip
91
+ end
92
+ end
93
+ end
94
+
95
+ def extract_content
96
+ candidate_nodes = html.search("//div|//p|//br").collect do |node|
97
+ parent = node.parent
98
+ if node.node_name == 'div'
99
+ text_size = calculate_children_text_size(parent, "div")
100
+
101
+ if text_size > 0
102
+ {:text_size => text_size, :parent => parent}
103
+ else
104
+ nil
105
+ end
106
+ elsif node.node_name == "p"
107
+ text_size = calculate_children_text_size(parent, "p")
108
+
109
+ if text_size > 0
110
+ {:text_size => text_size, :parent => parent}
111
+ else
112
+ nil
113
+ end
114
+ elsif node.node_name == "br"
115
+ begin
116
+ if node.previous.node_name == "text" && node.next.node_name == "text"
117
+ text_size = 0
118
+ parent.children.each do |child|
119
+ text_size += child.text.strip.size if child.node_name == "text"
120
+ end
121
+
122
+ if text_size > 0
123
+ {:text_size => text_size, :parent => parent}
124
+ else
125
+ nil
126
+ end
127
+ else
128
+ nil
129
+ end
130
+ rescue => e
131
+ nil
132
+ end
133
+ else
134
+ nil
135
+ end
136
+ end.compact.uniq
137
+
138
+ fragment = candidate_nodes.detect {|n| n[:text_size] > 140}[:parent].inner_html.strip rescue ""
139
+ end
140
+
141
+ def calculate_children_text_size(parent, node_type)
142
+ text_size = 0
143
+ parent.children.each do |child|
144
+ if child.node_name == node_type
145
+ child.children.each {|c| text_size += c.text.strip.size if c.node_name == "text"}
146
+ end
147
+ end
148
+
149
+ text_size
150
+ end
151
+ end
@@ -0,0 +1,124 @@
1
+ require 'typhoeus'
2
+ require 'json'
3
+
4
+ module Extractula
5
+ module OEmbed
6
+
7
+ def self.included(base)
8
+ base.class_eval {
9
+ extend Extractula::OEmbed::ClassMethods
10
+ include Extractula::OEmbed::InstanceMethods
11
+ }
12
+ end
13
+
14
+ def self.request(request)
15
+ http_response = Typhoeus::Request.get(request)
16
+ if http_response.code == 200
17
+ Extractula::OEmbed::Response.new(http_response.body)
18
+ else
19
+ # do something
20
+ end
21
+ end
22
+
23
+ module ClassMethods
24
+ def oembed_endpoint(url = nil)
25
+ if url
26
+ @oembed_endpoint = url
27
+ if @oembed_endpoint.match(/\.(xml|json)$/)
28
+ @oembed_format_param_required = false
29
+ @oembed_endpoint.sub!(/\.xml$/, '.json') if $1 == 'xml'
30
+ else
31
+ @oembed_format_param_required = true
32
+ end
33
+ end
34
+ @oembed_endpoint
35
+ end
36
+
37
+ def oembed_max_width(width = nil)
38
+ @oembed_max_width = width if width
39
+ @oembed_max_width
40
+ end
41
+
42
+ def oembed_max_height(height = nil)
43
+ @oembed_max_height = height if height
44
+ @oembed_max_height
45
+ end
46
+
47
+ def oembed_format_param_required?
48
+ @oembed_format_param_required
49
+ end
50
+ end
51
+
52
+ module InstanceMethods
53
+ def initialize(*args)
54
+ super
55
+ @oembed = Extractula::OEmbed.request(oembed_request)
56
+ end
57
+
58
+ def oembed_endpoint
59
+ self.class.oembed_endpoint
60
+ end
61
+
62
+ def oembed_max_width
63
+ self.class.oembed_max_width
64
+ end
65
+
66
+ def oembed_max_height
67
+ self.class.oembed_max_height
68
+ end
69
+
70
+ def oembed_format_param_required?
71
+ self.class.oembed_format_param_required?
72
+ end
73
+
74
+ def oembed
75
+ @oembed
76
+ end
77
+
78
+ def oembed_request
79
+ request = "#{oembed_endpoint}?url=#{url.url}"
80
+ request += "&format=json" if oembed_format_param_required?
81
+ request += "&maxwidth=#{oembed_max_width}" if oembed_max_width
82
+ request += "&maxheight=#{oembed_max_height}" if oembed_max_height
83
+ request
84
+ end
85
+
86
+ def title
87
+ oembed.title
88
+ end
89
+
90
+ def image_urls
91
+ [ oembed.url ] if oembed.type == 'photo'
92
+ end
93
+
94
+ def video_embed
95
+ oembed.html
96
+ end
97
+ end
98
+
99
+ class Response
100
+
101
+ FIELDS = %w{ type version title author_name author_url
102
+ provider_name provider_url cache_age thumbnail_url
103
+ thumbnail_width thumbnail_height }
104
+
105
+ FIELDS.each { |field| attr_reader field.to_sym }
106
+ attr_reader :width, :height, :url, :html
107
+
108
+ def initialize response
109
+ @doc = ::JSON.parse(response)
110
+ FIELDS.each { |field| instance_variable_set "@#{field}", @doc[field] }
111
+ unless @type == 'link'
112
+ @width = @doc['width']
113
+ @height = @doc['height']
114
+ if @type == 'photo'
115
+ @url = @doc['url']
116
+ else
117
+ @html = @doc['html']
118
+ end
119
+ end
120
+ end
121
+
122
+ end
123
+ end
124
+ end
@@ -1,3 +1,4 @@
1
+ # coding: utf-8
1
2
  require File.dirname(__FILE__) + '/../spec_helper'
2
3
 
3
4
  describe "extracted content" do
@@ -45,4 +46,11 @@ describe "extracted content" do
45
46
  extracted.video_embed.should == "<object width=\"425\" height=\"344\"><param name=\"movie\" value=\"http://www.youtube.com/v/0dHTIGas4CA&amp;color1=0x3a3a3a&amp;color2=0x999999&amp;hl=en_US&amp;feature=player_embedded&amp;fs=1\">\n<param name=\"allowFullScreen\" value=\"true\">\n<param name=\"allowScriptAccess\" value=\"always\">\n<embed wmode=\"opaque\" src=\"http://www.youtube.com/v/0dHTIGas4CA&amp;color1=0x3a3a3a&amp;color2=0x999999&amp;hl=en_US&amp;feature=player_embedded&amp;fs=1\" type=\"application/x-shockwave-flash\" allowfullscreen=\"true\" allowscriptaccess=\"always\" width=\"425\" height=\"344\"></embed></object>"
46
47
  end
47
48
  end
49
+
50
+ describe "some regressions" do
51
+ it "doesn't error with undefined method 'node_name' for nil:NilClass when looking at <br /> elements" do
52
+ extracted = Extractula::Extractor.new("http://viceland.com/caprica/", read_test_file("node-name-error.html")).extract
53
+ extracted.title.should == "Syfy + Motherboard.tv Caprica Screenings Contest"
54
+ end
55
+ end
48
56
  end
@@ -2,12 +2,12 @@ require File.dirname(__FILE__) + '/spec_helper'
2
2
 
3
3
  describe "extractula" do
4
4
  it "can add custom extractors" do
5
- custom_extractor = Class.new do
5
+ custom_extractor = Class.new(Extractula::Extractor) do
6
6
  def self.can_extract? url, html
7
7
  true
8
8
  end
9
9
 
10
- def extract url, html
10
+ def extract
11
11
  Extractula::ExtractedContent.new :url => "custom extractor url", :summary => "my custom extractor"
12
12
  end
13
13
  end
@@ -20,12 +20,12 @@ describe "extractula" do
20
20
  end
21
21
 
22
22
  it "skips custom extractors that can't extract the passed url and html" do
23
- custom_extractor = Class.new do
23
+ custom_extractor = Class.new(Extractula::Extractor) do
24
24
  def self.can_extract? url, html
25
25
  false
26
26
  end
27
27
 
28
- def extract url, html
28
+ def extract
29
29
  Extractula::ExtractedContent.new :url => "this url", :summary => "this summary"
30
30
  end
31
31
  end
@@ -42,4 +42,4 @@ describe "extractula" do
42
42
  result.should be_a Extractula::ExtractedContent
43
43
  result.url.should == "http://pauldix.net"
44
44
  end
45
- end
45
+ end
@@ -8,6 +8,7 @@ path = File.expand_path(File.dirname(__FILE__) + "/../lib/")
8
8
  $LOAD_PATH.unshift(path) unless $LOAD_PATH.include?(path)
9
9
 
10
10
  require "lib/extractula"
11
+ require "lib/extractula/custom_extractors"
11
12
 
12
13
  def read_test_file(file_name)
13
14
  File.read("#{File.dirname(__FILE__)}/test-files/#{file_name}")
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: extractula
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.1
4
+ version: 0.0.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Paul Dix
@@ -32,14 +32,18 @@ extra_rdoc_files: []
32
32
 
33
33
  files:
34
34
  - lib/extractula.rb
35
+ - lib/extractula/custom_extractors.rb
35
36
  - lib/extractula/extracted_content.rb
36
- - lib/extractula/dom_extractor.rb
37
+ - lib/extractula/extractor.rb
38
+ - lib/extractula/oembed.rb
39
+ - lib/extractula/custom_extractors/dinosaur_comics.rb
40
+ - lib/extractula/custom_extractors/flickr.rb
41
+ - lib/extractula/custom_extractors/you_tube.rb
37
42
  - README.textile
38
43
  - spec/spec.opts
39
44
  - spec/spec_helper.rb
40
45
  - spec/extractula_spec.rb
41
46
  - spec/extractula/extracted_content_spec.rb
42
- - spec/extractula/dom_extractor_spec.rb
43
47
  - spec/test-files/10-stunning-web-site-prototype-sketches.html
44
48
  - spec/test-files/totlol-youtube.html
45
49
  - spec/test-files/typhoeus-the-best-ruby-http-client-just-got-better.html
@@ -1,68 +0,0 @@
1
- # a basic dom based extractor. it's a generic catch all
2
- class Extractula::DomExtractor
3
- def extract url, html
4
- @doc = Nokogiri::HTML(html)
5
- extracted = Extractula::ExtractedContent.new :url => url, :title => title, :content => content
6
- end
7
-
8
- def title
9
- @title ||= @doc.search("//title").first.text.strip rescue nil
10
- end
11
-
12
- def content
13
- candidate_nodes = @doc.search("//div|//p|//br").collect do |node|
14
- parent = node.parent
15
- if node.node_name == 'div'
16
- text_size = calculate_children_text_size(parent, "div")
17
-
18
- if text_size > 0
19
- {:text_size => text_size, :parent => parent}
20
- else
21
- nil
22
- end
23
- elsif node.node_name == "p"
24
- text_size = calculate_children_text_size(parent, "p")
25
-
26
- if text_size > 0
27
- {:text_size => text_size, :parent => parent}
28
- else
29
- nil
30
- end
31
- elsif node.node_name == "br"
32
- if node.previous.node_name == "text" && node.next.node_name == "text"
33
- text_size = 0
34
- parent.children.each do |child|
35
- text_size += child.text.strip.size if child.node_name == "text"
36
- end
37
-
38
- if text_size > 0
39
- {:text_size => text_size, :parent => parent}
40
- else
41
- nil
42
- end
43
- else
44
- nil
45
- end
46
- else
47
- nil
48
- end
49
- end.compact.uniq
50
-
51
- fragment = candidate_nodes.detect {|n| n[:text_size] > 140}[:parent].inner_html.strip rescue ""
52
- # Loofah.fragment(fragment).scrub!(:prune).to_s
53
- end
54
-
55
- def summary
56
- end
57
-
58
- def calculate_children_text_size(parent, node_type)
59
- text_size = 0
60
- parent.children.each do |child|
61
- if child.node_name == node_type
62
- child.children.each {|c| text_size += c.text.strip.size if c.node_name == "text"}
63
- end
64
- end
65
-
66
- text_size
67
- end
68
- end
@@ -1,109 +0,0 @@
1
- require File.dirname(__FILE__) + '/../spec_helper'
2
-
3
- describe "dom extractor" do
4
- it "returns an extracted content object with the url set" do
5
- result = Extractula::DomExtractor.new.extract("http://pauldix.net", "")
6
- result.should be_a Extractula::ExtractedContent
7
- result.url.should == "http://pauldix.net"
8
- end
9
- end
10
-
11
- describe "extraction cases" do
12
- describe "extracting from a typepad blog" do
13
- before(:all) do
14
- @extracted_content = Extractula::DomExtractor.new.extract(
15
- "http://www.pauldix.net/2009/10/typhoeus-the-best-ruby-http-client-just-got-better.html",
16
- read_test_file("typhoeus-the-best-ruby-http-client-just-got-better.html"))
17
- end
18
-
19
- it "extracts the title" do
20
- @extracted_content.title.should == "Paul Dix Explains Nothing: Typhoeus, the best Ruby HTTP client just got better"
21
- end
22
-
23
- it "extracts the content" do
24
- @extracted_content.content.should == "<p>I've been quietly working on Typhoeus for the last few months. With the help of <a href=\"http://metaclass.org/\">Wilson Bilkovich</a> and <a href=\"http://github.com/dbalatero\">David Balatero</a> I've finished what I think is a significant improvement to the library. The new interface removes all the magic and opts instead for clarity.</p>\n<p>It's really slick and includes improved stubing support, caching, memoization, and (of course) parallelism. The <a href=\"http://github.com/pauldix/typhoeus/\">Typhoeus readme</a> highlights all of the awesomeness. It should be noted that the old interface of including Typhoeus into classes and defining remote methods has been deprecated. I'll be removing that sometime in the next six months.</p>\n<p>In addition to thanking everyone using the library and everyone contributing, I should also thank my employer kgbweb. If you're a solid Rubyist that likes parsing, crawling, and stuff, or a machine learning guy, or a Solr/Lucene/indexing bad ass, let me know. We need you and we're doing some crazy awesome stuff.</p>"
25
- end
26
- end
27
-
28
- describe "extracting from wordpress - techcrunch" do
29
- before(:all) do
30
- @extracted_content = Extractula::DomExtractor.new.extract(
31
- "http://www.techcrunch.com/2009/12/29/totlol-youtube/",
32
- read_test_file("totlol-youtube.html"))
33
- end
34
-
35
- it "extracts the title" do
36
- @extracted_content.title.should == "The Sad Tale Of Totlol And How YouTube’s Changing TOS Made It Hard To Make A Buck"
37
- end
38
-
39
- it "extracts the content" do
40
- @extracted_content.content.should == Nokogiri::HTML(read_test_file("totlol-youtube.html")).css("div.entry").first.inner_html.strip
41
- end
42
- end
43
-
44
- describe "extracting from wordpress - mashable" do
45
- before(:all) do
46
- @extracted_content = Extractula::DomExtractor.new.extract(
47
- "http://mashable.com/2009/12/29/ustream-new-years-eve/",
48
- read_test_file("ustream-new-years-eve.html"))
49
- end
50
-
51
- it "extracts the title" do
52
- @extracted_content.title.should == "New Years Eve: Watch Live Celebrations on Ustream"
53
- end
54
-
55
- it "extracts the content" do
56
- @extracted_content.content.should == Nokogiri::HTML(read_test_file("ustream-new-years-eve.html")).css("div.text-content").first.inner_html.strip
57
- end
58
-
59
- it "extracts content with a video embed" do
60
- extracted = Extractula::DomExtractor.new.extract(
61
- "http://mashable.com/2009/12/30/weather-channel-marriage-proposal-touching-with-a-chance-of-viral-status-video/",
62
- read_test_file("weather-channel-marriage-proposal-touching-with-a-chance-of-viral-status-video.html"))
63
- extracted.content.should == "<div style=\"float: left; margin-right: 10px; margin-bottom: 4px;\">\n<div class=\"wdt_button\"><iframe scrolling=\"no\" height=\"61\" frameborder=\"0\" width=\"50\" src=\"http://api.tweetmeme.com/widget.js?url=http://mashable.com/2009/12/30/weather-channel-marriage-proposal-touching-with-a-chance-of-viral-status-video/&amp;style=normal&amp;source=mashable&amp;service=bit.ly\"></iframe></div>\n<div class=\"wdt_button\" style=\"height:59px;\">\n<a name=\"fb_share\" type=\"box_count\" share_url=\"http://mashable.com/2009/12/30/weather-channel-marriage-proposal-touching-with-a-chance-of-viral-status-video/\"></a>\n</div>\n</div>\n<p><a href=\"http://mashable.com/wp-content/uploads/2009/12/weather.jpg\"><img src=\"http://mashable.com/wp-content/uploads/2009/12/weather.jpg\" alt=\"\" title=\"weather\" width=\"266\" height=\"184\" class=\"alignright size-full wp-image-174336\"></a>First <a href=\"http://mashable.com/tag/twitter/\">Twitter</a>, then Foursquare, now the Weather Channel? People are broadcasting their wedding proposals all over the place these days. </p>\n<p>That’s right, the other night Weather Channel meteorologist Kim Perez’s beau, police Sgt. Marty Cunningham (best name EVER), asked her to marry him during a routine forecast. Good thing she said yes, otherwise Cunningham’s disposition would have been cloudy with a serious chance of all-out mortification.<br><span id=\"more-174310\"></span></p>\n<p>Social media and viral videos have taken the place of the jumbotron when it comes to marriage proposals, allowing one to sound one’s not-so barbaric yawp over the roofs of the world. In today’s look-at-me society, public proposals are probably the least offensive byproduct. Meaning that even the most hardened of cynics can admit that they’re kind of sweet.</p>\n<p>Check out Cunningham’s proposal below (I personally enjoy that the weather map reads “<em>ring</em>ing in the New Year”), and then dive right into our list of even more social media wooers. What’s next? Entire domains dedicated to popping the question?</p>\n<p></p>\n<center>\n<object width=\"425\" height=\"344\"><param name=\"movie\" value=\"http://www.youtube.com/v/0dHTIGas4CA&amp;color1=0x3a3a3a&amp;color2=0x999999&amp;hl=en_US&amp;feature=player_embedded&amp;fs=1\">\n<param name=\"allowFullScreen\" value=\"true\">\n<param name=\"allowScriptAccess\" value=\"always\">\n<embed wmode=\"opaque\" src=\"http://www.youtube.com/v/0dHTIGas4CA&amp;color1=0x3a3a3a&amp;color2=0x999999&amp;hl=en_US&amp;feature=player_embedded&amp;fs=1\" type=\"application/x-shockwave-flash\" allowfullscreen=\"true\" allowscriptaccess=\"always\" width=\"425\" height=\"344\"></embed></object>\n<p></p>\n</center>\n<hr>\n<h2>More Wedding Bells and Whistles</h2>\n<hr>\n<p><a href=\"http://mashable.com/2009/08/28/mashable-marriage-proposal/\">CONGRATS: Mashable Marriage Proposal Live at #SocialGood [Video]</a></p>\n<p><a href=\"http://mashable.com/2009/12/19/foursquare-proposal/\">Man Proposes Marriage via Foursquare Check-In</a></p>\n<p><a href=\"http://mashable.com/2008/03/21/max-emily-twitter-proposal/\">Did We Just Witness a Twitter Marriage Proposal?</a></p>\n<p><a href=\"http://mashable.com/2009/06/30/twitter-marriage/\">Successful Marriage Proposal on Twitter Today: We #blamedrewscancer</a></p>\n<p><a href=\"http://mashable.com/2009/12/01/groom-facebook-update/\">Just Married: Groom Changes Facebook Relationship Status at the Altar [VIDEO]</a></p>"
64
- end
65
- end
66
-
67
- describe "extracting from alleyinsider" do
68
- before(:all) do
69
- @extracted_content = Extractula::DomExtractor.new.extract(
70
- "http://www.businessinsider.com/10-stunning-web-site-prototype-sketches-2009-12",
71
- read_test_file("10-stunning-web-site-prototype-sketches.html"))
72
- end
73
-
74
- it "extracts the title" do
75
- @extracted_content.title.should == "10 Stunning Web Site Prototype Sketches"
76
- end
77
-
78
- it "extracts the content" do
79
- @extracted_content.content.should == Nokogiri::HTML(read_test_file("10-stunning-web-site-prototype-sketches.html")).css("div.KonaBody").first.inner_html.strip
80
- end
81
- end
82
-
83
- describe "extracting from nytimes" do
84
- before(:all) do
85
- @front_page = Extractula::DomExtractor.new.extract(
86
- "http://www.nytimes.com/",
87
- read_test_file("nytimes.html"))
88
- @story_page = Extractula::DomExtractor.new.extract(
89
- "http://www.nytimes.com/2009/12/31/world/asia/31history.html?_r=1&hp",
90
- read_test_file("nytimes_story.html"))
91
- end
92
-
93
- it "extracts the title" do
94
- @front_page.title.should == "The New York Times - Breaking News, World News & Multimedia"
95
- end
96
-
97
- it "extracts the content" do
98
- @front_page.content.should == Nokogiri::HTML(read_test_file("nytimes.html")).css("div.story").first.inner_html.strip
99
- end
100
-
101
- it "extracts a story title" do
102
- @story_page.title.should == "Army Historians Document Early Missteps in Afghanistan - NYTimes.com"
103
- end
104
-
105
- it "extracts the story content" do
106
- @story_page.content.should == Nokogiri::HTML(read_test_file("nytimes_story.html")).css("nyt_text").first.inner_html.strip
107
- end
108
- end
109
- end