scrapifier 0.0.1 → 0.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: d645984640446b98bdc5bdf71972c991b361b75e
4
- data.tar.gz: 9bd772bf8ab26ab4dda602fa69eee8c12ae45e39
3
+ metadata.gz: aa0d714c01fc436bf4f90f4baaf50e98bcb9197a
4
+ data.tar.gz: b9337a1a690c0a7f0c57b237f327eec80d9bbdda
5
5
  SHA512:
6
- metadata.gz: ff5dd829fd8e41af883fccd65ade03bb44d968f71a5457a0f2ceeb3afb0e389f71b44c0b559b352793293f4675cd9556c6e0bbab9799b637f1c4e6e7bdbb61ee
7
- data.tar.gz: 3b974c372000f5d4f795f32074bd4da64e4fd910e8623f8446169c18b6880b1967aee105bf5df4e9de36fd054697c567727d4a8eee48def87f51b93b4117f556
6
+ metadata.gz: 75baa6ed1838759bd6c0ebc8a453120b6c2cfe8f484f4396a5401a1d1acd66eebb0377423750a647ca9417e0fb8f4677ba688d40a4bc7e642cee653a6a76131a
7
+ data.tar.gz: a77ab52807dbcf3a9846226641a00b210324e2d64db0d18324597be380e576faea9c65eb97b00ea0a98e9de6359329f3b295f168fd001b59a4b9b8390ef3badc
@@ -0,0 +1,4 @@
1
+ language: ruby
2
+ rvm:
3
+ - 2.0.0
4
+ - 1.9.3
data/README.md CHANGED
@@ -1,83 +1,88 @@
1
- # Scrapifier
2
-
3
- It's a Ruby gem that brings a very simple way to extract meta information from URIs using the screen scraping technique.
4
-
5
- ## Installation
6
-
7
- Compatible with Ruby 1.9.3+
8
-
9
- Add this line to your application's Gemfile:
10
-
11
- gem 'scrapifier'
12
-
13
- And then execute:
14
-
15
- $ bundle
16
-
17
- Or install it yourself as:
18
-
19
- $ gem install scrapifier
20
-
21
- ## Usage
22
-
23
- The method finds an URI in the String and gets some meta information from it, like the page's title, description, images and the URI. All the data is returned in a well-formatted Hash.
24
-
25
- #### Default usage.
26
-
27
- ``` ruby
28
- 'Wow! What an awesome site: http://adtangerine.com!'.scrapify
29
- #=> {
30
- # title: "AdTangerine | Advertising Platform for Social Media",
31
- # description: "AdTangerine is an advertising platform that uses the tangerine as a virtual currency for advertisers and publishers in order to share content on social networks.",
32
- # images: ["http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/logo_adt_og.png", "http://s3-us-west-2.amazonaws.com/adtangerine-prod/users/avatars/000/000/834/thumb/275747_1118382211_1929809351_n.jpg", "http://adtangerine.com/assets/foobar.gif"],
33
- # uri: "http://adtangerine.com"
34
- # }
35
- ```
36
-
37
- #### Allow only certain image types.
38
-
39
- ``` ruby
40
- 'Wow! What an awesome site: http://adtangerine.com!'.scrapify images: :jpg
41
- #=> {
42
- # title: "AdTangerine | Advertising Platform for Social Media",
43
- # description: "AdTangerine is an advertising platform that uses the tangerine as a virtual currency for advertisers and publishers in order to share content on social networks.",
44
- # images: ["http://s3-us-west-2.amazonaws.com/adtangerine-prod/users/avatars/000/000/834/thumb/275747_1118382211_1929809351_n.jpg"],
45
- # uri: "http://adtangerine.com"
46
- # }
47
-
48
- 'Wow! What an awesome site: http://adtangerine.com!'.scrapify images: [:png, :gif]
49
- #=> {
50
- # title: "AdTangerine | Advertising Platform for Social Media",
51
- # description: "AdTangerine is an advertising platform that uses the tangerine as a virtual currency for advertisers and publishers in order to share content on social networks.",
52
- # images: ["http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/foobar.gif"],
53
- # uri: "http://adtangerine.com"
54
- # }
55
- ```
56
-
57
- #### Choose which URI you want it to be scraped.
58
-
59
- ``` ruby
60
- 'Check out: http://adtangerine.com and www.twitflink.com'.scrapify which: 1
61
- #=> {
62
- # title: "TwitFlink | Find a link!",
63
- # description: "TwitFlink is a very simple searching tool that allows people to find out links tweeted by any user from Twitter.",
64
- # images: ["http://www.twitflink.com//assets/tf_logo.png", "http://twitflink.com/assets/tf_logo.png"],
65
- # uri: "http://www.twitflink.com"
66
- # }
67
-
68
- 'Check out: http://adtangerine.com and www.twitflink.com'.scrapify({ which: 0, images: :gif })
69
- #=> {
70
- # title: "AdTangerine | Advertising Platform for Social Media",
71
- # description: "AdTangerine is an advertising platform that uses the tangerine as a virtual currency for advertisers and publishers in order to share content on social networks.",
72
- # images: ["http://adtangerine.com/assets/foobar.gif"],
73
- # uri: "http://adtangerine.com"
74
- # }
75
- ```
76
-
77
- ## Contributing
78
-
79
- 1. Fork it
80
- 2. Create your feature branch (`git checkout -b my-new-feature`)
81
- 3. Commit your changes (`git commit -am 'Add some feature'`)
82
- 4. Push to the branch (`git push origin my-new-feature`)
83
- 5. Create new Pull Request
1
+ # Scrapifier
2
+
3
+ [![Build Status](https://travis-ci.org/tiagopog/scrapifier.svg?branch=master)](https://travis-ci.org/tiagopog/scrapifier)
4
+ [![Code Climate](https://codeclimate.com/github/tiagopog/scrapifier.png)](https://codeclimate.com/github/tiagopog/scrapifier)
5
+ [![Dependency Status](https://gemnasium.com/tiagopog/scrapifier.svg)](https://gemnasium.com/tiagopog/scrapifier)
6
+ [![Gem Version](https://badge.fury.io/rb/scrapifier.svg)](http://badge.fury.io/rb/scrapifier)
7
+
8
+ It's a Ruby gem that brings a very simple way to extract meta information from URIs using the screen scraping technique.
9
+
10
+ ## Installation
11
+
12
+ Compatible with Ruby 1.9.3+
13
+
14
+ Add this line to your application's Gemfile:
15
+
16
+ gem 'scrapifier'
17
+
18
+ And then execute:
19
+
20
+ $ bundle
21
+
22
+ Or install it yourself as:
23
+
24
+ $ gem install scrapifier
25
+
26
+ ## Usage
27
+
28
+ The method finds an URI in the String and gets some meta information from it, like the page's title, description, images and the URI. All the data is returned in a well-formatted Hash.
29
+
30
+ #### Default usage.
31
+
32
+ ``` ruby
33
+ 'Wow! What an awesome site: http://adtangerine.com!'.scrapify
34
+ #=> {
35
+ # title: "AdTangerine | Advertising Platform for Social Media",
36
+ # description: "AdTangerine is an advertising platform that uses the tangerine as a virtual currency for advertisers and publishers in order to share content on social networks.",
37
+ # images: ["http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/logo_adt_og.png", "http://s3-us-west-2.amazonaws.com/adtangerine-prod/users/avatars/000/000/834/thumb/275747_1118382211_1929809351_n.jpg", "http://adtangerine.com/assets/foobar.gif"],
38
+ # uri: "http://adtangerine.com"
39
+ # }
40
+ ```
41
+
42
+ #### Allow only certain image types.
43
+
44
+ ``` ruby
45
+ 'Wow! What an awesome site: http://adtangerine.com!'.scrapify images: :jpg
46
+ #=> {
47
+ # title: "AdTangerine | Advertising Platform for Social Media",
48
+ # description: "AdTangerine is an advertising platform that uses the tangerine as a virtual currency for advertisers and publishers in order to share content on social networks.",
49
+ # images: ["http://s3-us-west-2.amazonaws.com/adtangerine-prod/users/avatars/000/000/834/thumb/275747_1118382211_1929809351_n.jpg"],
50
+ # uri: "http://adtangerine.com"
51
+ # }
52
+
53
+ 'Wow! What an awesome site: http://adtangerine.com!'.scrapify images: [:png, :gif]
54
+ #=> {
55
+ # title: "AdTangerine | Advertising Platform for Social Media",
56
+ # description: "AdTangerine is an advertising platform that uses the tangerine as a virtual currency for advertisers and publishers in order to share content on social networks.",
57
+ # images: ["http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/foobar.gif"],
58
+ # uri: "http://adtangerine.com"
59
+ # }
60
+ ```
61
+
62
+ #### Choose which URI you want it to be scraped.
63
+
64
+ ``` ruby
65
+ 'Check out: http://adtangerine.com and www.twitflink.com'.scrapify which: 1
66
+ #=> {
67
+ # title: "TwitFlink | Find a link!",
68
+ # description: "TwitFlink is a very simple searching tool that allows people to find out links tweeted by any user from Twitter.",
69
+ # images: ["http://www.twitflink.com//assets/tf_logo.png", "http://twitflink.com/assets/tf_logo.png"],
70
+ # uri: "http://www.twitflink.com"
71
+ # }
72
+
73
+ 'Check out: http://adtangerine.com and www.twitflink.com'.scrapify({ which: 0, images: :gif })
74
+ #=> {
75
+ # title: "AdTangerine | Advertising Platform for Social Media",
76
+ # description: "AdTangerine is an advertising platform that uses the tangerine as a virtual currency for advertisers and publishers in order to share content on social networks.",
77
+ # images: ["http://adtangerine.com/assets/foobar.gif"],
78
+ # uri: "http://adtangerine.com"
79
+ # }
80
+ ```
81
+
82
+ ## Contributing
83
+
84
+ 1. Fork it
85
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
86
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
87
+ 4. Push to the branch (`git push origin my-new-feature`)
88
+ 5. Create new Pull Request
@@ -4,67 +4,57 @@ require 'open-uri'
4
4
  require 'scrapifier/support'
5
5
 
6
6
  module Scrapifier
7
+ # Methods which will be included into the String class.
7
8
  module Methods
8
9
  include Scrapifier::Support
9
10
 
10
- # Gets meta data from an URI using the screen scraping technique.
11
- #
11
+ # Get metadata from an URI using the screen scraping technique.
12
+ #
12
13
  # Example:
13
14
  # >> 'Wow! What an awesome site: http://adtangerine.com!'.scrapify
14
15
  # => {
15
16
  # :title => "AdTangerine | Advertising Platform for Social Media",
16
- # :description => "AdTangerine is an advertising platform that uses the tangerine as a virtual currency...",
17
- # :images => ["http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/logo_adt_og.png],
17
+ # :description => "AdTangerine is an advertising platform that...",
18
+ # :images => [
19
+ # "http://adtangerine.com/assets/logo_adt_og.png",
20
+ # "http://adtangerine.com/assets/logo_adt_og.png
21
+ # ],
18
22
  # :uri => "http://adtangerine.com"
19
23
  # }
20
24
  # Arguments:
21
25
  # options: (Hash)
22
- # - which: (Integer) Indicates which URI in the String will be used. It starts from 0 to N.
23
- # - images: (Symbol or Array) Indicates the image extensions which are allowed to be returned as result.
24
-
26
+ # - which: (Integer)
27
+ # Which URI in the String will be used. It starts from 0 to N.
28
+ # - images: (Symbol or Array)
29
+ # Image extensions which are allowed to be returned as result.
25
30
  def scrapify(options = {})
26
- meta, uri = {}, find_uri(options[:which])
27
-
28
- begin
29
- if uri.nil?
30
- raise
31
- elsif uri =~ sf_regex(:image)
32
- uri = (sf_check_img_ext(uri, options[:images])[0] rescue [])
33
- raise if uri.empty?
34
- [:title, :description, :uri, :images].each { |key| meta[key] = uri }
35
- else
36
- doc = Nokogiri::HTML(open(uri).read)
37
- doc.encoding = 'utf-8'
38
-
39
- [:title, :description].each do |key|
40
- meta[key] = (doc.xpath(sf_paths[key])[0].text rescue '-')
41
- end
31
+ uri, meta = find_uri(options[:which]), {}
32
+ return meta if uri.nil?
42
33
 
43
- meta[:images] = sf_fix_imgs(doc.xpath(sf_paths[:image]), uri, options[:images])
44
- meta[:uri] = uri
45
- end
46
- rescue
47
- meta = {}
34
+ if !(uri =~ sf_regex(:image))
35
+ meta = sf_eval_uri(uri, options[:images])
36
+ elsif !sf_check_img_ext(uri, options[:images]).empty?
37
+ [:title, :description, :uri, :images].each { |k| meta[k] = uri }
48
38
  end
49
39
 
50
40
  meta
51
41
  end
52
42
 
53
- # Looks for URIs in the String.
54
- #
43
+ # Find URIs in the String.
44
+ #
55
45
  # Example:
56
46
  # >> 'Wow! What an awesome site: http://adtangerine.com!'.find_uri
57
47
  # => 'http://adtangerine.com'
58
- # >> 'Wow! What an awesome sites: http://adtangerine.com and www.twitflink.com'.find_uri 1
48
+ # >> 'Very cool: http://adtangerine.com and www.twitflink.com'.find_uri 1
59
49
  # => 'www.twitflink.com'
60
50
  # Arguments:
61
51
  # which: (Integer)
62
52
  # - Which URI in the String: first (0), second (1) and so on.
63
-
64
53
  def find_uri(which = 0)
65
- which ||= which.to_i
66
- which = self.scan(sf_regex(:uri))[which][0] rescue nil
67
- (which.nil? or which =~ sf_regex(:protocol)) ? which : 'http://' << which
54
+ which = scan(sf_regex(:uri))[which.to_i][0]
55
+ which =~ sf_regex(:protocol) ? which : "http://#{which}"
56
+ rescue NoMethodError
57
+ nil
68
58
  end
69
59
  end
70
60
  end
@@ -1,144 +1,168 @@
1
1
  module Scrapifier
2
2
  module Support
3
- private
4
- # Filters images returning those with the allowed extentions.
5
- #
6
- # Example:
7
- # >> sf_check_img_ext('http://source.com/image.gif', :jpg)
8
- # => []
9
- # >> sf_check_img_ext(['http://source.com/image.gif', 'http://source.com/image.jpg'], [:jpg, :png])
10
- # => ['http://source.com/image.jpg']
11
- # Arguments:
12
- # images: (String or Array)
13
- # - Images which will be checked.
14
- # allowed: (String, Symbol or Array)
15
- # - Allowed types of image extension.
16
-
17
- def sf_check_img_ext(images, allowed = [])
18
- allowed ||= []
19
- if images.is_a?(String)
20
- images = images.split
21
- elsif !images.is_a?(Array)
22
- images = []
23
- end
24
- images.select { |i| i =~ sf_regex(:image, allowed) }
25
- end
3
+ module_function
26
4
 
27
- # Selects regexes for URIs, protocols and image extensions.
28
- #
29
- # Example:
30
- # >> sf_regex(:uri)
31
- # => /\b((((ht|f)tp[s]?:\/\/)|([a-z0-9]+\.))+(?<!@)([a-z0-9\_\-]+)(\.[a-z]+)+([\?\/\:][a-z0-9_=%&@\?\.\/\-\:\#\(\)]+)?\/?)/i,
32
- # >> sf_regex(:image, :jpg)
33
- # => /(^http{1}[s]?:\/\/([w]{3}\.)?.+\.(jpg)(\?.+)?$)/i
34
- # Arguments:
35
- # type: (Symbol or String)
36
- # - Regex type.
37
- # args: (*)
38
- # - Anything.
5
+ # Evaluate the URI's HTML document and get its metadata.
6
+ #
7
+ # Example:
8
+ # >> eval_uri('http://adtangerine.com', [:png])
9
+ # => {
10
+ # :title => "AdTangerine | Advertising Platform for Social Media",
11
+ # :description => "AdTangerine is an advertising platform that...",
12
+ # :images => [
13
+ # "http://adtangerine.com/assets/logo_adt_og.png",
14
+ # "http://adtangerine.com/assets/logo_adt_og.png
15
+ # ],
16
+ # :uri => "http://adtangerine.com"
17
+ # }
18
+ # Arguments:
19
+ # uri: (String)
20
+ # - URI.
21
+ # imgs: (Array)
22
+ # - Allowed type of images.
23
+ def sf_eval_uri(uri, imgs = [])
24
+ doc = Nokogiri::HTML(open(uri).read)
25
+ doc.encoding, meta = 'utf-8', { uri: uri }
39
26
 
40
- def sf_regex(type, *args)
41
- type = type.to_sym unless type.is_a? Symbol
42
- if type == :image
43
- sf_img_regex args.flatten
44
- else
45
- regexes = {
46
- uri: /\b((((ht|f)tp[s]?:\/\/)|([a-z0-9]+\.))+(?<!@)([a-z0-9\_\-]+)(\.[a-z]+)+([\?\/\:][a-z0-9_=%&@\?\.\/\-\:\#\(\)]+)?\/?)/i,
47
- protocol: /((ht|f)tp[s]?)/i
48
- }
49
- regexes[type]
50
- end
51
- end
27
+ %i(title description).each { |k| meta[k] = (doc.xpath(sf_paths[k])[0].text rescue '-') }
28
+ meta[:images] = sf_fix_imgs(doc.xpath(sf_paths[:image]), uri, imgs)
29
+
30
+ meta
31
+ rescue SocketError
32
+ {}
33
+ end
52
34
 
53
- # Builds image regexes according to the required extensions.
54
- #
55
- # Example:
56
- # >> sf_img_regex
57
- # => /(^http{1}[s]?:\/\/([w]{3}\.)?.+\.(jpg|jpeg|png|gif)(\?.+)?$)/i
58
- # >> sf_img_regex([:jpg, :png])
59
- # => /(^http{1}[s]?:\/\/([w]{3}\.)?.+\.(jpg|png)(\?.+)?$)/i
60
- # Arguments:
61
- # exts: (Array)
62
- # - Image extensions which will be included in the regex.
63
-
64
- def sf_img_regex(exts = [])
65
- exts = [exts].flatten unless exts.is_a?(Array)
66
- if exts.nil? or exts.empty?
67
- exts = %w(jpg jpeg png gif)
68
- elsif exts.include?(:jpg) and !exts.include?(:jpeg)
69
- exts.push :jpeg
70
- end
71
- eval "/(^http{1}[s]?:\\/\\/([w]{3}\\.)?.+\\.(#{exts.join('|')})(\\?.+)?$)/i"
35
+ # Filter images returning those with the allowed extentions.
36
+ #
37
+ # Example:
38
+ # >> sf_check_img_ext('http://source.com/image.gif', :jpg)
39
+ # => []
40
+ # >> sf_check_img_ext(['http://source.com/image.gif', 'http://source.com/image.jpg'], [:jpg, :png])
41
+ # => ['http://source.com/image.jpg']
42
+ # Arguments:
43
+ # images: (String or Array)
44
+ # - Images which will be checked.
45
+ # allowed: (String, Symbol or Array)
46
+ # - Allowed types of image extension.
47
+ def sf_check_img_ext(images, allowed = [])
48
+ allowed ||= []
49
+ if images.is_a?(String)
50
+ images = images.split
51
+ elsif !images.is_a?(Array)
52
+ images = []
72
53
  end
54
+ images.select { |i| i =~ sf_regex(:image, allowed) }
55
+ end
73
56
 
74
- # Collection of paths used to get content from HTML tags via Node#xpath method.
75
- # See more: http://nokogiri.org/tutorials/searching_a_xml_html_document.html
76
- #
77
- # Example:
78
- # >> sf_paths[:title]
79
- # => '//meta[@property = "og:title"]/@content | //meta[@name = "title"]/@content | //meta[@name = "Title"]/@content | //title | //h1'
80
-
81
- def sf_paths
82
- {
83
- title: '//meta[@property = "og:title"]/@content | //meta[@name = "title"]/@content | //meta[@name = "Title"]/@content | //title | //h1',
84
- description: '//meta[@property = "og:description"]/@content | //meta[@name = "description"]/@content | //meta[@name = "Description"]/@content | //h1 | //h3 | //p | //span | //font',
85
- image: '//meta[@property = "og:image"]/@content | //link[@rel = "image_src"]/@href | //meta[@itemprop = "image"]/@content | //div[@id = "logo"]/img/@src | //a[@id = "logo"]/img/@src | //div[@class = "logo"]/img/@src | //a[@class = "logo"]/img/@src | //a//img[@width]/@src | //img[@width]/@src | //a//img[@height]/@src | //img[@height]/@src | //a//img/@src | //span//img/@src'
57
+ # Select regexes for URIs, protocols and image extensions.
58
+ #
59
+ # Example:
60
+ # >> sf_regex(:uri)
61
+ # => /\b((((ht|f)tp[s]?:\/\/)|([a-z0-9]+\.))+(?<!@)([a-z0-9\_\-]+)(\.[a-z]+)+([\?\/\:][a-z0-9_=%&@\?\.\/\-\:\#\(\)]+)?\/?)/i,
62
+ # >> sf_regex(:image, :jpg)
63
+ # => /(^http{1}[s]?:\/\/([w]{3}\.)?.+\.(jpg)(\?.+)?$)/i
64
+ # Arguments:
65
+ # type: (Symbol or String)
66
+ # - Regex type.
67
+ # args: (*)
68
+ # - Anything.
69
+ def sf_regex(type, *args)
70
+ type = type.to_sym unless type.is_a? Symbol
71
+ if type == :image
72
+ sf_img_regex args.flatten
73
+ else
74
+ regexes = {
75
+ uri: /\b((((ht|f)tp[s]?:\/\/)|([a-z0-9]+\.))+(?<!@)([a-z0-9\_\-]+)(\.[a-z]+)+([\?\/\:][a-z0-9_=%&@\?\.\/\-\:\#\(\)]+)?\/?)/i,
76
+ protocol: /((ht|f)tp[s]?)/i
86
77
  }
78
+ regexes[type]
87
79
  end
80
+ end
88
81
 
89
- # Checks and returns only the valid image URIs.
90
- #
91
- # Example:
92
- # >> sf_fix_imgs(['http://adtangerine.com/image.png', '/assets/image.jpg'], 'http://adtangerine.com', :jpg)
93
- # => ['http://adtangerine/assets/image.jpg']
94
- # Arguments:
95
- # imgs: (Array)
96
- # - Image URIs got from the HTML doc.
97
- # uri: (String)
98
- # - Used as basis to the URIs that don't have any protocol/domain set.
99
- # exts: (Symbol or Array)
100
- # - Allowed image extesntions.
82
+ # Build image regexes according to the required extensions.
83
+ #
84
+ # Example:
85
+ # >> sf_img_regex
86
+ # => /(^http{1}[s]?:\/\/([w]{3}\.)?.+\.(jpg|jpeg|png|gif)(\?.+)?$)/i
87
+ # >> sf_img_regex([:jpg, :png])
88
+ # => /(^http{1}[s]?:\/\/([w]{3}\.)?.+\.(jpg|png)(\?.+)?$)/i
89
+ # Arguments:
90
+ # exts: (Array)
91
+ # - Image extensions which will be included in the regex.
92
+ def sf_img_regex(exts = [])
93
+ exts = [exts].flatten unless exts.is_a?(Array)
94
+ if exts.nil? or exts.empty?
95
+ exts = %w(jpg jpeg png gif)
96
+ elsif exts.include?(:jpg) and !exts.include?(:jpeg)
97
+ exts.push :jpeg
98
+ end
99
+ eval "/(^http{1}[s]?:\\/\\/([w]{3}\\.)?.+\\.(#{exts.join('|')})(\\?.+)?$)/i"
100
+ end
101
101
 
102
- def sf_fix_imgs(imgs, uri, exts = [])
103
- sf_check_img_ext(imgs.map do |img|
104
- img = img.to_s
105
- img = sf_fix_protocol(img, sf_domain(uri)) unless img =~ sf_regex(:protocol)
106
- img if (img =~ sf_regex(:image))
107
- end.compact, exts)
108
- end
102
+ # Collection of paths used to get content from HTML tags via Node#xpath method.
103
+ # See more: http://nokogiri.org/tutorials/searching_a_xml_html_document.html
104
+ #
105
+ # Example:
106
+ # >> sf_paths[:title]
107
+ # => '//meta[@property = "og:title"]/@content | //meta[@name = "title"]/@content | //meta[@name = "Title"]/@content | //title | //h1'
108
+ def sf_paths
109
+ {
110
+ title: '//meta[@property = "og:title"]/@content | //meta[@name = "title"]/@content | //meta[@name = "Title"]/@content | //title | //h1',
111
+ description: '//meta[@property = "og:description"]/@content | //meta[@name = "description"]/@content | //meta[@name = "Description"]/@content | //h1 | //h3 | //p | //span | //font',
112
+ image: '//meta[@property = "og:image"]/@content | //link[@rel = "image_src"]/@href | //meta[@itemprop = "image"]/@content | //div[@id = "logo"]/img/@src | //a[@id = "logo"]/img/@src | //div[@class = "logo"]/img/@src | //a[@class = "logo"]/img/@src | //a//img[@width]/@src | //img[@width]/@src | //a//img[@height]/@src | //img[@height]/@src | //a//img/@src | //span//img/@src'
113
+ }
114
+ end
109
115
 
110
- # Fixes image URIs that doesn't present protocol/domain.
111
- #
112
- # Example:
113
- # >> sf_fix_protocol('/assets/image.jpg', 'http://adtangerine.com')
114
- # => 'http://adtangerine/assets/image.jpg'
115
- # >> sf_fix_protocol('//s.ytimg.com/yts/img/youtub_img.png', 'https://youtube.com')
116
- # => 'https://s.ytimg.com/yts/img/youtub_img.png'
117
- # Arguments:
118
- # path: (String)
119
- # - URI path having no protocol/domain set.
120
- # domain: (String)
121
- # - Domain that will be prepended into the path.
116
+ # Check and return only the valid image URIs.
117
+ #
118
+ # Example:
119
+ # >> sf_fix_imgs(['http://adtangerine.com/image.png', '/assets/image.jpg'], 'http://adtangerine.com', :jpg)
120
+ # => ['http://adtangerine/assets/image.jpg']
121
+ # Arguments:
122
+ # imgs: (Array)
123
+ # - Image URIs got from the HTML doc.
124
+ # uri: (String)
125
+ # - Used as basis to the URIs that don't have any protocol/domain set.
126
+ # exts: (Symbol or Array)
127
+ # - Allowed image extesntions.
128
+ def sf_fix_imgs(imgs, uri, exts = [])
129
+ sf_check_img_ext(imgs.map do |img|
130
+ img = img.to_s
131
+ img = sf_fix_protocol(img, sf_domain(uri)) unless img =~ sf_regex(:protocol)
132
+ img if (img =~ sf_regex(:image))
133
+ end.compact, exts)
134
+ end
122
135
 
123
- def sf_fix_protocol(path, domain)
124
- if path =~ /^\/\/[^\/]+/
125
- 'http:' << path
126
- else
127
- "http://#{domain}#{'/' unless path =~ /^\/[^\/]+/}#{path}"
128
- end
129
- end
136
+ # Fix image URIs that don't have a protocol/domain set.
137
+ #
138
+ # Example:
139
+ # >> sf_fix_protocol('/assets/image.jpg', 'http://adtangerine.com')
140
+ # => 'http://adtangerine/assets/image.jpg'
141
+ # >> sf_fix_protocol('//s.ytimg.com/yts/img/youtub_img.png', 'https://youtube.com')
142
+ # => 'https://s.ytimg.com/yts/img/youtub_img.png'
143
+ # Arguments:
144
+ # path: (String)
145
+ # - URI path having no protocol/domain set.
146
+ # domain: (String)
147
+ # - Domain that will be prepended into the path.
148
+ def sf_fix_protocol(path, domain)
149
+ if path =~ /^\/\/[^\/]+/
150
+ 'http:' << path
151
+ else
152
+ "http://#{domain}#{'/' unless path =~ /^\/[^\/]+/}#{path}"
153
+ end
154
+ end
130
155
 
131
- # Returns the domain from an URI
132
- #
133
- # Example:
134
- # >> sf_domain('http://adtangerine.com')
135
- # => 'adtangerine.com'
136
- # Arguments:
137
- # uri: (String)
138
- # - URI.
139
-
140
- def sf_domain(uri)
141
- (uri.split('/')[2] rescue '')
142
- end
156
+ # Return the URI domain.
157
+ #
158
+ # Example:
159
+ # >> sf_domain('http://adtangerine.com')
160
+ # => 'adtangerine.com'
161
+ # Arguments:
162
+ # uri: (String)
163
+ # - URI.
164
+ def sf_domain(uri)
165
+ (uri.split('/')[2] rescue '')
166
+ end
143
167
  end
144
168
  end
@@ -1,3 +1,3 @@
1
1
  module Scrapifier
2
- VERSION = '0.0.1'
2
+ VERSION = '0.0.2'
3
3
  end
File without changes
@@ -1,5 +1,5 @@
1
1
  require 'rubygems'
2
2
  require 'bundler/setup'
3
3
  require 'scrapifier'
4
- require 'factories/uris'
4
+ require 'factories/samples'
5
5
 
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: scrapifier
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.1
4
+ version: 0.0.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Tiago Guedes
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2014-04-07 00:00:00.000000000 Z
11
+ date: 2014-04-30 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: nokogiri
@@ -75,6 +75,7 @@ extensions: []
75
75
  extra_rdoc_files: []
76
76
  files:
77
77
  - ".gitignore"
78
+ - ".travis.yml"
78
79
  - Gemfile
79
80
  - LICENSE.txt
80
81
  - README.md
@@ -84,7 +85,7 @@ files:
84
85
  - lib/scrapifier/support.rb
85
86
  - lib/scrapifier/version.rb
86
87
  - scrapifier.gemspec
87
- - spec/factories/uris.rb
88
+ - spec/factories/samples.rb
88
89
  - spec/scrapifier_spec.rb
89
90
  - spec/spec_helper.rb
90
91
  homepage: https://github.com/tiagopog/scrapifier
@@ -112,6 +113,6 @@ signing_key:
112
113
  specification_version: 4
113
114
  summary: Extends the Ruby String class with a screen scraping method.
114
115
  test_files:
115
- - spec/factories/uris.rb
116
+ - spec/factories/samples.rb
116
117
  - spec/scrapifier_spec.rb
117
118
  - spec/spec_helper.rb