scrapifier 0.0.5 → 0.0.6
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +119 -94
- data/lib/scrapifier/version.rb +1 -1
- data/lib/scrapifier/xpath.rb +3 -3
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 66a555f6eaeccf042961100999adeeee7a4a084a
|
4
|
+
data.tar.gz: 0517ecc6004507c2c4a479c187d24d559d6cd2e0
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 001a68b1afe7f2b88d7c8c6fd9155ef6754468b78bd66fc6be2c038e291e38ee7aac9a04e9d4f51ce87579a0590ec04b25dd07d6701b317ef37bb02b611b8a37
|
7
|
+
data.tar.gz: 7330af461a9585ecf840c054a7f96556036e1189af32198a86e2020cac90b9305f3596fdffc33a3d837a403129d31d085c424542502786da5c33f5e8fac13d70
|
data/README.md
CHANGED
@@ -1,94 +1,119 @@
|
|
1
|
-
# Scrapifier
|
2
|
-
|
3
|
-
[![Build Status](https://travis-ci.org/tiagopog/scrapifier.svg?branch=master)](https://travis-ci.org/tiagopog/scrapifier)
|
4
|
-
[![Code Climate](https://codeclimate.com/github/tiagopog/scrapifier.png)](https://codeclimate.com/github/tiagopog/scrapifier)
|
5
|
-
[![Dependency Status](https://gemnasium.com/tiagopog/scrapifier.svg)](https://gemnasium.com/tiagopog/scrapifier)
|
6
|
-
[![Gem Version](https://badge.fury.io/rb/scrapifier.svg)](http://badge.fury.io/rb/scrapifier)
|
7
|
-
|
8
|
-
It's a Ruby gem that brings a very simple way to extract meta information from URIs using the screen scraping technique.
|
9
|
-
|
10
|
-
Note: This gem is mainly focused on screen scraping URLs (presence of protocol, such as: "http", "https" and "ftp"), but it also works with URIs which have the "www" without any protocol defined, like: "www.google.com".
|
11
|
-
|
12
|
-
## Installation
|
13
|
-
|
14
|
-
Compatible with Ruby 1.9.3+
|
15
|
-
|
16
|
-
Add this line to your application's Gemfile:
|
17
|
-
|
18
|
-
gem 'scrapifier'
|
19
|
-
|
20
|
-
And then execute:
|
21
|
-
|
22
|
-
$ bundle
|
23
|
-
|
24
|
-
Or install it yourself as:
|
25
|
-
|
26
|
-
$ gem install scrapifier
|
27
|
-
|
28
|
-
An then require the gem:
|
29
|
-
|
30
|
-
$ require 'scrapifier'
|
31
|
-
|
32
|
-
## Usage
|
33
|
-
|
34
|
-
The String#scrapify method finds URIs in a string and then gets their metadata, e.g., the page's title, description, images and URI. All the data is returned in a well-formatted hash.
|
35
|
-
|
36
|
-
#### Default usage.
|
37
|
-
|
38
|
-
``` ruby
|
39
|
-
'Wow! What an awesome site: http://adtangerine.com!'.scrapify
|
40
|
-
#=> {
|
41
|
-
# title:
|
42
|
-
# description: "
|
43
|
-
#
|
44
|
-
#
|
45
|
-
#
|
46
|
-
|
47
|
-
|
48
|
-
|
49
|
-
|
50
|
-
|
51
|
-
|
52
|
-
|
53
|
-
|
54
|
-
|
55
|
-
|
56
|
-
|
57
|
-
|
58
|
-
|
59
|
-
|
60
|
-
|
61
|
-
#
|
62
|
-
#
|
63
|
-
#
|
64
|
-
#
|
65
|
-
#
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
|
70
|
-
|
71
|
-
|
72
|
-
|
73
|
-
#
|
74
|
-
#
|
75
|
-
#
|
76
|
-
#
|
77
|
-
#
|
78
|
-
|
79
|
-
|
80
|
-
|
81
|
-
|
82
|
-
|
83
|
-
|
84
|
-
|
85
|
-
|
86
|
-
|
87
|
-
|
88
|
-
|
89
|
-
|
90
|
-
|
91
|
-
|
92
|
-
|
93
|
-
|
94
|
-
|
1
|
+
# Scrapifier
|
2
|
+
|
3
|
+
[![Build Status](https://travis-ci.org/tiagopog/scrapifier.svg?branch=master)](https://travis-ci.org/tiagopog/scrapifier)
|
4
|
+
[![Code Climate](https://codeclimate.com/github/tiagopog/scrapifier.png)](https://codeclimate.com/github/tiagopog/scrapifier)
|
5
|
+
[![Dependency Status](https://gemnasium.com/tiagopog/scrapifier.svg)](https://gemnasium.com/tiagopog/scrapifier)
|
6
|
+
[![Gem Version](https://badge.fury.io/rb/scrapifier.svg)](http://badge.fury.io/rb/scrapifier)
|
7
|
+
|
8
|
+
It's a Ruby gem that brings a very simple way to extract meta information from URIs using the screen scraping technique.
|
9
|
+
|
10
|
+
Note: This gem is mainly focused on screen scraping URLs (presence of protocol, such as: "http", "https" and "ftp"), but it also works with URIs which have the "www" without any protocol defined, like: "www.google.com".
|
11
|
+
|
12
|
+
## Installation
|
13
|
+
|
14
|
+
Compatible with Ruby 1.9.3+
|
15
|
+
|
16
|
+
Add this line to your application's Gemfile:
|
17
|
+
|
18
|
+
gem 'scrapifier'
|
19
|
+
|
20
|
+
And then execute:
|
21
|
+
|
22
|
+
$ bundle
|
23
|
+
|
24
|
+
Or install it yourself as:
|
25
|
+
|
26
|
+
$ gem install scrapifier
|
27
|
+
|
28
|
+
An then require the gem:
|
29
|
+
|
30
|
+
$ require 'scrapifier'
|
31
|
+
|
32
|
+
## Usage
|
33
|
+
|
34
|
+
The String#scrapify method finds URIs in a string and then gets their metadata, e.g., the page's title, description, images, keywords, language, encode, "reply to" email, author and URI. All the data is returned in a well-formatted hash.
|
35
|
+
|
36
|
+
#### Default usage.
|
37
|
+
|
38
|
+
``` ruby
|
39
|
+
'Wow! What an awesome site: http://adtangerine.com!'.scrapify
|
40
|
+
#=> {
|
41
|
+
# title: "AdTangerine | Boosting great ideas",
|
42
|
+
# description: "Advertising social network that uses tangerines as a virtual currency..." ,
|
43
|
+
# keywords: "ad network, ad, advertising, advertiser, publisher, social media",
|
44
|
+
# lang: "en-us",
|
45
|
+
# encode: "utf-8",
|
46
|
+
# reply_to: "sayhello@adtangerine.com",
|
47
|
+
# author: "Tiago Guedes, Jonatas de Paula, Raphael da Costa",
|
48
|
+
# images: ["http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/logo_adt_og.png", "http://s3-us-west-2.amazonaws.com/adtangerine-prod/users/avatars/000/000/834/thumb/275747_1118382211_1929809351_n.jpg", "http://adtangerine.com/assets/foobar.gif"],
|
49
|
+
# uri: "http://adtangerine.com"
|
50
|
+
# }
|
51
|
+
```
|
52
|
+
|
53
|
+
#### Allow only certain image types.
|
54
|
+
|
55
|
+
``` ruby
|
56
|
+
'Wow! What an awesome site: http://adtangerine.com!'.scrapify(images: :jpg)
|
57
|
+
#=> {
|
58
|
+
# title: "AdTangerine | Boosting great ideas",
|
59
|
+
# description: "Advertising social network that uses tangerines as a virtual currency..." ,
|
60
|
+
# keywords: "ad network, ad, advertising, advertiser, publisher, social media",
|
61
|
+
# lang: "en-us",
|
62
|
+
# encode: "utf-8",
|
63
|
+
# reply_to: "sayhello@adtangerine.com",
|
64
|
+
# author: "Tiago Guedes, Jonatas de Paula, Raphael da Costa",
|
65
|
+
# images: ["http://s3-us-west-2.amazonaws.com/adtangerine-prod/users/avatars/000/000/834/thumb/275747_1118382211_1929809351_n.jpg"],
|
66
|
+
# uri: "http://adtangerine.com"
|
67
|
+
# }
|
68
|
+
|
69
|
+
'Wow! What an awesome site: http://adtangerine.com!'.scrapify(images: [:png, :gif])
|
70
|
+
#=> {
|
71
|
+
# title: "AdTangerine | Boosting great ideas",
|
72
|
+
# description: "Advertising social network that uses tangerines as a virtual currency..." ,
|
73
|
+
# keywords: "ad network, ad, advertising, advertiser, publisher, social media",
|
74
|
+
# lang: "en-us",
|
75
|
+
# encode: "utf-8",
|
76
|
+
# reply_to: "sayhello@adtangerine.com",
|
77
|
+
# author: "Tiago Guedes, Jonatas de Paula, Raphael da Costa",
|
78
|
+
# images: ["http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/foobar.gif"],
|
79
|
+
# uri: "http://adtangerine.com"
|
80
|
+
# }
|
81
|
+
```
|
82
|
+
|
83
|
+
#### Choose which URI you want it to be scraped.
|
84
|
+
|
85
|
+
``` ruby
|
86
|
+
'Check out: http://adtangerine.com and www.twitflink.com'.scrapify(which: 1)
|
87
|
+
#=> {
|
88
|
+
# title: "TwitFlink | Find a link!",
|
89
|
+
# description: "TwitFlink is a very simple searching tool that allows people to find out links tweeted...",
|
90
|
+
# keywords: "search, searching tool, link, twitter, social media",
|
91
|
+
# lang: "en-us",
|
92
|
+
# encode: "utf-8",
|
93
|
+
# reply_to: "sayhello@adtangerine.com",
|
94
|
+
# author: "Tiago Guedes",
|
95
|
+
# images: ["http://www.twitflink.com//assets/tf_logo.png", "http://twitflink.com/assets/tf_logo.png"],
|
96
|
+
# uri: "http://www.twitflink.com"
|
97
|
+
# }
|
98
|
+
|
99
|
+
'Check out: http://adtangerine.com and www.twitflink.com'.scrapify(which: 0, images: :gif)
|
100
|
+
#=> {
|
101
|
+
# title: "AdTangerine | Boosting great ideas",
|
102
|
+
# description: "Advertising social network that uses tangerines as a virtual currency..." ,
|
103
|
+
# keywords: "ad network, ad, advertising, advertiser, publisher, social media",
|
104
|
+
# lang: "en-us",
|
105
|
+
# encode: "utf-8",
|
106
|
+
# reply_to: "sayhello@adtangerine.com",
|
107
|
+
# author: "Tiago Guedes, Jonatas de Paula, Raphael da Costa",
|
108
|
+
# images: ["http://adtangerine.com/assets/foobar.gif"],
|
109
|
+
# uri: "http://adtangerine.com"
|
110
|
+
# }
|
111
|
+
```
|
112
|
+
|
113
|
+
## Contributing
|
114
|
+
|
115
|
+
1. Fork it
|
116
|
+
2. Create your feature branch (`git checkout -b my-new-feature`)
|
117
|
+
3. Commit your changes (`git commit -am 'Add some feature'`)
|
118
|
+
4. Push to the branch (`git push origin my-new-feature`)
|
119
|
+
5. Create new Pull Request
|
data/lib/scrapifier/version.rb
CHANGED
data/lib/scrapifier/xpath.rb
CHANGED
@@ -41,14 +41,14 @@ module Scrapifier
|
|
41
41
|
|
42
42
|
REPLY_TO =
|
43
43
|
<<-END.gsub(/^\s+\|/, '')
|
44
|
-
|//meta[@name="reply_to"]/@content
|
44
|
+
|//meta[@name="reply_to"]/@content|
|
45
|
+
|//meta[@name="Reply_to"]/@content
|
45
46
|
END
|
46
47
|
|
47
48
|
AUTHOR =
|
48
49
|
<<-END.gsub(/^\s+\|/, '')
|
49
50
|
|//meta[@name="author"]/@content|
|
50
|
-
|//meta[@name="Author"]/@content
|
51
|
-
|//meta[@name="reply_to"]/@content
|
51
|
+
|//meta[@name="Author"]/@content
|
52
52
|
END
|
53
53
|
|
54
54
|
IMG =
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: scrapifier
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.6
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Tiago Guedes
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2014-06-
|
11
|
+
date: 2014-06-29 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: nokogiri
|