scrapifier 0.0.5 → 0.0.6
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +119 -94
- data/lib/scrapifier/version.rb +1 -1
- data/lib/scrapifier/xpath.rb +3 -3
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 66a555f6eaeccf042961100999adeeee7a4a084a
|
4
|
+
data.tar.gz: 0517ecc6004507c2c4a479c187d24d559d6cd2e0
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 001a68b1afe7f2b88d7c8c6fd9155ef6754468b78bd66fc6be2c038e291e38ee7aac9a04e9d4f51ce87579a0590ec04b25dd07d6701b317ef37bb02b611b8a37
|
7
|
+
data.tar.gz: 7330af461a9585ecf840c054a7f96556036e1189af32198a86e2020cac90b9305f3596fdffc33a3d837a403129d31d085c424542502786da5c33f5e8fac13d70
|
data/README.md
CHANGED
@@ -1,94 +1,119 @@
|
|
1
|
-
# Scrapifier
|
2
|
-
|
3
|
-
[](https://travis-ci.org/tiagopog/scrapifier)
|
4
|
-
[](https://codeclimate.com/github/tiagopog/scrapifier)
|
5
|
-
[](https://gemnasium.com/tiagopog/scrapifier)
|
6
|
-
[](http://badge.fury.io/rb/scrapifier)
|
7
|
-
|
8
|
-
It's a Ruby gem that brings a very simple way to extract meta information from URIs using the screen scraping technique.
|
9
|
-
|
10
|
-
Note: This gem is mainly focused on screen scraping URLs (presence of protocol, such as: "http", "https" and "ftp"), but it also works with URIs which have the "www" without any protocol defined, like: "www.google.com".
|
11
|
-
|
12
|
-
## Installation
|
13
|
-
|
14
|
-
Compatible with Ruby 1.9.3+
|
15
|
-
|
16
|
-
Add this line to your application's Gemfile:
|
17
|
-
|
18
|
-
gem 'scrapifier'
|
19
|
-
|
20
|
-
And then execute:
|
21
|
-
|
22
|
-
$ bundle
|
23
|
-
|
24
|
-
Or install it yourself as:
|
25
|
-
|
26
|
-
$ gem install scrapifier
|
27
|
-
|
28
|
-
An then require the gem:
|
29
|
-
|
30
|
-
$ require 'scrapifier'
|
31
|
-
|
32
|
-
## Usage
|
33
|
-
|
34
|
-
The String#scrapify method finds URIs in a string and then gets their metadata, e.g., the page's title, description, images and URI. All the data is returned in a well-formatted hash.
|
35
|
-
|
36
|
-
#### Default usage.
|
37
|
-
|
38
|
-
``` ruby
|
39
|
-
'Wow! What an awesome site: http://adtangerine.com!'.scrapify
|
40
|
-
#=> {
|
41
|
-
# title:
|
42
|
-
# description: "
|
43
|
-
#
|
44
|
-
#
|
45
|
-
#
|
46
|
-
|
47
|
-
|
48
|
-
|
49
|
-
|
50
|
-
|
51
|
-
|
52
|
-
|
53
|
-
|
54
|
-
|
55
|
-
|
56
|
-
|
57
|
-
|
58
|
-
|
59
|
-
|
60
|
-
|
61
|
-
#
|
62
|
-
#
|
63
|
-
#
|
64
|
-
#
|
65
|
-
#
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
|
70
|
-
|
71
|
-
|
72
|
-
|
73
|
-
#
|
74
|
-
#
|
75
|
-
#
|
76
|
-
#
|
77
|
-
#
|
78
|
-
|
79
|
-
|
80
|
-
|
81
|
-
|
82
|
-
|
83
|
-
|
84
|
-
|
85
|
-
|
86
|
-
|
87
|
-
|
88
|
-
|
89
|
-
|
90
|
-
|
91
|
-
|
92
|
-
|
93
|
-
|
94
|
-
|
1
|
+
# Scrapifier
|
2
|
+
|
3
|
+
[](https://travis-ci.org/tiagopog/scrapifier)
|
4
|
+
[](https://codeclimate.com/github/tiagopog/scrapifier)
|
5
|
+
[](https://gemnasium.com/tiagopog/scrapifier)
|
6
|
+
[](http://badge.fury.io/rb/scrapifier)
|
7
|
+
|
8
|
+
It's a Ruby gem that brings a very simple way to extract meta information from URIs using the screen scraping technique.
|
9
|
+
|
10
|
+
Note: This gem is mainly focused on screen scraping URLs (presence of protocol, such as: "http", "https" and "ftp"), but it also works with URIs which have the "www" without any protocol defined, like: "www.google.com".
|
11
|
+
|
12
|
+
## Installation
|
13
|
+
|
14
|
+
Compatible with Ruby 1.9.3+
|
15
|
+
|
16
|
+
Add this line to your application's Gemfile:
|
17
|
+
|
18
|
+
gem 'scrapifier'
|
19
|
+
|
20
|
+
And then execute:
|
21
|
+
|
22
|
+
$ bundle
|
23
|
+
|
24
|
+
Or install it yourself as:
|
25
|
+
|
26
|
+
$ gem install scrapifier
|
27
|
+
|
28
|
+
An then require the gem:
|
29
|
+
|
30
|
+
$ require 'scrapifier'
|
31
|
+
|
32
|
+
## Usage
|
33
|
+
|
34
|
+
The String#scrapify method finds URIs in a string and then gets their metadata, e.g., the page's title, description, images, keywords, language, encode, "reply to" email, author and URI. All the data is returned in a well-formatted hash.
|
35
|
+
|
36
|
+
#### Default usage.
|
37
|
+
|
38
|
+
``` ruby
|
39
|
+
'Wow! What an awesome site: http://adtangerine.com!'.scrapify
|
40
|
+
#=> {
|
41
|
+
# title: "AdTangerine | Boosting great ideas",
|
42
|
+
# description: "Advertising social network that uses tangerines as a virtual currency..." ,
|
43
|
+
# keywords: "ad network, ad, advertising, advertiser, publisher, social media",
|
44
|
+
# lang: "en-us",
|
45
|
+
# encode: "utf-8",
|
46
|
+
# reply_to: "sayhello@adtangerine.com",
|
47
|
+
# author: "Tiago Guedes, Jonatas de Paula, Raphael da Costa",
|
48
|
+
# images: ["http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/logo_adt_og.png", "http://s3-us-west-2.amazonaws.com/adtangerine-prod/users/avatars/000/000/834/thumb/275747_1118382211_1929809351_n.jpg", "http://adtangerine.com/assets/foobar.gif"],
|
49
|
+
# uri: "http://adtangerine.com"
|
50
|
+
# }
|
51
|
+
```
|
52
|
+
|
53
|
+
#### Allow only certain image types.
|
54
|
+
|
55
|
+
``` ruby
|
56
|
+
'Wow! What an awesome site: http://adtangerine.com!'.scrapify(images: :jpg)
|
57
|
+
#=> {
|
58
|
+
# title: "AdTangerine | Boosting great ideas",
|
59
|
+
# description: "Advertising social network that uses tangerines as a virtual currency..." ,
|
60
|
+
# keywords: "ad network, ad, advertising, advertiser, publisher, social media",
|
61
|
+
# lang: "en-us",
|
62
|
+
# encode: "utf-8",
|
63
|
+
# reply_to: "sayhello@adtangerine.com",
|
64
|
+
# author: "Tiago Guedes, Jonatas de Paula, Raphael da Costa",
|
65
|
+
# images: ["http://s3-us-west-2.amazonaws.com/adtangerine-prod/users/avatars/000/000/834/thumb/275747_1118382211_1929809351_n.jpg"],
|
66
|
+
# uri: "http://adtangerine.com"
|
67
|
+
# }
|
68
|
+
|
69
|
+
'Wow! What an awesome site: http://adtangerine.com!'.scrapify(images: [:png, :gif])
|
70
|
+
#=> {
|
71
|
+
# title: "AdTangerine | Boosting great ideas",
|
72
|
+
# description: "Advertising social network that uses tangerines as a virtual currency..." ,
|
73
|
+
# keywords: "ad network, ad, advertising, advertiser, publisher, social media",
|
74
|
+
# lang: "en-us",
|
75
|
+
# encode: "utf-8",
|
76
|
+
# reply_to: "sayhello@adtangerine.com",
|
77
|
+
# author: "Tiago Guedes, Jonatas de Paula, Raphael da Costa",
|
78
|
+
# images: ["http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/foobar.gif"],
|
79
|
+
# uri: "http://adtangerine.com"
|
80
|
+
# }
|
81
|
+
```
|
82
|
+
|
83
|
+
#### Choose which URI you want it to be scraped.
|
84
|
+
|
85
|
+
``` ruby
|
86
|
+
'Check out: http://adtangerine.com and www.twitflink.com'.scrapify(which: 1)
|
87
|
+
#=> {
|
88
|
+
# title: "TwitFlink | Find a link!",
|
89
|
+
# description: "TwitFlink is a very simple searching tool that allows people to find out links tweeted...",
|
90
|
+
# keywords: "search, searching tool, link, twitter, social media",
|
91
|
+
# lang: "en-us",
|
92
|
+
# encode: "utf-8",
|
93
|
+
# reply_to: "sayhello@adtangerine.com",
|
94
|
+
# author: "Tiago Guedes",
|
95
|
+
# images: ["http://www.twitflink.com//assets/tf_logo.png", "http://twitflink.com/assets/tf_logo.png"],
|
96
|
+
# uri: "http://www.twitflink.com"
|
97
|
+
# }
|
98
|
+
|
99
|
+
'Check out: http://adtangerine.com and www.twitflink.com'.scrapify(which: 0, images: :gif)
|
100
|
+
#=> {
|
101
|
+
# title: "AdTangerine | Boosting great ideas",
|
102
|
+
# description: "Advertising social network that uses tangerines as a virtual currency..." ,
|
103
|
+
# keywords: "ad network, ad, advertising, advertiser, publisher, social media",
|
104
|
+
# lang: "en-us",
|
105
|
+
# encode: "utf-8",
|
106
|
+
# reply_to: "sayhello@adtangerine.com",
|
107
|
+
# author: "Tiago Guedes, Jonatas de Paula, Raphael da Costa",
|
108
|
+
# images: ["http://adtangerine.com/assets/foobar.gif"],
|
109
|
+
# uri: "http://adtangerine.com"
|
110
|
+
# }
|
111
|
+
```
|
112
|
+
|
113
|
+
## Contributing
|
114
|
+
|
115
|
+
1. Fork it
|
116
|
+
2. Create your feature branch (`git checkout -b my-new-feature`)
|
117
|
+
3. Commit your changes (`git commit -am 'Add some feature'`)
|
118
|
+
4. Push to the branch (`git push origin my-new-feature`)
|
119
|
+
5. Create new Pull Request
|
data/lib/scrapifier/version.rb
CHANGED
data/lib/scrapifier/xpath.rb
CHANGED
@@ -41,14 +41,14 @@ module Scrapifier
|
|
41
41
|
|
42
42
|
REPLY_TO =
|
43
43
|
<<-END.gsub(/^\s+\|/, '')
|
44
|
-
|//meta[@name="reply_to"]/@content
|
44
|
+
|//meta[@name="reply_to"]/@content|
|
45
|
+
|//meta[@name="Reply_to"]/@content
|
45
46
|
END
|
46
47
|
|
47
48
|
AUTHOR =
|
48
49
|
<<-END.gsub(/^\s+\|/, '')
|
49
50
|
|//meta[@name="author"]/@content|
|
50
|
-
|//meta[@name="Author"]/@content
|
51
|
-
|//meta[@name="reply_to"]/@content
|
51
|
+
|//meta[@name="Author"]/@content
|
52
52
|
END
|
53
53
|
|
54
54
|
IMG =
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: scrapifier
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.6
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Tiago Guedes
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2014-06-
|
11
|
+
date: 2014-06-29 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: nokogiri
|