super_crawler 0.1.0 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.gitignore +1 -0
- data/README.md +67 -98
- data/lib/super_crawler/{crawl_site.rb → crawl.rb} +20 -21
- data/lib/super_crawler/{crawl_page.rb → scrap.rb} +14 -25
- data/lib/super_crawler/version.rb +1 -1
- data/lib/super_crawler.rb +2 -2
- data/super_crawler.gemspec +0 -1
- metadata +4 -18
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: ffb65575970b9bd45f3ac9fa80b004c96df3492a
|
4
|
+
data.tar.gz: 1a72997367a389dcfb67dcd913554207604ef3e2
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 67dd33ed9ee8965a84cdcc22187212ea97c47c6a29bc868818b31246c827e342c486d0b12270d52eac1c1c8071cab51fef4dd441470abd1702905e84c814c31c
|
7
|
+
data.tar.gz: bd45255d527837c177b788f81412a82e0609a015a65e9d9423841e80bef362b62f854c55e1e1637cf8fc45bbf19d256bac4a66da9e804b8850128e0ca6f8d9a8
|
data/README.md
CHANGED
@@ -6,23 +6,16 @@ Easy (yet efficient) ruby gem to crawl your favorite website.
|
|
6
6
|
|
7
7
|
Open your terminal, then:
|
8
8
|
|
9
|
-
|
10
|
-
|
9
|
+
git clone https://github.com/htaidirt/super_crawler
|
10
|
+
cd super_crawler
|
11
|
+
bundle
|
12
|
+
./bin/console
|
11
13
|
|
12
|
-
|
14
|
+
Then
|
13
15
|
|
14
|
-
|
15
|
-
|
16
|
-
|
17
|
-
```
|
18
|
-
|
19
|
-
```ruby
|
20
|
-
> sc = SuperCrawler::CrawlSite.new('https://gocardless.com')
|
21
|
-
|
22
|
-
> sc.start # => Start crawling the website
|
23
|
-
|
24
|
-
> sc.render(5) # => Show first 5 results of the crawling as sitemap
|
25
|
-
```
|
16
|
+
sc = SuperCrawler::Crawl.new('https://gocardless.com')
|
17
|
+
sc.start(10) # => Start crawling the website using 10 threads
|
18
|
+
sc.render(5) # => Show the first 5 results of the crawling as sitemap
|
26
19
|
|
27
20
|
## Installation
|
28
21
|
|
@@ -52,64 +45,57 @@ This gem is an experiment and can't be used for production purposes. Please, use
|
|
52
45
|
|
53
46
|
There are also a lot of limitations that weren't handled due to time. You'll find more information on the limitations below.
|
54
47
|
|
55
|
-
SuperCrawler gem was only tested on MRI and
|
48
|
+
SuperCrawler gem was only tested on MRI 2.3.1 and Rubinius 2.5.8.
|
56
49
|
|
57
50
|
## Philosophy
|
58
51
|
|
59
|
-
Starting from a URL,
|
52
|
+
Starting from a given URL, the crawler extracts all the internal links and assets within the page. The links are added to a list of unique links for further exploration. The crawler repeats the exploration visiting all the links until no new link is found.
|
60
53
|
|
61
|
-
Due to the heavy operations, and the time to access each page content, we will use threads to perform near-parallel processing.
|
54
|
+
Due to the heavy operations (thousands of pages), and the network time to access each page content, we will use threads to perform near-parallel processing.
|
62
55
|
|
63
|
-
In order to keep the code readable and structured,
|
56
|
+
In order to keep the code readable and structured, we created two classes:
|
64
57
|
|
65
|
-
- `SuperCrawler::
|
66
|
-
- `SuperCrawler::
|
58
|
+
- `SuperCrawler::Scrap` is responsible for scrapping a single page and extracting all relevant information (internal links and assets)
|
59
|
+
- `SuperCrawler::Crawl` is responsible for crawling a whole website by collecting and managing links (using `SuperCrawler::Scrap` on every internal link found.) This class is also responsible for rendering results.
|
67
60
|
|
68
61
|
## More detailed use
|
69
62
|
|
70
63
|
Open your favorite ruby console and require the gem:
|
71
64
|
|
72
|
-
|
73
|
-
require 'super_crawler'
|
74
|
-
```
|
65
|
+
require 'super_crawler'
|
75
66
|
|
76
|
-
###
|
67
|
+
### Scrapping a single web page
|
77
68
|
|
78
69
|
Read the following if you would like to crawl a single web page and extract relevant information (internal links and assets).
|
79
70
|
|
80
|
-
|
81
|
-
page = SuperCrawler::CrawlPage.new( url )
|
82
|
-
```
|
71
|
+
page = SuperCrawler::Scrap.new( url )
|
83
72
|
|
84
|
-
Where `url` should be the URL of the page you would like to
|
73
|
+
Where `url` should be the URL of the page you would like to scrap.
|
85
74
|
|
86
|
-
**Nota:**
|
75
|
+
**Nota:** If the given URL has a missing scheme (`http://` or `https://`), SuperCrawler will prepend `http://` to the URL.
|
87
76
|
|
88
77
|
#### Get the encoded URL
|
89
78
|
|
90
79
|
Run
|
91
80
|
|
92
|
-
|
93
|
-
|
94
|
-
|
95
|
-
|
96
|
-
to get the encoded URL provided.
|
81
|
+
page.url
|
82
|
+
|
83
|
+
to get the encoded URL.
|
97
84
|
|
98
85
|
#### Get internal links of a page
|
99
86
|
|
100
87
|
Run
|
101
88
|
|
102
|
-
|
103
|
-
|
104
|
-
|
105
|
-
|
106
|
-
to get a list of internal links within the crawled page. An internal link is a link that _has the same host than the page URL_. Subdomains are rejected.
|
89
|
+
page.get_links
|
90
|
+
|
91
|
+
to get the list of internal links in the page. An internal link is a link that _has the same schame and host than the provided URL_. Subdomains are rejected.
|
107
92
|
|
108
93
|
This method searches in the `href` attribute of all `<a>` anchor tags.
|
109
94
|
|
110
|
-
**Nota:**
|
95
|
+
**Nota:**
|
111
96
|
|
112
|
-
|
97
|
+
- This method returns an array of absolute URLs (all internal links).
|
98
|
+
- Bad links and special links (like mailto and javascript) are discarded.
|
113
99
|
|
114
100
|
#### Get images of a page
|
115
101
|
|
@@ -129,92 +115,75 @@ to get a list of images links within the page. The images links are extracted fr
|
|
129
115
|
|
130
116
|
Run
|
131
117
|
|
132
|
-
|
133
|
-
page.get_stylesheets
|
134
|
-
```
|
118
|
+
page.get_stylesheets
|
135
119
|
|
136
|
-
to get a list of
|
120
|
+
to get a list of stylesheet links within the page. The links are extracted from the `href="..."` attribute of all `<link rel="stylesheet">` tags.
|
137
121
|
|
138
|
-
**Nota:**
|
122
|
+
**Nota:**
|
139
123
|
|
140
|
-
|
124
|
+
- Inline styling isn't yet detected by the method.
|
125
|
+
- This method returns an array of absolute URLs.
|
141
126
|
|
142
127
|
#### Get scripts of a page
|
143
128
|
|
144
129
|
Run
|
145
130
|
|
146
|
-
|
147
|
-
page.get_scripts
|
148
|
-
```
|
131
|
+
page.get_scripts
|
149
132
|
|
150
|
-
to get a list of
|
133
|
+
to get a list of script links within the page. The links are extracted from the `src="..."` attribute of all `<script>` tags.
|
151
134
|
|
152
|
-
**Nota:**
|
135
|
+
**Nota:**
|
153
136
|
|
154
|
-
|
137
|
+
- Inline script isn't yet detected by the method.
|
138
|
+
- This method returns an array of absolute URLs.
|
155
139
|
|
156
140
|
#### List all assets of a page
|
157
141
|
|
158
142
|
Run
|
159
143
|
|
160
|
-
|
161
|
-
page.get_assets
|
162
|
-
```
|
144
|
+
page.get_assets
|
163
145
|
|
164
|
-
to get a list of all assets (images, stylesheets and scripts
|
146
|
+
to get a list of all assets (links of images, stylesheets and scripts) as a hash of arrays.
|
165
147
|
|
166
148
|
### Crawling a whole web site
|
167
149
|
|
168
|
-
|
169
|
-
|
170
|
-
```ruby
|
171
|
-
sc = SuperCrawler::CrawlSite.new(url, count_threads)
|
172
|
-
```
|
150
|
+
sc = SuperCrawler::Crawl.new(url)
|
173
151
|
|
174
|
-
where `url` is the URL of the
|
152
|
+
where `url` is the URL of the website to crawl.
|
175
153
|
|
176
154
|
Next, start the crawler:
|
177
155
|
|
178
|
-
|
179
|
-
|
180
|
-
|
156
|
+
sc.start(number_of_threads)
|
157
|
+
|
158
|
+
where `number_of_threads` is the number of threads that will perform the job (10 by default.) **This can take some time, depending on the site to crawl.**
|
181
159
|
|
182
|
-
|
160
|
+
To access the crawl results, use the following:
|
183
161
|
|
184
|
-
|
185
|
-
|
186
|
-
```ruby
|
187
|
-
sc.links # The array of internal links
|
188
|
-
|
189
|
-
sc.crawl_results # Array of hashes containing links and assets for every link crawled
|
190
|
-
```
|
162
|
+
sc.links # The array of unique internal links
|
163
|
+
sc.crawl_results # Array of hashes containing links and assets for every unique internal link found
|
191
164
|
|
192
165
|
To see the crawling as a sitemap, use:
|
193
166
|
|
194
|
-
|
195
|
-
sc.render(5) # Will render the sitemap of the first 5 pages
|
196
|
-
```
|
167
|
+
sc.render(5) # Will render the sitemap of the first 5 pages
|
197
168
|
|
198
|
-
|
169
|
+
_TODO: Create a separate and more sophisticated rendering class, that can render within files of different formats (HTML, XML, JSON,...)_
|
199
170
|
|
200
171
|
#### Tips on searching assets and links
|
201
172
|
|
202
173
|
After `sc.start`, you can access all collected resources (links and assets) using `sc.crawl_results`. This has the following structure:
|
203
174
|
|
204
|
-
|
205
|
-
|
206
|
-
|
207
|
-
|
208
|
-
|
209
|
-
|
210
|
-
|
211
|
-
|
212
|
-
|
213
|
-
|
214
|
-
|
215
|
-
|
216
|
-
]
|
217
|
-
```
|
175
|
+
[
|
176
|
+
{
|
177
|
+
url: 'http://example.com/',
|
178
|
+
links: [...array of internal links...],
|
179
|
+
assets: {
|
180
|
+
images: [...array of images links],
|
181
|
+
stylesheets: [...array of stylesheets links],
|
182
|
+
scripts: [...array of scripts links],
|
183
|
+
}
|
184
|
+
},
|
185
|
+
...
|
186
|
+
]
|
218
187
|
|
219
188
|
You can use `sc.crawl_results.select{ |resource| ... }` to select a particular resource.
|
220
189
|
|
@@ -223,12 +192,12 @@ You can use `sc.crawl_results.select{ |resource| ... }` to select a particular r
|
|
223
192
|
Actually, the gem has the following limitations:
|
224
193
|
|
225
194
|
- Subdomains are not considered as internal links
|
226
|
-
-
|
195
|
+
- A link with the same domain but different scheme is ignored (http -> https, or the opposite)
|
227
196
|
- Only links within `<a href="...">` tags are extracted
|
228
197
|
- Only images links within `<img src="..."/>` tags are extracted
|
229
198
|
- Only stylesheets links within `<link rel="stylesheet" href="..." />` tags are extracted
|
230
199
|
- Only scripts links within `<script src="...">` tags are extracted
|
231
|
-
- A page that is not accessible (
|
200
|
+
- A page that is not accessible (not status 200) is not checked later
|
232
201
|
|
233
202
|
## Development
|
234
203
|
|
@@ -238,11 +207,11 @@ To install this gem onto your local machine, run `bundle exec rake install`. To
|
|
238
207
|
|
239
208
|
## Contributing
|
240
209
|
|
241
|
-
Bug reports and pull requests are welcome on GitHub at https://github.com/htaidirt/super_crawler. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
|
210
|
+
Bug reports and pull requests are welcome on GitHub at [https://github.com/htaidirt/super_crawler](https://github.com/htaidirt/super_crawler). This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
|
242
211
|
|
243
|
-
|
212
|
+
Please, follow this process:
|
244
213
|
|
245
|
-
1. Fork
|
214
|
+
1. Fork the project
|
246
215
|
2. Create your feature branch (git checkout -b my-new-feature)
|
247
216
|
3. Commit your changes (git commit -am 'Add some feature')
|
248
217
|
4. Push to the branch (git push origin my-new-feature)
|
@@ -1,33 +1,32 @@
|
|
1
1
|
require 'thread'
|
2
2
|
|
3
|
-
require 'super_crawler/
|
3
|
+
require 'super_crawler/scrap'
|
4
4
|
|
5
5
|
module SuperCrawler
|
6
6
|
|
7
7
|
###
|
8
8
|
# Crawl a whole website
|
9
9
|
#
|
10
|
-
class
|
10
|
+
class Crawl
|
11
11
|
|
12
12
|
attr_reader :links, :crawl_results
|
13
13
|
|
14
|
-
def initialize start_url,
|
14
|
+
def initialize start_url, options = {}
|
15
15
|
@start_url = URI(URI.encode start_url).normalize().to_s # Normalize the given URL
|
16
16
|
@links = [@start_url] # Will contain the list of all links found
|
17
17
|
@crawl_results = [] # Will contain the crawl results (links and assets), as array of hashes
|
18
|
-
@threads = threads # How many threads to use? Default: 10
|
19
18
|
|
20
19
|
@option_debug = options[:debug].nil? ? true : !!(options[:debug]) # Debug by default
|
21
20
|
end
|
22
21
|
|
23
22
|
###
|
24
23
|
# Start crawling site
|
25
|
-
# Could take a while
|
24
|
+
# Could take a while! Use threads to speed up crawling and log to inform user.
|
26
25
|
#
|
27
|
-
def start
|
26
|
+
def start threads_count = 10
|
28
27
|
|
29
|
-
crawling_start_notice # Show message on what will happen
|
30
|
-
threads = [] # Will contain our threads
|
28
|
+
crawling_start_notice( @start_url, threads_count ) # Show message on what will happen
|
29
|
+
threads = [] # Will contain our n-threads
|
31
30
|
@links_queue = Queue.new # Will contain the links queue that the threads will use
|
32
31
|
@links = [@start_url] # Re-init the links list
|
33
32
|
@crawl_results = [] # Re-init the crawling results
|
@@ -38,12 +37,12 @@ module SuperCrawler
|
|
38
37
|
process_page( @start_url )
|
39
38
|
|
40
39
|
# Create threads to handle new links
|
41
|
-
|
40
|
+
threads_count.times do # Create threads_count threads
|
42
41
|
|
43
|
-
threads << Thread.new do #
|
42
|
+
threads << Thread.new do # Instantiate a new threads
|
44
43
|
begin
|
45
|
-
while current_link = @links_queue.pop(true) #
|
46
|
-
process_page( current_link ) # Get links and assets
|
44
|
+
while current_link = @links_queue.pop(true) # Pop one link after another
|
45
|
+
process_page( current_link ) # Get links and assets of the popped link
|
47
46
|
end
|
48
47
|
rescue ThreadError # Stop when empty links queue
|
49
48
|
end
|
@@ -52,7 +51,7 @@ module SuperCrawler
|
|
52
51
|
end
|
53
52
|
|
54
53
|
threads.map(&:join) # Activate the threads
|
55
|
-
crawling_summary_notice(start_time, Time.now) if @option_debug # Display crawling summary
|
54
|
+
crawling_summary_notice(start_time, Time.now, threads_count) if @option_debug # Display crawling summary
|
56
55
|
|
57
56
|
return true
|
58
57
|
end
|
@@ -64,7 +63,7 @@ module SuperCrawler
|
|
64
63
|
#
|
65
64
|
def render max_pages = 10
|
66
65
|
draw_line
|
67
|
-
puts "Showing first #{
|
66
|
+
puts "Showing first #{max_pages} crawled pages and their contents:\n\n"
|
68
67
|
@crawl_results[0..(max_pages-1)].each_with_index do |result, index|
|
69
68
|
puts "[#{index+1}] Content of #{result[:url]}\n"
|
70
69
|
|
@@ -90,13 +89,13 @@ module SuperCrawler
|
|
90
89
|
# Process a page by extracting information and updating links queue, links list and results.
|
91
90
|
#
|
92
91
|
def process_page page_url
|
93
|
-
page = SuperCrawler::
|
92
|
+
page = SuperCrawler::Scrap.new(page_url) # Scrap the current page
|
94
93
|
|
95
94
|
current_page_links = page.get_links # Get current page internal links
|
96
95
|
new_links = current_page_links - @links # Select new links
|
97
96
|
|
98
97
|
new_links.each { |link| @links_queue.push(link) } # Add new links to the queue
|
99
|
-
@links += new_links # Add new links to the
|
98
|
+
@links += new_links # Add new links to the links list
|
100
99
|
@crawl_results << { # Provide current page crawl result as a hash
|
101
100
|
url: page.url, # The crawled page
|
102
101
|
links: current_page_links, # Its internal links
|
@@ -109,11 +108,11 @@ module SuperCrawler
|
|
109
108
|
###
|
110
109
|
# Display a notice when starting a site crawl
|
111
110
|
#
|
112
|
-
def crawling_start_notice
|
111
|
+
def crawling_start_notice start_url, threads
|
113
112
|
draw_line
|
114
|
-
puts "Start crawling #{
|
113
|
+
puts "Start crawling #{start_url} using #{threads} threads. Crawling rules:"
|
115
114
|
puts "1. Keep only internal links"
|
116
|
-
puts "2.
|
115
|
+
puts "2. Links with different scheme are agnored"
|
117
116
|
puts "3. Remove the fragment part from the links (#...)"
|
118
117
|
puts "4. Keep paths with different parameters (?...)"
|
119
118
|
draw_line
|
@@ -132,11 +131,11 @@ module SuperCrawler
|
|
132
131
|
###
|
133
132
|
# Display final crawling summary after site crawling complete
|
134
133
|
#
|
135
|
-
def crawling_summary_notice time_start, time_end
|
134
|
+
def crawling_summary_notice time_start, time_end, threads
|
136
135
|
total_time = time_end - time_start
|
137
136
|
puts ""
|
138
137
|
draw_line
|
139
|
-
puts "Crawled #{@links.count} links in #{total_time.to_f.to_s} seconds using #{
|
138
|
+
puts "Crawled #{@links.count} links in #{total_time.to_f.to_s} seconds using #{threads} threads."
|
140
139
|
puts "Use .crawl_results to access the crawl results as an array of hashes."
|
141
140
|
puts "Use .render to see the crawl_results as a sitemap."
|
142
141
|
draw_line
|
@@ -1,20 +1,19 @@
|
|
1
1
|
require "open-uri"
|
2
|
-
require "open_uri_redirections"
|
3
2
|
require "nokogiri"
|
4
3
|
|
5
4
|
module SuperCrawler
|
6
5
|
|
7
6
|
###
|
8
|
-
#
|
7
|
+
# Scrap a single HTML page
|
9
8
|
# Responsible for extracting all relevant information within a page
|
9
|
+
# (internal links and assets)
|
10
10
|
#
|
11
|
-
class
|
11
|
+
class Scrap
|
12
12
|
|
13
13
|
attr_reader :url
|
14
14
|
|
15
15
|
def initialize url
|
16
|
-
# Normalize the URL, by adding http
|
17
|
-
# NOTA: By default, add http:// scheme to an URL that doesn't have one
|
16
|
+
# Normalize the URL, by adding a scheme (http) if not present in the URL
|
18
17
|
@url = URI.encode( !!(url =~ /^(http(s)?:\/\/)/) ? url : ('http://' + url) )
|
19
18
|
end
|
20
19
|
|
@@ -28,7 +27,7 @@ module SuperCrawler
|
|
28
27
|
links = get_doc.css('a').map{ |link| link['href'] }.compact
|
29
28
|
|
30
29
|
# Select only internal links (relative links, or absolute links with the same host)
|
31
|
-
links.select!{ |link| URI.parse(URI.encode link).host.nil? ||
|
30
|
+
links.select!{ |link| URI.parse(URI.encode link).host.nil? || link.start_with?( @url ) }
|
32
31
|
|
33
32
|
# Reject bad matches links (like mailto, tel and javascript)
|
34
33
|
links.reject!{ |link| !!(link =~ /^(mailto:|tel:|javascript:)/) }
|
@@ -97,9 +96,9 @@ module SuperCrawler
|
|
97
96
|
#
|
98
97
|
def get_assets
|
99
98
|
{
|
100
|
-
'images'
|
101
|
-
'stylesheets'
|
102
|
-
'scripts'
|
99
|
+
:'images' => get_images,
|
100
|
+
:'stylesheets' => get_stylesheets,
|
101
|
+
:'scripts' => get_scripts
|
103
102
|
}
|
104
103
|
end
|
105
104
|
|
@@ -109,10 +108,10 @@ module SuperCrawler
|
|
109
108
|
#
|
110
109
|
def get_all
|
111
110
|
{
|
112
|
-
'links'
|
113
|
-
'images'
|
114
|
-
'stylesheets'
|
115
|
-
'scripts'
|
111
|
+
:'links' => get_links,
|
112
|
+
:'images' => get_images,
|
113
|
+
:'stylesheets' => get_stylesheets,
|
114
|
+
:'scripts' => get_scripts
|
116
115
|
}
|
117
116
|
end
|
118
117
|
|
@@ -131,28 +130,18 @@ module SuperCrawler
|
|
131
130
|
#
|
132
131
|
def get_doc
|
133
132
|
begin
|
134
|
-
@doc ||= Nokogiri(open( @url
|
133
|
+
@doc ||= Nokogiri(open( @url ))
|
135
134
|
rescue Exception => e
|
136
135
|
raise "Problem with URL #{@url}: #{e}"
|
137
136
|
end
|
138
137
|
end
|
139
138
|
|
140
|
-
###
|
141
|
-
# Extract the base URL (scheme and host only)
|
142
|
-
#
|
143
|
-
# eg:
|
144
|
-
# http://mysite.com/abc -> http://mysite.com
|
145
|
-
# https://dev.mysite.co.uk/mylink -> https://dev.mysite.co.uk
|
146
|
-
def base_url
|
147
|
-
"#{URI.parse(@url).scheme}://#{URI.parse(@url).host}"
|
148
|
-
end
|
149
|
-
|
150
139
|
###
|
151
140
|
# Given a URL, return the absolute URL
|
152
141
|
#
|
153
142
|
def create_absolute_url url
|
154
143
|
# Append the base URL (scheme+host) if the provided URL is relative
|
155
|
-
URI.parse(URI.encode url).host.nil? ? (
|
144
|
+
URI.parse(URI.encode url).host.nil? ? "#{URI.parse(@url).scheme}://#{URI.parse(@url).host}#{url}" : url
|
156
145
|
end
|
157
146
|
|
158
147
|
end
|
data/lib/super_crawler.rb
CHANGED
data/super_crawler.gemspec
CHANGED
@@ -28,7 +28,6 @@ Gem::Specification.new do |spec|
|
|
28
28
|
spec.require_paths = ["lib"]
|
29
29
|
|
30
30
|
spec.add_dependency "nokogiri", "~> 1"
|
31
|
-
spec.add_dependency "open_uri_redirections", "~> 0.2"
|
32
31
|
spec.add_dependency "thread", "~> 0.2"
|
33
32
|
|
34
33
|
spec.add_development_dependency "bundler", "~> 1.10"
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: super_crawler
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.2.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Hassen Taidirt
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2016-07-
|
11
|
+
date: 2016-07-13 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: nokogiri
|
@@ -24,20 +24,6 @@ dependencies:
|
|
24
24
|
- - "~>"
|
25
25
|
- !ruby/object:Gem::Version
|
26
26
|
version: '1'
|
27
|
-
- !ruby/object:Gem::Dependency
|
28
|
-
name: open_uri_redirections
|
29
|
-
requirement: !ruby/object:Gem::Requirement
|
30
|
-
requirements:
|
31
|
-
- - "~>"
|
32
|
-
- !ruby/object:Gem::Version
|
33
|
-
version: '0.2'
|
34
|
-
type: :runtime
|
35
|
-
prerelease: false
|
36
|
-
version_requirements: !ruby/object:Gem::Requirement
|
37
|
-
requirements:
|
38
|
-
- - "~>"
|
39
|
-
- !ruby/object:Gem::Version
|
40
|
-
version: '0.2'
|
41
27
|
- !ruby/object:Gem::Dependency
|
42
28
|
name: thread
|
43
29
|
requirement: !ruby/object:Gem::Requirement
|
@@ -113,8 +99,8 @@ files:
|
|
113
99
|
- bin/console
|
114
100
|
- bin/setup
|
115
101
|
- lib/super_crawler.rb
|
116
|
-
- lib/super_crawler/
|
117
|
-
- lib/super_crawler/
|
102
|
+
- lib/super_crawler/crawl.rb
|
103
|
+
- lib/super_crawler/scrap.rb
|
118
104
|
- lib/super_crawler/version.rb
|
119
105
|
- super_crawler.gemspec
|
120
106
|
homepage: https://github.com/htaidirt/super_crawler
|