super_crawler 0.1.0 → 0.2.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.gitignore +1 -0
- data/README.md +67 -98
- data/lib/super_crawler/{crawl_site.rb → crawl.rb} +20 -21
- data/lib/super_crawler/{crawl_page.rb → scrap.rb} +14 -25
- data/lib/super_crawler/version.rb +1 -1
- data/lib/super_crawler.rb +2 -2
- data/super_crawler.gemspec +0 -1
- metadata +4 -18
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: ffb65575970b9bd45f3ac9fa80b004c96df3492a
|
4
|
+
data.tar.gz: 1a72997367a389dcfb67dcd913554207604ef3e2
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 67dd33ed9ee8965a84cdcc22187212ea97c47c6a29bc868818b31246c827e342c486d0b12270d52eac1c1c8071cab51fef4dd441470abd1702905e84c814c31c
|
7
|
+
data.tar.gz: bd45255d527837c177b788f81412a82e0609a015a65e9d9423841e80bef362b62f854c55e1e1637cf8fc45bbf19d256bac4a66da9e804b8850128e0ca6f8d9a8
|
data/README.md
CHANGED
@@ -6,23 +6,16 @@ Easy (yet efficient) ruby gem to crawl your favorite website.
|
|
6
6
|
|
7
7
|
Open your terminal, then:
|
8
8
|
|
9
|
-
|
10
|
-
|
9
|
+
git clone https://github.com/htaidirt/super_crawler
|
10
|
+
cd super_crawler
|
11
|
+
bundle
|
12
|
+
./bin/console
|
11
13
|
|
12
|
-
|
14
|
+
Then
|
13
15
|
|
14
|
-
|
15
|
-
|
16
|
-
|
17
|
-
```
|
18
|
-
|
19
|
-
```ruby
|
20
|
-
> sc = SuperCrawler::CrawlSite.new('https://gocardless.com')
|
21
|
-
|
22
|
-
> sc.start # => Start crawling the website
|
23
|
-
|
24
|
-
> sc.render(5) # => Show first 5 results of the crawling as sitemap
|
25
|
-
```
|
16
|
+
sc = SuperCrawler::Crawl.new('https://gocardless.com')
|
17
|
+
sc.start(10) # => Start crawling the website using 10 threads
|
18
|
+
sc.render(5) # => Show the first 5 results of the crawling as sitemap
|
26
19
|
|
27
20
|
## Installation
|
28
21
|
|
@@ -52,64 +45,57 @@ This gem is an experiment and can't be used for production purposes. Please, use
|
|
52
45
|
|
53
46
|
There are also a lot of limitations that weren't handled due to time. You'll find more information on the limitations below.
|
54
47
|
|
55
|
-
SuperCrawler gem was only tested on MRI and
|
48
|
+
SuperCrawler gem was only tested on MRI 2.3.1 and Rubinius 2.5.8.
|
56
49
|
|
57
50
|
## Philosophy
|
58
51
|
|
59
|
-
Starting from a URL,
|
52
|
+
Starting from a given URL, the crawler extracts all the internal links and assets within the page. The links are added to a list of unique links for further exploration. The crawler repeats the exploration visiting all the links until no new link is found.
|
60
53
|
|
61
|
-
Due to the heavy operations, and the time to access each page content, we will use threads to perform near-parallel processing.
|
54
|
+
Due to the heavy operations (thousands of pages), and the network time to access each page content, we will use threads to perform near-parallel processing.
|
62
55
|
|
63
|
-
In order to keep the code readable and structured,
|
56
|
+
In order to keep the code readable and structured, we created two classes:
|
64
57
|
|
65
|
-
- `SuperCrawler::
|
66
|
-
- `SuperCrawler::
|
58
|
+
- `SuperCrawler::Scrap` is responsible for scrapping a single page and extracting all relevant information (internal links and assets)
|
59
|
+
- `SuperCrawler::Crawl` is responsible for crawling a whole website by collecting and managing links (using `SuperCrawler::Scrap` on every internal link found.) This class is also responsible for rendering results.
|
67
60
|
|
68
61
|
## More detailed use
|
69
62
|
|
70
63
|
Open your favorite ruby console and require the gem:
|
71
64
|
|
72
|
-
|
73
|
-
require 'super_crawler'
|
74
|
-
```
|
65
|
+
require 'super_crawler'
|
75
66
|
|
76
|
-
###
|
67
|
+
### Scrapping a single web page
|
77
68
|
|
78
69
|
Read the following if you would like to crawl a single web page and extract relevant information (internal links and assets).
|
79
70
|
|
80
|
-
|
81
|
-
page = SuperCrawler::CrawlPage.new( url )
|
82
|
-
```
|
71
|
+
page = SuperCrawler::Scrap.new( url )
|
83
72
|
|
84
|
-
Where `url` should be the URL of the page you would like to
|
73
|
+
Where `url` should be the URL of the page you would like to scrap.
|
85
74
|
|
86
|
-
**Nota:**
|
75
|
+
**Nota:** If the given URL has a missing scheme (`http://` or `https://`), SuperCrawler will prepend `http://` to the URL.
|
87
76
|
|
88
77
|
#### Get the encoded URL
|
89
78
|
|
90
79
|
Run
|
91
80
|
|
92
|
-
|
93
|
-
|
94
|
-
|
95
|
-
|
96
|
-
to get the encoded URL provided.
|
81
|
+
page.url
|
82
|
+
|
83
|
+
to get the encoded URL.
|
97
84
|
|
98
85
|
#### Get internal links of a page
|
99
86
|
|
100
87
|
Run
|
101
88
|
|
102
|
-
|
103
|
-
|
104
|
-
|
105
|
-
|
106
|
-
to get a list of internal links within the crawled page. An internal link is a link that _has the same host than the page URL_. Subdomains are rejected.
|
89
|
+
page.get_links
|
90
|
+
|
91
|
+
to get the list of internal links in the page. An internal link is a link that _has the same schame and host than the provided URL_. Subdomains are rejected.
|
107
92
|
|
108
93
|
This method searches in the `href` attribute of all `<a>` anchor tags.
|
109
94
|
|
110
|
-
**Nota:**
|
95
|
+
**Nota:**
|
111
96
|
|
112
|
-
|
97
|
+
- This method returns an array of absolute URLs (all internal links).
|
98
|
+
- Bad links and special links (like mailto and javascript) are discarded.
|
113
99
|
|
114
100
|
#### Get images of a page
|
115
101
|
|
@@ -129,92 +115,75 @@ to get a list of images links within the page. The images links are extracted fr
|
|
129
115
|
|
130
116
|
Run
|
131
117
|
|
132
|
-
|
133
|
-
page.get_stylesheets
|
134
|
-
```
|
118
|
+
page.get_stylesheets
|
135
119
|
|
136
|
-
to get a list of
|
120
|
+
to get a list of stylesheet links within the page. The links are extracted from the `href="..."` attribute of all `<link rel="stylesheet">` tags.
|
137
121
|
|
138
|
-
**Nota:**
|
122
|
+
**Nota:**
|
139
123
|
|
140
|
-
|
124
|
+
- Inline styling isn't yet detected by the method.
|
125
|
+
- This method returns an array of absolute URLs.
|
141
126
|
|
142
127
|
#### Get scripts of a page
|
143
128
|
|
144
129
|
Run
|
145
130
|
|
146
|
-
|
147
|
-
page.get_scripts
|
148
|
-
```
|
131
|
+
page.get_scripts
|
149
132
|
|
150
|
-
to get a list of
|
133
|
+
to get a list of script links within the page. The links are extracted from the `src="..."` attribute of all `<script>` tags.
|
151
134
|
|
152
|
-
**Nota:**
|
135
|
+
**Nota:**
|
153
136
|
|
154
|
-
|
137
|
+
- Inline script isn't yet detected by the method.
|
138
|
+
- This method returns an array of absolute URLs.
|
155
139
|
|
156
140
|
#### List all assets of a page
|
157
141
|
|
158
142
|
Run
|
159
143
|
|
160
|
-
|
161
|
-
page.get_assets
|
162
|
-
```
|
144
|
+
page.get_assets
|
163
145
|
|
164
|
-
to get a list of all assets (images, stylesheets and scripts
|
146
|
+
to get a list of all assets (links of images, stylesheets and scripts) as a hash of arrays.
|
165
147
|
|
166
148
|
### Crawling a whole web site
|
167
149
|
|
168
|
-
|
169
|
-
|
170
|
-
```ruby
|
171
|
-
sc = SuperCrawler::CrawlSite.new(url, count_threads)
|
172
|
-
```
|
150
|
+
sc = SuperCrawler::Crawl.new(url)
|
173
151
|
|
174
|
-
where `url` is the URL of the
|
152
|
+
where `url` is the URL of the website to crawl.
|
175
153
|
|
176
154
|
Next, start the crawler:
|
177
155
|
|
178
|
-
|
179
|
-
|
180
|
-
|
156
|
+
sc.start(number_of_threads)
|
157
|
+
|
158
|
+
where `number_of_threads` is the number of threads that will perform the job (10 by default.) **This can take some time, depending on the site to crawl.**
|
181
159
|
|
182
|
-
|
160
|
+
To access the crawl results, use the following:
|
183
161
|
|
184
|
-
|
185
|
-
|
186
|
-
```ruby
|
187
|
-
sc.links # The array of internal links
|
188
|
-
|
189
|
-
sc.crawl_results # Array of hashes containing links and assets for every link crawled
|
190
|
-
```
|
162
|
+
sc.links # The array of unique internal links
|
163
|
+
sc.crawl_results # Array of hashes containing links and assets for every unique internal link found
|
191
164
|
|
192
165
|
To see the crawling as a sitemap, use:
|
193
166
|
|
194
|
-
|
195
|
-
sc.render(5) # Will render the sitemap of the first 5 pages
|
196
|
-
```
|
167
|
+
sc.render(5) # Will render the sitemap of the first 5 pages
|
197
168
|
|
198
|
-
|
169
|
+
_TODO: Create a separate and more sophisticated rendering class, that can render within files of different formats (HTML, XML, JSON,...)_
|
199
170
|
|
200
171
|
#### Tips on searching assets and links
|
201
172
|
|
202
173
|
After `sc.start`, you can access all collected resources (links and assets) using `sc.crawl_results`. This has the following structure:
|
203
174
|
|
204
|
-
|
205
|
-
|
206
|
-
|
207
|
-
|
208
|
-
|
209
|
-
|
210
|
-
|
211
|
-
|
212
|
-
|
213
|
-
|
214
|
-
|
215
|
-
|
216
|
-
]
|
217
|
-
```
|
175
|
+
[
|
176
|
+
{
|
177
|
+
url: 'http://example.com/',
|
178
|
+
links: [...array of internal links...],
|
179
|
+
assets: {
|
180
|
+
images: [...array of images links],
|
181
|
+
stylesheets: [...array of stylesheets links],
|
182
|
+
scripts: [...array of scripts links],
|
183
|
+
}
|
184
|
+
},
|
185
|
+
...
|
186
|
+
]
|
218
187
|
|
219
188
|
You can use `sc.crawl_results.select{ |resource| ... }` to select a particular resource.
|
220
189
|
|
@@ -223,12 +192,12 @@ You can use `sc.crawl_results.select{ |resource| ... }` to select a particular r
|
|
223
192
|
Actually, the gem has the following limitations:
|
224
193
|
|
225
194
|
- Subdomains are not considered as internal links
|
226
|
-
-
|
195
|
+
- A link with the same domain but different scheme is ignored (http -> https, or the opposite)
|
227
196
|
- Only links within `<a href="...">` tags are extracted
|
228
197
|
- Only images links within `<img src="..."/>` tags are extracted
|
229
198
|
- Only stylesheets links within `<link rel="stylesheet" href="..." />` tags are extracted
|
230
199
|
- Only scripts links within `<script src="...">` tags are extracted
|
231
|
-
- A page that is not accessible (
|
200
|
+
- A page that is not accessible (not status 200) is not checked later
|
232
201
|
|
233
202
|
## Development
|
234
203
|
|
@@ -238,11 +207,11 @@ To install this gem onto your local machine, run `bundle exec rake install`. To
|
|
238
207
|
|
239
208
|
## Contributing
|
240
209
|
|
241
|
-
Bug reports and pull requests are welcome on GitHub at https://github.com/htaidirt/super_crawler. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
|
210
|
+
Bug reports and pull requests are welcome on GitHub at [https://github.com/htaidirt/super_crawler](https://github.com/htaidirt/super_crawler). This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
|
242
211
|
|
243
|
-
|
212
|
+
Please, follow this process:
|
244
213
|
|
245
|
-
1. Fork
|
214
|
+
1. Fork the project
|
246
215
|
2. Create your feature branch (git checkout -b my-new-feature)
|
247
216
|
3. Commit your changes (git commit -am 'Add some feature')
|
248
217
|
4. Push to the branch (git push origin my-new-feature)
|
@@ -1,33 +1,32 @@
|
|
1
1
|
require 'thread'
|
2
2
|
|
3
|
-
require 'super_crawler/
|
3
|
+
require 'super_crawler/scrap'
|
4
4
|
|
5
5
|
module SuperCrawler
|
6
6
|
|
7
7
|
###
|
8
8
|
# Crawl a whole website
|
9
9
|
#
|
10
|
-
class
|
10
|
+
class Crawl
|
11
11
|
|
12
12
|
attr_reader :links, :crawl_results
|
13
13
|
|
14
|
-
def initialize start_url,
|
14
|
+
def initialize start_url, options = {}
|
15
15
|
@start_url = URI(URI.encode start_url).normalize().to_s # Normalize the given URL
|
16
16
|
@links = [@start_url] # Will contain the list of all links found
|
17
17
|
@crawl_results = [] # Will contain the crawl results (links and assets), as array of hashes
|
18
|
-
@threads = threads # How many threads to use? Default: 10
|
19
18
|
|
20
19
|
@option_debug = options[:debug].nil? ? true : !!(options[:debug]) # Debug by default
|
21
20
|
end
|
22
21
|
|
23
22
|
###
|
24
23
|
# Start crawling site
|
25
|
-
# Could take a while
|
24
|
+
# Could take a while! Use threads to speed up crawling and log to inform user.
|
26
25
|
#
|
27
|
-
def start
|
26
|
+
def start threads_count = 10
|
28
27
|
|
29
|
-
crawling_start_notice # Show message on what will happen
|
30
|
-
threads = [] # Will contain our threads
|
28
|
+
crawling_start_notice( @start_url, threads_count ) # Show message on what will happen
|
29
|
+
threads = [] # Will contain our n-threads
|
31
30
|
@links_queue = Queue.new # Will contain the links queue that the threads will use
|
32
31
|
@links = [@start_url] # Re-init the links list
|
33
32
|
@crawl_results = [] # Re-init the crawling results
|
@@ -38,12 +37,12 @@ module SuperCrawler
|
|
38
37
|
process_page( @start_url )
|
39
38
|
|
40
39
|
# Create threads to handle new links
|
41
|
-
|
40
|
+
threads_count.times do # Create threads_count threads
|
42
41
|
|
43
|
-
threads << Thread.new do #
|
42
|
+
threads << Thread.new do # Instantiate a new threads
|
44
43
|
begin
|
45
|
-
while current_link = @links_queue.pop(true) #
|
46
|
-
process_page( current_link ) # Get links and assets
|
44
|
+
while current_link = @links_queue.pop(true) # Pop one link after another
|
45
|
+
process_page( current_link ) # Get links and assets of the popped link
|
47
46
|
end
|
48
47
|
rescue ThreadError # Stop when empty links queue
|
49
48
|
end
|
@@ -52,7 +51,7 @@ module SuperCrawler
|
|
52
51
|
end
|
53
52
|
|
54
53
|
threads.map(&:join) # Activate the threads
|
55
|
-
crawling_summary_notice(start_time, Time.now) if @option_debug # Display crawling summary
|
54
|
+
crawling_summary_notice(start_time, Time.now, threads_count) if @option_debug # Display crawling summary
|
56
55
|
|
57
56
|
return true
|
58
57
|
end
|
@@ -64,7 +63,7 @@ module SuperCrawler
|
|
64
63
|
#
|
65
64
|
def render max_pages = 10
|
66
65
|
draw_line
|
67
|
-
puts "Showing first #{
|
66
|
+
puts "Showing first #{max_pages} crawled pages and their contents:\n\n"
|
68
67
|
@crawl_results[0..(max_pages-1)].each_with_index do |result, index|
|
69
68
|
puts "[#{index+1}] Content of #{result[:url]}\n"
|
70
69
|
|
@@ -90,13 +89,13 @@ module SuperCrawler
|
|
90
89
|
# Process a page by extracting information and updating links queue, links list and results.
|
91
90
|
#
|
92
91
|
def process_page page_url
|
93
|
-
page = SuperCrawler::
|
92
|
+
page = SuperCrawler::Scrap.new(page_url) # Scrap the current page
|
94
93
|
|
95
94
|
current_page_links = page.get_links # Get current page internal links
|
96
95
|
new_links = current_page_links - @links # Select new links
|
97
96
|
|
98
97
|
new_links.each { |link| @links_queue.push(link) } # Add new links to the queue
|
99
|
-
@links += new_links # Add new links to the
|
98
|
+
@links += new_links # Add new links to the links list
|
100
99
|
@crawl_results << { # Provide current page crawl result as a hash
|
101
100
|
url: page.url, # The crawled page
|
102
101
|
links: current_page_links, # Its internal links
|
@@ -109,11 +108,11 @@ module SuperCrawler
|
|
109
108
|
###
|
110
109
|
# Display a notice when starting a site crawl
|
111
110
|
#
|
112
|
-
def crawling_start_notice
|
111
|
+
def crawling_start_notice start_url, threads
|
113
112
|
draw_line
|
114
|
-
puts "Start crawling #{
|
113
|
+
puts "Start crawling #{start_url} using #{threads} threads. Crawling rules:"
|
115
114
|
puts "1. Keep only internal links"
|
116
|
-
puts "2.
|
115
|
+
puts "2. Links with different scheme are agnored"
|
117
116
|
puts "3. Remove the fragment part from the links (#...)"
|
118
117
|
puts "4. Keep paths with different parameters (?...)"
|
119
118
|
draw_line
|
@@ -132,11 +131,11 @@ module SuperCrawler
|
|
132
131
|
###
|
133
132
|
# Display final crawling summary after site crawling complete
|
134
133
|
#
|
135
|
-
def crawling_summary_notice time_start, time_end
|
134
|
+
def crawling_summary_notice time_start, time_end, threads
|
136
135
|
total_time = time_end - time_start
|
137
136
|
puts ""
|
138
137
|
draw_line
|
139
|
-
puts "Crawled #{@links.count} links in #{total_time.to_f.to_s} seconds using #{
|
138
|
+
puts "Crawled #{@links.count} links in #{total_time.to_f.to_s} seconds using #{threads} threads."
|
140
139
|
puts "Use .crawl_results to access the crawl results as an array of hashes."
|
141
140
|
puts "Use .render to see the crawl_results as a sitemap."
|
142
141
|
draw_line
|
@@ -1,20 +1,19 @@
|
|
1
1
|
require "open-uri"
|
2
|
-
require "open_uri_redirections"
|
3
2
|
require "nokogiri"
|
4
3
|
|
5
4
|
module SuperCrawler
|
6
5
|
|
7
6
|
###
|
8
|
-
#
|
7
|
+
# Scrap a single HTML page
|
9
8
|
# Responsible for extracting all relevant information within a page
|
9
|
+
# (internal links and assets)
|
10
10
|
#
|
11
|
-
class
|
11
|
+
class Scrap
|
12
12
|
|
13
13
|
attr_reader :url
|
14
14
|
|
15
15
|
def initialize url
|
16
|
-
# Normalize the URL, by adding http
|
17
|
-
# NOTA: By default, add http:// scheme to an URL that doesn't have one
|
16
|
+
# Normalize the URL, by adding a scheme (http) if not present in the URL
|
18
17
|
@url = URI.encode( !!(url =~ /^(http(s)?:\/\/)/) ? url : ('http://' + url) )
|
19
18
|
end
|
20
19
|
|
@@ -28,7 +27,7 @@ module SuperCrawler
|
|
28
27
|
links = get_doc.css('a').map{ |link| link['href'] }.compact
|
29
28
|
|
30
29
|
# Select only internal links (relative links, or absolute links with the same host)
|
31
|
-
links.select!{ |link| URI.parse(URI.encode link).host.nil? ||
|
30
|
+
links.select!{ |link| URI.parse(URI.encode link).host.nil? || link.start_with?( @url ) }
|
32
31
|
|
33
32
|
# Reject bad matches links (like mailto, tel and javascript)
|
34
33
|
links.reject!{ |link| !!(link =~ /^(mailto:|tel:|javascript:)/) }
|
@@ -97,9 +96,9 @@ module SuperCrawler
|
|
97
96
|
#
|
98
97
|
def get_assets
|
99
98
|
{
|
100
|
-
'images'
|
101
|
-
'stylesheets'
|
102
|
-
'scripts'
|
99
|
+
:'images' => get_images,
|
100
|
+
:'stylesheets' => get_stylesheets,
|
101
|
+
:'scripts' => get_scripts
|
103
102
|
}
|
104
103
|
end
|
105
104
|
|
@@ -109,10 +108,10 @@ module SuperCrawler
|
|
109
108
|
#
|
110
109
|
def get_all
|
111
110
|
{
|
112
|
-
'links'
|
113
|
-
'images'
|
114
|
-
'stylesheets'
|
115
|
-
'scripts'
|
111
|
+
:'links' => get_links,
|
112
|
+
:'images' => get_images,
|
113
|
+
:'stylesheets' => get_stylesheets,
|
114
|
+
:'scripts' => get_scripts
|
116
115
|
}
|
117
116
|
end
|
118
117
|
|
@@ -131,28 +130,18 @@ module SuperCrawler
|
|
131
130
|
#
|
132
131
|
def get_doc
|
133
132
|
begin
|
134
|
-
@doc ||= Nokogiri(open( @url
|
133
|
+
@doc ||= Nokogiri(open( @url ))
|
135
134
|
rescue Exception => e
|
136
135
|
raise "Problem with URL #{@url}: #{e}"
|
137
136
|
end
|
138
137
|
end
|
139
138
|
|
140
|
-
###
|
141
|
-
# Extract the base URL (scheme and host only)
|
142
|
-
#
|
143
|
-
# eg:
|
144
|
-
# http://mysite.com/abc -> http://mysite.com
|
145
|
-
# https://dev.mysite.co.uk/mylink -> https://dev.mysite.co.uk
|
146
|
-
def base_url
|
147
|
-
"#{URI.parse(@url).scheme}://#{URI.parse(@url).host}"
|
148
|
-
end
|
149
|
-
|
150
139
|
###
|
151
140
|
# Given a URL, return the absolute URL
|
152
141
|
#
|
153
142
|
def create_absolute_url url
|
154
143
|
# Append the base URL (scheme+host) if the provided URL is relative
|
155
|
-
URI.parse(URI.encode url).host.nil? ? (
|
144
|
+
URI.parse(URI.encode url).host.nil? ? "#{URI.parse(@url).scheme}://#{URI.parse(@url).host}#{url}" : url
|
156
145
|
end
|
157
146
|
|
158
147
|
end
|
data/lib/super_crawler.rb
CHANGED
data/super_crawler.gemspec
CHANGED
@@ -28,7 +28,6 @@ Gem::Specification.new do |spec|
|
|
28
28
|
spec.require_paths = ["lib"]
|
29
29
|
|
30
30
|
spec.add_dependency "nokogiri", "~> 1"
|
31
|
-
spec.add_dependency "open_uri_redirections", "~> 0.2"
|
32
31
|
spec.add_dependency "thread", "~> 0.2"
|
33
32
|
|
34
33
|
spec.add_development_dependency "bundler", "~> 1.10"
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: super_crawler
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.2.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Hassen Taidirt
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2016-07-
|
11
|
+
date: 2016-07-13 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: nokogiri
|
@@ -24,20 +24,6 @@ dependencies:
|
|
24
24
|
- - "~>"
|
25
25
|
- !ruby/object:Gem::Version
|
26
26
|
version: '1'
|
27
|
-
- !ruby/object:Gem::Dependency
|
28
|
-
name: open_uri_redirections
|
29
|
-
requirement: !ruby/object:Gem::Requirement
|
30
|
-
requirements:
|
31
|
-
- - "~>"
|
32
|
-
- !ruby/object:Gem::Version
|
33
|
-
version: '0.2'
|
34
|
-
type: :runtime
|
35
|
-
prerelease: false
|
36
|
-
version_requirements: !ruby/object:Gem::Requirement
|
37
|
-
requirements:
|
38
|
-
- - "~>"
|
39
|
-
- !ruby/object:Gem::Version
|
40
|
-
version: '0.2'
|
41
27
|
- !ruby/object:Gem::Dependency
|
42
28
|
name: thread
|
43
29
|
requirement: !ruby/object:Gem::Requirement
|
@@ -113,8 +99,8 @@ files:
|
|
113
99
|
- bin/console
|
114
100
|
- bin/setup
|
115
101
|
- lib/super_crawler.rb
|
116
|
-
- lib/super_crawler/
|
117
|
-
- lib/super_crawler/
|
102
|
+
- lib/super_crawler/crawl.rb
|
103
|
+
- lib/super_crawler/scrap.rb
|
118
104
|
- lib/super_crawler/version.rb
|
119
105
|
- super_crawler.gemspec
|
120
106
|
homepage: https://github.com/htaidirt/super_crawler
|