rubyretriever 1.2.2 → 1.2.3
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/lib/retriever/fetch.rb +1 -1
- data/lib/retriever/fetchfiles.rb +1 -1
- data/lib/retriever/version.rb +1 -1
- data/readme.md +1 -1
- metadata +1 -1
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 3bb32aa2e9c8317d2f3cb13572e2cdecb1da24a9
|
4
|
+
data.tar.gz: 732e5610104345efed80651929cb9a050e01d9be
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 3d4e109785452db3906dc7b66158846cda24e4c3e1b942f600918338e141d6a337f1f9b3087b94b2561c64095fcdc2f2fb439d29b73574a2ddae501a8f0d965b
|
7
|
+
data.tar.gz: 2e0befea22dfc2bc689d15ad3c33efaf015f7b1ee5c53a322cccca7f6394a4def445e362585600e1934f770377ce193c5a762caa3639ccda480f7c481ce64d64
|
data/lib/retriever/fetch.rb
CHANGED
@@ -91,7 +91,7 @@ module Retriever
|
|
91
91
|
@sitemap = options['sitemap']
|
92
92
|
@seo = options['seo']
|
93
93
|
@autodown = options['autodown']
|
94
|
-
@file_re = Regexp.new(
|
94
|
+
@file_re = Regexp.new(/.#{@fileharvest}\z/).freeze if @fileharvest
|
95
95
|
end
|
96
96
|
|
97
97
|
def setup_bloom_filter
|
data/lib/retriever/fetchfiles.rb
CHANGED
@@ -6,7 +6,7 @@ module Retriever
|
|
6
6
|
def initialize(url, options)
|
7
7
|
super
|
8
8
|
temp_file_collection = @page_one.parse_files(@page_one.parse_internal)
|
9
|
-
@data.concat(
|
9
|
+
@data.concat(temp_file_collection) if temp_file_collection.size > 0
|
10
10
|
lg("#{@data.size} new files found")
|
11
11
|
|
12
12
|
async_crawl_and_collect
|
data/lib/retriever/version.rb
CHANGED
data/readme.md
CHANGED
@@ -6,7 +6,7 @@ By Joe Norton
|
|
6
6
|
|
7
7
|
RubyRetriever is a Web Crawler, Site Mapper, File Harvester & Autodownloader.
|
8
8
|
|
9
|
-
RubyRetriever (RR) uses asynchronous HTTP requests, thanks to [Eventmachine](https://github.com/eventmachine/eventmachine) & [Synchrony](https://github.com/igrigorik/em-synchrony), to crawl webpages *very quickly*. Another neat thing about RR, is it uses a ruby implementation of the [bloomfilter](https://github.com/igrigorik/bloomfilter-rb) in order to keep track of
|
9
|
+
RubyRetriever (RR) uses asynchronous HTTP requests, thanks to [Eventmachine](https://github.com/eventmachine/eventmachine) & [Synchrony](https://github.com/igrigorik/em-synchrony), to crawl webpages *very quickly*. Another neat thing about RR, is it uses a ruby implementation of the [bloomfilter](https://github.com/igrigorik/bloomfilter-rb) in order to keep track of pages it has already crawled.
|
10
10
|
|
11
11
|
**v1.0 Update (6/07/2014)** - Includes major code changes, a lot of bug fixes. Much better in dealing with redirects, and issues with the host changing, etc. Also, added the SEO mode -- which grabs a number of key SEO components from every page on a site. Lastly, this update was so extensive that I could not ensure backward compatibility -- and thus, this was update 1.0!
|
12
12
|
mission
|