rubyretriever 0.0.11 → 0.0.12
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/lib/retriever.rb +0 -1
- data/lib/retriever/version.rb +1 -1
- data/readme.md +8 -22
- metadata +3 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: efc429906131b363741d6560e37cb095f905b48e
|
4
|
+
data.tar.gz: 85f320d55600f007315941b6c3213c8f04b70515
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 1cdeb51c607ee23b662128ae7b1071085314c9c04626fdfaf708ef9be7224e1bd83652e9bffb64175da480f7830af223a6e8a2a846cb429af3a4c58a71472941
|
7
|
+
data.tar.gz: 437ee738e18d69600897512e0dd047da23166b2c59ad5f70ae8336532ecfa73e85399e9092b5d7b11895ddc23cffd46dd9465c2c6859bbf0890c80a32e15218b
|
data/lib/retriever.rb
CHANGED
data/lib/retriever/version.rb
CHANGED
data/readme.md
CHANGED
@@ -1,37 +1,23 @@
|
|
1
|
-
RubyRetriever [![Gem Version](https://badge.fury.io/rb/rubyretriever.svg)](http://badge.fury.io/rb/rubyretriever)
|
1
|
+
[RubyRetriever] (http://www.softwarebyjoe.com/rubyretriever/) [![Gem Version](https://badge.fury.io/rb/rubyretriever.svg)](http://badge.fury.io/rb/rubyretriever)
|
2
2
|
==============
|
3
3
|
|
4
|
-
|
5
|
-
```sh
|
6
|
-
gem install rubyretriever
|
7
|
-
```
|
8
|
-
Update (5/26):
|
9
|
-
Version 0.0.10 - fixes a bug that wouldn't allow sitemaps to write out to file correctly.
|
10
|
-
|
11
|
-
Update (5/25):
|
12
|
-
Version 0.0.6 - Switches to using a Bloom Filter to keep track of past 'visited pages'. I saw this in [Arachnid] (https://github.com/dchuk/Arachnid) and realized it's a much better idea for performance and implemented it immediately. Hat tip [dchuk] (https://github.com/dchuk/)
|
13
|
-
|
14
|
-
About
|
15
|
-
=====
|
4
|
+
By Joe Norton
|
16
5
|
|
17
6
|
RubyRetriever is a Web Crawler, Site Mapper, File Harvester & Autodownloader, and all around nice buddy to have around.
|
18
|
-
Soon to add some high level scraping options.
|
19
7
|
|
20
8
|
RubyRetriever uses aynchronous HTTP requests, thanks to eventmachine and Synchrony fibers, to crawl webpages *very quickly*.
|
21
9
|
|
22
|
-
|
23
|
-
|
24
|
-
RubyRetriever does NOT respect robots.txt, and RubyRetriever currently - by default - launches up to 10 parallel GET requests at once. This is a feature, do not abuse it. Use at own risk.
|
10
|
+
RubyRetriever does NOT respect robots.txt, and RubyRetriever currently - by default - launches up to 10 parallel GET requests at once. This is a feature, do not abuse it. Use at own risk.
|
25
11
|
|
26
12
|
|
27
|
-
|
13
|
+
getting started
|
28
14
|
-----------
|
15
|
+
Install the gem
|
29
16
|
```sh
|
30
|
-
gem install rubyretriever
|
31
|
-
rr [MODE] [OPTIONS] Target_URL
|
17
|
+
gem install rubyretriever
|
32
18
|
```
|
33
19
|
|
34
|
-
**
|
20
|
+
**Example: Sitemap mode**
|
35
21
|
```sh
|
36
22
|
rr --sitemap --progress --limit 1000 --output cnet http://www.cnet.com
|
37
23
|
```
|
@@ -42,7 +28,7 @@ rr -s -p -l 1000 -o cnet http://www.cnet.com
|
|
42
28
|
|
43
29
|
This would go to http://www.cnet.com and map it until it crawled a max of 1,000 pages, and then it would write it out to a csv named cnet.
|
44
30
|
|
45
|
-
**File Harvesting**
|
31
|
+
**Example: File Harvesting mode**
|
46
32
|
```sh
|
47
33
|
rr --files --ext pdf --progress --limit 1000 --output hubspot http://www.hubspot.com
|
48
34
|
```
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: rubyretriever
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.12
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Joe Norton
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2014-05-
|
11
|
+
date: 2014-05-26 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: em-synchrony
|
@@ -126,7 +126,7 @@ files:
|
|
126
126
|
- readme.md
|
127
127
|
- spec/retriever_spec.rb
|
128
128
|
- spec/spec_helper.rb
|
129
|
-
homepage: http://
|
129
|
+
homepage: http://www.softwarebyjoe.com/rubyretriever/
|
130
130
|
licenses:
|
131
131
|
- MIT
|
132
132
|
metadata: {}
|