rubyretriever 0.0.11 → 0.0.12
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/lib/retriever.rb +0 -1
- data/lib/retriever/version.rb +1 -1
- data/readme.md +8 -22
- metadata +3 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: efc429906131b363741d6560e37cb095f905b48e
|
4
|
+
data.tar.gz: 85f320d55600f007315941b6c3213c8f04b70515
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 1cdeb51c607ee23b662128ae7b1071085314c9c04626fdfaf708ef9be7224e1bd83652e9bffb64175da480f7830af223a6e8a2a846cb429af3a4c58a71472941
|
7
|
+
data.tar.gz: 437ee738e18d69600897512e0dd047da23166b2c59ad5f70ae8336532ecfa73e85399e9092b5d7b11895ddc23cffd46dd9465c2c6859bbf0890c80a32e15218b
|
data/lib/retriever.rb
CHANGED
data/lib/retriever/version.rb
CHANGED
data/readme.md
CHANGED
@@ -1,37 +1,23 @@
|
|
1
|
-
RubyRetriever [](http://badge.fury.io/rb/rubyretriever)
|
1
|
+
[RubyRetriever] (http://www.softwarebyjoe.com/rubyretriever/) [](http://badge.fury.io/rb/rubyretriever)
|
2
2
|
==============
|
3
3
|
|
4
|
-
|
5
|
-
```sh
|
6
|
-
gem install rubyretriever
|
7
|
-
```
|
8
|
-
Update (5/26):
|
9
|
-
Version 0.0.10 - fixes a bug that wouldn't allow sitemaps to write out to file correctly.
|
10
|
-
|
11
|
-
Update (5/25):
|
12
|
-
Version 0.0.6 - Switches to using a Bloom Filter to keep track of past 'visited pages'. I saw this in [Arachnid] (https://github.com/dchuk/Arachnid) and realized it's a much better idea for performance and implemented it immediately. Hat tip [dchuk] (https://github.com/dchuk/)
|
13
|
-
|
14
|
-
About
|
15
|
-
=====
|
4
|
+
By Joe Norton
|
16
5
|
|
17
6
|
RubyRetriever is a Web Crawler, Site Mapper, File Harvester & Autodownloader, and all around nice buddy to have around.
|
18
|
-
Soon to add some high level scraping options.
|
19
7
|
|
20
8
|
RubyRetriever uses aynchronous HTTP requests, thanks to eventmachine and Synchrony fibers, to crawl webpages *very quickly*.
|
21
9
|
|
22
|
-
|
23
|
-
|
24
|
-
RubyRetriever does NOT respect robots.txt, and RubyRetriever currently - by default - launches up to 10 parallel GET requests at once. This is a feature, do not abuse it. Use at own risk.
|
10
|
+
RubyRetriever does NOT respect robots.txt, and RubyRetriever currently - by default - launches up to 10 parallel GET requests at once. This is a feature, do not abuse it. Use at own risk.
|
25
11
|
|
26
12
|
|
27
|
-
|
13
|
+
getting started
|
28
14
|
-----------
|
15
|
+
Install the gem
|
29
16
|
```sh
|
30
|
-
gem install rubyretriever
|
31
|
-
rr [MODE] [OPTIONS] Target_URL
|
17
|
+
gem install rubyretriever
|
32
18
|
```
|
33
19
|
|
34
|
-
**
|
20
|
+
**Example: Sitemap mode**
|
35
21
|
```sh
|
36
22
|
rr --sitemap --progress --limit 1000 --output cnet http://www.cnet.com
|
37
23
|
```
|
@@ -42,7 +28,7 @@ rr -s -p -l 1000 -o cnet http://www.cnet.com
|
|
42
28
|
|
43
29
|
This would go to http://www.cnet.com and map it until it crawled a max of 1,000 pages, and then it would write it out to a csv named cnet.
|
44
30
|
|
45
|
-
**File Harvesting**
|
31
|
+
**Example: File Harvesting mode**
|
46
32
|
```sh
|
47
33
|
rr --files --ext pdf --progress --limit 1000 --output hubspot http://www.hubspot.com
|
48
34
|
```
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: rubyretriever
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.12
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Joe Norton
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2014-05-
|
11
|
+
date: 2014-05-26 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: em-synchrony
|
@@ -126,7 +126,7 @@ files:
|
|
126
126
|
- readme.md
|
127
127
|
- spec/retriever_spec.rb
|
128
128
|
- spec/spec_helper.rb
|
129
|
-
homepage: http://
|
129
|
+
homepage: http://www.softwarebyjoe.com/rubyretriever/
|
130
130
|
licenses:
|
131
131
|
- MIT
|
132
132
|
metadata: {}
|