rubyretriever 0.0.11 → 0.0.12

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: ffb93b0faa77d73f014f67be6dbb6320233a5497
4
- data.tar.gz: 920547b074b92a01b164e2f27130010773a55e0b
3
+ metadata.gz: efc429906131b363741d6560e37cb095f905b48e
4
+ data.tar.gz: 85f320d55600f007315941b6c3213c8f04b70515
5
5
  SHA512:
6
- metadata.gz: b3c36ff313a381ec3d1950abf1c148faed90aa99a0658741ab4533f15d6b6afd2e6dc95caa0be5afd231125099126c24aeee36b3a873c33d8a81c6f42dace510
7
- data.tar.gz: 31bb5aa05f6354f083fae15b3351059d28073c449125081a2c043b7087340d48294541a11951cb2f13f67ec9b8944a029071bcf24e1421b358f64ad458a31d85
6
+ metadata.gz: 1cdeb51c607ee23b662128ae7b1071085314c9c04626fdfaf708ef9be7224e1bd83652e9bffb64175da480f7830af223a6e8a2a846cb429af3a4c58a71472941
7
+ data.tar.gz: 437ee738e18d69600897512e0dd047da23166b2c59ad5f70ae8336532ecfa73e85399e9092b5d7b11895ddc23cffd46dd9465c2c6859bbf0890c80a32e15218b
data/lib/retriever.rb CHANGED
@@ -10,7 +10,6 @@ require 'em-synchrony/fiber_iterator'
10
10
  require 'ruby-progressbar'
11
11
  require 'open-uri'
12
12
  require 'optparse'
13
- require 'uri'
14
13
  require 'csv'
15
14
  require 'bloomfilter-rb'
16
15
 
@@ -1,3 +1,3 @@
1
1
  module Retriever
2
- VERSION = '0.0.11'
2
+ VERSION = '0.0.12'
3
3
  end
data/readme.md CHANGED
@@ -1,37 +1,23 @@
1
- RubyRetriever [![Gem Version](https://badge.fury.io/rb/rubyretriever.svg)](http://badge.fury.io/rb/rubyretriever)
1
+ [RubyRetriever] (http://www.softwarebyjoe.com/rubyretriever/) [![Gem Version](https://badge.fury.io/rb/rubyretriever.svg)](http://badge.fury.io/rb/rubyretriever)
2
2
  ==============
3
3
 
4
- Now an official RubyGem!
5
- ```sh
6
- gem install rubyretriever
7
- ```
8
- Update (5/26):
9
- Version 0.0.10 - fixes a bug that wouldn't allow sitemaps to write out to file correctly.
10
-
11
- Update (5/25):
12
- Version 0.0.6 - Switches to using a Bloom Filter to keep track of past 'visited pages'. I saw this in [Arachnid] (https://github.com/dchuk/Arachnid) and realized it's a much better idea for performance and implemented it immediately. Hat tip [dchuk] (https://github.com/dchuk/)
13
-
14
- About
15
- =====
4
+ By Joe Norton
16
5
 
17
6
  RubyRetriever is a Web Crawler, Site Mapper, File Harvester & Autodownloader, and all around nice buddy to have around.
18
- Soon to add some high level scraping options.
19
7
 
20
8
  RubyRetriever uses aynchronous HTTP requests, thanks to eventmachine and Synchrony fibers, to crawl webpages *very quickly*.
21
9
 
22
- This is the 2nd or 3rd reincarnation of the RubyRetriever autodownloader project. It started out as a executable autodownloader, intended for malware research. From there it has morphed to become a more well-rounded web-crawler and general purpose file harvesting utility.
23
-
24
- RubyRetriever does NOT respect robots.txt, and RubyRetriever currently - by default - launches up to 10 parallel GET requests at once. This is a feature, do not abuse it. Use at own risk.
10
+ RubyRetriever does NOT respect robots.txt, and RubyRetriever currently - by default - launches up to 10 parallel GET requests at once. This is a feature, do not abuse it. Use at own risk.
25
11
 
26
12
 
27
- HOW IT WORKS
13
+ getting started
28
14
  -----------
15
+ Install the gem
29
16
  ```sh
30
- gem install rubyretriever
31
- rr [MODE] [OPTIONS] Target_URL
17
+ gem install rubyretriever
32
18
  ```
33
19
 
34
- **Site Mapper**
20
+ **Example: Sitemap mode**
35
21
  ```sh
36
22
  rr --sitemap --progress --limit 1000 --output cnet http://www.cnet.com
37
23
  ```
@@ -42,7 +28,7 @@ rr -s -p -l 1000 -o cnet http://www.cnet.com
42
28
 
43
29
  This would go to http://www.cnet.com and map it until it crawled a max of 1,000 pages, and then it would write it out to a csv named cnet.
44
30
 
45
- **File Harvesting**
31
+ **Example: File Harvesting mode**
46
32
  ```sh
47
33
  rr --files --ext pdf --progress --limit 1000 --output hubspot http://www.hubspot.com
48
34
  ```
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: rubyretriever
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.11
4
+ version: 0.0.12
5
5
  platform: ruby
6
6
  authors:
7
7
  - Joe Norton
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2014-05-25 00:00:00.000000000 Z
11
+ date: 2014-05-26 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: em-synchrony
@@ -126,7 +126,7 @@ files:
126
126
  - readme.md
127
127
  - spec/retriever_spec.rb
128
128
  - spec/spec_helper.rb
129
- homepage: http://github.com/joenorton/rubyretriever
129
+ homepage: http://www.softwarebyjoe.com/rubyretriever/
130
130
  licenses:
131
131
  - MIT
132
132
  metadata: {}