rubyretriever 0.0.11 → 0.0.12

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: ffb93b0faa77d73f014f67be6dbb6320233a5497
4
- data.tar.gz: 920547b074b92a01b164e2f27130010773a55e0b
3
+ metadata.gz: efc429906131b363741d6560e37cb095f905b48e
4
+ data.tar.gz: 85f320d55600f007315941b6c3213c8f04b70515
5
5
  SHA512:
6
- metadata.gz: b3c36ff313a381ec3d1950abf1c148faed90aa99a0658741ab4533f15d6b6afd2e6dc95caa0be5afd231125099126c24aeee36b3a873c33d8a81c6f42dace510
7
- data.tar.gz: 31bb5aa05f6354f083fae15b3351059d28073c449125081a2c043b7087340d48294541a11951cb2f13f67ec9b8944a029071bcf24e1421b358f64ad458a31d85
6
+ metadata.gz: 1cdeb51c607ee23b662128ae7b1071085314c9c04626fdfaf708ef9be7224e1bd83652e9bffb64175da480f7830af223a6e8a2a846cb429af3a4c58a71472941
7
+ data.tar.gz: 437ee738e18d69600897512e0dd047da23166b2c59ad5f70ae8336532ecfa73e85399e9092b5d7b11895ddc23cffd46dd9465c2c6859bbf0890c80a32e15218b
data/lib/retriever.rb CHANGED
@@ -10,7 +10,6 @@ require 'em-synchrony/fiber_iterator'
10
10
  require 'ruby-progressbar'
11
11
  require 'open-uri'
12
12
  require 'optparse'
13
- require 'uri'
14
13
  require 'csv'
15
14
  require 'bloomfilter-rb'
16
15
 
@@ -1,3 +1,3 @@
1
1
  module Retriever
2
- VERSION = '0.0.11'
2
+ VERSION = '0.0.12'
3
3
  end
data/readme.md CHANGED
@@ -1,37 +1,23 @@
1
- RubyRetriever [![Gem Version](https://badge.fury.io/rb/rubyretriever.svg)](http://badge.fury.io/rb/rubyretriever)
1
+ [RubyRetriever] (http://www.softwarebyjoe.com/rubyretriever/) [![Gem Version](https://badge.fury.io/rb/rubyretriever.svg)](http://badge.fury.io/rb/rubyretriever)
2
2
  ==============
3
3
 
4
- Now an official RubyGem!
5
- ```sh
6
- gem install rubyretriever
7
- ```
8
- Update (5/26):
9
- Version 0.0.10 - fixes a bug that wouldn't allow sitemaps to write out to file correctly.
10
-
11
- Update (5/25):
12
- Version 0.0.6 - Switches to using a Bloom Filter to keep track of past 'visited pages'. I saw this in [Arachnid] (https://github.com/dchuk/Arachnid) and realized it's a much better idea for performance and implemented it immediately. Hat tip [dchuk] (https://github.com/dchuk/)
13
-
14
- About
15
- =====
4
+ By Joe Norton
16
5
 
17
6
  RubyRetriever is a Web Crawler, Site Mapper, File Harvester & Autodownloader, and all around nice buddy to have around.
18
- Soon to add some high level scraping options.
19
7
 
20
8
  RubyRetriever uses aynchronous HTTP requests, thanks to eventmachine and Synchrony fibers, to crawl webpages *very quickly*.
21
9
 
22
- This is the 2nd or 3rd reincarnation of the RubyRetriever autodownloader project. It started out as a executable autodownloader, intended for malware research. From there it has morphed to become a more well-rounded web-crawler and general purpose file harvesting utility.
23
-
24
- RubyRetriever does NOT respect robots.txt, and RubyRetriever currently - by default - launches up to 10 parallel GET requests at once. This is a feature, do not abuse it. Use at own risk.
10
+ RubyRetriever does NOT respect robots.txt, and RubyRetriever currently - by default - launches up to 10 parallel GET requests at once. This is a feature, do not abuse it. Use at own risk.
25
11
 
26
12
 
27
- HOW IT WORKS
13
+ getting started
28
14
  -----------
15
+ Install the gem
29
16
  ```sh
30
- gem install rubyretriever
31
- rr [MODE] [OPTIONS] Target_URL
17
+ gem install rubyretriever
32
18
  ```
33
19
 
34
- **Site Mapper**
20
+ **Example: Sitemap mode**
35
21
  ```sh
36
22
  rr --sitemap --progress --limit 1000 --output cnet http://www.cnet.com
37
23
  ```
@@ -42,7 +28,7 @@ rr -s -p -l 1000 -o cnet http://www.cnet.com
42
28
 
43
29
  This would go to http://www.cnet.com and map it until it crawled a max of 1,000 pages, and then it would write it out to a csv named cnet.
44
30
 
45
- **File Harvesting**
31
+ **Example: File Harvesting mode**
46
32
  ```sh
47
33
  rr --files --ext pdf --progress --limit 1000 --output hubspot http://www.hubspot.com
48
34
  ```
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: rubyretriever
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.11
4
+ version: 0.0.12
5
5
  platform: ruby
6
6
  authors:
7
7
  - Joe Norton
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2014-05-25 00:00:00.000000000 Z
11
+ date: 2014-05-26 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: em-synchrony
@@ -126,7 +126,7 @@ files:
126
126
  - readme.md
127
127
  - spec/retriever_spec.rb
128
128
  - spec/spec_helper.rb
129
- homepage: http://github.com/joenorton/rubyretriever
129
+ homepage: http://www.softwarebyjoe.com/rubyretriever/
130
130
  licenses:
131
131
  - MIT
132
132
  metadata: {}