elasticrawl 1.1.0 → 1.1.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.travis.yml +1 -0
- data/README.md +33 -46
- data/bin/elasticrawl +8 -0
- data/lib/elasticrawl/config.rb +6 -6
- data/lib/elasticrawl/crawl.rb +10 -4
- data/lib/elasticrawl/error.rb +7 -1
- data/lib/elasticrawl/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: ba9231bff7a4f4782280baaa82987d0942452b76
|
4
|
+
data.tar.gz: c7fa24226b647a63c48c299cd6be1cf664ed1a84
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: c815de487d1a2a31654b583ed867b97774bbe2dace47e56c9a22886b45e1d651625b3ab3b11ea77bf7630cd5850d311503a43daa240005999349e5df5c4f53c1
|
7
|
+
data.tar.gz: b1af2574b6ef416e80726c7bb6ff461f2b7baeaf8e888c55598aba3b5f3be301a3961630bacbb646746b68cec843cb11182ce394872ebffc3008af0f53f03fb6
|
data/.travis.yml
CHANGED
data/README.md
CHANGED
@@ -16,58 +16,34 @@ Ships with a default configuration that launches the
|
|
16
16
|
[elasticrawl-examples](https://github.com/rossf7/elasticrawl-examples) jobs.
|
17
17
|
This is an implementation of the standard Hadoop Word Count example.
|
18
18
|
|
19
|
-
|
20
|
-
|
21
|
-
* [Blog post](https://rossfairbanks.com/2015/01/03/parsing-common-crawl-using-elasticrawl.html) with walkthrough of running the Hadoop WordCount example on the November 2014 crawl.
|
19
|
+
This [blog post](https://rossfairbanks.com/2015/01/03/parsing-common-crawl-using-elasticrawl.html) has a walkthrough of running the example jobs on the November 2014 crawl.
|
22
20
|
|
23
21
|
## Installation
|
24
22
|
|
25
|
-
|
26
|
-
|
27
|
-
Elasticrawl is developed in Ruby and requires Ruby 1.9.3 or later (Ruby 2.1 is recommended).
|
28
|
-
Installing using [rbenv](https://github.com/sstephenson/rbenv#installation)
|
29
|
-
and the ruby-build plugin is recommended.
|
30
|
-
|
31
|
-
A SQLite database is used to store details of crawls and jobs. Installing the sqlite3
|
32
|
-
gem requires the development headers to be installed.
|
33
|
-
|
34
|
-
```bash
|
35
|
-
# OS X
|
36
|
-
brew install sqlite3
|
37
|
-
|
38
|
-
# CentOS
|
39
|
-
sudo yum install sqlite-devel
|
40
|
-
|
41
|
-
# Ubuntu
|
42
|
-
sudo apt-get install libsqlite3-dev
|
43
|
-
```
|
44
|
-
|
45
|
-
### Install elasticrawl
|
46
|
-
|
47
|
-
[![Gem Version](https://badge.fury.io/rb/elasticrawl.png)](http://badge.fury.io/rb/elasticrawl)
|
48
|
-
[![Code Climate](https://codeclimate.com/github/rossf7/elasticrawl.png)](https://codeclimate.com/github/rossf7/elasticrawl)
|
49
|
-
[![Build Status](https://travis-ci.org/rossf7/elasticrawl.png?branch=master)](https://travis-ci.org/rossf7/elasticrawl) 1.9.3, 2.0.0, 2.1.5
|
23
|
+
Deployment packages are available for Linux and OS X, unfortunately Windows isn't supported yet. Download the package, extract it and run the elasticrawl command from the package directory.
|
50
24
|
|
51
25
|
```bash
|
52
|
-
|
53
|
-
|
26
|
+
# OS X https://d2ujrnticqzebc.cloudfront.net/elasticrawl-1.1.1-osx.tar.gz
|
27
|
+
# Linux (64-bit) https://d2ujrnticqzebc.cloudfront.net/elasticrawl-1.1.1-linux-x86_64.tar.gz
|
28
|
+
# Linux (32-bit) https://d2ujrnticqzebc.cloudfront.net/elasticrawl-1.1.1-linux-x86.tar.gz
|
54
29
|
|
55
|
-
|
56
|
-
to your path.
|
30
|
+
# e.g.
|
57
31
|
|
58
|
-
|
59
|
-
|
32
|
+
curl -O https://d2ujrnticqzebc.cloudfront.net/elasticrawl-1.1.0-osx.tar.gz
|
33
|
+
tar -xzf elasticrawl-1.1.0-osx.tar.gz
|
34
|
+
cd elasticrawl-1.1.0-osx/
|
35
|
+
./elasticrawl --help
|
60
36
|
```
|
61
37
|
|
62
38
|
## Commands
|
63
39
|
|
64
40
|
### elasticrawl init
|
65
41
|
|
66
|
-
|
42
|
+
The init command takes in an S3 bucket name and your AWS credentials. The S3 bucket will be created
|
67
43
|
and will store your data and logs.
|
68
44
|
|
69
45
|
```bash
|
70
|
-
~$ elasticrawl init your-s3-bucket
|
46
|
+
~$ ./elasticrawl init your-s3-bucket
|
71
47
|
|
72
48
|
Enter AWS Access Key ID: ************
|
73
49
|
Enter AWS Secret Access Key: ************
|
@@ -81,10 +57,10 @@ Config complete
|
|
81
57
|
|
82
58
|
### elasticrawl parse
|
83
59
|
|
84
|
-
|
60
|
+
The parse command takes in the crawl name and an optional number of segments and files to parse.
|
85
61
|
|
86
62
|
```bash
|
87
|
-
~$ elasticrawl parse CC-MAIN-2014-49 --max-segments 2 --max-files 3
|
63
|
+
~$ ./elasticrawl parse CC-MAIN-2014-49 --max-segments 2 --max-files 3
|
88
64
|
Segments
|
89
65
|
Segment: 1416400372202.67 Files: 150
|
90
66
|
Segment: 1416400372490.23 Files: 124
|
@@ -104,10 +80,10 @@ Job: 1420124830792 Job Flow ID: j-2R3MFE6TWLIUB
|
|
104
80
|
|
105
81
|
### elasticrawl combine
|
106
82
|
|
107
|
-
|
83
|
+
The combine command takes in the results of previous parse jobs and produces a combined set of results.
|
108
84
|
|
109
85
|
```bash
|
110
|
-
~$ elasticrawl combine --input-jobs 1420124830792
|
86
|
+
~$ ./elasticrawl combine --input-jobs 1420124830792
|
111
87
|
Job configuration
|
112
88
|
Combining: 2 segments
|
113
89
|
|
@@ -123,10 +99,10 @@ Job: 1420129496115 Job Flow ID: j-251GXDIZGK8HL
|
|
123
99
|
|
124
100
|
### elasticrawl status
|
125
101
|
|
126
|
-
|
102
|
+
The status command shows crawls and your job history.
|
127
103
|
|
128
104
|
```bash
|
129
|
-
~$ elasticrawl status
|
105
|
+
~$ ./elasticrawl status
|
130
106
|
Crawl Status
|
131
107
|
CC-MAIN-2014-49 Segments: to parse 134, parsed 2, total 136
|
132
108
|
|
@@ -136,10 +112,10 @@ Job History (last 10)
|
|
136
112
|
|
137
113
|
### elasticrawl reset
|
138
114
|
|
139
|
-
|
115
|
+
The reset comment resets a crawl so it is parsed again.
|
140
116
|
|
141
117
|
```bash
|
142
|
-
~$ elasticrawl reset CC-MAIN-2014-49
|
118
|
+
~$ ./elasticrawl reset CC-MAIN-2014-49
|
143
119
|
Reset crawl? (y/n)
|
144
120
|
y
|
145
121
|
CC-MAIN-2014-49 Segments: to parse 136, parsed 0, total 136
|
@@ -147,10 +123,10 @@ y
|
|
147
123
|
|
148
124
|
### elasticrawl destroy
|
149
125
|
|
150
|
-
|
126
|
+
The destroy command deletes your S3 bucket and the ~/.elasticrawl directory.
|
151
127
|
|
152
128
|
```bash
|
153
|
-
~$ elasticrawl destroy
|
129
|
+
~$ ./elasticrawl destroy
|
154
130
|
|
155
131
|
WARNING:
|
156
132
|
Bucket s3://elasticrawl-test and its data will be deleted
|
@@ -178,6 +154,16 @@ configures the EC2 instances that are launched to form your EMR cluster
|
|
178
154
|
* [jobs.yml](https://github.com/rossf7/elasticrawl/blob/master/templates/jobs.yml) -
|
179
155
|
stores your S3 bucket name and the config for the parse and combine jobs
|
180
156
|
|
157
|
+
## Development
|
158
|
+
|
159
|
+
Elasticrawl is developed in Ruby and requires Ruby 1.9.3 or later (Ruby 2.1 is recommended). The sqlite3 and nokogiri gems have C extensions which mean you may need to install development headers.
|
160
|
+
|
161
|
+
[![Gem Version](https://badge.fury.io/rb/elasticrawl.png)](http://badge.fury.io/rb/elasticrawl)
|
162
|
+
[![Code Climate](https://codeclimate.com/github/rossf7/elasticrawl.png)](https://codeclimate.com/github/rossf7/elasticrawl)
|
163
|
+
[![Build Status](https://travis-ci.org/rossf7/elasticrawl.png?branch=master)](https://travis-ci.org/rossf7/elasticrawl) 1.9.3, 2.0.0, 2.1.5, 2.2.0
|
164
|
+
|
165
|
+
The deployment packages are created using [Traveling Ruby](http://phusion.github.io/traveling-ruby/). The deploy packages contain a Ruby 2.1 interpreter, Gems and the compiled C extensions. The [traveling-elasticrawl](https://github.com/rossf7/traveling-elasticrawl) repository has a Rake task that automates building the deployment packages.
|
166
|
+
|
181
167
|
## TODO
|
182
168
|
|
183
169
|
* Add support for Streaming and Pig jobs
|
@@ -187,6 +173,7 @@ stores your S3 bucket name and the config for the parse and combine jobs
|
|
187
173
|
* Thanks to everyone at Common Crawl for making this awesome dataset available!
|
188
174
|
* Thanks to Robert Slifka for the [elasticity](https://github.com/rslifka/elasticity)
|
189
175
|
gem which provides a nice Ruby wrapper for the EMR REST API.
|
176
|
+
* Thanks to Phusion for creating Traveling Ruby.
|
190
177
|
|
191
178
|
## Contributing
|
192
179
|
|
data/bin/elasticrawl
CHANGED
@@ -138,4 +138,12 @@ rescue Thor::Error => e
|
|
138
138
|
# Show elasticrawl errors.
|
139
139
|
rescue Elasticrawl::Error => e
|
140
140
|
puts("ERROR: #{e.message}")
|
141
|
+
puts e.backtrace
|
142
|
+
|
143
|
+
if e.http_response.present?
|
144
|
+
response = e.http_response
|
145
|
+
|
146
|
+
puts "HTTP Response: #{response.status}"
|
147
|
+
puts response.body if response.body.present?
|
148
|
+
end
|
141
149
|
end
|
data/lib/elasticrawl/config.rb
CHANGED
@@ -100,8 +100,8 @@ module Elasticrawl
|
|
100
100
|
|
101
101
|
rescue AWS::S3::Errors::SignatureDoesNotMatch => e
|
102
102
|
raise AWSCredentialsInvalidError, 'AWS access credentials are invalid'
|
103
|
-
rescue
|
104
|
-
raise S3AccessError, e.message
|
103
|
+
rescue AWS::Errors::Base => s3e
|
104
|
+
raise S3AccessError.new(s3e.http_response), e.message
|
105
105
|
end
|
106
106
|
end
|
107
107
|
|
@@ -158,8 +158,8 @@ module Elasticrawl
|
|
158
158
|
s3 = AWS::S3.new
|
159
159
|
s3.buckets.create(bucket_name)
|
160
160
|
|
161
|
-
rescue
|
162
|
-
raise S3AccessError, e.message
|
161
|
+
rescue AWS::Errors::Base => s3e
|
162
|
+
raise S3AccessError.new(s3e.http_response), e.message
|
163
163
|
end
|
164
164
|
end
|
165
165
|
|
@@ -170,8 +170,8 @@ module Elasticrawl
|
|
170
170
|
bucket = s3.buckets[bucket_name]
|
171
171
|
bucket.delete!
|
172
172
|
|
173
|
-
rescue
|
174
|
-
raise S3AccessError, e.message
|
173
|
+
rescue AWS::Errors::Base => s3e
|
174
|
+
raise S3AccessError.new(s3e.http_response), e.message
|
175
175
|
end
|
176
176
|
end
|
177
177
|
|
data/lib/elasticrawl/crawl.rb
CHANGED
@@ -91,12 +91,18 @@ module Elasticrawl
|
|
91
91
|
s3_path = [Elasticrawl::COMMON_CRAWL_PATH,
|
92
92
|
crawl_name,
|
93
93
|
Elasticrawl::WARC_PATHS].join('/')
|
94
|
+
begin
|
95
|
+
s3 = AWS::S3.new
|
96
|
+
bucket = s3.buckets[Elasticrawl::COMMON_CRAWL_BUCKET]
|
97
|
+
object = bucket.objects[s3_path]
|
94
98
|
|
95
|
-
|
96
|
-
bucket = s3.buckets[Elasticrawl::COMMON_CRAWL_BUCKET]
|
97
|
-
object = bucket.objects[s3_path]
|
99
|
+
uncompress_file(object)
|
98
100
|
|
99
|
-
|
101
|
+
rescue AWS::Errors::Base => s3e
|
102
|
+
raise S3AccessError.new(s3e.http_response), 'Failed to get WARC paths'
|
103
|
+
rescue Exception => e
|
104
|
+
raise S3AccessError, 'Failed to get WARC paths'
|
105
|
+
end
|
100
106
|
end
|
101
107
|
|
102
108
|
# Takes in a S3 object and returns the contents as an uncompressed string.
|
data/lib/elasticrawl/error.rb
CHANGED
@@ -1,6 +1,12 @@
|
|
1
1
|
module Elasticrawl
|
2
2
|
# Base error class extends standard error.
|
3
|
-
class Error < StandardError
|
3
|
+
class Error < StandardError
|
4
|
+
attr_reader :http_response
|
5
|
+
|
6
|
+
def initialize(response = nil)
|
7
|
+
@http_response = response
|
8
|
+
end
|
9
|
+
end
|
4
10
|
|
5
11
|
# AWS access credentials are invalid.
|
6
12
|
class AWSCredentialsInvalidError < Error; end
|
data/lib/elasticrawl/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: elasticrawl
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.1.
|
4
|
+
version: 1.1.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Ross Fairbanks
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2015-01-
|
11
|
+
date: 2015-01-27 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: activerecord
|