elasticrawl 1.1.0 → 1.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.travis.yml +1 -0
- data/README.md +33 -46
- data/bin/elasticrawl +8 -0
- data/lib/elasticrawl/config.rb +6 -6
- data/lib/elasticrawl/crawl.rb +10 -4
- data/lib/elasticrawl/error.rb +7 -1
- data/lib/elasticrawl/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: ba9231bff7a4f4782280baaa82987d0942452b76
|
4
|
+
data.tar.gz: c7fa24226b647a63c48c299cd6be1cf664ed1a84
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: c815de487d1a2a31654b583ed867b97774bbe2dace47e56c9a22886b45e1d651625b3ab3b11ea77bf7630cd5850d311503a43daa240005999349e5df5c4f53c1
|
7
|
+
data.tar.gz: b1af2574b6ef416e80726c7bb6ff461f2b7baeaf8e888c55598aba3b5f3be301a3961630bacbb646746b68cec843cb11182ce394872ebffc3008af0f53f03fb6
|
data/.travis.yml
CHANGED
data/README.md
CHANGED
@@ -16,58 +16,34 @@ Ships with a default configuration that launches the
|
|
16
16
|
[elasticrawl-examples](https://github.com/rossf7/elasticrawl-examples) jobs.
|
17
17
|
This is an implementation of the standard Hadoop Word Count example.
|
18
18
|
|
19
|
-
|
20
|
-
|
21
|
-
* [Blog post](https://rossfairbanks.com/2015/01/03/parsing-common-crawl-using-elasticrawl.html) with walkthrough of running the Hadoop WordCount example on the November 2014 crawl.
|
19
|
+
This [blog post](https://rossfairbanks.com/2015/01/03/parsing-common-crawl-using-elasticrawl.html) has a walkthrough of running the example jobs on the November 2014 crawl.
|
22
20
|
|
23
21
|
## Installation
|
24
22
|
|
25
|
-
|
26
|
-
|
27
|
-
Elasticrawl is developed in Ruby and requires Ruby 1.9.3 or later (Ruby 2.1 is recommended).
|
28
|
-
Installing using [rbenv](https://github.com/sstephenson/rbenv#installation)
|
29
|
-
and the ruby-build plugin is recommended.
|
30
|
-
|
31
|
-
A SQLite database is used to store details of crawls and jobs. Installing the sqlite3
|
32
|
-
gem requires the development headers to be installed.
|
33
|
-
|
34
|
-
```bash
|
35
|
-
# OS X
|
36
|
-
brew install sqlite3
|
37
|
-
|
38
|
-
# CentOS
|
39
|
-
sudo yum install sqlite-devel
|
40
|
-
|
41
|
-
# Ubuntu
|
42
|
-
sudo apt-get install libsqlite3-dev
|
43
|
-
```
|
44
|
-
|
45
|
-
### Install elasticrawl
|
46
|
-
|
47
|
-
[](http://badge.fury.io/rb/elasticrawl)
|
48
|
-
[](https://codeclimate.com/github/rossf7/elasticrawl)
|
49
|
-
[](https://travis-ci.org/rossf7/elasticrawl) 1.9.3, 2.0.0, 2.1.5
|
23
|
+
Deployment packages are available for Linux and OS X, unfortunately Windows isn't supported yet. Download the package, extract it and run the elasticrawl command from the package directory.
|
50
24
|
|
51
25
|
```bash
|
52
|
-
|
53
|
-
|
26
|
+
# OS X https://d2ujrnticqzebc.cloudfront.net/elasticrawl-1.1.1-osx.tar.gz
|
27
|
+
# Linux (64-bit) https://d2ujrnticqzebc.cloudfront.net/elasticrawl-1.1.1-linux-x86_64.tar.gz
|
28
|
+
# Linux (32-bit) https://d2ujrnticqzebc.cloudfront.net/elasticrawl-1.1.1-linux-x86.tar.gz
|
54
29
|
|
55
|
-
|
56
|
-
to your path.
|
30
|
+
# e.g.
|
57
31
|
|
58
|
-
|
59
|
-
|
32
|
+
curl -O https://d2ujrnticqzebc.cloudfront.net/elasticrawl-1.1.0-osx.tar.gz
|
33
|
+
tar -xzf elasticrawl-1.1.0-osx.tar.gz
|
34
|
+
cd elasticrawl-1.1.0-osx/
|
35
|
+
./elasticrawl --help
|
60
36
|
```
|
61
37
|
|
62
38
|
## Commands
|
63
39
|
|
64
40
|
### elasticrawl init
|
65
41
|
|
66
|
-
|
42
|
+
The init command takes in an S3 bucket name and your AWS credentials. The S3 bucket will be created
|
67
43
|
and will store your data and logs.
|
68
44
|
|
69
45
|
```bash
|
70
|
-
~$ elasticrawl init your-s3-bucket
|
46
|
+
~$ ./elasticrawl init your-s3-bucket
|
71
47
|
|
72
48
|
Enter AWS Access Key ID: ************
|
73
49
|
Enter AWS Secret Access Key: ************
|
@@ -81,10 +57,10 @@ Config complete
|
|
81
57
|
|
82
58
|
### elasticrawl parse
|
83
59
|
|
84
|
-
|
60
|
+
The parse command takes in the crawl name and an optional number of segments and files to parse.
|
85
61
|
|
86
62
|
```bash
|
87
|
-
~$ elasticrawl parse CC-MAIN-2014-49 --max-segments 2 --max-files 3
|
63
|
+
~$ ./elasticrawl parse CC-MAIN-2014-49 --max-segments 2 --max-files 3
|
88
64
|
Segments
|
89
65
|
Segment: 1416400372202.67 Files: 150
|
90
66
|
Segment: 1416400372490.23 Files: 124
|
@@ -104,10 +80,10 @@ Job: 1420124830792 Job Flow ID: j-2R3MFE6TWLIUB
|
|
104
80
|
|
105
81
|
### elasticrawl combine
|
106
82
|
|
107
|
-
|
83
|
+
The combine command takes in the results of previous parse jobs and produces a combined set of results.
|
108
84
|
|
109
85
|
```bash
|
110
|
-
~$ elasticrawl combine --input-jobs 1420124830792
|
86
|
+
~$ ./elasticrawl combine --input-jobs 1420124830792
|
111
87
|
Job configuration
|
112
88
|
Combining: 2 segments
|
113
89
|
|
@@ -123,10 +99,10 @@ Job: 1420129496115 Job Flow ID: j-251GXDIZGK8HL
|
|
123
99
|
|
124
100
|
### elasticrawl status
|
125
101
|
|
126
|
-
|
102
|
+
The status command shows crawls and your job history.
|
127
103
|
|
128
104
|
```bash
|
129
|
-
~$ elasticrawl status
|
105
|
+
~$ ./elasticrawl status
|
130
106
|
Crawl Status
|
131
107
|
CC-MAIN-2014-49 Segments: to parse 134, parsed 2, total 136
|
132
108
|
|
@@ -136,10 +112,10 @@ Job History (last 10)
|
|
136
112
|
|
137
113
|
### elasticrawl reset
|
138
114
|
|
139
|
-
|
115
|
+
The reset comment resets a crawl so it is parsed again.
|
140
116
|
|
141
117
|
```bash
|
142
|
-
~$ elasticrawl reset CC-MAIN-2014-49
|
118
|
+
~$ ./elasticrawl reset CC-MAIN-2014-49
|
143
119
|
Reset crawl? (y/n)
|
144
120
|
y
|
145
121
|
CC-MAIN-2014-49 Segments: to parse 136, parsed 0, total 136
|
@@ -147,10 +123,10 @@ y
|
|
147
123
|
|
148
124
|
### elasticrawl destroy
|
149
125
|
|
150
|
-
|
126
|
+
The destroy command deletes your S3 bucket and the ~/.elasticrawl directory.
|
151
127
|
|
152
128
|
```bash
|
153
|
-
~$ elasticrawl destroy
|
129
|
+
~$ ./elasticrawl destroy
|
154
130
|
|
155
131
|
WARNING:
|
156
132
|
Bucket s3://elasticrawl-test and its data will be deleted
|
@@ -178,6 +154,16 @@ configures the EC2 instances that are launched to form your EMR cluster
|
|
178
154
|
* [jobs.yml](https://github.com/rossf7/elasticrawl/blob/master/templates/jobs.yml) -
|
179
155
|
stores your S3 bucket name and the config for the parse and combine jobs
|
180
156
|
|
157
|
+
## Development
|
158
|
+
|
159
|
+
Elasticrawl is developed in Ruby and requires Ruby 1.9.3 or later (Ruby 2.1 is recommended). The sqlite3 and nokogiri gems have C extensions which mean you may need to install development headers.
|
160
|
+
|
161
|
+
[](http://badge.fury.io/rb/elasticrawl)
|
162
|
+
[](https://codeclimate.com/github/rossf7/elasticrawl)
|
163
|
+
[](https://travis-ci.org/rossf7/elasticrawl) 1.9.3, 2.0.0, 2.1.5, 2.2.0
|
164
|
+
|
165
|
+
The deployment packages are created using [Traveling Ruby](http://phusion.github.io/traveling-ruby/). The deploy packages contain a Ruby 2.1 interpreter, Gems and the compiled C extensions. The [traveling-elasticrawl](https://github.com/rossf7/traveling-elasticrawl) repository has a Rake task that automates building the deployment packages.
|
166
|
+
|
181
167
|
## TODO
|
182
168
|
|
183
169
|
* Add support for Streaming and Pig jobs
|
@@ -187,6 +173,7 @@ stores your S3 bucket name and the config for the parse and combine jobs
|
|
187
173
|
* Thanks to everyone at Common Crawl for making this awesome dataset available!
|
188
174
|
* Thanks to Robert Slifka for the [elasticity](https://github.com/rslifka/elasticity)
|
189
175
|
gem which provides a nice Ruby wrapper for the EMR REST API.
|
176
|
+
* Thanks to Phusion for creating Traveling Ruby.
|
190
177
|
|
191
178
|
## Contributing
|
192
179
|
|
data/bin/elasticrawl
CHANGED
@@ -138,4 +138,12 @@ rescue Thor::Error => e
|
|
138
138
|
# Show elasticrawl errors.
|
139
139
|
rescue Elasticrawl::Error => e
|
140
140
|
puts("ERROR: #{e.message}")
|
141
|
+
puts e.backtrace
|
142
|
+
|
143
|
+
if e.http_response.present?
|
144
|
+
response = e.http_response
|
145
|
+
|
146
|
+
puts "HTTP Response: #{response.status}"
|
147
|
+
puts response.body if response.body.present?
|
148
|
+
end
|
141
149
|
end
|
data/lib/elasticrawl/config.rb
CHANGED
@@ -100,8 +100,8 @@ module Elasticrawl
|
|
100
100
|
|
101
101
|
rescue AWS::S3::Errors::SignatureDoesNotMatch => e
|
102
102
|
raise AWSCredentialsInvalidError, 'AWS access credentials are invalid'
|
103
|
-
rescue
|
104
|
-
raise S3AccessError, e.message
|
103
|
+
rescue AWS::Errors::Base => s3e
|
104
|
+
raise S3AccessError.new(s3e.http_response), e.message
|
105
105
|
end
|
106
106
|
end
|
107
107
|
|
@@ -158,8 +158,8 @@ module Elasticrawl
|
|
158
158
|
s3 = AWS::S3.new
|
159
159
|
s3.buckets.create(bucket_name)
|
160
160
|
|
161
|
-
rescue
|
162
|
-
raise S3AccessError, e.message
|
161
|
+
rescue AWS::Errors::Base => s3e
|
162
|
+
raise S3AccessError.new(s3e.http_response), e.message
|
163
163
|
end
|
164
164
|
end
|
165
165
|
|
@@ -170,8 +170,8 @@ module Elasticrawl
|
|
170
170
|
bucket = s3.buckets[bucket_name]
|
171
171
|
bucket.delete!
|
172
172
|
|
173
|
-
rescue
|
174
|
-
raise S3AccessError, e.message
|
173
|
+
rescue AWS::Errors::Base => s3e
|
174
|
+
raise S3AccessError.new(s3e.http_response), e.message
|
175
175
|
end
|
176
176
|
end
|
177
177
|
|
data/lib/elasticrawl/crawl.rb
CHANGED
@@ -91,12 +91,18 @@ module Elasticrawl
|
|
91
91
|
s3_path = [Elasticrawl::COMMON_CRAWL_PATH,
|
92
92
|
crawl_name,
|
93
93
|
Elasticrawl::WARC_PATHS].join('/')
|
94
|
+
begin
|
95
|
+
s3 = AWS::S3.new
|
96
|
+
bucket = s3.buckets[Elasticrawl::COMMON_CRAWL_BUCKET]
|
97
|
+
object = bucket.objects[s3_path]
|
94
98
|
|
95
|
-
|
96
|
-
bucket = s3.buckets[Elasticrawl::COMMON_CRAWL_BUCKET]
|
97
|
-
object = bucket.objects[s3_path]
|
99
|
+
uncompress_file(object)
|
98
100
|
|
99
|
-
|
101
|
+
rescue AWS::Errors::Base => s3e
|
102
|
+
raise S3AccessError.new(s3e.http_response), 'Failed to get WARC paths'
|
103
|
+
rescue Exception => e
|
104
|
+
raise S3AccessError, 'Failed to get WARC paths'
|
105
|
+
end
|
100
106
|
end
|
101
107
|
|
102
108
|
# Takes in a S3 object and returns the contents as an uncompressed string.
|
data/lib/elasticrawl/error.rb
CHANGED
@@ -1,6 +1,12 @@
|
|
1
1
|
module Elasticrawl
|
2
2
|
# Base error class extends standard error.
|
3
|
-
class Error < StandardError
|
3
|
+
class Error < StandardError
|
4
|
+
attr_reader :http_response
|
5
|
+
|
6
|
+
def initialize(response = nil)
|
7
|
+
@http_response = response
|
8
|
+
end
|
9
|
+
end
|
4
10
|
|
5
11
|
# AWS access credentials are invalid.
|
6
12
|
class AWSCredentialsInvalidError < Error; end
|
data/lib/elasticrawl/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: elasticrawl
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.1.
|
4
|
+
version: 1.1.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Ross Fairbanks
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2015-01-
|
11
|
+
date: 2015-01-27 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: activerecord
|