elasticrawl 1.1.0 → 1.1.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: d30b065d6f268827c458f2da44db8ccc726c209f
4
- data.tar.gz: 0fe917b0f93bf4f70b23386bedad9cc3547d9e8b
3
+ metadata.gz: ba9231bff7a4f4782280baaa82987d0942452b76
4
+ data.tar.gz: c7fa24226b647a63c48c299cd6be1cf664ed1a84
5
5
  SHA512:
6
- metadata.gz: cd63bfea578623e32c03d10f4bab56591950cbff8cd19eaf38aa28db5ba875101c94f8fa9da297e418fc6460eb16771c9fc45098d7c2e8e5b16e2f380a5ab4bc
7
- data.tar.gz: f6ca6c84103df5a9d299ed78d9963f57357c15719ca90e4193bb0a9dc909c552c055329b85677dddaef8af35fea2f4b044ea73e29ebfc110fe6116c8e548b3df
6
+ metadata.gz: c815de487d1a2a31654b583ed867b97774bbe2dace47e56c9a22886b45e1d651625b3ab3b11ea77bf7630cd5850d311503a43daa240005999349e5df5c4f53c1
7
+ data.tar.gz: b1af2574b6ef416e80726c7bb6ff461f2b7baeaf8e888c55598aba3b5f3be301a3961630bacbb646746b68cec843cb11182ce394872ebffc3008af0f53f03fb6
data/.travis.yml CHANGED
@@ -3,3 +3,4 @@ rvm:
3
3
  - 1.9.3
4
4
  - 2.0.0
5
5
  - 2.1.5
6
+ - 2.2.0
data/README.md CHANGED
@@ -16,58 +16,34 @@ Ships with a default configuration that launches the
16
16
  [elasticrawl-examples](https://github.com/rossf7/elasticrawl-examples) jobs.
17
17
  This is an implementation of the standard Hadoop Word Count example.
18
18
 
19
- ## More Information
20
-
21
- * [Blog post](https://rossfairbanks.com/2015/01/03/parsing-common-crawl-using-elasticrawl.html) with walkthrough of running the Hadoop WordCount example on the November 2014 crawl.
19
+ This [blog post](https://rossfairbanks.com/2015/01/03/parsing-common-crawl-using-elasticrawl.html) has a walkthrough of running the example jobs on the November 2014 crawl.
22
20
 
23
21
  ## Installation
24
22
 
25
- ### Dependencies
26
-
27
- Elasticrawl is developed in Ruby and requires Ruby 1.9.3 or later (Ruby 2.1 is recommended).
28
- Installing using [rbenv](https://github.com/sstephenson/rbenv#installation)
29
- and the ruby-build plugin is recommended.
30
-
31
- A SQLite database is used to store details of crawls and jobs. Installing the sqlite3
32
- gem requires the development headers to be installed.
33
-
34
- ```bash
35
- # OS X
36
- brew install sqlite3
37
-
38
- # CentOS
39
- sudo yum install sqlite-devel
40
-
41
- # Ubuntu
42
- sudo apt-get install libsqlite3-dev
43
- ```
44
-
45
- ### Install elasticrawl
46
-
47
- [![Gem Version](https://badge.fury.io/rb/elasticrawl.png)](http://badge.fury.io/rb/elasticrawl)
48
- [![Code Climate](https://codeclimate.com/github/rossf7/elasticrawl.png)](https://codeclimate.com/github/rossf7/elasticrawl)
49
- [![Build Status](https://travis-ci.org/rossf7/elasticrawl.png?branch=master)](https://travis-ci.org/rossf7/elasticrawl) 1.9.3, 2.0.0, 2.1.5
23
+ Deployment packages are available for Linux and OS X, unfortunately Windows isn't supported yet. Download the package, extract it and run the elasticrawl command from the package directory.
50
24
 
51
25
  ```bash
52
- ~$ gem install elasticrawl --no-rdoc --no-ri
53
- ```
26
+ # OS X https://d2ujrnticqzebc.cloudfront.net/elasticrawl-1.1.1-osx.tar.gz
27
+ # Linux (64-bit) https://d2ujrnticqzebc.cloudfront.net/elasticrawl-1.1.1-linux-x86_64.tar.gz
28
+ # Linux (32-bit) https://d2ujrnticqzebc.cloudfront.net/elasticrawl-1.1.1-linux-x86.tar.gz
54
29
 
55
- If you're using rbenv you need to do a rehash to add the elasticrawl executable
56
- to your path.
30
+ # e.g.
57
31
 
58
- ```bash
59
- ~$ rbenv rehash
32
+ curl -O https://d2ujrnticqzebc.cloudfront.net/elasticrawl-1.1.0-osx.tar.gz
33
+ tar -xzf elasticrawl-1.1.0-osx.tar.gz
34
+ cd elasticrawl-1.1.0-osx/
35
+ ./elasticrawl --help
60
36
  ```
61
37
 
62
38
  ## Commands
63
39
 
64
40
  ### elasticrawl init
65
41
 
66
- Init takes in an S3 bucket name and your AWS credentials. The S3 bucket will be created
42
+ The init command takes in an S3 bucket name and your AWS credentials. The S3 bucket will be created
67
43
  and will store your data and logs.
68
44
 
69
45
  ```bash
70
- ~$ elasticrawl init your-s3-bucket
46
+ ~$ ./elasticrawl init your-s3-bucket
71
47
 
72
48
  Enter AWS Access Key ID: ************
73
49
  Enter AWS Secret Access Key: ************
@@ -81,10 +57,10 @@ Config complete
81
57
 
82
58
  ### elasticrawl parse
83
59
 
84
- Parse takes in the crawl name and an optional number of segments and files to parse.
60
+ The parse command takes in the crawl name and an optional number of segments and files to parse.
85
61
 
86
62
  ```bash
87
- ~$ elasticrawl parse CC-MAIN-2014-49 --max-segments 2 --max-files 3
63
+ ~$ ./elasticrawl parse CC-MAIN-2014-49 --max-segments 2 --max-files 3
88
64
  Segments
89
65
  Segment: 1416400372202.67 Files: 150
90
66
  Segment: 1416400372490.23 Files: 124
@@ -104,10 +80,10 @@ Job: 1420124830792 Job Flow ID: j-2R3MFE6TWLIUB
104
80
 
105
81
  ### elasticrawl combine
106
82
 
107
- Combine takes in the results of previous parse jobs and produces a combined set of results.
83
+ The combine command takes in the results of previous parse jobs and produces a combined set of results.
108
84
 
109
85
  ```bash
110
- ~$ elasticrawl combine --input-jobs 1420124830792
86
+ ~$ ./elasticrawl combine --input-jobs 1420124830792
111
87
  Job configuration
112
88
  Combining: 2 segments
113
89
 
@@ -123,10 +99,10 @@ Job: 1420129496115 Job Flow ID: j-251GXDIZGK8HL
123
99
 
124
100
  ### elasticrawl status
125
101
 
126
- Status shows crawls and your job history.
102
+ The status command shows crawls and your job history.
127
103
 
128
104
  ```bash
129
- ~$ elasticrawl status
105
+ ~$ ./elasticrawl status
130
106
  Crawl Status
131
107
  CC-MAIN-2014-49 Segments: to parse 134, parsed 2, total 136
132
108
 
@@ -136,10 +112,10 @@ Job History (last 10)
136
112
 
137
113
  ### elasticrawl reset
138
114
 
139
- Reset a crawl so it is parsed again.
115
+ The reset comment resets a crawl so it is parsed again.
140
116
 
141
117
  ```bash
142
- ~$ elasticrawl reset CC-MAIN-2014-49
118
+ ~$ ./elasticrawl reset CC-MAIN-2014-49
143
119
  Reset crawl? (y/n)
144
120
  y
145
121
  CC-MAIN-2014-49 Segments: to parse 136, parsed 0, total 136
@@ -147,10 +123,10 @@ y
147
123
 
148
124
  ### elasticrawl destroy
149
125
 
150
- Destroy deletes your S3 bucket and the ~/.elasticrawl directory.
126
+ The destroy command deletes your S3 bucket and the ~/.elasticrawl directory.
151
127
 
152
128
  ```bash
153
- ~$ elasticrawl destroy
129
+ ~$ ./elasticrawl destroy
154
130
 
155
131
  WARNING:
156
132
  Bucket s3://elasticrawl-test and its data will be deleted
@@ -178,6 +154,16 @@ configures the EC2 instances that are launched to form your EMR cluster
178
154
  * [jobs.yml](https://github.com/rossf7/elasticrawl/blob/master/templates/jobs.yml) -
179
155
  stores your S3 bucket name and the config for the parse and combine jobs
180
156
 
157
+ ## Development
158
+
159
+ Elasticrawl is developed in Ruby and requires Ruby 1.9.3 or later (Ruby 2.1 is recommended). The sqlite3 and nokogiri gems have C extensions which mean you may need to install development headers.
160
+
161
+ [![Gem Version](https://badge.fury.io/rb/elasticrawl.png)](http://badge.fury.io/rb/elasticrawl)
162
+ [![Code Climate](https://codeclimate.com/github/rossf7/elasticrawl.png)](https://codeclimate.com/github/rossf7/elasticrawl)
163
+ [![Build Status](https://travis-ci.org/rossf7/elasticrawl.png?branch=master)](https://travis-ci.org/rossf7/elasticrawl) 1.9.3, 2.0.0, 2.1.5, 2.2.0
164
+
165
+ The deployment packages are created using [Traveling Ruby](http://phusion.github.io/traveling-ruby/). The deploy packages contain a Ruby 2.1 interpreter, Gems and the compiled C extensions. The [traveling-elasticrawl](https://github.com/rossf7/traveling-elasticrawl) repository has a Rake task that automates building the deployment packages.
166
+
181
167
  ## TODO
182
168
 
183
169
  * Add support for Streaming and Pig jobs
@@ -187,6 +173,7 @@ stores your S3 bucket name and the config for the parse and combine jobs
187
173
  * Thanks to everyone at Common Crawl for making this awesome dataset available!
188
174
  * Thanks to Robert Slifka for the [elasticity](https://github.com/rslifka/elasticity)
189
175
  gem which provides a nice Ruby wrapper for the EMR REST API.
176
+ * Thanks to Phusion for creating Traveling Ruby.
190
177
 
191
178
  ## Contributing
192
179
 
data/bin/elasticrawl CHANGED
@@ -138,4 +138,12 @@ rescue Thor::Error => e
138
138
  # Show elasticrawl errors.
139
139
  rescue Elasticrawl::Error => e
140
140
  puts("ERROR: #{e.message}")
141
+ puts e.backtrace
142
+
143
+ if e.http_response.present?
144
+ response = e.http_response
145
+
146
+ puts "HTTP Response: #{response.status}"
147
+ puts response.body if response.body.present?
148
+ end
141
149
  end
@@ -100,8 +100,8 @@ module Elasticrawl
100
100
 
101
101
  rescue AWS::S3::Errors::SignatureDoesNotMatch => e
102
102
  raise AWSCredentialsInvalidError, 'AWS access credentials are invalid'
103
- rescue StandardError => e
104
- raise S3AccessError, e.message
103
+ rescue AWS::Errors::Base => s3e
104
+ raise S3AccessError.new(s3e.http_response), e.message
105
105
  end
106
106
  end
107
107
 
@@ -158,8 +158,8 @@ module Elasticrawl
158
158
  s3 = AWS::S3.new
159
159
  s3.buckets.create(bucket_name)
160
160
 
161
- rescue StandardError => e
162
- raise S3AccessError, e.message
161
+ rescue AWS::Errors::Base => s3e
162
+ raise S3AccessError.new(s3e.http_response), e.message
163
163
  end
164
164
  end
165
165
 
@@ -170,8 +170,8 @@ module Elasticrawl
170
170
  bucket = s3.buckets[bucket_name]
171
171
  bucket.delete!
172
172
 
173
- rescue StandardError => e
174
- raise S3AccessError, e.message
173
+ rescue AWS::Errors::Base => s3e
174
+ raise S3AccessError.new(s3e.http_response), e.message
175
175
  end
176
176
  end
177
177
 
@@ -91,12 +91,18 @@ module Elasticrawl
91
91
  s3_path = [Elasticrawl::COMMON_CRAWL_PATH,
92
92
  crawl_name,
93
93
  Elasticrawl::WARC_PATHS].join('/')
94
+ begin
95
+ s3 = AWS::S3.new
96
+ bucket = s3.buckets[Elasticrawl::COMMON_CRAWL_BUCKET]
97
+ object = bucket.objects[s3_path]
94
98
 
95
- s3 = AWS::S3.new
96
- bucket = s3.buckets[Elasticrawl::COMMON_CRAWL_BUCKET]
97
- object = bucket.objects[s3_path]
99
+ uncompress_file(object)
98
100
 
99
- uncompress_file(object)
101
+ rescue AWS::Errors::Base => s3e
102
+ raise S3AccessError.new(s3e.http_response), 'Failed to get WARC paths'
103
+ rescue Exception => e
104
+ raise S3AccessError, 'Failed to get WARC paths'
105
+ end
100
106
  end
101
107
 
102
108
  # Takes in a S3 object and returns the contents as an uncompressed string.
@@ -1,6 +1,12 @@
1
1
  module Elasticrawl
2
2
  # Base error class extends standard error.
3
- class Error < StandardError; end
3
+ class Error < StandardError
4
+ attr_reader :http_response
5
+
6
+ def initialize(response = nil)
7
+ @http_response = response
8
+ end
9
+ end
4
10
 
5
11
  # AWS access credentials are invalid.
6
12
  class AWSCredentialsInvalidError < Error; end
@@ -1,3 +1,3 @@
1
1
  module Elasticrawl
2
- VERSION = "1.1.0"
2
+ VERSION = '1.1.1'
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: elasticrawl
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.1.0
4
+ version: 1.1.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Ross Fairbanks
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2015-01-03 00:00:00.000000000 Z
11
+ date: 2015-01-27 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: activerecord