RubyGems - elasticrawl - Versions diffs - 1.1.0 → 1.1.1 - Mend

elasticrawl 1.1.0 → 1.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: d30b065d6f268827c458f2da44db8ccc726c209f
-  data.tar.gz: 0fe917b0f93bf4f70b23386bedad9cc3547d9e8b
+  metadata.gz: ba9231bff7a4f4782280baaa82987d0942452b76
+  data.tar.gz: c7fa24226b647a63c48c299cd6be1cf664ed1a84
 SHA512:
-  metadata.gz: cd63bfea578623e32c03d10f4bab56591950cbff8cd19eaf38aa28db5ba875101c94f8fa9da297e418fc6460eb16771c9fc45098d7c2e8e5b16e2f380a5ab4bc
-  data.tar.gz: f6ca6c84103df5a9d299ed78d9963f57357c15719ca90e4193bb0a9dc909c552c055329b85677dddaef8af35fea2f4b044ea73e29ebfc110fe6116c8e548b3df
+  metadata.gz: c815de487d1a2a31654b583ed867b97774bbe2dace47e56c9a22886b45e1d651625b3ab3b11ea77bf7630cd5850d311503a43daa240005999349e5df5c4f53c1
+  data.tar.gz: b1af2574b6ef416e80726c7bb6ff461f2b7baeaf8e888c55598aba3b5f3be301a3961630bacbb646746b68cec843cb11182ce394872ebffc3008af0f53f03fb6

data/.travis.yml CHANGED Viewed

@@ -3,3 +3,4 @@ rvm:
   - 1.9.3
   - 2.0.0
   - 2.1.5
+  - 2.2.0

data/README.md CHANGED Viewed

@@ -16,58 +16,34 @@ Ships with a default configuration that launches the
 [elasticrawl-examples](https://github.com/rossf7/elasticrawl-examples) jobs.
 This is an implementation of the standard Hadoop Word Count example.
-## More Information
-* [Blog post](https://rossfairbanks.com/2015/01/03/parsing-common-crawl-using-elasticrawl.html) with walkthrough of running the Hadoop WordCount example on the November 2014 crawl.
+This [blog post](https://rossfairbanks.com/2015/01/03/parsing-common-crawl-using-elasticrawl.html) has a walkthrough of running the example jobs on the November 2014 crawl.
 ## Installation
-### Dependencies
-Elasticrawl is developed in Ruby and requires Ruby 1.9.3 or later (Ruby 2.1 is recommended).
-Installing using [rbenv](https://github.com/sstephenson/rbenv#installation)
-and the ruby-build plugin is recommended.
-A SQLite database is used to store details of crawls and jobs. Installing the sqlite3
-gem requires the development headers to be installed.
-```bash
-# OS X
-brew install sqlite3
-# CentOS
-sudo yum install sqlite-devel
-# Ubuntu
-sudo apt-get install libsqlite3-dev
-```
-### Install elasticrawl
-[![Gem Version](https://badge.fury.io/rb/elasticrawl.png)](http://badge.fury.io/rb/elasticrawl)
-[![Code Climate](https://codeclimate.com/github/rossf7/elasticrawl.png)](https://codeclimate.com/github/rossf7/elasticrawl)
-[![Build Status](https://travis-ci.org/rossf7/elasticrawl.png?branch=master)](https://travis-ci.org/rossf7/elasticrawl) 1.9.3, 2.0.0, 2.1.5
+Deployment packages are available for Linux and OS X, unfortunately Windows isn't supported yet. Download the package, extract it and run the elasticrawl command from the package directory.
 ```bash
-~$ gem install elasticrawl --no-rdoc --no-ri
-```
+# OS X            https://d2ujrnticqzebc.cloudfront.net/elasticrawl-1.1.1-osx.tar.gz
+# Linux (64-bit)  https://d2ujrnticqzebc.cloudfront.net/elasticrawl-1.1.1-linux-x86_64.tar.gz
+# Linux (32-bit)  https://d2ujrnticqzebc.cloudfront.net/elasticrawl-1.1.1-linux-x86.tar.gz
-If you're using rbenv you need to do a rehash to add the elasticrawl executable
-to your path.
+# e.g.
-```bash
-~$ rbenv rehash
+curl -O https://d2ujrnticqzebc.cloudfront.net/elasticrawl-1.1.0-osx.tar.gz
+tar -xzf elasticrawl-1.1.0-osx.tar.gz
+cd elasticrawl-1.1.0-osx/
+./elasticrawl --help
 ```
 ## Commands
 ### elasticrawl init
-Init takes in an S3 bucket name and your AWS credentials. The S3 bucket will be created
+The init command takes in an S3 bucket name and your AWS credentials. The S3 bucket will be created
 and will store your data and logs.
 ```bash
-~$ elasticrawl init your-s3-bucket
+~$ ./elasticrawl init your-s3-bucket
 Enter AWS Access Key ID: ************
 Enter AWS Secret Access Key: ************
@@ -81,10 +57,10 @@ Config complete
 ### elasticrawl parse
-Parse takes in the crawl name and an optional number of segments and files to parse.
+The parse command takes in the crawl name and an optional number of segments and files to parse.
 ```bash
-~$ elasticrawl parse CC-MAIN-2014-49 --max-segments 2 --max-files 3
+~$ ./elasticrawl parse CC-MAIN-2014-49 --max-segments 2 --max-files 3
 Segments
 Segment: 1416400372202.67 Files: 150
 Segment: 1416400372490.23 Files: 124
@@ -104,10 +80,10 @@ Job: 1420124830792 Job Flow ID: j-2R3MFE6TWLIUB
 ### elasticrawl combine
-Combine takes in the results of previous parse jobs and produces a combined set of results.
+The combine command takes in the results of previous parse jobs and produces a combined set of results.
 ```bash
-~$ elasticrawl combine --input-jobs 1420124830792
+~$ ./elasticrawl combine --input-jobs 1420124830792
 Job configuration
 Combining: 2 segments
@@ -123,10 +99,10 @@ Job: 1420129496115 Job Flow ID: j-251GXDIZGK8HL
 ### elasticrawl status
-Status shows crawls and your job history.
+The status command shows crawls and your job history.
 ```bash
-~$ elasticrawl status
+~$ ./elasticrawl status
 Crawl Status
 CC-MAIN-2014-49 Segments: to parse 134, parsed 2, total 136
@@ -136,10 +112,10 @@ Job History (last 10)
 ### elasticrawl reset
-Reset a crawl so it is parsed again.
+The reset comment resets a crawl so it is parsed again.
 ```bash
-~$ elasticrawl reset CC-MAIN-2014-49
+~$ ./elasticrawl reset CC-MAIN-2014-49
 Reset crawl? (y/n)
 y
  CC-MAIN-2014-49 Segments: to parse 136, parsed 0, total 136
@@ -147,10 +123,10 @@ y
 ### elasticrawl destroy
-Destroy deletes your S3 bucket and the ~/.elasticrawl directory.
+The destroy command deletes your S3 bucket and the ~/.elasticrawl directory.
 ```bash
-~$ elasticrawl destroy
+~$ ./elasticrawl destroy
 WARNING:
 Bucket s3://elasticrawl-test and its data will be deleted
@@ -178,6 +154,16 @@ configures the EC2 instances that are launched to form your EMR cluster
 * [jobs.yml](https://github.com/rossf7/elasticrawl/blob/master/templates/jobs.yml) -
 stores your S3 bucket name and the config for the parse and combine jobs
+## Development
+Elasticrawl is developed in Ruby and requires Ruby 1.9.3 or later (Ruby 2.1 is recommended). The sqlite3 and nokogiri gems have C extensions which mean you may need to install development headers.
+[![Gem Version](https://badge.fury.io/rb/elasticrawl.png)](http://badge.fury.io/rb/elasticrawl)
+[![Code Climate](https://codeclimate.com/github/rossf7/elasticrawl.png)](https://codeclimate.com/github/rossf7/elasticrawl)
+[![Build Status](https://travis-ci.org/rossf7/elasticrawl.png?branch=master)](https://travis-ci.org/rossf7/elasticrawl) 1.9.3, 2.0.0, 2.1.5, 2.2.0
+The deployment packages are created using [Traveling Ruby](http://phusion.github.io/traveling-ruby/). The deploy packages contain a Ruby 2.1 interpreter, Gems and the compiled C extensions. The [traveling-elasticrawl](https://github.com/rossf7/traveling-elasticrawl) repository has a Rake task that automates building the deployment packages.
 ## TODO
 * Add support for Streaming and Pig jobs
@@ -187,6 +173,7 @@ stores your S3 bucket name and the config for the parse and combine jobs
 * Thanks to everyone at Common Crawl for making this awesome dataset available!
 * Thanks to Robert Slifka for the [elasticity](https://github.com/rslifka/elasticity)
 gem which provides a nice Ruby wrapper for the EMR REST API.
+* Thanks to Phusion for creating Traveling Ruby.
 ## Contributing

data/bin/elasticrawl CHANGED Viewed

@@ -138,4 +138,12 @@ rescue Thor::Error => e
 # Show elasticrawl errors.
 rescue Elasticrawl::Error => e
   puts("ERROR: #{e.message}")
+  puts e.backtrace
+  if e.http_response.present?
+    response = e.http_response
+    puts "HTTP Response: #{response.status}"
+    puts response.body if response.body.present?
+  end
 end

data/lib/elasticrawl/config.rb CHANGED Viewed

@@ -100,8 +100,8 @@ module Elasticrawl
       rescue AWS::S3::Errors::SignatureDoesNotMatch => e
         raise AWSCredentialsInvalidError, 'AWS access credentials are invalid'
-      rescue StandardError => e
-        raise S3AccessError, e.message
+      rescue AWS::Errors::Base => s3e
+        raise S3AccessError.new(s3e.http_response), e.message
       end
     end
@@ -158,8 +158,8 @@ module Elasticrawl
         s3 = AWS::S3.new
         s3.buckets.create(bucket_name)
-      rescue StandardError => e
-        raise S3AccessError, e.message
+      rescue AWS::Errors::Base => s3e
+        raise S3AccessError.new(s3e.http_response), e.message
       end
     end
@@ -170,8 +170,8 @@ module Elasticrawl
         bucket = s3.buckets[bucket_name]
         bucket.delete!
-      rescue StandardError => e
-        raise S3AccessError, e.message
+      rescue AWS::Errors::Base => s3e
+        raise S3AccessError.new(s3e.http_response), e.message
       end
     end

data/lib/elasticrawl/crawl.rb CHANGED Viewed

@@ -91,12 +91,18 @@ module Elasticrawl
       s3_path = [Elasticrawl::COMMON_CRAWL_PATH,
                  crawl_name,
                  Elasticrawl::WARC_PATHS].join('/')
+      begin
+        s3 = AWS::S3.new
+        bucket = s3.buckets[Elasticrawl::COMMON_CRAWL_BUCKET]
+        object = bucket.objects[s3_path]
-      s3 = AWS::S3.new
-      bucket = s3.buckets[Elasticrawl::COMMON_CRAWL_BUCKET]
-      object = bucket.objects[s3_path]
+        uncompress_file(object)
-      uncompress_file(object)
+      rescue AWS::Errors::Base => s3e
+        raise S3AccessError.new(s3e.http_response), 'Failed to get WARC paths'
+      rescue Exception => e
+        raise S3AccessError, 'Failed to get WARC paths'
+      end
     end
     # Takes in a S3 object and returns the contents as an uncompressed string.

data/lib/elasticrawl/error.rb CHANGED Viewed

@@ -1,6 +1,12 @@
 module Elasticrawl
   # Base error class extends standard error.
-  class Error < StandardError; end
+  class Error < StandardError
+    attr_reader :http_response
+    def initialize(response = nil)
+      @http_response = response
+    end
+  end
   # AWS access credentials are invalid.
   class AWSCredentialsInvalidError < Error; end

data/lib/elasticrawl/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Elasticrawl
-  VERSION = "1.1.0"
+  VERSION = '1.1.1'
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: elasticrawl
 version: !ruby/object:Gem::Version
-  version: 1.1.0
+  version: 1.1.1
 platform: ruby
 authors:
 - Ross Fairbanks
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2015-01-03 00:00:00.000000000 Z
+date: 2015-01-27 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: activerecord