RubyGems - elasticrawl - Versions diffs - 1.1.5 → 1.1.6 - Mend

elasticrawl 1.1.5 → 1.1.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

checksums.yaml +4 -4
data/.travis.yml +3 -4
data/CHANGELOG.md +3 -0
data/Gemfile +1 -1
data/README.md +11 -21
data/elasticrawl.gemspec +1 -1
data/lib/elasticrawl.rb +2 -2
data/lib/elasticrawl/version.rb +1 -1
data/spec/unit/crawl_segment_spec.rb +1 -1
data/spec/unit/crawl_spec.rb +1 -1
metadata +5 -5

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: a7a988f505c699d078fa4cc15981b77f7924e907
-  data.tar.gz: db7ee7ae05ccec51b0ffde16900656c859523001
+  metadata.gz: 0d1d5521ae9a6e0762c0b057bbd828f6c15be0c4
+  data.tar.gz: df0386cf340ac6aff20bc95bffcfa0f0fb3995ff
 SHA512:
-  metadata.gz: 12949f3230c3a0f7d08d4e02a9b296968616fa634072db882fe26809c58791d06ebf48d21fbc3c703ab1f858fdf12cfa9c324a6a97a7e896a5916679c7d6de06
-  data.tar.gz: ab9ec066ecb469707241751087ff6d25ff89e9013fdb0a8505bca904fcaa3fed2b331ef289fa0ffb71dd970261ec7c0028ca955d94642fad7093191723d8c4e3
+  metadata.gz: 58f9f46e73d3bf03da4bfad2260cca0663430bd50187ed4927f564d15ac2b2acd377a76dd9016978eae8eb80caee3a76d610d0b524b010bbaa5a7cd953fdbbc9
+  data.tar.gz: e415324ccadc507ac37ddeaca0ae0ca71e9b6b93b008d23b2edefb985bd98928afe65ae508ad9d7ad68d532e936259bc36c6fdebce3806db2b6d39215d4dbf6e

data/.travis.yml CHANGED Viewed

@@ -1,6 +1,5 @@
 language: ruby
 rvm:
-  - 2.0.0-p648
-  - 2.1.8
-  - 2.2.4
-  - 2.3.0
+  - 2.1.9
+  - 2.2.5
+  - 2.3.1

data/CHANGELOG.md CHANGED Viewed

@@ -1,3 +1,6 @@
+## v1.1.6 / 2016-06-26
+* Change CommonCrawl bucket to s3://commoncrawl
 ## v1.1.3 / 2015-02-04
 * Upgrade Traveling Ruby to 20150204-2.1.5

data/Gemfile CHANGED Viewed

@@ -1,3 +1,3 @@
-source 'http://rubygems.org'
+source 'https://rubygems.org'
 gemspec

data/README.md CHANGED Viewed

@@ -13,19 +13,11 @@ This [blog post](https://rossfairbanks.com/2015/01/03/parsing-common-crawl-using
 ## Installation
-Deployment packages are available for Linux and OS X, unfortunately Windows isn't supported yet. Download the package, extract it and run the elasticrawl command from the package directory.
+* Elasticrawl needs a [Ruby installation](https://www.ruby-lang.org/en/documentation/installation/) (2.1 or higher).
+* Install Ruby from RubyGems.
-```bash
-# OS X            https://d2ujrnticqzebc.cloudfront.net/elasticrawl-1.1.5-osx.tar.gz
-# Linux (64-bit)  https://d2ujrnticqzebc.cloudfront.net/elasticrawl-1.1.5-linux-x86_64.tar.gz
-# Linux (32-bit)  https://d2ujrnticqzebc.cloudfront.net/elasticrawl-1.1.5-linux-x86.tar.gz
-# e.g.
-curl -O https://d2ujrnticqzebc.cloudfront.net/elasticrawl-1.1.5-osx.tar.gz
-tar -xzf elasticrawl-1.1.5-osx.tar.gz
-cd elasticrawl-1.1.5-osx/
-./elasticrawl --help
+```
+gem install elasticrawl --no-rdoc --no-ri
 ```
 ### Troubleshooting
@@ -45,7 +37,7 @@ The init command takes in an S3 bucket name and your AWS credentials. The S3 buc
 and will store your data and logs.
 ```bash
-~$ ./elasticrawl init your-s3-bucket
+~$ elasticrawl init your-s3-bucket
 Enter AWS Access Key ID: ************
 Enter AWS Secret Access Key: ************
@@ -62,7 +54,7 @@ Config complete
 The parse command takes in the crawl name and an optional number of segments and files to parse.
 ```bash
-~$ ./elasticrawl parse CC-MAIN-2015-48 --max-segments 2 --max-files 3
+~$ elasticrawl parse CC-MAIN-2015-48 --max-segments 2 --max-files 3
 Segments
 Segment: 1416400372202.67 Files: 150
 Segment: 1416400372490.23 Files: 124
@@ -85,7 +77,7 @@ Job: 1420124830792 Job Flow ID: j-2R3MFE6TWLIUB
 The combine command takes in the results of previous parse jobs and produces a combined set of results.
 ```bash
-~$ ./elasticrawl combine --input-jobs 1420124830792
+~$ elasticrawl combine --input-jobs 1420124830792
 Job configuration
 Combining: 2 segments
@@ -104,7 +96,7 @@ Job: 1420129496115 Job Flow ID: j-251GXDIZGK8HL
 The status command shows crawls and your job history.
 ```bash
-~$ ./elasticrawl status
+~$ elasticrawl status
 Crawl Status
 CC-MAIN-2015-48 Segments: to parse 98, parsed 2, total 100
@@ -117,7 +109,7 @@ Job History (last 10)
 The reset comment resets a crawl so it is parsed again.
 ```bash
-~$ ./elasticrawl reset CC-MAIN-2015-48
+~$ elasticrawl reset CC-MAIN-2015-48
 Reset crawl? (y/n)
 y
  CC-MAIN-2015-48 Segments: to parse 100, parsed 0, total 100
@@ -128,7 +120,7 @@ y
 The destroy command deletes your S3 bucket and the ~/.elasticrawl directory.
 ```bash
-~$ ./elasticrawl destroy
+~$ elasticrawl destroy
 WARNING:
 Bucket s3://elasticrawl-test and its data will be deleted
@@ -158,14 +150,12 @@ stores your S3 bucket name and the config for the parse and combine jobs
 ## Development
-Elasticrawl is developed in Ruby and requires Ruby 2.0.0 or later (Ruby 2.2 is recommended). The sqlite3 and nokogiri gems have C extensions which mean you may need to install development headers.
+Elasticrawl is developed in Ruby and requires Ruby 2.1.0 or later (Ruby 2.3 is recommended). The sqlite3 and nokogiri gems have C extensions which mean you may need to install development headers.
 [![Gem Version](https://badge.fury.io/rb/elasticrawl.png)](http://badge.fury.io/rb/elasticrawl)
 [![Code Climate](https://codeclimate.com/github/rossf7/elasticrawl.png)](https://codeclimate.com/github/rossf7/elasticrawl)
 [![Build Status](https://travis-ci.org/rossf7/elasticrawl.png?branch=master)](https://travis-ci.org/rossf7/elasticrawl) 2.0.0, 2.1.8, 2.2.4, 2.3.0
-The deployment packages are created using [Traveling Ruby](http://phusion.github.io/traveling-ruby/). The deploy packages contain a Ruby 2.2 interpreter, Gems and the compiled C extensions. The [traveling-elasticrawl](https://github.com/rossf7/traveling-elasticrawl) repository has a Rake task that automates building the deployment packages.
 ## TODO
 * Add support for Streaming and Pig jobs

data/elasticrawl.gemspec CHANGED Viewed

@@ -27,7 +27,7 @@ Gem::Specification.new do |spec|
   spec.add_dependency 'thor', '~> 0.19.1'
   spec.add_development_dependency 'rake', '~> 10.4.2'
-  spec.add_development_dependency 'bundler', '~> 1.11.2'
+  spec.add_development_dependency 'bundler', '~> 1.12.5'
   spec.add_development_dependency 'rspec', '~> 3.4.0'
   spec.add_development_dependency 'database_cleaner', '~> 1.5.1'
   spec.add_development_dependency 'shoulda-matchers', '~> 3.0.1'

data/lib/elasticrawl.rb CHANGED Viewed

@@ -7,8 +7,8 @@ require 'thor'
 module Elasticrawl
   # S3 locations
-  COMMON_CRAWL_BUCKET = 'aws-publicdatasets'
-  COMMON_CRAWL_PATH = 'common-crawl/crawl-data'
+  COMMON_CRAWL_BUCKET = 'commoncrawl'
+  COMMON_CRAWL_PATH = 'crawl-data'
   SEGMENTS_PATH = 'segments'
   WARC_PATHS = 'warc.paths.gz'
   MAX_SEGMENTS = 256

data/lib/elasticrawl/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Elasticrawl
-  VERSION = '1.1.5'
+  VERSION = '1.1.6'
 end

data/spec/unit/crawl_segment_spec.rb CHANGED Viewed

@@ -22,7 +22,7 @@ describe Elasticrawl::CrawlSegment, type: :model do
     it 'should have an s3 uri' do
       expect(subject.segment_s3_uri).to eq \
-        "s3://aws-publicdatasets/common-crawl/crawl-data/#{crawl.crawl_name}/segments/#{segment_name}/"
+        "s3://commoncrawl/crawl-data/#{crawl.crawl_name}/segments/#{segment_name}/"
     end
     it 'should have a file count' do

data/spec/unit/crawl_spec.rb CHANGED Viewed

@@ -35,7 +35,7 @@ describe Elasticrawl::Crawl, type: :model do
     it 'should create segment s3 uris' do
       expect(subject.crawl_segments[0].segment_s3_uri).to eq \
-        's3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-49/segments/1416400372202.67/'
+        's3://commoncrawl/crawl-data/CC-MAIN-2014-49/segments/1416400372202.67/'
     end
     it 'should set file counts' do

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: elasticrawl
 version: !ruby/object:Gem::Version
-  version: 1.1.5
+  version: 1.1.6
 platform: ruby
 authors:
 - Ross Fairbanks
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2016-01-05 00:00:00.000000000 Z
+date: 2016-06-26 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: activerecord
@@ -128,14 +128,14 @@ dependencies:
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: 1.11.2
+        version: 1.12.5
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: 1.11.2
+        version: 1.12.5
 - !ruby/object:Gem::Dependency
   name: rspec
   requirement: !ruby/object:Gem::Requirement
@@ -251,7 +251,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
       version: '0'
 requirements: []
 rubyforge_project:
-rubygems_version: 2.4.5.1
+rubygems_version: 2.5.1
 signing_key:
 specification_version: 4
 summary: Launch AWS Elastic MapReduce jobs that process Common Crawl data.