RubyGems - elasticrawl - Versions diffs - 1.0.0 → 1.1.0 - Mend

elasticrawl 1.0.0 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (27) hide show

checksums.yaml +7 -0
data/.travis.yml +1 -1
data/README.md +77 -108
data/Vagrantfile +5 -5
data/db/migrate/201401051536_create_crawls.rb +1 -1
data/db/migrate/201401051855_create_crawl_segments.rb +1 -1
data/db/migrate/201401101723_create_jobs.rb +1 -1
data/db/migrate/201401141606_create_job_steps.rb +1 -1
data/db/migrate/201412311554_add_file_count_to_crawl_segments.rb +5 -0
data/elasticrawl.gemspec +10 -11
data/lib/elasticrawl.rb +7 -0
data/lib/elasticrawl/cluster.rb +1 -1
data/lib/elasticrawl/crawl.rb +49 -31
data/lib/elasticrawl/crawl_segment.rb +30 -0
data/lib/elasticrawl/job.rb +13 -6
data/lib/elasticrawl/job_step.rb +5 -3
data/lib/elasticrawl/parse_job.rb +14 -0
data/lib/elasticrawl/version.rb +1 -1
data/spec/fixtures/warc.paths +6 -0
data/spec/spec_helper.rb +8 -14
data/spec/unit/cluster_spec.rb +2 -2
data/spec/unit/combine_job_spec.rb +4 -4
data/spec/unit/crawl_segment_spec.rb +19 -10
data/spec/unit/crawl_spec.rb +21 -16
data/spec/unit/job_step_spec.rb +4 -4
data/spec/unit/parse_job_spec.rb +20 -14
metadata +56 -101

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: d30b065d6f268827c458f2da44db8ccc726c209f
+  data.tar.gz: 0fe917b0f93bf4f70b23386bedad9cc3547d9e8b
+SHA512:
+  metadata.gz: cd63bfea578623e32c03d10f4bab56591950cbff8cd19eaf38aa28db5ba875101c94f8fa9da297e418fc6460eb16771c9fc45098d7c2e8e5b16e2f380a5ab4bc
+  data.tar.gz: f6ca6c84103df5a9d299ed78d9963f57357c15719ca90e4193bb0a9dc909c552c055329b85677dddaef8af35fea2f4b044ea73e29ebfc110fe6116c8e548b3df

data/.travis.yml CHANGED Viewed

@@ -2,4 +2,4 @@ language: ruby
 rvm:
   - 1.9.3
   - 2.0.0
-  - 2.1.0
+  - 2.1.5

data/README.md CHANGED Viewed

@@ -1,42 +1,53 @@
 # Elasticrawl
-Launch AWS Elastic MapReduce jobs that process Common Crawl data.
-Elasticrawl works with the latest Common Crawl data structure and file formats
-([2013 data onwards](http://commoncrawl.org/new-crawl-data-available/)).
-Ships with a default configuration that launches the
-[elasticrawl-examples](https://github.com/rossf7/elasticrawl-examples) jobs.
-This is an implementation of the standard Hadoop Word Count example.
+Command line tool for launching Hadoop jobs using AWS EMR (Elastic MapReduce) to process Common Crawl data.
+Elasticrawl can be used with [crawl data](http://commoncrawl.org/the-data/get-started/) from April 2014 onwards.
-## Overview
+| Crawl Name     | Month     | Web Pages
+| -------------- |:--------:|:--------:|
+| [CC-MAIN-2014-15](http://blog.commoncrawl.org/2014/07/april-2014-crawl-data-available/) | April 2014 | ~ 2.3 billion
+| [CC-MAIN-2014-23](http://blog.commoncrawl.org/2014/08/july-2014-crawl-data-available/) | July 2014 | ~ 3.6 billion
+| [CC-MAIN-2014-35](http://blog.commoncrawl.org/2014/09/august-2014-crawl-data-available/) | August 2014 | ~ 2.8 billion
+| [CC-MAIN-2014-49](http://blog.commoncrawl.org/2014/12/november-2014-crawl-archive-available/) | November 2014 | ~ 1.95 billion
-Common Crawl have released 2 web crawls of 2013 data. Further crawls will be released
-during 2014. Each crawl is split into multiple segments that contain 3 file types.
+Common Crawl announce new crawls on their [blog](http://blog.commoncrawl.org/).
-* WARC - WARC files with the HTTP request and response for each fetch
-* WAT - WARC encoded files containing JSON metadata
-* WET - WARC encoded text extractions of the HTTP responses
-| Crawl Name     | Date     | Segments | Pages         | Size (uncompressed) |
-| -------------- |:--------:|:--------:|:-------------:|:-------------------:|
-| CC-MAIN-2013-48| Nov 2013 | 517      | ~ 2.3 billion | 148 TB              |
-| CC-MAIN-2013-20| May 2013 | 316      | ~ 2.0 billion | 102 TB              |
+Ships with a default configuration that launches the
+[elasticrawl-examples](https://github.com/rossf7/elasticrawl-examples) jobs.
+This is an implementation of the standard Hadoop Word Count example.
-Elasticrawl is a command line tool that automates launching Elastic MapReduce
-jobs against this data.
+## More Information
-[![Code Climate](https://codeclimate.com/github/rossf7/elasticrawl.png)](https://codeclimate.com/github/rossf7/elasticrawl)
-[![Build Status](https://travis-ci.org/rossf7/elasticrawl.png?branch=master)](https://travis-ci.org/rossf7/elasticrawl) 1.9.3, 2.0.0, 2.1.0
+* [Blog post](https://rossfairbanks.com/2015/01/03/parsing-common-crawl-using-elasticrawl.html) with walkthrough of running the Hadoop WordCount example on the November 2014 crawl.
 ## Installation
 ### Dependencies
-Elasticrawl is developed in Ruby and requires Ruby 1.9.3 or later.
+Elasticrawl is developed in Ruby and requires Ruby 1.9.3 or later (Ruby 2.1 is recommended).
 Installing using [rbenv](https://github.com/sstephenson/rbenv#installation)
 and the ruby-build plugin is recommended.
+A SQLite database is used to store details of crawls and jobs. Installing the sqlite3
+gem requires the development headers to be installed.
+```bash
+# OS X
+brew install sqlite3
+# CentOS
+sudo yum install sqlite-devel
+# Ubuntu
+sudo apt-get install libsqlite3-dev
+```
 ### Install elasticrawl
+[![Gem Version](https://badge.fury.io/rb/elasticrawl.png)](http://badge.fury.io/rb/elasticrawl)
+[![Code Climate](https://codeclimate.com/github/rossf7/elasticrawl.png)](https://codeclimate.com/github/rossf7/elasticrawl)
+[![Build Status](https://travis-ci.org/rossf7/elasticrawl.png?branch=master)](https://travis-ci.org/rossf7/elasticrawl) 1.9.3, 2.0.0, 2.1.5
 ```bash
 ~$ gem install elasticrawl --no-rdoc --no-ri
 ```
@@ -48,21 +59,12 @@ to your path.
 ~$ rbenv rehash
 ```
-## Quick Start
-In this example you'll launch 2 EMR jobs against a small portion of the Nov
-2013 crawl. Each job will take around 20 minutes to run. Most of this is setup
-time while your EC2 spot instances are provisioned and your Hadoop cluster is
-configured.
+## Commands
-You'll need to have an [AWS account](https://portal.aws.amazon.com/gp/aws/developer/registration/index.html)
-to use elasticrawl. The total cost of the 2 EMR jobs will be under $1 USD.
+### elasticrawl init
-### Setup
-You'll need to choose an S3 bucket name and enter your AWS access key and
-secret key. The S3 bucket will be used for storing data and logs. S3 bucket
-names must be unique, using hyphens rather than underscores is recommended.
+Init takes in an S3 bucket name and your AWS credentials. The S3 bucket will be created
+and will store your data and logs.
 ```bash
 ~$ elasticrawl init your-s3-bucket
@@ -77,38 +79,35 @@ Config dir /Users/ross/.elasticrawl created
 Config complete
 ```
-### Parse Job
+### elasticrawl parse
-For this example you'll parse the first 2 WET files in the first 2 segments
-of the Nov 2013 crawl.
+Parse takes in the crawl name and an optional number of segments and files to parse.
 ```bash
-~$ elasticrawl parse CC-MAIN-2013-48 --max-segments 2 --max-files 2
+~$ elasticrawl parse CC-MAIN-2014-49 --max-segments 2 --max-files 3
+Segments
+Segment: 1416400372202.67 Files: 150
+Segment: 1416400372490.23 Files: 124
 Job configuration
-Crawl: CC-MAIN-2013-48 Segments: 2 Parsing: 2 files per segment
+Crawl: CC-MAIN-2014-49 Segments: 2 Parsing: 3 files per segment
 Cluster configuration
 Master: 1 m1.medium  (Spot: 0.12)
 Core:   2 m1.medium  (Spot: 0.12)
 Task:   --
 Launch job? (y/n)
 y
-Job Name: 1391458746774 Job Flow ID: j-2X9JVDC1UKEQ1
-```
-You can monitor the progress of your job in the Elastic MapReduce section
-of the AWS web console.
+Job: 1420124830792 Job Flow ID: j-2R3MFE6TWLIUB
+```
-### Combine Job
+### elasticrawl combine
-The combine job will aggregate the word count results from both segments into
-a single set of files.
+Combine takes in the results of previous parse jobs and produces a combined set of results.
 ```bash
-~$ elasticrawl combine --input-jobs 1391458746774
+~$ elasticrawl combine --input-jobs 1420124830792
 Job configuration
 Combining: 2 segments
@@ -117,20 +116,38 @@ Master: 1 m1.medium  (Spot: 0.12)
 Core:   2 m1.medium  (Spot: 0.12)
 Task:   --
 Launch job? (y/n)
 y
-Job Name: 1391459918730 Job Flow ID: j-GTJ2M7D1TXO6
+Job: 1420129496115 Job Flow ID: j-251GXDIZGK8HL
 ```
-Once the combine job is complete you can download your results from the
-S3 section of the AWS web console. Your data will be stored in
+### elasticrawl status
+Status shows crawls and your job history.
-[your S3 bucket]/data/2-combine/[job name]
+```bash
+~$ elasticrawl status
+Crawl Status
+CC-MAIN-2014-49 Segments: to parse 134, parsed 2, total 136
+Job History (last 10)
+1420124830792 Launched: 2015-01-01 15:07:10 Crawl: CC-MAIN-2014-49 Segments: 2 Parsing: 3 files per segment
+```
-### Cleaning Up
+### elasticrawl reset
-You'll be charged by AWS for any data stored in your S3 bucket. The destroy
-command deletes your S3 bucket and the ~/.elasticrawl/ directory.
+Reset a crawl so it is parsed again.
+```bash
+~$ elasticrawl reset CC-MAIN-2014-49
+Reset crawl? (y/n)
+y
+ CC-MAIN-2014-49 Segments: to parse 136, parsed 0, total 136
+```
+### elasticrawl destroy
+Destroy deletes your S3 bucket and the ~/.elasticrawl directory.
 ```bash
 ~$ elasticrawl destroy
@@ -151,7 +168,7 @@ Config deleted
 The elasticrawl init command creates the ~/elasticrawl/ directory which
 contains
-* [aws.yml](https://github.com/rossf7/elasticrawl/blob/master/templates/aws.yml) -
+* [aws.yml](https://github.com/rossf7/.elasticrawl/blob/master/templates/aws.yml) -
 stores your AWS access credentials. Or you can set the environment
 variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
@@ -161,61 +178,13 @@ configures the EC2 instances that are launched to form your EMR cluster
 * [jobs.yml](https://github.com/rossf7/elasticrawl/blob/master/templates/jobs.yml) -
 stores your S3 bucket name and the config for the parse and combine jobs
-## Managing Segments
-Each Common Crawl segment is parsed as a separate EMR job step. This avoids
-overloading the job tracker and means if a job fails then only data from the
-current segment is lost. However an EMR job flow can only contain 256 steps.
-So to process an entire crawl multiple parse jobs must be combined.
-```bash
-~$ elasticrawl combine --input-jobs 1391430796774 1391458746774 1391498046704
-```
-You can use the status command to see details of crawls and jobs.
-```bash
-~$ elasticrawl status
-Crawl Status
-CC-MAIN-2013-48 Segments: to parse 517, parsed 2, total 519
-Job History (last 10)
-1391459918730 Launched: 2014-02-04 13:58:12 Combining: 2 segments
-1391458746774 Launched: 2014-02-04 13:55:50 Crawl: CC-MAIN-2013-48 Segments: 2 Parsing: 2 files per segment
-```
-You can use the reset command to parse a crawl again.
-```bash
-~$ elasticrawl reset CC-MAIN-2013-48
-Reset crawl? (y/n)
-y
-CC-MAIN-2013-48 Segments: to parse 519, parsed 0, total 519
-```
-To parse the same segments multiple times.
-```bash
-~$ elasticrawl parse CC-MAIN-2013-48 --segment-list 1386163036037 1386163035819 --max-files 2
-```
-## Running your own Jobs
-1. Fork the [elasticrawl-examples](https://github.com/rossf7/elasticrawl-examples)
-2. Make your changes
-3. Compile your changes into a JAR using Maven
-4. Upload your JAR to your own S3 bucket
-5. Edit ~/.elasticrawl/jobs.yml with your JAR and class names
 ## TODO
 * Add support for Streaming and Pig jobs
 ## Thanks
-* Thanks to everyone at Common Crawl for making this awesome dataset available.
+* Thanks to everyone at Common Crawl for making this awesome dataset available!
 * Thanks to Robert Slifka for the [elasticity](https://github.com/rslifka/elasticity)
 gem which provides a nice Ruby wrapper for the EMR REST API.

data/Vagrantfile CHANGED Viewed

@@ -36,16 +36,16 @@ Vagrant.configure("2") do |config|
           "user_installs" => [
             {
               "user" => "vagrant",
-              "rubies" => ["1.9.3-p484", "2.0.0-p353", "2.1.0"],
-              "global" => "1.9.3-p484",
+              "rubies" => ["1.9.3-p551", "2.0.0-p598", "2.1.5"],
+              "global" => "2.1.5",
               "gems" => {
-                "1.9.3-p484" => [
+                "1.9.3-p551" => [
                   { "name" => "bundler" }
                 ],
-                "2.0.0-p353" => [
+                "2.0.0-p598" => [
                   { "name" => "bundler" }
                 ],
-                "2.1.0" => [
+                "2.1.5" => [
                   { "name" => "bundler" }
                 ]
               }

data/db/migrate/201401051536_create_crawls.rb CHANGED Viewed

@@ -2,7 +2,7 @@ class CreateCrawls < ActiveRecord::Migration
   def change
     create_table :crawls do |t|
       t.string :crawl_name
-      t.timestamps
+      t.timestamps(:null => false)
     end
     add_index(:crawls, :crawl_name, :unique => true)

data/db/migrate/201401051855_create_crawl_segments.rb CHANGED Viewed

@@ -5,7 +5,7 @@ class CreateCrawlSegments < ActiveRecord::Migration
       t.string :segment_name
       t.string :segment_s3_uri
       t.datetime :parse_time
-      t.timestamps
+      t.timestamps(:null => false)
     end
     add_index(:crawl_segments, :segment_name, :unique => true)

data/db/migrate/201401101723_create_jobs.rb CHANGED Viewed

@@ -6,7 +6,7 @@ class CreateJobs < ActiveRecord::Migration
       t.string :job_desc
       t.integer :max_files
       t.string :job_flow_id
-      t.timestamps
+      t.timestamps(:null => false)
     end
     add_index(:jobs, :job_name, :unique => true)

data/db/migrate/201401141606_create_job_steps.rb CHANGED Viewed

@@ -5,7 +5,7 @@ class CreateJobSteps < ActiveRecord::Migration
       t.references :crawl_segment
       t.text :input_paths
       t.text :output_path
-      t.timestamps
+      t.timestamps(:null => false)
     end
   end
 end

data/db/migrate/201412311554_add_file_count_to_crawl_segments.rb ADDED Viewed

@@ -0,0 +1,5 @@
+class AddFileCountToCrawlSegments < ActiveRecord::Migration
+  def change
+    add_column(:crawl_segments, :file_count, :integer)
+  end
+end

data/elasticrawl.gemspec CHANGED Viewed

@@ -18,18 +18,17 @@ Gem::Specification.new do |spec|
   spec.test_files    = spec.files.grep(%r{^(test|spec|features)/})
   spec.require_paths = ['lib']
-  spec.add_dependency 'activerecord', '~> 4.0.2'
-  spec.add_dependency 'activesupport', '~> 4.0.2'
-  spec.add_dependency 'aws-sdk', '~> 1.0'
-  spec.add_dependency 'elasticity', '~> 2.7'
-  spec.add_dependency 'highline', '~> 1.6.20'
-  spec.add_dependency 'sqlite3', '~> 1.3.8'
-  spec.add_dependency 'thor', '~> 0.18.1'
+  spec.add_dependency 'activerecord', '~> 4.2'
+  spec.add_dependency 'activesupport', '~> 4.2'
+  spec.add_dependency 'aws-sdk', '~> 1.60'
+  spec.add_dependency 'elasticity', '~> 4.0'
+  spec.add_dependency 'highline', '~> 1.6'
+  spec.add_dependency 'sqlite3', '~> 1.3'
+  spec.add_dependency 'thor', '~> 0.19'
   spec.add_development_dependency 'rake'
   spec.add_development_dependency 'bundler', '~> 1.3'
-  spec.add_development_dependency 'rspec', '~> 2.14.1'
-  spec.add_development_dependency 'mocha', '~> 1.0.0'
-  spec.add_development_dependency 'database_cleaner', '~> 1.2.0'
-  spec.add_development_dependency 'shoulda-matchers', '~> 2.4.0'
+  spec.add_development_dependency 'rspec', '~> 3.1'
+  spec.add_development_dependency 'database_cleaner', '~> 1.3.0'
+  spec.add_development_dependency 'shoulda-matchers', '~> 2.7.0'
 end

data/lib/elasticrawl.rb CHANGED Viewed

@@ -6,6 +6,13 @@ require 'highline/import'
 require 'thor'
 module Elasticrawl
+  # S3 locations
+  COMMON_CRAWL_BUCKET = 'aws-publicdatasets'
+  COMMON_CRAWL_PATH = 'common-crawl/crawl-data'
+  SEGMENTS_PATH = 'segments'
+  WARC_PATHS = 'warc.paths.gz'
+  MAX_SEGMENTS = 256
   require 'elasticrawl/version'
   require 'elasticrawl/config'

data/lib/elasticrawl/cluster.rb CHANGED Viewed

@@ -13,7 +13,7 @@ module Elasticrawl
       config = Config.new
       job_flow = Elasticity::JobFlow.new(config.access_key_id,
                                          config.secret_access_key)
-      job_flow.name = "Job Name: #{job.job_name} #{job.job_desc}"
+      job_flow.name = "Job: #{job.job_name} #{job.job_desc}"
       job_flow.log_uri = job.log_uri
       configure_job_flow(job_flow)

data/lib/elasticrawl/crawl.rb CHANGED Viewed

@@ -5,11 +5,6 @@ module Elasticrawl
   class Crawl < ActiveRecord::Base
     has_many :crawl_segments
-    COMMON_CRAWL_BUCKET = 'aws-publicdatasets'
-    COMMON_CRAWL_PATH = 'common-crawl/crawl-data/'
-    SEGMENTS_PATH = '/segments/'
-    MAX_SEGMENTS = 256
     # Returns the status of all saved crawls and the current job history.
     def self.status(show_all = false)
       status = ['Crawl Status']
@@ -51,13 +46,19 @@ module Elasticrawl
       end
     end
-    # Creates crawl segments from their S3 paths and returns the segment count.
+    # Creates crawl segments from the warc.paths file for this crawl.
     def create_segments
-      segment_paths = s3_segment_paths(self.crawl_name)
-      save if segment_paths.count > 0
-      segment_paths.map { |s3_path| create_segment(s3_path) }
+      file_paths = warc_paths(self.crawl_name)
+      segments = parse_segments(file_paths)
+      save if segments.count > 0
+      segments.keys.each do |segment_name|
+        file_count = segments[segment_name]
+        CrawlSegment.create_segment(self, segment_name, file_count)
+      end
-      segment_paths.count
+      segments.count
     end
     # Returns the list of segments from the database.
@@ -68,8 +69,8 @@ module Elasticrawl
     # Returns next # segments to be parsed. The maximum is 256
     # as this is the maximum # of steps for an Elastic MapReduce job flow.
     def next_segments(max_segments = nil)
-      max_segments = MAX_SEGMENTS if max_segments.nil?
-      max_segments = MAX_SEGMENTS if max_segments > MAX_SEGMENTS
+      max_segments = Elasticrawl::MAX_SEGMENTS if max_segments.nil?
+      max_segments = Elasticrawl::MAX_SEGMENTS if max_segments > Elasticrawl::MAX_SEGMENTS
       self.crawl_segments.where(:parse_time => nil).limit(max_segments)
     end
@@ -85,30 +86,47 @@ module Elasticrawl
     end
   private
-    # Creates a crawl segment based on its S3 path if it does not exist.
-    def create_segment(s3_path)
-      segment_name = s3_path.split('/').last
-      segment_s3_uri = URI::Generic.build(:scheme => 's3',
-                                          :host => COMMON_CRAWL_BUCKET,
-                                          :path => "/#{s3_path}").to_s
-      segment = CrawlSegment.where(:crawl_id => self.id,
-                         :segment_name => segment_name,
-                         :segment_s3_uri => segment_s3_uri).first_or_create
+    # Gets the WARC file paths from S3 for this crawl if it exists.
+    def warc_paths(crawl_name)
+      s3_path = [Elasticrawl::COMMON_CRAWL_PATH,
+                 crawl_name,
+                 Elasticrawl::WARC_PATHS].join('/')
+      s3 = AWS::S3.new
+      bucket = s3.buckets[Elasticrawl::COMMON_CRAWL_BUCKET]
+      object = bucket.objects[s3_path]
+      uncompress_file(object)
     end
-    # Returns a list of S3 paths for the crawl name.
-    def s3_segment_paths(crawl_name)
-      s3_segment_tree(crawl_name).children.collect(&:prefix)
+    # Takes in a S3 object and returns the contents as an uncompressed string.
+    def uncompress_file(s3_object)
+      result = ''
+      if s3_object.exists?
+        io = StringIO.new
+        io.write(s3_object.read)
+        io.rewind
+        gz = Zlib::GzipReader.new(io)
+        result = gz.read
+        gz.close
+      end
+      result
     end
-    # Calls the S3 API and returns the tree structure for the crawl name.
-    def s3_segment_tree(crawl_name)
-      crawl_path = [COMMON_CRAWL_PATH, crawl_name, SEGMENTS_PATH].join
+    # Parses the segment names and file counts from the WARC file paths.
+    def parse_segments(warc_paths)
+      segments = Hash.new 0
-      s3 = AWS::S3.new
-      bucket = s3.buckets[COMMON_CRAWL_BUCKET]
-      bucket.as_tree(:prefix => crawl_path)
+      warc_paths.split.each do |warc_path|
+        segment_name = warc_path.split('/')[4]
+        segments[segment_name] += 1 if segment_name.present?
+      end
+      segments
     end
   end
 end