RubyGems - elasticrawl - Versions diffs - 1.0.0 → 1.1.0 - Mend

elasticrawl 1.0.0 → 1.1.0

Files changed (27) hide show

checksums.yaml +7 -0
data/.travis.yml +1 -1
data/README.md +77 -108
data/Vagrantfile +5 -5
data/db/migrate/201401051536_create_crawls.rb +1 -1
data/db/migrate/201401051855_create_crawl_segments.rb +1 -1
data/db/migrate/201401101723_create_jobs.rb +1 -1
data/db/migrate/201401141606_create_job_steps.rb +1 -1
data/db/migrate/201412311554_add_file_count_to_crawl_segments.rb +5 -0
data/elasticrawl.gemspec +10 -11
data/lib/elasticrawl.rb +7 -0
data/lib/elasticrawl/cluster.rb +1 -1
data/lib/elasticrawl/crawl.rb +49 -31
data/lib/elasticrawl/crawl_segment.rb +30 -0
data/lib/elasticrawl/job.rb +13 -6
data/lib/elasticrawl/job_step.rb +5 -3
data/lib/elasticrawl/parse_job.rb +14 -0
data/lib/elasticrawl/version.rb +1 -1
data/spec/fixtures/warc.paths +6 -0
data/spec/spec_helper.rb +8 -14
data/spec/unit/cluster_spec.rb +2 -2
data/spec/unit/combine_job_spec.rb +4 -4
data/spec/unit/crawl_segment_spec.rb +19 -10
data/spec/unit/crawl_spec.rb +21 -16
data/spec/unit/job_step_spec.rb +4 -4
data/spec/unit/parse_job_spec.rb +20 -14
metadata +56 -101

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: d30b065d6f268827c458f2da44db8ccc726c209f
+  data.tar.gz: 0fe917b0f93bf4f70b23386bedad9cc3547d9e8b
+SHA512:
+  metadata.gz: cd63bfea578623e32c03d10f4bab56591950cbff8cd19eaf38aa28db5ba875101c94f8fa9da297e418fc6460eb16771c9fc45098d7c2e8e5b16e2f380a5ab4bc
+  data.tar.gz: f6ca6c84103df5a9d299ed78d9963f57357c15719ca90e4193bb0a9dc909c552c055329b85677dddaef8af35fea2f4b044ea73e29ebfc110fe6116c8e548b3df

data/.travis.yml CHANGED Viewed

@@ -2,4 +2,4 @@ language: ruby
 rvm:
   - 1.9.3
   - 2.0.0
-  - 2.1.0
+  - 2.1.5

data/README.md CHANGED Viewed

@@ -1,42 +1,53 @@
 # Elasticrawl
-Launch AWS Elastic MapReduce jobs that process Common Crawl data.
-Elasticrawl works with the latest Common Crawl data structure and file formats
-([2013 data onwards](http://commoncrawl.org/new-crawl-data-available/)).
-Ships with a default configuration that launches the
-[elasticrawl-examples](https://github.com/rossf7/elasticrawl-examples) jobs.
-This is an implementation of the standard Hadoop Word Count example.
+Command line tool for launching Hadoop jobs using AWS EMR (Elastic MapReduce) to process Common Crawl data.
+Elasticrawl can be used with [crawl data](http://commoncrawl.org/the-data/get-started/) from April 2014 onwards.
-## Overview
+| Crawl Name     | Month     | Web Pages
+| -------------- |:--------:|:--------:|
+| [CC-MAIN-2014-15](http://blog.commoncrawl.org/2014/07/april-2014-crawl-data-available/) | April 2014 | ~ 2.3 billion
+| [CC-MAIN-2014-23](http://blog.commoncrawl.org/2014/08/july-2014-crawl-data-available/) | July 2014 | ~ 3.6 billion
+| [CC-MAIN-2014-35](http://blog.commoncrawl.org/2014/09/august-2014-crawl-data-available/) | August 2014 | ~ 2.8 billion
+| [CC-MAIN-2014-49](http://blog.commoncrawl.org/2014/12/november-2014-crawl-archive-available/) | November 2014 | ~ 1.95 billion
-Common Crawl have released 2 web crawls of 2013 data. Further crawls will be released
-during 2014. Each crawl is split into multiple segments that contain 3 file types.
+Common Crawl announce new crawls on their [blog](http://blog.commoncrawl.org/).
-* WARC - WARC files with the HTTP request and response for each fetch
-* WAT - WARC encoded files containing JSON metadata
-* WET - WARC encoded text extractions of the HTTP responses
-| Crawl Name     | Date     | Segments | Pages         | Size (uncompressed) |
-| -------------- |:--------:|:--------:|:-------------:|:-------------------:|
-| CC-MAIN-2013-48| Nov 2013 | 517      | ~ 2.3 billion | 148 TB              |
-| CC-MAIN-2013-20| May 2013 | 316      | ~ 2.0 billion | 102 TB              |
+Ships with a default configuration that launches the
+[elasticrawl-examples](https://github.com/rossf7/elasticrawl-examples) jobs.
+This is an implementation of the standard Hadoop Word Count example.
-Elasticrawl is a command line tool that automates launching Elastic MapReduce
-jobs against this data.
+## More Information
-[![Code Climate](https://codeclimate.com/github/rossf7/elasticrawl.png)](https://codeclimate.com/github/rossf7/elasticrawl)
-[![Build Status](https://travis-ci.org/rossf7/elasticrawl.png?branch=master)](https://travis-ci.org/rossf7/elasticrawl) 1.9.3, 2.0.0, 2.1.0
+* [Blog post](https://rossfairbanks.com/2015/01/03/parsing-common-crawl-using-elasticrawl.html) with walkthrough of running the Hadoop WordCount example on the November 2014 crawl.
 ## Installation
 ### Dependencies
-Elasticrawl is developed in Ruby and requires Ruby 1.9.3 or later.
+Elasticrawl is developed in Ruby and requires Ruby 1.9.3 or later (Ruby 2.1 is recommended).
 Installing using [rbenv](https://github.com/sstephenson/rbenv#installation)
 and the ruby-build plugin is recommended.
+A SQLite database is used to store details of crawls and jobs. Installing the sqlite3
+gem requires the development headers to be installed.
+```bash
+# OS X
+brew install sqlite3
+# CentOS
+sudo yum install sqlite-devel
+# Ubuntu
+sudo apt-get install libsqlite3-dev
+```
 ### Install elasticrawl
+[![Gem Version](https://badge.fury.io/rb/elasticrawl.png)](http://badge.fury.io/rb/elasticrawl)
+[![Code Climate](https://codeclimate.com/github/rossf7/elasticrawl.png)](https://codeclimate.com/github/rossf7/elasticrawl)
+[![Build Status](https://travis-ci.org/rossf7/elasticrawl.png?branch=master)](https://travis-ci.org/rossf7/elasticrawl) 1.9.3, 2.0.0, 2.1.5
 ```bash
 ~$ gem install elasticrawl --no-rdoc --no-ri
 ```
@@ -48,21 +59,12 @@ to your path.
 ~$ rbenv rehash
 ```
-## Quick Start
-In this example you'll launch 2 EMR jobs against a small portion of the Nov
-2013 crawl. Each job will take around 20 minutes to run. Most of this is setup
-time while your EC2 spot instances are provisioned and your Hadoop cluster is
-configured.
+## Commands
-You'll need to have an [AWS account](https://portal.aws.amazon.com/gp/aws/developer/registration/index.html)
-to use elasticrawl. The total cost of the 2 EMR jobs will be under $1 USD.
+### elasticrawl init
-### Setup
-You'll need to choose an S3 bucket name and enter your AWS access key and
-secret key. The S3 bucket will be used for storing data and logs. S3 bucket
-names must be unique, using hyphens rather than underscores is recommended.
+Init takes in an S3 bucket name and your AWS credentials. The S3 bucket will be created
+and will store your data and logs.
 ```bash
 ~$ elasticrawl init your-s3-bucket
@@ -77,38 +79,35 @@ Config dir /Users/ross/.elasticrawl created
 Config complete
 ```
-### Parse Job
+### elasticrawl parse
-For this example you'll parse the first 2 WET files in the first 2 segments
-of the Nov 2013 crawl.
+Parse takes in the crawl name and an optional number of segments and files to parse.
 ```bash
-~$ elasticrawl parse CC-MAIN-2013-48 --max-segments 2 --max-files 2
+~$ elasticrawl parse CC-MAIN-2014-49 --max-segments 2 --max-files 3
+Segments
+Segment: 1416400372202.67 Files: 150
+Segment: 1416400372490.23 Files: 124
 Job configuration
-Crawl: CC-MAIN-2013-48 Segments: 2 Parsing: 2 files per segment
+Crawl: CC-MAIN-2014-49 Segments: 2 Parsing: 3 files per segment
 Cluster configuration
 Master: 1 m1.medium  (Spot: 0.12)
 Core:   2 m1.medium  (Spot: 0.12)
 Task:   --
 Launch job? (y/n)
 y
-Job Name: 1391458746774 Job Flow ID: j-2X9JVDC1UKEQ1
-```
-You can monitor the progress of your job in the Elastic MapReduce section
-of the AWS web console.
+Job: 1420124830792 Job Flow ID: j-2R3MFE6TWLIUB
+```
-### Combine Job
+### elasticrawl combine
-The combine job will aggregate the word count results from both segments into
-a single set of files.
+Combine takes in the results of previous parse jobs and produces a combined set of results.
 ```bash
-~$ elasticrawl combine --input-jobs 1391458746774
+~$ elasticrawl combine --input-jobs 1420124830792
 Job configuration
 Combining: 2 segments
@@ -117,20 +116,38 @@ Master: 1 m1.medium  (Spot: 0.12)
 Core:   2 m1.medium  (Spot: 0.12)
 Task:   --
 Launch job? (y/n)
 y
-Job Name: 1391459918730 Job Flow ID: j-GTJ2M7D1TXO6
+Job: 1420129496115 Job Flow ID: j-251GXDIZGK8HL
 ```
-Once the combine job is complete you can download your results from the
-S3 section of the AWS web console. Your data will be stored in
+### elasticrawl status
+Status shows crawls and your job history.
-[your S3 bucket]/data/2-combine/[job name]
+```bash
+~$ elasticrawl status
+Crawl Status
+CC-MAIN-2014-49 Segments: to parse 134, parsed 2, total 136
+Job History (last 10)
+1420124830792 Launched: 2015-01-01 15:07:10 Crawl: CC-MAIN-2014-49 Segments: 2 Parsing: 3 files per segment
+```
-### Cleaning Up
+### elasticrawl reset
-You'll be charged by AWS for any data stored in your S3 bucket. The destroy
-command deletes your S3 bucket and the ~/.elasticrawl/ directory.
+Reset a crawl so it is parsed again.
+```bash
+~$ elasticrawl reset CC-MAIN-2014-49
+Reset crawl? (y/n)
+y
+ CC-MAIN-2014-49 Segments: to parse 136, parsed 0, total 136
+```
+### elasticrawl destroy
+Destroy deletes your S3 bucket and the ~/.elasticrawl directory.
 ```bash
 ~$ elasticrawl destroy
@@ -151,7 +168,7 @@ Config deleted
 The elasticrawl init command creates the ~/elasticrawl/ directory which
 contains
-* [aws.yml](https://github.com/rossf7/elasticrawl/blob/master/templates/aws.yml) -
+* [aws.yml](https://github.com/rossf7/.elasticrawl/blob/master/templates/aws.yml) -
 stores your AWS access credentials. Or you can set the environment
 variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
@@ -161,61 +178,13 @@ configures the EC2 instances that are launched to form your EMR cluster
 * [jobs.yml](https://github.com/rossf7/elasticrawl/blob/master/templates/jobs.yml) -
 stores your S3 bucket name and the config for the parse and combine jobs
-## Managing Segments
-Each Common Crawl segment is parsed as a separate EMR job step. This avoids
-overloading the job tracker and means if a job fails then only data from the
-current segment is lost. However an EMR job flow can only contain 256 steps.
-So to process an entire crawl multiple parse jobs must be combined.
-```bash
-~$ elasticrawl combine --input-jobs 1391430796774 1391458746774 1391498046704
-```
-You can use the status command to see details of crawls and jobs.
-```bash
-~$ elasticrawl status
-Crawl Status
-CC-MAIN-2013-48 Segments: to parse 517, parsed 2, total 519
-Job History (last 10)
-1391459918730 Launched: 2014-02-04 13:58:12 Combining: 2 segments
-1391458746774 Launched: 2014-02-04 13:55:50 Crawl: CC-MAIN-2013-48 Segments: 2 Parsing: 2 files per segment
-```
-You can use the reset command to parse a crawl again.
-```bash
-~$ elasticrawl reset CC-MAIN-2013-48
-Reset crawl? (y/n)
-y
-CC-MAIN-2013-48 Segments: to parse 519, parsed 0, total 519
-```
-To parse the same segments multiple times.
-```bash
-~$ elasticrawl parse CC-MAIN-2013-48 --segment-list 1386163036037 1386163035819 --max-files 2
-```
-## Running your own Jobs
-1. Fork the [elasticrawl-examples](https://github.com/rossf7/elasticrawl-examples)
-2. Make your changes
-3. Compile your changes into a JAR using Maven
-4. Upload your JAR to your own S3 bucket
-5. Edit ~/.elasticrawl/jobs.yml with your JAR and class names
 ## TODO
 * Add support for Streaming and Pig jobs
 ## Thanks
-* Thanks to everyone at Common Crawl for making this awesome dataset available.
+* Thanks to everyone at Common Crawl for making this awesome dataset available!
 * Thanks to Robert Slifka for the [elasticity](https://github.com/rslifka/elasticity)
 gem which provides a nice Ruby wrapper for the EMR REST API.

data/Vagrantfile CHANGED Viewed

@@ -36,16 +36,16 @@ Vagrant.configure("2") do |config|
           "user_installs" => [
             {
               "user" => "vagrant",
-              "rubies" => ["1.9.3-p484", "2.0.0-p353", "2.1.0"],
-              "global" => "1.9.3-p484",
+              "rubies" => ["1.9.3-p551", "2.0.0-p598", "2.1.5"],
+              "global" => "2.1.5",
               "gems" => {
-                "1.9.3-p484" => [
+                "1.9.3-p551" => [
                   { "name" => "bundler" }
                 ],
-                "2.0.0-p353" => [
+                "2.0.0-p598" => [
                   { "name" => "bundler" }
                 ],
-                "2.1.0" => [
+                "2.1.5" => [
                   { "name" => "bundler" }
                 ]
               }

data/db/migrate/201401051536_create_crawls.rb CHANGED Viewed

@@ -2,7 +2,7 @@ class CreateCrawls < ActiveRecord::Migration
   def change
     create_table :crawls do |t|
       t.string :crawl_name
-      t.timestamps
+      t.timestamps(:null => false)
     end
     add_index(:crawls, :crawl_name, :unique => true)

data/db/migrate/201401051855_create_crawl_segments.rb CHANGED Viewed

@@ -5,7 +5,7 @@ class CreateCrawlSegments < ActiveRecord::Migration
       t.string :segment_name
       t.string :segment_s3_uri
       t.datetime :parse_time
-      t.timestamps
+      t.timestamps(:null => false)
     end
     add_index(:crawl_segments, :segment_name, :unique => true)

data/db/migrate/201401101723_create_jobs.rb CHANGED Viewed

@@ -6,7 +6,7 @@ class CreateJobs < ActiveRecord::Migration
       t.string :job_desc
       t.integer :max_files
       t.string :job_flow_id
-      t.timestamps
+      t.timestamps(:null => false)
     end
     add_index(:jobs, :job_name, :unique => true)

data/db/migrate/201401141606_create_job_steps.rb CHANGED Viewed

@@ -5,7 +5,7 @@ class CreateJobSteps < ActiveRecord::Migration
       t.references :crawl_segment
       t.text :input_paths
       t.text :output_path
-      t.timestamps
+      t.timestamps(:null => false)
     end
   end
 end

data/db/migrate/201412311554_add_file_count_to_crawl_segments.rb ADDED Viewed

@@ -0,0 +1,5 @@
+class AddFileCountToCrawlSegments < ActiveRecord::Migration
+  def change
+    add_column(:crawl_segments, :file_count, :integer)
+  end
+end

data/elasticrawl.gemspec CHANGED Viewed

@@ -18,18 +18,17 @@ Gem::Specification.new do |spec|
   spec.test_files    = spec.files.grep(%r{^(test|spec|features)/})
   spec.require_paths = ['lib']
-  spec.add_dependency 'activerecord', '~> 4.0.2'
-  spec.add_dependency 'activesupport', '~> 4.0.2'
-  spec.add_dependency 'aws-sdk', '~> 1.0'
-  spec.add_dependency 'elasticity', '~> 2.7'
-  spec.add_dependency 'highline', '~> 1.6.20'
-  spec.add_dependency 'sqlite3', '~> 1.3.8'
-  spec.add_dependency 'thor', '~> 0.18.1'
+  spec.add_dependency 'activerecord', '~> 4.2'
+  spec.add_dependency 'activesupport', '~> 4.2'
+  spec.add_dependency 'aws-sdk', '~> 1.60'
+  spec.add_dependency 'elasticity', '~> 4.0'
+  spec.add_dependency 'highline', '~> 1.6'
+  spec.add_dependency 'sqlite3', '~> 1.3'
+  spec.add_dependency 'thor', '~> 0.19'
   spec.add_development_dependency 'rake'
   spec.add_development_dependency 'bundler', '~> 1.3'
-  spec.add_development_dependency 'rspec', '~> 2.14.1'
-  spec.add_development_dependency 'mocha', '~> 1.0.0'
-  spec.add_development_dependency 'database_cleaner', '~> 1.2.0'
-  spec.add_development_dependency 'shoulda-matchers', '~> 2.4.0'
+  spec.add_development_dependency 'rspec', '~> 3.1'
+  spec.add_development_dependency 'database_cleaner', '~> 1.3.0'
+  spec.add_development_dependency 'shoulda-matchers', '~> 2.7.0'
 end

data/lib/elasticrawl.rb CHANGED Viewed

@@ -6,6 +6,13 @@ require 'highline/import'
 require 'thor'
 module Elasticrawl
+  # S3 locations
+  COMMON_CRAWL_BUCKET = 'aws-publicdatasets'
+  COMMON_CRAWL_PATH = 'common-crawl/crawl-data'
+  SEGMENTS_PATH = 'segments'
+  WARC_PATHS = 'warc.paths.gz'
+  MAX_SEGMENTS = 256
   require 'elasticrawl/version'
   require 'elasticrawl/config'

data/lib/elasticrawl/cluster.rb CHANGED Viewed

@@ -13,7 +13,7 @@ module Elasticrawl
       config = Config.new
       job_flow = Elasticity::JobFlow.new(config.access_key_id,
                                          config.secret_access_key)
-      job_flow.name = "Job Name: #{job.job_name} #{job.job_desc}"
+      job_flow.name = "Job: #{job.job_name} #{job.job_desc}"
       job_flow.log_uri = job.log_uri
       configure_job_flow(job_flow)

data/lib/elasticrawl/crawl.rb CHANGED Viewed

@@ -5,11 +5,6 @@ module Elasticrawl
   class Crawl < ActiveRecord::Base
     has_many :crawl_segments
-    COMMON_CRAWL_BUCKET = 'aws-publicdatasets'
-    COMMON_CRAWL_PATH = 'common-crawl/crawl-data/'
-    SEGMENTS_PATH = '/segments/'
-    MAX_SEGMENTS = 256
     # Returns the status of all saved crawls and the current job history.
     def self.status(show_all = false)
       status = ['Crawl Status']
@@ -51,13 +46,19 @@ module Elasticrawl
       end
     end
-    # Creates crawl segments from their S3 paths and returns the segment count.
+    # Creates crawl segments from the warc.paths file for this crawl.
     def create_segments
-      segment_paths = s3_segment_paths(self.crawl_name)
-      save if segment_paths.count > 0
-      segment_paths.map { |s3_path| create_segment(s3_path) }
+      file_paths = warc_paths(self.crawl_name)
+      segments = parse_segments(file_paths)
+      save if segments.count > 0
+      segments.keys.each do |segment_name|
+        file_count = segments[segment_name]
+        CrawlSegment.create_segment(self, segment_name, file_count)
+      end
-      segment_paths.count
+      segments.count
     end
     # Returns the list of segments from the database.
@@ -68,8 +69,8 @@ module Elasticrawl
     # Returns next # segments to be parsed. The maximum is 256
     # as this is the maximum # of steps for an Elastic MapReduce job flow.
     def next_segments(max_segments = nil)
-      max_segments = MAX_SEGMENTS if max_segments.nil?
-      max_segments = MAX_SEGMENTS if max_segments > MAX_SEGMENTS
+      max_segments = Elasticrawl::MAX_SEGMENTS if max_segments.nil?
+      max_segments = Elasticrawl::MAX_SEGMENTS if max_segments > Elasticrawl::MAX_SEGMENTS
       self.crawl_segments.where(:parse_time => nil).limit(max_segments)
     end
@@ -85,30 +86,47 @@ module Elasticrawl
     end
   private
-    # Creates a crawl segment based on its S3 path if it does not exist.
-    def create_segment(s3_path)
-      segment_name = s3_path.split('/').last
-      segment_s3_uri = URI::Generic.build(:scheme => 's3',
-                                          :host => COMMON_CRAWL_BUCKET,
-                                          :path => "/#{s3_path}").to_s
-      segment = CrawlSegment.where(:crawl_id => self.id,
-                         :segment_name => segment_name,
-                         :segment_s3_uri => segment_s3_uri).first_or_create
+    # Gets the WARC file paths from S3 for this crawl if it exists.
+    def warc_paths(crawl_name)
+      s3_path = [Elasticrawl::COMMON_CRAWL_PATH,
+                 crawl_name,
+                 Elasticrawl::WARC_PATHS].join('/')
+      s3 = AWS::S3.new
+      bucket = s3.buckets[Elasticrawl::COMMON_CRAWL_BUCKET]
+      object = bucket.objects[s3_path]
+      uncompress_file(object)
     end
-    # Returns a list of S3 paths for the crawl name.
-    def s3_segment_paths(crawl_name)
-      s3_segment_tree(crawl_name).children.collect(&:prefix)
+    # Takes in a S3 object and returns the contents as an uncompressed string.
+    def uncompress_file(s3_object)
+      result = ''
+      if s3_object.exists?
+        io = StringIO.new
+        io.write(s3_object.read)
+        io.rewind
+        gz = Zlib::GzipReader.new(io)
+        result = gz.read
+        gz.close
+      end
+      result
     end
-    # Calls the S3 API and returns the tree structure for the crawl name.
-    def s3_segment_tree(crawl_name)
-      crawl_path = [COMMON_CRAWL_PATH, crawl_name, SEGMENTS_PATH].join
+    # Parses the segment names and file counts from the WARC file paths.
+    def parse_segments(warc_paths)
+      segments = Hash.new 0
-      s3 = AWS::S3.new
-      bucket = s3.buckets[COMMON_CRAWL_BUCKET]
-      bucket.as_tree(:prefix => crawl_path)
+      warc_paths.split.each do |warc_path|
+        segment_name = warc_path.split('/')[4]
+        segments[segment_name] += 1 if segment_name.present?
+      end
+      segments
     end
   end
 end