elasticrawl 1.0.0 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: d30b065d6f268827c458f2da44db8ccc726c209f
4
+ data.tar.gz: 0fe917b0f93bf4f70b23386bedad9cc3547d9e8b
5
+ SHA512:
6
+ metadata.gz: cd63bfea578623e32c03d10f4bab56591950cbff8cd19eaf38aa28db5ba875101c94f8fa9da297e418fc6460eb16771c9fc45098d7c2e8e5b16e2f380a5ab4bc
7
+ data.tar.gz: f6ca6c84103df5a9d299ed78d9963f57357c15719ca90e4193bb0a9dc909c552c055329b85677dddaef8af35fea2f4b044ea73e29ebfc110fe6116c8e548b3df
data/.travis.yml CHANGED
@@ -2,4 +2,4 @@ language: ruby
2
2
  rvm:
3
3
  - 1.9.3
4
4
  - 2.0.0
5
- - 2.1.0
5
+ - 2.1.5
data/README.md CHANGED
@@ -1,42 +1,53 @@
1
1
  # Elasticrawl
2
2
 
3
- Launch AWS Elastic MapReduce jobs that process Common Crawl data.
4
- Elasticrawl works with the latest Common Crawl data structure and file formats
5
- ([2013 data onwards](http://commoncrawl.org/new-crawl-data-available/)).
6
- Ships with a default configuration that launches the
7
- [elasticrawl-examples](https://github.com/rossf7/elasticrawl-examples) jobs.
8
- This is an implementation of the standard Hadoop Word Count example.
3
+ Command line tool for launching Hadoop jobs using AWS EMR (Elastic MapReduce) to process Common Crawl data.
4
+ Elasticrawl can be used with [crawl data](http://commoncrawl.org/the-data/get-started/) from April 2014 onwards.
9
5
 
10
- ## Overview
6
+ | Crawl Name | Month | Web Pages
7
+ | -------------- |:--------:|:--------:|
8
+ | [CC-MAIN-2014-15](http://blog.commoncrawl.org/2014/07/april-2014-crawl-data-available/) | April 2014 | ~ 2.3 billion
9
+ | [CC-MAIN-2014-23](http://blog.commoncrawl.org/2014/08/july-2014-crawl-data-available/) | July 2014 | ~ 3.6 billion
10
+ | [CC-MAIN-2014-35](http://blog.commoncrawl.org/2014/09/august-2014-crawl-data-available/) | August 2014 | ~ 2.8 billion
11
+ | [CC-MAIN-2014-49](http://blog.commoncrawl.org/2014/12/november-2014-crawl-archive-available/) | November 2014 | ~ 1.95 billion
11
12
 
12
- Common Crawl have released 2 web crawls of 2013 data. Further crawls will be released
13
- during 2014. Each crawl is split into multiple segments that contain 3 file types.
13
+ Common Crawl announce new crawls on their [blog](http://blog.commoncrawl.org/).
14
14
 
15
- * WARC - WARC files with the HTTP request and response for each fetch
16
- * WAT - WARC encoded files containing JSON metadata
17
- * WET - WARC encoded text extractions of the HTTP responses
18
-
19
- | Crawl Name | Date | Segments | Pages | Size (uncompressed) |
20
- | -------------- |:--------:|:--------:|:-------------:|:-------------------:|
21
- | CC-MAIN-2013-48| Nov 2013 | 517 | ~ 2.3 billion | 148 TB |
22
- | CC-MAIN-2013-20| May 2013 | 316 | ~ 2.0 billion | 102 TB |
15
+ Ships with a default configuration that launches the
16
+ [elasticrawl-examples](https://github.com/rossf7/elasticrawl-examples) jobs.
17
+ This is an implementation of the standard Hadoop Word Count example.
23
18
 
24
- Elasticrawl is a command line tool that automates launching Elastic MapReduce
25
- jobs against this data.
19
+ ## More Information
26
20
 
27
- [![Code Climate](https://codeclimate.com/github/rossf7/elasticrawl.png)](https://codeclimate.com/github/rossf7/elasticrawl)
28
- [![Build Status](https://travis-ci.org/rossf7/elasticrawl.png?branch=master)](https://travis-ci.org/rossf7/elasticrawl) 1.9.3, 2.0.0, 2.1.0
21
+ * [Blog post](https://rossfairbanks.com/2015/01/03/parsing-common-crawl-using-elasticrawl.html) with walkthrough of running the Hadoop WordCount example on the November 2014 crawl.
29
22
 
30
23
  ## Installation
31
24
 
32
25
  ### Dependencies
33
26
 
34
- Elasticrawl is developed in Ruby and requires Ruby 1.9.3 or later.
27
+ Elasticrawl is developed in Ruby and requires Ruby 1.9.3 or later (Ruby 2.1 is recommended).
35
28
  Installing using [rbenv](https://github.com/sstephenson/rbenv#installation)
36
29
  and the ruby-build plugin is recommended.
37
30
 
31
+ A SQLite database is used to store details of crawls and jobs. Installing the sqlite3
32
+ gem requires the development headers to be installed.
33
+
34
+ ```bash
35
+ # OS X
36
+ brew install sqlite3
37
+
38
+ # CentOS
39
+ sudo yum install sqlite-devel
40
+
41
+ # Ubuntu
42
+ sudo apt-get install libsqlite3-dev
43
+ ```
44
+
38
45
  ### Install elasticrawl
39
46
 
47
+ [![Gem Version](https://badge.fury.io/rb/elasticrawl.png)](http://badge.fury.io/rb/elasticrawl)
48
+ [![Code Climate](https://codeclimate.com/github/rossf7/elasticrawl.png)](https://codeclimate.com/github/rossf7/elasticrawl)
49
+ [![Build Status](https://travis-ci.org/rossf7/elasticrawl.png?branch=master)](https://travis-ci.org/rossf7/elasticrawl) 1.9.3, 2.0.0, 2.1.5
50
+
40
51
  ```bash
41
52
  ~$ gem install elasticrawl --no-rdoc --no-ri
42
53
  ```
@@ -48,21 +59,12 @@ to your path.
48
59
  ~$ rbenv rehash
49
60
  ```
50
61
 
51
- ## Quick Start
52
-
53
- In this example you'll launch 2 EMR jobs against a small portion of the Nov
54
- 2013 crawl. Each job will take around 20 minutes to run. Most of this is setup
55
- time while your EC2 spot instances are provisioned and your Hadoop cluster is
56
- configured.
62
+ ## Commands
57
63
 
58
- You'll need to have an [AWS account](https://portal.aws.amazon.com/gp/aws/developer/registration/index.html)
59
- to use elasticrawl. The total cost of the 2 EMR jobs will be under $1 USD.
64
+ ### elasticrawl init
60
65
 
61
- ### Setup
62
-
63
- You'll need to choose an S3 bucket name and enter your AWS access key and
64
- secret key. The S3 bucket will be used for storing data and logs. S3 bucket
65
- names must be unique, using hyphens rather than underscores is recommended.
66
+ Init takes in an S3 bucket name and your AWS credentials. The S3 bucket will be created
67
+ and will store your data and logs.
66
68
 
67
69
  ```bash
68
70
  ~$ elasticrawl init your-s3-bucket
@@ -77,38 +79,35 @@ Config dir /Users/ross/.elasticrawl created
77
79
  Config complete
78
80
  ```
79
81
 
80
- ### Parse Job
82
+ ### elasticrawl parse
81
83
 
82
- For this example you'll parse the first 2 WET files in the first 2 segments
83
- of the Nov 2013 crawl.
84
+ Parse takes in the crawl name and an optional number of segments and files to parse.
84
85
 
85
86
  ```bash
86
- ~$ elasticrawl parse CC-MAIN-2013-48 --max-segments 2 --max-files 2
87
+ ~$ elasticrawl parse CC-MAIN-2014-49 --max-segments 2 --max-files 3
88
+ Segments
89
+ Segment: 1416400372202.67 Files: 150
90
+ Segment: 1416400372490.23 Files: 124
87
91
 
88
92
  Job configuration
89
- Crawl: CC-MAIN-2013-48 Segments: 2 Parsing: 2 files per segment
93
+ Crawl: CC-MAIN-2014-49 Segments: 2 Parsing: 3 files per segment
90
94
 
91
95
  Cluster configuration
92
96
  Master: 1 m1.medium (Spot: 0.12)
93
97
  Core: 2 m1.medium (Spot: 0.12)
94
98
  Task: --
95
99
  Launch job? (y/n)
96
-
97
100
  y
98
- Job Name: 1391458746774 Job Flow ID: j-2X9JVDC1UKEQ1
99
- ```
100
101
 
101
- You can monitor the progress of your job in the Elastic MapReduce section
102
- of the AWS web console.
102
+ Job: 1420124830792 Job Flow ID: j-2R3MFE6TWLIUB
103
+ ```
103
104
 
104
- ### Combine Job
105
+ ### elasticrawl combine
105
106
 
106
- The combine job will aggregate the word count results from both segments into
107
- a single set of files.
107
+ Combine takes in the results of previous parse jobs and produces a combined set of results.
108
108
 
109
109
  ```bash
110
- ~$ elasticrawl combine --input-jobs 1391458746774
111
-
110
+ ~$ elasticrawl combine --input-jobs 1420124830792
112
111
  Job configuration
113
112
  Combining: 2 segments
114
113
 
@@ -117,20 +116,38 @@ Master: 1 m1.medium (Spot: 0.12)
117
116
  Core: 2 m1.medium (Spot: 0.12)
118
117
  Task: --
119
118
  Launch job? (y/n)
120
-
121
119
  y
122
- Job Name: 1391459918730 Job Flow ID: j-GTJ2M7D1TXO6
120
+
121
+ Job: 1420129496115 Job Flow ID: j-251GXDIZGK8HL
123
122
  ```
124
123
 
125
- Once the combine job is complete you can download your results from the
126
- S3 section of the AWS web console. Your data will be stored in
124
+ ### elasticrawl status
125
+
126
+ Status shows crawls and your job history.
127
127
 
128
- [your S3 bucket]/data/2-combine/[job name]
128
+ ```bash
129
+ ~$ elasticrawl status
130
+ Crawl Status
131
+ CC-MAIN-2014-49 Segments: to parse 134, parsed 2, total 136
132
+
133
+ Job History (last 10)
134
+ 1420124830792 Launched: 2015-01-01 15:07:10 Crawl: CC-MAIN-2014-49 Segments: 2 Parsing: 3 files per segment
135
+ ```
129
136
 
130
- ### Cleaning Up
137
+ ### elasticrawl reset
131
138
 
132
- You'll be charged by AWS for any data stored in your S3 bucket. The destroy
133
- command deletes your S3 bucket and the ~/.elasticrawl/ directory.
139
+ Reset a crawl so it is parsed again.
140
+
141
+ ```bash
142
+ ~$ elasticrawl reset CC-MAIN-2014-49
143
+ Reset crawl? (y/n)
144
+ y
145
+ CC-MAIN-2014-49 Segments: to parse 136, parsed 0, total 136
146
+ ```
147
+
148
+ ### elasticrawl destroy
149
+
150
+ Destroy deletes your S3 bucket and the ~/.elasticrawl directory.
134
151
 
135
152
  ```bash
136
153
  ~$ elasticrawl destroy
@@ -151,7 +168,7 @@ Config deleted
151
168
  The elasticrawl init command creates the ~/elasticrawl/ directory which
152
169
  contains
153
170
 
154
- * [aws.yml](https://github.com/rossf7/elasticrawl/blob/master/templates/aws.yml) -
171
+ * [aws.yml](https://github.com/rossf7/.elasticrawl/blob/master/templates/aws.yml) -
155
172
  stores your AWS access credentials. Or you can set the environment
156
173
  variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
157
174
 
@@ -161,61 +178,13 @@ configures the EC2 instances that are launched to form your EMR cluster
161
178
  * [jobs.yml](https://github.com/rossf7/elasticrawl/blob/master/templates/jobs.yml) -
162
179
  stores your S3 bucket name and the config for the parse and combine jobs
163
180
 
164
- ## Managing Segments
165
-
166
- Each Common Crawl segment is parsed as a separate EMR job step. This avoids
167
- overloading the job tracker and means if a job fails then only data from the
168
- current segment is lost. However an EMR job flow can only contain 256 steps.
169
- So to process an entire crawl multiple parse jobs must be combined.
170
-
171
- ```bash
172
- ~$ elasticrawl combine --input-jobs 1391430796774 1391458746774 1391498046704
173
- ```
174
-
175
- You can use the status command to see details of crawls and jobs.
176
-
177
- ```bash
178
- ~$ elasticrawl status
179
-
180
- Crawl Status
181
- CC-MAIN-2013-48 Segments: to parse 517, parsed 2, total 519
182
-
183
- Job History (last 10)
184
- 1391459918730 Launched: 2014-02-04 13:58:12 Combining: 2 segments
185
- 1391458746774 Launched: 2014-02-04 13:55:50 Crawl: CC-MAIN-2013-48 Segments: 2 Parsing: 2 files per segment
186
- ```
187
-
188
- You can use the reset command to parse a crawl again.
189
-
190
- ```bash
191
- ~$ elasticrawl reset CC-MAIN-2013-48
192
-
193
- Reset crawl? (y/n)
194
- y
195
- CC-MAIN-2013-48 Segments: to parse 519, parsed 0, total 519
196
- ```
197
-
198
- To parse the same segments multiple times.
199
-
200
- ```bash
201
- ~$ elasticrawl parse CC-MAIN-2013-48 --segment-list 1386163036037 1386163035819 --max-files 2
202
- ```
203
-
204
- ## Running your own Jobs
205
-
206
- 1. Fork the [elasticrawl-examples](https://github.com/rossf7/elasticrawl-examples)
207
- 2. Make your changes
208
- 3. Compile your changes into a JAR using Maven
209
- 4. Upload your JAR to your own S3 bucket
210
- 5. Edit ~/.elasticrawl/jobs.yml with your JAR and class names
211
-
212
181
  ## TODO
213
182
 
214
183
  * Add support for Streaming and Pig jobs
215
184
 
216
185
  ## Thanks
217
186
 
218
- * Thanks to everyone at Common Crawl for making this awesome dataset available.
187
+ * Thanks to everyone at Common Crawl for making this awesome dataset available!
219
188
  * Thanks to Robert Slifka for the [elasticity](https://github.com/rslifka/elasticity)
220
189
  gem which provides a nice Ruby wrapper for the EMR REST API.
221
190
 
data/Vagrantfile CHANGED
@@ -36,16 +36,16 @@ Vagrant.configure("2") do |config|
36
36
  "user_installs" => [
37
37
  {
38
38
  "user" => "vagrant",
39
- "rubies" => ["1.9.3-p484", "2.0.0-p353", "2.1.0"],
40
- "global" => "1.9.3-p484",
39
+ "rubies" => ["1.9.3-p551", "2.0.0-p598", "2.1.5"],
40
+ "global" => "2.1.5",
41
41
  "gems" => {
42
- "1.9.3-p484" => [
42
+ "1.9.3-p551" => [
43
43
  { "name" => "bundler" }
44
44
  ],
45
- "2.0.0-p353" => [
45
+ "2.0.0-p598" => [
46
46
  { "name" => "bundler" }
47
47
  ],
48
- "2.1.0" => [
48
+ "2.1.5" => [
49
49
  { "name" => "bundler" }
50
50
  ]
51
51
  }
@@ -2,7 +2,7 @@ class CreateCrawls < ActiveRecord::Migration
2
2
  def change
3
3
  create_table :crawls do |t|
4
4
  t.string :crawl_name
5
- t.timestamps
5
+ t.timestamps(:null => false)
6
6
  end
7
7
 
8
8
  add_index(:crawls, :crawl_name, :unique => true)
@@ -5,7 +5,7 @@ class CreateCrawlSegments < ActiveRecord::Migration
5
5
  t.string :segment_name
6
6
  t.string :segment_s3_uri
7
7
  t.datetime :parse_time
8
- t.timestamps
8
+ t.timestamps(:null => false)
9
9
  end
10
10
 
11
11
  add_index(:crawl_segments, :segment_name, :unique => true)
@@ -6,7 +6,7 @@ class CreateJobs < ActiveRecord::Migration
6
6
  t.string :job_desc
7
7
  t.integer :max_files
8
8
  t.string :job_flow_id
9
- t.timestamps
9
+ t.timestamps(:null => false)
10
10
  end
11
11
 
12
12
  add_index(:jobs, :job_name, :unique => true)
@@ -5,7 +5,7 @@ class CreateJobSteps < ActiveRecord::Migration
5
5
  t.references :crawl_segment
6
6
  t.text :input_paths
7
7
  t.text :output_path
8
- t.timestamps
8
+ t.timestamps(:null => false)
9
9
  end
10
10
  end
11
11
  end
@@ -0,0 +1,5 @@
1
+ class AddFileCountToCrawlSegments < ActiveRecord::Migration
2
+ def change
3
+ add_column(:crawl_segments, :file_count, :integer)
4
+ end
5
+ end
data/elasticrawl.gemspec CHANGED
@@ -18,18 +18,17 @@ Gem::Specification.new do |spec|
18
18
  spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
19
19
  spec.require_paths = ['lib']
20
20
 
21
- spec.add_dependency 'activerecord', '~> 4.0.2'
22
- spec.add_dependency 'activesupport', '~> 4.0.2'
23
- spec.add_dependency 'aws-sdk', '~> 1.0'
24
- spec.add_dependency 'elasticity', '~> 2.7'
25
- spec.add_dependency 'highline', '~> 1.6.20'
26
- spec.add_dependency 'sqlite3', '~> 1.3.8'
27
- spec.add_dependency 'thor', '~> 0.18.1'
21
+ spec.add_dependency 'activerecord', '~> 4.2'
22
+ spec.add_dependency 'activesupport', '~> 4.2'
23
+ spec.add_dependency 'aws-sdk', '~> 1.60'
24
+ spec.add_dependency 'elasticity', '~> 4.0'
25
+ spec.add_dependency 'highline', '~> 1.6'
26
+ spec.add_dependency 'sqlite3', '~> 1.3'
27
+ spec.add_dependency 'thor', '~> 0.19'
28
28
 
29
29
  spec.add_development_dependency 'rake'
30
30
  spec.add_development_dependency 'bundler', '~> 1.3'
31
- spec.add_development_dependency 'rspec', '~> 2.14.1'
32
- spec.add_development_dependency 'mocha', '~> 1.0.0'
33
- spec.add_development_dependency 'database_cleaner', '~> 1.2.0'
34
- spec.add_development_dependency 'shoulda-matchers', '~> 2.4.0'
31
+ spec.add_development_dependency 'rspec', '~> 3.1'
32
+ spec.add_development_dependency 'database_cleaner', '~> 1.3.0'
33
+ spec.add_development_dependency 'shoulda-matchers', '~> 2.7.0'
35
34
  end
data/lib/elasticrawl.rb CHANGED
@@ -6,6 +6,13 @@ require 'highline/import'
6
6
  require 'thor'
7
7
 
8
8
  module Elasticrawl
9
+ # S3 locations
10
+ COMMON_CRAWL_BUCKET = 'aws-publicdatasets'
11
+ COMMON_CRAWL_PATH = 'common-crawl/crawl-data'
12
+ SEGMENTS_PATH = 'segments'
13
+ WARC_PATHS = 'warc.paths.gz'
14
+ MAX_SEGMENTS = 256
15
+
9
16
  require 'elasticrawl/version'
10
17
 
11
18
  require 'elasticrawl/config'
@@ -13,7 +13,7 @@ module Elasticrawl
13
13
  config = Config.new
14
14
  job_flow = Elasticity::JobFlow.new(config.access_key_id,
15
15
  config.secret_access_key)
16
- job_flow.name = "Job Name: #{job.job_name} #{job.job_desc}"
16
+ job_flow.name = "Job: #{job.job_name} #{job.job_desc}"
17
17
  job_flow.log_uri = job.log_uri
18
18
 
19
19
  configure_job_flow(job_flow)
@@ -5,11 +5,6 @@ module Elasticrawl
5
5
  class Crawl < ActiveRecord::Base
6
6
  has_many :crawl_segments
7
7
 
8
- COMMON_CRAWL_BUCKET = 'aws-publicdatasets'
9
- COMMON_CRAWL_PATH = 'common-crawl/crawl-data/'
10
- SEGMENTS_PATH = '/segments/'
11
- MAX_SEGMENTS = 256
12
-
13
8
  # Returns the status of all saved crawls and the current job history.
14
9
  def self.status(show_all = false)
15
10
  status = ['Crawl Status']
@@ -51,13 +46,19 @@ module Elasticrawl
51
46
  end
52
47
  end
53
48
 
54
- # Creates crawl segments from their S3 paths and returns the segment count.
49
+ # Creates crawl segments from the warc.paths file for this crawl.
55
50
  def create_segments
56
- segment_paths = s3_segment_paths(self.crawl_name)
57
- save if segment_paths.count > 0
58
- segment_paths.map { |s3_path| create_segment(s3_path) }
51
+ file_paths = warc_paths(self.crawl_name)
52
+
53
+ segments = parse_segments(file_paths)
54
+ save if segments.count > 0
55
+
56
+ segments.keys.each do |segment_name|
57
+ file_count = segments[segment_name]
58
+ CrawlSegment.create_segment(self, segment_name, file_count)
59
+ end
59
60
 
60
- segment_paths.count
61
+ segments.count
61
62
  end
62
63
 
63
64
  # Returns the list of segments from the database.
@@ -68,8 +69,8 @@ module Elasticrawl
68
69
  # Returns next # segments to be parsed. The maximum is 256
69
70
  # as this is the maximum # of steps for an Elastic MapReduce job flow.
70
71
  def next_segments(max_segments = nil)
71
- max_segments = MAX_SEGMENTS if max_segments.nil?
72
- max_segments = MAX_SEGMENTS if max_segments > MAX_SEGMENTS
72
+ max_segments = Elasticrawl::MAX_SEGMENTS if max_segments.nil?
73
+ max_segments = Elasticrawl::MAX_SEGMENTS if max_segments > Elasticrawl::MAX_SEGMENTS
73
74
 
74
75
  self.crawl_segments.where(:parse_time => nil).limit(max_segments)
75
76
  end
@@ -85,30 +86,47 @@ module Elasticrawl
85
86
  end
86
87
 
87
88
  private
88
- # Creates a crawl segment based on its S3 path if it does not exist.
89
- def create_segment(s3_path)
90
- segment_name = s3_path.split('/').last
91
- segment_s3_uri = URI::Generic.build(:scheme => 's3',
92
- :host => COMMON_CRAWL_BUCKET,
93
- :path => "/#{s3_path}").to_s
94
-
95
- segment = CrawlSegment.where(:crawl_id => self.id,
96
- :segment_name => segment_name,
97
- :segment_s3_uri => segment_s3_uri).first_or_create
89
+ # Gets the WARC file paths from S3 for this crawl if it exists.
90
+ def warc_paths(crawl_name)
91
+ s3_path = [Elasticrawl::COMMON_CRAWL_PATH,
92
+ crawl_name,
93
+ Elasticrawl::WARC_PATHS].join('/')
94
+
95
+ s3 = AWS::S3.new
96
+ bucket = s3.buckets[Elasticrawl::COMMON_CRAWL_BUCKET]
97
+ object = bucket.objects[s3_path]
98
+
99
+ uncompress_file(object)
98
100
  end
99
101
 
100
- # Returns a list of S3 paths for the crawl name.
101
- def s3_segment_paths(crawl_name)
102
- s3_segment_tree(crawl_name).children.collect(&:prefix)
102
+ # Takes in a S3 object and returns the contents as an uncompressed string.
103
+ def uncompress_file(s3_object)
104
+ result = ''
105
+
106
+ if s3_object.exists?
107
+ io = StringIO.new
108
+ io.write(s3_object.read)
109
+ io.rewind
110
+
111
+ gz = Zlib::GzipReader.new(io)
112
+ result = gz.read
113
+
114
+ gz.close
115
+ end
116
+
117
+ result
103
118
  end
104
119
 
105
- # Calls the S3 API and returns the tree structure for the crawl name.
106
- def s3_segment_tree(crawl_name)
107
- crawl_path = [COMMON_CRAWL_PATH, crawl_name, SEGMENTS_PATH].join
120
+ # Parses the segment names and file counts from the WARC file paths.
121
+ def parse_segments(warc_paths)
122
+ segments = Hash.new 0
108
123
 
109
- s3 = AWS::S3.new
110
- bucket = s3.buckets[COMMON_CRAWL_BUCKET]
111
- bucket.as_tree(:prefix => crawl_path)
124
+ warc_paths.split.each do |warc_path|
125
+ segment_name = warc_path.split('/')[4]
126
+ segments[segment_name] += 1 if segment_name.present?
127
+ end
128
+
129
+ segments
112
130
  end
113
131
  end
114
132
  end