elasticrawl 1.0.0 → 1.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: d30b065d6f268827c458f2da44db8ccc726c209f
4
+ data.tar.gz: 0fe917b0f93bf4f70b23386bedad9cc3547d9e8b
5
+ SHA512:
6
+ metadata.gz: cd63bfea578623e32c03d10f4bab56591950cbff8cd19eaf38aa28db5ba875101c94f8fa9da297e418fc6460eb16771c9fc45098d7c2e8e5b16e2f380a5ab4bc
7
+ data.tar.gz: f6ca6c84103df5a9d299ed78d9963f57357c15719ca90e4193bb0a9dc909c552c055329b85677dddaef8af35fea2f4b044ea73e29ebfc110fe6116c8e548b3df
data/.travis.yml CHANGED
@@ -2,4 +2,4 @@ language: ruby
2
2
  rvm:
3
3
  - 1.9.3
4
4
  - 2.0.0
5
- - 2.1.0
5
+ - 2.1.5
data/README.md CHANGED
@@ -1,42 +1,53 @@
1
1
  # Elasticrawl
2
2
 
3
- Launch AWS Elastic MapReduce jobs that process Common Crawl data.
4
- Elasticrawl works with the latest Common Crawl data structure and file formats
5
- ([2013 data onwards](http://commoncrawl.org/new-crawl-data-available/)).
6
- Ships with a default configuration that launches the
7
- [elasticrawl-examples](https://github.com/rossf7/elasticrawl-examples) jobs.
8
- This is an implementation of the standard Hadoop Word Count example.
3
+ Command line tool for launching Hadoop jobs using AWS EMR (Elastic MapReduce) to process Common Crawl data.
4
+ Elasticrawl can be used with [crawl data](http://commoncrawl.org/the-data/get-started/) from April 2014 onwards.
9
5
 
10
- ## Overview
6
+ | Crawl Name | Month | Web Pages
7
+ | -------------- |:--------:|:--------:|
8
+ | [CC-MAIN-2014-15](http://blog.commoncrawl.org/2014/07/april-2014-crawl-data-available/) | April 2014 | ~ 2.3 billion
9
+ | [CC-MAIN-2014-23](http://blog.commoncrawl.org/2014/08/july-2014-crawl-data-available/) | July 2014 | ~ 3.6 billion
10
+ | [CC-MAIN-2014-35](http://blog.commoncrawl.org/2014/09/august-2014-crawl-data-available/) | August 2014 | ~ 2.8 billion
11
+ | [CC-MAIN-2014-49](http://blog.commoncrawl.org/2014/12/november-2014-crawl-archive-available/) | November 2014 | ~ 1.95 billion
11
12
 
12
- Common Crawl have released 2 web crawls of 2013 data. Further crawls will be released
13
- during 2014. Each crawl is split into multiple segments that contain 3 file types.
13
+ Common Crawl announce new crawls on their [blog](http://blog.commoncrawl.org/).
14
14
 
15
- * WARC - WARC files with the HTTP request and response for each fetch
16
- * WAT - WARC encoded files containing JSON metadata
17
- * WET - WARC encoded text extractions of the HTTP responses
18
-
19
- | Crawl Name | Date | Segments | Pages | Size (uncompressed) |
20
- | -------------- |:--------:|:--------:|:-------------:|:-------------------:|
21
- | CC-MAIN-2013-48| Nov 2013 | 517 | ~ 2.3 billion | 148 TB |
22
- | CC-MAIN-2013-20| May 2013 | 316 | ~ 2.0 billion | 102 TB |
15
+ Ships with a default configuration that launches the
16
+ [elasticrawl-examples](https://github.com/rossf7/elasticrawl-examples) jobs.
17
+ This is an implementation of the standard Hadoop Word Count example.
23
18
 
24
- Elasticrawl is a command line tool that automates launching Elastic MapReduce
25
- jobs against this data.
19
+ ## More Information
26
20
 
27
- [![Code Climate](https://codeclimate.com/github/rossf7/elasticrawl.png)](https://codeclimate.com/github/rossf7/elasticrawl)
28
- [![Build Status](https://travis-ci.org/rossf7/elasticrawl.png?branch=master)](https://travis-ci.org/rossf7/elasticrawl) 1.9.3, 2.0.0, 2.1.0
21
+ * [Blog post](https://rossfairbanks.com/2015/01/03/parsing-common-crawl-using-elasticrawl.html) with walkthrough of running the Hadoop WordCount example on the November 2014 crawl.
29
22
 
30
23
  ## Installation
31
24
 
32
25
  ### Dependencies
33
26
 
34
- Elasticrawl is developed in Ruby and requires Ruby 1.9.3 or later.
27
+ Elasticrawl is developed in Ruby and requires Ruby 1.9.3 or later (Ruby 2.1 is recommended).
35
28
  Installing using [rbenv](https://github.com/sstephenson/rbenv#installation)
36
29
  and the ruby-build plugin is recommended.
37
30
 
31
+ A SQLite database is used to store details of crawls and jobs. Installing the sqlite3
32
+ gem requires the development headers to be installed.
33
+
34
+ ```bash
35
+ # OS X
36
+ brew install sqlite3
37
+
38
+ # CentOS
39
+ sudo yum install sqlite-devel
40
+
41
+ # Ubuntu
42
+ sudo apt-get install libsqlite3-dev
43
+ ```
44
+
38
45
  ### Install elasticrawl
39
46
 
47
+ [![Gem Version](https://badge.fury.io/rb/elasticrawl.png)](http://badge.fury.io/rb/elasticrawl)
48
+ [![Code Climate](https://codeclimate.com/github/rossf7/elasticrawl.png)](https://codeclimate.com/github/rossf7/elasticrawl)
49
+ [![Build Status](https://travis-ci.org/rossf7/elasticrawl.png?branch=master)](https://travis-ci.org/rossf7/elasticrawl) 1.9.3, 2.0.0, 2.1.5
50
+
40
51
  ```bash
41
52
  ~$ gem install elasticrawl --no-rdoc --no-ri
42
53
  ```
@@ -48,21 +59,12 @@ to your path.
48
59
  ~$ rbenv rehash
49
60
  ```
50
61
 
51
- ## Quick Start
52
-
53
- In this example you'll launch 2 EMR jobs against a small portion of the Nov
54
- 2013 crawl. Each job will take around 20 minutes to run. Most of this is setup
55
- time while your EC2 spot instances are provisioned and your Hadoop cluster is
56
- configured.
62
+ ## Commands
57
63
 
58
- You'll need to have an [AWS account](https://portal.aws.amazon.com/gp/aws/developer/registration/index.html)
59
- to use elasticrawl. The total cost of the 2 EMR jobs will be under $1 USD.
64
+ ### elasticrawl init
60
65
 
61
- ### Setup
62
-
63
- You'll need to choose an S3 bucket name and enter your AWS access key and
64
- secret key. The S3 bucket will be used for storing data and logs. S3 bucket
65
- names must be unique, using hyphens rather than underscores is recommended.
66
+ Init takes in an S3 bucket name and your AWS credentials. The S3 bucket will be created
67
+ and will store your data and logs.
66
68
 
67
69
  ```bash
68
70
  ~$ elasticrawl init your-s3-bucket
@@ -77,38 +79,35 @@ Config dir /Users/ross/.elasticrawl created
77
79
  Config complete
78
80
  ```
79
81
 
80
- ### Parse Job
82
+ ### elasticrawl parse
81
83
 
82
- For this example you'll parse the first 2 WET files in the first 2 segments
83
- of the Nov 2013 crawl.
84
+ Parse takes in the crawl name and an optional number of segments and files to parse.
84
85
 
85
86
  ```bash
86
- ~$ elasticrawl parse CC-MAIN-2013-48 --max-segments 2 --max-files 2
87
+ ~$ elasticrawl parse CC-MAIN-2014-49 --max-segments 2 --max-files 3
88
+ Segments
89
+ Segment: 1416400372202.67 Files: 150
90
+ Segment: 1416400372490.23 Files: 124
87
91
 
88
92
  Job configuration
89
- Crawl: CC-MAIN-2013-48 Segments: 2 Parsing: 2 files per segment
93
+ Crawl: CC-MAIN-2014-49 Segments: 2 Parsing: 3 files per segment
90
94
 
91
95
  Cluster configuration
92
96
  Master: 1 m1.medium (Spot: 0.12)
93
97
  Core: 2 m1.medium (Spot: 0.12)
94
98
  Task: --
95
99
  Launch job? (y/n)
96
-
97
100
  y
98
- Job Name: 1391458746774 Job Flow ID: j-2X9JVDC1UKEQ1
99
- ```
100
101
 
101
- You can monitor the progress of your job in the Elastic MapReduce section
102
- of the AWS web console.
102
+ Job: 1420124830792 Job Flow ID: j-2R3MFE6TWLIUB
103
+ ```
103
104
 
104
- ### Combine Job
105
+ ### elasticrawl combine
105
106
 
106
- The combine job will aggregate the word count results from both segments into
107
- a single set of files.
107
+ Combine takes in the results of previous parse jobs and produces a combined set of results.
108
108
 
109
109
  ```bash
110
- ~$ elasticrawl combine --input-jobs 1391458746774
111
-
110
+ ~$ elasticrawl combine --input-jobs 1420124830792
112
111
  Job configuration
113
112
  Combining: 2 segments
114
113
 
@@ -117,20 +116,38 @@ Master: 1 m1.medium (Spot: 0.12)
117
116
  Core: 2 m1.medium (Spot: 0.12)
118
117
  Task: --
119
118
  Launch job? (y/n)
120
-
121
119
  y
122
- Job Name: 1391459918730 Job Flow ID: j-GTJ2M7D1TXO6
120
+
121
+ Job: 1420129496115 Job Flow ID: j-251GXDIZGK8HL
123
122
  ```
124
123
 
125
- Once the combine job is complete you can download your results from the
126
- S3 section of the AWS web console. Your data will be stored in
124
+ ### elasticrawl status
125
+
126
+ Status shows crawls and your job history.
127
127
 
128
- [your S3 bucket]/data/2-combine/[job name]
128
+ ```bash
129
+ ~$ elasticrawl status
130
+ Crawl Status
131
+ CC-MAIN-2014-49 Segments: to parse 134, parsed 2, total 136
132
+
133
+ Job History (last 10)
134
+ 1420124830792 Launched: 2015-01-01 15:07:10 Crawl: CC-MAIN-2014-49 Segments: 2 Parsing: 3 files per segment
135
+ ```
129
136
 
130
- ### Cleaning Up
137
+ ### elasticrawl reset
131
138
 
132
- You'll be charged by AWS for any data stored in your S3 bucket. The destroy
133
- command deletes your S3 bucket and the ~/.elasticrawl/ directory.
139
+ Reset a crawl so it is parsed again.
140
+
141
+ ```bash
142
+ ~$ elasticrawl reset CC-MAIN-2014-49
143
+ Reset crawl? (y/n)
144
+ y
145
+ CC-MAIN-2014-49 Segments: to parse 136, parsed 0, total 136
146
+ ```
147
+
148
+ ### elasticrawl destroy
149
+
150
+ Destroy deletes your S3 bucket and the ~/.elasticrawl directory.
134
151
 
135
152
  ```bash
136
153
  ~$ elasticrawl destroy
@@ -151,7 +168,7 @@ Config deleted
151
168
  The elasticrawl init command creates the ~/elasticrawl/ directory which
152
169
  contains
153
170
 
154
- * [aws.yml](https://github.com/rossf7/elasticrawl/blob/master/templates/aws.yml) -
171
+ * [aws.yml](https://github.com/rossf7/.elasticrawl/blob/master/templates/aws.yml) -
155
172
  stores your AWS access credentials. Or you can set the environment
156
173
  variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
157
174
 
@@ -161,61 +178,13 @@ configures the EC2 instances that are launched to form your EMR cluster
161
178
  * [jobs.yml](https://github.com/rossf7/elasticrawl/blob/master/templates/jobs.yml) -
162
179
  stores your S3 bucket name and the config for the parse and combine jobs
163
180
 
164
- ## Managing Segments
165
-
166
- Each Common Crawl segment is parsed as a separate EMR job step. This avoids
167
- overloading the job tracker and means if a job fails then only data from the
168
- current segment is lost. However an EMR job flow can only contain 256 steps.
169
- So to process an entire crawl multiple parse jobs must be combined.
170
-
171
- ```bash
172
- ~$ elasticrawl combine --input-jobs 1391430796774 1391458746774 1391498046704
173
- ```
174
-
175
- You can use the status command to see details of crawls and jobs.
176
-
177
- ```bash
178
- ~$ elasticrawl status
179
-
180
- Crawl Status
181
- CC-MAIN-2013-48 Segments: to parse 517, parsed 2, total 519
182
-
183
- Job History (last 10)
184
- 1391459918730 Launched: 2014-02-04 13:58:12 Combining: 2 segments
185
- 1391458746774 Launched: 2014-02-04 13:55:50 Crawl: CC-MAIN-2013-48 Segments: 2 Parsing: 2 files per segment
186
- ```
187
-
188
- You can use the reset command to parse a crawl again.
189
-
190
- ```bash
191
- ~$ elasticrawl reset CC-MAIN-2013-48
192
-
193
- Reset crawl? (y/n)
194
- y
195
- CC-MAIN-2013-48 Segments: to parse 519, parsed 0, total 519
196
- ```
197
-
198
- To parse the same segments multiple times.
199
-
200
- ```bash
201
- ~$ elasticrawl parse CC-MAIN-2013-48 --segment-list 1386163036037 1386163035819 --max-files 2
202
- ```
203
-
204
- ## Running your own Jobs
205
-
206
- 1. Fork the [elasticrawl-examples](https://github.com/rossf7/elasticrawl-examples)
207
- 2. Make your changes
208
- 3. Compile your changes into a JAR using Maven
209
- 4. Upload your JAR to your own S3 bucket
210
- 5. Edit ~/.elasticrawl/jobs.yml with your JAR and class names
211
-
212
181
  ## TODO
213
182
 
214
183
  * Add support for Streaming and Pig jobs
215
184
 
216
185
  ## Thanks
217
186
 
218
- * Thanks to everyone at Common Crawl for making this awesome dataset available.
187
+ * Thanks to everyone at Common Crawl for making this awesome dataset available!
219
188
  * Thanks to Robert Slifka for the [elasticity](https://github.com/rslifka/elasticity)
220
189
  gem which provides a nice Ruby wrapper for the EMR REST API.
221
190
 
data/Vagrantfile CHANGED
@@ -36,16 +36,16 @@ Vagrant.configure("2") do |config|
36
36
  "user_installs" => [
37
37
  {
38
38
  "user" => "vagrant",
39
- "rubies" => ["1.9.3-p484", "2.0.0-p353", "2.1.0"],
40
- "global" => "1.9.3-p484",
39
+ "rubies" => ["1.9.3-p551", "2.0.0-p598", "2.1.5"],
40
+ "global" => "2.1.5",
41
41
  "gems" => {
42
- "1.9.3-p484" => [
42
+ "1.9.3-p551" => [
43
43
  { "name" => "bundler" }
44
44
  ],
45
- "2.0.0-p353" => [
45
+ "2.0.0-p598" => [
46
46
  { "name" => "bundler" }
47
47
  ],
48
- "2.1.0" => [
48
+ "2.1.5" => [
49
49
  { "name" => "bundler" }
50
50
  ]
51
51
  }
@@ -2,7 +2,7 @@ class CreateCrawls < ActiveRecord::Migration
2
2
  def change
3
3
  create_table :crawls do |t|
4
4
  t.string :crawl_name
5
- t.timestamps
5
+ t.timestamps(:null => false)
6
6
  end
7
7
 
8
8
  add_index(:crawls, :crawl_name, :unique => true)
@@ -5,7 +5,7 @@ class CreateCrawlSegments < ActiveRecord::Migration
5
5
  t.string :segment_name
6
6
  t.string :segment_s3_uri
7
7
  t.datetime :parse_time
8
- t.timestamps
8
+ t.timestamps(:null => false)
9
9
  end
10
10
 
11
11
  add_index(:crawl_segments, :segment_name, :unique => true)
@@ -6,7 +6,7 @@ class CreateJobs < ActiveRecord::Migration
6
6
  t.string :job_desc
7
7
  t.integer :max_files
8
8
  t.string :job_flow_id
9
- t.timestamps
9
+ t.timestamps(:null => false)
10
10
  end
11
11
 
12
12
  add_index(:jobs, :job_name, :unique => true)
@@ -5,7 +5,7 @@ class CreateJobSteps < ActiveRecord::Migration
5
5
  t.references :crawl_segment
6
6
  t.text :input_paths
7
7
  t.text :output_path
8
- t.timestamps
8
+ t.timestamps(:null => false)
9
9
  end
10
10
  end
11
11
  end
@@ -0,0 +1,5 @@
1
+ class AddFileCountToCrawlSegments < ActiveRecord::Migration
2
+ def change
3
+ add_column(:crawl_segments, :file_count, :integer)
4
+ end
5
+ end
data/elasticrawl.gemspec CHANGED
@@ -18,18 +18,17 @@ Gem::Specification.new do |spec|
18
18
  spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
19
19
  spec.require_paths = ['lib']
20
20
 
21
- spec.add_dependency 'activerecord', '~> 4.0.2'
22
- spec.add_dependency 'activesupport', '~> 4.0.2'
23
- spec.add_dependency 'aws-sdk', '~> 1.0'
24
- spec.add_dependency 'elasticity', '~> 2.7'
25
- spec.add_dependency 'highline', '~> 1.6.20'
26
- spec.add_dependency 'sqlite3', '~> 1.3.8'
27
- spec.add_dependency 'thor', '~> 0.18.1'
21
+ spec.add_dependency 'activerecord', '~> 4.2'
22
+ spec.add_dependency 'activesupport', '~> 4.2'
23
+ spec.add_dependency 'aws-sdk', '~> 1.60'
24
+ spec.add_dependency 'elasticity', '~> 4.0'
25
+ spec.add_dependency 'highline', '~> 1.6'
26
+ spec.add_dependency 'sqlite3', '~> 1.3'
27
+ spec.add_dependency 'thor', '~> 0.19'
28
28
 
29
29
  spec.add_development_dependency 'rake'
30
30
  spec.add_development_dependency 'bundler', '~> 1.3'
31
- spec.add_development_dependency 'rspec', '~> 2.14.1'
32
- spec.add_development_dependency 'mocha', '~> 1.0.0'
33
- spec.add_development_dependency 'database_cleaner', '~> 1.2.0'
34
- spec.add_development_dependency 'shoulda-matchers', '~> 2.4.0'
31
+ spec.add_development_dependency 'rspec', '~> 3.1'
32
+ spec.add_development_dependency 'database_cleaner', '~> 1.3.0'
33
+ spec.add_development_dependency 'shoulda-matchers', '~> 2.7.0'
35
34
  end
data/lib/elasticrawl.rb CHANGED
@@ -6,6 +6,13 @@ require 'highline/import'
6
6
  require 'thor'
7
7
 
8
8
  module Elasticrawl
9
+ # S3 locations
10
+ COMMON_CRAWL_BUCKET = 'aws-publicdatasets'
11
+ COMMON_CRAWL_PATH = 'common-crawl/crawl-data'
12
+ SEGMENTS_PATH = 'segments'
13
+ WARC_PATHS = 'warc.paths.gz'
14
+ MAX_SEGMENTS = 256
15
+
9
16
  require 'elasticrawl/version'
10
17
 
11
18
  require 'elasticrawl/config'
@@ -13,7 +13,7 @@ module Elasticrawl
13
13
  config = Config.new
14
14
  job_flow = Elasticity::JobFlow.new(config.access_key_id,
15
15
  config.secret_access_key)
16
- job_flow.name = "Job Name: #{job.job_name} #{job.job_desc}"
16
+ job_flow.name = "Job: #{job.job_name} #{job.job_desc}"
17
17
  job_flow.log_uri = job.log_uri
18
18
 
19
19
  configure_job_flow(job_flow)
@@ -5,11 +5,6 @@ module Elasticrawl
5
5
  class Crawl < ActiveRecord::Base
6
6
  has_many :crawl_segments
7
7
 
8
- COMMON_CRAWL_BUCKET = 'aws-publicdatasets'
9
- COMMON_CRAWL_PATH = 'common-crawl/crawl-data/'
10
- SEGMENTS_PATH = '/segments/'
11
- MAX_SEGMENTS = 256
12
-
13
8
  # Returns the status of all saved crawls and the current job history.
14
9
  def self.status(show_all = false)
15
10
  status = ['Crawl Status']
@@ -51,13 +46,19 @@ module Elasticrawl
51
46
  end
52
47
  end
53
48
 
54
- # Creates crawl segments from their S3 paths and returns the segment count.
49
+ # Creates crawl segments from the warc.paths file for this crawl.
55
50
  def create_segments
56
- segment_paths = s3_segment_paths(self.crawl_name)
57
- save if segment_paths.count > 0
58
- segment_paths.map { |s3_path| create_segment(s3_path) }
51
+ file_paths = warc_paths(self.crawl_name)
52
+
53
+ segments = parse_segments(file_paths)
54
+ save if segments.count > 0
55
+
56
+ segments.keys.each do |segment_name|
57
+ file_count = segments[segment_name]
58
+ CrawlSegment.create_segment(self, segment_name, file_count)
59
+ end
59
60
 
60
- segment_paths.count
61
+ segments.count
61
62
  end
62
63
 
63
64
  # Returns the list of segments from the database.
@@ -68,8 +69,8 @@ module Elasticrawl
68
69
  # Returns next # segments to be parsed. The maximum is 256
69
70
  # as this is the maximum # of steps for an Elastic MapReduce job flow.
70
71
  def next_segments(max_segments = nil)
71
- max_segments = MAX_SEGMENTS if max_segments.nil?
72
- max_segments = MAX_SEGMENTS if max_segments > MAX_SEGMENTS
72
+ max_segments = Elasticrawl::MAX_SEGMENTS if max_segments.nil?
73
+ max_segments = Elasticrawl::MAX_SEGMENTS if max_segments > Elasticrawl::MAX_SEGMENTS
73
74
 
74
75
  self.crawl_segments.where(:parse_time => nil).limit(max_segments)
75
76
  end
@@ -85,30 +86,47 @@ module Elasticrawl
85
86
  end
86
87
 
87
88
  private
88
- # Creates a crawl segment based on its S3 path if it does not exist.
89
- def create_segment(s3_path)
90
- segment_name = s3_path.split('/').last
91
- segment_s3_uri = URI::Generic.build(:scheme => 's3',
92
- :host => COMMON_CRAWL_BUCKET,
93
- :path => "/#{s3_path}").to_s
94
-
95
- segment = CrawlSegment.where(:crawl_id => self.id,
96
- :segment_name => segment_name,
97
- :segment_s3_uri => segment_s3_uri).first_or_create
89
+ # Gets the WARC file paths from S3 for this crawl if it exists.
90
+ def warc_paths(crawl_name)
91
+ s3_path = [Elasticrawl::COMMON_CRAWL_PATH,
92
+ crawl_name,
93
+ Elasticrawl::WARC_PATHS].join('/')
94
+
95
+ s3 = AWS::S3.new
96
+ bucket = s3.buckets[Elasticrawl::COMMON_CRAWL_BUCKET]
97
+ object = bucket.objects[s3_path]
98
+
99
+ uncompress_file(object)
98
100
  end
99
101
 
100
- # Returns a list of S3 paths for the crawl name.
101
- def s3_segment_paths(crawl_name)
102
- s3_segment_tree(crawl_name).children.collect(&:prefix)
102
+ # Takes in a S3 object and returns the contents as an uncompressed string.
103
+ def uncompress_file(s3_object)
104
+ result = ''
105
+
106
+ if s3_object.exists?
107
+ io = StringIO.new
108
+ io.write(s3_object.read)
109
+ io.rewind
110
+
111
+ gz = Zlib::GzipReader.new(io)
112
+ result = gz.read
113
+
114
+ gz.close
115
+ end
116
+
117
+ result
103
118
  end
104
119
 
105
- # Calls the S3 API and returns the tree structure for the crawl name.
106
- def s3_segment_tree(crawl_name)
107
- crawl_path = [COMMON_CRAWL_PATH, crawl_name, SEGMENTS_PATH].join
120
+ # Parses the segment names and file counts from the WARC file paths.
121
+ def parse_segments(warc_paths)
122
+ segments = Hash.new 0
108
123
 
109
- s3 = AWS::S3.new
110
- bucket = s3.buckets[COMMON_CRAWL_BUCKET]
111
- bucket.as_tree(:prefix => crawl_path)
124
+ warc_paths.split.each do |warc_path|
125
+ segment_name = warc_path.split('/')[4]
126
+ segments[segment_name] += 1 if segment_name.present?
127
+ end
128
+
129
+ segments
112
130
  end
113
131
  end
114
132
  end