elasticrawl 1.0.0 → 1.1.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/.travis.yml +1 -1
- data/README.md +77 -108
- data/Vagrantfile +5 -5
- data/db/migrate/201401051536_create_crawls.rb +1 -1
- data/db/migrate/201401051855_create_crawl_segments.rb +1 -1
- data/db/migrate/201401101723_create_jobs.rb +1 -1
- data/db/migrate/201401141606_create_job_steps.rb +1 -1
- data/db/migrate/201412311554_add_file_count_to_crawl_segments.rb +5 -0
- data/elasticrawl.gemspec +10 -11
- data/lib/elasticrawl.rb +7 -0
- data/lib/elasticrawl/cluster.rb +1 -1
- data/lib/elasticrawl/crawl.rb +49 -31
- data/lib/elasticrawl/crawl_segment.rb +30 -0
- data/lib/elasticrawl/job.rb +13 -6
- data/lib/elasticrawl/job_step.rb +5 -3
- data/lib/elasticrawl/parse_job.rb +14 -0
- data/lib/elasticrawl/version.rb +1 -1
- data/spec/fixtures/warc.paths +6 -0
- data/spec/spec_helper.rb +8 -14
- data/spec/unit/cluster_spec.rb +2 -2
- data/spec/unit/combine_job_spec.rb +4 -4
- data/spec/unit/crawl_segment_spec.rb +19 -10
- data/spec/unit/crawl_spec.rb +21 -16
- data/spec/unit/job_step_spec.rb +4 -4
- data/spec/unit/parse_job_spec.rb +20 -14
- metadata +56 -101
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: d30b065d6f268827c458f2da44db8ccc726c209f
|
4
|
+
data.tar.gz: 0fe917b0f93bf4f70b23386bedad9cc3547d9e8b
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: cd63bfea578623e32c03d10f4bab56591950cbff8cd19eaf38aa28db5ba875101c94f8fa9da297e418fc6460eb16771c9fc45098d7c2e8e5b16e2f380a5ab4bc
|
7
|
+
data.tar.gz: f6ca6c84103df5a9d299ed78d9963f57357c15719ca90e4193bb0a9dc909c552c055329b85677dddaef8af35fea2f4b044ea73e29ebfc110fe6116c8e548b3df
|
data/.travis.yml
CHANGED
data/README.md
CHANGED
@@ -1,42 +1,53 @@
|
|
1
1
|
# Elasticrawl
|
2
2
|
|
3
|
-
|
4
|
-
Elasticrawl
|
5
|
-
([2013 data onwards](http://commoncrawl.org/new-crawl-data-available/)).
|
6
|
-
Ships with a default configuration that launches the
|
7
|
-
[elasticrawl-examples](https://github.com/rossf7/elasticrawl-examples) jobs.
|
8
|
-
This is an implementation of the standard Hadoop Word Count example.
|
3
|
+
Command line tool for launching Hadoop jobs using AWS EMR (Elastic MapReduce) to process Common Crawl data.
|
4
|
+
Elasticrawl can be used with [crawl data](http://commoncrawl.org/the-data/get-started/) from April 2014 onwards.
|
9
5
|
|
10
|
-
|
6
|
+
| Crawl Name | Month | Web Pages
|
7
|
+
| -------------- |:--------:|:--------:|
|
8
|
+
| [CC-MAIN-2014-15](http://blog.commoncrawl.org/2014/07/april-2014-crawl-data-available/) | April 2014 | ~ 2.3 billion
|
9
|
+
| [CC-MAIN-2014-23](http://blog.commoncrawl.org/2014/08/july-2014-crawl-data-available/) | July 2014 | ~ 3.6 billion
|
10
|
+
| [CC-MAIN-2014-35](http://blog.commoncrawl.org/2014/09/august-2014-crawl-data-available/) | August 2014 | ~ 2.8 billion
|
11
|
+
| [CC-MAIN-2014-49](http://blog.commoncrawl.org/2014/12/november-2014-crawl-archive-available/) | November 2014 | ~ 1.95 billion
|
11
12
|
|
12
|
-
Common Crawl
|
13
|
-
during 2014. Each crawl is split into multiple segments that contain 3 file types.
|
13
|
+
Common Crawl announce new crawls on their [blog](http://blog.commoncrawl.org/).
|
14
14
|
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
| Crawl Name | Date | Segments | Pages | Size (uncompressed) |
|
20
|
-
| -------------- |:--------:|:--------:|:-------------:|:-------------------:|
|
21
|
-
| CC-MAIN-2013-48| Nov 2013 | 517 | ~ 2.3 billion | 148 TB |
|
22
|
-
| CC-MAIN-2013-20| May 2013 | 316 | ~ 2.0 billion | 102 TB |
|
15
|
+
Ships with a default configuration that launches the
|
16
|
+
[elasticrawl-examples](https://github.com/rossf7/elasticrawl-examples) jobs.
|
17
|
+
This is an implementation of the standard Hadoop Word Count example.
|
23
18
|
|
24
|
-
|
25
|
-
jobs against this data.
|
19
|
+
## More Information
|
26
20
|
|
27
|
-
[
|
28
|
-
[![Build Status](https://travis-ci.org/rossf7/elasticrawl.png?branch=master)](https://travis-ci.org/rossf7/elasticrawl) 1.9.3, 2.0.0, 2.1.0
|
21
|
+
* [Blog post](https://rossfairbanks.com/2015/01/03/parsing-common-crawl-using-elasticrawl.html) with walkthrough of running the Hadoop WordCount example on the November 2014 crawl.
|
29
22
|
|
30
23
|
## Installation
|
31
24
|
|
32
25
|
### Dependencies
|
33
26
|
|
34
|
-
Elasticrawl is developed in Ruby and requires Ruby 1.9.3 or later.
|
27
|
+
Elasticrawl is developed in Ruby and requires Ruby 1.9.3 or later (Ruby 2.1 is recommended).
|
35
28
|
Installing using [rbenv](https://github.com/sstephenson/rbenv#installation)
|
36
29
|
and the ruby-build plugin is recommended.
|
37
30
|
|
31
|
+
A SQLite database is used to store details of crawls and jobs. Installing the sqlite3
|
32
|
+
gem requires the development headers to be installed.
|
33
|
+
|
34
|
+
```bash
|
35
|
+
# OS X
|
36
|
+
brew install sqlite3
|
37
|
+
|
38
|
+
# CentOS
|
39
|
+
sudo yum install sqlite-devel
|
40
|
+
|
41
|
+
# Ubuntu
|
42
|
+
sudo apt-get install libsqlite3-dev
|
43
|
+
```
|
44
|
+
|
38
45
|
### Install elasticrawl
|
39
46
|
|
47
|
+
[![Gem Version](https://badge.fury.io/rb/elasticrawl.png)](http://badge.fury.io/rb/elasticrawl)
|
48
|
+
[![Code Climate](https://codeclimate.com/github/rossf7/elasticrawl.png)](https://codeclimate.com/github/rossf7/elasticrawl)
|
49
|
+
[![Build Status](https://travis-ci.org/rossf7/elasticrawl.png?branch=master)](https://travis-ci.org/rossf7/elasticrawl) 1.9.3, 2.0.0, 2.1.5
|
50
|
+
|
40
51
|
```bash
|
41
52
|
~$ gem install elasticrawl --no-rdoc --no-ri
|
42
53
|
```
|
@@ -48,21 +59,12 @@ to your path.
|
|
48
59
|
~$ rbenv rehash
|
49
60
|
```
|
50
61
|
|
51
|
-
##
|
52
|
-
|
53
|
-
In this example you'll launch 2 EMR jobs against a small portion of the Nov
|
54
|
-
2013 crawl. Each job will take around 20 minutes to run. Most of this is setup
|
55
|
-
time while your EC2 spot instances are provisioned and your Hadoop cluster is
|
56
|
-
configured.
|
62
|
+
## Commands
|
57
63
|
|
58
|
-
|
59
|
-
to use elasticrawl. The total cost of the 2 EMR jobs will be under $1 USD.
|
64
|
+
### elasticrawl init
|
60
65
|
|
61
|
-
|
62
|
-
|
63
|
-
You'll need to choose an S3 bucket name and enter your AWS access key and
|
64
|
-
secret key. The S3 bucket will be used for storing data and logs. S3 bucket
|
65
|
-
names must be unique, using hyphens rather than underscores is recommended.
|
66
|
+
Init takes in an S3 bucket name and your AWS credentials. The S3 bucket will be created
|
67
|
+
and will store your data and logs.
|
66
68
|
|
67
69
|
```bash
|
68
70
|
~$ elasticrawl init your-s3-bucket
|
@@ -77,38 +79,35 @@ Config dir /Users/ross/.elasticrawl created
|
|
77
79
|
Config complete
|
78
80
|
```
|
79
81
|
|
80
|
-
###
|
82
|
+
### elasticrawl parse
|
81
83
|
|
82
|
-
|
83
|
-
of the Nov 2013 crawl.
|
84
|
+
Parse takes in the crawl name and an optional number of segments and files to parse.
|
84
85
|
|
85
86
|
```bash
|
86
|
-
~$ elasticrawl parse CC-MAIN-
|
87
|
+
~$ elasticrawl parse CC-MAIN-2014-49 --max-segments 2 --max-files 3
|
88
|
+
Segments
|
89
|
+
Segment: 1416400372202.67 Files: 150
|
90
|
+
Segment: 1416400372490.23 Files: 124
|
87
91
|
|
88
92
|
Job configuration
|
89
|
-
Crawl: CC-MAIN-
|
93
|
+
Crawl: CC-MAIN-2014-49 Segments: 2 Parsing: 3 files per segment
|
90
94
|
|
91
95
|
Cluster configuration
|
92
96
|
Master: 1 m1.medium (Spot: 0.12)
|
93
97
|
Core: 2 m1.medium (Spot: 0.12)
|
94
98
|
Task: --
|
95
99
|
Launch job? (y/n)
|
96
|
-
|
97
100
|
y
|
98
|
-
Job Name: 1391458746774 Job Flow ID: j-2X9JVDC1UKEQ1
|
99
|
-
```
|
100
101
|
|
101
|
-
|
102
|
-
|
102
|
+
Job: 1420124830792 Job Flow ID: j-2R3MFE6TWLIUB
|
103
|
+
```
|
103
104
|
|
104
|
-
###
|
105
|
+
### elasticrawl combine
|
105
106
|
|
106
|
-
|
107
|
-
a single set of files.
|
107
|
+
Combine takes in the results of previous parse jobs and produces a combined set of results.
|
108
108
|
|
109
109
|
```bash
|
110
|
-
~$ elasticrawl combine --input-jobs
|
111
|
-
|
110
|
+
~$ elasticrawl combine --input-jobs 1420124830792
|
112
111
|
Job configuration
|
113
112
|
Combining: 2 segments
|
114
113
|
|
@@ -117,20 +116,38 @@ Master: 1 m1.medium (Spot: 0.12)
|
|
117
116
|
Core: 2 m1.medium (Spot: 0.12)
|
118
117
|
Task: --
|
119
118
|
Launch job? (y/n)
|
120
|
-
|
121
119
|
y
|
122
|
-
|
120
|
+
|
121
|
+
Job: 1420129496115 Job Flow ID: j-251GXDIZGK8HL
|
123
122
|
```
|
124
123
|
|
125
|
-
|
126
|
-
|
124
|
+
### elasticrawl status
|
125
|
+
|
126
|
+
Status shows crawls and your job history.
|
127
127
|
|
128
|
-
|
128
|
+
```bash
|
129
|
+
~$ elasticrawl status
|
130
|
+
Crawl Status
|
131
|
+
CC-MAIN-2014-49 Segments: to parse 134, parsed 2, total 136
|
132
|
+
|
133
|
+
Job History (last 10)
|
134
|
+
1420124830792 Launched: 2015-01-01 15:07:10 Crawl: CC-MAIN-2014-49 Segments: 2 Parsing: 3 files per segment
|
135
|
+
```
|
129
136
|
|
130
|
-
###
|
137
|
+
### elasticrawl reset
|
131
138
|
|
132
|
-
|
133
|
-
|
139
|
+
Reset a crawl so it is parsed again.
|
140
|
+
|
141
|
+
```bash
|
142
|
+
~$ elasticrawl reset CC-MAIN-2014-49
|
143
|
+
Reset crawl? (y/n)
|
144
|
+
y
|
145
|
+
CC-MAIN-2014-49 Segments: to parse 136, parsed 0, total 136
|
146
|
+
```
|
147
|
+
|
148
|
+
### elasticrawl destroy
|
149
|
+
|
150
|
+
Destroy deletes your S3 bucket and the ~/.elasticrawl directory.
|
134
151
|
|
135
152
|
```bash
|
136
153
|
~$ elasticrawl destroy
|
@@ -151,7 +168,7 @@ Config deleted
|
|
151
168
|
The elasticrawl init command creates the ~/elasticrawl/ directory which
|
152
169
|
contains
|
153
170
|
|
154
|
-
* [aws.yml](https://github.com/rossf7
|
171
|
+
* [aws.yml](https://github.com/rossf7/.elasticrawl/blob/master/templates/aws.yml) -
|
155
172
|
stores your AWS access credentials. Or you can set the environment
|
156
173
|
variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
|
157
174
|
|
@@ -161,61 +178,13 @@ configures the EC2 instances that are launched to form your EMR cluster
|
|
161
178
|
* [jobs.yml](https://github.com/rossf7/elasticrawl/blob/master/templates/jobs.yml) -
|
162
179
|
stores your S3 bucket name and the config for the parse and combine jobs
|
163
180
|
|
164
|
-
## Managing Segments
|
165
|
-
|
166
|
-
Each Common Crawl segment is parsed as a separate EMR job step. This avoids
|
167
|
-
overloading the job tracker and means if a job fails then only data from the
|
168
|
-
current segment is lost. However an EMR job flow can only contain 256 steps.
|
169
|
-
So to process an entire crawl multiple parse jobs must be combined.
|
170
|
-
|
171
|
-
```bash
|
172
|
-
~$ elasticrawl combine --input-jobs 1391430796774 1391458746774 1391498046704
|
173
|
-
```
|
174
|
-
|
175
|
-
You can use the status command to see details of crawls and jobs.
|
176
|
-
|
177
|
-
```bash
|
178
|
-
~$ elasticrawl status
|
179
|
-
|
180
|
-
Crawl Status
|
181
|
-
CC-MAIN-2013-48 Segments: to parse 517, parsed 2, total 519
|
182
|
-
|
183
|
-
Job History (last 10)
|
184
|
-
1391459918730 Launched: 2014-02-04 13:58:12 Combining: 2 segments
|
185
|
-
1391458746774 Launched: 2014-02-04 13:55:50 Crawl: CC-MAIN-2013-48 Segments: 2 Parsing: 2 files per segment
|
186
|
-
```
|
187
|
-
|
188
|
-
You can use the reset command to parse a crawl again.
|
189
|
-
|
190
|
-
```bash
|
191
|
-
~$ elasticrawl reset CC-MAIN-2013-48
|
192
|
-
|
193
|
-
Reset crawl? (y/n)
|
194
|
-
y
|
195
|
-
CC-MAIN-2013-48 Segments: to parse 519, parsed 0, total 519
|
196
|
-
```
|
197
|
-
|
198
|
-
To parse the same segments multiple times.
|
199
|
-
|
200
|
-
```bash
|
201
|
-
~$ elasticrawl parse CC-MAIN-2013-48 --segment-list 1386163036037 1386163035819 --max-files 2
|
202
|
-
```
|
203
|
-
|
204
|
-
## Running your own Jobs
|
205
|
-
|
206
|
-
1. Fork the [elasticrawl-examples](https://github.com/rossf7/elasticrawl-examples)
|
207
|
-
2. Make your changes
|
208
|
-
3. Compile your changes into a JAR using Maven
|
209
|
-
4. Upload your JAR to your own S3 bucket
|
210
|
-
5. Edit ~/.elasticrawl/jobs.yml with your JAR and class names
|
211
|
-
|
212
181
|
## TODO
|
213
182
|
|
214
183
|
* Add support for Streaming and Pig jobs
|
215
184
|
|
216
185
|
## Thanks
|
217
186
|
|
218
|
-
* Thanks to everyone at Common Crawl for making this awesome dataset available
|
187
|
+
* Thanks to everyone at Common Crawl for making this awesome dataset available!
|
219
188
|
* Thanks to Robert Slifka for the [elasticity](https://github.com/rslifka/elasticity)
|
220
189
|
gem which provides a nice Ruby wrapper for the EMR REST API.
|
221
190
|
|
data/Vagrantfile
CHANGED
@@ -36,16 +36,16 @@ Vagrant.configure("2") do |config|
|
|
36
36
|
"user_installs" => [
|
37
37
|
{
|
38
38
|
"user" => "vagrant",
|
39
|
-
"rubies" => ["1.9.3-
|
40
|
-
"global" => "1.
|
39
|
+
"rubies" => ["1.9.3-p551", "2.0.0-p598", "2.1.5"],
|
40
|
+
"global" => "2.1.5",
|
41
41
|
"gems" => {
|
42
|
-
"1.9.3-
|
42
|
+
"1.9.3-p551" => [
|
43
43
|
{ "name" => "bundler" }
|
44
44
|
],
|
45
|
-
"2.0.0-
|
45
|
+
"2.0.0-p598" => [
|
46
46
|
{ "name" => "bundler" }
|
47
47
|
],
|
48
|
-
"2.1.
|
48
|
+
"2.1.5" => [
|
49
49
|
{ "name" => "bundler" }
|
50
50
|
]
|
51
51
|
}
|
data/elasticrawl.gemspec
CHANGED
@@ -18,18 +18,17 @@ Gem::Specification.new do |spec|
|
|
18
18
|
spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
|
19
19
|
spec.require_paths = ['lib']
|
20
20
|
|
21
|
-
spec.add_dependency 'activerecord', '~> 4.
|
22
|
-
spec.add_dependency 'activesupport', '~> 4.
|
23
|
-
spec.add_dependency 'aws-sdk', '~> 1.
|
24
|
-
spec.add_dependency 'elasticity', '~>
|
25
|
-
spec.add_dependency 'highline', '~> 1.6
|
26
|
-
spec.add_dependency 'sqlite3', '~> 1.3
|
27
|
-
spec.add_dependency 'thor', '~> 0.
|
21
|
+
spec.add_dependency 'activerecord', '~> 4.2'
|
22
|
+
spec.add_dependency 'activesupport', '~> 4.2'
|
23
|
+
spec.add_dependency 'aws-sdk', '~> 1.60'
|
24
|
+
spec.add_dependency 'elasticity', '~> 4.0'
|
25
|
+
spec.add_dependency 'highline', '~> 1.6'
|
26
|
+
spec.add_dependency 'sqlite3', '~> 1.3'
|
27
|
+
spec.add_dependency 'thor', '~> 0.19'
|
28
28
|
|
29
29
|
spec.add_development_dependency 'rake'
|
30
30
|
spec.add_development_dependency 'bundler', '~> 1.3'
|
31
|
-
spec.add_development_dependency 'rspec', '~>
|
32
|
-
spec.add_development_dependency '
|
33
|
-
spec.add_development_dependency '
|
34
|
-
spec.add_development_dependency 'shoulda-matchers', '~> 2.4.0'
|
31
|
+
spec.add_development_dependency 'rspec', '~> 3.1'
|
32
|
+
spec.add_development_dependency 'database_cleaner', '~> 1.3.0'
|
33
|
+
spec.add_development_dependency 'shoulda-matchers', '~> 2.7.0'
|
35
34
|
end
|
data/lib/elasticrawl.rb
CHANGED
@@ -6,6 +6,13 @@ require 'highline/import'
|
|
6
6
|
require 'thor'
|
7
7
|
|
8
8
|
module Elasticrawl
|
9
|
+
# S3 locations
|
10
|
+
COMMON_CRAWL_BUCKET = 'aws-publicdatasets'
|
11
|
+
COMMON_CRAWL_PATH = 'common-crawl/crawl-data'
|
12
|
+
SEGMENTS_PATH = 'segments'
|
13
|
+
WARC_PATHS = 'warc.paths.gz'
|
14
|
+
MAX_SEGMENTS = 256
|
15
|
+
|
9
16
|
require 'elasticrawl/version'
|
10
17
|
|
11
18
|
require 'elasticrawl/config'
|
data/lib/elasticrawl/cluster.rb
CHANGED
@@ -13,7 +13,7 @@ module Elasticrawl
|
|
13
13
|
config = Config.new
|
14
14
|
job_flow = Elasticity::JobFlow.new(config.access_key_id,
|
15
15
|
config.secret_access_key)
|
16
|
-
job_flow.name = "Job
|
16
|
+
job_flow.name = "Job: #{job.job_name} #{job.job_desc}"
|
17
17
|
job_flow.log_uri = job.log_uri
|
18
18
|
|
19
19
|
configure_job_flow(job_flow)
|
data/lib/elasticrawl/crawl.rb
CHANGED
@@ -5,11 +5,6 @@ module Elasticrawl
|
|
5
5
|
class Crawl < ActiveRecord::Base
|
6
6
|
has_many :crawl_segments
|
7
7
|
|
8
|
-
COMMON_CRAWL_BUCKET = 'aws-publicdatasets'
|
9
|
-
COMMON_CRAWL_PATH = 'common-crawl/crawl-data/'
|
10
|
-
SEGMENTS_PATH = '/segments/'
|
11
|
-
MAX_SEGMENTS = 256
|
12
|
-
|
13
8
|
# Returns the status of all saved crawls and the current job history.
|
14
9
|
def self.status(show_all = false)
|
15
10
|
status = ['Crawl Status']
|
@@ -51,13 +46,19 @@ module Elasticrawl
|
|
51
46
|
end
|
52
47
|
end
|
53
48
|
|
54
|
-
# Creates crawl segments from
|
49
|
+
# Creates crawl segments from the warc.paths file for this crawl.
|
55
50
|
def create_segments
|
56
|
-
|
57
|
-
|
58
|
-
|
51
|
+
file_paths = warc_paths(self.crawl_name)
|
52
|
+
|
53
|
+
segments = parse_segments(file_paths)
|
54
|
+
save if segments.count > 0
|
55
|
+
|
56
|
+
segments.keys.each do |segment_name|
|
57
|
+
file_count = segments[segment_name]
|
58
|
+
CrawlSegment.create_segment(self, segment_name, file_count)
|
59
|
+
end
|
59
60
|
|
60
|
-
|
61
|
+
segments.count
|
61
62
|
end
|
62
63
|
|
63
64
|
# Returns the list of segments from the database.
|
@@ -68,8 +69,8 @@ module Elasticrawl
|
|
68
69
|
# Returns next # segments to be parsed. The maximum is 256
|
69
70
|
# as this is the maximum # of steps for an Elastic MapReduce job flow.
|
70
71
|
def next_segments(max_segments = nil)
|
71
|
-
max_segments = MAX_SEGMENTS if max_segments.nil?
|
72
|
-
max_segments = MAX_SEGMENTS if max_segments > MAX_SEGMENTS
|
72
|
+
max_segments = Elasticrawl::MAX_SEGMENTS if max_segments.nil?
|
73
|
+
max_segments = Elasticrawl::MAX_SEGMENTS if max_segments > Elasticrawl::MAX_SEGMENTS
|
73
74
|
|
74
75
|
self.crawl_segments.where(:parse_time => nil).limit(max_segments)
|
75
76
|
end
|
@@ -85,30 +86,47 @@ module Elasticrawl
|
|
85
86
|
end
|
86
87
|
|
87
88
|
private
|
88
|
-
#
|
89
|
-
def
|
90
|
-
|
91
|
-
|
92
|
-
|
93
|
-
|
94
|
-
|
95
|
-
|
96
|
-
|
97
|
-
|
89
|
+
# Gets the WARC file paths from S3 for this crawl if it exists.
|
90
|
+
def warc_paths(crawl_name)
|
91
|
+
s3_path = [Elasticrawl::COMMON_CRAWL_PATH,
|
92
|
+
crawl_name,
|
93
|
+
Elasticrawl::WARC_PATHS].join('/')
|
94
|
+
|
95
|
+
s3 = AWS::S3.new
|
96
|
+
bucket = s3.buckets[Elasticrawl::COMMON_CRAWL_BUCKET]
|
97
|
+
object = bucket.objects[s3_path]
|
98
|
+
|
99
|
+
uncompress_file(object)
|
98
100
|
end
|
99
101
|
|
100
|
-
#
|
101
|
-
def
|
102
|
-
|
102
|
+
# Takes in a S3 object and returns the contents as an uncompressed string.
|
103
|
+
def uncompress_file(s3_object)
|
104
|
+
result = ''
|
105
|
+
|
106
|
+
if s3_object.exists?
|
107
|
+
io = StringIO.new
|
108
|
+
io.write(s3_object.read)
|
109
|
+
io.rewind
|
110
|
+
|
111
|
+
gz = Zlib::GzipReader.new(io)
|
112
|
+
result = gz.read
|
113
|
+
|
114
|
+
gz.close
|
115
|
+
end
|
116
|
+
|
117
|
+
result
|
103
118
|
end
|
104
119
|
|
105
|
-
#
|
106
|
-
def
|
107
|
-
|
120
|
+
# Parses the segment names and file counts from the WARC file paths.
|
121
|
+
def parse_segments(warc_paths)
|
122
|
+
segments = Hash.new 0
|
108
123
|
|
109
|
-
|
110
|
-
|
111
|
-
|
124
|
+
warc_paths.split.each do |warc_path|
|
125
|
+
segment_name = warc_path.split('/')[4]
|
126
|
+
segments[segment_name] += 1 if segment_name.present?
|
127
|
+
end
|
128
|
+
|
129
|
+
segments
|
112
130
|
end
|
113
131
|
end
|
114
132
|
end
|