elasticrawl 1.0.0 → 1.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.travis.yml +1 -1
- data/README.md +77 -108
- data/Vagrantfile +5 -5
- data/db/migrate/201401051536_create_crawls.rb +1 -1
- data/db/migrate/201401051855_create_crawl_segments.rb +1 -1
- data/db/migrate/201401101723_create_jobs.rb +1 -1
- data/db/migrate/201401141606_create_job_steps.rb +1 -1
- data/db/migrate/201412311554_add_file_count_to_crawl_segments.rb +5 -0
- data/elasticrawl.gemspec +10 -11
- data/lib/elasticrawl.rb +7 -0
- data/lib/elasticrawl/cluster.rb +1 -1
- data/lib/elasticrawl/crawl.rb +49 -31
- data/lib/elasticrawl/crawl_segment.rb +30 -0
- data/lib/elasticrawl/job.rb +13 -6
- data/lib/elasticrawl/job_step.rb +5 -3
- data/lib/elasticrawl/parse_job.rb +14 -0
- data/lib/elasticrawl/version.rb +1 -1
- data/spec/fixtures/warc.paths +6 -0
- data/spec/spec_helper.rb +8 -14
- data/spec/unit/cluster_spec.rb +2 -2
- data/spec/unit/combine_job_spec.rb +4 -4
- data/spec/unit/crawl_segment_spec.rb +19 -10
- data/spec/unit/crawl_spec.rb +21 -16
- data/spec/unit/job_step_spec.rb +4 -4
- data/spec/unit/parse_job_spec.rb +20 -14
- metadata +56 -101
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: d30b065d6f268827c458f2da44db8ccc726c209f
|
4
|
+
data.tar.gz: 0fe917b0f93bf4f70b23386bedad9cc3547d9e8b
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: cd63bfea578623e32c03d10f4bab56591950cbff8cd19eaf38aa28db5ba875101c94f8fa9da297e418fc6460eb16771c9fc45098d7c2e8e5b16e2f380a5ab4bc
|
7
|
+
data.tar.gz: f6ca6c84103df5a9d299ed78d9963f57357c15719ca90e4193bb0a9dc909c552c055329b85677dddaef8af35fea2f4b044ea73e29ebfc110fe6116c8e548b3df
|
data/.travis.yml
CHANGED
data/README.md
CHANGED
@@ -1,42 +1,53 @@
|
|
1
1
|
# Elasticrawl
|
2
2
|
|
3
|
-
|
4
|
-
Elasticrawl
|
5
|
-
([2013 data onwards](http://commoncrawl.org/new-crawl-data-available/)).
|
6
|
-
Ships with a default configuration that launches the
|
7
|
-
[elasticrawl-examples](https://github.com/rossf7/elasticrawl-examples) jobs.
|
8
|
-
This is an implementation of the standard Hadoop Word Count example.
|
3
|
+
Command line tool for launching Hadoop jobs using AWS EMR (Elastic MapReduce) to process Common Crawl data.
|
4
|
+
Elasticrawl can be used with [crawl data](http://commoncrawl.org/the-data/get-started/) from April 2014 onwards.
|
9
5
|
|
10
|
-
|
6
|
+
| Crawl Name | Month | Web Pages
|
7
|
+
| -------------- |:--------:|:--------:|
|
8
|
+
| [CC-MAIN-2014-15](http://blog.commoncrawl.org/2014/07/april-2014-crawl-data-available/) | April 2014 | ~ 2.3 billion
|
9
|
+
| [CC-MAIN-2014-23](http://blog.commoncrawl.org/2014/08/july-2014-crawl-data-available/) | July 2014 | ~ 3.6 billion
|
10
|
+
| [CC-MAIN-2014-35](http://blog.commoncrawl.org/2014/09/august-2014-crawl-data-available/) | August 2014 | ~ 2.8 billion
|
11
|
+
| [CC-MAIN-2014-49](http://blog.commoncrawl.org/2014/12/november-2014-crawl-archive-available/) | November 2014 | ~ 1.95 billion
|
11
12
|
|
12
|
-
Common Crawl
|
13
|
-
during 2014. Each crawl is split into multiple segments that contain 3 file types.
|
13
|
+
Common Crawl announce new crawls on their [blog](http://blog.commoncrawl.org/).
|
14
14
|
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
| Crawl Name | Date | Segments | Pages | Size (uncompressed) |
|
20
|
-
| -------------- |:--------:|:--------:|:-------------:|:-------------------:|
|
21
|
-
| CC-MAIN-2013-48| Nov 2013 | 517 | ~ 2.3 billion | 148 TB |
|
22
|
-
| CC-MAIN-2013-20| May 2013 | 316 | ~ 2.0 billion | 102 TB |
|
15
|
+
Ships with a default configuration that launches the
|
16
|
+
[elasticrawl-examples](https://github.com/rossf7/elasticrawl-examples) jobs.
|
17
|
+
This is an implementation of the standard Hadoop Word Count example.
|
23
18
|
|
24
|
-
|
25
|
-
jobs against this data.
|
19
|
+
## More Information
|
26
20
|
|
27
|
-
[
|
28
|
-
[](https://travis-ci.org/rossf7/elasticrawl) 1.9.3, 2.0.0, 2.1.0
|
21
|
+
* [Blog post](https://rossfairbanks.com/2015/01/03/parsing-common-crawl-using-elasticrawl.html) with walkthrough of running the Hadoop WordCount example on the November 2014 crawl.
|
29
22
|
|
30
23
|
## Installation
|
31
24
|
|
32
25
|
### Dependencies
|
33
26
|
|
34
|
-
Elasticrawl is developed in Ruby and requires Ruby 1.9.3 or later.
|
27
|
+
Elasticrawl is developed in Ruby and requires Ruby 1.9.3 or later (Ruby 2.1 is recommended).
|
35
28
|
Installing using [rbenv](https://github.com/sstephenson/rbenv#installation)
|
36
29
|
and the ruby-build plugin is recommended.
|
37
30
|
|
31
|
+
A SQLite database is used to store details of crawls and jobs. Installing the sqlite3
|
32
|
+
gem requires the development headers to be installed.
|
33
|
+
|
34
|
+
```bash
|
35
|
+
# OS X
|
36
|
+
brew install sqlite3
|
37
|
+
|
38
|
+
# CentOS
|
39
|
+
sudo yum install sqlite-devel
|
40
|
+
|
41
|
+
# Ubuntu
|
42
|
+
sudo apt-get install libsqlite3-dev
|
43
|
+
```
|
44
|
+
|
38
45
|
### Install elasticrawl
|
39
46
|
|
47
|
+
[](http://badge.fury.io/rb/elasticrawl)
|
48
|
+
[](https://codeclimate.com/github/rossf7/elasticrawl)
|
49
|
+
[](https://travis-ci.org/rossf7/elasticrawl) 1.9.3, 2.0.0, 2.1.5
|
50
|
+
|
40
51
|
```bash
|
41
52
|
~$ gem install elasticrawl --no-rdoc --no-ri
|
42
53
|
```
|
@@ -48,21 +59,12 @@ to your path.
|
|
48
59
|
~$ rbenv rehash
|
49
60
|
```
|
50
61
|
|
51
|
-
##
|
52
|
-
|
53
|
-
In this example you'll launch 2 EMR jobs against a small portion of the Nov
|
54
|
-
2013 crawl. Each job will take around 20 minutes to run. Most of this is setup
|
55
|
-
time while your EC2 spot instances are provisioned and your Hadoop cluster is
|
56
|
-
configured.
|
62
|
+
## Commands
|
57
63
|
|
58
|
-
|
59
|
-
to use elasticrawl. The total cost of the 2 EMR jobs will be under $1 USD.
|
64
|
+
### elasticrawl init
|
60
65
|
|
61
|
-
|
62
|
-
|
63
|
-
You'll need to choose an S3 bucket name and enter your AWS access key and
|
64
|
-
secret key. The S3 bucket will be used for storing data and logs. S3 bucket
|
65
|
-
names must be unique, using hyphens rather than underscores is recommended.
|
66
|
+
Init takes in an S3 bucket name and your AWS credentials. The S3 bucket will be created
|
67
|
+
and will store your data and logs.
|
66
68
|
|
67
69
|
```bash
|
68
70
|
~$ elasticrawl init your-s3-bucket
|
@@ -77,38 +79,35 @@ Config dir /Users/ross/.elasticrawl created
|
|
77
79
|
Config complete
|
78
80
|
```
|
79
81
|
|
80
|
-
###
|
82
|
+
### elasticrawl parse
|
81
83
|
|
82
|
-
|
83
|
-
of the Nov 2013 crawl.
|
84
|
+
Parse takes in the crawl name and an optional number of segments and files to parse.
|
84
85
|
|
85
86
|
```bash
|
86
|
-
~$ elasticrawl parse CC-MAIN-
|
87
|
+
~$ elasticrawl parse CC-MAIN-2014-49 --max-segments 2 --max-files 3
|
88
|
+
Segments
|
89
|
+
Segment: 1416400372202.67 Files: 150
|
90
|
+
Segment: 1416400372490.23 Files: 124
|
87
91
|
|
88
92
|
Job configuration
|
89
|
-
Crawl: CC-MAIN-
|
93
|
+
Crawl: CC-MAIN-2014-49 Segments: 2 Parsing: 3 files per segment
|
90
94
|
|
91
95
|
Cluster configuration
|
92
96
|
Master: 1 m1.medium (Spot: 0.12)
|
93
97
|
Core: 2 m1.medium (Spot: 0.12)
|
94
98
|
Task: --
|
95
99
|
Launch job? (y/n)
|
96
|
-
|
97
100
|
y
|
98
|
-
Job Name: 1391458746774 Job Flow ID: j-2X9JVDC1UKEQ1
|
99
|
-
```
|
100
101
|
|
101
|
-
|
102
|
-
|
102
|
+
Job: 1420124830792 Job Flow ID: j-2R3MFE6TWLIUB
|
103
|
+
```
|
103
104
|
|
104
|
-
###
|
105
|
+
### elasticrawl combine
|
105
106
|
|
106
|
-
|
107
|
-
a single set of files.
|
107
|
+
Combine takes in the results of previous parse jobs and produces a combined set of results.
|
108
108
|
|
109
109
|
```bash
|
110
|
-
~$ elasticrawl combine --input-jobs
|
111
|
-
|
110
|
+
~$ elasticrawl combine --input-jobs 1420124830792
|
112
111
|
Job configuration
|
113
112
|
Combining: 2 segments
|
114
113
|
|
@@ -117,20 +116,38 @@ Master: 1 m1.medium (Spot: 0.12)
|
|
117
116
|
Core: 2 m1.medium (Spot: 0.12)
|
118
117
|
Task: --
|
119
118
|
Launch job? (y/n)
|
120
|
-
|
121
119
|
y
|
122
|
-
|
120
|
+
|
121
|
+
Job: 1420129496115 Job Flow ID: j-251GXDIZGK8HL
|
123
122
|
```
|
124
123
|
|
125
|
-
|
126
|
-
|
124
|
+
### elasticrawl status
|
125
|
+
|
126
|
+
Status shows crawls and your job history.
|
127
127
|
|
128
|
-
|
128
|
+
```bash
|
129
|
+
~$ elasticrawl status
|
130
|
+
Crawl Status
|
131
|
+
CC-MAIN-2014-49 Segments: to parse 134, parsed 2, total 136
|
132
|
+
|
133
|
+
Job History (last 10)
|
134
|
+
1420124830792 Launched: 2015-01-01 15:07:10 Crawl: CC-MAIN-2014-49 Segments: 2 Parsing: 3 files per segment
|
135
|
+
```
|
129
136
|
|
130
|
-
###
|
137
|
+
### elasticrawl reset
|
131
138
|
|
132
|
-
|
133
|
-
|
139
|
+
Reset a crawl so it is parsed again.
|
140
|
+
|
141
|
+
```bash
|
142
|
+
~$ elasticrawl reset CC-MAIN-2014-49
|
143
|
+
Reset crawl? (y/n)
|
144
|
+
y
|
145
|
+
CC-MAIN-2014-49 Segments: to parse 136, parsed 0, total 136
|
146
|
+
```
|
147
|
+
|
148
|
+
### elasticrawl destroy
|
149
|
+
|
150
|
+
Destroy deletes your S3 bucket and the ~/.elasticrawl directory.
|
134
151
|
|
135
152
|
```bash
|
136
153
|
~$ elasticrawl destroy
|
@@ -151,7 +168,7 @@ Config deleted
|
|
151
168
|
The elasticrawl init command creates the ~/elasticrawl/ directory which
|
152
169
|
contains
|
153
170
|
|
154
|
-
* [aws.yml](https://github.com/rossf7
|
171
|
+
* [aws.yml](https://github.com/rossf7/.elasticrawl/blob/master/templates/aws.yml) -
|
155
172
|
stores your AWS access credentials. Or you can set the environment
|
156
173
|
variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
|
157
174
|
|
@@ -161,61 +178,13 @@ configures the EC2 instances that are launched to form your EMR cluster
|
|
161
178
|
* [jobs.yml](https://github.com/rossf7/elasticrawl/blob/master/templates/jobs.yml) -
|
162
179
|
stores your S3 bucket name and the config for the parse and combine jobs
|
163
180
|
|
164
|
-
## Managing Segments
|
165
|
-
|
166
|
-
Each Common Crawl segment is parsed as a separate EMR job step. This avoids
|
167
|
-
overloading the job tracker and means if a job fails then only data from the
|
168
|
-
current segment is lost. However an EMR job flow can only contain 256 steps.
|
169
|
-
So to process an entire crawl multiple parse jobs must be combined.
|
170
|
-
|
171
|
-
```bash
|
172
|
-
~$ elasticrawl combine --input-jobs 1391430796774 1391458746774 1391498046704
|
173
|
-
```
|
174
|
-
|
175
|
-
You can use the status command to see details of crawls and jobs.
|
176
|
-
|
177
|
-
```bash
|
178
|
-
~$ elasticrawl status
|
179
|
-
|
180
|
-
Crawl Status
|
181
|
-
CC-MAIN-2013-48 Segments: to parse 517, parsed 2, total 519
|
182
|
-
|
183
|
-
Job History (last 10)
|
184
|
-
1391459918730 Launched: 2014-02-04 13:58:12 Combining: 2 segments
|
185
|
-
1391458746774 Launched: 2014-02-04 13:55:50 Crawl: CC-MAIN-2013-48 Segments: 2 Parsing: 2 files per segment
|
186
|
-
```
|
187
|
-
|
188
|
-
You can use the reset command to parse a crawl again.
|
189
|
-
|
190
|
-
```bash
|
191
|
-
~$ elasticrawl reset CC-MAIN-2013-48
|
192
|
-
|
193
|
-
Reset crawl? (y/n)
|
194
|
-
y
|
195
|
-
CC-MAIN-2013-48 Segments: to parse 519, parsed 0, total 519
|
196
|
-
```
|
197
|
-
|
198
|
-
To parse the same segments multiple times.
|
199
|
-
|
200
|
-
```bash
|
201
|
-
~$ elasticrawl parse CC-MAIN-2013-48 --segment-list 1386163036037 1386163035819 --max-files 2
|
202
|
-
```
|
203
|
-
|
204
|
-
## Running your own Jobs
|
205
|
-
|
206
|
-
1. Fork the [elasticrawl-examples](https://github.com/rossf7/elasticrawl-examples)
|
207
|
-
2. Make your changes
|
208
|
-
3. Compile your changes into a JAR using Maven
|
209
|
-
4. Upload your JAR to your own S3 bucket
|
210
|
-
5. Edit ~/.elasticrawl/jobs.yml with your JAR and class names
|
211
|
-
|
212
181
|
## TODO
|
213
182
|
|
214
183
|
* Add support for Streaming and Pig jobs
|
215
184
|
|
216
185
|
## Thanks
|
217
186
|
|
218
|
-
* Thanks to everyone at Common Crawl for making this awesome dataset available
|
187
|
+
* Thanks to everyone at Common Crawl for making this awesome dataset available!
|
219
188
|
* Thanks to Robert Slifka for the [elasticity](https://github.com/rslifka/elasticity)
|
220
189
|
gem which provides a nice Ruby wrapper for the EMR REST API.
|
221
190
|
|
data/Vagrantfile
CHANGED
@@ -36,16 +36,16 @@ Vagrant.configure("2") do |config|
|
|
36
36
|
"user_installs" => [
|
37
37
|
{
|
38
38
|
"user" => "vagrant",
|
39
|
-
"rubies" => ["1.9.3-
|
40
|
-
"global" => "1.
|
39
|
+
"rubies" => ["1.9.3-p551", "2.0.0-p598", "2.1.5"],
|
40
|
+
"global" => "2.1.5",
|
41
41
|
"gems" => {
|
42
|
-
"1.9.3-
|
42
|
+
"1.9.3-p551" => [
|
43
43
|
{ "name" => "bundler" }
|
44
44
|
],
|
45
|
-
"2.0.0-
|
45
|
+
"2.0.0-p598" => [
|
46
46
|
{ "name" => "bundler" }
|
47
47
|
],
|
48
|
-
"2.1.
|
48
|
+
"2.1.5" => [
|
49
49
|
{ "name" => "bundler" }
|
50
50
|
]
|
51
51
|
}
|
data/elasticrawl.gemspec
CHANGED
@@ -18,18 +18,17 @@ Gem::Specification.new do |spec|
|
|
18
18
|
spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
|
19
19
|
spec.require_paths = ['lib']
|
20
20
|
|
21
|
-
spec.add_dependency 'activerecord', '~> 4.
|
22
|
-
spec.add_dependency 'activesupport', '~> 4.
|
23
|
-
spec.add_dependency 'aws-sdk', '~> 1.
|
24
|
-
spec.add_dependency 'elasticity', '~>
|
25
|
-
spec.add_dependency 'highline', '~> 1.6
|
26
|
-
spec.add_dependency 'sqlite3', '~> 1.3
|
27
|
-
spec.add_dependency 'thor', '~> 0.
|
21
|
+
spec.add_dependency 'activerecord', '~> 4.2'
|
22
|
+
spec.add_dependency 'activesupport', '~> 4.2'
|
23
|
+
spec.add_dependency 'aws-sdk', '~> 1.60'
|
24
|
+
spec.add_dependency 'elasticity', '~> 4.0'
|
25
|
+
spec.add_dependency 'highline', '~> 1.6'
|
26
|
+
spec.add_dependency 'sqlite3', '~> 1.3'
|
27
|
+
spec.add_dependency 'thor', '~> 0.19'
|
28
28
|
|
29
29
|
spec.add_development_dependency 'rake'
|
30
30
|
spec.add_development_dependency 'bundler', '~> 1.3'
|
31
|
-
spec.add_development_dependency 'rspec', '~>
|
32
|
-
spec.add_development_dependency '
|
33
|
-
spec.add_development_dependency '
|
34
|
-
spec.add_development_dependency 'shoulda-matchers', '~> 2.4.0'
|
31
|
+
spec.add_development_dependency 'rspec', '~> 3.1'
|
32
|
+
spec.add_development_dependency 'database_cleaner', '~> 1.3.0'
|
33
|
+
spec.add_development_dependency 'shoulda-matchers', '~> 2.7.0'
|
35
34
|
end
|
data/lib/elasticrawl.rb
CHANGED
@@ -6,6 +6,13 @@ require 'highline/import'
|
|
6
6
|
require 'thor'
|
7
7
|
|
8
8
|
module Elasticrawl
|
9
|
+
# S3 locations
|
10
|
+
COMMON_CRAWL_BUCKET = 'aws-publicdatasets'
|
11
|
+
COMMON_CRAWL_PATH = 'common-crawl/crawl-data'
|
12
|
+
SEGMENTS_PATH = 'segments'
|
13
|
+
WARC_PATHS = 'warc.paths.gz'
|
14
|
+
MAX_SEGMENTS = 256
|
15
|
+
|
9
16
|
require 'elasticrawl/version'
|
10
17
|
|
11
18
|
require 'elasticrawl/config'
|
data/lib/elasticrawl/cluster.rb
CHANGED
@@ -13,7 +13,7 @@ module Elasticrawl
|
|
13
13
|
config = Config.new
|
14
14
|
job_flow = Elasticity::JobFlow.new(config.access_key_id,
|
15
15
|
config.secret_access_key)
|
16
|
-
job_flow.name = "Job
|
16
|
+
job_flow.name = "Job: #{job.job_name} #{job.job_desc}"
|
17
17
|
job_flow.log_uri = job.log_uri
|
18
18
|
|
19
19
|
configure_job_flow(job_flow)
|
data/lib/elasticrawl/crawl.rb
CHANGED
@@ -5,11 +5,6 @@ module Elasticrawl
|
|
5
5
|
class Crawl < ActiveRecord::Base
|
6
6
|
has_many :crawl_segments
|
7
7
|
|
8
|
-
COMMON_CRAWL_BUCKET = 'aws-publicdatasets'
|
9
|
-
COMMON_CRAWL_PATH = 'common-crawl/crawl-data/'
|
10
|
-
SEGMENTS_PATH = '/segments/'
|
11
|
-
MAX_SEGMENTS = 256
|
12
|
-
|
13
8
|
# Returns the status of all saved crawls and the current job history.
|
14
9
|
def self.status(show_all = false)
|
15
10
|
status = ['Crawl Status']
|
@@ -51,13 +46,19 @@ module Elasticrawl
|
|
51
46
|
end
|
52
47
|
end
|
53
48
|
|
54
|
-
# Creates crawl segments from
|
49
|
+
# Creates crawl segments from the warc.paths file for this crawl.
|
55
50
|
def create_segments
|
56
|
-
|
57
|
-
|
58
|
-
|
51
|
+
file_paths = warc_paths(self.crawl_name)
|
52
|
+
|
53
|
+
segments = parse_segments(file_paths)
|
54
|
+
save if segments.count > 0
|
55
|
+
|
56
|
+
segments.keys.each do |segment_name|
|
57
|
+
file_count = segments[segment_name]
|
58
|
+
CrawlSegment.create_segment(self, segment_name, file_count)
|
59
|
+
end
|
59
60
|
|
60
|
-
|
61
|
+
segments.count
|
61
62
|
end
|
62
63
|
|
63
64
|
# Returns the list of segments from the database.
|
@@ -68,8 +69,8 @@ module Elasticrawl
|
|
68
69
|
# Returns next # segments to be parsed. The maximum is 256
|
69
70
|
# as this is the maximum # of steps for an Elastic MapReduce job flow.
|
70
71
|
def next_segments(max_segments = nil)
|
71
|
-
max_segments = MAX_SEGMENTS if max_segments.nil?
|
72
|
-
max_segments = MAX_SEGMENTS if max_segments > MAX_SEGMENTS
|
72
|
+
max_segments = Elasticrawl::MAX_SEGMENTS if max_segments.nil?
|
73
|
+
max_segments = Elasticrawl::MAX_SEGMENTS if max_segments > Elasticrawl::MAX_SEGMENTS
|
73
74
|
|
74
75
|
self.crawl_segments.where(:parse_time => nil).limit(max_segments)
|
75
76
|
end
|
@@ -85,30 +86,47 @@ module Elasticrawl
|
|
85
86
|
end
|
86
87
|
|
87
88
|
private
|
88
|
-
#
|
89
|
-
def
|
90
|
-
|
91
|
-
|
92
|
-
|
93
|
-
|
94
|
-
|
95
|
-
|
96
|
-
|
97
|
-
|
89
|
+
# Gets the WARC file paths from S3 for this crawl if it exists.
|
90
|
+
def warc_paths(crawl_name)
|
91
|
+
s3_path = [Elasticrawl::COMMON_CRAWL_PATH,
|
92
|
+
crawl_name,
|
93
|
+
Elasticrawl::WARC_PATHS].join('/')
|
94
|
+
|
95
|
+
s3 = AWS::S3.new
|
96
|
+
bucket = s3.buckets[Elasticrawl::COMMON_CRAWL_BUCKET]
|
97
|
+
object = bucket.objects[s3_path]
|
98
|
+
|
99
|
+
uncompress_file(object)
|
98
100
|
end
|
99
101
|
|
100
|
-
#
|
101
|
-
def
|
102
|
-
|
102
|
+
# Takes in a S3 object and returns the contents as an uncompressed string.
|
103
|
+
def uncompress_file(s3_object)
|
104
|
+
result = ''
|
105
|
+
|
106
|
+
if s3_object.exists?
|
107
|
+
io = StringIO.new
|
108
|
+
io.write(s3_object.read)
|
109
|
+
io.rewind
|
110
|
+
|
111
|
+
gz = Zlib::GzipReader.new(io)
|
112
|
+
result = gz.read
|
113
|
+
|
114
|
+
gz.close
|
115
|
+
end
|
116
|
+
|
117
|
+
result
|
103
118
|
end
|
104
119
|
|
105
|
-
#
|
106
|
-
def
|
107
|
-
|
120
|
+
# Parses the segment names and file counts from the WARC file paths.
|
121
|
+
def parse_segments(warc_paths)
|
122
|
+
segments = Hash.new 0
|
108
123
|
|
109
|
-
|
110
|
-
|
111
|
-
|
124
|
+
warc_paths.split.each do |warc_path|
|
125
|
+
segment_name = warc_path.split('/')[4]
|
126
|
+
segments[segment_name] += 1 if segment_name.present?
|
127
|
+
end
|
128
|
+
|
129
|
+
segments
|
112
130
|
end
|
113
131
|
end
|
114
132
|
end
|