elasticrawl 1.1.5 → 1.1.6

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: a7a988f505c699d078fa4cc15981b77f7924e907
4
- data.tar.gz: db7ee7ae05ccec51b0ffde16900656c859523001
3
+ metadata.gz: 0d1d5521ae9a6e0762c0b057bbd828f6c15be0c4
4
+ data.tar.gz: df0386cf340ac6aff20bc95bffcfa0f0fb3995ff
5
5
  SHA512:
6
- metadata.gz: 12949f3230c3a0f7d08d4e02a9b296968616fa634072db882fe26809c58791d06ebf48d21fbc3c703ab1f858fdf12cfa9c324a6a97a7e896a5916679c7d6de06
7
- data.tar.gz: ab9ec066ecb469707241751087ff6d25ff89e9013fdb0a8505bca904fcaa3fed2b331ef289fa0ffb71dd970261ec7c0028ca955d94642fad7093191723d8c4e3
6
+ metadata.gz: 58f9f46e73d3bf03da4bfad2260cca0663430bd50187ed4927f564d15ac2b2acd377a76dd9016978eae8eb80caee3a76d610d0b524b010bbaa5a7cd953fdbbc9
7
+ data.tar.gz: e415324ccadc507ac37ddeaca0ae0ca71e9b6b93b008d23b2edefb985bd98928afe65ae508ad9d7ad68d532e936259bc36c6fdebce3806db2b6d39215d4dbf6e
data/.travis.yml CHANGED
@@ -1,6 +1,5 @@
1
1
  language: ruby
2
2
  rvm:
3
- - 2.0.0-p648
4
- - 2.1.8
5
- - 2.2.4
6
- - 2.3.0
3
+ - 2.1.9
4
+ - 2.2.5
5
+ - 2.3.1
data/CHANGELOG.md CHANGED
@@ -1,3 +1,6 @@
1
+ ## v1.1.6 / 2016-06-26
2
+ * Change CommonCrawl bucket to s3://commoncrawl
3
+
1
4
  ## v1.1.3 / 2015-02-04
2
5
  * Upgrade Traveling Ruby to 20150204-2.1.5
3
6
 
data/Gemfile CHANGED
@@ -1,3 +1,3 @@
1
- source 'http://rubygems.org'
1
+ source 'https://rubygems.org'
2
2
 
3
3
  gemspec
data/README.md CHANGED
@@ -13,19 +13,11 @@ This [blog post](https://rossfairbanks.com/2015/01/03/parsing-common-crawl-using
13
13
 
14
14
  ## Installation
15
15
 
16
- Deployment packages are available for Linux and OS X, unfortunately Windows isn't supported yet. Download the package, extract it and run the elasticrawl command from the package directory.
16
+ * Elasticrawl needs a [Ruby installation](https://www.ruby-lang.org/en/documentation/installation/) (2.1 or higher).
17
+ * Install Ruby from RubyGems.
17
18
 
18
- ```bash
19
- # OS X https://d2ujrnticqzebc.cloudfront.net/elasticrawl-1.1.5-osx.tar.gz
20
- # Linux (64-bit) https://d2ujrnticqzebc.cloudfront.net/elasticrawl-1.1.5-linux-x86_64.tar.gz
21
- # Linux (32-bit) https://d2ujrnticqzebc.cloudfront.net/elasticrawl-1.1.5-linux-x86.tar.gz
22
-
23
- # e.g.
24
-
25
- curl -O https://d2ujrnticqzebc.cloudfront.net/elasticrawl-1.1.5-osx.tar.gz
26
- tar -xzf elasticrawl-1.1.5-osx.tar.gz
27
- cd elasticrawl-1.1.5-osx/
28
- ./elasticrawl --help
19
+ ```
20
+ gem install elasticrawl --no-rdoc --no-ri
29
21
  ```
30
22
 
31
23
  ### Troubleshooting
@@ -45,7 +37,7 @@ The init command takes in an S3 bucket name and your AWS credentials. The S3 buc
45
37
  and will store your data and logs.
46
38
 
47
39
  ```bash
48
- ~$ ./elasticrawl init your-s3-bucket
40
+ ~$ elasticrawl init your-s3-bucket
49
41
 
50
42
  Enter AWS Access Key ID: ************
51
43
  Enter AWS Secret Access Key: ************
@@ -62,7 +54,7 @@ Config complete
62
54
  The parse command takes in the crawl name and an optional number of segments and files to parse.
63
55
 
64
56
  ```bash
65
- ~$ ./elasticrawl parse CC-MAIN-2015-48 --max-segments 2 --max-files 3
57
+ ~$ elasticrawl parse CC-MAIN-2015-48 --max-segments 2 --max-files 3
66
58
  Segments
67
59
  Segment: 1416400372202.67 Files: 150
68
60
  Segment: 1416400372490.23 Files: 124
@@ -85,7 +77,7 @@ Job: 1420124830792 Job Flow ID: j-2R3MFE6TWLIUB
85
77
  The combine command takes in the results of previous parse jobs and produces a combined set of results.
86
78
 
87
79
  ```bash
88
- ~$ ./elasticrawl combine --input-jobs 1420124830792
80
+ ~$ elasticrawl combine --input-jobs 1420124830792
89
81
  Job configuration
90
82
  Combining: 2 segments
91
83
 
@@ -104,7 +96,7 @@ Job: 1420129496115 Job Flow ID: j-251GXDIZGK8HL
104
96
  The status command shows crawls and your job history.
105
97
 
106
98
  ```bash
107
- ~$ ./elasticrawl status
99
+ ~$ elasticrawl status
108
100
  Crawl Status
109
101
  CC-MAIN-2015-48 Segments: to parse 98, parsed 2, total 100
110
102
 
@@ -117,7 +109,7 @@ Job History (last 10)
117
109
  The reset comment resets a crawl so it is parsed again.
118
110
 
119
111
  ```bash
120
- ~$ ./elasticrawl reset CC-MAIN-2015-48
112
+ ~$ elasticrawl reset CC-MAIN-2015-48
121
113
  Reset crawl? (y/n)
122
114
  y
123
115
  CC-MAIN-2015-48 Segments: to parse 100, parsed 0, total 100
@@ -128,7 +120,7 @@ y
128
120
  The destroy command deletes your S3 bucket and the ~/.elasticrawl directory.
129
121
 
130
122
  ```bash
131
- ~$ ./elasticrawl destroy
123
+ ~$ elasticrawl destroy
132
124
 
133
125
  WARNING:
134
126
  Bucket s3://elasticrawl-test and its data will be deleted
@@ -158,14 +150,12 @@ stores your S3 bucket name and the config for the parse and combine jobs
158
150
 
159
151
  ## Development
160
152
 
161
- Elasticrawl is developed in Ruby and requires Ruby 2.0.0 or later (Ruby 2.2 is recommended). The sqlite3 and nokogiri gems have C extensions which mean you may need to install development headers.
153
+ Elasticrawl is developed in Ruby and requires Ruby 2.1.0 or later (Ruby 2.3 is recommended). The sqlite3 and nokogiri gems have C extensions which mean you may need to install development headers.
162
154
 
163
155
  [![Gem Version](https://badge.fury.io/rb/elasticrawl.png)](http://badge.fury.io/rb/elasticrawl)
164
156
  [![Code Climate](https://codeclimate.com/github/rossf7/elasticrawl.png)](https://codeclimate.com/github/rossf7/elasticrawl)
165
157
  [![Build Status](https://travis-ci.org/rossf7/elasticrawl.png?branch=master)](https://travis-ci.org/rossf7/elasticrawl) 2.0.0, 2.1.8, 2.2.4, 2.3.0
166
158
 
167
- The deployment packages are created using [Traveling Ruby](http://phusion.github.io/traveling-ruby/). The deploy packages contain a Ruby 2.2 interpreter, Gems and the compiled C extensions. The [traveling-elasticrawl](https://github.com/rossf7/traveling-elasticrawl) repository has a Rake task that automates building the deployment packages.
168
-
169
159
  ## TODO
170
160
 
171
161
  * Add support for Streaming and Pig jobs
data/elasticrawl.gemspec CHANGED
@@ -27,7 +27,7 @@ Gem::Specification.new do |spec|
27
27
  spec.add_dependency 'thor', '~> 0.19.1'
28
28
 
29
29
  spec.add_development_dependency 'rake', '~> 10.4.2'
30
- spec.add_development_dependency 'bundler', '~> 1.11.2'
30
+ spec.add_development_dependency 'bundler', '~> 1.12.5'
31
31
  spec.add_development_dependency 'rspec', '~> 3.4.0'
32
32
  spec.add_development_dependency 'database_cleaner', '~> 1.5.1'
33
33
  spec.add_development_dependency 'shoulda-matchers', '~> 3.0.1'
data/lib/elasticrawl.rb CHANGED
@@ -7,8 +7,8 @@ require 'thor'
7
7
 
8
8
  module Elasticrawl
9
9
  # S3 locations
10
- COMMON_CRAWL_BUCKET = 'aws-publicdatasets'
11
- COMMON_CRAWL_PATH = 'common-crawl/crawl-data'
10
+ COMMON_CRAWL_BUCKET = 'commoncrawl'
11
+ COMMON_CRAWL_PATH = 'crawl-data'
12
12
  SEGMENTS_PATH = 'segments'
13
13
  WARC_PATHS = 'warc.paths.gz'
14
14
  MAX_SEGMENTS = 256
@@ -1,3 +1,3 @@
1
1
  module Elasticrawl
2
- VERSION = '1.1.5'
2
+ VERSION = '1.1.6'
3
3
  end
@@ -22,7 +22,7 @@ describe Elasticrawl::CrawlSegment, type: :model do
22
22
 
23
23
  it 'should have an s3 uri' do
24
24
  expect(subject.segment_s3_uri).to eq \
25
- "s3://aws-publicdatasets/common-crawl/crawl-data/#{crawl.crawl_name}/segments/#{segment_name}/"
25
+ "s3://commoncrawl/crawl-data/#{crawl.crawl_name}/segments/#{segment_name}/"
26
26
  end
27
27
 
28
28
  it 'should have a file count' do
@@ -35,7 +35,7 @@ describe Elasticrawl::Crawl, type: :model do
35
35
 
36
36
  it 'should create segment s3 uris' do
37
37
  expect(subject.crawl_segments[0].segment_s3_uri).to eq \
38
- 's3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-49/segments/1416400372202.67/'
38
+ 's3://commoncrawl/crawl-data/CC-MAIN-2014-49/segments/1416400372202.67/'
39
39
  end
40
40
 
41
41
  it 'should set file counts' do
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: elasticrawl
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.1.5
4
+ version: 1.1.6
5
5
  platform: ruby
6
6
  authors:
7
7
  - Ross Fairbanks
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2016-01-05 00:00:00.000000000 Z
11
+ date: 2016-06-26 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: activerecord
@@ -128,14 +128,14 @@ dependencies:
128
128
  requirements:
129
129
  - - "~>"
130
130
  - !ruby/object:Gem::Version
131
- version: 1.11.2
131
+ version: 1.12.5
132
132
  type: :development
133
133
  prerelease: false
134
134
  version_requirements: !ruby/object:Gem::Requirement
135
135
  requirements:
136
136
  - - "~>"
137
137
  - !ruby/object:Gem::Version
138
- version: 1.11.2
138
+ version: 1.12.5
139
139
  - !ruby/object:Gem::Dependency
140
140
  name: rspec
141
141
  requirement: !ruby/object:Gem::Requirement
@@ -251,7 +251,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
251
251
  version: '0'
252
252
  requirements: []
253
253
  rubyforge_project:
254
- rubygems_version: 2.4.5.1
254
+ rubygems_version: 2.5.1
255
255
  signing_key:
256
256
  specification_version: 4
257
257
  summary: Launch AWS Elastic MapReduce jobs that process Common Crawl data.