pupa 0.1.10 → 0.1.11
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.travis.yml +0 -3
- data/README.md +5 -20
- data/lib/pupa/runner.rb +2 -2
- data/lib/pupa/version.rb +1 -1
- data/pupa.gemspec +1 -1
- data/spec/processor/connection_adapters/mongodb_adapter_spec.rb +1 -5
- data/spec/processor/connection_adapters/postgresql_adapter_spec.rb +1 -5
- data/spec/processor_spec.rb +1 -5
- data/spec/spec_helper.rb +0 -12
- metadata +4 -8
- data/lib/pupa/refinements/opencivicdata.rb +0 -42
- data/spec/refinements/opencivicdata_spec.rb +0 -35
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 08051be1ed0ea215d0e1c258d9b01bd189215dc2
|
4
|
+
data.tar.gz: dd163a78b33959069adbb5d971781a17a582a15b
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 597d6a420ae3c8ba04336b9ed38f4073925e90720d51895fbd83a9c77232f8c588834f66d5106fd9906c402bba752f4805c1014919812427b119aa870ead1cda
|
7
|
+
data.tar.gz: 98e65c5af6ea3f1a63b54c7c0c2091d0664660705d19e77bf22b2d5e7dfa911d56155e8cb6fa69ab11c22191e74444f1e60981f3523239e6a8873bc0d3c92efd
|
data/.travis.yml
CHANGED
data/README.md
CHANGED
@@ -6,13 +6,13 @@
|
|
6
6
|
[![Coverage Status](https://coveralls.io/repos/opennorth/pupa-ruby/badge.png?branch=master)](https://coveralls.io/r/opennorth/pupa-ruby)
|
7
7
|
[![Code Climate](https://codeclimate.com/github/opennorth/pupa-ruby.png)](https://codeclimate.com/github/opennorth/pupa-ruby)
|
8
8
|
|
9
|
-
Pupa.rb is a Ruby 2.x fork of
|
9
|
+
Pupa.rb is a Ruby 2.x fork of Python [Pupa](https://github.com/opencivicdata/pupa). It implements an Extract, Transform and Load (ETL) process to scrape data from online sources, transform it, and write it to a database.
|
10
10
|
|
11
11
|
gem install pupa
|
12
12
|
|
13
13
|
## Usage
|
14
14
|
|
15
|
-
You can scrape any sort of data with Pupa.rb using your own models. You can also use Pupa.rb to scrape people, organizations, memberships and posts according to the [Popolo](http://www.popoloproject.com/) open government data specification. This gem is up-to-date with Popolo's 2014-
|
15
|
+
You can scrape any sort of data with Pupa.rb using your own models. You can also use Pupa.rb to scrape people, organizations, memberships and posts according to the [Popolo](http://www.popoloproject.com/) open government data specification. This gem is up-to-date with Popolo's 2014-10-28 version.
|
16
16
|
|
17
17
|
The [cat.rb](http://opennorth.github.io/pupa-ruby/docs/cat.html) example shows you how to:
|
18
18
|
|
@@ -52,14 +52,6 @@ The [organization.rb](http://opennorth.github.io/pupa-ruby/docs/organization.htm
|
|
52
52
|
|
53
53
|
JSON parsing is enabled by default. To enable automatic parsing of HTML and XML, require the `nokogiri` and `multi_xml` gems.
|
54
54
|
|
55
|
-
### Automatic response decompression
|
56
|
-
|
57
|
-
Until [Faraday Middleware](https://github.com/lostisland/faraday_middleware) releases its next version (> 0.9.1), you must use the gem from its git repository to automatically decompress responses:
|
58
|
-
|
59
|
-
```ruby
|
60
|
-
gem 'faraday_middleware', git: 'https://github.com/lostisland/faraday_middleware.git'
|
61
|
-
```
|
62
|
-
|
63
55
|
## Performance
|
64
56
|
|
65
57
|
Pupa.rb offers several ways to significantly improve performance. [Read the documentation.](https://github.com/opennorth/pupa-ruby/blob/master/PERFORMANCE.md#readme)
|
@@ -90,22 +82,15 @@ In short, Pupa.rb lets you spend more time on the tasks that are unique to your
|
|
90
82
|
|
91
83
|
* Logging, to make debugging and monitoring a scraper easier
|
92
84
|
* [Automatic response parsing](#automatic-response-parsing) of JSON, XML and HTML
|
85
|
+
* Automatic response decompression
|
93
86
|
* [Option parsing](http://opennorth.github.io/pupa-ruby/docs/legislator.html#section-9), to control your scraper from the command-line
|
94
87
|
* [Object validation](http://opennorth.github.io/pupa-ruby/docs/cat.html#section-4), using [JSON Schema](http://json-schema.org/)
|
95
88
|
|
96
89
|
Pupa.rb is extensible, so that you can add your own models, parsers, helpers, actions, etc. It also offers several ways to [improve your scraper's performance](#performance).
|
97
90
|
|
98
|
-
## [
|
99
|
-
|
100
|
-
Both Pupa.rb and Sunlight Labs' [Pupa](https://github.com/opencivicdata/pupa) implement models for people, organizations and memberships from the [Popolo](http://www.popoloproject.com/) open government data specification. Pupa.rb lets you use your own classes, but Pupa only supports a fixed set of classes. A consequence of Pupa.rb's flexibility is that the value of the `_type` property for `Person`, `Organization` and `Membership` objects differs between Pupa.rb and Pupa. Pupa.rb has namespaced types like `pupa/person` – to allow Ruby to load the `Person` class in the `Pupa` module – whereas Pupa has unnamespaced types like `person`.
|
101
|
-
|
102
|
-
To save objects to MongoDB with unnamespaced types like Sunlight Labs' Pupa – in order to benefit from other tools in the [OpenCivicData](http://opencivicdata.org/) stack – add this line to the top of your script:
|
103
|
-
|
104
|
-
```ruby
|
105
|
-
require 'pupa/refinements/opencivicdata'
|
106
|
-
```
|
91
|
+
## Python [Pupa](https://github.com/opencivicdata/pupa) differences
|
107
92
|
|
108
|
-
|
93
|
+
Both Pupa.rb and Python [Pupa](https://github.com/opencivicdata/pupa) implement models from the [Popolo](http://www.popoloproject.com/) open government data specifications, but Pupa.rb also lets you use your own classes. Pupa.rb stores data in either MongoDB (default) or PostgreSQL (experimental); Python Pupa stores data in PostgreSQL. The PostgreSQL schema of Pupa.rb and Python Pupa differ.
|
109
94
|
|
110
95
|
## Testing
|
111
96
|
|
data/lib/pupa/runner.rb
CHANGED
@@ -13,9 +13,9 @@ module Pupa
|
|
13
13
|
@options = OpenStruct.new({
|
14
14
|
actions: [],
|
15
15
|
tasks: [],
|
16
|
-
output_dir: File.expand_path('
|
16
|
+
output_dir: File.expand_path('_data', Dir.pwd),
|
17
17
|
pipelined: false,
|
18
|
-
cache_dir: File.expand_path('
|
18
|
+
cache_dir: File.expand_path('_cache', Dir.pwd),
|
19
19
|
expires_in: 86400, # 1 day
|
20
20
|
value_max_bytes: 1048576, # 1 MB
|
21
21
|
memcached_username: nil,
|
data/lib/pupa/version.rb
CHANGED
data/pupa.gemspec
CHANGED
@@ -16,7 +16,7 @@ Gem::Specification.new do |s|
|
|
16
16
|
s.executables = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
|
17
17
|
s.require_paths = ["lib"]
|
18
18
|
|
19
|
-
s.add_runtime_dependency('activesupport', '~> 4.0
|
19
|
+
s.add_runtime_dependency('activesupport', '~> 4.0')
|
20
20
|
s.add_runtime_dependency('colored', '~> 1.2')
|
21
21
|
s.add_runtime_dependency('faraday_middleware', '~> 0.9.0')
|
22
22
|
s.add_runtime_dependency('json-schema', '~> 2.1.3')
|
@@ -2,11 +2,7 @@ require File.expand_path(File.dirname(__FILE__) + '/../../spec_helper')
|
|
2
2
|
|
3
3
|
describe Pupa::Processor::Connection::MongoDBAdapter do
|
4
4
|
def _type
|
5
|
-
|
6
|
-
'person'
|
7
|
-
else
|
8
|
-
'pupa/person'
|
9
|
-
end
|
5
|
+
'pupa/person'
|
10
6
|
end
|
11
7
|
|
12
8
|
def connection
|
@@ -2,11 +2,7 @@ require File.expand_path(File.dirname(__FILE__) + '/../../spec_helper')
|
|
2
2
|
|
3
3
|
describe Pupa::Processor::Connection::PostgreSQLAdapter do
|
4
4
|
def _type
|
5
|
-
|
6
|
-
'person'
|
7
|
-
else
|
8
|
-
'pupa/person'
|
9
|
-
end
|
5
|
+
'pupa/person'
|
10
6
|
end
|
11
7
|
|
12
8
|
def connection
|
data/spec/processor_spec.rb
CHANGED
data/spec/spec_helper.rb
CHANGED
@@ -8,15 +8,3 @@ require 'nokogiri'
|
|
8
8
|
require 'redis-store'
|
9
9
|
require 'rspec'
|
10
10
|
require File.dirname(__FILE__) + '/../lib/pupa'
|
11
|
-
|
12
|
-
def testing_python_compatibility?
|
13
|
-
ENV['MODE'] == 'compat'
|
14
|
-
end
|
15
|
-
|
16
|
-
if testing_python_compatibility?
|
17
|
-
require File.dirname(__FILE__) + '/../lib/pupa/refinements/opencivicdata'
|
18
|
-
end
|
19
|
-
|
20
|
-
RSpec.configure do |c|
|
21
|
-
c.filter_run_excluding :testing_python_compatibility => true unless testing_python_compatibility?
|
22
|
-
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: pupa
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.11
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Open North
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2015-01-07 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: activesupport
|
@@ -16,14 +16,14 @@ dependencies:
|
|
16
16
|
requirements:
|
17
17
|
- - "~>"
|
18
18
|
- !ruby/object:Gem::Version
|
19
|
-
version: 4.0
|
19
|
+
version: '4.0'
|
20
20
|
type: :runtime
|
21
21
|
prerelease: false
|
22
22
|
version_requirements: !ruby/object:Gem::Requirement
|
23
23
|
requirements:
|
24
24
|
- - "~>"
|
25
25
|
- !ruby/object:Gem::Version
|
26
|
-
version: 4.0
|
26
|
+
version: '4.0'
|
27
27
|
- !ruby/object:Gem::Dependency
|
28
28
|
name: colored
|
29
29
|
requirement: !ruby/object:Gem::Requirement
|
@@ -331,7 +331,6 @@ files:
|
|
331
331
|
- lib/pupa/refinements/faraday.rb
|
332
332
|
- lib/pupa/refinements/faraday_middleware.rb
|
333
333
|
- lib/pupa/refinements/json-schema.rb
|
334
|
-
- lib/pupa/refinements/opencivicdata.rb
|
335
334
|
- lib/pupa/runner.rb
|
336
335
|
- lib/pupa/version.rb
|
337
336
|
- pupa.gemspec
|
@@ -385,7 +384,6 @@ files:
|
|
385
384
|
- spec/processor/yielder_spec.rb
|
386
385
|
- spec/processor_spec.rb
|
387
386
|
- spec/refinements/json-schema_spec.rb
|
388
|
-
- spec/refinements/opencivicdata_spec.rb
|
389
387
|
- spec/runner_spec.rb
|
390
388
|
- spec/spec_helper.rb
|
391
389
|
homepage: http://github.com/opennorth/pupa-ruby
|
@@ -449,7 +447,5 @@ test_files:
|
|
449
447
|
- spec/processor/yielder_spec.rb
|
450
448
|
- spec/processor_spec.rb
|
451
449
|
- spec/refinements/json-schema_spec.rb
|
452
|
-
- spec/refinements/opencivicdata_spec.rb
|
453
450
|
- spec/runner_spec.rb
|
454
451
|
- spec/spec_helper.rb
|
455
|
-
has_rdoc:
|
@@ -1,42 +0,0 @@
|
|
1
|
-
# @see https://github.com/opennorth/pupa-ruby#opencivicdata-compatibility
|
2
|
-
|
3
|
-
module Pupa::Model
|
4
|
-
# This unfortunately won't cause the behavior of any model that has already
|
5
|
-
# included `Pupa::Model` to change.
|
6
|
-
class << self
|
7
|
-
def append_features(base)
|
8
|
-
if base.instance_variable_defined?("@_dependencies")
|
9
|
-
base.instance_variable_get("@_dependencies") << self
|
10
|
-
return false
|
11
|
-
else
|
12
|
-
return false if base < self
|
13
|
-
@_dependencies.each { |dep| base.send(:include, dep) }
|
14
|
-
super
|
15
|
-
base.extend const_get("ClassMethods") if const_defined?("ClassMethods")
|
16
|
-
base.class_eval(&@_included_block) if instance_variable_defined?("@_included_block")
|
17
|
-
base.class_eval do # XXX
|
18
|
-
set_callback(:save, :before) do |object|
|
19
|
-
object._type = object._type.camelize.demodulize.underscore
|
20
|
-
end
|
21
|
-
end
|
22
|
-
end
|
23
|
-
end
|
24
|
-
end
|
25
|
-
end
|
26
|
-
|
27
|
-
# `set_callback` is called by `class_eval` in `ActiveSupport::Concern`. Without
|
28
|
-
# monkey-patching `ActiveSupport::Concern`, we can either iterate `ObjectSpace`,
|
29
|
-
# implement something like ActiveSupport's `DescendantsTracker` for inclusion
|
30
|
-
# instead of inheritance, or go back to `Pupa::Model` being a superclass instead
|
31
|
-
# of a mixin to take advantage of `DescendantsTracker` itself.
|
32
|
-
#
|
33
|
-
# Instead of adding a callback, we can override `to_h` when `persist` is `true`.
|
34
|
-
ObjectSpace.each_object(Class) do |base|
|
35
|
-
if base != Sequel::Model && base.include?(Pupa::Model) # Sequel::Model will error on #include?
|
36
|
-
base.class_eval do
|
37
|
-
set_callback(:save, :before) do |object|
|
38
|
-
object._type = object._type.camelize.demodulize.underscore
|
39
|
-
end
|
40
|
-
end
|
41
|
-
end
|
42
|
-
end
|
@@ -1,35 +0,0 @@
|
|
1
|
-
require File.expand_path(File.dirname(__FILE__) + '/../spec_helper')
|
2
|
-
|
3
|
-
describe Pupa::Refinements, testing_python_compatibility: true do
|
4
|
-
module Music
|
5
|
-
class Band
|
6
|
-
include Pupa::Model
|
7
|
-
|
8
|
-
def save
|
9
|
-
run_callbacks(:save) do
|
10
|
-
end
|
11
|
-
end
|
12
|
-
end
|
13
|
-
end
|
14
|
-
|
15
|
-
module Pupa
|
16
|
-
class Committee < Organization
|
17
|
-
def save
|
18
|
-
run_callbacks(:save) do
|
19
|
-
end
|
20
|
-
end
|
21
|
-
end
|
22
|
-
end
|
23
|
-
|
24
|
-
it 'should demodulize the type of new models' do
|
25
|
-
object = Music::Band.new
|
26
|
-
object.save
|
27
|
-
object._type.should == 'band'
|
28
|
-
end
|
29
|
-
|
30
|
-
it 'should demodulize the type of existing models' do
|
31
|
-
object = Pupa::Committee.new
|
32
|
-
object.save
|
33
|
-
object._type.should == 'committee'
|
34
|
-
end
|
35
|
-
end
|