pupa 0.1.10 → 0.1.11
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.travis.yml +0 -3
- data/README.md +5 -20
- data/lib/pupa/runner.rb +2 -2
- data/lib/pupa/version.rb +1 -1
- data/pupa.gemspec +1 -1
- data/spec/processor/connection_adapters/mongodb_adapter_spec.rb +1 -5
- data/spec/processor/connection_adapters/postgresql_adapter_spec.rb +1 -5
- data/spec/processor_spec.rb +1 -5
- data/spec/spec_helper.rb +0 -12
- metadata +4 -8
- data/lib/pupa/refinements/opencivicdata.rb +0 -42
- data/spec/refinements/opencivicdata_spec.rb +0 -35
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 08051be1ed0ea215d0e1c258d9b01bd189215dc2
|
4
|
+
data.tar.gz: dd163a78b33959069adbb5d971781a17a582a15b
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 597d6a420ae3c8ba04336b9ed38f4073925e90720d51895fbd83a9c77232f8c588834f66d5106fd9906c402bba752f4805c1014919812427b119aa870ead1cda
|
7
|
+
data.tar.gz: 98e65c5af6ea3f1a63b54c7c0c2091d0664660705d19e77bf22b2d5e7dfa911d56155e8cb6fa69ab11c22191e74444f1e60981f3523239e6a8873bc0d3c92efd
|
data/.travis.yml
CHANGED
data/README.md
CHANGED
@@ -6,13 +6,13 @@
|
|
6
6
|
[](https://coveralls.io/r/opennorth/pupa-ruby)
|
7
7
|
[](https://codeclimate.com/github/opennorth/pupa-ruby)
|
8
8
|
|
9
|
-
Pupa.rb is a Ruby 2.x fork of
|
9
|
+
Pupa.rb is a Ruby 2.x fork of Python [Pupa](https://github.com/opencivicdata/pupa). It implements an Extract, Transform and Load (ETL) process to scrape data from online sources, transform it, and write it to a database.
|
10
10
|
|
11
11
|
gem install pupa
|
12
12
|
|
13
13
|
## Usage
|
14
14
|
|
15
|
-
You can scrape any sort of data with Pupa.rb using your own models. You can also use Pupa.rb to scrape people, organizations, memberships and posts according to the [Popolo](http://www.popoloproject.com/) open government data specification. This gem is up-to-date with Popolo's 2014-
|
15
|
+
You can scrape any sort of data with Pupa.rb using your own models. You can also use Pupa.rb to scrape people, organizations, memberships and posts according to the [Popolo](http://www.popoloproject.com/) open government data specification. This gem is up-to-date with Popolo's 2014-10-28 version.
|
16
16
|
|
17
17
|
The [cat.rb](http://opennorth.github.io/pupa-ruby/docs/cat.html) example shows you how to:
|
18
18
|
|
@@ -52,14 +52,6 @@ The [organization.rb](http://opennorth.github.io/pupa-ruby/docs/organization.htm
|
|
52
52
|
|
53
53
|
JSON parsing is enabled by default. To enable automatic parsing of HTML and XML, require the `nokogiri` and `multi_xml` gems.
|
54
54
|
|
55
|
-
### Automatic response decompression
|
56
|
-
|
57
|
-
Until [Faraday Middleware](https://github.com/lostisland/faraday_middleware) releases its next version (> 0.9.1), you must use the gem from its git repository to automatically decompress responses:
|
58
|
-
|
59
|
-
```ruby
|
60
|
-
gem 'faraday_middleware', git: 'https://github.com/lostisland/faraday_middleware.git'
|
61
|
-
```
|
62
|
-
|
63
55
|
## Performance
|
64
56
|
|
65
57
|
Pupa.rb offers several ways to significantly improve performance. [Read the documentation.](https://github.com/opennorth/pupa-ruby/blob/master/PERFORMANCE.md#readme)
|
@@ -90,22 +82,15 @@ In short, Pupa.rb lets you spend more time on the tasks that are unique to your
|
|
90
82
|
|
91
83
|
* Logging, to make debugging and monitoring a scraper easier
|
92
84
|
* [Automatic response parsing](#automatic-response-parsing) of JSON, XML and HTML
|
85
|
+
* Automatic response decompression
|
93
86
|
* [Option parsing](http://opennorth.github.io/pupa-ruby/docs/legislator.html#section-9), to control your scraper from the command-line
|
94
87
|
* [Object validation](http://opennorth.github.io/pupa-ruby/docs/cat.html#section-4), using [JSON Schema](http://json-schema.org/)
|
95
88
|
|
96
89
|
Pupa.rb is extensible, so that you can add your own models, parsers, helpers, actions, etc. It also offers several ways to [improve your scraper's performance](#performance).
|
97
90
|
|
98
|
-
## [
|
99
|
-
|
100
|
-
Both Pupa.rb and Sunlight Labs' [Pupa](https://github.com/opencivicdata/pupa) implement models for people, organizations and memberships from the [Popolo](http://www.popoloproject.com/) open government data specification. Pupa.rb lets you use your own classes, but Pupa only supports a fixed set of classes. A consequence of Pupa.rb's flexibility is that the value of the `_type` property for `Person`, `Organization` and `Membership` objects differs between Pupa.rb and Pupa. Pupa.rb has namespaced types like `pupa/person` – to allow Ruby to load the `Person` class in the `Pupa` module – whereas Pupa has unnamespaced types like `person`.
|
101
|
-
|
102
|
-
To save objects to MongoDB with unnamespaced types like Sunlight Labs' Pupa – in order to benefit from other tools in the [OpenCivicData](http://opencivicdata.org/) stack – add this line to the top of your script:
|
103
|
-
|
104
|
-
```ruby
|
105
|
-
require 'pupa/refinements/opencivicdata'
|
106
|
-
```
|
91
|
+
## Python [Pupa](https://github.com/opencivicdata/pupa) differences
|
107
92
|
|
108
|
-
|
93
|
+
Both Pupa.rb and Python [Pupa](https://github.com/opencivicdata/pupa) implement models from the [Popolo](http://www.popoloproject.com/) open government data specifications, but Pupa.rb also lets you use your own classes. Pupa.rb stores data in either MongoDB (default) or PostgreSQL (experimental); Python Pupa stores data in PostgreSQL. The PostgreSQL schema of Pupa.rb and Python Pupa differ.
|
109
94
|
|
110
95
|
## Testing
|
111
96
|
|
data/lib/pupa/runner.rb
CHANGED
@@ -13,9 +13,9 @@ module Pupa
|
|
13
13
|
@options = OpenStruct.new({
|
14
14
|
actions: [],
|
15
15
|
tasks: [],
|
16
|
-
output_dir: File.expand_path('
|
16
|
+
output_dir: File.expand_path('_data', Dir.pwd),
|
17
17
|
pipelined: false,
|
18
|
-
cache_dir: File.expand_path('
|
18
|
+
cache_dir: File.expand_path('_cache', Dir.pwd),
|
19
19
|
expires_in: 86400, # 1 day
|
20
20
|
value_max_bytes: 1048576, # 1 MB
|
21
21
|
memcached_username: nil,
|
data/lib/pupa/version.rb
CHANGED
data/pupa.gemspec
CHANGED
@@ -16,7 +16,7 @@ Gem::Specification.new do |s|
|
|
16
16
|
s.executables = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
|
17
17
|
s.require_paths = ["lib"]
|
18
18
|
|
19
|
-
s.add_runtime_dependency('activesupport', '~> 4.0
|
19
|
+
s.add_runtime_dependency('activesupport', '~> 4.0')
|
20
20
|
s.add_runtime_dependency('colored', '~> 1.2')
|
21
21
|
s.add_runtime_dependency('faraday_middleware', '~> 0.9.0')
|
22
22
|
s.add_runtime_dependency('json-schema', '~> 2.1.3')
|
@@ -2,11 +2,7 @@ require File.expand_path(File.dirname(__FILE__) + '/../../spec_helper')
|
|
2
2
|
|
3
3
|
describe Pupa::Processor::Connection::MongoDBAdapter do
|
4
4
|
def _type
|
5
|
-
|
6
|
-
'person'
|
7
|
-
else
|
8
|
-
'pupa/person'
|
9
|
-
end
|
5
|
+
'pupa/person'
|
10
6
|
end
|
11
7
|
|
12
8
|
def connection
|
@@ -2,11 +2,7 @@ require File.expand_path(File.dirname(__FILE__) + '/../../spec_helper')
|
|
2
2
|
|
3
3
|
describe Pupa::Processor::Connection::PostgreSQLAdapter do
|
4
4
|
def _type
|
5
|
-
|
6
|
-
'person'
|
7
|
-
else
|
8
|
-
'pupa/person'
|
9
|
-
end
|
5
|
+
'pupa/person'
|
10
6
|
end
|
11
7
|
|
12
8
|
def connection
|
data/spec/processor_spec.rb
CHANGED
data/spec/spec_helper.rb
CHANGED
@@ -8,15 +8,3 @@ require 'nokogiri'
|
|
8
8
|
require 'redis-store'
|
9
9
|
require 'rspec'
|
10
10
|
require File.dirname(__FILE__) + '/../lib/pupa'
|
11
|
-
|
12
|
-
def testing_python_compatibility?
|
13
|
-
ENV['MODE'] == 'compat'
|
14
|
-
end
|
15
|
-
|
16
|
-
if testing_python_compatibility?
|
17
|
-
require File.dirname(__FILE__) + '/../lib/pupa/refinements/opencivicdata'
|
18
|
-
end
|
19
|
-
|
20
|
-
RSpec.configure do |c|
|
21
|
-
c.filter_run_excluding :testing_python_compatibility => true unless testing_python_compatibility?
|
22
|
-
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: pupa
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.11
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Open North
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2015-01-07 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: activesupport
|
@@ -16,14 +16,14 @@ dependencies:
|
|
16
16
|
requirements:
|
17
17
|
- - "~>"
|
18
18
|
- !ruby/object:Gem::Version
|
19
|
-
version: 4.0
|
19
|
+
version: '4.0'
|
20
20
|
type: :runtime
|
21
21
|
prerelease: false
|
22
22
|
version_requirements: !ruby/object:Gem::Requirement
|
23
23
|
requirements:
|
24
24
|
- - "~>"
|
25
25
|
- !ruby/object:Gem::Version
|
26
|
-
version: 4.0
|
26
|
+
version: '4.0'
|
27
27
|
- !ruby/object:Gem::Dependency
|
28
28
|
name: colored
|
29
29
|
requirement: !ruby/object:Gem::Requirement
|
@@ -331,7 +331,6 @@ files:
|
|
331
331
|
- lib/pupa/refinements/faraday.rb
|
332
332
|
- lib/pupa/refinements/faraday_middleware.rb
|
333
333
|
- lib/pupa/refinements/json-schema.rb
|
334
|
-
- lib/pupa/refinements/opencivicdata.rb
|
335
334
|
- lib/pupa/runner.rb
|
336
335
|
- lib/pupa/version.rb
|
337
336
|
- pupa.gemspec
|
@@ -385,7 +384,6 @@ files:
|
|
385
384
|
- spec/processor/yielder_spec.rb
|
386
385
|
- spec/processor_spec.rb
|
387
386
|
- spec/refinements/json-schema_spec.rb
|
388
|
-
- spec/refinements/opencivicdata_spec.rb
|
389
387
|
- spec/runner_spec.rb
|
390
388
|
- spec/spec_helper.rb
|
391
389
|
homepage: http://github.com/opennorth/pupa-ruby
|
@@ -449,7 +447,5 @@ test_files:
|
|
449
447
|
- spec/processor/yielder_spec.rb
|
450
448
|
- spec/processor_spec.rb
|
451
449
|
- spec/refinements/json-schema_spec.rb
|
452
|
-
- spec/refinements/opencivicdata_spec.rb
|
453
450
|
- spec/runner_spec.rb
|
454
451
|
- spec/spec_helper.rb
|
455
|
-
has_rdoc:
|
@@ -1,42 +0,0 @@
|
|
1
|
-
# @see https://github.com/opennorth/pupa-ruby#opencivicdata-compatibility
|
2
|
-
|
3
|
-
module Pupa::Model
|
4
|
-
# This unfortunately won't cause the behavior of any model that has already
|
5
|
-
# included `Pupa::Model` to change.
|
6
|
-
class << self
|
7
|
-
def append_features(base)
|
8
|
-
if base.instance_variable_defined?("@_dependencies")
|
9
|
-
base.instance_variable_get("@_dependencies") << self
|
10
|
-
return false
|
11
|
-
else
|
12
|
-
return false if base < self
|
13
|
-
@_dependencies.each { |dep| base.send(:include, dep) }
|
14
|
-
super
|
15
|
-
base.extend const_get("ClassMethods") if const_defined?("ClassMethods")
|
16
|
-
base.class_eval(&@_included_block) if instance_variable_defined?("@_included_block")
|
17
|
-
base.class_eval do # XXX
|
18
|
-
set_callback(:save, :before) do |object|
|
19
|
-
object._type = object._type.camelize.demodulize.underscore
|
20
|
-
end
|
21
|
-
end
|
22
|
-
end
|
23
|
-
end
|
24
|
-
end
|
25
|
-
end
|
26
|
-
|
27
|
-
# `set_callback` is called by `class_eval` in `ActiveSupport::Concern`. Without
|
28
|
-
# monkey-patching `ActiveSupport::Concern`, we can either iterate `ObjectSpace`,
|
29
|
-
# implement something like ActiveSupport's `DescendantsTracker` for inclusion
|
30
|
-
# instead of inheritance, or go back to `Pupa::Model` being a superclass instead
|
31
|
-
# of a mixin to take advantage of `DescendantsTracker` itself.
|
32
|
-
#
|
33
|
-
# Instead of adding a callback, we can override `to_h` when `persist` is `true`.
|
34
|
-
ObjectSpace.each_object(Class) do |base|
|
35
|
-
if base != Sequel::Model && base.include?(Pupa::Model) # Sequel::Model will error on #include?
|
36
|
-
base.class_eval do
|
37
|
-
set_callback(:save, :before) do |object|
|
38
|
-
object._type = object._type.camelize.demodulize.underscore
|
39
|
-
end
|
40
|
-
end
|
41
|
-
end
|
42
|
-
end
|
@@ -1,35 +0,0 @@
|
|
1
|
-
require File.expand_path(File.dirname(__FILE__) + '/../spec_helper')
|
2
|
-
|
3
|
-
describe Pupa::Refinements, testing_python_compatibility: true do
|
4
|
-
module Music
|
5
|
-
class Band
|
6
|
-
include Pupa::Model
|
7
|
-
|
8
|
-
def save
|
9
|
-
run_callbacks(:save) do
|
10
|
-
end
|
11
|
-
end
|
12
|
-
end
|
13
|
-
end
|
14
|
-
|
15
|
-
module Pupa
|
16
|
-
class Committee < Organization
|
17
|
-
def save
|
18
|
-
run_callbacks(:save) do
|
19
|
-
end
|
20
|
-
end
|
21
|
-
end
|
22
|
-
end
|
23
|
-
|
24
|
-
it 'should demodulize the type of new models' do
|
25
|
-
object = Music::Band.new
|
26
|
-
object.save
|
27
|
-
object._type.should == 'band'
|
28
|
-
end
|
29
|
-
|
30
|
-
it 'should demodulize the type of existing models' do
|
31
|
-
object = Pupa::Committee.new
|
32
|
-
object.save
|
33
|
-
object._type.should == 'committee'
|
34
|
-
end
|
35
|
-
end
|