extraloop-redis-storage 0.0.6 → 0.0.7

Sign up to get free protection for your applications and to get access to all the features.
data/History.txt CHANGED
@@ -1,3 +1,7 @@
1
+ == 0.0.7 / 2012-11-03
2
+ * Datasets can now be pushed to Google Fusion tables
3
+ * Added support for YAML export
4
+
1
5
  == 0.0.6 / 2012-26-02
2
6
 
3
7
  * Added CSV export
data/README.rdoc CHANGED
@@ -4,8 +4,8 @@
4
4
 
5
5
  Persistence layer for the {ExtraLoop}[https://github.com/afiore/extraloop] data extraction toolkit.
6
6
  This module is implemented as a wrapper around {Ohm}[http://ohm.keyvalue.org], an object-hash mapping library which
7
- makes easy storing structured data into Redis. It comes with a convinent command line tool, which allows to
8
- list, filter, delete, and export harvested datasets.
7
+ makes easy storing structured data into Redis. Includes a convinent command line tool that allows to
8
+ list, filter, and delete harvested datasets, as well as exporting them on local files or remote data stores (i.e Google Fusion tables).
9
9
 
10
10
  == Installation
11
11
 
@@ -33,46 +33,49 @@ with the +set_storage+ method: a helper method that allows to specify how the sc
33
33
  .run()
34
34
 
35
35
  At each scraper run, the ExtraLoop storage module internally instantiates a
36
- session (see <code>ExtraLoop::Storage::ScrapingSession</code>) and link the extracted records to it.
37
- The +AmazonReview+ instances extracted and stored in the example above, can in fact be fetched by calling
38
- Ohm's +find+ with the session id as argument.
36
+ session (see <code>ExtraLoop::Storage::ScrapingSession</code>) and associates the extracted records to it.
37
+ The `AmazonReview` records just created, can now be accessed by calling the `#records` metod on scraper session object.
39
38
 
40
- reviews = AmazonReview.find :session_id => scraper.session
39
+ reviews = scraper.session.records
41
40
 
42
- The same set of reviews can alternatively be retrieved by calling the +record+ method on the scraping
43
- session instance:
41
+ === #set_storage
44
42
 
45
- reviews = scraper.session.records AmazonReview
43
+ The +set_storage+ method accepts the following arguments:
46
44
 
47
-
48
- === The #set_storage method
49
-
50
- The +set_storage+ method can be called with the following arguments:
51
-
52
- * _model_ A Ruby constant specifying the model to be used for storing the extracted data .
45
+ * _model_ A Ruby constant or a symbol specifying the model to be used for storing the extracted data. If a symbol is passed, it is assumed that a model does not exist and the storage module dynamically generates one by subclassing <code>ExtraLoop::Storage::Record</code>.
53
46
  * _session_title_ A human readable title for the extracted dataset (optional).
54
47
 
55
48
  == Command line interface
56
49
 
57
- Once installed, the gem will also add to your system path the +extraloop+ executable, a command line interface to the datasets harvested through extraloop.
50
+ Once installed, the gem will also add to your system path the +extraloop+ executable: a command line interface to the datasets harvested through ExtraLoop.
58
51
  A list of datasets can be obtained by running:
59
52
 
60
- extraloop datastore list:
53
+ extraloop datastore list
61
54
 
62
55
  This will generate a table like the following one:
63
- <code>
56
+
64
57
  id | title | model | records
65
58
  --------------------------------------------------------------------
66
59
  48 | 1330106699 GoogleNewsStory Dataset | GoogleNewsStory | 110
67
60
  49 | 1330106948 AmazonReview Dataset | AmazonReview | 0
68
61
  51 | 1330107087 GoogleNewsStory Dataset | GoogleNewsStory | 110
69
62
  52 | 1330111630 AmazonReview Dataset | AmazonReview | 10
70
- </code>
71
63
 
72
- Datasets can by removed using the +delete+ subcommand:
64
+
65
+ Datasets can be removed using the +delete+ subcommand:
73
66
 
74
67
  extraloop datastore delete [id]
75
68
 
76
69
  Where +id+ is either a single scraping session id, or a session id range (e.g. 48..52).
77
- Finally, the +export+ subcommand allows to export one or several datasets into a JSON or CSV documents.
78
- Please refer to the executable inline help (<code>extraloop datastore help [command]</code>) for more usage information.
70
+
71
+ From the Redis datastore, ExtraLoop datasets can be exported to disk as CSV, JSON, or YAML documents:
72
+
73
+ extraloop datastore export 51..52 -f csv
74
+
75
+ Similarly, stored datasets can be uploaded to a remote datastore:
76
+
77
+ extraloop datastore push 51..48 fusion_tables -c google_username:password
78
+
79
+ While Google's Fusion Tables is currently the only one implemented, support for other remote datastores (e.g.
80
+ [couchDB](http://couchdb.apache.org/), , [cartoDB](http://cartodb.com) ), and [CKAN Webstore](http://wiki.ckan.org/Webstore) will be added soon.
81
+
data/bin/extraloop CHANGED
@@ -2,7 +2,6 @@
2
2
  $: << File.realpath(File.dirname(File.dirname(__FILE__))) + "/lib/extraloop"
3
3
 
4
4
  require 'thor'
5
- require 'pry'
6
5
  require 'fileutils'
7
6
  require 'thor/group'
8
7
  require 'redis-storage'
@@ -11,6 +10,15 @@ class DataStoreCommand < Thor
11
10
 
12
11
  ExtraLoop::Storage::autoload_models
13
12
 
13
+ class << self
14
+ def parse_config
15
+ config_file = File.join(Etc.getpwuid.dir, '.extraloop.yml')
16
+ File.exist?(config_file) && YAML::load_file(config_file) or {}
17
+ end
18
+ end
19
+
20
+ @@config = parse_config
21
+
14
22
  @@sessions = ExtraLoop::Storage::ScrapingSession.all
15
23
  @@redis = Ohm.redis
16
24
 
@@ -18,16 +26,16 @@ class DataStoreCommand < Thor
18
26
  "d" => :delete,
19
27
  "e" => :export
20
28
 
21
- desc "list [sessions]", "List harvested datasets filtering by session id range (e.g '25..50')"
29
+ desc "list [sessions]", "Lists harvested datasets filtering by session id range (e.g '25..50')"
22
30
  def list(sessions=nil)
23
31
  data = (filter sessions).map { |session| [ session.id, session.title, session.model && session.model.name, session.model && session.records.size ]}
24
32
  $stdout.puts tabularize(%w[id title model records], data)
25
33
  end
26
34
 
27
- desc "delete [sessions]", "Remove datasets by session id or session id range"
35
+ desc "delete [sessions]", "Removes datasets by session id or session id range"
28
36
  def delete(sessions)
29
37
  deleted = 0
30
- (filter sessions).each { |session| (session.delete && session.records.each(&:delete) ) && deleted += 1 }
38
+ (filter sessions).each { |session| session.delete && session.records.each(&:delete) && deleted += 1 }
31
39
  $stderr.puts "\n => #{deleted > 0 && deleted or 'No' } record#{'s' if deleted > 1} deleted \n\n"
32
40
  list
33
41
  end
@@ -44,15 +52,33 @@ class DataStoreCommand < Thor
44
52
  format = options[:format]
45
53
  dir = options[:directory]
46
54
  exception = DataStoreCommand::Exceptions::FormatNotImplemented.new "Format not supported #{format}"
47
- raise exception unless %w[json csv].include? format
55
+ raise exception unless %w[json csv yaml].include? format
48
56
  FileUtils.mkdir(dir) unless File.exists? dir
49
57
 
50
- (filter sessions).each do |session|
58
+ filter(sessions).each do |session|
51
59
  filename, data = *[ "#{session.id}_#{session.title.gsub(/\s/,"_")}", session.send("to_#{format}")]
52
60
  File.open("#{dir}/#{filename}.#{format}", "w") { |f| f.write data }
53
61
  end
54
62
  end
55
63
 
64
+ desc "push [sessions] [remote_store]", "Uploads one or several datasets to a remote data store"
65
+ method_option :schema, :type => 'hash', :aliases => "-s"
66
+ method_option :credentials, :type => 'string', :aliases => "-a"
67
+
68
+ def push(sessions, store_type=:fusion_tables)
69
+
70
+ filter(sessions).each do |session|
71
+ store_type = store_type.to_sym
72
+ begin
73
+ credentials = options.fetch('credentials', @@config[:datastore] && @@config[:datastore][:credentials] && @@config[:datastore][store_type]).split(':')
74
+ rescue NoMethodError
75
+ abort "Cannot find credentials for remote datastore.\nPlease specify them using the --credential switch (e.g. 'andrea:mypassword')"
76
+ end
77
+ datastore = ExtraLoop::Storage::RemoteStore::get_transport(store_type, credentials)
78
+ datastore.push session
79
+ end
80
+ end
81
+
56
82
  # override default banner
57
83
  def self.banner(task, namespace = true, subcommand = false)
58
84
  "datastore#{task.formatted_usage(self, true, subcommand).gsub(/data_store_command/,'')}"
@@ -65,7 +91,6 @@ class DataStoreCommand < Thor
65
91
  exception = DataStoreCommand::Exceptions::FileNotFound.new "cannot find #{path}"
66
92
  raise exception unless File.exists?(path)
67
93
  (File.directory? path) && Dir["#{path}/*.rb"] or path
68
-
69
94
  end.flatten
70
95
 
71
96
  files.each { |file| require "./#{file}" }
@@ -79,7 +104,6 @@ class DataStoreCommand < Thor
79
104
  else
80
105
  @@sessions
81
106
  end
82
-
83
107
  end
84
108
 
85
109
  def tabularize(headers, data)
@@ -100,6 +124,7 @@ end
100
124
 
101
125
  class DataStoreCommand::Exceptions
102
126
  class FormatNotImplemented < StandardError; end
127
+ class UnknownDatastore < StandardError; end
103
128
  class FileNotFound < StandardError; end
104
129
  end
105
130
 
@@ -1,5 +1,6 @@
1
1
  require "rubygems"
2
2
  require "date"
3
+ require "pry"
3
4
  require "extraloop"
4
5
  require "../lib/extraloop/redis-storage.rb"
5
6
  require "./lib/models/amazon_review.rb"
@@ -10,4 +10,7 @@ class ExtraLoop::Storage::Model < Ohm::Model
10
10
  memo.merge(attribute => send(attribute))
11
11
  })
12
12
  end
13
+ def to_yaml
14
+ to_hash.to_yaml
15
+ end
13
16
  end
@@ -22,6 +22,10 @@ class ExtraLoop::Storage::Record < Ohm::Model
22
22
  })
23
23
  end
24
24
 
25
+ def to_yaml
26
+ to_hash.to_yaml
27
+ end
28
+
25
29
  def validate
26
30
  assert_present :session
27
31
  end
@@ -0,0 +1,41 @@
1
+ class ExtraLoop::Storage::FusionTables
2
+ @@connection = nil
3
+
4
+ def initialize(credentials, options={})
5
+ @options = options
6
+ @credentials = credentials
7
+ @api = connect
8
+ end
9
+
10
+ def push(session)
11
+ dataset = session.to_hash
12
+ records = dataset[:records]
13
+ title = dataset[:title].gsub(/\sDataset/,'')
14
+ schema = make_schema(records.first)
15
+
16
+ table = @api.create_table("Dataset #{title}", schema)
17
+ table.insert records
18
+ end
19
+
20
+ private
21
+ def make_schema(record)
22
+ defaults = {
23
+ 'session_id' => 'number'
24
+ }
25
+
26
+ schema = defaults.merge(@options.fetch :schema, {})
27
+
28
+ record.keys.
29
+ reject { |key| schema.keys.include?(key) }.
30
+ map { |key| {:name => key.to_s, :type => 'string'} }.
31
+ concat(schema.map { |field, type| {:name => field.to_s, :type => type }})
32
+ end
33
+
34
+ def connect
35
+ return @@connection if @@connection
36
+
37
+ @@connection = GData::Client::FusionTables.new
38
+ @@connection.clientlogin(*@credentials)
39
+ @@connection
40
+ end
41
+ end
@@ -0,0 +1,13 @@
1
+ # Base class for pushing Extraloop datasets from the local Redis
2
+ # store to remote ones (e.g. Google Fusion tables, Buzzdata, Cartodb)
3
+
4
+ $: << path = File.dirname(__FILE__) + '/remote_store'
5
+ Dir["#{path}/*.rb"].each { |store_adapter| require store_adapter }
6
+
7
+
8
+ class ExtraLoop::Storage::RemoteStore
9
+ def self.get_transport(datastore, credentials, options={})
10
+ classname = datastore.to_s.gsub(/^.|_./) { |chars| chars.split("").last.upcase }
11
+ ExtraLoop::Storage.const_get(classname).new(credentials, options) if ExtraLoop::Storage.const_defined?(classname)
12
+ end
13
+ end
@@ -1,7 +1,5 @@
1
1
  class ExtraLoop::Storage::ScrapingSession < Ohm::Model
2
2
 
3
- BOM = "\377\376" #Byte Order Mark
4
-
5
3
  include Ohm::Boundaries
6
4
  include Ohm::Timestamping
7
5
  include Ohm::Callbacks
@@ -9,22 +7,20 @@ class ExtraLoop::Storage::ScrapingSession < Ohm::Model
9
7
  attribute :title
10
8
  reference :model, ExtraLoop::Storage::Model
11
9
 
12
-
13
-
14
10
  def records(params={})
15
11
  klass = if Object.const_defined?(model.name)
16
- Object.const_get(model.name)
17
- else
18
- dynamic_class = Class.new(ExtraLoop::Storage::Record) do
19
- # override default to_hash so that it will return the Redis hash
20
- # internally stored by Ohm
21
- def to_hash
22
- Ohm.redis.hgetall self.key
23
- end
12
+ Object.const_get(model.name)
13
+ else
14
+ dynamic_class = Class.new(ExtraLoop::Storage::Record) do
15
+ # override default to_hash so that it will return the Redis hash
16
+ # internally stored by Ohm
17
+ def to_hash
18
+ Ohm.redis.hgetall self.key
24
19
  end
20
+ end
25
21
 
26
- Object.const_set(model.name, dynamic_class)
27
- dynamic_class
22
+ Object.const_set(model.name, dynamic_class)
23
+ dynamic_class
28
24
  end
29
25
 
30
26
  # set a session index, so that Ohm finder will work
@@ -56,4 +52,8 @@ class ExtraLoop::Storage::ScrapingSession < Ohm::Model
56
52
  data = [header].concat _records.map(&:values)
57
53
  output = data.map { |cells| CSV.generate_line cells }.join
58
54
  end
55
+
56
+ def to_yaml
57
+ to_hash.to_yaml
58
+ end
59
59
  end
@@ -1,10 +1,19 @@
1
+ require "rubygems"
1
2
  require "json"
3
+ require "yaml"
2
4
  require "rubygems"
3
5
  require "redis"
4
6
  require "ohm"
5
7
  require "ohm/contrib"
6
8
  require "extraloop"
7
9
 
10
+ begin
11
+ gem "fusion_tables", "~> 0.3.1"
12
+ require "fusion_tables"
13
+ rescue Gem::LoadError
14
+ end
15
+
16
+
8
17
  base_path = File.realpath(File.dirname(__FILE__))
9
18
  $: << "#{base_path}"
10
19
 
@@ -12,7 +21,7 @@ require "scraper_base"
12
21
 
13
22
  module ExtraLoop
14
23
  module Storage
15
- VERSION ||= "0.0.1"
24
+ VERSION ||= "0.0.7"
16
25
 
17
26
  def self.connect(*args)
18
27
  Ohm.connect(*args)
@@ -26,10 +35,13 @@ module ExtraLoop
26
35
  end
27
36
 
28
37
  autoload :CSV, 'csv'
29
- autoload :Iconv, 'iconv'
30
- ExtraLoop::Storage.autoload :Record, "#{base_path}/redis-storage/record.rb"
31
- ExtraLoop::Storage.autoload :ScrapingSession, "#{base_path}/redis-storage/scraping_session.rb"
32
- ExtraLoop::Storage.autoload :Model, "#{base_path}/redis-storage/model.rb"
33
- ExtraLoop::Storage.autoload :DatasetFactory, "#{base_path}/redis-storage/dataset_factory.rb"
38
+ autoload :Etc, 'etc'
39
+
40
+ base_path << "/redis-storage"
34
41
 
42
+ ExtraLoop::Storage.autoload :Record, "#{base_path}/record.rb"
43
+ ExtraLoop::Storage.autoload :ScrapingSession, "#{base_path}/scraping_session.rb"
44
+ ExtraLoop::Storage.autoload :Model, "#{base_path}/model.rb"
45
+ ExtraLoop::Storage.autoload :DatasetFactory, "#{base_path}/dataset_factory.rb"
46
+ ExtraLoop::Storage.autoload :RemoteStore, "#{base_path}/remote_store.rb"
35
47
 
@@ -1,11 +1,13 @@
1
1
  class ExtraLoop::ScraperBase
2
2
  attr_reader :session
3
3
 
4
- def set_storage(model, title=nil)
4
+
5
+ def set_storage(model, title=nil, options={})
5
6
  collection_name = "#{Time.now.to_i} #{model.to_s} Dataset"
6
7
  title ||= collection_name
7
8
 
8
9
  @model = model_klass = model.respond_to?(:new) && model || ExtraLoop::Storage::DatasetFactory.new(model.to_sym, @extractor_args.map(&:first)).get_class
10
+
9
11
  log_session! title
10
12
 
11
13
  on :data do |results|
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: extraloop-redis-storage
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.6
4
+ version: 0.0.7
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,11 +9,11 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-02-26 00:00:00.000000000Z
12
+ date: 2012-03-11 00:00:00.000000000Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: extraloop
16
- requirement: &10773840 !ruby/object:Gem::Requirement
16
+ requirement: &14542580 !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - ~>
@@ -21,10 +21,10 @@ dependencies:
21
21
  version: 0.0.3
22
22
  type: :runtime
23
23
  prerelease: false
24
- version_requirements: *10773840
24
+ version_requirements: *14542580
25
25
  - !ruby/object:Gem::Dependency
26
26
  name: ohm
27
- requirement: &10773380 !ruby/object:Gem::Requirement
27
+ requirement: &14542000 !ruby/object:Gem::Requirement
28
28
  none: false
29
29
  requirements:
30
30
  - - ~>
@@ -32,10 +32,10 @@ dependencies:
32
32
  version: 0.1.3
33
33
  type: :runtime
34
34
  prerelease: false
35
- version_requirements: *10773380
35
+ version_requirements: *14542000
36
36
  - !ruby/object:Gem::Dependency
37
37
  name: ohm-contrib
38
- requirement: &10772860 !ruby/object:Gem::Requirement
38
+ requirement: &14507840 !ruby/object:Gem::Requirement
39
39
  none: false
40
40
  requirements:
41
41
  - - ~>
@@ -43,10 +43,10 @@ dependencies:
43
43
  version: 0.1.2
44
44
  type: :runtime
45
45
  prerelease: false
46
- version_requirements: *10772860
46
+ version_requirements: *14507840
47
47
  - !ruby/object:Gem::Dependency
48
48
  name: thor
49
- requirement: &10772260 !ruby/object:Gem::Requirement
49
+ requirement: &14507360 !ruby/object:Gem::Requirement
50
50
  none: false
51
51
  requirements:
52
52
  - - =
@@ -54,10 +54,21 @@ dependencies:
54
54
  version: 0.14.6
55
55
  type: :runtime
56
56
  prerelease: false
57
- version_requirements: *10772260
57
+ version_requirements: *14507360
58
+ - !ruby/object:Gem::Dependency
59
+ name: rake
60
+ requirement: &14506900 !ruby/object:Gem::Requirement
61
+ none: false
62
+ requirements:
63
+ - - ! '>='
64
+ - !ruby/object:Gem::Version
65
+ version: '0'
66
+ type: :development
67
+ prerelease: false
68
+ version_requirements: *14506900
58
69
  - !ruby/object:Gem::Dependency
59
70
  name: rspec
60
- requirement: &10771600 !ruby/object:Gem::Requirement
71
+ requirement: &14506220 !ruby/object:Gem::Requirement
61
72
  none: false
62
73
  requirements:
63
74
  - - ~>
@@ -65,10 +76,10 @@ dependencies:
65
76
  version: 2.7.0
66
77
  type: :development
67
78
  prerelease: false
68
- version_requirements: *10771600
79
+ version_requirements: *14506220
69
80
  - !ruby/object:Gem::Dependency
70
81
  name: rr
71
- requirement: &10771000 !ruby/object:Gem::Requirement
82
+ requirement: &14505600 !ruby/object:Gem::Requirement
72
83
  none: false
73
84
  requirements:
74
85
  - - ~>
@@ -76,10 +87,10 @@ dependencies:
76
87
  version: 1.0.4
77
88
  type: :development
78
89
  prerelease: false
79
- version_requirements: *10771000
90
+ version_requirements: *14505600
80
91
  - !ruby/object:Gem::Dependency
81
92
  name: pry
82
- requirement: &10770520 !ruby/object:Gem::Requirement
93
+ requirement: &14505020 !ruby/object:Gem::Requirement
83
94
  none: false
84
95
  requirements:
85
96
  - - ~>
@@ -87,8 +98,10 @@ dependencies:
87
98
  version: 0.9.7.4
88
99
  type: :development
89
100
  prerelease: false
90
- version_requirements: *10770520
91
- description: Redis+Ohm based storage for data sets extracted using the ExtraLoop toolkit.
101
+ version_requirements: *14505020
102
+ description: Redis-based Persistence layer for the ExtraLoop data extraction toolkit.
103
+ Includes a convinent command line tool allowing to list, filter, delete, and export
104
+ harvested datasets
92
105
  email: andrea.giulio.fiore@googlemail.com
93
106
  executables:
94
107
  - extraloop
@@ -105,6 +118,8 @@ files:
105
118
  - lib/extraloop/redis-storage/dataset_factory.rb
106
119
  - lib/extraloop/redis-storage/model.rb
107
120
  - lib/extraloop/redis-storage/record.rb
121
+ - lib/extraloop/redis-storage/remote_store.rb
122
+ - lib/extraloop/redis-storage/remote_store/fusion_tables.rb
108
123
  - lib/extraloop/redis-storage/scraping_session.rb
109
124
  - lib/extraloop/scraper_base.rb
110
125
  - spec/dataset_factory_spec.rb
@@ -125,6 +140,9 @@ required_ruby_version: !ruby/object:Gem::Requirement
125
140
  - - ! '>='
126
141
  - !ruby/object:Gem::Version
127
142
  version: '0'
143
+ segments:
144
+ - 0
145
+ hash: 1448249409185434738
128
146
  required_rubygems_version: !ruby/object:Gem::Requirement
129
147
  none: false
130
148
  requirements:
@@ -138,4 +156,3 @@ signing_key:
138
156
  specification_version: 2
139
157
  summary: Redis storage for Extraloop.
140
158
  test_files: []
141
- has_rdoc: