extraloop-redis-storage 0.0.6 → 0.0.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/History.txt CHANGED
@@ -1,3 +1,7 @@
1
+ == 0.0.7 / 2012-11-03
2
+ * Datasets can now be pushed to Google Fusion tables
3
+ * Added support for YAML export
4
+
1
5
  == 0.0.6 / 2012-26-02
2
6
 
3
7
  * Added CSV export
data/README.rdoc CHANGED
@@ -4,8 +4,8 @@
4
4
 
5
5
  Persistence layer for the {ExtraLoop}[https://github.com/afiore/extraloop] data extraction toolkit.
6
6
  This module is implemented as a wrapper around {Ohm}[http://ohm.keyvalue.org], an object-hash mapping library which
7
- makes easy storing structured data into Redis. It comes with a convinent command line tool, which allows to
8
- list, filter, delete, and export harvested datasets.
7
+ makes easy storing structured data into Redis. Includes a convinent command line tool that allows to
8
+ list, filter, and delete harvested datasets, as well as exporting them on local files or remote data stores (i.e Google Fusion tables).
9
9
 
10
10
  == Installation
11
11
 
@@ -33,46 +33,49 @@ with the +set_storage+ method: a helper method that allows to specify how the sc
33
33
  .run()
34
34
 
35
35
  At each scraper run, the ExtraLoop storage module internally instantiates a
36
- session (see <code>ExtraLoop::Storage::ScrapingSession</code>) and link the extracted records to it.
37
- The +AmazonReview+ instances extracted and stored in the example above, can in fact be fetched by calling
38
- Ohm's +find+ with the session id as argument.
36
+ session (see <code>ExtraLoop::Storage::ScrapingSession</code>) and associates the extracted records to it.
37
+ The `AmazonReview` records just created, can now be accessed by calling the `#records` metod on scraper session object.
39
38
 
40
- reviews = AmazonReview.find :session_id => scraper.session
39
+ reviews = scraper.session.records
41
40
 
42
- The same set of reviews can alternatively be retrieved by calling the +record+ method on the scraping
43
- session instance:
41
+ === #set_storage
44
42
 
45
- reviews = scraper.session.records AmazonReview
43
+ The +set_storage+ method accepts the following arguments:
46
44
 
47
-
48
- === The #set_storage method
49
-
50
- The +set_storage+ method can be called with the following arguments:
51
-
52
- * _model_ A Ruby constant specifying the model to be used for storing the extracted data .
45
+ * _model_ A Ruby constant or a symbol specifying the model to be used for storing the extracted data. If a symbol is passed, it is assumed that a model does not exist and the storage module dynamically generates one by subclassing <code>ExtraLoop::Storage::Record</code>.
53
46
  * _session_title_ A human readable title for the extracted dataset (optional).
54
47
 
55
48
  == Command line interface
56
49
 
57
- Once installed, the gem will also add to your system path the +extraloop+ executable, a command line interface to the datasets harvested through extraloop.
50
+ Once installed, the gem will also add to your system path the +extraloop+ executable: a command line interface to the datasets harvested through ExtraLoop.
58
51
  A list of datasets can be obtained by running:
59
52
 
60
- extraloop datastore list:
53
+ extraloop datastore list
61
54
 
62
55
  This will generate a table like the following one:
63
- <code>
56
+
64
57
  id | title | model | records
65
58
  --------------------------------------------------------------------
66
59
  48 | 1330106699 GoogleNewsStory Dataset | GoogleNewsStory | 110
67
60
  49 | 1330106948 AmazonReview Dataset | AmazonReview | 0
68
61
  51 | 1330107087 GoogleNewsStory Dataset | GoogleNewsStory | 110
69
62
  52 | 1330111630 AmazonReview Dataset | AmazonReview | 10
70
- </code>
71
63
 
72
- Datasets can by removed using the +delete+ subcommand:
64
+
65
+ Datasets can be removed using the +delete+ subcommand:
73
66
 
74
67
  extraloop datastore delete [id]
75
68
 
76
69
  Where +id+ is either a single scraping session id, or a session id range (e.g. 48..52).
77
- Finally, the +export+ subcommand allows to export one or several datasets into a JSON or CSV documents.
78
- Please refer to the executable inline help (<code>extraloop datastore help [command]</code>) for more usage information.
70
+
71
+ From the Redis datastore, ExtraLoop datasets can be exported to disk as CSV, JSON, or YAML documents:
72
+
73
+ extraloop datastore export 51..52 -f csv
74
+
75
+ Similarly, stored datasets can be uploaded to a remote datastore:
76
+
77
+ extraloop datastore push 51..48 fusion_tables -c google_username:password
78
+
79
+ While Google's Fusion Tables is currently the only one implemented, support for other remote datastores (e.g.
80
+ [couchDB](http://couchdb.apache.org/), , [cartoDB](http://cartodb.com) ), and [CKAN Webstore](http://wiki.ckan.org/Webstore) will be added soon.
81
+
data/bin/extraloop CHANGED
@@ -2,7 +2,6 @@
2
2
  $: << File.realpath(File.dirname(File.dirname(__FILE__))) + "/lib/extraloop"
3
3
 
4
4
  require 'thor'
5
- require 'pry'
6
5
  require 'fileutils'
7
6
  require 'thor/group'
8
7
  require 'redis-storage'
@@ -11,6 +10,15 @@ class DataStoreCommand < Thor
11
10
 
12
11
  ExtraLoop::Storage::autoload_models
13
12
 
13
+ class << self
14
+ def parse_config
15
+ config_file = File.join(Etc.getpwuid.dir, '.extraloop.yml')
16
+ File.exist?(config_file) && YAML::load_file(config_file) or {}
17
+ end
18
+ end
19
+
20
+ @@config = parse_config
21
+
14
22
  @@sessions = ExtraLoop::Storage::ScrapingSession.all
15
23
  @@redis = Ohm.redis
16
24
 
@@ -18,16 +26,16 @@ class DataStoreCommand < Thor
18
26
  "d" => :delete,
19
27
  "e" => :export
20
28
 
21
- desc "list [sessions]", "List harvested datasets filtering by session id range (e.g '25..50')"
29
+ desc "list [sessions]", "Lists harvested datasets filtering by session id range (e.g '25..50')"
22
30
  def list(sessions=nil)
23
31
  data = (filter sessions).map { |session| [ session.id, session.title, session.model && session.model.name, session.model && session.records.size ]}
24
32
  $stdout.puts tabularize(%w[id title model records], data)
25
33
  end
26
34
 
27
- desc "delete [sessions]", "Remove datasets by session id or session id range"
35
+ desc "delete [sessions]", "Removes datasets by session id or session id range"
28
36
  def delete(sessions)
29
37
  deleted = 0
30
- (filter sessions).each { |session| (session.delete && session.records.each(&:delete) ) && deleted += 1 }
38
+ (filter sessions).each { |session| session.delete && session.records.each(&:delete) && deleted += 1 }
31
39
  $stderr.puts "\n => #{deleted > 0 && deleted or 'No' } record#{'s' if deleted > 1} deleted \n\n"
32
40
  list
33
41
  end
@@ -44,15 +52,33 @@ class DataStoreCommand < Thor
44
52
  format = options[:format]
45
53
  dir = options[:directory]
46
54
  exception = DataStoreCommand::Exceptions::FormatNotImplemented.new "Format not supported #{format}"
47
- raise exception unless %w[json csv].include? format
55
+ raise exception unless %w[json csv yaml].include? format
48
56
  FileUtils.mkdir(dir) unless File.exists? dir
49
57
 
50
- (filter sessions).each do |session|
58
+ filter(sessions).each do |session|
51
59
  filename, data = *[ "#{session.id}_#{session.title.gsub(/\s/,"_")}", session.send("to_#{format}")]
52
60
  File.open("#{dir}/#{filename}.#{format}", "w") { |f| f.write data }
53
61
  end
54
62
  end
55
63
 
64
+ desc "push [sessions] [remote_store]", "Uploads one or several datasets to a remote data store"
65
+ method_option :schema, :type => 'hash', :aliases => "-s"
66
+ method_option :credentials, :type => 'string', :aliases => "-a"
67
+
68
+ def push(sessions, store_type=:fusion_tables)
69
+
70
+ filter(sessions).each do |session|
71
+ store_type = store_type.to_sym
72
+ begin
73
+ credentials = options.fetch('credentials', @@config[:datastore] && @@config[:datastore][:credentials] && @@config[:datastore][store_type]).split(':')
74
+ rescue NoMethodError
75
+ abort "Cannot find credentials for remote datastore.\nPlease specify them using the --credential switch (e.g. 'andrea:mypassword')"
76
+ end
77
+ datastore = ExtraLoop::Storage::RemoteStore::get_transport(store_type, credentials)
78
+ datastore.push session
79
+ end
80
+ end
81
+
56
82
  # override default banner
57
83
  def self.banner(task, namespace = true, subcommand = false)
58
84
  "datastore#{task.formatted_usage(self, true, subcommand).gsub(/data_store_command/,'')}"
@@ -65,7 +91,6 @@ class DataStoreCommand < Thor
65
91
  exception = DataStoreCommand::Exceptions::FileNotFound.new "cannot find #{path}"
66
92
  raise exception unless File.exists?(path)
67
93
  (File.directory? path) && Dir["#{path}/*.rb"] or path
68
-
69
94
  end.flatten
70
95
 
71
96
  files.each { |file| require "./#{file}" }
@@ -79,7 +104,6 @@ class DataStoreCommand < Thor
79
104
  else
80
105
  @@sessions
81
106
  end
82
-
83
107
  end
84
108
 
85
109
  def tabularize(headers, data)
@@ -100,6 +124,7 @@ end
100
124
 
101
125
  class DataStoreCommand::Exceptions
102
126
  class FormatNotImplemented < StandardError; end
127
+ class UnknownDatastore < StandardError; end
103
128
  class FileNotFound < StandardError; end
104
129
  end
105
130
 
@@ -1,5 +1,6 @@
1
1
  require "rubygems"
2
2
  require "date"
3
+ require "pry"
3
4
  require "extraloop"
4
5
  require "../lib/extraloop/redis-storage.rb"
5
6
  require "./lib/models/amazon_review.rb"
@@ -10,4 +10,7 @@ class ExtraLoop::Storage::Model < Ohm::Model
10
10
  memo.merge(attribute => send(attribute))
11
11
  })
12
12
  end
13
+ def to_yaml
14
+ to_hash.to_yaml
15
+ end
13
16
  end
@@ -22,6 +22,10 @@ class ExtraLoop::Storage::Record < Ohm::Model
22
22
  })
23
23
  end
24
24
 
25
+ def to_yaml
26
+ to_hash.to_yaml
27
+ end
28
+
25
29
  def validate
26
30
  assert_present :session
27
31
  end
@@ -0,0 +1,41 @@
1
+ class ExtraLoop::Storage::FusionTables
2
+ @@connection = nil
3
+
4
+ def initialize(credentials, options={})
5
+ @options = options
6
+ @credentials = credentials
7
+ @api = connect
8
+ end
9
+
10
+ def push(session)
11
+ dataset = session.to_hash
12
+ records = dataset[:records]
13
+ title = dataset[:title].gsub(/\sDataset/,'')
14
+ schema = make_schema(records.first)
15
+
16
+ table = @api.create_table("Dataset #{title}", schema)
17
+ table.insert records
18
+ end
19
+
20
+ private
21
+ def make_schema(record)
22
+ defaults = {
23
+ 'session_id' => 'number'
24
+ }
25
+
26
+ schema = defaults.merge(@options.fetch :schema, {})
27
+
28
+ record.keys.
29
+ reject { |key| schema.keys.include?(key) }.
30
+ map { |key| {:name => key.to_s, :type => 'string'} }.
31
+ concat(schema.map { |field, type| {:name => field.to_s, :type => type }})
32
+ end
33
+
34
+ def connect
35
+ return @@connection if @@connection
36
+
37
+ @@connection = GData::Client::FusionTables.new
38
+ @@connection.clientlogin(*@credentials)
39
+ @@connection
40
+ end
41
+ end
@@ -0,0 +1,13 @@
1
+ # Base class for pushing Extraloop datasets from the local Redis
2
+ # store to remote ones (e.g. Google Fusion tables, Buzzdata, Cartodb)
3
+
4
+ $: << path = File.dirname(__FILE__) + '/remote_store'
5
+ Dir["#{path}/*.rb"].each { |store_adapter| require store_adapter }
6
+
7
+
8
+ class ExtraLoop::Storage::RemoteStore
9
+ def self.get_transport(datastore, credentials, options={})
10
+ classname = datastore.to_s.gsub(/^.|_./) { |chars| chars.split("").last.upcase }
11
+ ExtraLoop::Storage.const_get(classname).new(credentials, options) if ExtraLoop::Storage.const_defined?(classname)
12
+ end
13
+ end
@@ -1,7 +1,5 @@
1
1
  class ExtraLoop::Storage::ScrapingSession < Ohm::Model
2
2
 
3
- BOM = "\377\376" #Byte Order Mark
4
-
5
3
  include Ohm::Boundaries
6
4
  include Ohm::Timestamping
7
5
  include Ohm::Callbacks
@@ -9,22 +7,20 @@ class ExtraLoop::Storage::ScrapingSession < Ohm::Model
9
7
  attribute :title
10
8
  reference :model, ExtraLoop::Storage::Model
11
9
 
12
-
13
-
14
10
  def records(params={})
15
11
  klass = if Object.const_defined?(model.name)
16
- Object.const_get(model.name)
17
- else
18
- dynamic_class = Class.new(ExtraLoop::Storage::Record) do
19
- # override default to_hash so that it will return the Redis hash
20
- # internally stored by Ohm
21
- def to_hash
22
- Ohm.redis.hgetall self.key
23
- end
12
+ Object.const_get(model.name)
13
+ else
14
+ dynamic_class = Class.new(ExtraLoop::Storage::Record) do
15
+ # override default to_hash so that it will return the Redis hash
16
+ # internally stored by Ohm
17
+ def to_hash
18
+ Ohm.redis.hgetall self.key
24
19
  end
20
+ end
25
21
 
26
- Object.const_set(model.name, dynamic_class)
27
- dynamic_class
22
+ Object.const_set(model.name, dynamic_class)
23
+ dynamic_class
28
24
  end
29
25
 
30
26
  # set a session index, so that Ohm finder will work
@@ -56,4 +52,8 @@ class ExtraLoop::Storage::ScrapingSession < Ohm::Model
56
52
  data = [header].concat _records.map(&:values)
57
53
  output = data.map { |cells| CSV.generate_line cells }.join
58
54
  end
55
+
56
+ def to_yaml
57
+ to_hash.to_yaml
58
+ end
59
59
  end
@@ -1,10 +1,19 @@
1
+ require "rubygems"
1
2
  require "json"
3
+ require "yaml"
2
4
  require "rubygems"
3
5
  require "redis"
4
6
  require "ohm"
5
7
  require "ohm/contrib"
6
8
  require "extraloop"
7
9
 
10
+ begin
11
+ gem "fusion_tables", "~> 0.3.1"
12
+ require "fusion_tables"
13
+ rescue Gem::LoadError
14
+ end
15
+
16
+
8
17
  base_path = File.realpath(File.dirname(__FILE__))
9
18
  $: << "#{base_path}"
10
19
 
@@ -12,7 +21,7 @@ require "scraper_base"
12
21
 
13
22
  module ExtraLoop
14
23
  module Storage
15
- VERSION ||= "0.0.1"
24
+ VERSION ||= "0.0.7"
16
25
 
17
26
  def self.connect(*args)
18
27
  Ohm.connect(*args)
@@ -26,10 +35,13 @@ module ExtraLoop
26
35
  end
27
36
 
28
37
  autoload :CSV, 'csv'
29
- autoload :Iconv, 'iconv'
30
- ExtraLoop::Storage.autoload :Record, "#{base_path}/redis-storage/record.rb"
31
- ExtraLoop::Storage.autoload :ScrapingSession, "#{base_path}/redis-storage/scraping_session.rb"
32
- ExtraLoop::Storage.autoload :Model, "#{base_path}/redis-storage/model.rb"
33
- ExtraLoop::Storage.autoload :DatasetFactory, "#{base_path}/redis-storage/dataset_factory.rb"
38
+ autoload :Etc, 'etc'
39
+
40
+ base_path << "/redis-storage"
34
41
 
42
+ ExtraLoop::Storage.autoload :Record, "#{base_path}/record.rb"
43
+ ExtraLoop::Storage.autoload :ScrapingSession, "#{base_path}/scraping_session.rb"
44
+ ExtraLoop::Storage.autoload :Model, "#{base_path}/model.rb"
45
+ ExtraLoop::Storage.autoload :DatasetFactory, "#{base_path}/dataset_factory.rb"
46
+ ExtraLoop::Storage.autoload :RemoteStore, "#{base_path}/remote_store.rb"
35
47
 
@@ -1,11 +1,13 @@
1
1
  class ExtraLoop::ScraperBase
2
2
  attr_reader :session
3
3
 
4
- def set_storage(model, title=nil)
4
+
5
+ def set_storage(model, title=nil, options={})
5
6
  collection_name = "#{Time.now.to_i} #{model.to_s} Dataset"
6
7
  title ||= collection_name
7
8
 
8
9
  @model = model_klass = model.respond_to?(:new) && model || ExtraLoop::Storage::DatasetFactory.new(model.to_sym, @extractor_args.map(&:first)).get_class
10
+
9
11
  log_session! title
10
12
 
11
13
  on :data do |results|
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: extraloop-redis-storage
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.6
4
+ version: 0.0.7
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,11 +9,11 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-02-26 00:00:00.000000000Z
12
+ date: 2012-03-11 00:00:00.000000000Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: extraloop
16
- requirement: &10773840 !ruby/object:Gem::Requirement
16
+ requirement: &14542580 !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - ~>
@@ -21,10 +21,10 @@ dependencies:
21
21
  version: 0.0.3
22
22
  type: :runtime
23
23
  prerelease: false
24
- version_requirements: *10773840
24
+ version_requirements: *14542580
25
25
  - !ruby/object:Gem::Dependency
26
26
  name: ohm
27
- requirement: &10773380 !ruby/object:Gem::Requirement
27
+ requirement: &14542000 !ruby/object:Gem::Requirement
28
28
  none: false
29
29
  requirements:
30
30
  - - ~>
@@ -32,10 +32,10 @@ dependencies:
32
32
  version: 0.1.3
33
33
  type: :runtime
34
34
  prerelease: false
35
- version_requirements: *10773380
35
+ version_requirements: *14542000
36
36
  - !ruby/object:Gem::Dependency
37
37
  name: ohm-contrib
38
- requirement: &10772860 !ruby/object:Gem::Requirement
38
+ requirement: &14507840 !ruby/object:Gem::Requirement
39
39
  none: false
40
40
  requirements:
41
41
  - - ~>
@@ -43,10 +43,10 @@ dependencies:
43
43
  version: 0.1.2
44
44
  type: :runtime
45
45
  prerelease: false
46
- version_requirements: *10772860
46
+ version_requirements: *14507840
47
47
  - !ruby/object:Gem::Dependency
48
48
  name: thor
49
- requirement: &10772260 !ruby/object:Gem::Requirement
49
+ requirement: &14507360 !ruby/object:Gem::Requirement
50
50
  none: false
51
51
  requirements:
52
52
  - - =
@@ -54,10 +54,21 @@ dependencies:
54
54
  version: 0.14.6
55
55
  type: :runtime
56
56
  prerelease: false
57
- version_requirements: *10772260
57
+ version_requirements: *14507360
58
+ - !ruby/object:Gem::Dependency
59
+ name: rake
60
+ requirement: &14506900 !ruby/object:Gem::Requirement
61
+ none: false
62
+ requirements:
63
+ - - ! '>='
64
+ - !ruby/object:Gem::Version
65
+ version: '0'
66
+ type: :development
67
+ prerelease: false
68
+ version_requirements: *14506900
58
69
  - !ruby/object:Gem::Dependency
59
70
  name: rspec
60
- requirement: &10771600 !ruby/object:Gem::Requirement
71
+ requirement: &14506220 !ruby/object:Gem::Requirement
61
72
  none: false
62
73
  requirements:
63
74
  - - ~>
@@ -65,10 +76,10 @@ dependencies:
65
76
  version: 2.7.0
66
77
  type: :development
67
78
  prerelease: false
68
- version_requirements: *10771600
79
+ version_requirements: *14506220
69
80
  - !ruby/object:Gem::Dependency
70
81
  name: rr
71
- requirement: &10771000 !ruby/object:Gem::Requirement
82
+ requirement: &14505600 !ruby/object:Gem::Requirement
72
83
  none: false
73
84
  requirements:
74
85
  - - ~>
@@ -76,10 +87,10 @@ dependencies:
76
87
  version: 1.0.4
77
88
  type: :development
78
89
  prerelease: false
79
- version_requirements: *10771000
90
+ version_requirements: *14505600
80
91
  - !ruby/object:Gem::Dependency
81
92
  name: pry
82
- requirement: &10770520 !ruby/object:Gem::Requirement
93
+ requirement: &14505020 !ruby/object:Gem::Requirement
83
94
  none: false
84
95
  requirements:
85
96
  - - ~>
@@ -87,8 +98,10 @@ dependencies:
87
98
  version: 0.9.7.4
88
99
  type: :development
89
100
  prerelease: false
90
- version_requirements: *10770520
91
- description: Redis+Ohm based storage for data sets extracted using the ExtraLoop toolkit.
101
+ version_requirements: *14505020
102
+ description: Redis-based Persistence layer for the ExtraLoop data extraction toolkit.
103
+ Includes a convinent command line tool allowing to list, filter, delete, and export
104
+ harvested datasets
92
105
  email: andrea.giulio.fiore@googlemail.com
93
106
  executables:
94
107
  - extraloop
@@ -105,6 +118,8 @@ files:
105
118
  - lib/extraloop/redis-storage/dataset_factory.rb
106
119
  - lib/extraloop/redis-storage/model.rb
107
120
  - lib/extraloop/redis-storage/record.rb
121
+ - lib/extraloop/redis-storage/remote_store.rb
122
+ - lib/extraloop/redis-storage/remote_store/fusion_tables.rb
108
123
  - lib/extraloop/redis-storage/scraping_session.rb
109
124
  - lib/extraloop/scraper_base.rb
110
125
  - spec/dataset_factory_spec.rb
@@ -125,6 +140,9 @@ required_ruby_version: !ruby/object:Gem::Requirement
125
140
  - - ! '>='
126
141
  - !ruby/object:Gem::Version
127
142
  version: '0'
143
+ segments:
144
+ - 0
145
+ hash: 1448249409185434738
128
146
  required_rubygems_version: !ruby/object:Gem::Requirement
129
147
  none: false
130
148
  requirements:
@@ -138,4 +156,3 @@ signing_key:
138
156
  specification_version: 2
139
157
  summary: Redis storage for Extraloop.
140
158
  test_files: []
141
- has_rdoc: