data_miner 2.0.1 → 2.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: data_miner
3
3
  version: !ruby/object:Gem::Version
4
- version: 2.0.1
4
+ version: 2.0.2
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -11,7 +11,7 @@ authors:
11
11
  autorequire:
12
12
  bindir: bin
13
13
  cert_chain: []
14
- date: 2012-04-18 00:00:00.000000000 Z
14
+ date: 2012-05-04 00:00:00.000000000 Z
15
15
  dependencies:
16
16
  - !ruby/object:Gem::Dependency
17
17
  name: remote_table
@@ -141,7 +141,8 @@ dependencies:
141
141
  - - ! '>='
142
142
  - !ruby/object:Gem::Version
143
143
  version: 0.5.1
144
- description: Mine remote data into your ActiveRecord models. You can also convert
144
+ description: Download, pull out of a ZIP/TAR/GZ/BZ2 archive, parse, correct, and import
145
+ XLS, ODS, XML, CSV, HTML, etc. into your ActiveRecord models. You can also convert
145
146
  units.
146
147
  email:
147
148
  - seamus@abshere.net
@@ -153,11 +154,11 @@ files:
153
154
  - CHANGELOG
154
155
  - Gemfile
155
156
  - LICENSE
156
- - README.rdoc
157
+ - README.markdown
157
158
  - Rakefile
158
159
  - data_miner.gemspec
159
160
  - lib/data_miner.rb
160
- - lib/data_miner/active_record_extensions.rb
161
+ - lib/data_miner/active_record_class_methods.rb
161
162
  - lib/data_miner/attribute.rb
162
163
  - lib/data_miner/dictionary.rb
163
164
  - lib/data_miner/run.rb
@@ -200,7 +201,8 @@ rubyforge_project: data_miner
200
201
  rubygems_version: 1.8.21
201
202
  signing_key:
202
203
  specification_version: 3
203
- summary: Mine remote data into your ActiveRecord models.
204
+ summary: Download, pull out of a ZIP/TAR/GZ/BZ2 archive, parse, correct, and import
205
+ XLS, ODS, XML, CSV, HTML, etc. into your ActiveRecord models.
204
206
  test_files:
205
207
  - test/helper.rb
206
208
  - test/support/breeds.xls
@@ -1,289 +0,0 @@
1
- =data_miner
2
-
3
- Programmatically import useful data into your ActiveRecord models.
4
-
5
- (see http://wiki.github.com/seamusabshere/data_miner for more examples)
6
-
7
- ==Quick start
8
-
9
- You define <tt>data_miner</tt> blocks in your ActiveRecord models. For example, in <tt>app/models/country.rb</tt>:
10
-
11
- class Country < ActiveRecord::Base
12
- self.primary_key = :iso_3166_code
13
-
14
- data_miner do
15
- import 'the official ISO country list',
16
- :url => 'http://www.iso.org/iso/list-en1-semic-3.txt',
17
- :skip => 2,
18
- :headers => false,
19
- :delimiter => ';',
20
- :encoding => 'ISO-8859-1' do
21
- key :iso_3166_code, :field_number => 1
22
- store :name, :field_number => 0
23
- end
24
- end
25
- end
26
-
27
- Now you can run:
28
-
29
- irb(main):001:0> Country.run_data_miner!
30
- => nil
31
-
32
- == Creating tables from scratch (changed in 1.2)
33
-
34
- We recommend using the <tt>mini_record-compat</tt> gem (https://github.com/seamusabshere/mini_record)
35
-
36
- This replaces the <tt>schema</tt> method that was available before. It didn't make sense for <tt>data_miner</tt> to provide this natively.
37
-
38
- class Car < ActiveRecord::Base
39
- # the mini_record way
40
- col :make
41
- col :model
42
-
43
- data_miner do
44
- # DEPRECATED - see above
45
- # schema do
46
- # string :make
47
- # string :model
48
- # end
49
-
50
- # the mini_record way
51
- process :auto_upgrade!
52
-
53
- # [... other data mining steps]
54
- end
55
- end
56
-
57
- ==Advanced usage
58
-
59
- This is how we linked together (http://data.brighterplanet.com/aircraft) the FAA's list of aircraft with the US Department of Transportations list of aircraft:
60
-
61
- class Aircraft < ActiveRecord::Base
62
- # Tell ActiveRecord that we want to use a string primary key.
63
- # This makes it easier to repeatedly truncate and re-import this
64
- # table without breaking associations.
65
- self.primary_key = :icao_code
66
-
67
- # Use the mini_record-compat gem to define the database schema in-line.
68
- # It will destructively and automatically add/remove columns.
69
- # This is "OK" because you can always just re-run the import script to get the data back.
70
- # PS. If you're using DataMapper, you don't need this
71
- col :icao_code
72
- col :manufacturer_name
73
- col :name
74
- col :bts_name
75
- col :bts_aircraft_type_code
76
- col :brighter_planet_aircraft_class_code
77
- col :fuel_use_aircraft_name
78
- col :m3, :type => :float
79
- col :m3_units
80
- col :m2, :type => :float
81
- col :m2_units
82
- col :m1, :type => :float
83
- col :m1_units
84
- col :endpoint_fuel, :type => :float
85
- col :endpoint_fuel_units
86
- col :seats, :type => :float
87
- col :distance, :type => :float
88
- col :distance_units
89
- col :load_factor, :type => :float
90
- col :freight_share, :type => :float
91
- col :payload, :type => :float
92
- col :weighting, :type => :float
93
- col :bts_aircraft_type_code, :type => :index
94
-
95
- # A dictionary between BTS aircraft type codes and ICAO aircraft
96
- # codes that uses string similarity instead of exact matching.
97
- # This is preferable to typing everything out.
98
- def self.bts_name_dictionary
99
- # Sorry for documenting the LooseTightDictionary gem here, but it's useful
100
- @_bts_dictionary ||= LooseTightDictionary.new(
101
- # The first argument is the source... the possible matches. Most Enumerables will do.
102
- RemoteTable.new(:url => 'http://www.transtats.bts.gov/Download_Lookup.asp?Lookup=L_AIRCRAFT_TYPE', :select => lambda { |record| record['Code'].to_i.between?(1, 998) }),
103
- # Tightenings optionally pull out what is important on both sides of a potential match
104
- :tightenings => RemoteTable.new(:url => 'http://spreadsheets.google.com/pub?key=tiS_6CCDDM_drNphpYwE_iw&single=true&gid=0&output=csv', :headers => false),
105
- # Identities optionally require a particular capture from both sides of a match to be equal
106
- :identities => RemoteTable.new(:url => 'http://spreadsheets.google.com/pub?key=tiS_6CCDDM_drNphpYwE_iw&single=true&gid=3&output=csv', :headers => false),
107
- # Blockings restrict comparisons to a subset where everything matches the blocking
108
- :blockings => RemoteTable.new(:url => 'http://spreadsheets.google.com/pub?key=tiS_6CCDDM_drNphpYwE_iw&single=true&gid=4&output=csv', :headers => false),
109
- # This means that lookups that don't match a blocking won't be compared to possible matches that **do** match a blocking.
110
- # This is useful because we say /boeing/ and only boeings are matched against other boeings.
111
- :blocking_only => true,
112
- # Tell the dictionary how read things from the source.
113
- :right_reader => lambda { |record| record['Description'] }
114
- )
115
- end
116
-
117
- # A dictionary between what appear to be ICAO aircraft names and
118
- # objects of this class itself.
119
- # Warning: self-referential (it calls Aircraft.all) so it should be run after the first DataMiner step.
120
- def self.icao_name_dictionary
121
- @_icao_dictionary ||= LooseTightDictionary.new Aircraft.all,
122
- :tightenings => RemoteTable.new(:url => 'http://spreadsheets.google.com/pub?key=tiS_6CCDDM_drNphpYwE_iw&single=true&gid=0&output=csv', :headers => false),
123
- :identities => RemoteTable.new(:url => 'http://spreadsheets.google.com/pub?key=tiS_6CCDDM_drNphpYwE_iw&single=true&gid=3&output=csv', :headers => false),
124
- :blockings => RemoteTable.new(:url => 'http://spreadsheets.google.com/pub?key=tiS_6CCDDM_drNphpYwE_iw&single=true&gid=4&output=csv', :headers => false),
125
- :right_reader => lambda { |record| record.manufacturer_name.to_s + ' ' + record.name.to_s }
126
- end
127
-
128
- # This responds to the "Matcher" interface as defined by DataMiner.
129
- # In other words, it takes Matcher#match(*args) and returns something.
130
- class BtsMatcher
131
- attr_reader :wants
132
- def initialize(wants)
133
- @wants = wants
134
- end
135
- def match(raw_faa_icao_record)
136
- @_match ||= Hash.new
137
- return @_match[raw_faa_icao_record] if @_match.has_key?(raw_faa_icao_record)
138
- faa_icao_record = [ raw_faa_icao_record['Manufacturer'] + ' ' + raw_faa_icao_record['Model'] ]
139
- bts_record = Aircraft.bts_name_dictionary.left_to_right faa_icao_record
140
- retval = case wants
141
- when :bts_aircraft_type_code
142
- bts_record['Code']
143
- when :bts_name
144
- bts_record['Description']
145
- end if bts_record
146
- @_match[raw_faa_icao_record] = retval
147
- end
148
- end
149
-
150
- # Another class that implements the "Matcher" interface as expected by DataMiner.
151
- class FuelUseMatcher
152
- def match(raw_fuel_use_record)
153
- @_match ||= Hash.new
154
- return @_match[raw_fuel_use_record] if @_match.has_key?(raw_fuel_use_record)
155
- # First try assuming we have an ICAO code
156
- aircraft_record = if raw_fuel_use_record['ICAO'] =~ /\A[0-9A-Z]+\z/
157
- Aircraft.find_by_icao_code raw_fuel_use_record['ICAO']
158
- end
159
- # No luck? then try a fuzzy match
160
- aircraft_record ||= if raw_fuel_use_record['Aircraft Name'].present?
161
- Aircraft.icao_name_dictionary.left_to_right [ raw_fuel_use_record['Aircraft Name'] ]
162
- end
163
- if aircraft_record
164
- @_match[raw_fuel_use_record] = aircraft_record.icao_code
165
- else
166
- # While we're developing the dictionary, we want it to blow up until we have 100% matchability
167
- raise "Didn't find a match for #{raw_fuel_use_record['Aircraft Name']} (#{raw_fuel_use_record['ICAO']}), which we found in the fuel use spreadsheet"
168
- end
169
- end
170
- end
171
-
172
- # This responds to the "Responder" interface as expected by Errata.
173
- # Basically it lets you say "Is a DC plane" in the errata file and
174
- # have it map to a Ruby method.
175
- class Guru
176
- def is_a_dc_plane?(row)
177
- row['Designator'] =~ /^DC\d/i
178
- end
179
- def is_a_g159?(row)
180
- row['Designator'] =~ /^G159$/
181
- end
182
- def is_a_galx?(row)
183
- row['Designator'] =~ /^GALX$/
184
- end
185
- def method_missing(method_id, *args, &block)
186
- if method_id.to_s =~ /\Ais_n?o?t?_?attributed_to_([^\?]+)/
187
- manufacturer_name = $1
188
- manufacturer_regexp = Regexp.new(manufacturer_name.gsub('_', ' ?'), Regexp::IGNORECASE)
189
- matches = manufacturer_regexp.match(args.first['Manufacturer']) # row['Manufacturer'] =~ /mcdonnell douglas/i
190
- method_id.to_s.include?('not_attributed') ? matches.nil? : !matches.nil?
191
- else
192
- super
193
- end
194
- end
195
- end
196
-
197
- data_miner do
198
- # In our app, we defined DataMiner::Run.allowed? to return false if a run
199
- # has taken place in the last hour (among other things).
200
- # By raising DataMiner::Skip, we skip this run but call it a success.
201
- process "Don't re-import too often" do
202
- raise DataMiner::Skip unless DataMiner::Run.allowed? Aircraft
203
- end
204
-
205
- # The FAA publishes a document to help people identify aircraft by different names.
206
- ('A'..'Z').each do |letter|
207
- import( "ICAO aircraft codes starting with the letter #{letter} used by the FAA",
208
- # The master URL of the source file (one for every letter)
209
- :url => "http://www.faa.gov/air_traffic/publications/atpubs/CNT/5-2-#{letter}.htm",
210
- # The RFC-style errata... note that it will use the Guru class we defined above. See the Errata gem for more details.
211
- :errata => Errata.new(:url => 'http://spreadsheets.google.com/pub?key=tObVAGyqOkCBtGid0tJUZrw', :responder => Aircraft::Guru.new),
212
- # If it's not UTF-8, you should say what it is so that we can iconv it!
213
- :encoding => 'windows-1252',
214
- # Nokogiri is being used to grab each row starting from the second
215
- :row_xpath => '//table/tr[2]/td/table/tr',
216
- # ditto... XPath for Nokogiri
217
- :column_xpath => 'td' ) do
218
- # The code that they use is in fact the ICAO code!
219
- key 'icao_code', :field_name => 'Designator'
220
- # We get this for free
221
- store 'manufacturer_name', :field_name => 'Manufacturer'
222
- # ditto
223
- store 'name', :field_name => 'Model'
224
- # Use the loose-tight dictionary.
225
- # It gets the entire input row to play with before deciding on an output.
226
- store 'bts_aircraft_type_code', :matcher => Aircraft::BtsMatcher.new(:bts_aircraft_type_code)
227
- store 'bts_name', :matcher => Aircraft::BtsMatcher.new(:bts_name)
228
- end
229
- end
230
-
231
- # Pull in some data that might only be important to Brighter Planet
232
- import "Brighter Planet's aircraft class codes",
233
- :url => 'http://static.brighterplanet.com/science/data/transport/air/bts_aircraft_type/bts_aircraft_types-brighter_planet_aircraft_classes.csv' do
234
- key 'bts_aircraft_type_code', :field_name => 'bts_aircraft_type'
235
- store 'brighter_planet_aircraft_class_code'
236
- end
237
-
238
- # Pull in fuel use equation (y = m3*x^3 + m2*x^2 + m1*x + endpoint_fuel).
239
- # This data comes from the EEA.
240
- import "pre-calculated fuel use equation coefficients",
241
- :url => 'http://static.brighterplanet.com/science/data/transport/air/fuel_use/aircraft_fuel_use_formulae.ods',
242
- :select => lambda { |row| row['ICAO'].present? or row['Aircraft Name'].present? } do
243
- # We want to key on ICAO code, but since it's sometimes missing, use the loose-tight dictionary we defined above.
244
- key 'icao_code', :matcher => Aircraft::FuelUseMatcher.new
245
- # Keep the name for sanity checking. Yes, we have 3 different "name" fields... they should all refer to the same aircraft.
246
- store 'fuel_use_aircraft_name', :field_name => 'Aircraft Name'
247
- store 'm3'
248
- store 'm2'
249
- store 'm1'
250
- store 'endpoint_fuel', :field_name => 'b'
251
- end
252
-
253
- # Use arel and the weighted_average gem to do some crazy averaging.
254
- # This assumes that you're dealing with the BTS T-100 flight segment data.
255
- # See http://data.brighterplanet.com/flight_segments for a pre-sanitized version.
256
- process "Derive some average flight characteristics from flight segments" do
257
- FlightSegment.run_data_miner!
258
- aircraft = Aircraft.arel_table
259
- segments = FlightSegment.arel_table
260
-
261
- conditional_relation = aircraft[:bts_aircraft_type_code].eq(segments[:bts_aircraft_type_code])
262
- update_all "seats = (#{FlightSegment.weighted_average_relation(:seats, :weighted_by => :passengers ).where(conditional_relation).to_sql})"
263
- update_all "distance = (#{FlightSegment.weighted_average_relation(:distance, :weighted_by => :passengers ).where(conditional_relation).to_sql})"
264
- update_all "load_factor = (#{FlightSegment.weighted_average_relation(:load_factor, :weighted_by => :passengers ).where(conditional_relation).to_sql})"
265
- update_all "freight_share = (#{FlightSegment.weighted_average_relation(:freight_share, :weighted_by => :passengers ).where(conditional_relation).to_sql})"
266
- update_all "payload = (#{FlightSegment.weighted_average_relation(:payload, :weighted_by => :passengers, :disaggregate_by => :departures_performed).where(conditional_relation).to_sql})"
267
-
268
- update_all "weighting = (#{segments.project(segments[:passengers].sum).where(aircraft[:bts_aircraft_type_code].eq(segments[:bts_aircraft_type_code])).to_sql})"
269
- end
270
-
271
- # And finally re-run the import of resources that depend on this model.
272
- # Don't worry about calling Aircraft.run_data_miner! at the top of AircraftManufacturer's data_miner block;
273
- # that's the right way to do dependencies. It won't get called twice in the same run.
274
- [ AircraftManufacturer ].each do |synthetic_resource|
275
- process "Synthesize #{synthetic_resource}" do
276
- synthetic_resource.run_data_miner!
277
- end
278
- end
279
- end
280
- end
281
-
282
- ==Authors
283
-
284
- * Seamus Abshere <seamus@abshere.net>
285
- * Andy Rossmeissl <andy@rossmeissl.net>
286
-
287
- ==Copyright
288
-
289
- Copyright (c) 2010 Brighter Planet. See LICENSE for details.
@@ -1,38 +0,0 @@
1
- require 'active_record'
2
- require 'lock_method'
3
-
4
- class DataMiner
5
- module ActiveRecordExtensions
6
- MUTEX = ::Mutex.new
7
-
8
- def data_miner_script
9
- @data_miner_script || MUTEX.synchronize do
10
- @data_miner_script ||= DataMiner::Script.new(self)
11
- end
12
- end
13
-
14
- def data_miner_runs
15
- DataMiner::Run.scoped :conditions => { :model_name => name }
16
- end
17
-
18
- def run_data_miner!
19
- data_miner_script.perform
20
- end
21
-
22
- def run_data_miner_on_parent_associations!
23
- reflect_on_all_associations(:belongs_to).reject do |assoc|
24
- assoc.options[:polymorphic]
25
- end.each do |non_polymorphic_belongs_to_assoc|
26
- non_polymorphic_belongs_to_assoc.klass.run_data_miner!
27
- end
28
- end
29
-
30
- def data_miner(options = {}, &blk)
31
- DataMiner.model_names.add name
32
- unless options[:append]
33
- @data_miner_script = nil
34
- end
35
- data_miner_script.append_block blk
36
- end
37
- end
38
- end