remote_table 1.4.0 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/CHANGELOG CHANGED
@@ -1,3 +1,20 @@
1
+ 2.0.0 / 2012-05-08
2
+
3
+ * Breaking changes
4
+
5
+ * New names for options... (not really breaking, these deprecated options are still accepted)
6
+ :errata -> :errata_settings
7
+ :transform -> :transform_settings
8
+ :select -> :pre_select (to avoid conflict with Enumerable#select)
9
+ :reject -> :pre_reject
10
+ :encoding -> :internal_encoding
11
+
12
+ * Enhancements
13
+
14
+ * Every option is documented
15
+ * Refactored to simplify and DRY
16
+ * Thread safe
17
+
1
18
  1.4.0 / 2012-04-12
2
19
 
3
20
  * Enhancements
data/README.markdown CHANGED
@@ -1,25 +1,38 @@
1
1
  # remote_table
2
2
 
3
- Open local or remote XLSX, XLS, ODS, CSV and fixed-width files.
3
+ Open Google Docs spreadsheets, local or remote XLSX, XLS, ODS, CSV (comma separated), TSV (tab separated), other delimited, fixed-width files.
4
4
 
5
- ## Production usage
5
+ Tested on MRI 1.8, MRI 1.9, and JRuby 1.6.7+. Thread-safe.
6
6
 
7
- Used by [the Brighter Planet Reference Data web service](http://data.brighterplanet.com), the [`data_miner` gem](https://github.com/seamusabshere/data_miner), and the [`earth` gem](https://github.com/brighterplanet/earth).
7
+ ## Real-world usage
8
+
9
+ <p><a href="http://brighterplanet.com"><img src="https://s3.amazonaws.com/static.brighterplanet.com/assets/logos/flush-left/inline/green/rasterized/brighter_planet-160-transparent.png" alt="Brighter Planet logo"/></a></p>
10
+
11
+ We use `remote_table` for [data science at Brighter Planet](http://brighterplanet.com/research) and in production at
12
+
13
+ * [Brighter Planet's impact estimate web service](http://impact.brighterplanet.com)
14
+ * [Brighter Planet's reference data web service](http://data.brighterplanet.com)
15
+
16
+ It's also a big part of
17
+
18
+ * the [`data_miner`](https://github.com/seamusabshere/data_miner) library
19
+ * the [`earth`](https://github.com/brighterplanet/earth) library
8
20
 
9
21
  ## Example
10
22
 
11
- $ irb
12
- 1.9.3-p0 :001 > require 'remote_table'
23
+ >> require 'remote_table'
24
+ remote_table.rb:8:in `<top (required)>': iconv will be deprecated in the future, use String#encode instead.
25
+ [remote_table] Apologies - using iconv because Ruby 1.9.x's String#encode doesn't have transliteration tables (yet)
13
26
  => true
14
- 1.9.3-p0 :002 > t = RemoteTable.new 'http://www.fueleconomy.gov/FEG/epadata/98guide6.zip', :filename => '98guide6.csv'
15
- => #<RemoteTable:0x00000100851d98 @options={:filename=>"98guide6.csv"}, @url="http://www.fueleconomy.gov/FEG/epadata/98guide6.zip">
16
- 1.9.3-p0 :003 > t.rows.length
27
+ >> t = RemoteTable.new 'http://www.fueleconomy.gov/FEG/epadata/98guide6.zip', :filename => '98guide6.csv'
28
+ => #<RemoteTable:0x00000101b87390 @download_count_mutex=#<Mutex:0x00000101b87228>, @iconv_mutex=#<Mutex:0x00000101b87200>, @extend_bang_mutex=#<Mutex:0x00000101b871d8>, @errata_mutex=#<Mutex:0x00000101b871b0>, @cache=[], @download_count=0, @url="http://www.fueleconomy.gov/FEG/epadata/98guide6.zip", @format=nil, @headers=:first_row, @compression=:zip, @packing=nil, @streaming=false, @warn_on_multiple_downloads=true, @delimiter=",", @sheet=nil, @keep_blank_rows=false, @form_data=nil, @skip=0, @internal_encoding="UTF-8", @row_xpath=nil, @column_xpath=nil, @row_css=nil, @column_css=nil, @glob=nil, @filename="98guide6.csv", @transform_settings=nil, @cut=nil, @crop=nil, @schema=nil, @schema_name=nil, @pre_select=nil, @pre_reject=nil, @errata_settings=nil, @other_options={}, @transformer=#<RemoteTable::Transformer:0x00000101b8c2f0 @t=#<RemoteTable:0x00000101b87390 ...>, @legacy_transformer_mutex=#<Mutex:0x00000101b8c2a0>>, @local_copy=#<RemoteTable::LocalCopy:0x00000101b8bf58 @t=#<RemoteTable:0x00000101b87390 ...>, @encoded_io_mutex=#<Mutex:0x00000101b8be18>, @generate_mutex=#<Mutex:0x00000101b8bdc8>>>
29
+ >> t.rows.length
17
30
  => 806
18
- 1.9.3-p0 :004 > t.rows.first.length
31
+ >> t.rows.first.length
19
32
  => 26
20
- 1.9.3-p0 :005 > require 'pp'
33
+ >> require 'pp'
21
34
  => true
22
- 1.9.3-p0 :006 > pp t[23]
35
+ >> pp t[23]
23
36
  {"Class"=>"TWO SEATERS",
24
37
  "Manufacturer"=>"PORSCHE",
25
38
  "carline name"=>"BOXSTER",
@@ -47,7 +60,19 @@ Used by [the Brighter Planet Reference Data web service](http://data.brighterpla
47
60
  "eng dscr"=>"",
48
61
  "trans dscr"=>""}
49
62
 
50
- You get an <code>Array</code> of <code>Hash</code>es with **string keys**. If you set <code>:headers => false</code>, then you get an <code>Array</code> of <code>Array</code>s.
63
+ ## Columns and rows
64
+
65
+ * If there are headers, you get an <code>Array</code> of <code>Hash</code>es with **string keys**.
66
+ * If you set <code>:headers => false</code>, then you get an <code>Array</code> of <code>Array</code>s.
67
+
68
+ ## Row keys
69
+
70
+ Row keys are **strings**. Row keys are NOT symbolized.
71
+
72
+ row['foobar'] # correct
73
+ row[:foobar] # incorrect
74
+
75
+ You can call <code>symbolize_keys</code> yourself, but we don't do it automatically to avoid creating loads of garbage symbols.
51
76
 
52
77
  ## Supported formats
53
78
 
@@ -59,7 +84,7 @@ You get an <code>Array</code> of <code>Hash</code>es with **string keys**. If yo
59
84
  </tr>
60
85
  <tr>
61
86
  <td>Delimited (CSV, TSV, etc.)</td>
62
- <td>All <code>RemoteTable::Format::Delimited::FASTERCSV_OPTIONS</code>, for example <code>:col_sep</code>, are passed directly to fastercsv.</td>
87
+ <td>All <code>RemoteTable::Delimited::PASSTHROUGH_CSV_SETTINGS</code>, for example <code>:col_sep</code>, are passed directly to fastercsv.</td>
63
88
  <td>
64
89
  <a href="http://fastercsv.rubyforge.org/">fastercsv</a> (1.8);
65
90
  <a href="http://www.ruby-doc.org/stdlib-1.9.3/libdoc/csv/rdoc/index.html">stdlib</code></a> (1.9)
@@ -147,12 +172,12 @@ Everything is forced into UTF-8. You can improve the quality of the conversion b
147
172
  :row_xpath => '//table/tr[2]/td/table/tr',
148
173
  :column_xpath => 'td',
149
174
  :errata => { RemoteTable.new('https://spreadsheets.google.com/spreadsheet/pub?key=0AoQJbWqPrREqdGVBRnhkRGhSaVptSDJ5bXJGbkpUSWc&output=csv', :responder => Aircraft::Guru.new },
150
- :select => lambda { |record| manufacturer_whitelist? record['Manufacturer'] })
175
+ :select => proc { |record| manufacturer_whitelist? record['Manufacturer'] })
151
176
 
152
177
  # OpenFlights.org airports database
153
178
  RemoteTable.new('https://openflights.svn.sourceforge.net/svnroot/openflights/openflights/data/airports.dat',
154
179
  :headers => %w{ id name city country_name iata_code icao_code latitude longitude altitude timezone daylight_savings },
155
- :select => lambda { |record| record['iata_code'].present? },
180
+ :select => proc { |record| record['iata_code'].present? },
156
181
  :errata => { RemoteTable.new('https://spreadsheets.google.com/pub?key=0AoQJbWqPrREqdFc2UzhQYU5PWEQ0N21yWFZGNmc2a3c&gid=0&output=csv', :responder => Airport::Guru.new }) # see https://github.com/brighterplanet/earth/blob/master/lib/earth/air/aircraft/data_miner.rb
157
182
 
158
183
  # T100 flight segment data for #{month.strftime('%B %Y')}
@@ -162,7 +187,7 @@ Everything is forced into UTF-8. You can improve the quality of the conversion b
162
187
  :compression => :zip,
163
188
  :glob => '/*.csv',
164
189
  :errata => { RemoteTable.new('https://spreadsheets.google.com/spreadsheet/pub?key=0AoQJbWqPrREqdGxpYU1qWFR3d0syTVMyQVVOaDd0V3c&output=csv', :responder => FlightSegment::Guru.new },
165
- :select => lambda { |record| record['DEPARTURES_PERFORMED'].to_i > 0 })
190
+ :select => proc { |record| record['DEPARTURES_PERFORMED'].to_i > 0 })
166
191
 
167
192
  # 1995 Fuel Economy Guide
168
193
  # for definition of `:fuel_economy_guide_b` and `AutomobileMakeModelYearVariant::ParserB` see https://github.com/brighterplanet/earth/blob/master/lib/earth/automobile/automobile_make_model_year_variant/data_miner.rb
@@ -171,7 +196,7 @@ Everything is forced into UTF-8. You can improve the quality of the conversion b
171
196
  :format => :fixed_width,
172
197
  :cut => '13-',
173
198
  :schema_name => :fuel_economy_guide_b,
174
- :select => lambda { |row| row['model'].present? and (row['suppress_code'].blank? or row['suppress_code'].to_f == 0) and row['state_code'] == 'F' },
199
+ :select => proc { |row| row['model'].present? and (row['suppress_code'].blank? or row['suppress_code'].to_f == 0) and row['state_code'] == 'F' },
175
200
  :transform => { :class => AutomobileMakeModelYearVariant::ParserB, :year => 1995 },
176
201
  :errata => { :url => "https://docs.google.com/spreadsheet/pub?key=0AoQJbWqPrREqdDkxTElWRVlvUXB3Uy04SDhSYWkzakE&output=csv", :responder => AutomobileMakeModelYearVariant::Guru.new })
177
202
 
@@ -181,30 +206,30 @@ Everything is forced into UTF-8. You can improve the quality of the conversion b
181
206
  :filename => '98guide6.csv',
182
207
  :transform => { :class => AutomobileMakeModelYearVariant::ParserC, :year => 1998 },
183
208
  :errata => { :url => "https://docs.google.com/spreadsheet/pub?key=0AoQJbWqPrREqdDkxTElWRVlvUXB3Uy04SDhSYWkzakE&output=csv", :responder => AutomobileMakeModelYearVariant::Guru.new },
184
- :select => lambda { |row| row['model'].present? })
209
+ :select => proc { |row| row['model'].present? })
185
210
 
186
211
  # annual corporate average fuel economy data for domestic and imported vehicle fleets from the NHTSA
187
212
  RemoteTable.new('https://spreadsheets.google.com/pub?key=0AoQJbWqPrREqdEdXWXB6dkVLWkowLXhYSFVUT01sS2c&hl=en&gid=0&output=csv',
188
213
  :errata => { 'url' => 'http://static.brighterplanet.com/science/data/transport/automobiles/make_fleet_years/errata.csv' },
189
- :select => lambda { |row| row['volume'].to_i > 0 })
214
+ :select => proc { |row| row['volume'].to_i > 0 })
190
215
 
191
216
  # total vehicle miles travelled by gasoline passenger cars from the 2010 EPA GHG Inventory
192
217
  RemoteTable.new('http://www.epa.gov/climatechange/emissions/downloads10/2010-Inventory-Annex-Tables.zip',
193
218
  :filename => 'Annex Tables/Annex 3/Table A-87.csv',
194
219
  :skip => 1,
195
- :select => lambda { |row| row['Year'].to_i.to_s == row['Year'] })
220
+ :select => proc { |row| row['Year'].to_i.to_s == row['Year'] })
196
221
 
197
222
  # total vehicle miles travelled from the 2010 EPA GHG Inventory
198
223
  RemoteTable.new('http://www.epa.gov/climatechange/emissions/downloads10/2010-Inventory-Annex-Tables.zip',
199
224
  :filename => 'Annex Tables/Annex 3/Table A-87.csv',
200
225
  :skip => 1,
201
- :select => lambda { |row| row['Year'].to_i.to_s == row['Year'] })
226
+ :select => proc { |row| row['Year'].to_i.to_s == row['Year'] })
202
227
 
203
228
  # total travel distribution from the 2010 EPA GHG Inventory
204
229
  RemoteTable.new('http://www.epa.gov/climatechange/emissions/downloads10/2010-Inventory-Annex-Tables.zip',
205
230
  :filename => 'Annex Tables/Annex 3/Table A-93.csv',
206
231
  :skip => 1,
207
- :select => lambda { |row| row['Vehicle Age'].to_i.to_s == row['Vehicle Age'] })
232
+ :select => proc { |row| row['Vehicle Age'].to_i.to_s == row['Vehicle Age'] })
208
233
 
209
234
  # building characteristics from the 2003 EIA Commercial Building Energy Consumption Survey
210
235
  RemoteTable.new('http://www.eia.gov/emeu/cbecs/cbecs2003/public_use_2003/data/FILE02.csv',
@@ -215,7 +240,7 @@ Everything is forced into UTF-8. You can improve the quality of the conversion b
215
240
  # for definition of `CbecsEnergyIntensity::NAICS_CODE_SYNTHESIZER` see https://github.com/brighterplanet/earth/blob/master/lib/earth/industry/cbecs_energy_intensity/data_miner.rb
216
241
  RemoteTable.new("http://www.eia.gov/emeu/cbecs/cbecs2003/detailed_tables_2003/2003set10/2003excel/C17.xls",
217
242
  :headers => false,
218
- :select => ::Proc.new { |row| CbecsEnergyIntensity::NAICS_CODE_SYNTHESIZER.call(row) },
243
+ :select => proc { |row| CbecsEnergyIntensity::NAICS_CODE_SYNTHESIZER.call(row) },
219
244
  :crop => (21..37))
220
245
 
221
246
  # U.S. Census 2002 NAICS code list
@@ -238,13 +263,13 @@ Everything is forced into UTF-8. You can improve the quality of the conversion b
238
263
  RemoteTable.new('http://www.census.gov/popest/about/geo/state_geocodes_v2009.txt',
239
264
  :skip => 6,
240
265
  :headers => %w{ Region Division FIPS Name },
241
- :select => ::Proc.new { |row| row['Division'].to_i > 0 and row['FIPS'].to_i == 0 })
266
+ :select => proc { |row| row['Division'].to_i > 0 and row['FIPS'].to_i == 0 })
242
267
 
243
268
  # state census divisions from the U.S. Census
244
269
  RemoteTable.new('http://www.census.gov/popest/about/geo/state_geocodes_v2009.txt',
245
270
  :skip => 8,
246
271
  :headers => ['Region', 'Division', 'State FIPS', 'Name'],
247
- :select => ::Proc.new { |row| row['State FIPS'].to_i > 0 })
272
+ :select => proc { |row| row['State FIPS'].to_i > 0 })
248
273
 
249
274
  # OpenGeoCode.org's Country Codes to Country Names list
250
275
  RemoteTable.new('http://opengeocode.org/download/countrynames.txt',
@@ -267,19 +292,19 @@ Everything is forced into UTF-8. You can improve the quality of the conversion b
267
292
  RemoteTable.new('http://www.epa.gov/cleanenergy/documents/egridzips/eGRID2010V1_1_STIE_USGC.xls',
268
293
  :sheet => 'STIE07',
269
294
  :skip => 4,
270
- :select => lambda { |row| row['eGRID2010 year 2007 file state sequence number'].to_i.between?(1, 51) })
295
+ :select => proc { |row| row['eGRID2010 year 2007 file state sequence number'].to_i.between?(1, 51) })
271
296
 
272
297
  # eGRID 2010 subregions and electricity emission factors
273
298
  RemoteTable.new('http://www.epa.gov/cleanenergy/documents/egridzips/eGRID2010_Version1-1_xls_only.zip',
274
299
  :filename => 'eGRID2010V1_1_year07_AGGREGATION.xls',
275
300
  :sheet => 'SRL07',
276
301
  :skip => 4,
277
- :select => lambda { |row| row['SEQSRL07'].to_i.between?(1, 26) })
302
+ :select => proc { |row| row['SEQSRL07'].to_i.between?(1, 26) })
278
303
 
279
304
  # U.S. Census State ANSI Code file
280
305
  RemoteTable.new('http://www.census.gov/geo/www/ansi/state.txt',
281
306
  :delimiter => '|',
282
- :select => lambda { |record| record['STATE'].to_i < 60 })
307
+ :select => proc { |record| record['STATE'].to_i < 60 })
283
308
 
284
309
  # Mapping Hacks zipcode database
285
310
  RemoteTable.new('http://mappinghacks.com/data/zipcode.zip',
@@ -295,18 +320,18 @@ Everything is forced into UTF-8. You can improve the quality of the conversion b
295
320
  # Brighter Planet's list of cat and dog breeds, genders, and weights
296
321
  RemoteTable.new('http://static.brighterplanet.com/science/data/consumables/pets/breed_genders.csv',
297
322
  :encoding => 'ISO-8859-1',
298
- :select => lambda { |row| row['gender'].present? })
323
+ :select => proc { |row| row['gender'].present? })
299
324
 
300
325
  # residential electricity prices from the EIA
301
326
  RemoteTable.new('http://www.eia.doe.gov/cneaf/electricity/page/sales_revenue.xls',
302
- :select => lambda { |row| row['Year'].to_s.first(4).to_i > 1989 })
327
+ :select => proc { |row| row['Year'].to_s.first(4).to_i > 1989 })
303
328
 
304
329
  # residential natural gas prices from the EIA
305
330
  # for definition of `NaturalGasParser` see https://github.com/brighterplanet/earth/blob/master/lib/earth/residence/residence_fuel_price/data_miner.rb
306
331
  RemoteTable.new('http://tonto.eia.doe.gov/dnav/ng/xls/ng_pri_sum_a_EPG0_FWA_DMcf_a.xls',
307
332
  :sheet => 'Data 1',
308
333
  :skip => 2,
309
- :select => lambda { |row| row['year'].to_i > 1989 },
334
+ :select => proc { |row| row['year'].to_i > 1989 },
310
335
  :transform => { :class => NaturalGasParser })
311
336
 
312
337
  # 2005 EIA Residential Energy Consumption Survey microdata
@@ -375,7 +400,7 @@ Everything is forced into UTF-8. You can improve the quality of the conversion b
375
400
  :format => :fixed_width,
376
401
  :crop => 21..26, # inclusive
377
402
  :cut => '2-',
378
- :select => lambda { |row| /\A[A-Z]/.match row['code'] },
403
+ :select => proc { |row| /\A[A-Z]/.match row['code'] },
379
404
  :schema => [[ 'code', 2, { :type => :string } ],
380
405
  [ 'spacer', 2 ],
381
406
  [ 'name', 52, { :type => :string } ]]
@@ -420,14 +445,11 @@ Everything is forced into UTF-8. You can improve the quality of the conversion b
420
445
 
421
446
  ## Requirements
422
447
 
423
- * MRI (not JRuby)
424
- * Unix tools like curl, iconv, perl, cat, cut, tail, etc. accessible from `ENV['PATH']`
425
-
426
- As this library matures, that requirement should go away.
448
+ * Unix tools like curl, iconv, perl, cat, cut, tail, etc. accessible from your `$PATH`
427
449
 
428
450
  ## Wishlist
429
451
 
430
- * JRuby and Win32 compat
452
+ * Win32 compat
431
453
  * The new "custom parser" syntax (aka transformer) hasn't been defined yet... only the old-style syntax is available
432
454
 
433
455
  ## Authors
data/lib/remote_table.rb CHANGED
@@ -3,6 +3,14 @@ if ::RUBY_VERSION < '1.9' and $KCODE != 'UTF8'
3
3
  $KCODE = 'UTF8'
4
4
  end
5
5
 
6
+ require 'thread'
7
+
8
+ require 'iconv'
9
+ if RUBY_VERSION >= '1.9'
10
+ # for an excellent explanation see http://blog.segment7.net/2010/12/17/from-iconv-iconv-to-string-encode
11
+ Kernel.warn "[remote_table] Apologies - using iconv because Ruby 1.9.x's String#encode doesn't have transliteration tables (yet)"
12
+ end
13
+
6
14
  require 'active_support'
7
15
  require 'active_support/version'
8
16
  if ::ActiveSupport::VERSION::MAJOR >= 3
@@ -10,78 +18,432 @@ if ::ActiveSupport::VERSION::MAJOR >= 3
10
18
  end
11
19
  require 'hash_digest'
12
20
 
13
- require 'remote_table/format'
14
- require 'remote_table/config'
15
- require 'remote_table/local_file'
21
+ require 'remote_table/local_copy'
16
22
  require 'remote_table/transformer'
17
23
 
24
+ require 'remote_table/plaintext'
25
+ require 'remote_table/processed_by_roo'
26
+ require 'remote_table/processed_by_nokogiri'
27
+ require 'remote_table/xls'
28
+ require 'remote_table/xlsx'
29
+ require 'remote_table/delimited'
30
+ require 'remote_table/ods'
31
+ require 'remote_table/fixed_width'
32
+ require 'remote_table/html'
33
+ require 'remote_table/xml'
34
+ require 'remote_table/yaml'
35
+
18
36
  class Hash
37
+ # Added by remote_table to store a hash (think checksum) of the data with which a particular Hash is initialized.
38
+ # @return [String]
19
39
  attr_accessor :row_hash
20
40
  end
21
41
 
22
42
  class Array
43
+ # Added by remote_table to store a hash (think checksum) of the data with which a particular Array is initialized.
44
+ # @return [String]
23
45
  attr_accessor :row_hash
24
46
  end
25
47
 
26
- class RemoteTable
27
- # Legacy
48
+ # Open Google Docs spreadsheets, local or remote XLSX, XLS, ODS, CSV (comma separated), TSV (tab separated), other delimited, fixed-width files.
49
+ class RemoteTable
50
+ class << self
51
+ # Guess compression based on URL. Used internally.
52
+ # @return [Symbol,nil]
53
+ def guess_compression(url)
54
+ extname = ::File.extname(::URI.parse(url).path).downcase
55
+ case extname
56
+ when /gz/, /gunzip/
57
+ :gz
58
+ when /zip/
59
+ :zip
60
+ when /bz2/, /bunzip2/
61
+ :bz2
62
+ when /exe/
63
+ :exe
64
+ end
65
+ end
66
+
67
+ # Guess packing from URL. Used internally.
68
+ # @return [Symbol,nil]
69
+ def guess_packing(url)
70
+ basename = ::File.basename(::URI.parse(url).path).downcase
71
+ if basename.include?('.tar') or basename.include?('.tgz')
72
+ :tar
73
+ end
74
+ end
75
+
76
+ # Guess file format from the basename. Since a file might be decompressed and/or pulled out of an archive with a glob, this usually can't be called until a file is downloaded.
77
+ # @return [Symbol,nil]
78
+ def guess_format(basename)
79
+ case basename.to_s.downcase
80
+ when /ods/, /open_?office/
81
+ :ods
82
+ when /xlsx/, /excelx/
83
+ :xlsx
84
+ when /xls/, /excel/
85
+ :xls
86
+ when /csv/, /tsv/, /delimited/
87
+ # note that there is no RemoteTable::Csv class - it's normalized to :delimited
88
+ :delimited
89
+ when /fixed_?width/
90
+ :fixed_width
91
+ when /htm/
92
+ :html
93
+ when /xml/
94
+ :xml
95
+ when /yaml/, /yml/
96
+ :yaml
97
+ end
98
+ end
99
+
100
+ # Given a Google Docs spreadsheet URL, make sure it uses CSV output.
101
+ # @return [String]
102
+ def google_spreadsheet_csv_url(url)
103
+ uri = ::URI.parse url
104
+ params = uri.query.split('&')
105
+ params.delete_if { |param| param.start_with?('output=') }
106
+ params << 'output=csv'
107
+ uri.query = params.join('&')
108
+ uri.to_s
109
+ end
110
+ end
111
+
112
+ # @private
113
+ # Here to support legacy code.
28
114
  class Transform
29
115
  def self.row_hash(row)
30
116
  ::HashDigest.hexdigest row
31
117
  end
32
118
  end
33
119
 
120
+ EXTERNAL_ENCODING = 'UTF-8'
121
+ EXTERNAL_ENCODING_ICONV = 'UTF-8//TRANSLIT'
122
+ GOOGLE_DOCS_SPREADSHEET = [
123
+ /docs.google.com/i,
124
+ /spreadsheets.google.com/i
125
+ ]
126
+ VALID = {
127
+ :compression => [:gz, :zip, :bz2, :exe],
128
+ :packing => [:tar],
129
+ :format => [:xlsx, :xls, :delimited, :ods, :fixed_width, :html, :xml, :yaml, :csv],
130
+ }
131
+ DEFAULT = {
132
+ :streaming => false,
133
+ :warn_on_multiple_downloads => true,
134
+ :headers => :first_row,
135
+ :keep_blank_rows => false,
136
+ :skip => 0,
137
+ :internal_encoding => 'UTF-8',
138
+ :delimiter => ','
139
+ }
140
+ OLD_SETTING_NAMES = {
141
+ :internal_encoding => [:encoding],
142
+ :transform_settings => [:transform],
143
+ :pre_select => [:select],
144
+ :pre_reject => [:reject],
145
+ :errata_settings => [:errata],
146
+ }
147
+
34
148
  include ::Enumerable
35
149
 
150
+ # The URL of the local or remote file.
151
+ #
152
+ # * Local: "file:///Users/myuser/Desktop/holidays.csv"
153
+ # * Remote: "http://data.brighterplanet.com/countries.csv"
154
+ #
155
+ # @return [String]
36
156
  attr_reader :url
37
- attr_reader :config
157
+
158
+ # @private
159
+ # A cache of rows, created unless +:streaming+ is enabled.
160
+ # @return [Array<Hash,Array>]
161
+ attr_reader :cache
162
+
163
+ # @private
164
+ # How many times this file has been downloaded. RemoteTable will emit a warning if you download it more than once.
165
+ # @return [Integer]
166
+ attr_reader :download_count
167
+
168
+ # @private
169
+ # Used internally to access the transformer (aka parser).
170
+ attr_reader :transformer
171
+
172
+ # @private
173
+ # Used internally to access to a downloaded copy of the file.
174
+ # @return [RemoteTable::LocalCopy]
175
+ attr_reader :local_copy
176
+
177
+ # Whether to stream the rows without caching them. Saves memory, but you have to re-download the file every time you enumerate its rows. Defaults to false.
178
+ # @return [true,false]
179
+ attr_reader :streaming
180
+
181
+ # Whether to warn the user on multiple downloads. Defaults to true.
182
+ # @return [true,false]
183
+ attr_reader :warn_on_multiple_downloads
184
+
185
+ # Headers specified by the user: +:first_row+ (the default), +false+, or a list of headers.
186
+ # @return [:first_row,false,Array<String>]
187
+ attr_reader :headers
188
+
189
+ # The sheet specified by the user as a number or a string.
190
+ # @return[String,Integer]
191
+ attr_reader :sheet
192
+
193
+ # Whether to keep blank rows. Default is false.
194
+ # @return [true,false]
195
+ attr_reader :keep_blank_rows
196
+
197
+ # Form data to POST in the download request. It should probably be in +application/x-www-form-urlencoded+.
198
+ # @return [String]
199
+ attr_reader :form_data
200
+
201
+ # How many rows to skip at the beginning of the file or table. Default is 0.
202
+ # @return [Integer]
203
+ attr_reader :skip
204
+
205
+ # The original encoding of the source file. Default is UTF-8. Previously passed as +:encoding+.
206
+ # @return [String]
207
+ attr_reader :internal_encoding
208
+
209
+ # The delimiter, a.k.a. column separator. Passed to Ruby CSV as +:col_sep+. Default is :delimited.
210
+ # @return [String]
211
+ attr_reader :delimiter
212
+
213
+ # The XPath used to find rows in HTML or XML.
214
+ # @return [String]
215
+ attr_reader :row_xpath
216
+
217
+ # The XPath used to find columns in HTML or XML.
218
+ # @return [String]
219
+ attr_reader :column_xpath
220
+
221
+ # The CSS selector used to find rows in HTML or XML.
222
+ # @return [String]
223
+ attr_reader :row_css
224
+
225
+ # The CSS selector used to find columns in HTML or XML.
226
+ # @return [String]
227
+ attr_reader :column_css
228
+
229
+ # The format of the source file. Can be +:xlsx+, +:xls+, +:delimited+, +:ods+, +:fixed_width+, +:html+, +:xml+, +:yaml+.
230
+ # @return [Symbol]
231
+ attr_reader :format
232
+
233
+ # The compression type. Guessed from URL if not provided. +:gz+, +:zip+, +:bz2+, and +:exe+ (treated as +:zip+) are supported.
234
+ # @return [Symbol]
235
+ attr_reader :compression
236
+
237
+ # The packing type. Guessed from URL if not provided. Only +:tar+ is supported.
238
+ # @return [Symbol]
239
+ attr_reader :packing
240
+
241
+ # The glob used to pick a file out of an archive.
242
+ #
243
+ # @return [String]
244
+ #
245
+ # @example Pick out the only CSV in a ZIP file
246
+ # RemoteTable.new 'http://www.fueleconomy.gov/FEG/epadata/08data.zip', :glob => '/*.csv'
247
+ attr_reader :glob
248
+
249
+ # The filename, which can be used to pick a file out of an archive.
250
+ #
251
+ # @return [String]
252
+ #
253
+ # @example Specify the filename to get out of a ZIP file
254
+ # RemoteTable.new 'http://www.fueleconomy.gov/FEG/epadata/08data.zip', :filename => '2008_FE_guide_ALL_rel_dates_-no sales-for DOE-5-1-08.csv'
255
+ attr_reader :filename
256
+
257
+ # Pick specific columns out of a plaintext file using an argument to the UNIX [+cut+ utility](http://en.wikipedia.org/wiki/Cut_%28Unix%29).
258
+ #
259
+ # @return [String]
260
+ #
261
+ # @example Pick ALMOST out of ABCDEFGHIJKLMNOPQRSTUVWXYZ
262
+ # # $ echo ABCDEFGHIJKLMNOPQRSTUVWXYZ | cut -c '1,12,13,15,19,20'
263
+ # # ALMOST
264
+ # RemoteTable.new 'file:///atoz.txt', :cut => '1,12,13,15,19,20'
265
+ attr_reader :cut
266
+
267
+ # Use a range of rows in a plaintext file.
268
+ #
269
+ # @return [Range]
270
+ #
271
+ # @example Only take rows 21 through 37
272
+ # RemoteTable.new("http://www.eia.gov/emeu/cbecs/cbecs2003/detailed_tables_2003/2003set10/2003excel/C17.xls",
273
+ # :headers => false,
274
+ # :select => proc { |row| CbecsEnergyIntensity::NAICS_CODE_SYNTHESIZER.call(row) },
275
+ # :crop => (21..37))
276
+ attr_reader :crop
277
+
278
+ # The fixed-width schema, given as a multi-dimensional array.
279
+ #
280
+ # @return [Array<Array{String,Integer,Hash}>]
281
+ #
282
+ # @example From the tests
283
+ # RemoteTable.new('http://cloud.github.com/downloads/seamusabshere/remote_table/test2.fixed_width.txt',
284
+ # :format => :fixed_width,
285
+ # :skip => 1,
286
+ # :schema => [[ 'header4', 10, { :type => :string } ],
287
+ # [ 'spacer', 1 ],
288
+ # [ 'header5', 10, { :type => :string } ],
289
+ # [ 'spacer', 12 ],
290
+ # [ 'header6', 10, { :type => :string } ]])
291
+ attr_reader :schema
38
292
 
39
- # Create a new RemoteTable.
293
+ # If you somehow already defined a fixed-width schema (so you can re-use it?), specify it here.
294
+ # @return [String,Symbol]
295
+ attr_reader :schema_name
296
+
297
+ # A proc that decides whether to include a row. Previously passed as +:select+.
298
+ # @return [Proc]
299
+ attr_reader :pre_select
300
+
301
+ # A proc that decides whether to include a row. Previously passed as +:reject+.
302
+ # @return [Proc]
303
+ attr_reader :pre_reject
304
+
305
+ # Settings to create a transformer.
306
+ # @return [Hash]
307
+ attr_reader :transform_settings
308
+
309
+ # A hash of settings to initialize an Errata instance to be used on every row. Previously passed as +:errata+.
310
+ #
311
+ # See the Errata library at https://github.com/seamusabshere/errata
312
+ #
313
+ # @return [Hash]
314
+ attr_reader :errata_settings
315
+
316
+ # The format of the source file. Can be specified as: :xlsx, :xls, :delimited (aka :csv), :ods, :fixed_width, :html, :xml, :yaml
317
+ #
318
+ # Note: treats all +docs.google.com+ and +spreadsheets.google.com+ URLs as +:delimited+.
40
319
  #
41
- # RemoteTable.new(url, options = {})
320
+ # Default: guessed from file extension (which is usually the same as the URL, but sometimes not if you pick out a specific file from an archive)
42
321
  #
43
- # New syntax:
44
- # RemoteTable.new('www.customerreferenceprogram.org/uploads/CRP_RFP_template.xlsx', :foo => 'bar')
45
- # Old syntax:
46
- # RemoteTable.new(:url => 'www.customerreferenceprogram.org/uploads/CRP_RFP_template.xlsx', :foo => 'bar')
322
+ # @return [Hash]
323
+ attr_reader :format
324
+
325
+ # Options passed by the user that may be passed through to the underlying parsing library.
326
+ # @return [Hash]
327
+ attr_reader :other_options
328
+
329
+ # Create a new RemoteTable, which is an Enumerable.
330
+ #
331
+ # Does not immediately download/parse... it's lazy-loading.
332
+ #
333
+ # @overload initialize(settings)
334
+ # @param [Hash] settings Settings including +:url+.
47
335
  #
48
- # See the <tt>Config</tt> object for the sorts of options you can pass.
336
+ # @overload initialize(url, settings)
337
+ # @param [String] url The URL to the local or remote file.
338
+ # @param [Hash] settings Settings.
339
+ #
340
+ # @example Open an XLSX
341
+ # RemoteTable.new('http://www.customerreferenceprogram.org/uploads/CRP_RFP_template.xlsx')
342
+ #
343
+ # @example Open a CSV inside a ZIP file
344
+ # RemoteTable.new 'http://www.epa.gov/climatechange/emissions/downloads10/2010-Inventory-Annex-Tables.zip',
345
+ # :filename => 'Annex Tables/Annex 3/Table A-93.csv',
346
+ # :skip => 1,
347
+ # :pre_select => proc { |row| row['Vehicle Age'].strip =~ /^\d+$/ }
49
348
  def initialize(*args)
50
- options = args.last.is_a?(::Hash) ? args.last.symbolize_keys : {}
51
-
349
+ @download_count_mutex = ::Mutex.new
350
+ @iconv_mutex = ::Mutex.new
351
+ @extend_bang_mutex = ::Mutex.new
352
+ @errata_mutex = ::Mutex.new
353
+
354
+ @cache = []
355
+ @download_count = 0
356
+
357
+ settings = args.last.is_a?(::Hash) ? args.last.symbolize_keys : {}
358
+
52
359
  @url = if args.first.is_a? ::String
53
- args.first.dup
360
+ args.first
54
361
  else
55
- options[:url].dup
362
+ grab settings, :url
363
+ end
364
+ @format = RemoteTable.guess_format grab(settings, :format)
365
+ if GOOGLE_DOCS_SPREADSHEET.any? { |regex| regex =~ url }
366
+ @url = RemoteTable.google_spreadsheet_csv_url url
367
+ @format = :delimited
368
+ end
369
+
370
+ @headers = grab settings, :headers
371
+ if headers.is_a?(::Array) and headers.any?(&:blank?)
372
+ raise ::ArgumentError, "[remote_table] If you specify headers, none of them can be blank"
56
373
  end
57
- @config = Config.new self, options
374
+
375
+ @compression = grab(settings, :compression) || RemoteTable.guess_compression(url)
376
+ @packing = grab(settings, :packing) || RemoteTable.guess_packing(url)
377
+
378
+ @streaming = grab settings, :streaming
379
+ @warn_on_multiple_downloads = grab settings, :warn_on_multiple_downloads
380
+ @delimiter = grab settings, :delimiter
381
+ @sheet = grab settings, :sheet
382
+ @keep_blank_rows = grab settings, :keep_blank_rows
383
+ @form_data = grab settings, :form_data
384
+ @skip = grab settings, :skip
385
+ @internal_encoding = grab settings, :internal_encoding
386
+ @row_xpath = grab settings, :row_xpath
387
+ @column_xpath = grab settings, :column_xpath
388
+ @row_css = grab settings, :row_css
389
+ @column_css = grab settings, :column_css
390
+ @glob = grab settings, :glob
391
+ @filename = grab settings, :filename
392
+ @transform_settings = grab settings, :transform_settings
393
+ @cut = grab settings, :cut
394
+ @crop = grab settings, :crop
395
+ @schema = grab settings, :schema
396
+ @schema_name = grab settings, :schema_name
397
+ @pre_select = grab settings, :pre_select
398
+ @pre_reject = grab settings, :pre_reject
399
+ @errata_settings = grab settings, :errata_settings
400
+
401
+ @other_options = settings
402
+
403
+ @transformer = Transformer.new self
404
+ @local_copy = LocalCopy.new self
58
405
  end
59
-
60
- # not thread safe
61
- def each(&blk)
406
+
407
+ # Yield each row.
408
+ #
409
+ # @return [nil]
410
+ #
411
+ # @yield [Hash,Array] A hash or an array depending on whether the RemoteTable has named headers (column names).
412
+ def each
413
+ extend!
62
414
  if fully_cached?
63
- cache.each(&blk)
415
+ cache.each do |row|
416
+ yield row
417
+ end
64
418
  else
65
419
  mark_download!
66
- retval = format.each do |row|
420
+ memo = _each do |row|
67
421
  transformer.transform(row).each do |virtual_row|
68
422
  virtual_row.row_hash = ::HashDigest.hexdigest row
69
- if config.errata
70
- next if config.errata.rejects? virtual_row
71
- config.errata.correct! virtual_row
423
+ if errata
424
+ next if errata.rejects? virtual_row
425
+ errata.correct! virtual_row
426
+ end
427
+ next if pre_select and !pre_select.call(virtual_row)
428
+ next if pre_reject and pre_reject.call(virtual_row)
429
+ unless streaming
430
+ cache.push virtual_row
72
431
  end
73
- next if config.select and !config.select.call(virtual_row)
74
- next if config.reject and config.reject.call(virtual_row)
75
- cache.push virtual_row unless config.streaming
76
432
  yield virtual_row
77
433
  end
78
434
  end
79
- fully_cached! unless config.streaming
80
- retval
435
+ unless streaming
436
+ fully_cached!
437
+ end
438
+ memo
81
439
  end
440
+ nil
82
441
  end
442
+
443
+ # @deprecated
83
444
  alias :each_row :each
84
445
 
446
+ # @return [Array<Hash,Array>] All rows.
85
447
  def to_a
86
448
  if fully_cached?
87
449
  cache.dup
@@ -89,9 +451,13 @@ class RemoteTable
89
451
  map { |row| row }
90
452
  end
91
453
  end
454
+
455
+ # @deprecated
92
456
  alias :rows :to_a
93
457
 
94
- # Get a row by row number
458
+ # Get a row by row number. Zero-based.
459
+ #
460
+ # @return [Hash,Array]
95
461
  def [](row_number)
96
462
  if fully_cached?
97
463
  cache[row_number]
@@ -100,35 +466,37 @@ class RemoteTable
100
466
  end
101
467
  end
102
468
 
103
- # clear the row cache to save memory
469
+ # Clear the row cache in case it helps your GC.
470
+ #
471
+ # @return [nil]
104
472
  def free
473
+ @fully_cached = false
474
+ @errata = nil
105
475
  cache.clear
106
476
  nil
107
477
  end
108
-
109
- # Used internally to access to a downloaded copy of the file
110
- def local_file
111
- @local_file ||= LocalFile.new self
112
- end
113
-
114
- # Used internally to access to the driver that reads the format
115
- def format
116
- @format ||= config.format.new self
117
- end
118
-
119
- # Used internally to access the transformer (aka parser).
120
- def transformer
121
- @transformer ||= Transformer.new self
478
+
479
+ # @private
480
+ def errata
481
+ @errata || @errata_mutex.synchronize do
482
+ @errata ||= begin
483
+ if defined?(::Errata) and errata_settings.is_a?(::Errata)
484
+ ::Kernel.warn %{[remote_table] Passing :errata_settings as an Errata object is deprecated. Please pass a Hash of settings instead.}
485
+ errata_settings
486
+ elsif errata_settings.is_a?(::Hash)
487
+ ::Errata.new errata_settings
488
+ end
489
+ end
490
+ end
122
491
  end
123
-
124
- attr_reader :download_count
125
-
492
+
126
493
  private
127
494
 
128
495
  def mark_download!
129
- @download_count ||= 0
130
- @download_count += 1
131
- if config.warn_on_multiple_downloads and download_count > 1
496
+ @download_count_mutex.synchronize do
497
+ @download_count += 1
498
+ end
499
+ if warn_on_multiple_downloads and download_count > 1
132
500
  ::Kernel.warn "[remote_table] #{url} has been downloaded #{download_count} times."
133
501
  end
134
502
  end
@@ -140,8 +508,62 @@ class RemoteTable
140
508
  def fully_cached?
141
509
  !!@fully_cached
142
510
  end
143
-
144
- def cache
145
- @cache ||= []
511
+
512
+ def iconv
513
+ @iconv || @iconv_mutex.synchronize do
514
+ @iconv ||= ::Iconv.new(EXTERNAL_ENCODING_ICONV, internal_encoding)
515
+ end
516
+ end
517
+
518
+ def transliterate_to_utf8(str)
519
+ if str.is_a?(::String)
520
+ [ iconv.iconv(str), iconv.iconv(nil) ].join
521
+ end
522
+ end
523
+
524
+ def assume_utf8(str)
525
+ if str.is_a?(::String) and ::RUBY_VERSION >= '1.9'
526
+ str.encode! EXTERNAL_ENCODING
527
+ else
528
+ str
529
+ end
530
+ end
531
+
532
+ def grab(settings, k)
533
+ user_specified = false
534
+ memo = nil
535
+ if (old_names = OLD_SETTING_NAMES[k]) and old_names.any? { |old_k| settings.has_key?(old_k) }
536
+ user_specified = true
537
+ memo = old_names.map { |old_k| settings.delete(old_k) }.compact.first
538
+ end
539
+ if settings.has_key?(k)
540
+ user_specified = true
541
+ memo = settings.delete k
542
+ end
543
+ if not user_specified and DEFAULT.has_key?(k)
544
+ memo = DEFAULT[k]
545
+ end
546
+ if memo and (valid = VALID[k]) and not valid.include?(memo.to_sym)
547
+ raise ::ArgumentError, %{[remote_table] #{k.inspect} => #{memo.inspect} is not a valid setting. Valid settings are #{valid.inspect}.}
548
+ end
549
+ memo
550
+ end
551
+
552
+ def extend!
553
+ return if @extend_bang
554
+ @extend_bang_mutex.synchronize do
555
+ return if @extend_bang
556
+ @extend_bang = true
557
+ format_module = if format
558
+ RemoteTable.const_get format.to_s.camelcase
559
+ elsif format = RemoteTable.guess_format(local_copy.path)
560
+ @format = format
561
+ RemoteTable.const_get format.to_s.camelcase
562
+ else
563
+ Delimited
564
+ end
565
+ extend format_module
566
+ after_extend if respond_to?(:after_extend)
567
+ end
146
568
  end
147
569
  end