olek-libcraigscrape 1.0.3 → 1.1.0

Sign up to get free protection for your applications and to get access to all the features.
data/CHANGELOG CHANGED
@@ -1,34 +1,40 @@
1
1
  == Change Log
2
2
 
3
+ === Release 1.1
4
+ - ruby 1.9.3 support
5
+ - migrated from rails 2 gems to rails 3
6
+ - fixed some new parsing bugs introduced by craigslist template changes
7
+ - Replaced Net:Http with typhoeus
8
+
3
9
  === Release 1.0
4
10
  - Replaced hpricot dependency with Nokogiri. Nokogiri should be faster and more reliable. Whoo-hoo!
5
11
 
6
12
  === Release 0.9.1
7
13
  - Added support for posting_has_expired? and expired post recognition
8
- - Fixed a weird bug in craigwatch that would cause a scrape to abort if a flagged_for_removal? was encountered when using certain (minimal) filtering
14
+ - Fixed a weird bug in craigwatch that would cause a scrape to abort if a flagged_for_removal? was encountered when using certain (minimal) filtering
9
15
 
10
16
  === Release 0.9 (Oct 01, 2010)
11
17
  - Minor adjustments to craigwatch to fix deprecation warnings in new ActiveRecord and ActionMailer gems
12
18
  - Added gem version specifiers to the Gem spec and to the require statements
13
19
  - Moved repo to github
14
- - Fixed an esoteric bug in craigwatch, affecting the last scraped post in a listing when that post was 'flagged for removal'.
20
+ - Fixed an esoteric bug in craigwatch, affecting the last scraped post in a listing when that post was 'flagged for removal'.
15
21
  - Took all those extra package-building tasts out of the Rakefile since this is 2010 and we only party with gemfiles
16
22
  - Ruby 1.9 compatibility adjustments
17
23
 
18
24
  === Release 0.8.4 (Sep 6, 2010)
19
25
  - Someone found a way to screw up hpricot's to_s method (posting1938291834-090610.html) and fixed by added html_source to the craigslist Scraper object, which returns the body of the post without passing it through hpricot. Its a better way to go anyways, and re-wrote a couple incidentals to use the html_source method...
20
- - Adjusted the test cases a bit, since the user bodies being returned have less cleanup in their output than they had prior
26
+ - Adjusted the test cases a bit, since the user bodies being returned have less cleanup in their output than they had prior
21
27
 
22
28
  === Release 0.8.3 (August 2, 2010)
23
29
  - Someone was posting really bad html that was screwing up Hpricot. Such is to be expected when you're soliciting html from the general public I suppose. Added test_bugs_found061710 posting test, and fixed by stripping out the user body before parsing with Hpricot.
24
- - Added a MaxRedirectError and corresponding maximum_redirects_per_request cattr for the Craigscrape objects. This fixed a weird bug where craigslist was sending us in redirect circles around 06/10
30
+ - Added a MaxRedirectError and corresponding maximum_redirects_per_request cattr for the Craigscrape objects. This fixed a weird bug where craigslist was sending us in redirect circles around 06/10
25
31
 
26
32
  === Release 0.8.2 (April 17, 2010)
27
33
  - Found another odd parsing bug. Scrape sample is in 'listing_samples/mia_search_kitten.3.15.10.html', Adjusted CraigScrape::Listings::HEADER_DATE to fix.
28
34
  - Craigslist started added <span> tags in its post summaries. Fixed. See sample in test_new_listing_span051710
29
35
 
30
36
  === Release 0.8.1 (Feb 10, 2010)
31
- - Found an odd parsing bug occured for the first time today. Scrape sample is in 'listing_samples/mia_sss_kittens2.10.10.html', Adjusted CraigScrape::Listings::LABEL to fix.
37
+ - Found an odd parsing bug occured for the first time today. Scrape sample is in 'listing_samples/mia_sss_kittens2.10.10.html', Adjusted CraigScrape::Listings::LABEL to fix.
32
38
  - Switched to require "active_support" per the deprecation notices
33
39
  - Little adjustments to fix the rdoc readibility
34
40
 
@@ -83,7 +89,7 @@
83
89
  - Adjusted the examples in the readme, added a "require 'rubygems'" to the top of the listing so that they would actually work if you tried to run them verbatim (Thanks J T!)
84
90
  - Restructured some of the parsing to be less leinient when scraped values aren't matching their regexp's in the PostSummary
85
91
  - It seems like craigslist returns a 404 on pages that exist, for no good reason on occasion. Added a retry mechanism that wont take no for an answer, unless we get a defineable number of them in a row
86
- - Added CraigScrape cattr_accessors : retries_on_fetch_fail, sleep_between_fetch_retries .
92
+ - Added CraigScrape cattr_accessors : retries_on_fetch_fail, sleep_between_fetch_retries .
87
93
  - Adjusted craigwatch to not commit any database changes until the notification email goes out. This way if there's an error, the user wont miss any results on a re-run
88
94
  - Added a FetchError for http requests that don't return 200 or redirect...
89
95
  - Adjusted craigwatch to use scrape_until instead of scrape_since, this new approach cuts down on the url fetching by assuming that if we come across something we've already tracked, we dont need to keep going any further. NOTE: We still can't use a 'last_scraped_url' on the TrackedSearch model b/c sometimes posts get deleted.
data/COPYING.LESSER CHANGED
@@ -1,4 +1,4 @@
1
- GNU LESSER GENERAL PUBLIC LICENSE
1
+ GNU LESSER GENERAL PUBLIC LICENSE
2
2
  Version 3, 29 June 2007
3
3
 
4
4
  Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/>
data/README CHANGED
@@ -25,10 +25,10 @@ On the 'miami.craigslist.org' site, using the query "search/sss?query=apple"
25
25
  require 'libcraigscrape'
26
26
  require 'date'
27
27
  require 'pp'
28
-
28
+
29
29
  miami_cl = CraigScrape.new 'us/fl/miami'
30
30
  miami_cl.posts_since(Time.parse('Sep 10'), 'search/sss?query=apple').each do |post|
31
- pp post
31
+ pp post
32
32
  end
33
33
 
34
34
  === Scrape Last 225 Craigslist Listings
@@ -38,26 +38,26 @@ On the 'miami.craigslist.org' under the 'apa' category
38
38
  require 'rubygems'
39
39
  require 'libcraigscrape'
40
40
  require 'pp'
41
-
41
+
42
42
  i=1
43
43
  CraigScrape.new('us/fl/miami').each_post('apa') do |post|
44
44
  break if i > 225
45
- i+=1
46
- pp post
45
+ i+=1
46
+ pp post
47
47
  end
48
48
 
49
49
  === Multiple site with multiple section/search enumeration of posts
50
50
 
51
- In Florida, with the exception of 'miami.craigslist.org' & 'keys.craigslist.org' sites, output each post in
51
+ In Florida, with the exception of 'miami.craigslist.org' & 'keys.craigslist.org' sites, output each post in
52
52
  the 'crg' category and for the search 'artist needed'
53
53
 
54
54
  require 'rubygems'
55
55
  require 'libcraigscrape'
56
56
  require 'pp'
57
-
57
+
58
58
  non_sfl_sites = CraigScrape.new('us/fl', '- us/fl/miami', '- us/fl/keys')
59
59
  non_sfl_sites.each_post('crg', 'search/sss?query=artist+needed') do |post|
60
- pp post
60
+ pp post
61
61
  end
62
62
 
63
63
  === Scrape Single Craigslist Posting
@@ -66,7 +66,7 @@ This grabs the full details under the specific post http://miami.craigslist.org/
66
66
 
67
67
  require 'rubygems'
68
68
  require 'libcraigscrape'
69
-
69
+
70
70
  post = CraigScrape::Posting.new 'http://miami.craigslist.org/mdc/sys/1140808860.html'
71
71
  puts "(%s) %s:\n %s" % [ post.post_time.strftime('%b %d'), post.title, post.contents_as_plain ]
72
72
 
@@ -76,7 +76,7 @@ This grabs the post summaries of the single listings at http://miami.craigslist.
76
76
 
77
77
  require 'rubygems'
78
78
  require 'libcraigscrape'
79
-
79
+
80
80
  listing = CraigScrape::Listings.new 'http://miami.craigslist.org/search/sss?query=laptop'
81
81
  puts 'Found %d posts for the search "laptop" on this page' % listing.posts.length
82
82
 
data/Rakefile CHANGED
@@ -1,8 +1,8 @@
1
1
  require 'rake'
2
2
  require 'rake/clean'
3
- require 'rake/gempackagetask'
4
- require 'rake/rdoctask'
3
+ require 'rdoc/task'
5
4
  require 'rake/testtask'
5
+ require 'rubygems/package_task'
6
6
  require 'fileutils'
7
7
  require 'tempfile'
8
8
 
@@ -11,7 +11,7 @@ include FileUtils
11
11
  RbConfig = Config unless defined? RbConfig
12
12
 
13
13
  NAME = "olek-libcraigscrape"
14
- VERS = ENV['VERSION'] || "1.0.3"
14
+ VERS = ENV['VERSION'] || "1.1.0"
15
15
  PKG = "#{NAME}-#{VERS}"
16
16
 
17
17
  RDOC_OPTS = ['--quiet', '--title', 'The libcraigscrape Reference', '--main', 'README', '--inline-source']
@@ -35,15 +35,8 @@ SPEC =
35
35
  s.homepage = 'http://www.derosetechnologies.com/community/libcraigscrape'
36
36
  s.rubyforge_project = 'libcraigwatch'
37
37
  s.files = PKG_FILES
38
- s.require_paths = ["lib"]
38
+ s.require_paths = ["lib"]
39
39
  s.test_files = FileList['test/test_*.rb']
40
- s.add_dependency 'nokogiri', '>= 1.4.4'
41
- s.add_dependency 'htmlentities', '>= 4.0.0'
42
- s.add_dependency 'activesupport','>= 2.3.0', '< 3'
43
- s.add_dependency 'activerecord', '>= 2.3.0', '< 3'
44
- s.add_dependency 'actionmailer', '>= 2.3.0', '< 3'
45
- s.add_dependency 'kwalify', '>= 0.7.2'
46
- s.add_dependency 'sqlite3'
47
40
  end
48
41
 
49
42
  desc "Run all the tests"
@@ -61,7 +54,7 @@ Rake::RDocTask.new do |rdoc|
61
54
  rdoc.rdoc_files.add RDOC_FILES+Dir.glob('lib/*.rb').sort_by{|a,b| (a == 'lib/libcraigscrape.rb') ? -1 : 0 }
62
55
  end
63
56
 
64
- Rake::GemPackageTask.new(SPEC) do |p|
57
+ Gem::PackageTask.new(SPEC) do |p|
65
58
  p.need_tar = false
66
59
  p.need_tar_gz = false
67
60
  p.need_tar_bz2 = false
@@ -81,45 +74,3 @@ end
81
74
  task :uninstall => [:clean] do
82
75
  sh %{sudo gem uninstall #{NAME}}
83
76
  end
84
-
85
- require 'roodi'
86
- require 'roodi_task'
87
-
88
- namespace :code_tests do
89
- desc "Analyze for code complexity"
90
- task :flog do
91
- require 'flog'
92
-
93
- flog = Flog.new
94
- flog.flog_files ['lib']
95
- threshold = 105
96
-
97
- bad_methods = flog.totals.select do |name, score|
98
- score > threshold
99
- end
100
-
101
- bad_methods.sort { |a,b| a[1] <=> b[1] }.each do |name, score|
102
- puts "%8.1f: %s" % [score, name]
103
- end
104
-
105
- puts "WARNING : #{bad_methods.size} methods have a flog complexity > #{threshold}" unless bad_methods.empty?
106
- end
107
-
108
- desc "Analyze for code duplication"
109
- require 'flay'
110
- task :flay do
111
- threshold = 25
112
- flay = Flay.new({:fuzzy => false, :verbose => false, :mass => threshold})
113
- flay.process(*Flay.expand_dirs_to_files(['lib']))
114
-
115
- flay.report
116
-
117
- raise "#{flay.masses.size} chunks of code have a duplicate mass > #{threshold}" unless flay.masses.empty?
118
- end
119
-
120
- RoodiTask.new 'roodi', ['lib/*.rb'], 'roodi.yml'
121
- end
122
-
123
- desc "Run all code tests"
124
- task :code_tests => %w(code_tests:flog code_tests:flay code_tests:roodi)
125
-
@@ -25,7 +25,7 @@ mapping:
25
25
  mapping:
26
26
  "adapter": { type: str, required: yes }
27
27
  "dbfile": { type: str, required: no }
28
- "host": { type: str, required: no }
28
+ "host": { type: str, required: no }
29
29
  "username": { type: str, required: no }
30
30
  "password": { type: str, required: no }
31
31
  "socket": { type: str, required: no }
@@ -50,7 +50,7 @@ mapping:
50
50
  "summary_or_full_post_has_no": {type: seq, required: no, sequence: [ {type: str, unique: yes} ]}
51
51
  "location_has": {type: seq, required: no, sequence: [ {type: str, unique: yes} ]}
52
52
  "location_has_no": {type: seq, required: no, sequence: [ {type: str, unique: yes} ]}
53
- "sites":
53
+ "sites":
54
54
  type: seq
55
55
  required: yes
56
56
  sequence:
@@ -62,7 +62,7 @@ mapping:
62
62
  sequence:
63
63
  - type: str
64
64
  unique: yes
65
- "starting":
65
+ "starting":
66
66
  type: str
67
67
  required: no
68
68
  pattern: /^[\d]{1,2}\/[\d]{1,2}\/(?:[\d]{2}|[\d]{4})$/
data/bin/craigwatch CHANGED
@@ -1,4 +1,5 @@
1
- #!/usr/bin/ruby
1
+ #!/usr/bin/env ruby
2
+ # encoding: UTF-8
2
3
  #
3
4
  # =craigwatch - A email-based "post monitoring" solution
4
5
  #
@@ -160,9 +161,9 @@ $: << File.dirname(__FILE__) + '/../lib'
160
161
 
161
162
  require 'rubygems'
162
163
 
163
- gem 'kwalify', '~> 0.7'
164
- gem 'activerecord', '~> 2.3'
165
- gem 'actionmailer', '~> 2.3'
164
+ gem 'kwalify'
165
+ gem 'activerecord'
166
+ gem 'actionmailer'
166
167
 
167
168
  require 'kwalify'
168
169
  require 'active_record'
@@ -252,7 +253,7 @@ class CraigReportDefinition #:nodoc:
252
253
 
253
254
  def starting_at
254
255
  (@starting) ?
255
- Time.parse(@starting) :
256
+ Time.strptime(@starting, "%m/%d/%Y") :
256
257
  Time.now.yesterday.beginning_of_day
257
258
  end
258
259
 
@@ -290,17 +291,23 @@ class CraigReportDefinition #:nodoc:
290
291
  private
291
292
 
292
293
  def matches_all?(conditions, against)
293
- against = against.to_a.compact
294
- (conditions.nil? or conditions.all?{|c| against.any?{|a| match_against c, a } }) ? true : false
294
+ (conditions.nil? or conditions.all?{|c| sanitized_against(against).any?{|a| match_against c, a } }) ? true : false
295
295
  end
296
296
 
297
297
  def doesnt_match_any?(conditions, against)
298
- against = against.to_a.compact
299
- (conditions.nil? or conditions.all?{|c| against.any?{|a| !match_against c, a } }) ? true : false
298
+ (conditions.nil? or conditions.all?{|c| sanitized_against(against).any?{|a| !match_against c, a } }) ? true : false
300
299
  end
301
300
 
302
301
  def match_against(condition, against)
303
- (against.scan( condition.is_re? ? condition.to_re : /#{condition}/i).length > 0) ? true : false
302
+ (CraigScrape::Scraper.he_decode(against).scan( condition.is_re? ? condition.to_re : /#{condition}/i).length > 0) ? true : false
303
+ end
304
+
305
+ # This is kind of a hack to deal with ruby 1.9. Really the filtering mechanism
306
+ # needs to be factored out and tested....
307
+ def sanitized_against(against)
308
+ against = against.lines if against.respond_to? :lines
309
+ against = against.to_a if against.respond_to? :to_a
310
+ (against.nil?) ? [] : against.compact
304
311
  end
305
312
  end
306
313
  end
@@ -353,24 +360,12 @@ class TrackedPost < ActiveRecord::Base #:nodoc:
353
360
  end
354
361
 
355
362
  class ReportMailer < ActionMailer::Base #:nodoc:
356
- def report(to, sender, subject_template, report_tmpl)
357
-
358
- formatted_subject = Time.now.strftime(subject_template)
359
-
360
- recipients to
361
- from sender
362
- subject formatted_subject
363
+ # default :template_path => File.dirname(__FILE__)
363
364
 
364
- generate_view_parts 'craigslist_report', report_tmpl.merge({:subject =>formatted_subject})
365
- end
366
-
367
- def generate_view_parts(view_name, tmpl)
368
- part( :content_type => "multipart/alternative" ) do |p|
369
- [
370
- { :content_type => "text/plain", :body => render_message("#{view_name.to_s}.plain.erb", tmpl) },
371
- { :content_type => "text/html", :body => render_message("#{view_name.to_s}.html.erb", tmpl.merge({:part_container => p})) }
372
- ].each { |parms| p.part parms.merge( { :charset => "UTF-8", :transfer_encoding => "7bit" } ) }
373
- end
365
+ def report(to, sender, subject_template, report_tmpl)
366
+ subject = Time.now.strftime subject_template
367
+ @summaries = report_tmpl[:summaries]
368
+ mail :to => to, :subject => subject, :from => sender
374
369
  end
375
370
  end
376
371
 
@@ -405,13 +400,14 @@ parser.errors.each do |e|
405
400
  end and exit if parser.errors.length > 0
406
401
 
407
402
  # Initialize Action Mailer:
403
+ ActionMailer::Base.prepend_view_path(File.dirname(__FILE__))
408
404
  ActionMailer::Base.logger = Logger.new STDERR if craig_report.debug_mailer?
409
405
  if craig_report.smtp_settings
410
- ReportMailer.smtp_settings = craig_report.smtp_settings.symbolize_keys
406
+ ActionMailer::Base.smtp_settings = craig_report.smtp_settings
407
+ ActionMailer::Base.delivery_method = :smtp
411
408
  else
412
- ReportMailer.delivery_method = :sendmail
409
+ ActionMailer::Base.delivery_method = :sendmail
413
410
  end
414
- ReportMailer.template_root = File.dirname __FILE__
415
411
 
416
412
  # Initialize the database:
417
413
  ActiveRecord::Base.logger = Logger.new STDERR if craig_report.debug_database?
@@ -517,7 +513,7 @@ report_summaries = craig_report.searches.collect do |search|
517
513
  # Now let's add these urls to the database so as to reduce memory overhead.
518
514
  # Keep in mind - they're not active until the email goes out.
519
515
  # also - we shouldn't have to worry about putting 'irrelevant' posts in the db, since
520
- # the nbewest are always the first ones parsed:
516
+ # the newest are always the first ones parsed:
521
517
  tracked_listing.posts.create(
522
518
  :url => post.url,
523
519
  :created_at => newest_post_date
@@ -530,18 +526,10 @@ report_summaries = craig_report.searches.collect do |search|
530
526
  end
531
527
  end
532
528
 
533
-
529
+
534
530
 
535
531
  # Let's flatten the unique'd hash into a more useable array:
536
- # NOTE: The reason we included a reject is a little complicated, but here's the gist:
537
- # * We try not to load the whole post if we don't have to
538
- # * Its possible that we met all the criterion of the passes_filter? with merely a header, and
539
- # if so we add a url to the summaries stack
540
- # * Unfortunately, when we later load that post in full, we may find that the post was posting_has_expired?
541
- # or flagged_for_removal?, etc.
542
- # * If this was the case, below we'll end up sorting against nil post_dates. This would fail.
543
- # * So - before we sort, we run a quick reject on nil post_dates
544
- new_summaries = new_summaries.values.reject{|v| v.post_date.nil? }.sort{|a,b| a.post_date <=> b.post_date} # oldest goes to bottom
532
+ new_summaries = new_summaries.values.sort{|a,b| a.post_date <=> b.post_date} # oldest goes to bottom
545
533
 
546
534
  # Now Let's manage the tracking database:
547
535
  if new_summaries.length > 0
@@ -562,13 +550,13 @@ report_summaries = craig_report.searches.collect do |search|
562
550
  end
563
551
 
564
552
  # Time to send the email (maybe):
565
- unless report_summaries.select { |s| ! s[:postings].empty? }.empty?
566
- ReportMailer.deliver_report(
553
+ unless report_summaries.select { |s| !s[:postings].empty? }.empty?
554
+ ReportMailer.report(
567
555
  craig_report.email_to,
568
556
  craig_report.email_from,
569
557
  craig_report.report_name,
570
558
  {:summaries => report_summaries, :definition => craig_report}
571
- )
559
+ ).deliver
572
560
  end
573
561
 
574
562
  # Commit (make 'active') all newly created tracked post urls:
@@ -0,0 +1,17 @@
1
+ <h2><%=h @subject %></h2>
2
+ <%@summaries.each do |summary| %>
3
+ <h3><%=h summary[:search].name%></h3>
4
+ <% if summary[:postings].length > 0 %>
5
+ <%summary[:postings].each do |post|%>
6
+ <%=('<p>%s <a href="%s">%s -</a>%s%s</p>' % [
7
+ h(post.post_date.strftime('%b %d')),
8
+ post.url,
9
+ h(post.label),
10
+ (post.location) ? '<font size="-1"> (%s)</font>' % h(post.location) : '',
11
+ (post.has_pic_or_img?) ? ' <span style="color: orange"> img</span>': ''
12
+ ]).html_safe -%>
13
+ <% end %>
14
+ <% else %>
15
+ <p><i>No new postings were found, which matched the search criteria.</i></p>
16
+ <% end %>
17
+ <% end %>
@@ -1,15 +1,15 @@
1
1
  CRAIGSLIST REPORTER
2
2
 
3
- <%@summaries.each do |summary| -%>
3
+ <% @summaries.each do |summary| -%>
4
4
  <%=summary[:search].name %>
5
5
  <% summary[:postings].collect do |post| -%>
6
6
  <% if summary[:postings].length > 0 %>
7
7
  <%='%s : %s %s %s %s' % [
8
- post.post_date.strftime('%b %d'),
9
- post.label,
10
- (post.location) ? " (#{post.location})" : '',
11
- (post.has_pic_or_img?) ? ' [img]': '',
12
- post.url
8
+ post.post_date.strftime('%b %d'),
9
+ post.label,
10
+ (post.location) ? " (#{post.location})" : '',
11
+ (post.has_pic_or_img?) ? ' [img]': '',
12
+ post.url
13
13
  ] -%>
14
14
  <% else %>
15
15
  No new postings were found, which matched the search criteria.
data/lib/geo_listings.rb CHANGED
@@ -1,19 +1,19 @@
1
1
  # = About geo_listings.rb
2
2
  #
3
3
  # This file contains the parsing code, and logic relating to geographic site pages and paths. You
4
- # should never need to include this file directly, as all of libcraigscrape's objects and methods
4
+ # should never need to include this file directly, as all of libcraigscrape's objects and methods
5
5
  # are loaded when you use <tt>require 'libcraigscrape'</tt> in your code.
6
6
  #
7
7
 
8
8
  require 'scraper'
9
9
 
10
10
  class CraigScrape
11
-
12
- # GeoListings represents a parsed Craigslist geo lisiting page. (i.e. {'http://geo.craigslist.org/iso/us'}[http://geo.craigslist.org/iso/us])
11
+
12
+ # GeoListings represents a parsed Craigslist geo lisiting page. (i.e. {'http://geo.craigslist.org/iso/us'}[http://geo.craigslist.org/iso/us])
13
13
  # These list all the craigslist sites in a given region.
14
14
  class GeoListings < Scraper
15
15
  GEOLISTING_BASE_URL = %{http://geo.craigslist.org/iso/}
16
-
16
+
17
17
  LOCATION_NAME = /[ ]*\>[ ](.+)[ ]*/
18
18
  PATH_SCANNER = /(?:\\\/|[^\/])+/
19
19
  URL_HOST_PART = /^[^\:]+\:\/\/([^\/]+)[\/]?$/
@@ -31,18 +31,18 @@ class CraigScrape
31
31
  # Validate that required fields are present, at least - if we've downloaded it from a url
32
32
  parse_error! unless location
33
33
  end
34
-
34
+
35
35
  # Returns the GeoLocation's full name
36
36
  def location
37
37
  unless @location
38
38
  cursor = html % 'h3 > b > a:first-of-type'
39
- cursor = cursor.next if cursor
39
+ cursor = cursor.next if cursor
40
40
  @location = $1 if cursor and LOCATION_NAME.match he_decode(cursor.to_s)
41
41
  end
42
-
42
+
43
43
  @location
44
44
  end
45
-
45
+
46
46
  # Returns a hash of site name to urls in the current listing
47
47
  def sites
48
48
  unless @sites
@@ -52,27 +52,27 @@ class CraigScrape
52
52
  @sites[site_name] = $1 if URL_HOST_PART.match el_a[:href]
53
53
  end
54
54
  end
55
-
55
+
56
56
  @sites
57
57
  end
58
-
58
+
59
59
  # This method will return an array of all possible sites that match the specified location path.
60
60
  # Sample location paths:
61
61
  # - us/ca
62
62
  # - us/fl/miami
63
63
  # - jp/fukuoka
64
64
  # - mx
65
- # Here's how location paths work.
65
+ # Here's how location paths work.
66
66
  # - The components of the path are to be separated by '/' 's.
67
67
  # - Up to (and optionally, not including) the last component, the path should correspond against a valid GeoLocation url with the prefix of 'http://geo.craigslist.org/iso/'
68
68
  # - the last component can either be a site's 'prefix' on a GeoLocation page, or, the last component can just be a geolocation page itself, in which case all the sites on that page are selected.
69
69
  # - the site prefix is the first dns record in a website listed on a GeoLocation page. (So, for the case of us/fl/miami , the last 'miami' corresponds to the 'south florida' link on {'http://geo.craigslist.org/iso/us/fl'}[http://geo.craigslist.org/iso/us/fl]
70
70
  def self.sites_in_path(full_path, base_url = GEOLISTING_BASE_URL)
71
71
  # the base_url parameter is mostly so we can test this method
72
-
73
- # Unfortunately - the easiest way to understand much of this is to see how craigslist returns
72
+
73
+ # Unfortunately - the easiest way to understand much of this is to see how craigslist returns
74
74
  # these geolocations. Watch what happens when you request us/fl/non-existant/page/here.
75
- # I also made this a little forgiving in a couple ways not specified with official support, per
75
+ # I also made this a little forgiving in a couple ways not specified with official support, per
76
76
  # the rules above.
77
77
  full_path_parts = full_path.scan PATH_SCANNER
78
78
 
@@ -82,15 +82,15 @@ class CraigScrape
82
82
  full_path_parts.each_with_index do |part, i|
83
83
 
84
84
  # Let's un-escape the path-part, if needed:
85
- part.gsub! "\\/", "/"
85
+ part.gsub! "\\/", "/"
86
86
 
87
87
  # If they're specifying a single site, this will catch and return it immediately
88
- site = geo_listing.sites.find{ |n,s|
88
+ site = geo_listing.sites.find{ |n,s|
89
89
  (SITE_PREFIX.match s and $1 == part) or n == part
90
90
  } if geo_listing
91
91
 
92
92
  # This returns the site component of the found array
93
- return [site.last] if site
93
+ return [site.last] if site
94
94
 
95
95
  begin
96
96
  # The URI escape is mostly needed to translate the space characters
@@ -109,9 +109,9 @@ class CraigScrape
109
109
  geo_listing.sites.collect{|n,s| s }
110
110
  end
111
111
 
112
- # find_sites takes a single array of strings as an argument. Each string is to be either a location path
112
+ # find_sites takes a single array of strings as an argument. Each string is to be either a location path
113
113
  # (see sites_in_path), or a full site (in canonical form - ie "memphis.craigslist.org"). Optionally,
114
- # each of this may/should contain a '+' or '-' prefix to indicate whether the string is supposed to
114
+ # each of this may/should contain a '+' or '-' prefix to indicate whether the string is supposed to
115
115
  # include sites from the master list, or remove them from the list. If no '+' or'-' is
116
116
  # specified, the default assumption is '+'. Strings are processed from left to right, which gives
117
117
  # a high degree of control over the selection set. Examples:
@@ -122,23 +122,23 @@ class CraigScrape
122
122
  # There's a lot of flexibility here, you get the idea.
123
123
  def self.find_sites(specs, base_url = GEOLISTING_BASE_URL)
124
124
  ret = []
125
-
125
+
126
126
  specs.each do |spec|
127
127
  (op,spec = $1,$2) if FIND_SITES_PARTS.match spec
128
128
 
129
- spec = (spec.include? '.') ? [spec] : sites_in_path(spec, base_url)
129
+ spec = (spec.include? '.') ? [spec] : sites_in_path(spec, base_url)
130
130
 
131
131
  (op == '-') ? ret -= spec : ret |= spec
132
132
  end
133
-
133
+
134
134
  ret
135
135
  end
136
136
 
137
137
  private
138
-
138
+
139
139
  def self.bad_geo_path!(path)
140
140
  raise BadGeoListingPath, "Unable to load path #{path.inspect}, either you're having problems connecting to Craiglist, or your path is invalid."
141
141
  end
142
-
142
+
143
143
  end
144
144
  end