olek-libcraigscrape 1.0.3 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/CHANGELOG CHANGED
@@ -1,34 +1,40 @@
1
1
  == Change Log
2
2
 
3
+ === Release 1.1
4
+ - ruby 1.9.3 support
5
+ - migrated from rails 2 gems to rails 3
6
+ - fixed some new parsing bugs introduced by craigslist template changes
7
+ - Replaced Net:Http with typhoeus
8
+
3
9
  === Release 1.0
4
10
  - Replaced hpricot dependency with Nokogiri. Nokogiri should be faster and more reliable. Whoo-hoo!
5
11
 
6
12
  === Release 0.9.1
7
13
  - Added support for posting_has_expired? and expired post recognition
8
- - Fixed a weird bug in craigwatch that would cause a scrape to abort if a flagged_for_removal? was encountered when using certain (minimal) filtering
14
+ - Fixed a weird bug in craigwatch that would cause a scrape to abort if a flagged_for_removal? was encountered when using certain (minimal) filtering
9
15
 
10
16
  === Release 0.9 (Oct 01, 2010)
11
17
  - Minor adjustments to craigwatch to fix deprecation warnings in new ActiveRecord and ActionMailer gems
12
18
  - Added gem version specifiers to the Gem spec and to the require statements
13
19
  - Moved repo to github
14
- - Fixed an esoteric bug in craigwatch, affecting the last scraped post in a listing when that post was 'flagged for removal'.
20
+ - Fixed an esoteric bug in craigwatch, affecting the last scraped post in a listing when that post was 'flagged for removal'.
15
21
  - Took all those extra package-building tasts out of the Rakefile since this is 2010 and we only party with gemfiles
16
22
  - Ruby 1.9 compatibility adjustments
17
23
 
18
24
  === Release 0.8.4 (Sep 6, 2010)
19
25
  - Someone found a way to screw up hpricot's to_s method (posting1938291834-090610.html) and fixed by added html_source to the craigslist Scraper object, which returns the body of the post without passing it through hpricot. Its a better way to go anyways, and re-wrote a couple incidentals to use the html_source method...
20
- - Adjusted the test cases a bit, since the user bodies being returned have less cleanup in their output than they had prior
26
+ - Adjusted the test cases a bit, since the user bodies being returned have less cleanup in their output than they had prior
21
27
 
22
28
  === Release 0.8.3 (August 2, 2010)
23
29
  - Someone was posting really bad html that was screwing up Hpricot. Such is to be expected when you're soliciting html from the general public I suppose. Added test_bugs_found061710 posting test, and fixed by stripping out the user body before parsing with Hpricot.
24
- - Added a MaxRedirectError and corresponding maximum_redirects_per_request cattr for the Craigscrape objects. This fixed a weird bug where craigslist was sending us in redirect circles around 06/10
30
+ - Added a MaxRedirectError and corresponding maximum_redirects_per_request cattr for the Craigscrape objects. This fixed a weird bug where craigslist was sending us in redirect circles around 06/10
25
31
 
26
32
  === Release 0.8.2 (April 17, 2010)
27
33
  - Found another odd parsing bug. Scrape sample is in 'listing_samples/mia_search_kitten.3.15.10.html', Adjusted CraigScrape::Listings::HEADER_DATE to fix.
28
34
  - Craigslist started added <span> tags in its post summaries. Fixed. See sample in test_new_listing_span051710
29
35
 
30
36
  === Release 0.8.1 (Feb 10, 2010)
31
- - Found an odd parsing bug occured for the first time today. Scrape sample is in 'listing_samples/mia_sss_kittens2.10.10.html', Adjusted CraigScrape::Listings::LABEL to fix.
37
+ - Found an odd parsing bug occured for the first time today. Scrape sample is in 'listing_samples/mia_sss_kittens2.10.10.html', Adjusted CraigScrape::Listings::LABEL to fix.
32
38
  - Switched to require "active_support" per the deprecation notices
33
39
  - Little adjustments to fix the rdoc readibility
34
40
 
@@ -83,7 +89,7 @@
83
89
  - Adjusted the examples in the readme, added a "require 'rubygems'" to the top of the listing so that they would actually work if you tried to run them verbatim (Thanks J T!)
84
90
  - Restructured some of the parsing to be less leinient when scraped values aren't matching their regexp's in the PostSummary
85
91
  - It seems like craigslist returns a 404 on pages that exist, for no good reason on occasion. Added a retry mechanism that wont take no for an answer, unless we get a defineable number of them in a row
86
- - Added CraigScrape cattr_accessors : retries_on_fetch_fail, sleep_between_fetch_retries .
92
+ - Added CraigScrape cattr_accessors : retries_on_fetch_fail, sleep_between_fetch_retries .
87
93
  - Adjusted craigwatch to not commit any database changes until the notification email goes out. This way if there's an error, the user wont miss any results on a re-run
88
94
  - Added a FetchError for http requests that don't return 200 or redirect...
89
95
  - Adjusted craigwatch to use scrape_until instead of scrape_since, this new approach cuts down on the url fetching by assuming that if we come across something we've already tracked, we dont need to keep going any further. NOTE: We still can't use a 'last_scraped_url' on the TrackedSearch model b/c sometimes posts get deleted.
data/COPYING.LESSER CHANGED
@@ -1,4 +1,4 @@
1
- GNU LESSER GENERAL PUBLIC LICENSE
1
+ GNU LESSER GENERAL PUBLIC LICENSE
2
2
  Version 3, 29 June 2007
3
3
 
4
4
  Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/>
data/README CHANGED
@@ -25,10 +25,10 @@ On the 'miami.craigslist.org' site, using the query "search/sss?query=apple"
25
25
  require 'libcraigscrape'
26
26
  require 'date'
27
27
  require 'pp'
28
-
28
+
29
29
  miami_cl = CraigScrape.new 'us/fl/miami'
30
30
  miami_cl.posts_since(Time.parse('Sep 10'), 'search/sss?query=apple').each do |post|
31
- pp post
31
+ pp post
32
32
  end
33
33
 
34
34
  === Scrape Last 225 Craigslist Listings
@@ -38,26 +38,26 @@ On the 'miami.craigslist.org' under the 'apa' category
38
38
  require 'rubygems'
39
39
  require 'libcraigscrape'
40
40
  require 'pp'
41
-
41
+
42
42
  i=1
43
43
  CraigScrape.new('us/fl/miami').each_post('apa') do |post|
44
44
  break if i > 225
45
- i+=1
46
- pp post
45
+ i+=1
46
+ pp post
47
47
  end
48
48
 
49
49
  === Multiple site with multiple section/search enumeration of posts
50
50
 
51
- In Florida, with the exception of 'miami.craigslist.org' & 'keys.craigslist.org' sites, output each post in
51
+ In Florida, with the exception of 'miami.craigslist.org' & 'keys.craigslist.org' sites, output each post in
52
52
  the 'crg' category and for the search 'artist needed'
53
53
 
54
54
  require 'rubygems'
55
55
  require 'libcraigscrape'
56
56
  require 'pp'
57
-
57
+
58
58
  non_sfl_sites = CraigScrape.new('us/fl', '- us/fl/miami', '- us/fl/keys')
59
59
  non_sfl_sites.each_post('crg', 'search/sss?query=artist+needed') do |post|
60
- pp post
60
+ pp post
61
61
  end
62
62
 
63
63
  === Scrape Single Craigslist Posting
@@ -66,7 +66,7 @@ This grabs the full details under the specific post http://miami.craigslist.org/
66
66
 
67
67
  require 'rubygems'
68
68
  require 'libcraigscrape'
69
-
69
+
70
70
  post = CraigScrape::Posting.new 'http://miami.craigslist.org/mdc/sys/1140808860.html'
71
71
  puts "(%s) %s:\n %s" % [ post.post_time.strftime('%b %d'), post.title, post.contents_as_plain ]
72
72
 
@@ -76,7 +76,7 @@ This grabs the post summaries of the single listings at http://miami.craigslist.
76
76
 
77
77
  require 'rubygems'
78
78
  require 'libcraigscrape'
79
-
79
+
80
80
  listing = CraigScrape::Listings.new 'http://miami.craigslist.org/search/sss?query=laptop'
81
81
  puts 'Found %d posts for the search "laptop" on this page' % listing.posts.length
82
82
 
data/Rakefile CHANGED
@@ -1,8 +1,8 @@
1
1
  require 'rake'
2
2
  require 'rake/clean'
3
- require 'rake/gempackagetask'
4
- require 'rake/rdoctask'
3
+ require 'rdoc/task'
5
4
  require 'rake/testtask'
5
+ require 'rubygems/package_task'
6
6
  require 'fileutils'
7
7
  require 'tempfile'
8
8
 
@@ -11,7 +11,7 @@ include FileUtils
11
11
  RbConfig = Config unless defined? RbConfig
12
12
 
13
13
  NAME = "olek-libcraigscrape"
14
- VERS = ENV['VERSION'] || "1.0.3"
14
+ VERS = ENV['VERSION'] || "1.1.0"
15
15
  PKG = "#{NAME}-#{VERS}"
16
16
 
17
17
  RDOC_OPTS = ['--quiet', '--title', 'The libcraigscrape Reference', '--main', 'README', '--inline-source']
@@ -35,15 +35,8 @@ SPEC =
35
35
  s.homepage = 'http://www.derosetechnologies.com/community/libcraigscrape'
36
36
  s.rubyforge_project = 'libcraigwatch'
37
37
  s.files = PKG_FILES
38
- s.require_paths = ["lib"]
38
+ s.require_paths = ["lib"]
39
39
  s.test_files = FileList['test/test_*.rb']
40
- s.add_dependency 'nokogiri', '>= 1.4.4'
41
- s.add_dependency 'htmlentities', '>= 4.0.0'
42
- s.add_dependency 'activesupport','>= 2.3.0', '< 3'
43
- s.add_dependency 'activerecord', '>= 2.3.0', '< 3'
44
- s.add_dependency 'actionmailer', '>= 2.3.0', '< 3'
45
- s.add_dependency 'kwalify', '>= 0.7.2'
46
- s.add_dependency 'sqlite3'
47
40
  end
48
41
 
49
42
  desc "Run all the tests"
@@ -61,7 +54,7 @@ Rake::RDocTask.new do |rdoc|
61
54
  rdoc.rdoc_files.add RDOC_FILES+Dir.glob('lib/*.rb').sort_by{|a,b| (a == 'lib/libcraigscrape.rb') ? -1 : 0 }
62
55
  end
63
56
 
64
- Rake::GemPackageTask.new(SPEC) do |p|
57
+ Gem::PackageTask.new(SPEC) do |p|
65
58
  p.need_tar = false
66
59
  p.need_tar_gz = false
67
60
  p.need_tar_bz2 = false
@@ -81,45 +74,3 @@ end
81
74
  task :uninstall => [:clean] do
82
75
  sh %{sudo gem uninstall #{NAME}}
83
76
  end
84
-
85
- require 'roodi'
86
- require 'roodi_task'
87
-
88
- namespace :code_tests do
89
- desc "Analyze for code complexity"
90
- task :flog do
91
- require 'flog'
92
-
93
- flog = Flog.new
94
- flog.flog_files ['lib']
95
- threshold = 105
96
-
97
- bad_methods = flog.totals.select do |name, score|
98
- score > threshold
99
- end
100
-
101
- bad_methods.sort { |a,b| a[1] <=> b[1] }.each do |name, score|
102
- puts "%8.1f: %s" % [score, name]
103
- end
104
-
105
- puts "WARNING : #{bad_methods.size} methods have a flog complexity > #{threshold}" unless bad_methods.empty?
106
- end
107
-
108
- desc "Analyze for code duplication"
109
- require 'flay'
110
- task :flay do
111
- threshold = 25
112
- flay = Flay.new({:fuzzy => false, :verbose => false, :mass => threshold})
113
- flay.process(*Flay.expand_dirs_to_files(['lib']))
114
-
115
- flay.report
116
-
117
- raise "#{flay.masses.size} chunks of code have a duplicate mass > #{threshold}" unless flay.masses.empty?
118
- end
119
-
120
- RoodiTask.new 'roodi', ['lib/*.rb'], 'roodi.yml'
121
- end
122
-
123
- desc "Run all code tests"
124
- task :code_tests => %w(code_tests:flog code_tests:flay code_tests:roodi)
125
-
@@ -25,7 +25,7 @@ mapping:
25
25
  mapping:
26
26
  "adapter": { type: str, required: yes }
27
27
  "dbfile": { type: str, required: no }
28
- "host": { type: str, required: no }
28
+ "host": { type: str, required: no }
29
29
  "username": { type: str, required: no }
30
30
  "password": { type: str, required: no }
31
31
  "socket": { type: str, required: no }
@@ -50,7 +50,7 @@ mapping:
50
50
  "summary_or_full_post_has_no": {type: seq, required: no, sequence: [ {type: str, unique: yes} ]}
51
51
  "location_has": {type: seq, required: no, sequence: [ {type: str, unique: yes} ]}
52
52
  "location_has_no": {type: seq, required: no, sequence: [ {type: str, unique: yes} ]}
53
- "sites":
53
+ "sites":
54
54
  type: seq
55
55
  required: yes
56
56
  sequence:
@@ -62,7 +62,7 @@ mapping:
62
62
  sequence:
63
63
  - type: str
64
64
  unique: yes
65
- "starting":
65
+ "starting":
66
66
  type: str
67
67
  required: no
68
68
  pattern: /^[\d]{1,2}\/[\d]{1,2}\/(?:[\d]{2}|[\d]{4})$/
data/bin/craigwatch CHANGED
@@ -1,4 +1,5 @@
1
- #!/usr/bin/ruby
1
+ #!/usr/bin/env ruby
2
+ # encoding: UTF-8
2
3
  #
3
4
  # =craigwatch - A email-based "post monitoring" solution
4
5
  #
@@ -160,9 +161,9 @@ $: << File.dirname(__FILE__) + '/../lib'
160
161
 
161
162
  require 'rubygems'
162
163
 
163
- gem 'kwalify', '~> 0.7'
164
- gem 'activerecord', '~> 2.3'
165
- gem 'actionmailer', '~> 2.3'
164
+ gem 'kwalify'
165
+ gem 'activerecord'
166
+ gem 'actionmailer'
166
167
 
167
168
  require 'kwalify'
168
169
  require 'active_record'
@@ -252,7 +253,7 @@ class CraigReportDefinition #:nodoc:
252
253
 
253
254
  def starting_at
254
255
  (@starting) ?
255
- Time.parse(@starting) :
256
+ Time.strptime(@starting, "%m/%d/%Y") :
256
257
  Time.now.yesterday.beginning_of_day
257
258
  end
258
259
 
@@ -290,17 +291,23 @@ class CraigReportDefinition #:nodoc:
290
291
  private
291
292
 
292
293
  def matches_all?(conditions, against)
293
- against = against.to_a.compact
294
- (conditions.nil? or conditions.all?{|c| against.any?{|a| match_against c, a } }) ? true : false
294
+ (conditions.nil? or conditions.all?{|c| sanitized_against(against).any?{|a| match_against c, a } }) ? true : false
295
295
  end
296
296
 
297
297
  def doesnt_match_any?(conditions, against)
298
- against = against.to_a.compact
299
- (conditions.nil? or conditions.all?{|c| against.any?{|a| !match_against c, a } }) ? true : false
298
+ (conditions.nil? or conditions.all?{|c| sanitized_against(against).any?{|a| !match_against c, a } }) ? true : false
300
299
  end
301
300
 
302
301
  def match_against(condition, against)
303
- (against.scan( condition.is_re? ? condition.to_re : /#{condition}/i).length > 0) ? true : false
302
+ (CraigScrape::Scraper.he_decode(against).scan( condition.is_re? ? condition.to_re : /#{condition}/i).length > 0) ? true : false
303
+ end
304
+
305
+ # This is kind of a hack to deal with ruby 1.9. Really the filtering mechanism
306
+ # needs to be factored out and tested....
307
+ def sanitized_against(against)
308
+ against = against.lines if against.respond_to? :lines
309
+ against = against.to_a if against.respond_to? :to_a
310
+ (against.nil?) ? [] : against.compact
304
311
  end
305
312
  end
306
313
  end
@@ -353,24 +360,12 @@ class TrackedPost < ActiveRecord::Base #:nodoc:
353
360
  end
354
361
 
355
362
  class ReportMailer < ActionMailer::Base #:nodoc:
356
- def report(to, sender, subject_template, report_tmpl)
357
-
358
- formatted_subject = Time.now.strftime(subject_template)
359
-
360
- recipients to
361
- from sender
362
- subject formatted_subject
363
+ # default :template_path => File.dirname(__FILE__)
363
364
 
364
- generate_view_parts 'craigslist_report', report_tmpl.merge({:subject =>formatted_subject})
365
- end
366
-
367
- def generate_view_parts(view_name, tmpl)
368
- part( :content_type => "multipart/alternative" ) do |p|
369
- [
370
- { :content_type => "text/plain", :body => render_message("#{view_name.to_s}.plain.erb", tmpl) },
371
- { :content_type => "text/html", :body => render_message("#{view_name.to_s}.html.erb", tmpl.merge({:part_container => p})) }
372
- ].each { |parms| p.part parms.merge( { :charset => "UTF-8", :transfer_encoding => "7bit" } ) }
373
- end
365
+ def report(to, sender, subject_template, report_tmpl)
366
+ subject = Time.now.strftime subject_template
367
+ @summaries = report_tmpl[:summaries]
368
+ mail :to => to, :subject => subject, :from => sender
374
369
  end
375
370
  end
376
371
 
@@ -405,13 +400,14 @@ parser.errors.each do |e|
405
400
  end and exit if parser.errors.length > 0
406
401
 
407
402
  # Initialize Action Mailer:
403
+ ActionMailer::Base.prepend_view_path(File.dirname(__FILE__))
408
404
  ActionMailer::Base.logger = Logger.new STDERR if craig_report.debug_mailer?
409
405
  if craig_report.smtp_settings
410
- ReportMailer.smtp_settings = craig_report.smtp_settings.symbolize_keys
406
+ ActionMailer::Base.smtp_settings = craig_report.smtp_settings
407
+ ActionMailer::Base.delivery_method = :smtp
411
408
  else
412
- ReportMailer.delivery_method = :sendmail
409
+ ActionMailer::Base.delivery_method = :sendmail
413
410
  end
414
- ReportMailer.template_root = File.dirname __FILE__
415
411
 
416
412
  # Initialize the database:
417
413
  ActiveRecord::Base.logger = Logger.new STDERR if craig_report.debug_database?
@@ -517,7 +513,7 @@ report_summaries = craig_report.searches.collect do |search|
517
513
  # Now let's add these urls to the database so as to reduce memory overhead.
518
514
  # Keep in mind - they're not active until the email goes out.
519
515
  # also - we shouldn't have to worry about putting 'irrelevant' posts in the db, since
520
- # the nbewest are always the first ones parsed:
516
+ # the newest are always the first ones parsed:
521
517
  tracked_listing.posts.create(
522
518
  :url => post.url,
523
519
  :created_at => newest_post_date
@@ -530,18 +526,10 @@ report_summaries = craig_report.searches.collect do |search|
530
526
  end
531
527
  end
532
528
 
533
-
529
+
534
530
 
535
531
  # Let's flatten the unique'd hash into a more useable array:
536
- # NOTE: The reason we included a reject is a little complicated, but here's the gist:
537
- # * We try not to load the whole post if we don't have to
538
- # * Its possible that we met all the criterion of the passes_filter? with merely a header, and
539
- # if so we add a url to the summaries stack
540
- # * Unfortunately, when we later load that post in full, we may find that the post was posting_has_expired?
541
- # or flagged_for_removal?, etc.
542
- # * If this was the case, below we'll end up sorting against nil post_dates. This would fail.
543
- # * So - before we sort, we run a quick reject on nil post_dates
544
- new_summaries = new_summaries.values.reject{|v| v.post_date.nil? }.sort{|a,b| a.post_date <=> b.post_date} # oldest goes to bottom
532
+ new_summaries = new_summaries.values.sort{|a,b| a.post_date <=> b.post_date} # oldest goes to bottom
545
533
 
546
534
  # Now Let's manage the tracking database:
547
535
  if new_summaries.length > 0
@@ -562,13 +550,13 @@ report_summaries = craig_report.searches.collect do |search|
562
550
  end
563
551
 
564
552
  # Time to send the email (maybe):
565
- unless report_summaries.select { |s| ! s[:postings].empty? }.empty?
566
- ReportMailer.deliver_report(
553
+ unless report_summaries.select { |s| !s[:postings].empty? }.empty?
554
+ ReportMailer.report(
567
555
  craig_report.email_to,
568
556
  craig_report.email_from,
569
557
  craig_report.report_name,
570
558
  {:summaries => report_summaries, :definition => craig_report}
571
- )
559
+ ).deliver
572
560
  end
573
561
 
574
562
  # Commit (make 'active') all newly created tracked post urls:
@@ -0,0 +1,17 @@
1
+ <h2><%=h @subject %></h2>
2
+ <%@summaries.each do |summary| %>
3
+ <h3><%=h summary[:search].name%></h3>
4
+ <% if summary[:postings].length > 0 %>
5
+ <%summary[:postings].each do |post|%>
6
+ <%=('<p>%s <a href="%s">%s -</a>%s%s</p>' % [
7
+ h(post.post_date.strftime('%b %d')),
8
+ post.url,
9
+ h(post.label),
10
+ (post.location) ? '<font size="-1"> (%s)</font>' % h(post.location) : '',
11
+ (post.has_pic_or_img?) ? ' <span style="color: orange"> img</span>': ''
12
+ ]).html_safe -%>
13
+ <% end %>
14
+ <% else %>
15
+ <p><i>No new postings were found, which matched the search criteria.</i></p>
16
+ <% end %>
17
+ <% end %>
@@ -1,15 +1,15 @@
1
1
  CRAIGSLIST REPORTER
2
2
 
3
- <%@summaries.each do |summary| -%>
3
+ <% @summaries.each do |summary| -%>
4
4
  <%=summary[:search].name %>
5
5
  <% summary[:postings].collect do |post| -%>
6
6
  <% if summary[:postings].length > 0 %>
7
7
  <%='%s : %s %s %s %s' % [
8
- post.post_date.strftime('%b %d'),
9
- post.label,
10
- (post.location) ? " (#{post.location})" : '',
11
- (post.has_pic_or_img?) ? ' [img]': '',
12
- post.url
8
+ post.post_date.strftime('%b %d'),
9
+ post.label,
10
+ (post.location) ? " (#{post.location})" : '',
11
+ (post.has_pic_or_img?) ? ' [img]': '',
12
+ post.url
13
13
  ] -%>
14
14
  <% else %>
15
15
  No new postings were found, which matched the search criteria.
data/lib/geo_listings.rb CHANGED
@@ -1,19 +1,19 @@
1
1
  # = About geo_listings.rb
2
2
  #
3
3
  # This file contains the parsing code, and logic relating to geographic site pages and paths. You
4
- # should never need to include this file directly, as all of libcraigscrape's objects and methods
4
+ # should never need to include this file directly, as all of libcraigscrape's objects and methods
5
5
  # are loaded when you use <tt>require 'libcraigscrape'</tt> in your code.
6
6
  #
7
7
 
8
8
  require 'scraper'
9
9
 
10
10
  class CraigScrape
11
-
12
- # GeoListings represents a parsed Craigslist geo lisiting page. (i.e. {'http://geo.craigslist.org/iso/us'}[http://geo.craigslist.org/iso/us])
11
+
12
+ # GeoListings represents a parsed Craigslist geo lisiting page. (i.e. {'http://geo.craigslist.org/iso/us'}[http://geo.craigslist.org/iso/us])
13
13
  # These list all the craigslist sites in a given region.
14
14
  class GeoListings < Scraper
15
15
  GEOLISTING_BASE_URL = %{http://geo.craigslist.org/iso/}
16
-
16
+
17
17
  LOCATION_NAME = /[ ]*\>[ ](.+)[ ]*/
18
18
  PATH_SCANNER = /(?:\\\/|[^\/])+/
19
19
  URL_HOST_PART = /^[^\:]+\:\/\/([^\/]+)[\/]?$/
@@ -31,18 +31,18 @@ class CraigScrape
31
31
  # Validate that required fields are present, at least - if we've downloaded it from a url
32
32
  parse_error! unless location
33
33
  end
34
-
34
+
35
35
  # Returns the GeoLocation's full name
36
36
  def location
37
37
  unless @location
38
38
  cursor = html % 'h3 > b > a:first-of-type'
39
- cursor = cursor.next if cursor
39
+ cursor = cursor.next if cursor
40
40
  @location = $1 if cursor and LOCATION_NAME.match he_decode(cursor.to_s)
41
41
  end
42
-
42
+
43
43
  @location
44
44
  end
45
-
45
+
46
46
  # Returns a hash of site name to urls in the current listing
47
47
  def sites
48
48
  unless @sites
@@ -52,27 +52,27 @@ class CraigScrape
52
52
  @sites[site_name] = $1 if URL_HOST_PART.match el_a[:href]
53
53
  end
54
54
  end
55
-
55
+
56
56
  @sites
57
57
  end
58
-
58
+
59
59
  # This method will return an array of all possible sites that match the specified location path.
60
60
  # Sample location paths:
61
61
  # - us/ca
62
62
  # - us/fl/miami
63
63
  # - jp/fukuoka
64
64
  # - mx
65
- # Here's how location paths work.
65
+ # Here's how location paths work.
66
66
  # - The components of the path are to be separated by '/' 's.
67
67
  # - Up to (and optionally, not including) the last component, the path should correspond against a valid GeoLocation url with the prefix of 'http://geo.craigslist.org/iso/'
68
68
  # - the last component can either be a site's 'prefix' on a GeoLocation page, or, the last component can just be a geolocation page itself, in which case all the sites on that page are selected.
69
69
  # - the site prefix is the first dns record in a website listed on a GeoLocation page. (So, for the case of us/fl/miami , the last 'miami' corresponds to the 'south florida' link on {'http://geo.craigslist.org/iso/us/fl'}[http://geo.craigslist.org/iso/us/fl]
70
70
  def self.sites_in_path(full_path, base_url = GEOLISTING_BASE_URL)
71
71
  # the base_url parameter is mostly so we can test this method
72
-
73
- # Unfortunately - the easiest way to understand much of this is to see how craigslist returns
72
+
73
+ # Unfortunately - the easiest way to understand much of this is to see how craigslist returns
74
74
  # these geolocations. Watch what happens when you request us/fl/non-existant/page/here.
75
- # I also made this a little forgiving in a couple ways not specified with official support, per
75
+ # I also made this a little forgiving in a couple ways not specified with official support, per
76
76
  # the rules above.
77
77
  full_path_parts = full_path.scan PATH_SCANNER
78
78
 
@@ -82,15 +82,15 @@ class CraigScrape
82
82
  full_path_parts.each_with_index do |part, i|
83
83
 
84
84
  # Let's un-escape the path-part, if needed:
85
- part.gsub! "\\/", "/"
85
+ part.gsub! "\\/", "/"
86
86
 
87
87
  # If they're specifying a single site, this will catch and return it immediately
88
- site = geo_listing.sites.find{ |n,s|
88
+ site = geo_listing.sites.find{ |n,s|
89
89
  (SITE_PREFIX.match s and $1 == part) or n == part
90
90
  } if geo_listing
91
91
 
92
92
  # This returns the site component of the found array
93
- return [site.last] if site
93
+ return [site.last] if site
94
94
 
95
95
  begin
96
96
  # The URI escape is mostly needed to translate the space characters
@@ -109,9 +109,9 @@ class CraigScrape
109
109
  geo_listing.sites.collect{|n,s| s }
110
110
  end
111
111
 
112
- # find_sites takes a single array of strings as an argument. Each string is to be either a location path
112
+ # find_sites takes a single array of strings as an argument. Each string is to be either a location path
113
113
  # (see sites_in_path), or a full site (in canonical form - ie "memphis.craigslist.org"). Optionally,
114
- # each of this may/should contain a '+' or '-' prefix to indicate whether the string is supposed to
114
+ # each of this may/should contain a '+' or '-' prefix to indicate whether the string is supposed to
115
115
  # include sites from the master list, or remove them from the list. If no '+' or'-' is
116
116
  # specified, the default assumption is '+'. Strings are processed from left to right, which gives
117
117
  # a high degree of control over the selection set. Examples:
@@ -122,23 +122,23 @@ class CraigScrape
122
122
  # There's a lot of flexibility here, you get the idea.
123
123
  def self.find_sites(specs, base_url = GEOLISTING_BASE_URL)
124
124
  ret = []
125
-
125
+
126
126
  specs.each do |spec|
127
127
  (op,spec = $1,$2) if FIND_SITES_PARTS.match spec
128
128
 
129
- spec = (spec.include? '.') ? [spec] : sites_in_path(spec, base_url)
129
+ spec = (spec.include? '.') ? [spec] : sites_in_path(spec, base_url)
130
130
 
131
131
  (op == '-') ? ret -= spec : ret |= spec
132
132
  end
133
-
133
+
134
134
  ret
135
135
  end
136
136
 
137
137
  private
138
-
138
+
139
139
  def self.bad_geo_path!(path)
140
140
  raise BadGeoListingPath, "Unable to load path #{path.inspect}, either you're having problems connecting to Craiglist, or your path is invalid."
141
141
  end
142
-
142
+
143
143
  end
144
144
  end