olek-libcraigscrape 1.0.3 → 1.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/CHANGELOG +12 -6
- data/COPYING.LESSER +1 -1
- data/README +10 -10
- data/Rakefile +5 -54
- data/bin/craig_report_schema.yml +3 -3
- data/bin/craigwatch +32 -44
- data/bin/report_mailer/report.html.erb +17 -0
- data/bin/report_mailer/{craigslist_report.plain.erb → report.text.erb} +6 -6
- data/lib/geo_listings.rb +24 -24
- data/lib/libcraigscrape.rb +6 -11
- data/lib/listings.rb +62 -45
- data/lib/posting.rb +153 -106
- data/lib/scraper.rb +37 -94
- data/test/libcraigscrape_test_helpers.rb +10 -10
- data/test/test_craigslist_geolisting.rb +53 -53
- data/test/test_craigslist_listing.rb +26 -26
- data/test/test_craigslist_posting.rb +39 -38
- metadata +38 -114
- data/bin/report_mailer/craigslist_report.html.erb +0 -17
data/CHANGELOG
CHANGED
@@ -1,34 +1,40 @@
|
|
1
1
|
== Change Log
|
2
2
|
|
3
|
+
=== Release 1.1
|
4
|
+
- ruby 1.9.3 support
|
5
|
+
- migrated from rails 2 gems to rails 3
|
6
|
+
- fixed some new parsing bugs introduced by craigslist template changes
|
7
|
+
- Replaced Net:Http with typhoeus
|
8
|
+
|
3
9
|
=== Release 1.0
|
4
10
|
- Replaced hpricot dependency with Nokogiri. Nokogiri should be faster and more reliable. Whoo-hoo!
|
5
11
|
|
6
12
|
=== Release 0.9.1
|
7
13
|
- Added support for posting_has_expired? and expired post recognition
|
8
|
-
- Fixed a weird bug in craigwatch that would cause a scrape to abort if a flagged_for_removal? was encountered when using certain (minimal) filtering
|
14
|
+
- Fixed a weird bug in craigwatch that would cause a scrape to abort if a flagged_for_removal? was encountered when using certain (minimal) filtering
|
9
15
|
|
10
16
|
=== Release 0.9 (Oct 01, 2010)
|
11
17
|
- Minor adjustments to craigwatch to fix deprecation warnings in new ActiveRecord and ActionMailer gems
|
12
18
|
- Added gem version specifiers to the Gem spec and to the require statements
|
13
19
|
- Moved repo to github
|
14
|
-
- Fixed an esoteric bug in craigwatch, affecting the last scraped post in a listing when that post was 'flagged for removal'.
|
20
|
+
- Fixed an esoteric bug in craigwatch, affecting the last scraped post in a listing when that post was 'flagged for removal'.
|
15
21
|
- Took all those extra package-building tasts out of the Rakefile since this is 2010 and we only party with gemfiles
|
16
22
|
- Ruby 1.9 compatibility adjustments
|
17
23
|
|
18
24
|
=== Release 0.8.4 (Sep 6, 2010)
|
19
25
|
- Someone found a way to screw up hpricot's to_s method (posting1938291834-090610.html) and fixed by added html_source to the craigslist Scraper object, which returns the body of the post without passing it through hpricot. Its a better way to go anyways, and re-wrote a couple incidentals to use the html_source method...
|
20
|
-
- Adjusted the test cases a bit, since the user bodies being returned have less cleanup in their output than they had prior
|
26
|
+
- Adjusted the test cases a bit, since the user bodies being returned have less cleanup in their output than they had prior
|
21
27
|
|
22
28
|
=== Release 0.8.3 (August 2, 2010)
|
23
29
|
- Someone was posting really bad html that was screwing up Hpricot. Such is to be expected when you're soliciting html from the general public I suppose. Added test_bugs_found061710 posting test, and fixed by stripping out the user body before parsing with Hpricot.
|
24
|
-
- Added a MaxRedirectError and corresponding maximum_redirects_per_request cattr for the Craigscrape objects. This fixed a weird bug where craigslist was sending us in redirect circles around 06/10
|
30
|
+
- Added a MaxRedirectError and corresponding maximum_redirects_per_request cattr for the Craigscrape objects. This fixed a weird bug where craigslist was sending us in redirect circles around 06/10
|
25
31
|
|
26
32
|
=== Release 0.8.2 (April 17, 2010)
|
27
33
|
- Found another odd parsing bug. Scrape sample is in 'listing_samples/mia_search_kitten.3.15.10.html', Adjusted CraigScrape::Listings::HEADER_DATE to fix.
|
28
34
|
- Craigslist started added <span> tags in its post summaries. Fixed. See sample in test_new_listing_span051710
|
29
35
|
|
30
36
|
=== Release 0.8.1 (Feb 10, 2010)
|
31
|
-
- Found an odd parsing bug occured for the first time today. Scrape sample is in 'listing_samples/mia_sss_kittens2.10.10.html', Adjusted CraigScrape::Listings::LABEL to fix.
|
37
|
+
- Found an odd parsing bug occured for the first time today. Scrape sample is in 'listing_samples/mia_sss_kittens2.10.10.html', Adjusted CraigScrape::Listings::LABEL to fix.
|
32
38
|
- Switched to require "active_support" per the deprecation notices
|
33
39
|
- Little adjustments to fix the rdoc readibility
|
34
40
|
|
@@ -83,7 +89,7 @@
|
|
83
89
|
- Adjusted the examples in the readme, added a "require 'rubygems'" to the top of the listing so that they would actually work if you tried to run them verbatim (Thanks J T!)
|
84
90
|
- Restructured some of the parsing to be less leinient when scraped values aren't matching their regexp's in the PostSummary
|
85
91
|
- It seems like craigslist returns a 404 on pages that exist, for no good reason on occasion. Added a retry mechanism that wont take no for an answer, unless we get a defineable number of them in a row
|
86
|
-
- Added CraigScrape cattr_accessors : retries_on_fetch_fail, sleep_between_fetch_retries .
|
92
|
+
- Added CraigScrape cattr_accessors : retries_on_fetch_fail, sleep_between_fetch_retries .
|
87
93
|
- Adjusted craigwatch to not commit any database changes until the notification email goes out. This way if there's an error, the user wont miss any results on a re-run
|
88
94
|
- Added a FetchError for http requests that don't return 200 or redirect...
|
89
95
|
- Adjusted craigwatch to use scrape_until instead of scrape_since, this new approach cuts down on the url fetching by assuming that if we come across something we've already tracked, we dont need to keep going any further. NOTE: We still can't use a 'last_scraped_url' on the TrackedSearch model b/c sometimes posts get deleted.
|
data/COPYING.LESSER
CHANGED
data/README
CHANGED
@@ -25,10 +25,10 @@ On the 'miami.craigslist.org' site, using the query "search/sss?query=apple"
|
|
25
25
|
require 'libcraigscrape'
|
26
26
|
require 'date'
|
27
27
|
require 'pp'
|
28
|
-
|
28
|
+
|
29
29
|
miami_cl = CraigScrape.new 'us/fl/miami'
|
30
30
|
miami_cl.posts_since(Time.parse('Sep 10'), 'search/sss?query=apple').each do |post|
|
31
|
-
pp post
|
31
|
+
pp post
|
32
32
|
end
|
33
33
|
|
34
34
|
=== Scrape Last 225 Craigslist Listings
|
@@ -38,26 +38,26 @@ On the 'miami.craigslist.org' under the 'apa' category
|
|
38
38
|
require 'rubygems'
|
39
39
|
require 'libcraigscrape'
|
40
40
|
require 'pp'
|
41
|
-
|
41
|
+
|
42
42
|
i=1
|
43
43
|
CraigScrape.new('us/fl/miami').each_post('apa') do |post|
|
44
44
|
break if i > 225
|
45
|
-
|
46
|
-
|
45
|
+
i+=1
|
46
|
+
pp post
|
47
47
|
end
|
48
48
|
|
49
49
|
=== Multiple site with multiple section/search enumeration of posts
|
50
50
|
|
51
|
-
In Florida, with the exception of 'miami.craigslist.org' & 'keys.craigslist.org' sites, output each post in
|
51
|
+
In Florida, with the exception of 'miami.craigslist.org' & 'keys.craigslist.org' sites, output each post in
|
52
52
|
the 'crg' category and for the search 'artist needed'
|
53
53
|
|
54
54
|
require 'rubygems'
|
55
55
|
require 'libcraigscrape'
|
56
56
|
require 'pp'
|
57
|
-
|
57
|
+
|
58
58
|
non_sfl_sites = CraigScrape.new('us/fl', '- us/fl/miami', '- us/fl/keys')
|
59
59
|
non_sfl_sites.each_post('crg', 'search/sss?query=artist+needed') do |post|
|
60
|
-
|
60
|
+
pp post
|
61
61
|
end
|
62
62
|
|
63
63
|
=== Scrape Single Craigslist Posting
|
@@ -66,7 +66,7 @@ This grabs the full details under the specific post http://miami.craigslist.org/
|
|
66
66
|
|
67
67
|
require 'rubygems'
|
68
68
|
require 'libcraigscrape'
|
69
|
-
|
69
|
+
|
70
70
|
post = CraigScrape::Posting.new 'http://miami.craigslist.org/mdc/sys/1140808860.html'
|
71
71
|
puts "(%s) %s:\n %s" % [ post.post_time.strftime('%b %d'), post.title, post.contents_as_plain ]
|
72
72
|
|
@@ -76,7 +76,7 @@ This grabs the post summaries of the single listings at http://miami.craigslist.
|
|
76
76
|
|
77
77
|
require 'rubygems'
|
78
78
|
require 'libcraigscrape'
|
79
|
-
|
79
|
+
|
80
80
|
listing = CraigScrape::Listings.new 'http://miami.craigslist.org/search/sss?query=laptop'
|
81
81
|
puts 'Found %d posts for the search "laptop" on this page' % listing.posts.length
|
82
82
|
|
data/Rakefile
CHANGED
@@ -1,8 +1,8 @@
|
|
1
1
|
require 'rake'
|
2
2
|
require 'rake/clean'
|
3
|
-
require '
|
4
|
-
require 'rake/rdoctask'
|
3
|
+
require 'rdoc/task'
|
5
4
|
require 'rake/testtask'
|
5
|
+
require 'rubygems/package_task'
|
6
6
|
require 'fileutils'
|
7
7
|
require 'tempfile'
|
8
8
|
|
@@ -11,7 +11,7 @@ include FileUtils
|
|
11
11
|
RbConfig = Config unless defined? RbConfig
|
12
12
|
|
13
13
|
NAME = "olek-libcraigscrape"
|
14
|
-
VERS = ENV['VERSION'] || "1.0
|
14
|
+
VERS = ENV['VERSION'] || "1.1.0"
|
15
15
|
PKG = "#{NAME}-#{VERS}"
|
16
16
|
|
17
17
|
RDOC_OPTS = ['--quiet', '--title', 'The libcraigscrape Reference', '--main', 'README', '--inline-source']
|
@@ -35,15 +35,8 @@ SPEC =
|
|
35
35
|
s.homepage = 'http://www.derosetechnologies.com/community/libcraigscrape'
|
36
36
|
s.rubyforge_project = 'libcraigwatch'
|
37
37
|
s.files = PKG_FILES
|
38
|
-
s.require_paths = ["lib"]
|
38
|
+
s.require_paths = ["lib"]
|
39
39
|
s.test_files = FileList['test/test_*.rb']
|
40
|
-
s.add_dependency 'nokogiri', '>= 1.4.4'
|
41
|
-
s.add_dependency 'htmlentities', '>= 4.0.0'
|
42
|
-
s.add_dependency 'activesupport','>= 2.3.0', '< 3'
|
43
|
-
s.add_dependency 'activerecord', '>= 2.3.0', '< 3'
|
44
|
-
s.add_dependency 'actionmailer', '>= 2.3.0', '< 3'
|
45
|
-
s.add_dependency 'kwalify', '>= 0.7.2'
|
46
|
-
s.add_dependency 'sqlite3'
|
47
40
|
end
|
48
41
|
|
49
42
|
desc "Run all the tests"
|
@@ -61,7 +54,7 @@ Rake::RDocTask.new do |rdoc|
|
|
61
54
|
rdoc.rdoc_files.add RDOC_FILES+Dir.glob('lib/*.rb').sort_by{|a,b| (a == 'lib/libcraigscrape.rb') ? -1 : 0 }
|
62
55
|
end
|
63
56
|
|
64
|
-
|
57
|
+
Gem::PackageTask.new(SPEC) do |p|
|
65
58
|
p.need_tar = false
|
66
59
|
p.need_tar_gz = false
|
67
60
|
p.need_tar_bz2 = false
|
@@ -81,45 +74,3 @@ end
|
|
81
74
|
task :uninstall => [:clean] do
|
82
75
|
sh %{sudo gem uninstall #{NAME}}
|
83
76
|
end
|
84
|
-
|
85
|
-
require 'roodi'
|
86
|
-
require 'roodi_task'
|
87
|
-
|
88
|
-
namespace :code_tests do
|
89
|
-
desc "Analyze for code complexity"
|
90
|
-
task :flog do
|
91
|
-
require 'flog'
|
92
|
-
|
93
|
-
flog = Flog.new
|
94
|
-
flog.flog_files ['lib']
|
95
|
-
threshold = 105
|
96
|
-
|
97
|
-
bad_methods = flog.totals.select do |name, score|
|
98
|
-
score > threshold
|
99
|
-
end
|
100
|
-
|
101
|
-
bad_methods.sort { |a,b| a[1] <=> b[1] }.each do |name, score|
|
102
|
-
puts "%8.1f: %s" % [score, name]
|
103
|
-
end
|
104
|
-
|
105
|
-
puts "WARNING : #{bad_methods.size} methods have a flog complexity > #{threshold}" unless bad_methods.empty?
|
106
|
-
end
|
107
|
-
|
108
|
-
desc "Analyze for code duplication"
|
109
|
-
require 'flay'
|
110
|
-
task :flay do
|
111
|
-
threshold = 25
|
112
|
-
flay = Flay.new({:fuzzy => false, :verbose => false, :mass => threshold})
|
113
|
-
flay.process(*Flay.expand_dirs_to_files(['lib']))
|
114
|
-
|
115
|
-
flay.report
|
116
|
-
|
117
|
-
raise "#{flay.masses.size} chunks of code have a duplicate mass > #{threshold}" unless flay.masses.empty?
|
118
|
-
end
|
119
|
-
|
120
|
-
RoodiTask.new 'roodi', ['lib/*.rb'], 'roodi.yml'
|
121
|
-
end
|
122
|
-
|
123
|
-
desc "Run all code tests"
|
124
|
-
task :code_tests => %w(code_tests:flog code_tests:flay code_tests:roodi)
|
125
|
-
|
data/bin/craig_report_schema.yml
CHANGED
@@ -25,7 +25,7 @@ mapping:
|
|
25
25
|
mapping:
|
26
26
|
"adapter": { type: str, required: yes }
|
27
27
|
"dbfile": { type: str, required: no }
|
28
|
-
"host":
|
28
|
+
"host": { type: str, required: no }
|
29
29
|
"username": { type: str, required: no }
|
30
30
|
"password": { type: str, required: no }
|
31
31
|
"socket": { type: str, required: no }
|
@@ -50,7 +50,7 @@ mapping:
|
|
50
50
|
"summary_or_full_post_has_no": {type: seq, required: no, sequence: [ {type: str, unique: yes} ]}
|
51
51
|
"location_has": {type: seq, required: no, sequence: [ {type: str, unique: yes} ]}
|
52
52
|
"location_has_no": {type: seq, required: no, sequence: [ {type: str, unique: yes} ]}
|
53
|
-
"sites":
|
53
|
+
"sites":
|
54
54
|
type: seq
|
55
55
|
required: yes
|
56
56
|
sequence:
|
@@ -62,7 +62,7 @@ mapping:
|
|
62
62
|
sequence:
|
63
63
|
- type: str
|
64
64
|
unique: yes
|
65
|
-
"starting":
|
65
|
+
"starting":
|
66
66
|
type: str
|
67
67
|
required: no
|
68
68
|
pattern: /^[\d]{1,2}\/[\d]{1,2}\/(?:[\d]{2}|[\d]{4})$/
|
data/bin/craigwatch
CHANGED
@@ -1,4 +1,5 @@
|
|
1
|
-
#!/usr/bin/ruby
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
# encoding: UTF-8
|
2
3
|
#
|
3
4
|
# =craigwatch - A email-based "post monitoring" solution
|
4
5
|
#
|
@@ -160,9 +161,9 @@ $: << File.dirname(__FILE__) + '/../lib'
|
|
160
161
|
|
161
162
|
require 'rubygems'
|
162
163
|
|
163
|
-
gem 'kwalify'
|
164
|
-
gem 'activerecord'
|
165
|
-
gem 'actionmailer'
|
164
|
+
gem 'kwalify'
|
165
|
+
gem 'activerecord'
|
166
|
+
gem 'actionmailer'
|
166
167
|
|
167
168
|
require 'kwalify'
|
168
169
|
require 'active_record'
|
@@ -252,7 +253,7 @@ class CraigReportDefinition #:nodoc:
|
|
252
253
|
|
253
254
|
def starting_at
|
254
255
|
(@starting) ?
|
255
|
-
Time.
|
256
|
+
Time.strptime(@starting, "%m/%d/%Y") :
|
256
257
|
Time.now.yesterday.beginning_of_day
|
257
258
|
end
|
258
259
|
|
@@ -290,17 +291,23 @@ class CraigReportDefinition #:nodoc:
|
|
290
291
|
private
|
291
292
|
|
292
293
|
def matches_all?(conditions, against)
|
293
|
-
|
294
|
-
(conditions.nil? or conditions.all?{|c| against.any?{|a| match_against c, a } }) ? true : false
|
294
|
+
(conditions.nil? or conditions.all?{|c| sanitized_against(against).any?{|a| match_against c, a } }) ? true : false
|
295
295
|
end
|
296
296
|
|
297
297
|
def doesnt_match_any?(conditions, against)
|
298
|
-
|
299
|
-
(conditions.nil? or conditions.all?{|c| against.any?{|a| !match_against c, a } }) ? true : false
|
298
|
+
(conditions.nil? or conditions.all?{|c| sanitized_against(against).any?{|a| !match_against c, a } }) ? true : false
|
300
299
|
end
|
301
300
|
|
302
301
|
def match_against(condition, against)
|
303
|
-
(against.scan( condition.is_re? ? condition.to_re : /#{condition}/i).length > 0) ? true : false
|
302
|
+
(CraigScrape::Scraper.he_decode(against).scan( condition.is_re? ? condition.to_re : /#{condition}/i).length > 0) ? true : false
|
303
|
+
end
|
304
|
+
|
305
|
+
# This is kind of a hack to deal with ruby 1.9. Really the filtering mechanism
|
306
|
+
# needs to be factored out and tested....
|
307
|
+
def sanitized_against(against)
|
308
|
+
against = against.lines if against.respond_to? :lines
|
309
|
+
against = against.to_a if against.respond_to? :to_a
|
310
|
+
(against.nil?) ? [] : against.compact
|
304
311
|
end
|
305
312
|
end
|
306
313
|
end
|
@@ -353,24 +360,12 @@ class TrackedPost < ActiveRecord::Base #:nodoc:
|
|
353
360
|
end
|
354
361
|
|
355
362
|
class ReportMailer < ActionMailer::Base #:nodoc:
|
356
|
-
|
357
|
-
|
358
|
-
formatted_subject = Time.now.strftime(subject_template)
|
359
|
-
|
360
|
-
recipients to
|
361
|
-
from sender
|
362
|
-
subject formatted_subject
|
363
|
+
# default :template_path => File.dirname(__FILE__)
|
363
364
|
|
364
|
-
|
365
|
-
|
366
|
-
|
367
|
-
|
368
|
-
part( :content_type => "multipart/alternative" ) do |p|
|
369
|
-
[
|
370
|
-
{ :content_type => "text/plain", :body => render_message("#{view_name.to_s}.plain.erb", tmpl) },
|
371
|
-
{ :content_type => "text/html", :body => render_message("#{view_name.to_s}.html.erb", tmpl.merge({:part_container => p})) }
|
372
|
-
].each { |parms| p.part parms.merge( { :charset => "UTF-8", :transfer_encoding => "7bit" } ) }
|
373
|
-
end
|
365
|
+
def report(to, sender, subject_template, report_tmpl)
|
366
|
+
subject = Time.now.strftime subject_template
|
367
|
+
@summaries = report_tmpl[:summaries]
|
368
|
+
mail :to => to, :subject => subject, :from => sender
|
374
369
|
end
|
375
370
|
end
|
376
371
|
|
@@ -405,13 +400,14 @@ parser.errors.each do |e|
|
|
405
400
|
end and exit if parser.errors.length > 0
|
406
401
|
|
407
402
|
# Initialize Action Mailer:
|
403
|
+
ActionMailer::Base.prepend_view_path(File.dirname(__FILE__))
|
408
404
|
ActionMailer::Base.logger = Logger.new STDERR if craig_report.debug_mailer?
|
409
405
|
if craig_report.smtp_settings
|
410
|
-
|
406
|
+
ActionMailer::Base.smtp_settings = craig_report.smtp_settings
|
407
|
+
ActionMailer::Base.delivery_method = :smtp
|
411
408
|
else
|
412
|
-
|
409
|
+
ActionMailer::Base.delivery_method = :sendmail
|
413
410
|
end
|
414
|
-
ReportMailer.template_root = File.dirname __FILE__
|
415
411
|
|
416
412
|
# Initialize the database:
|
417
413
|
ActiveRecord::Base.logger = Logger.new STDERR if craig_report.debug_database?
|
@@ -517,7 +513,7 @@ report_summaries = craig_report.searches.collect do |search|
|
|
517
513
|
# Now let's add these urls to the database so as to reduce memory overhead.
|
518
514
|
# Keep in mind - they're not active until the email goes out.
|
519
515
|
# also - we shouldn't have to worry about putting 'irrelevant' posts in the db, since
|
520
|
-
# the
|
516
|
+
# the newest are always the first ones parsed:
|
521
517
|
tracked_listing.posts.create(
|
522
518
|
:url => post.url,
|
523
519
|
:created_at => newest_post_date
|
@@ -530,18 +526,10 @@ report_summaries = craig_report.searches.collect do |search|
|
|
530
526
|
end
|
531
527
|
end
|
532
528
|
|
533
|
-
|
529
|
+
|
534
530
|
|
535
531
|
# Let's flatten the unique'd hash into a more useable array:
|
536
|
-
|
537
|
-
# * We try not to load the whole post if we don't have to
|
538
|
-
# * Its possible that we met all the criterion of the passes_filter? with merely a header, and
|
539
|
-
# if so we add a url to the summaries stack
|
540
|
-
# * Unfortunately, when we later load that post in full, we may find that the post was posting_has_expired?
|
541
|
-
# or flagged_for_removal?, etc.
|
542
|
-
# * If this was the case, below we'll end up sorting against nil post_dates. This would fail.
|
543
|
-
# * So - before we sort, we run a quick reject on nil post_dates
|
544
|
-
new_summaries = new_summaries.values.reject{|v| v.post_date.nil? }.sort{|a,b| a.post_date <=> b.post_date} # oldest goes to bottom
|
532
|
+
new_summaries = new_summaries.values.sort{|a,b| a.post_date <=> b.post_date} # oldest goes to bottom
|
545
533
|
|
546
534
|
# Now Let's manage the tracking database:
|
547
535
|
if new_summaries.length > 0
|
@@ -562,13 +550,13 @@ report_summaries = craig_report.searches.collect do |search|
|
|
562
550
|
end
|
563
551
|
|
564
552
|
# Time to send the email (maybe):
|
565
|
-
unless report_summaries.select { |s| !
|
566
|
-
ReportMailer.
|
553
|
+
unless report_summaries.select { |s| !s[:postings].empty? }.empty?
|
554
|
+
ReportMailer.report(
|
567
555
|
craig_report.email_to,
|
568
556
|
craig_report.email_from,
|
569
557
|
craig_report.report_name,
|
570
558
|
{:summaries => report_summaries, :definition => craig_report}
|
571
|
-
)
|
559
|
+
).deliver
|
572
560
|
end
|
573
561
|
|
574
562
|
# Commit (make 'active') all newly created tracked post urls:
|
@@ -0,0 +1,17 @@
|
|
1
|
+
<h2><%=h @subject %></h2>
|
2
|
+
<%@summaries.each do |summary| %>
|
3
|
+
<h3><%=h summary[:search].name%></h3>
|
4
|
+
<% if summary[:postings].length > 0 %>
|
5
|
+
<%summary[:postings].each do |post|%>
|
6
|
+
<%=('<p>%s <a href="%s">%s -</a>%s%s</p>' % [
|
7
|
+
h(post.post_date.strftime('%b %d')),
|
8
|
+
post.url,
|
9
|
+
h(post.label),
|
10
|
+
(post.location) ? '<font size="-1"> (%s)</font>' % h(post.location) : '',
|
11
|
+
(post.has_pic_or_img?) ? ' <span style="color: orange"> img</span>': ''
|
12
|
+
]).html_safe -%>
|
13
|
+
<% end %>
|
14
|
+
<% else %>
|
15
|
+
<p><i>No new postings were found, which matched the search criteria.</i></p>
|
16
|
+
<% end %>
|
17
|
+
<% end %>
|
@@ -1,15 +1,15 @@
|
|
1
1
|
CRAIGSLIST REPORTER
|
2
2
|
|
3
|
-
|
3
|
+
<% @summaries.each do |summary| -%>
|
4
4
|
<%=summary[:search].name %>
|
5
5
|
<% summary[:postings].collect do |post| -%>
|
6
6
|
<% if summary[:postings].length > 0 %>
|
7
7
|
<%='%s : %s %s %s %s' % [
|
8
|
-
|
9
|
-
|
10
|
-
|
11
|
-
|
12
|
-
|
8
|
+
post.post_date.strftime('%b %d'),
|
9
|
+
post.label,
|
10
|
+
(post.location) ? " (#{post.location})" : '',
|
11
|
+
(post.has_pic_or_img?) ? ' [img]': '',
|
12
|
+
post.url
|
13
13
|
] -%>
|
14
14
|
<% else %>
|
15
15
|
No new postings were found, which matched the search criteria.
|
data/lib/geo_listings.rb
CHANGED
@@ -1,19 +1,19 @@
|
|
1
1
|
# = About geo_listings.rb
|
2
2
|
#
|
3
3
|
# This file contains the parsing code, and logic relating to geographic site pages and paths. You
|
4
|
-
# should never need to include this file directly, as all of libcraigscrape's objects and methods
|
4
|
+
# should never need to include this file directly, as all of libcraigscrape's objects and methods
|
5
5
|
# are loaded when you use <tt>require 'libcraigscrape'</tt> in your code.
|
6
6
|
#
|
7
7
|
|
8
8
|
require 'scraper'
|
9
9
|
|
10
10
|
class CraigScrape
|
11
|
-
|
12
|
-
# GeoListings represents a parsed Craigslist geo lisiting page. (i.e. {'http://geo.craigslist.org/iso/us'}[http://geo.craigslist.org/iso/us])
|
11
|
+
|
12
|
+
# GeoListings represents a parsed Craigslist geo lisiting page. (i.e. {'http://geo.craigslist.org/iso/us'}[http://geo.craigslist.org/iso/us])
|
13
13
|
# These list all the craigslist sites in a given region.
|
14
14
|
class GeoListings < Scraper
|
15
15
|
GEOLISTING_BASE_URL = %{http://geo.craigslist.org/iso/}
|
16
|
-
|
16
|
+
|
17
17
|
LOCATION_NAME = /[ ]*\>[ ](.+)[ ]*/
|
18
18
|
PATH_SCANNER = /(?:\\\/|[^\/])+/
|
19
19
|
URL_HOST_PART = /^[^\:]+\:\/\/([^\/]+)[\/]?$/
|
@@ -31,18 +31,18 @@ class CraigScrape
|
|
31
31
|
# Validate that required fields are present, at least - if we've downloaded it from a url
|
32
32
|
parse_error! unless location
|
33
33
|
end
|
34
|
-
|
34
|
+
|
35
35
|
# Returns the GeoLocation's full name
|
36
36
|
def location
|
37
37
|
unless @location
|
38
38
|
cursor = html % 'h3 > b > a:first-of-type'
|
39
|
-
cursor = cursor.next if cursor
|
39
|
+
cursor = cursor.next if cursor
|
40
40
|
@location = $1 if cursor and LOCATION_NAME.match he_decode(cursor.to_s)
|
41
41
|
end
|
42
|
-
|
42
|
+
|
43
43
|
@location
|
44
44
|
end
|
45
|
-
|
45
|
+
|
46
46
|
# Returns a hash of site name to urls in the current listing
|
47
47
|
def sites
|
48
48
|
unless @sites
|
@@ -52,27 +52,27 @@ class CraigScrape
|
|
52
52
|
@sites[site_name] = $1 if URL_HOST_PART.match el_a[:href]
|
53
53
|
end
|
54
54
|
end
|
55
|
-
|
55
|
+
|
56
56
|
@sites
|
57
57
|
end
|
58
|
-
|
58
|
+
|
59
59
|
# This method will return an array of all possible sites that match the specified location path.
|
60
60
|
# Sample location paths:
|
61
61
|
# - us/ca
|
62
62
|
# - us/fl/miami
|
63
63
|
# - jp/fukuoka
|
64
64
|
# - mx
|
65
|
-
# Here's how location paths work.
|
65
|
+
# Here's how location paths work.
|
66
66
|
# - The components of the path are to be separated by '/' 's.
|
67
67
|
# - Up to (and optionally, not including) the last component, the path should correspond against a valid GeoLocation url with the prefix of 'http://geo.craigslist.org/iso/'
|
68
68
|
# - the last component can either be a site's 'prefix' on a GeoLocation page, or, the last component can just be a geolocation page itself, in which case all the sites on that page are selected.
|
69
69
|
# - the site prefix is the first dns record in a website listed on a GeoLocation page. (So, for the case of us/fl/miami , the last 'miami' corresponds to the 'south florida' link on {'http://geo.craigslist.org/iso/us/fl'}[http://geo.craigslist.org/iso/us/fl]
|
70
70
|
def self.sites_in_path(full_path, base_url = GEOLISTING_BASE_URL)
|
71
71
|
# the base_url parameter is mostly so we can test this method
|
72
|
-
|
73
|
-
# Unfortunately - the easiest way to understand much of this is to see how craigslist returns
|
72
|
+
|
73
|
+
# Unfortunately - the easiest way to understand much of this is to see how craigslist returns
|
74
74
|
# these geolocations. Watch what happens when you request us/fl/non-existant/page/here.
|
75
|
-
# I also made this a little forgiving in a couple ways not specified with official support, per
|
75
|
+
# I also made this a little forgiving in a couple ways not specified with official support, per
|
76
76
|
# the rules above.
|
77
77
|
full_path_parts = full_path.scan PATH_SCANNER
|
78
78
|
|
@@ -82,15 +82,15 @@ class CraigScrape
|
|
82
82
|
full_path_parts.each_with_index do |part, i|
|
83
83
|
|
84
84
|
# Let's un-escape the path-part, if needed:
|
85
|
-
part.gsub! "\\/", "/"
|
85
|
+
part.gsub! "\\/", "/"
|
86
86
|
|
87
87
|
# If they're specifying a single site, this will catch and return it immediately
|
88
|
-
site = geo_listing.sites.find{ |n,s|
|
88
|
+
site = geo_listing.sites.find{ |n,s|
|
89
89
|
(SITE_PREFIX.match s and $1 == part) or n == part
|
90
90
|
} if geo_listing
|
91
91
|
|
92
92
|
# This returns the site component of the found array
|
93
|
-
return [site.last] if site
|
93
|
+
return [site.last] if site
|
94
94
|
|
95
95
|
begin
|
96
96
|
# The URI escape is mostly needed to translate the space characters
|
@@ -109,9 +109,9 @@ class CraigScrape
|
|
109
109
|
geo_listing.sites.collect{|n,s| s }
|
110
110
|
end
|
111
111
|
|
112
|
-
# find_sites takes a single array of strings as an argument. Each string is to be either a location path
|
112
|
+
# find_sites takes a single array of strings as an argument. Each string is to be either a location path
|
113
113
|
# (see sites_in_path), or a full site (in canonical form - ie "memphis.craigslist.org"). Optionally,
|
114
|
-
# each of this may/should contain a '+' or '-' prefix to indicate whether the string is supposed to
|
114
|
+
# each of this may/should contain a '+' or '-' prefix to indicate whether the string is supposed to
|
115
115
|
# include sites from the master list, or remove them from the list. If no '+' or'-' is
|
116
116
|
# specified, the default assumption is '+'. Strings are processed from left to right, which gives
|
117
117
|
# a high degree of control over the selection set. Examples:
|
@@ -122,23 +122,23 @@ class CraigScrape
|
|
122
122
|
# There's a lot of flexibility here, you get the idea.
|
123
123
|
def self.find_sites(specs, base_url = GEOLISTING_BASE_URL)
|
124
124
|
ret = []
|
125
|
-
|
125
|
+
|
126
126
|
specs.each do |spec|
|
127
127
|
(op,spec = $1,$2) if FIND_SITES_PARTS.match spec
|
128
128
|
|
129
|
-
spec = (spec.include? '.') ? [spec] : sites_in_path(spec, base_url)
|
129
|
+
spec = (spec.include? '.') ? [spec] : sites_in_path(spec, base_url)
|
130
130
|
|
131
131
|
(op == '-') ? ret -= spec : ret |= spec
|
132
132
|
end
|
133
|
-
|
133
|
+
|
134
134
|
ret
|
135
135
|
end
|
136
136
|
|
137
137
|
private
|
138
|
-
|
138
|
+
|
139
139
|
def self.bad_geo_path!(path)
|
140
140
|
raise BadGeoListingPath, "Unable to load path #{path.inspect}, either you're having problems connecting to Craiglist, or your path is invalid."
|
141
141
|
end
|
142
|
-
|
142
|
+
|
143
143
|
end
|
144
144
|
end
|