olek-libcraigscrape 1.0.3 → 1.1.0
Sign up to get free protection for your applications and to get access to all the features.
- data/CHANGELOG +12 -6
- data/COPYING.LESSER +1 -1
- data/README +10 -10
- data/Rakefile +5 -54
- data/bin/craig_report_schema.yml +3 -3
- data/bin/craigwatch +32 -44
- data/bin/report_mailer/report.html.erb +17 -0
- data/bin/report_mailer/{craigslist_report.plain.erb → report.text.erb} +6 -6
- data/lib/geo_listings.rb +24 -24
- data/lib/libcraigscrape.rb +6 -11
- data/lib/listings.rb +62 -45
- data/lib/posting.rb +153 -106
- data/lib/scraper.rb +37 -94
- data/test/libcraigscrape_test_helpers.rb +10 -10
- data/test/test_craigslist_geolisting.rb +53 -53
- data/test/test_craigslist_listing.rb +26 -26
- data/test/test_craigslist_posting.rb +39 -38
- metadata +38 -114
- data/bin/report_mailer/craigslist_report.html.erb +0 -17
data/CHANGELOG
CHANGED
@@ -1,34 +1,40 @@
|
|
1
1
|
== Change Log
|
2
2
|
|
3
|
+
=== Release 1.1
|
4
|
+
- ruby 1.9.3 support
|
5
|
+
- migrated from rails 2 gems to rails 3
|
6
|
+
- fixed some new parsing bugs introduced by craigslist template changes
|
7
|
+
- Replaced Net:Http with typhoeus
|
8
|
+
|
3
9
|
=== Release 1.0
|
4
10
|
- Replaced hpricot dependency with Nokogiri. Nokogiri should be faster and more reliable. Whoo-hoo!
|
5
11
|
|
6
12
|
=== Release 0.9.1
|
7
13
|
- Added support for posting_has_expired? and expired post recognition
|
8
|
-
- Fixed a weird bug in craigwatch that would cause a scrape to abort if a flagged_for_removal? was encountered when using certain (minimal) filtering
|
14
|
+
- Fixed a weird bug in craigwatch that would cause a scrape to abort if a flagged_for_removal? was encountered when using certain (minimal) filtering
|
9
15
|
|
10
16
|
=== Release 0.9 (Oct 01, 2010)
|
11
17
|
- Minor adjustments to craigwatch to fix deprecation warnings in new ActiveRecord and ActionMailer gems
|
12
18
|
- Added gem version specifiers to the Gem spec and to the require statements
|
13
19
|
- Moved repo to github
|
14
|
-
- Fixed an esoteric bug in craigwatch, affecting the last scraped post in a listing when that post was 'flagged for removal'.
|
20
|
+
- Fixed an esoteric bug in craigwatch, affecting the last scraped post in a listing when that post was 'flagged for removal'.
|
15
21
|
- Took all those extra package-building tasts out of the Rakefile since this is 2010 and we only party with gemfiles
|
16
22
|
- Ruby 1.9 compatibility adjustments
|
17
23
|
|
18
24
|
=== Release 0.8.4 (Sep 6, 2010)
|
19
25
|
- Someone found a way to screw up hpricot's to_s method (posting1938291834-090610.html) and fixed by added html_source to the craigslist Scraper object, which returns the body of the post without passing it through hpricot. Its a better way to go anyways, and re-wrote a couple incidentals to use the html_source method...
|
20
|
-
- Adjusted the test cases a bit, since the user bodies being returned have less cleanup in their output than they had prior
|
26
|
+
- Adjusted the test cases a bit, since the user bodies being returned have less cleanup in their output than they had prior
|
21
27
|
|
22
28
|
=== Release 0.8.3 (August 2, 2010)
|
23
29
|
- Someone was posting really bad html that was screwing up Hpricot. Such is to be expected when you're soliciting html from the general public I suppose. Added test_bugs_found061710 posting test, and fixed by stripping out the user body before parsing with Hpricot.
|
24
|
-
- Added a MaxRedirectError and corresponding maximum_redirects_per_request cattr for the Craigscrape objects. This fixed a weird bug where craigslist was sending us in redirect circles around 06/10
|
30
|
+
- Added a MaxRedirectError and corresponding maximum_redirects_per_request cattr for the Craigscrape objects. This fixed a weird bug where craigslist was sending us in redirect circles around 06/10
|
25
31
|
|
26
32
|
=== Release 0.8.2 (April 17, 2010)
|
27
33
|
- Found another odd parsing bug. Scrape sample is in 'listing_samples/mia_search_kitten.3.15.10.html', Adjusted CraigScrape::Listings::HEADER_DATE to fix.
|
28
34
|
- Craigslist started added <span> tags in its post summaries. Fixed. See sample in test_new_listing_span051710
|
29
35
|
|
30
36
|
=== Release 0.8.1 (Feb 10, 2010)
|
31
|
-
- Found an odd parsing bug occured for the first time today. Scrape sample is in 'listing_samples/mia_sss_kittens2.10.10.html', Adjusted CraigScrape::Listings::LABEL to fix.
|
37
|
+
- Found an odd parsing bug occured for the first time today. Scrape sample is in 'listing_samples/mia_sss_kittens2.10.10.html', Adjusted CraigScrape::Listings::LABEL to fix.
|
32
38
|
- Switched to require "active_support" per the deprecation notices
|
33
39
|
- Little adjustments to fix the rdoc readibility
|
34
40
|
|
@@ -83,7 +89,7 @@
|
|
83
89
|
- Adjusted the examples in the readme, added a "require 'rubygems'" to the top of the listing so that they would actually work if you tried to run them verbatim (Thanks J T!)
|
84
90
|
- Restructured some of the parsing to be less leinient when scraped values aren't matching their regexp's in the PostSummary
|
85
91
|
- It seems like craigslist returns a 404 on pages that exist, for no good reason on occasion. Added a retry mechanism that wont take no for an answer, unless we get a defineable number of them in a row
|
86
|
-
- Added CraigScrape cattr_accessors : retries_on_fetch_fail, sleep_between_fetch_retries .
|
92
|
+
- Added CraigScrape cattr_accessors : retries_on_fetch_fail, sleep_between_fetch_retries .
|
87
93
|
- Adjusted craigwatch to not commit any database changes until the notification email goes out. This way if there's an error, the user wont miss any results on a re-run
|
88
94
|
- Added a FetchError for http requests that don't return 200 or redirect...
|
89
95
|
- Adjusted craigwatch to use scrape_until instead of scrape_since, this new approach cuts down on the url fetching by assuming that if we come across something we've already tracked, we dont need to keep going any further. NOTE: We still can't use a 'last_scraped_url' on the TrackedSearch model b/c sometimes posts get deleted.
|
data/COPYING.LESSER
CHANGED
data/README
CHANGED
@@ -25,10 +25,10 @@ On the 'miami.craigslist.org' site, using the query "search/sss?query=apple"
|
|
25
25
|
require 'libcraigscrape'
|
26
26
|
require 'date'
|
27
27
|
require 'pp'
|
28
|
-
|
28
|
+
|
29
29
|
miami_cl = CraigScrape.new 'us/fl/miami'
|
30
30
|
miami_cl.posts_since(Time.parse('Sep 10'), 'search/sss?query=apple').each do |post|
|
31
|
-
pp post
|
31
|
+
pp post
|
32
32
|
end
|
33
33
|
|
34
34
|
=== Scrape Last 225 Craigslist Listings
|
@@ -38,26 +38,26 @@ On the 'miami.craigslist.org' under the 'apa' category
|
|
38
38
|
require 'rubygems'
|
39
39
|
require 'libcraigscrape'
|
40
40
|
require 'pp'
|
41
|
-
|
41
|
+
|
42
42
|
i=1
|
43
43
|
CraigScrape.new('us/fl/miami').each_post('apa') do |post|
|
44
44
|
break if i > 225
|
45
|
-
|
46
|
-
|
45
|
+
i+=1
|
46
|
+
pp post
|
47
47
|
end
|
48
48
|
|
49
49
|
=== Multiple site with multiple section/search enumeration of posts
|
50
50
|
|
51
|
-
In Florida, with the exception of 'miami.craigslist.org' & 'keys.craigslist.org' sites, output each post in
|
51
|
+
In Florida, with the exception of 'miami.craigslist.org' & 'keys.craigslist.org' sites, output each post in
|
52
52
|
the 'crg' category and for the search 'artist needed'
|
53
53
|
|
54
54
|
require 'rubygems'
|
55
55
|
require 'libcraigscrape'
|
56
56
|
require 'pp'
|
57
|
-
|
57
|
+
|
58
58
|
non_sfl_sites = CraigScrape.new('us/fl', '- us/fl/miami', '- us/fl/keys')
|
59
59
|
non_sfl_sites.each_post('crg', 'search/sss?query=artist+needed') do |post|
|
60
|
-
|
60
|
+
pp post
|
61
61
|
end
|
62
62
|
|
63
63
|
=== Scrape Single Craigslist Posting
|
@@ -66,7 +66,7 @@ This grabs the full details under the specific post http://miami.craigslist.org/
|
|
66
66
|
|
67
67
|
require 'rubygems'
|
68
68
|
require 'libcraigscrape'
|
69
|
-
|
69
|
+
|
70
70
|
post = CraigScrape::Posting.new 'http://miami.craigslist.org/mdc/sys/1140808860.html'
|
71
71
|
puts "(%s) %s:\n %s" % [ post.post_time.strftime('%b %d'), post.title, post.contents_as_plain ]
|
72
72
|
|
@@ -76,7 +76,7 @@ This grabs the post summaries of the single listings at http://miami.craigslist.
|
|
76
76
|
|
77
77
|
require 'rubygems'
|
78
78
|
require 'libcraigscrape'
|
79
|
-
|
79
|
+
|
80
80
|
listing = CraigScrape::Listings.new 'http://miami.craigslist.org/search/sss?query=laptop'
|
81
81
|
puts 'Found %d posts for the search "laptop" on this page' % listing.posts.length
|
82
82
|
|
data/Rakefile
CHANGED
@@ -1,8 +1,8 @@
|
|
1
1
|
require 'rake'
|
2
2
|
require 'rake/clean'
|
3
|
-
require '
|
4
|
-
require 'rake/rdoctask'
|
3
|
+
require 'rdoc/task'
|
5
4
|
require 'rake/testtask'
|
5
|
+
require 'rubygems/package_task'
|
6
6
|
require 'fileutils'
|
7
7
|
require 'tempfile'
|
8
8
|
|
@@ -11,7 +11,7 @@ include FileUtils
|
|
11
11
|
RbConfig = Config unless defined? RbConfig
|
12
12
|
|
13
13
|
NAME = "olek-libcraigscrape"
|
14
|
-
VERS = ENV['VERSION'] || "1.0
|
14
|
+
VERS = ENV['VERSION'] || "1.1.0"
|
15
15
|
PKG = "#{NAME}-#{VERS}"
|
16
16
|
|
17
17
|
RDOC_OPTS = ['--quiet', '--title', 'The libcraigscrape Reference', '--main', 'README', '--inline-source']
|
@@ -35,15 +35,8 @@ SPEC =
|
|
35
35
|
s.homepage = 'http://www.derosetechnologies.com/community/libcraigscrape'
|
36
36
|
s.rubyforge_project = 'libcraigwatch'
|
37
37
|
s.files = PKG_FILES
|
38
|
-
s.require_paths = ["lib"]
|
38
|
+
s.require_paths = ["lib"]
|
39
39
|
s.test_files = FileList['test/test_*.rb']
|
40
|
-
s.add_dependency 'nokogiri', '>= 1.4.4'
|
41
|
-
s.add_dependency 'htmlentities', '>= 4.0.0'
|
42
|
-
s.add_dependency 'activesupport','>= 2.3.0', '< 3'
|
43
|
-
s.add_dependency 'activerecord', '>= 2.3.0', '< 3'
|
44
|
-
s.add_dependency 'actionmailer', '>= 2.3.0', '< 3'
|
45
|
-
s.add_dependency 'kwalify', '>= 0.7.2'
|
46
|
-
s.add_dependency 'sqlite3'
|
47
40
|
end
|
48
41
|
|
49
42
|
desc "Run all the tests"
|
@@ -61,7 +54,7 @@ Rake::RDocTask.new do |rdoc|
|
|
61
54
|
rdoc.rdoc_files.add RDOC_FILES+Dir.glob('lib/*.rb').sort_by{|a,b| (a == 'lib/libcraigscrape.rb') ? -1 : 0 }
|
62
55
|
end
|
63
56
|
|
64
|
-
|
57
|
+
Gem::PackageTask.new(SPEC) do |p|
|
65
58
|
p.need_tar = false
|
66
59
|
p.need_tar_gz = false
|
67
60
|
p.need_tar_bz2 = false
|
@@ -81,45 +74,3 @@ end
|
|
81
74
|
task :uninstall => [:clean] do
|
82
75
|
sh %{sudo gem uninstall #{NAME}}
|
83
76
|
end
|
84
|
-
|
85
|
-
require 'roodi'
|
86
|
-
require 'roodi_task'
|
87
|
-
|
88
|
-
namespace :code_tests do
|
89
|
-
desc "Analyze for code complexity"
|
90
|
-
task :flog do
|
91
|
-
require 'flog'
|
92
|
-
|
93
|
-
flog = Flog.new
|
94
|
-
flog.flog_files ['lib']
|
95
|
-
threshold = 105
|
96
|
-
|
97
|
-
bad_methods = flog.totals.select do |name, score|
|
98
|
-
score > threshold
|
99
|
-
end
|
100
|
-
|
101
|
-
bad_methods.sort { |a,b| a[1] <=> b[1] }.each do |name, score|
|
102
|
-
puts "%8.1f: %s" % [score, name]
|
103
|
-
end
|
104
|
-
|
105
|
-
puts "WARNING : #{bad_methods.size} methods have a flog complexity > #{threshold}" unless bad_methods.empty?
|
106
|
-
end
|
107
|
-
|
108
|
-
desc "Analyze for code duplication"
|
109
|
-
require 'flay'
|
110
|
-
task :flay do
|
111
|
-
threshold = 25
|
112
|
-
flay = Flay.new({:fuzzy => false, :verbose => false, :mass => threshold})
|
113
|
-
flay.process(*Flay.expand_dirs_to_files(['lib']))
|
114
|
-
|
115
|
-
flay.report
|
116
|
-
|
117
|
-
raise "#{flay.masses.size} chunks of code have a duplicate mass > #{threshold}" unless flay.masses.empty?
|
118
|
-
end
|
119
|
-
|
120
|
-
RoodiTask.new 'roodi', ['lib/*.rb'], 'roodi.yml'
|
121
|
-
end
|
122
|
-
|
123
|
-
desc "Run all code tests"
|
124
|
-
task :code_tests => %w(code_tests:flog code_tests:flay code_tests:roodi)
|
125
|
-
|
data/bin/craig_report_schema.yml
CHANGED
@@ -25,7 +25,7 @@ mapping:
|
|
25
25
|
mapping:
|
26
26
|
"adapter": { type: str, required: yes }
|
27
27
|
"dbfile": { type: str, required: no }
|
28
|
-
"host":
|
28
|
+
"host": { type: str, required: no }
|
29
29
|
"username": { type: str, required: no }
|
30
30
|
"password": { type: str, required: no }
|
31
31
|
"socket": { type: str, required: no }
|
@@ -50,7 +50,7 @@ mapping:
|
|
50
50
|
"summary_or_full_post_has_no": {type: seq, required: no, sequence: [ {type: str, unique: yes} ]}
|
51
51
|
"location_has": {type: seq, required: no, sequence: [ {type: str, unique: yes} ]}
|
52
52
|
"location_has_no": {type: seq, required: no, sequence: [ {type: str, unique: yes} ]}
|
53
|
-
"sites":
|
53
|
+
"sites":
|
54
54
|
type: seq
|
55
55
|
required: yes
|
56
56
|
sequence:
|
@@ -62,7 +62,7 @@ mapping:
|
|
62
62
|
sequence:
|
63
63
|
- type: str
|
64
64
|
unique: yes
|
65
|
-
"starting":
|
65
|
+
"starting":
|
66
66
|
type: str
|
67
67
|
required: no
|
68
68
|
pattern: /^[\d]{1,2}\/[\d]{1,2}\/(?:[\d]{2}|[\d]{4})$/
|
data/bin/craigwatch
CHANGED
@@ -1,4 +1,5 @@
|
|
1
|
-
#!/usr/bin/ruby
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
# encoding: UTF-8
|
2
3
|
#
|
3
4
|
# =craigwatch - A email-based "post monitoring" solution
|
4
5
|
#
|
@@ -160,9 +161,9 @@ $: << File.dirname(__FILE__) + '/../lib'
|
|
160
161
|
|
161
162
|
require 'rubygems'
|
162
163
|
|
163
|
-
gem 'kwalify'
|
164
|
-
gem 'activerecord'
|
165
|
-
gem 'actionmailer'
|
164
|
+
gem 'kwalify'
|
165
|
+
gem 'activerecord'
|
166
|
+
gem 'actionmailer'
|
166
167
|
|
167
168
|
require 'kwalify'
|
168
169
|
require 'active_record'
|
@@ -252,7 +253,7 @@ class CraigReportDefinition #:nodoc:
|
|
252
253
|
|
253
254
|
def starting_at
|
254
255
|
(@starting) ?
|
255
|
-
Time.
|
256
|
+
Time.strptime(@starting, "%m/%d/%Y") :
|
256
257
|
Time.now.yesterday.beginning_of_day
|
257
258
|
end
|
258
259
|
|
@@ -290,17 +291,23 @@ class CraigReportDefinition #:nodoc:
|
|
290
291
|
private
|
291
292
|
|
292
293
|
def matches_all?(conditions, against)
|
293
|
-
|
294
|
-
(conditions.nil? or conditions.all?{|c| against.any?{|a| match_against c, a } }) ? true : false
|
294
|
+
(conditions.nil? or conditions.all?{|c| sanitized_against(against).any?{|a| match_against c, a } }) ? true : false
|
295
295
|
end
|
296
296
|
|
297
297
|
def doesnt_match_any?(conditions, against)
|
298
|
-
|
299
|
-
(conditions.nil? or conditions.all?{|c| against.any?{|a| !match_against c, a } }) ? true : false
|
298
|
+
(conditions.nil? or conditions.all?{|c| sanitized_against(against).any?{|a| !match_against c, a } }) ? true : false
|
300
299
|
end
|
301
300
|
|
302
301
|
def match_against(condition, against)
|
303
|
-
(against.scan( condition.is_re? ? condition.to_re : /#{condition}/i).length > 0) ? true : false
|
302
|
+
(CraigScrape::Scraper.he_decode(against).scan( condition.is_re? ? condition.to_re : /#{condition}/i).length > 0) ? true : false
|
303
|
+
end
|
304
|
+
|
305
|
+
# This is kind of a hack to deal with ruby 1.9. Really the filtering mechanism
|
306
|
+
# needs to be factored out and tested....
|
307
|
+
def sanitized_against(against)
|
308
|
+
against = against.lines if against.respond_to? :lines
|
309
|
+
against = against.to_a if against.respond_to? :to_a
|
310
|
+
(against.nil?) ? [] : against.compact
|
304
311
|
end
|
305
312
|
end
|
306
313
|
end
|
@@ -353,24 +360,12 @@ class TrackedPost < ActiveRecord::Base #:nodoc:
|
|
353
360
|
end
|
354
361
|
|
355
362
|
class ReportMailer < ActionMailer::Base #:nodoc:
|
356
|
-
|
357
|
-
|
358
|
-
formatted_subject = Time.now.strftime(subject_template)
|
359
|
-
|
360
|
-
recipients to
|
361
|
-
from sender
|
362
|
-
subject formatted_subject
|
363
|
+
# default :template_path => File.dirname(__FILE__)
|
363
364
|
|
364
|
-
|
365
|
-
|
366
|
-
|
367
|
-
|
368
|
-
part( :content_type => "multipart/alternative" ) do |p|
|
369
|
-
[
|
370
|
-
{ :content_type => "text/plain", :body => render_message("#{view_name.to_s}.plain.erb", tmpl) },
|
371
|
-
{ :content_type => "text/html", :body => render_message("#{view_name.to_s}.html.erb", tmpl.merge({:part_container => p})) }
|
372
|
-
].each { |parms| p.part parms.merge( { :charset => "UTF-8", :transfer_encoding => "7bit" } ) }
|
373
|
-
end
|
365
|
+
def report(to, sender, subject_template, report_tmpl)
|
366
|
+
subject = Time.now.strftime subject_template
|
367
|
+
@summaries = report_tmpl[:summaries]
|
368
|
+
mail :to => to, :subject => subject, :from => sender
|
374
369
|
end
|
375
370
|
end
|
376
371
|
|
@@ -405,13 +400,14 @@ parser.errors.each do |e|
|
|
405
400
|
end and exit if parser.errors.length > 0
|
406
401
|
|
407
402
|
# Initialize Action Mailer:
|
403
|
+
ActionMailer::Base.prepend_view_path(File.dirname(__FILE__))
|
408
404
|
ActionMailer::Base.logger = Logger.new STDERR if craig_report.debug_mailer?
|
409
405
|
if craig_report.smtp_settings
|
410
|
-
|
406
|
+
ActionMailer::Base.smtp_settings = craig_report.smtp_settings
|
407
|
+
ActionMailer::Base.delivery_method = :smtp
|
411
408
|
else
|
412
|
-
|
409
|
+
ActionMailer::Base.delivery_method = :sendmail
|
413
410
|
end
|
414
|
-
ReportMailer.template_root = File.dirname __FILE__
|
415
411
|
|
416
412
|
# Initialize the database:
|
417
413
|
ActiveRecord::Base.logger = Logger.new STDERR if craig_report.debug_database?
|
@@ -517,7 +513,7 @@ report_summaries = craig_report.searches.collect do |search|
|
|
517
513
|
# Now let's add these urls to the database so as to reduce memory overhead.
|
518
514
|
# Keep in mind - they're not active until the email goes out.
|
519
515
|
# also - we shouldn't have to worry about putting 'irrelevant' posts in the db, since
|
520
|
-
# the
|
516
|
+
# the newest are always the first ones parsed:
|
521
517
|
tracked_listing.posts.create(
|
522
518
|
:url => post.url,
|
523
519
|
:created_at => newest_post_date
|
@@ -530,18 +526,10 @@ report_summaries = craig_report.searches.collect do |search|
|
|
530
526
|
end
|
531
527
|
end
|
532
528
|
|
533
|
-
|
529
|
+
|
534
530
|
|
535
531
|
# Let's flatten the unique'd hash into a more useable array:
|
536
|
-
|
537
|
-
# * We try not to load the whole post if we don't have to
|
538
|
-
# * Its possible that we met all the criterion of the passes_filter? with merely a header, and
|
539
|
-
# if so we add a url to the summaries stack
|
540
|
-
# * Unfortunately, when we later load that post in full, we may find that the post was posting_has_expired?
|
541
|
-
# or flagged_for_removal?, etc.
|
542
|
-
# * If this was the case, below we'll end up sorting against nil post_dates. This would fail.
|
543
|
-
# * So - before we sort, we run a quick reject on nil post_dates
|
544
|
-
new_summaries = new_summaries.values.reject{|v| v.post_date.nil? }.sort{|a,b| a.post_date <=> b.post_date} # oldest goes to bottom
|
532
|
+
new_summaries = new_summaries.values.sort{|a,b| a.post_date <=> b.post_date} # oldest goes to bottom
|
545
533
|
|
546
534
|
# Now Let's manage the tracking database:
|
547
535
|
if new_summaries.length > 0
|
@@ -562,13 +550,13 @@ report_summaries = craig_report.searches.collect do |search|
|
|
562
550
|
end
|
563
551
|
|
564
552
|
# Time to send the email (maybe):
|
565
|
-
unless report_summaries.select { |s| !
|
566
|
-
ReportMailer.
|
553
|
+
unless report_summaries.select { |s| !s[:postings].empty? }.empty?
|
554
|
+
ReportMailer.report(
|
567
555
|
craig_report.email_to,
|
568
556
|
craig_report.email_from,
|
569
557
|
craig_report.report_name,
|
570
558
|
{:summaries => report_summaries, :definition => craig_report}
|
571
|
-
)
|
559
|
+
).deliver
|
572
560
|
end
|
573
561
|
|
574
562
|
# Commit (make 'active') all newly created tracked post urls:
|
@@ -0,0 +1,17 @@
|
|
1
|
+
<h2><%=h @subject %></h2>
|
2
|
+
<%@summaries.each do |summary| %>
|
3
|
+
<h3><%=h summary[:search].name%></h3>
|
4
|
+
<% if summary[:postings].length > 0 %>
|
5
|
+
<%summary[:postings].each do |post|%>
|
6
|
+
<%=('<p>%s <a href="%s">%s -</a>%s%s</p>' % [
|
7
|
+
h(post.post_date.strftime('%b %d')),
|
8
|
+
post.url,
|
9
|
+
h(post.label),
|
10
|
+
(post.location) ? '<font size="-1"> (%s)</font>' % h(post.location) : '',
|
11
|
+
(post.has_pic_or_img?) ? ' <span style="color: orange"> img</span>': ''
|
12
|
+
]).html_safe -%>
|
13
|
+
<% end %>
|
14
|
+
<% else %>
|
15
|
+
<p><i>No new postings were found, which matched the search criteria.</i></p>
|
16
|
+
<% end %>
|
17
|
+
<% end %>
|
@@ -1,15 +1,15 @@
|
|
1
1
|
CRAIGSLIST REPORTER
|
2
2
|
|
3
|
-
|
3
|
+
<% @summaries.each do |summary| -%>
|
4
4
|
<%=summary[:search].name %>
|
5
5
|
<% summary[:postings].collect do |post| -%>
|
6
6
|
<% if summary[:postings].length > 0 %>
|
7
7
|
<%='%s : %s %s %s %s' % [
|
8
|
-
|
9
|
-
|
10
|
-
|
11
|
-
|
12
|
-
|
8
|
+
post.post_date.strftime('%b %d'),
|
9
|
+
post.label,
|
10
|
+
(post.location) ? " (#{post.location})" : '',
|
11
|
+
(post.has_pic_or_img?) ? ' [img]': '',
|
12
|
+
post.url
|
13
13
|
] -%>
|
14
14
|
<% else %>
|
15
15
|
No new postings were found, which matched the search criteria.
|
data/lib/geo_listings.rb
CHANGED
@@ -1,19 +1,19 @@
|
|
1
1
|
# = About geo_listings.rb
|
2
2
|
#
|
3
3
|
# This file contains the parsing code, and logic relating to geographic site pages and paths. You
|
4
|
-
# should never need to include this file directly, as all of libcraigscrape's objects and methods
|
4
|
+
# should never need to include this file directly, as all of libcraigscrape's objects and methods
|
5
5
|
# are loaded when you use <tt>require 'libcraigscrape'</tt> in your code.
|
6
6
|
#
|
7
7
|
|
8
8
|
require 'scraper'
|
9
9
|
|
10
10
|
class CraigScrape
|
11
|
-
|
12
|
-
# GeoListings represents a parsed Craigslist geo lisiting page. (i.e. {'http://geo.craigslist.org/iso/us'}[http://geo.craigslist.org/iso/us])
|
11
|
+
|
12
|
+
# GeoListings represents a parsed Craigslist geo lisiting page. (i.e. {'http://geo.craigslist.org/iso/us'}[http://geo.craigslist.org/iso/us])
|
13
13
|
# These list all the craigslist sites in a given region.
|
14
14
|
class GeoListings < Scraper
|
15
15
|
GEOLISTING_BASE_URL = %{http://geo.craigslist.org/iso/}
|
16
|
-
|
16
|
+
|
17
17
|
LOCATION_NAME = /[ ]*\>[ ](.+)[ ]*/
|
18
18
|
PATH_SCANNER = /(?:\\\/|[^\/])+/
|
19
19
|
URL_HOST_PART = /^[^\:]+\:\/\/([^\/]+)[\/]?$/
|
@@ -31,18 +31,18 @@ class CraigScrape
|
|
31
31
|
# Validate that required fields are present, at least - if we've downloaded it from a url
|
32
32
|
parse_error! unless location
|
33
33
|
end
|
34
|
-
|
34
|
+
|
35
35
|
# Returns the GeoLocation's full name
|
36
36
|
def location
|
37
37
|
unless @location
|
38
38
|
cursor = html % 'h3 > b > a:first-of-type'
|
39
|
-
cursor = cursor.next if cursor
|
39
|
+
cursor = cursor.next if cursor
|
40
40
|
@location = $1 if cursor and LOCATION_NAME.match he_decode(cursor.to_s)
|
41
41
|
end
|
42
|
-
|
42
|
+
|
43
43
|
@location
|
44
44
|
end
|
45
|
-
|
45
|
+
|
46
46
|
# Returns a hash of site name to urls in the current listing
|
47
47
|
def sites
|
48
48
|
unless @sites
|
@@ -52,27 +52,27 @@ class CraigScrape
|
|
52
52
|
@sites[site_name] = $1 if URL_HOST_PART.match el_a[:href]
|
53
53
|
end
|
54
54
|
end
|
55
|
-
|
55
|
+
|
56
56
|
@sites
|
57
57
|
end
|
58
|
-
|
58
|
+
|
59
59
|
# This method will return an array of all possible sites that match the specified location path.
|
60
60
|
# Sample location paths:
|
61
61
|
# - us/ca
|
62
62
|
# - us/fl/miami
|
63
63
|
# - jp/fukuoka
|
64
64
|
# - mx
|
65
|
-
# Here's how location paths work.
|
65
|
+
# Here's how location paths work.
|
66
66
|
# - The components of the path are to be separated by '/' 's.
|
67
67
|
# - Up to (and optionally, not including) the last component, the path should correspond against a valid GeoLocation url with the prefix of 'http://geo.craigslist.org/iso/'
|
68
68
|
# - the last component can either be a site's 'prefix' on a GeoLocation page, or, the last component can just be a geolocation page itself, in which case all the sites on that page are selected.
|
69
69
|
# - the site prefix is the first dns record in a website listed on a GeoLocation page. (So, for the case of us/fl/miami , the last 'miami' corresponds to the 'south florida' link on {'http://geo.craigslist.org/iso/us/fl'}[http://geo.craigslist.org/iso/us/fl]
|
70
70
|
def self.sites_in_path(full_path, base_url = GEOLISTING_BASE_URL)
|
71
71
|
# the base_url parameter is mostly so we can test this method
|
72
|
-
|
73
|
-
# Unfortunately - the easiest way to understand much of this is to see how craigslist returns
|
72
|
+
|
73
|
+
# Unfortunately - the easiest way to understand much of this is to see how craigslist returns
|
74
74
|
# these geolocations. Watch what happens when you request us/fl/non-existant/page/here.
|
75
|
-
# I also made this a little forgiving in a couple ways not specified with official support, per
|
75
|
+
# I also made this a little forgiving in a couple ways not specified with official support, per
|
76
76
|
# the rules above.
|
77
77
|
full_path_parts = full_path.scan PATH_SCANNER
|
78
78
|
|
@@ -82,15 +82,15 @@ class CraigScrape
|
|
82
82
|
full_path_parts.each_with_index do |part, i|
|
83
83
|
|
84
84
|
# Let's un-escape the path-part, if needed:
|
85
|
-
part.gsub! "\\/", "/"
|
85
|
+
part.gsub! "\\/", "/"
|
86
86
|
|
87
87
|
# If they're specifying a single site, this will catch and return it immediately
|
88
|
-
site = geo_listing.sites.find{ |n,s|
|
88
|
+
site = geo_listing.sites.find{ |n,s|
|
89
89
|
(SITE_PREFIX.match s and $1 == part) or n == part
|
90
90
|
} if geo_listing
|
91
91
|
|
92
92
|
# This returns the site component of the found array
|
93
|
-
return [site.last] if site
|
93
|
+
return [site.last] if site
|
94
94
|
|
95
95
|
begin
|
96
96
|
# The URI escape is mostly needed to translate the space characters
|
@@ -109,9 +109,9 @@ class CraigScrape
|
|
109
109
|
geo_listing.sites.collect{|n,s| s }
|
110
110
|
end
|
111
111
|
|
112
|
-
# find_sites takes a single array of strings as an argument. Each string is to be either a location path
|
112
|
+
# find_sites takes a single array of strings as an argument. Each string is to be either a location path
|
113
113
|
# (see sites_in_path), or a full site (in canonical form - ie "memphis.craigslist.org"). Optionally,
|
114
|
-
# each of this may/should contain a '+' or '-' prefix to indicate whether the string is supposed to
|
114
|
+
# each of this may/should contain a '+' or '-' prefix to indicate whether the string is supposed to
|
115
115
|
# include sites from the master list, or remove them from the list. If no '+' or'-' is
|
116
116
|
# specified, the default assumption is '+'. Strings are processed from left to right, which gives
|
117
117
|
# a high degree of control over the selection set. Examples:
|
@@ -122,23 +122,23 @@ class CraigScrape
|
|
122
122
|
# There's a lot of flexibility here, you get the idea.
|
123
123
|
def self.find_sites(specs, base_url = GEOLISTING_BASE_URL)
|
124
124
|
ret = []
|
125
|
-
|
125
|
+
|
126
126
|
specs.each do |spec|
|
127
127
|
(op,spec = $1,$2) if FIND_SITES_PARTS.match spec
|
128
128
|
|
129
|
-
spec = (spec.include? '.') ? [spec] : sites_in_path(spec, base_url)
|
129
|
+
spec = (spec.include? '.') ? [spec] : sites_in_path(spec, base_url)
|
130
130
|
|
131
131
|
(op == '-') ? ret -= spec : ret |= spec
|
132
132
|
end
|
133
|
-
|
133
|
+
|
134
134
|
ret
|
135
135
|
end
|
136
136
|
|
137
137
|
private
|
138
|
-
|
138
|
+
|
139
139
|
def self.bad_geo_path!(path)
|
140
140
|
raise BadGeoListingPath, "Unable to load path #{path.inspect}, either you're having problems connecting to Craiglist, or your path is invalid."
|
141
141
|
end
|
142
|
-
|
142
|
+
|
143
143
|
end
|
144
144
|
end
|