RubyGems - libcraigscrape - Versions diffs - 0.7.0 → 0.8.0 - Mend

libcraigscrape 0.7.0 → 0.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (25) hide show

data/CHANGELOG +19 -0
data/README +27 -11
data/Rakefile +44 -2
data/bin/craig_report_schema.yml +30 -21
data/bin/craigwatch +232 -67
data/bin/report_mailer/craigslist_report.html.erb +12 -9
data/bin/report_mailer/craigslist_report.plain.erb +4 -1
data/lib/geo_listings.rb +144 -0
data/lib/libcraigscrape.rb +158 -650
data/lib/listings.rb +144 -0
data/lib/posting.rb +293 -0
data/lib/scraper.rb +203 -0
data/test/geolisting_samples/hierarchy_test071009/index.html +31 -0
data/test/geolisting_samples/hierarchy_test071009/us/fl/ft%20myers%20%5C/%20SW%20florida/index.html +46 -0
data/test/geolisting_samples/hierarchy_test071009/us/fl/ft%20myers%20%5C/index.html +46 -0
data/test/geolisting_samples/hierarchy_test071009/us/fl/index.html +46 -0
data/test/geolisting_samples/hierarchy_test071009/us/fl/miami/index.html +46 -0
data/test/geolisting_samples/hierarchy_test071009/us/fl/miami/nonsense/index.html +46 -0
data/test/geolisting_samples/hierarchy_test071009/us/fl/miami/nonsense/more-nonsense/index.html +46 -0
data/test/geolisting_samples/hierarchy_test071009/us/fl/nonexist/index.html +46 -0
data/test/geolisting_samples/hierarchy_test071009/us/fl/nonsense/index.html +46 -0
data/test/geolisting_samples/hierarchy_test071009/us/fl/south%20florida/index.html +46 -0
data/test/geolisting_samples/hierarchy_test071009/us/index.html +355 -0
data/test/test_craigslist_geolisting.rb +476 -380
metadata +28 -2

data/CHANGELOG CHANGED Viewed

@@ -1,5 +1,24 @@
 == Change Log
+=== Release 0.8.0 (Oct 22, 2009)
+- Lots of substantial changes to the API & craigwatch (though backwards compatibility is mostly there)
+- Added :code_tests to the rakefile
+- Report definitions don't need a full path to the :dbfile, the parameter here can now be relative to the yaml file itself
+- Added a Listings::next_page
+- craigwatch: When not specifying a regex in _has or _has_no, we now perform an insensitive search
+- Created CraigScrape::GeoListings.find_sites, & CraigScrape::GeoListings.sites_in_path methods
+- <b>Large API changes</b> Added a constructor to CraigScrape, and changed a number of ways that sites are scraped
+- Changed the format of the craigwatch tracking db - you'll need to delete any db's you already have and let the migrations re-run
+- craigwatch is *much* more efficient with memory. Feel free to scrape the whole world now!
+- craigwatch's yml changed a bit - documented in craigwatch
+- We'll more or less automatically figure out the tracking_database in craigwatch if none is specified (will default to sqlite and auto-generated filename)
+- craigwatch report_name is optional too now and can largely figure itself out
+- Added summary_or_full_post_has and summary_or_full_post_has_no as craigwatch report parameters
+- If a craigwatch search comes up empty - we now indicate that no results were found...
+- Added location_has, location_has_no to craigwatch
+- Cleaned up thge rdoc to clarify all the new syntax/features
+- Added Scraper::retries_on_404_fail, Scraper::sleep_between_404_retries to help deal with some of the subtleties in handling connection reset errors different than the 404's
 === Release 0.7.0 (Jul 5, 2009)
 - A good bit of refactoring
 - Eager-loading in the Post object without the need of the full_post method

data/README CHANGED Viewed

@@ -17,31 +17,47 @@ Install via RubyGems:
 == Usage
-=== Scrape Craigslist Listings since Apr 26
+=== Scrape Craigslist Listings since Sep 10
-Using the search url http://miami.craigslist.org/search/sss?query=apple
+On the 'miami.craigslist.org' site, using the query "search/sss?query=apple"
   require 'rubygems'
   require 'libcraigscrape'
   require 'date'
   require 'pp'
-  posts = CraigScrape.scrape_posts_since 'http://miami.craigslist.org/search/sss?query=apple', Time.parse('Apr 25')
-  posts.each do |post|
-    pp post
+  miami_cl = CraigScrape.new 'us/fl/miami'
+  miami_cl.posts_since(Time.parse('Sep 10'), 'search/sss?query=apple').each do |post|
+    pp post
   end
 === Scrape Last 225 Craigslist Listings
-Under the category url http://miami.craigslist.org/apa/
+On the 'miami.craigslist.org'  under the 'apa' category
   require 'rubygems'
   require 'libcraigscrape'
   require 'pp'
-  posts = CraigScrape.scrape_posts 'http://miami.craigslist.org/apa/', 225
-  posts.each do |post|
-    pp post
+  i=1
+  CraigScrape.new('us/fl/miami').each_post('apa') do |post|
+    break if i > 225
+  	 i+=1
+  	 pp post
+  end
+=== Multiple site with multiple section/search enumeration of posts
+In Florida, with the exception of 'miami.craigslist.org' & 'keys.craigslist.org' sites, output each post in
+the 'crg' category and for the search 'artist needed'
+  require 'rubygems'
+  require 'libcraigscrape'
+  require 'pp'
+  non_sfl_sites = CraigScrape.new('us/fl', '- us/fl/miami', '- us/fl/keys')
+  non_sfl_sites.each_post('crg', 'search/sss?query=artist+needed') do |post|
+  	 pp post
   end
 === Scrape Single Craigslist Posting
@@ -51,7 +67,7 @@ This grabs the full details under the specific post http://miami.craigslist.org/
   require 'rubygems'
   require 'libcraigscrape'
-  post = CraigScrape.scrape_full_post 'http://miami.craigslist.org/mdc/sys/1140808860.html'
+  post = CraigScrape::Posting.new 'http://miami.craigslist.org/mdc/sys/1140808860.html'
   puts "(%s) %s:\n %s" % [ post.post_time.strftime('%b %d'), post.title, post.contents_as_plain ]
 === Scrape Single Craigslist Listing
@@ -61,7 +77,7 @@ This grabs the post summaries of the single listings at http://miami.craigslist.
   require 'rubygems'
   require 'libcraigscrape'
-  listing = CraigScrape.scrape_listing 'http://miami.craigslist.org/search/sss?query=laptop'
+  listing = CraigScrape::Listings.new 'http://miami.craigslist.org/search/sss?query=laptop'
   puts 'Found %d posts for the search "laptop" on this page' % listing.posts.length
 == Author

data/Rakefile CHANGED Viewed

@@ -11,7 +11,7 @@ include FileUtils
 RbConfig = Config unless defined? RbConfig
 NAME = "libcraigscrape"
-VERS = ENV['VERSION'] || "0.7.0"
+VERS = ENV['VERSION'] || "0.8.0"
 PKG = "#{NAME}-#{VERS}"
 RDOC_OPTS = ['--quiet', '--title', 'The libcraigscrape Reference', '--main', 'README', '--inline-source']
@@ -53,7 +53,8 @@ Rake::RDocTask.new do |rdoc|
     rdoc.rdoc_dir = 'doc/rdoc'
     rdoc.options += RDOC_OPTS
     rdoc.main = "README"
-    rdoc.rdoc_files.add RDOC_FILES+['lib/**/*.rb']
+    # NOTE: If you don't put libcraigscrape.rb at the beginning, the rdoc ends up looking a little screwy
+    rdoc.rdoc_files.add RDOC_FILES+Dir.glob('lib/*.rb').sort_by{|a,b| (a == 'lib/libcraigscrape.rb') ? -1 : 0 }
 end
 Rake::GemPackageTask.new(SPEC) do |p|
@@ -77,3 +78,44 @@ task :uninstall => [:clean] do
   sh %{sudo gem uninstall #{NAME}}
 end
+require 'roodi'
+require 'roodi_task'
+namespace :code_tests do
+  desc "Analyze for code complexity"
+  task :flog do
+    require 'flog'
+    flog = Flog.new
+    flog.flog_files ['lib']
+    threshold = 105
+    bad_methods = flog.totals.select do |name, score|
+       score > threshold
+    end
+    bad_methods.sort { |a,b| a[1] <=> b[1] }.each do |name, score|
+      puts "%8.1f: %s" % [score, name]
+    end
+    puts "WARNING : #{bad_methods.size} methods have a flog complexity > #{threshold}" unless bad_methods.empty?
+  end
+  desc "Analyze for code duplication"
+    require 'flay'
+    task :flay do
+    threshold = 25
+    flay = Flay.new({:fuzzy => false, :verbose => false, :mass => threshold})
+    flay.process(*Flay.expand_dirs_to_files(['lib']))
+    flay.report
+    raise "#{flay.masses.size} chunks of code have a duplicate mass > #{threshold}" unless flay.masses.empty?
+  end
+  RoodiTask.new 'roodi', ['lib/*.rb'], 'roodi.yml'
+end
+desc "Run all code tests"
+task :code_tests => %w(code_tests:flog code_tests:flay code_tests:roodi)

data/bin/craig_report_schema.yml CHANGED Viewed

@@ -5,19 +5,19 @@ mapping:
    "debug_mailer":      { type: bool, required: no }
    "debug_craigscrape": { type: bool, required: no }
-   "report_name":    { type: str, required: yes }
-   "email_to":       { type: str, required: yes }
-   "email_from":     { type: str, required: no }
+   "report_name":   { type: str, required: no  }
+   "email_to":      { type: str, required: yes }
+   "email_from":    { type: str, required: no  }
    "smtp_settings":
       type: map
       required: no
       mapping:
-         "address":         { type: str, required: yes }
-         "port":            { type: int, required: no, default: 25 }
-         "user_name":       { type: str, required: no }
-         "domain":          { type: str, required: no }
-         "password":        { type: str, required: no }
-         "authentication":  { type: str, required: no }
+         "address":        { type: str, required: yes }
+         "port":           { type: int, required: no, default: 25 }
+         "user_name":      { type: str, required: no }
+         "domain":         { type: str, required: no }
+         "password":       { type: str, required: no }
+         "authentication": { type: str, required: no }
    "tracking_database":
       type: map
       mapping:
@@ -34,22 +34,31 @@ mapping:
          - type: map
            class: CraigReportDefinition::SearchDefinition
            mapping:
-              "name":                {type: str,  required: yes, unique: yes}
-              "has_image":           {type: bool, required: no}
-              "newest_first":        {type: bool, required: no, default: no}
-              "price_required":      {type: bool, required: no, default: no}
-              "price_greater_than":  {type: int, required: no}
-              "price_less_than":     {type: int, required: no}
-              "full_post_has":       {type: seq, required: no, sequence: [ {type: str, unique: yes} ]}
-              "full_post_has_no":    {type: seq, required: no, sequence: [ {type: str, unique: yes} ]}
-              "summary_post_has":    {type: seq, required: no, sequence: [ {type: str, unique: yes} ]}
-              "summary_post_has_no": {type: seq, required: no, sequence: [ {type: str, unique: yes} ]}
-              "listing":
+              "name":                        {type: str,  required: yes, unique: yes}
+              "has_image":                   {type: bool, required: no}
+              "newest_first":                {type: bool, required: no, default: no}
+              "price_required":              {type: bool, required: no, default: no}
+              "price_greater_than":          {type: int, required: no}
+              "price_less_than":             {type: int, required: no}
+              "full_post_has":               {type: seq, required: no, sequence: [ {type: str, unique: yes} ]}
+              "full_post_has_no":            {type: seq, required: no, sequence: [ {type: str, unique: yes} ]}
+              "summary_post_has":            {type: seq, required: no, sequence: [ {type: str, unique: yes} ]}
+              "summary_post_has_no":         {type: seq, required: no, sequence: [ {type: str, unique: yes} ]}
+              "summary_or_full_post_has":    {type: seq, required: no, sequence: [ {type: str, unique: yes} ]}
+              "summary_or_full_post_has_no": {type: seq, required: no, sequence: [ {type: str, unique: yes} ]}
+              "location_has":                {type: seq, required: no, sequence: [ {type: str, unique: yes} ]}
+              "location_has_no":             {type: seq, required: no, sequence: [ {type: str, unique: yes} ]}
+              "sites":
+                 type: seq
+                 required: yes
+                 sequence:
+                   - type: str
+                     unique: yes
+              "listings":
                  type: seq
                  required: yes
                  sequence:
                    - type: str
-                     pattern: /^http[s]?\:\/\//
                      unique: yes
               "starting":
                  type: str