RubyGems - statement - Versions diffs - 1.9.9 → 2.0 - Mend

statement 1.9.9 → 2.0

Files changed (11) hide show

checksums.yaml +4 -4
data/README.md +11 -7
data/lib/statement/scraper.rb +140 -88
data/lib/statement/version.rb +1 -1
data/scraper_guide.md +49 -0
data/spec/butterfield_press.html +407 -0
data/spec/drupal_press.html +524 -0
data/spec/ed_perlmutter_press.html +5032 -0
data/spec/keating_press.html +2211 -0
data/spec/statement_spec.rb +62 -10
metadata +11 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: b7585e5c311f08f415b5dd857db4a3868757e525
-  data.tar.gz: 335ef89e72907c1867c3edbeb0913f027879cc39
+  metadata.gz: 11dcba16755ef54dff1c0c48db50aca841485abd
+  data.tar.gz: dda7e3b05004b1d7bf59411c902ed2a670436914
 SHA512:
-  metadata.gz: b9c134ef362bfebf125f1f52e53ffe2c89dcd5b42eb6c58fb790489c9cc2f2d0b5460aacf2827ed1c4752b60172458c0dd848d01a0d10823dd8a83a57bf82484
-  data.tar.gz: 390ae95ce5433578b9a761373f32ae77e98b5ba9e8a75f1677ae2e7d144c192eb2a68ad4b29328e3b61aa98aa716f343902cf987fb34482010379c328f3e843c
+  metadata.gz: 20f0513c7aa7a3d70b9e3c3f8d8eb6c30e66b2b88a3dec326d0516e031d3450b201e7a2db8f98b90d29a1d2fb2a384ed096d9ea9f9940cfa41abf9ef79a84590
+  data.tar.gz: 358a8cfa517caf462ebe20119a81a898ac46b9806b1614e2ed6f34ceabc89bcd5dd40ecf43dc38cd989f8fd17449af2a32e63c2d636cd54f92830919bf569533

data/README.md CHANGED Viewed

@@ -1,10 +1,10 @@
 # Statement
-Statement parses RSS feeds and HTML pages containing press releases and other official statements from members of Congress, and produces hashes with information about those pages. It has been tested under Ruby 1.9.2, 1.9.3 and 2.0.0.
+Statement parses RSS feeds and HTML pages containing press releases and other official statements from members of Congress, and produces hashes with information about those pages. It has been tested under Ruby 1.9.3 and 2.x.
 ## Coverage
-Statement currently parses press releases for members of the House and Senate. For members with RSS feeds, you can pass the feed URL into Statement. For members without RSS feeds, HTML scrapers are provided, as are methods for speciality groups, such as House Republicans. Suggestions are welcomed.
+Statement currently parses press releases for members of the House and Senate. For members with RSS feeds, you can pass the feed URL into Statement. For members without RSS feeds (or with broken ones), HTML scrapers are provided, as are methods for special groups, such as House Republicans. Suggestions are welcomed.
 ## Installation
@@ -28,7 +28,7 @@ $ gem install statement
 ## Usage
-Statement provides access to press releases, Facebook status updates and tweets from members of Congress. Most congressional offices have RSS feeds but some require HTML scraping.
+Statement provides access to press releases, Facebook status updates and tweets from members of Congress. Most congressional offices have RSS feeds but some require HTML scraping.
 To configure Statement to pull from the Twitter and Facebook APIs, you can pass in configuration values via a hash or a `config.yml` file:
@@ -48,7 +48,7 @@ To parse an RSS feed, simply pass the URL to Statement's Feed class:
 ```ruby
 require 'rubygems'
 require 'statement'
 results = Statement::Feed.from_rss('http://blumenauer.house.gov/index.php?option=com_bca-rss-syndicator&feed_id=1')
 puts results.first
 {:source=>"http://blumenauer.house.gov/index.php?option=com_bca-rss-syndicator&feed_id=1", :url=>"http://blumenauer.house.gov/index.php?option=com_content&amp;view=article&amp;id=2203:blumenauer-qwe-need-a-national-system-that-speaks-to-the-transportation-challenges-of-todayq&amp;catid=66:2013-press-releases", :title=>"Blumenauer: &quot;We need a national system that speaks to the transportation challenges of ...", :date=>#<Date: 2013-04-24 ((2456407j,0s,0n),+0s,2299161j)>, :domain=>"blumenauer.house.gov"}
@@ -121,6 +121,8 @@ $ rake test
 ## Contributing
+Statement would not be nearly the library it is without our contributors, and we sincerely thank them for their generosity and interest in making congressional press release data more available.
 1. Fork it
 2. Create your feature branch (`git checkout -b my-new-feature`)
 3. Commit your changes (`git commit -am 'Add some feature'`)
@@ -131,6 +133,8 @@ If you write a new scraper, please use Nokogiri for parsing - see some of the ex
 ## Authors
-* Derek Willis
-* Jacob Harris
+* [Derek Willis](https://github.com/dwillis)
+* [Jacob Harris](https://github.com/harrisj)
+* [Mick O'Brien](https://github.com/mickaobrien)
+* [Tyler Pearson](https://github.com/tylerpearson)
+* [Sam Sweeney](https://github.com/shubik22)

data/lib/statement/scraper.rb CHANGED Viewed

@@ -30,9 +30,9 @@ module Statement
     def self.member_methods
       [:crenshaw, :capuano, :cold_fusion, :conaway, :chabot, :freshman_senators, :klobuchar, :billnelson, :crapo, :boxer,
-      :vitter, :inhofe, :palazzo, :roe, :document_query, :swalwell, :fischer, :clark, :edwards, :culberson_chabot_grisham, :barton,
-      :sherman_mccaul, :welch, :sessions, :gabbard, :ellison, :costa, :farr, :mcclintock, :mcnerney, :olson, :schumer, :lamborn, :walden,
-      :bennie_thompson, :speier, :poe, :grassley]
+      :vitter, :inhofe, :document_query, :swalwell, :fischer, :clark, :edwards, :culberson_chabot_grisham, :barton,
+      :welch, :sessions, :gabbard, :costa, :farr, :mcclintock, :olson, :schumer, :lamborn, :walden,
+      :bennie_thompson, :speier, :poe, :grassley, :bennet, :shaheen, :keating, :drupal, :jenkins]
     end
     def self.committee_methods
@@ -41,21 +41,21 @@ module Statement
     def self.member_scrapers
       year = Date.today.year
-      results = [crenshaw, capuano, cold_fusion(year, nil), conaway, chabot, klobuchar(year), palazzo(page=1), roe(page=1), billnelson(year=year),
-        document_query(page=1), document_query(page=2), swalwell(page=1), crapo, boxer(start=1), grassley(page=0),
-        vitter(year=year), inhofe(year=year), fischer, clark(year=year), edwards, culberson_chabot_grisham(page=1), barton, sherman_mccaul, welch,
-        sessions(year=year), gabbard, ellison(page=0), costa, farr, olson, mcnerney, schumer, lamborn(limit=10), walden, bennie_thompson, speier,
-        poe(year=year, month=0)].flatten
+      results = [crenshaw, capuano, cold_fusion(year, nil), conaway, chabot, klobuchar(year), billnelson(page=0),
+        document_query(page=1), document_query(page=2), swalwell(page=1), crapo, boxer, grassley(page=0),
+        vitter(year=year), inhofe(year=year), fischer, clark(year=year), edwards, culberson_chabot_grisham(page=1), barton, welch,
+        sessions(year=year), gabbard, costa, farr, olson, schumer, lamborn(limit=10), walden, bennie_thompson, speier,
+        poe(year=year, month=0), bennet(page=1), shaheen(page=1), perlmutter, keating, drupal, jenkins].flatten
       results = results.compact
       Utils.remove_generic_urls!(results)
     end
     def self.backfill_from_scrapers
       results = [cold_fusion(2012, 0), cold_fusion(2011, 0), cold_fusion(2010, 0), billnelson(year=2012), document_query(page=3),
-        document_query(page=4), boxer(start=11), boxer(start=21), grassley(page=1), grassley(page=2), grassley(page=3),
-        boxer(start=31), boxer(start=41), vitter(year=2012), vitter(year=2011), swalwell(page=2), swalwell(page=3), clark(year=2013), culberson_chabot_grisham(page=2),
-        sherman_mccaul(page=1), sessions(year=2013), pryor(page=1), ellison(page=1), ellison(page=2), ellison(page=3), farr(year=2013), farr(year=2012), farr(year=2011),
-        mcnerney(page=2), mcnerney(page=3), mcnerney(page=4), mcnerney(page=5), mcnerney(page=6), olson(year=2013), schumer(page=2), schumer(page=3), poe(year=2015, month=2),
+        document_query(page=4), grassley(page=1), grassley(page=2), grassley(page=3),
+        vitter(year=2012), vitter(year=2011), swalwell(page=2), swalwell(page=3), clark(year=2013), culberson_chabot_grisham(page=2),
+        sessions(year=2013), pryor(page=1), farr(year=2013), farr(year=2012), farr(year=2011),
+        olson(year=2013), schumer(page=2), schumer(page=3), poe(year=2015, month=2),
         poe(year=2015, month=1)].flatten
       Utils.remove_generic_urls!(results)
     end
@@ -391,14 +391,14 @@ module Statement
       results
     end
-    def self.billnelson(year=2013)
+    def self.billnelson(page=0)
       results = []
-      base_url = "http://www.billnelson.senate.gov/news/"
-      year_url = base_url + "media.cfm?year=#{year}"
-      doc = open_html(year_url)
+      url = "http://www.billnelson.senate.gov/newsroom/press-releases?page=#{page}"
+      doc = open_html(url)
       return if doc.nil?
-      doc.xpath('//li').each do |row|
-        results << { :source => year_url, :url => base_url + row.children[0]['href'], :title => row.children[0].text.strip, :date => Date.parse(row.children.last.text), :domain => "billnelson.senate.gov" }
+      dates = doc.xpath("//div[@class='date-box']").map{|d| Date.parse(d.children.map{|x| x.text.strip}.join(" "))}
+      (doc/:h3).each_with_index do |row, index|
+        results << { :source => url, :url => "http://www.billnelson.senate.gov" + row.children.first['href'], :title => row.children.first.text.strip, :date => dates[index], :domain => "billnelson.senate.gov" }
       end
       results
     end
@@ -451,14 +451,15 @@ module Statement
       results
     end
-    def self.boxer(start=1)
+    def self.boxer
       results = []
-      url = "http://www.boxer.senate.gov/en/press/releases.cfm?start=#{start}"
+      url = "http://www.boxer.senate.gov/press/release"
       domain = 'www.boxer.senate.gov'
       doc = open_html(url)
       return if doc.nil?
-      doc.xpath("//div[@class='left']")[1..-1].each do |row|
-        results << { :source => url, :url => domain + row.next.next.children[1].children[0]['href'], :title => row.next.next.children[1].children[0].text, :date => Date.parse(row.text.strip), :domain => domain}
+      doc.css("tr")[1..-1].each do |row|
+        next if row.children[1].text == "Sat, January 1st 0000 "
+        results << { :source => url, :url => "http://"+domain + row.children[3].children[1]['href'], :title => row.children[3].children[1].text.strip, :date => Date.parse(row.children[1].text), :domain => domain}
       end
       results
     end
@@ -505,30 +506,6 @@ module Statement
       results
     end
-    def self.palazzo(page=1)
-      results = []
-      domain = "palazzo.house.gov"
-      url = "http://palazzo.house.gov/news/documentquery.aspx?DocumentTypeID=2519&Page=#{page}"
-      doc = open_html(url)
-      return if doc.nil?
-      doc.xpath("//div[@class='middlecopy']//li").each do |row|
-        results << { :source => url, :url => "http://palazzo.house.gov/news/" + row.children[1]['href'], :title => row.children[1].text.strip, :date => Date.parse(row.children[3].text.strip), :domain => domain }
-      end
-      results
-    end
-    def self.roe(page=1)
-      results = []
-      domain = 'roe.house.gov'
-      url = "http://roe.house.gov/news/documentquery.aspx?DocumentTypeID=1532&Page=#{page}"
-      doc = open_html(url)
-      return if doc.nil?
-      doc.xpath("//div[@class='middlecopy']//li").each do |row|
-        results << { :source => url, :url => "http://roe.house.gov/news/" + row.children[1]['href'], :title => row.children[1].text.strip, :date => Date.parse(row.children[3].text.strip), :domain => domain }
-      end
-      results
-    end
     def self.clark(year=Date.today.year)
       results = []
       domain = 'katherineclark.house.gov'
@@ -596,22 +573,6 @@ module Statement
       results
     end
-    def self.sherman_mccaul(page=0)
-      results = []
-      domains = ['sherman.house.gov', 'mccaul.house.gov']
-      domains.each do |domain|
-        url = "http://#{domain}/media-center/press-releases?page=#{page}"
-        doc = open_html(url)
-        return if doc.nil?
-        dates = doc.xpath('//span[@class="field-content"]').map {|s| s.text if s.text.strip.include?("201")}.compact!
-        (doc/:h3).first(10).each_with_index do |row, i|
-          date = Date.parse(dates[i])
-          results << {:source => url, :url => "http://"+domain+row.children.first['href'], :title => row.children.first.text.strip, :date => date, :domain => domain}
-        end
-      end
-      results.flatten
-    end
     def self.welch
       results = []
       domain = 'welch.house.gov'
@@ -636,19 +597,6 @@ module Statement
       results
     end
-    def self.ellison(page=0)
-      results = []
-      domain = 'ellison.house.gov'
-      url = "http://ellison.house.gov/media-center/press-releases?page=#{page}"
-      doc = open_html(url)
-      return if doc.nil?
-      doc.xpath("//div[@class='views-field views-field-created datebar']").each do |row|
-        next if row.nil?
-        results << { :source => url, :url => "http://ellison.house.gov" + row.next.next.children[1].children[0]['href'], :title => row.next.next.children[1].children[0].text.strip, :date => Date.parse(row.text.strip), :domain => domain}
-      end
-      results
-    end
     def self.costa
       results = []
       domain = 'costa.house.gov'
@@ -701,21 +649,9 @@ module Statement
       results
     end
-    def self.mcnerney(page=1)
-      results = []
-      domain = 'mcnerney.house.gov'
-      url = "http://mcnerney.house.gov/media-center/press-releases"
-      doc = open_html(url)
-      return if doc.nil?
-      doc.xpath("//div[@class='views-field views-field-title']").each do |row|
-        results << {:source => url, :url => 'http://mcnerney.house.gov' + row.children[1].children[0]['href'], :title => row.children[1].children[0].text.strip, :date => Date.parse(row.next.next.text.strip), :domain => domain }
-      end
-      results
-    end
     def self.document_query(page=1)
       results = []
-      domains = [{"thornberry.house.gov" => 1776}, {"wenstrup.house.gov" => 2491}, {"clawson.house.gov" => 2641}]
+      domains = [{"thornberry.house.gov" => 1776}, {"wenstrup.house.gov" => 2491}, {"clawson.house.gov" => 2641}, {"palazzo.house.gov" => 2519}, {"roe.house.gov" => 1532}, {"perry.house.gov" => 2608}, {"rodneydavis.house.gov" => 2427}, {"kevinbrady.house.gov" => 2657}]
       domains.each do |domain|
         doc = open_html("http://"+domain.keys.first+"/news/documentquery.aspx?DocumentTypeID=#{domain.values.first}&Page=#{page}")
         return if doc.nil?
@@ -739,6 +675,31 @@ module Statement
       results
     end
+    def self.bennet(page=1)
+      results = []
+      domain = 'www.bennet.senate.gov'
+      url = "http://www.bennet.senate.gov/?p=releases&pg=#{page}"
+      doc = open_html(url)
+      return if doc.nil?
+      (doc/:h2).each do |row|
+        results << {:source => url, :url => 'http://www.bennet.senate.gov' + row.children.first['href'], :title => row.text.strip, :date => Date.parse(row.previous.previous.text), :domain => domain }
+      end
+      results
+    end
+    def self.shaheen(page=1)
+      results = []
+      domain = 'www.shaheen.senate.gov'
+      url = "http://www.shaheen.senate.gov/news/press/index.cfm?PageNum_rs=#{page}"
+      doc = open_html(url)
+      return if doc.nil?
+      (doc/:ul)[3].children.each do |row|
+        next if row.text.strip == ''
+        results << {:source => url, :url => row.children[2].children[0]['href'], :title => row.children[2].text.strip, :date => Date.parse(row.children.first.text), :domain => domain }
+      end
+      results
+    end
     def self.lamborn(limit=nil)
       results = []
       domain = 'lamborn.house.gov'
@@ -756,6 +717,18 @@ module Statement
       results
     end
+    def self.jenkins
+      results = []
+      domain = 'lynnjenkins.house.gov/'
+      url = "http://lynnjenkins.house.gov/index.cfm?sectionid=186"
+      doc = open_html(url)
+      return if doc.nil?
+      doc.xpath("//ul[@class='sectionitems']//li").each do |row|
+        results << {:source => url, :url => 'http://lynnjenkins.house.gov' + row.children[3].children[1]['href'], :title => row.children[3].text.strip, :date => Date.parse(row.children[5].text), :domain => domain }
+      end
+      results
+    end
     def self.walden
       results = []
       domain = 'walden.house.gov'
@@ -812,5 +785,84 @@ module Statement
     end
+    def self.perlmutter
+      results = []
+      domain = "perlmutter.house.gov"
+      url = "http://#{domain}/index.php/media-center/press-releases-86821"
+      doc = open_html(url)
+      return if doc.nil?
+      doc.css("#adminForm tr")[0..-1].each do |row|
+        results << { :source => url, :url => "http://" + domain + row.children[1].children[1]['href'], :title => row.children[1].children[1].text.strip, :date => Date.parse(row.children[3].text), :domain => domain}
+      end
+      results
+    end
+    def self.keating
+      results = []
+      domain = "keating.house.gov"
+      source_url = "http://#{domain}/index.php?option=com_content&view=category&id=14&Itemid=13"
+      doc = open_html(source_url)
+      return if doc.nil?
+      doc.css("#adminForm tr")[0..-1].each do |row|
+        url = 'http://' + domain + row.children[1].children[1]['href']
+        title = row.children[1].children[1].text.strip
+        results << { :source => source_url, :url => url, :title => title, :date => Date.parse(row.children[3].text), :domain => domain}
+      end
+      results
+    end
+    def self.drupal(urls=[], page=0)
+      if urls.empty?
+        urls = [
+            "http://sherman.house.gov/media-center/press-releases",
+            "http://mccaul.house.gov/media-center/press-releases",
+            "https://ellison.house.gov/media-center/press-releases",
+            "http://mcnerney.house.gov/media-center/press-releases",
+            "http://sanford.house.gov/media-center/press-releases",
+            "http://butterfield.house.gov/media-center/press-releases",
+            "http://walz.house.gov/media-center/press-releases",
+            "https://pingree.house.gov/media-center/press-releases",
+            "http://sarbanes.house.gov/media-center/press-releases",
+            "http://wilson.house.gov/media-center/press-releases",
+            "https://bilirakis.house.gov/press-releases",
+            "http://quigley.house.gov/media-center/press-releases"
+        ]
+      end
+      results = []
+      urls.each do |url|
+        source_url = "#{url}?page=#{page}"
+        domain =  URI.parse(source_url).host
+        doc = open_html(source_url)
+        return if doc.nil?
+        doc.css("#region-content .views-row").each do |row|
+            title_anchor = row.css("h3 a")
+            title = title_anchor.text
+            release_url = "http://#{domain + title_anchor.attr('href')}"
+              raw_date = row.css(".views-field-created").text
+            results << { :source => source_url,
+                         :url => release_url,
+                         :title => title,
+                         :date => begin Date.parse(raw_date) rescue nil end,
+                         :domain => domain }
+        end
+        # mike quigley's release page doesn't have dates, so we fetch those individually
+        if url == "http://quigley.house.gov/media-center/press-releases"
+          results.select{|r| r[:source] == source_url}.each do |result|
+            doc = open_html(result[:url])
+            result[:date] = Date.parse(doc.css(".pane-content").children[0].text.strip)
+          end
+        end
+      end
+      results
+    end
   end
 end

data/lib/statement/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Statement
-  VERSION = "1.9.9"
+  VERSION = "2.0"
 end

data/scraper_guide.md ADDED Viewed

@@ -0,0 +1,49 @@
+## Contributing Scrapers
+Some members of Congress either don't have RSS feeds of their press releases, or the ones they have are broken. That's where scraping comes in. Unfortunately, members also tend to change the layouts of their sites more often than you might think, so it's not always a matter of writing a single scraper and forgetting about it.
+That doesn't mean that writing member-specific scrapers is particularly difficult. Many lawmakers have similar sites, so you can either build off an existing scraper or even add to an existing one. Here's the basic process:
+### Setup
+1. Ruby: if you don't have it, install Ruby (version 2.x) and run `gem install bundler` from the command line.
+2. Fork the [repository](https://github.com/TheUpshot/statement) and clone it to a directory on your computer.
+3. cd into that directory and run `bundle install` to install the gems used by Statement.
+4. Enter the Ruby console by typing `irb` and then require the libraries we'll need:
+```ruby
+require 'uri'
+require 'open-uri'
+require 'american_date'
+require 'nokogiri'
+```
+Then pick a lawmaker that needs a scraper written from [our issues page](https://github.com/TheUpshot/statement/issues).
+### Scraper Design
+Most lawmakers have press release sections of their sites that display the date, title and link of a press release. Take Barbara Boxer, the California Democratic senator. Her [press release page](http://www.boxer.senate.gov/press/release/) is somewhat typical in that it features a table of releases, 10 to a page. The goal is to scrape that page, and optionally others if the site is paginated (most congressional press release sites are), and to build an Array of Ruby hashes that contain each release's url, date and title, along with two other piece of information: the source page of press release urls and the domain of the site (which helps to identify the lawmaker).
+To do this, we use Nokogiri, a Ruby HTML and XML parser, rather than regular expressions. One of Nokogiri's strengths is that it can parse HTML documents based on CSS classes, XPath or via HTML entity search. Statement has a helper method, `open_html`, that loads the press release url into Nokigiri's parser. Senator Boxer's scraper might look like this:
+```ruby
+def self.boxer
+  results = []
+  url = "http://www.boxer.senate.gov/press/release"
+  domain = 'www.boxer.senate.gov'
+  doc = open_html(url)
+  return if doc.nil?
+  doc.css("tr")[1..-1].each do |row|
+    results << { :source => url, :url => "http://"+domain + row.children[3].children[1]['href'], :title => row.children[3].children[1].text.strip, :date => Date.parse(row.children[1].text), :domain => domain}
+  end
+  results
+end
+```
+For the first row that would produce the following hash:
+```ruby
+=> {:source=>"http://www.boxer.senate.gov/press/release", :url=>"http://www.boxer.senate.gov/press/release/boxer-feinstein-colleagues-introduces-bill-in-support-of-positive-train-control/", :title=>"Boxer, Feinstein, Colleagues Introduces Bill in Support of Positive Train Control", :date=><Date: 2015-04-17 ((2457130j,0s,0n),+0s,2299161j)>, :domain=>"www.boxer.senate.gov"}
+```
+For people new to Nokogiri, perhaps the hardest part is navigating its nodes - a `tr` node will have children `td` nodes, for example. The best advice we can provide is to spend time in the console trying to navigate up and down an HTML document's nodes. Calling the `text` method on any Nokogiri object will print its contents.
+The best advice is to work off an existing [member scraper](https://github.com/TheUpshot/statement/blob/master/lib/statement/scraper.rb). You don't need to write anything except the scraper method; we'll take care of the rest once you submit your pull request.