statement 1.9.9 → 2.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +11 -7
- data/lib/statement/scraper.rb +140 -88
- data/lib/statement/version.rb +1 -1
- data/scraper_guide.md +49 -0
- data/spec/butterfield_press.html +407 -0
- data/spec/drupal_press.html +524 -0
- data/spec/ed_perlmutter_press.html +5032 -0
- data/spec/keating_press.html +2211 -0
- data/spec/statement_spec.rb +62 -10
- metadata +11 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 11dcba16755ef54dff1c0c48db50aca841485abd
|
4
|
+
data.tar.gz: dda7e3b05004b1d7bf59411c902ed2a670436914
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 20f0513c7aa7a3d70b9e3c3f8d8eb6c30e66b2b88a3dec326d0516e031d3450b201e7a2db8f98b90d29a1d2fb2a384ed096d9ea9f9940cfa41abf9ef79a84590
|
7
|
+
data.tar.gz: 358a8cfa517caf462ebe20119a81a898ac46b9806b1614e2ed6f34ceabc89bcd5dd40ecf43dc38cd989f8fd17449af2a32e63c2d636cd54f92830919bf569533
|
data/README.md
CHANGED
@@ -1,10 +1,10 @@
|
|
1
1
|
# Statement
|
2
2
|
|
3
|
-
Statement parses RSS feeds and HTML pages containing press releases and other official statements from members of Congress, and produces hashes with information about those pages. It has been tested under Ruby 1.9.
|
3
|
+
Statement parses RSS feeds and HTML pages containing press releases and other official statements from members of Congress, and produces hashes with information about those pages. It has been tested under Ruby 1.9.3 and 2.x.
|
4
4
|
|
5
5
|
## Coverage
|
6
6
|
|
7
|
-
Statement currently parses press releases for members of the House and Senate. For members with RSS feeds, you can pass the feed URL into Statement. For members without RSS feeds, HTML scrapers are provided, as are methods for
|
7
|
+
Statement currently parses press releases for members of the House and Senate. For members with RSS feeds, you can pass the feed URL into Statement. For members without RSS feeds (or with broken ones), HTML scrapers are provided, as are methods for special groups, such as House Republicans. Suggestions are welcomed.
|
8
8
|
|
9
9
|
## Installation
|
10
10
|
|
@@ -28,7 +28,7 @@ $ gem install statement
|
|
28
28
|
|
29
29
|
## Usage
|
30
30
|
|
31
|
-
Statement provides access to press releases, Facebook status updates and tweets from members of Congress. Most congressional offices have RSS feeds but some require HTML scraping.
|
31
|
+
Statement provides access to press releases, Facebook status updates and tweets from members of Congress. Most congressional offices have RSS feeds but some require HTML scraping.
|
32
32
|
|
33
33
|
To configure Statement to pull from the Twitter and Facebook APIs, you can pass in configuration values via a hash or a `config.yml` file:
|
34
34
|
|
@@ -48,7 +48,7 @@ To parse an RSS feed, simply pass the URL to Statement's Feed class:
|
|
48
48
|
```ruby
|
49
49
|
require 'rubygems'
|
50
50
|
require 'statement'
|
51
|
-
|
51
|
+
|
52
52
|
results = Statement::Feed.from_rss('http://blumenauer.house.gov/index.php?option=com_bca-rss-syndicator&feed_id=1')
|
53
53
|
puts results.first
|
54
54
|
{:source=>"http://blumenauer.house.gov/index.php?option=com_bca-rss-syndicator&feed_id=1", :url=>"http://blumenauer.house.gov/index.php?option=com_content&view=article&id=2203:blumenauer-qwe-need-a-national-system-that-speaks-to-the-transportation-challenges-of-todayq&catid=66:2013-press-releases", :title=>"Blumenauer: "We need a national system that speaks to the transportation challenges of ...", :date=>#<Date: 2013-04-24 ((2456407j,0s,0n),+0s,2299161j)>, :domain=>"blumenauer.house.gov"}
|
@@ -121,6 +121,8 @@ $ rake test
|
|
121
121
|
|
122
122
|
## Contributing
|
123
123
|
|
124
|
+
Statement would not be nearly the library it is without our contributors, and we sincerely thank them for their generosity and interest in making congressional press release data more available.
|
125
|
+
|
124
126
|
1. Fork it
|
125
127
|
2. Create your feature branch (`git checkout -b my-new-feature`)
|
126
128
|
3. Commit your changes (`git commit -am 'Add some feature'`)
|
@@ -131,6 +133,8 @@ If you write a new scraper, please use Nokogiri for parsing - see some of the ex
|
|
131
133
|
|
132
134
|
## Authors
|
133
135
|
|
134
|
-
* Derek Willis
|
135
|
-
* Jacob Harris
|
136
|
-
|
136
|
+
* [Derek Willis](https://github.com/dwillis)
|
137
|
+
* [Jacob Harris](https://github.com/harrisj)
|
138
|
+
* [Mick O'Brien](https://github.com/mickaobrien)
|
139
|
+
* [Tyler Pearson](https://github.com/tylerpearson)
|
140
|
+
* [Sam Sweeney](https://github.com/shubik22)
|
data/lib/statement/scraper.rb
CHANGED
@@ -30,9 +30,9 @@ module Statement
|
|
30
30
|
|
31
31
|
def self.member_methods
|
32
32
|
[:crenshaw, :capuano, :cold_fusion, :conaway, :chabot, :freshman_senators, :klobuchar, :billnelson, :crapo, :boxer,
|
33
|
-
:vitter, :inhofe, :
|
34
|
-
:
|
35
|
-
:bennie_thompson, :speier, :poe, :grassley]
|
33
|
+
:vitter, :inhofe, :document_query, :swalwell, :fischer, :clark, :edwards, :culberson_chabot_grisham, :barton,
|
34
|
+
:welch, :sessions, :gabbard, :costa, :farr, :mcclintock, :olson, :schumer, :lamborn, :walden,
|
35
|
+
:bennie_thompson, :speier, :poe, :grassley, :bennet, :shaheen, :keating, :drupal, :jenkins]
|
36
36
|
end
|
37
37
|
|
38
38
|
def self.committee_methods
|
@@ -41,21 +41,21 @@ module Statement
|
|
41
41
|
|
42
42
|
def self.member_scrapers
|
43
43
|
year = Date.today.year
|
44
|
-
results = [crenshaw, capuano, cold_fusion(year, nil), conaway, chabot, klobuchar(year),
|
45
|
-
document_query(page=1), document_query(page=2), swalwell(page=1), crapo, boxer
|
46
|
-
vitter(year=year), inhofe(year=year), fischer, clark(year=year), edwards, culberson_chabot_grisham(page=1), barton,
|
47
|
-
sessions(year=year), gabbard,
|
48
|
-
poe(year=year, month=0)].flatten
|
44
|
+
results = [crenshaw, capuano, cold_fusion(year, nil), conaway, chabot, klobuchar(year), billnelson(page=0),
|
45
|
+
document_query(page=1), document_query(page=2), swalwell(page=1), crapo, boxer, grassley(page=0),
|
46
|
+
vitter(year=year), inhofe(year=year), fischer, clark(year=year), edwards, culberson_chabot_grisham(page=1), barton, welch,
|
47
|
+
sessions(year=year), gabbard, costa, farr, olson, schumer, lamborn(limit=10), walden, bennie_thompson, speier,
|
48
|
+
poe(year=year, month=0), bennet(page=1), shaheen(page=1), perlmutter, keating, drupal, jenkins].flatten
|
49
49
|
results = results.compact
|
50
50
|
Utils.remove_generic_urls!(results)
|
51
51
|
end
|
52
52
|
|
53
53
|
def self.backfill_from_scrapers
|
54
54
|
results = [cold_fusion(2012, 0), cold_fusion(2011, 0), cold_fusion(2010, 0), billnelson(year=2012), document_query(page=3),
|
55
|
-
document_query(page=4),
|
56
|
-
|
57
|
-
|
58
|
-
|
55
|
+
document_query(page=4), grassley(page=1), grassley(page=2), grassley(page=3),
|
56
|
+
vitter(year=2012), vitter(year=2011), swalwell(page=2), swalwell(page=3), clark(year=2013), culberson_chabot_grisham(page=2),
|
57
|
+
sessions(year=2013), pryor(page=1), farr(year=2013), farr(year=2012), farr(year=2011),
|
58
|
+
olson(year=2013), schumer(page=2), schumer(page=3), poe(year=2015, month=2),
|
59
59
|
poe(year=2015, month=1)].flatten
|
60
60
|
Utils.remove_generic_urls!(results)
|
61
61
|
end
|
@@ -391,14 +391,14 @@ module Statement
|
|
391
391
|
results
|
392
392
|
end
|
393
393
|
|
394
|
-
def self.billnelson(
|
394
|
+
def self.billnelson(page=0)
|
395
395
|
results = []
|
396
|
-
|
397
|
-
|
398
|
-
doc = open_html(year_url)
|
396
|
+
url = "http://www.billnelson.senate.gov/newsroom/press-releases?page=#{page}"
|
397
|
+
doc = open_html(url)
|
399
398
|
return if doc.nil?
|
400
|
-
doc.xpath(
|
401
|
-
|
399
|
+
dates = doc.xpath("//div[@class='date-box']").map{|d| Date.parse(d.children.map{|x| x.text.strip}.join(" "))}
|
400
|
+
(doc/:h3).each_with_index do |row, index|
|
401
|
+
results << { :source => url, :url => "http://www.billnelson.senate.gov" + row.children.first['href'], :title => row.children.first.text.strip, :date => dates[index], :domain => "billnelson.senate.gov" }
|
402
402
|
end
|
403
403
|
results
|
404
404
|
end
|
@@ -451,14 +451,15 @@ module Statement
|
|
451
451
|
results
|
452
452
|
end
|
453
453
|
|
454
|
-
def self.boxer
|
454
|
+
def self.boxer
|
455
455
|
results = []
|
456
|
-
url = "http://www.boxer.senate.gov/
|
456
|
+
url = "http://www.boxer.senate.gov/press/release"
|
457
457
|
domain = 'www.boxer.senate.gov'
|
458
458
|
doc = open_html(url)
|
459
459
|
return if doc.nil?
|
460
|
-
doc.
|
461
|
-
|
460
|
+
doc.css("tr")[1..-1].each do |row|
|
461
|
+
next if row.children[1].text == "Sat, January 1st 0000 "
|
462
|
+
results << { :source => url, :url => "http://"+domain + row.children[3].children[1]['href'], :title => row.children[3].children[1].text.strip, :date => Date.parse(row.children[1].text), :domain => domain}
|
462
463
|
end
|
463
464
|
results
|
464
465
|
end
|
@@ -505,30 +506,6 @@ module Statement
|
|
505
506
|
results
|
506
507
|
end
|
507
508
|
|
508
|
-
def self.palazzo(page=1)
|
509
|
-
results = []
|
510
|
-
domain = "palazzo.house.gov"
|
511
|
-
url = "http://palazzo.house.gov/news/documentquery.aspx?DocumentTypeID=2519&Page=#{page}"
|
512
|
-
doc = open_html(url)
|
513
|
-
return if doc.nil?
|
514
|
-
doc.xpath("//div[@class='middlecopy']//li").each do |row|
|
515
|
-
results << { :source => url, :url => "http://palazzo.house.gov/news/" + row.children[1]['href'], :title => row.children[1].text.strip, :date => Date.parse(row.children[3].text.strip), :domain => domain }
|
516
|
-
end
|
517
|
-
results
|
518
|
-
end
|
519
|
-
|
520
|
-
def self.roe(page=1)
|
521
|
-
results = []
|
522
|
-
domain = 'roe.house.gov'
|
523
|
-
url = "http://roe.house.gov/news/documentquery.aspx?DocumentTypeID=1532&Page=#{page}"
|
524
|
-
doc = open_html(url)
|
525
|
-
return if doc.nil?
|
526
|
-
doc.xpath("//div[@class='middlecopy']//li").each do |row|
|
527
|
-
results << { :source => url, :url => "http://roe.house.gov/news/" + row.children[1]['href'], :title => row.children[1].text.strip, :date => Date.parse(row.children[3].text.strip), :domain => domain }
|
528
|
-
end
|
529
|
-
results
|
530
|
-
end
|
531
|
-
|
532
509
|
def self.clark(year=Date.today.year)
|
533
510
|
results = []
|
534
511
|
domain = 'katherineclark.house.gov'
|
@@ -596,22 +573,6 @@ module Statement
|
|
596
573
|
results
|
597
574
|
end
|
598
575
|
|
599
|
-
def self.sherman_mccaul(page=0)
|
600
|
-
results = []
|
601
|
-
domains = ['sherman.house.gov', 'mccaul.house.gov']
|
602
|
-
domains.each do |domain|
|
603
|
-
url = "http://#{domain}/media-center/press-releases?page=#{page}"
|
604
|
-
doc = open_html(url)
|
605
|
-
return if doc.nil?
|
606
|
-
dates = doc.xpath('//span[@class="field-content"]').map {|s| s.text if s.text.strip.include?("201")}.compact!
|
607
|
-
(doc/:h3).first(10).each_with_index do |row, i|
|
608
|
-
date = Date.parse(dates[i])
|
609
|
-
results << {:source => url, :url => "http://"+domain+row.children.first['href'], :title => row.children.first.text.strip, :date => date, :domain => domain}
|
610
|
-
end
|
611
|
-
end
|
612
|
-
results.flatten
|
613
|
-
end
|
614
|
-
|
615
576
|
def self.welch
|
616
577
|
results = []
|
617
578
|
domain = 'welch.house.gov'
|
@@ -636,19 +597,6 @@ module Statement
|
|
636
597
|
results
|
637
598
|
end
|
638
599
|
|
639
|
-
def self.ellison(page=0)
|
640
|
-
results = []
|
641
|
-
domain = 'ellison.house.gov'
|
642
|
-
url = "http://ellison.house.gov/media-center/press-releases?page=#{page}"
|
643
|
-
doc = open_html(url)
|
644
|
-
return if doc.nil?
|
645
|
-
doc.xpath("//div[@class='views-field views-field-created datebar']").each do |row|
|
646
|
-
next if row.nil?
|
647
|
-
results << { :source => url, :url => "http://ellison.house.gov" + row.next.next.children[1].children[0]['href'], :title => row.next.next.children[1].children[0].text.strip, :date => Date.parse(row.text.strip), :domain => domain}
|
648
|
-
end
|
649
|
-
results
|
650
|
-
end
|
651
|
-
|
652
600
|
def self.costa
|
653
601
|
results = []
|
654
602
|
domain = 'costa.house.gov'
|
@@ -701,21 +649,9 @@ module Statement
|
|
701
649
|
results
|
702
650
|
end
|
703
651
|
|
704
|
-
def self.mcnerney(page=1)
|
705
|
-
results = []
|
706
|
-
domain = 'mcnerney.house.gov'
|
707
|
-
url = "http://mcnerney.house.gov/media-center/press-releases"
|
708
|
-
doc = open_html(url)
|
709
|
-
return if doc.nil?
|
710
|
-
doc.xpath("//div[@class='views-field views-field-title']").each do |row|
|
711
|
-
results << {:source => url, :url => 'http://mcnerney.house.gov' + row.children[1].children[0]['href'], :title => row.children[1].children[0].text.strip, :date => Date.parse(row.next.next.text.strip), :domain => domain }
|
712
|
-
end
|
713
|
-
results
|
714
|
-
end
|
715
|
-
|
716
652
|
def self.document_query(page=1)
|
717
653
|
results = []
|
718
|
-
domains = [{"thornberry.house.gov" => 1776}, {"wenstrup.house.gov" => 2491}, {"clawson.house.gov" => 2641}]
|
654
|
+
domains = [{"thornberry.house.gov" => 1776}, {"wenstrup.house.gov" => 2491}, {"clawson.house.gov" => 2641}, {"palazzo.house.gov" => 2519}, {"roe.house.gov" => 1532}, {"perry.house.gov" => 2608}, {"rodneydavis.house.gov" => 2427}, {"kevinbrady.house.gov" => 2657}]
|
719
655
|
domains.each do |domain|
|
720
656
|
doc = open_html("http://"+domain.keys.first+"/news/documentquery.aspx?DocumentTypeID=#{domain.values.first}&Page=#{page}")
|
721
657
|
return if doc.nil?
|
@@ -739,6 +675,31 @@ module Statement
|
|
739
675
|
results
|
740
676
|
end
|
741
677
|
|
678
|
+
def self.bennet(page=1)
|
679
|
+
results = []
|
680
|
+
domain = 'www.bennet.senate.gov'
|
681
|
+
url = "http://www.bennet.senate.gov/?p=releases&pg=#{page}"
|
682
|
+
doc = open_html(url)
|
683
|
+
return if doc.nil?
|
684
|
+
(doc/:h2).each do |row|
|
685
|
+
results << {:source => url, :url => 'http://www.bennet.senate.gov' + row.children.first['href'], :title => row.text.strip, :date => Date.parse(row.previous.previous.text), :domain => domain }
|
686
|
+
end
|
687
|
+
results
|
688
|
+
end
|
689
|
+
|
690
|
+
def self.shaheen(page=1)
|
691
|
+
results = []
|
692
|
+
domain = 'www.shaheen.senate.gov'
|
693
|
+
url = "http://www.shaheen.senate.gov/news/press/index.cfm?PageNum_rs=#{page}"
|
694
|
+
doc = open_html(url)
|
695
|
+
return if doc.nil?
|
696
|
+
(doc/:ul)[3].children.each do |row|
|
697
|
+
next if row.text.strip == ''
|
698
|
+
results << {:source => url, :url => row.children[2].children[0]['href'], :title => row.children[2].text.strip, :date => Date.parse(row.children.first.text), :domain => domain }
|
699
|
+
end
|
700
|
+
results
|
701
|
+
end
|
702
|
+
|
742
703
|
def self.lamborn(limit=nil)
|
743
704
|
results = []
|
744
705
|
domain = 'lamborn.house.gov'
|
@@ -756,6 +717,18 @@ module Statement
|
|
756
717
|
results
|
757
718
|
end
|
758
719
|
|
720
|
+
def self.jenkins
|
721
|
+
results = []
|
722
|
+
domain = 'lynnjenkins.house.gov/'
|
723
|
+
url = "http://lynnjenkins.house.gov/index.cfm?sectionid=186"
|
724
|
+
doc = open_html(url)
|
725
|
+
return if doc.nil?
|
726
|
+
doc.xpath("//ul[@class='sectionitems']//li").each do |row|
|
727
|
+
results << {:source => url, :url => 'http://lynnjenkins.house.gov' + row.children[3].children[1]['href'], :title => row.children[3].text.strip, :date => Date.parse(row.children[5].text), :domain => domain }
|
728
|
+
end
|
729
|
+
results
|
730
|
+
end
|
731
|
+
|
759
732
|
def self.walden
|
760
733
|
results = []
|
761
734
|
domain = 'walden.house.gov'
|
@@ -812,5 +785,84 @@ module Statement
|
|
812
785
|
|
813
786
|
end
|
814
787
|
|
788
|
+
def self.perlmutter
|
789
|
+
results = []
|
790
|
+
domain = "perlmutter.house.gov"
|
791
|
+
url = "http://#{domain}/index.php/media-center/press-releases-86821"
|
792
|
+
doc = open_html(url)
|
793
|
+
return if doc.nil?
|
794
|
+
|
795
|
+
doc.css("#adminForm tr")[0..-1].each do |row|
|
796
|
+
results << { :source => url, :url => "http://" + domain + row.children[1].children[1]['href'], :title => row.children[1].children[1].text.strip, :date => Date.parse(row.children[3].text), :domain => domain}
|
797
|
+
end
|
798
|
+
results
|
799
|
+
end
|
800
|
+
|
801
|
+
def self.keating
|
802
|
+
results = []
|
803
|
+
domain = "keating.house.gov"
|
804
|
+
source_url = "http://#{domain}/index.php?option=com_content&view=category&id=14&Itemid=13"
|
805
|
+
doc = open_html(source_url)
|
806
|
+
return if doc.nil?
|
807
|
+
|
808
|
+
doc.css("#adminForm tr")[0..-1].each do |row|
|
809
|
+
url = 'http://' + domain + row.children[1].children[1]['href']
|
810
|
+
title = row.children[1].children[1].text.strip
|
811
|
+
results << { :source => source_url, :url => url, :title => title, :date => Date.parse(row.children[3].text), :domain => domain}
|
812
|
+
end
|
813
|
+
results
|
814
|
+
end
|
815
|
+
|
816
|
+
def self.drupal(urls=[], page=0)
|
817
|
+
if urls.empty?
|
818
|
+
urls = [
|
819
|
+
"http://sherman.house.gov/media-center/press-releases",
|
820
|
+
"http://mccaul.house.gov/media-center/press-releases",
|
821
|
+
"https://ellison.house.gov/media-center/press-releases",
|
822
|
+
"http://mcnerney.house.gov/media-center/press-releases",
|
823
|
+
"http://sanford.house.gov/media-center/press-releases",
|
824
|
+
"http://butterfield.house.gov/media-center/press-releases",
|
825
|
+
"http://walz.house.gov/media-center/press-releases",
|
826
|
+
"https://pingree.house.gov/media-center/press-releases",
|
827
|
+
"http://sarbanes.house.gov/media-center/press-releases",
|
828
|
+
"http://wilson.house.gov/media-center/press-releases",
|
829
|
+
"https://bilirakis.house.gov/press-releases",
|
830
|
+
"http://quigley.house.gov/media-center/press-releases"
|
831
|
+
]
|
832
|
+
end
|
833
|
+
|
834
|
+
results = []
|
835
|
+
|
836
|
+
urls.each do |url|
|
837
|
+
source_url = "#{url}?page=#{page}"
|
838
|
+
|
839
|
+
domain = URI.parse(source_url).host
|
840
|
+
doc = open_html(source_url)
|
841
|
+
return if doc.nil?
|
842
|
+
|
843
|
+
doc.css("#region-content .views-row").each do |row|
|
844
|
+
title_anchor = row.css("h3 a")
|
845
|
+
title = title_anchor.text
|
846
|
+
release_url = "http://#{domain + title_anchor.attr('href')}"
|
847
|
+
raw_date = row.css(".views-field-created").text
|
848
|
+
results << { :source => source_url,
|
849
|
+
:url => release_url,
|
850
|
+
:title => title,
|
851
|
+
:date => begin Date.parse(raw_date) rescue nil end,
|
852
|
+
:domain => domain }
|
853
|
+
end
|
854
|
+
|
855
|
+
# mike quigley's release page doesn't have dates, so we fetch those individually
|
856
|
+
if url == "http://quigley.house.gov/media-center/press-releases"
|
857
|
+
results.select{|r| r[:source] == source_url}.each do |result|
|
858
|
+
doc = open_html(result[:url])
|
859
|
+
result[:date] = Date.parse(doc.css(".pane-content").children[0].text.strip)
|
860
|
+
end
|
861
|
+
end
|
862
|
+
end
|
863
|
+
results
|
864
|
+
end
|
865
|
+
|
815
866
|
end
|
867
|
+
|
816
868
|
end
|
data/lib/statement/version.rb
CHANGED
data/scraper_guide.md
ADDED
@@ -0,0 +1,49 @@
|
|
1
|
+
## Contributing Scrapers
|
2
|
+
|
3
|
+
Some members of Congress either don't have RSS feeds of their press releases, or the ones they have are broken. That's where scraping comes in. Unfortunately, members also tend to change the layouts of their sites more often than you might think, so it's not always a matter of writing a single scraper and forgetting about it.
|
4
|
+
|
5
|
+
That doesn't mean that writing member-specific scrapers is particularly difficult. Many lawmakers have similar sites, so you can either build off an existing scraper or even add to an existing one. Here's the basic process:
|
6
|
+
|
7
|
+
### Setup
|
8
|
+
|
9
|
+
1. Ruby: if you don't have it, install Ruby (version 2.x) and run `gem install bundler` from the command line.
|
10
|
+
2. Fork the [repository](https://github.com/TheUpshot/statement) and clone it to a directory on your computer.
|
11
|
+
3. cd into that directory and run `bundle install` to install the gems used by Statement.
|
12
|
+
4. Enter the Ruby console by typing `irb` and then require the libraries we'll need:
|
13
|
+
|
14
|
+
```ruby
|
15
|
+
require 'uri'
|
16
|
+
require 'open-uri'
|
17
|
+
require 'american_date'
|
18
|
+
require 'nokogiri'
|
19
|
+
```
|
20
|
+
Then pick a lawmaker that needs a scraper written from [our issues page](https://github.com/TheUpshot/statement/issues).
|
21
|
+
|
22
|
+
### Scraper Design
|
23
|
+
|
24
|
+
Most lawmakers have press release sections of their sites that display the date, title and link of a press release. Take Barbara Boxer, the California Democratic senator. Her [press release page](http://www.boxer.senate.gov/press/release/) is somewhat typical in that it features a table of releases, 10 to a page. The goal is to scrape that page, and optionally others if the site is paginated (most congressional press release sites are), and to build an Array of Ruby hashes that contain each release's url, date and title, along with two other piece of information: the source page of press release urls and the domain of the site (which helps to identify the lawmaker).
|
25
|
+
|
26
|
+
To do this, we use Nokogiri, a Ruby HTML and XML parser, rather than regular expressions. One of Nokogiri's strengths is that it can parse HTML documents based on CSS classes, XPath or via HTML entity search. Statement has a helper method, `open_html`, that loads the press release url into Nokigiri's parser. Senator Boxer's scraper might look like this:
|
27
|
+
|
28
|
+
```ruby
|
29
|
+
def self.boxer
|
30
|
+
results = []
|
31
|
+
url = "http://www.boxer.senate.gov/press/release"
|
32
|
+
domain = 'www.boxer.senate.gov'
|
33
|
+
doc = open_html(url)
|
34
|
+
return if doc.nil?
|
35
|
+
doc.css("tr")[1..-1].each do |row|
|
36
|
+
results << { :source => url, :url => "http://"+domain + row.children[3].children[1]['href'], :title => row.children[3].children[1].text.strip, :date => Date.parse(row.children[1].text), :domain => domain}
|
37
|
+
end
|
38
|
+
results
|
39
|
+
end
|
40
|
+
```
|
41
|
+
For the first row that would produce the following hash:
|
42
|
+
|
43
|
+
```ruby
|
44
|
+
=> {:source=>"http://www.boxer.senate.gov/press/release", :url=>"http://www.boxer.senate.gov/press/release/boxer-feinstein-colleagues-introduces-bill-in-support-of-positive-train-control/", :title=>"Boxer, Feinstein, Colleagues Introduces Bill in Support of Positive Train Control", :date=><Date: 2015-04-17 ((2457130j,0s,0n),+0s,2299161j)>, :domain=>"www.boxer.senate.gov"}
|
45
|
+
```
|
46
|
+
|
47
|
+
For people new to Nokogiri, perhaps the hardest part is navigating its nodes - a `tr` node will have children `td` nodes, for example. The best advice we can provide is to spend time in the console trying to navigate up and down an HTML document's nodes. Calling the `text` method on any Nokogiri object will print its contents.
|
48
|
+
|
49
|
+
The best advice is to work off an existing [member scraper](https://github.com/TheUpshot/statement/blob/master/lib/statement/scraper.rb). You don't need to write anything except the scraper method; we'll take care of the rest once you submit your pull request.
|