uk_parliament 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
data/README.md ADDED
@@ -0,0 +1,88 @@
1
+ # UkParliament
2
+
3
+ Gem that scrapes current UK parliamentarians (members of the House of Commons and House of Lords) contact details (addresses, phone, email, Twitter, Facebook etc) from the [parliament.uk](http://parliament.uk) web site and stores the data to file(s) in JSON format.
4
+
5
+ Each member of the House of Commons and House of Lords has a publicly available profile on the parliament.uk site. Each profile contains varying amounts of contact information, so results of each member will differ.
6
+
7
+ Why scrape this info? Yes, there is an API, but it doesn't/didn't appear particularly straight forward to get the contact info required.
8
+
9
+ ## Installation
10
+
11
+ Add this line to your application's Gemfile:
12
+
13
+ ```ruby
14
+ gem 'uk_parliament'
15
+ ```
16
+
17
+ And then execute:
18
+
19
+ $ bundle
20
+
21
+ Or install it yourself as:
22
+
23
+ $ gem install uk_parliament
24
+
25
+ ## Usage
26
+
27
+
28
+
29
+ Trigger scraping of data, so it can be saved to file, either, a) by running a Rake task:
30
+
31
+ rake scrape_parliament
32
+
33
+ Or b), by opening `bin/console` and running:
34
+
35
+ parliament = UkParliament::Parliament.new(false, false)
36
+
37
+ This will run for approx. 7-10 minutes (depending on machine/network etc.) and output two files, `commons.json` and `lords.json` in the user `$HOME/uk_parliament` directory.
38
+
39
+ Progress of the data being scraped is output to a log file, `uk_parliament.log`, also in the user `$HOME/uk_parliament` directory.
40
+
41
+ To monitor progress, run:
42
+
43
+ tail -f ~/uk_parliament/uk_parliament.log
44
+
45
+ The log will tell you which, if any, requests fail, and at the end tell you how many have failed.
46
+
47
+ If there are failures, they are recorded in an error queue. You just need to re-run the same command again, and the contents of the error queue will be used as the source of what to scrape, rather than attempting to scrape the whole set of data again.
48
+
49
+ The successfully scraped errors will be merged with previous data in the JSON file(s).
50
+
51
+ Errors appear to be few and intermittent. Keep reprocessing the error queue until it is empty.
52
+
53
+ Each time you re-run the scraping, the output of the previous run (the `*.json` files) will be backed up, with the timestamp from when they were created. Eg. `commons.json` will become `commons-20161231_061221.json`
54
+
55
+ Once you have created the output files, running:
56
+
57
+ parliament = UkParliament::Parliament.new
58
+
59
+ or:
60
+
61
+ parliament = UkParliament::Parliament.new(true, true)
62
+
63
+ will load the data from those files for you to process as you wish.
64
+
65
+ For example, you can access just the members of the House of Commons data with:
66
+
67
+ parliament.houses[:commons].members
68
+
69
+ The first entry in that list of members is then:
70
+
71
+ irb(main):013:0> parliament.houses[:commons].members[0]
72
+ => {"alphabetical_name"=>"Abbott, Ms Diane", ...}
73
+
74
+ For quick results, there is a _very_ basic member name lookup that can be run with:
75
+
76
+ irb(main):007:0> parliament.parliamentarians_named('Ahmed')
77
+ => [{"alphabetical_name"=>"Ahmed-Sheikh, Ms Tasmina", ...}, {"alphabetical_name"=>"Ahmed, Lord", ...}]
78
+
79
+ ## Development
80
+
81
+ After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
82
+
83
+ To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
84
+
85
+ ## Contributing
86
+
87
+ Bug reports and pull requests are welcome on GitHub at [https://github.com/a-sansom/uk_parliament](https://github.com/a-sansom/uk_parliament).
88
+
data/Rakefile ADDED
@@ -0,0 +1,15 @@
1
+ require "bundler/gem_tasks"
2
+ require "rspec/core/rake_task"
3
+ require "uk_parliament"
4
+
5
+ RSpec::Core::RakeTask.new(:spec)
6
+
7
+ task :default => :spec
8
+
9
+ task :scrape_parliament do
10
+ puts 'Scraping Parliament data'
11
+ puts 'This will take some time, approx. 10 mins'
12
+ puts "Check/tail the log file in your user '$HOME/uk_parliament' directory for progress."
13
+ parliament = UkParliament::Parliament.new(false, false)
14
+ puts "Finished scraping. Check user '$HOME/uk_parliament' directory for .json files."
15
+ end
data/bin/console ADDED
@@ -0,0 +1,14 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require "bundler/setup"
4
+ require "uk_parliament"
5
+
6
+ # You can add fixtures and/or initialization code here to make experimenting
7
+ # with your gem easier. You can also use a different console, if you like.
8
+
9
+ # (If you use this, don't forget to add pry to your Gemfile!)
10
+ # require "pry"
11
+ # Pry.start
12
+
13
+ require "irb"
14
+ IRB.start
data/bin/setup ADDED
@@ -0,0 +1,8 @@
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+ IFS=$'\n\t'
4
+ set -vx
5
+
6
+ bundle install
7
+
8
+ # Do any other automated setup that you need to do here
@@ -0,0 +1,99 @@
1
+ require 'uk_parliament/version'
2
+ require 'uk_parliament/commons'
3
+ require 'uk_parliament/doc_pipeline'
4
+ require 'uk_parliament/house_members'
5
+ require 'uk_parliament/file_house_members'
6
+ require 'uk_parliament/house_members_manager'
7
+ require 'uk_parliament/house_members_source_factory'
8
+ require 'uk_parliament/http_house_members'
9
+ require 'uk_parliament/lords'
10
+ require 'uk_parliament/member_list_doc_pipeline'
11
+ require 'uk_parliament/member_summary_doc_pipeline'
12
+ require 'uk_parliament/queue_manager'
13
+ require 'logger'
14
+
15
+ # Module defining classes and methods enabling scraping of UK Parliament
16
+ # members contact data from parliament.uk web site, or loading of scraped
17
+ # data from file.
18
+ module UkParliament
19
+ # Constants representing where data can come from.
20
+ DATA_SOURCE_FILE = 'file'
21
+ DATA_SOURCE_HTTP = 'http'
22
+
23
+ # Setup module-wide access to Log to file.
24
+ def log
25
+ UkParliament.log
26
+ end
27
+
28
+ # Setup a Logger instance, if one doesn't already exist.
29
+ def self.log
30
+ if @log.nil?
31
+ config = configuration
32
+ @log = Logger.new(File.join(config[:log_file_path], 'uk_parliament.log'), 'daily')
33
+ @log.level = Logger::INFO
34
+ end
35
+
36
+ @log
37
+ end
38
+
39
+ # Setup module-wide access to a set of configuration values.
40
+ def configuration
41
+ UkParliament.configuration
42
+ end
43
+
44
+ # Define set of configuration values for the module.
45
+ def self.configuration
46
+ if @configuration.nil?
47
+ base_dir = File.join(Dir.home, 'uk_parliament')
48
+ FileUtils.mkdir_p(base_dir) unless Dir.exist?(base_dir)
49
+
50
+ @configuration = {
51
+ :log_file_path => base_dir,
52
+ :data_file_path => base_dir,
53
+ :queue_file_path => base_dir,
54
+ :scrape_no_of_threads => 4,
55
+ :scrape_request_delay => 2,
56
+ :backup_before_write => true
57
+ }
58
+ end
59
+
60
+ @configuration
61
+ end
62
+
63
+ # Class representing Parliament.
64
+ class Parliament
65
+ include UkParliament
66
+
67
+ # Instance data accessor(s).
68
+ attr_reader :houses
69
+
70
+ # Initialise the class instance variables.
71
+ def initialize(load_commons_file = true, load_lords_file = true)
72
+ @houses = {
73
+ :commons => Commons.new(load_commons_file),
74
+ :lords => Lords.new(load_lords_file)
75
+ }
76
+ end
77
+
78
+ # Simple lookup of members with a particular name (or part of).
79
+ def parliamentarians_named(search_name)
80
+ search_name = search_name.strip.downcase
81
+ results = []
82
+
83
+ if search_name.size > 1
84
+ @houses.each_value { |house_data|
85
+ house_data.members.each { |member|
86
+ if member.key?('name')
87
+ if member['name']['full_name'].downcase.include?(search_name)
88
+ results << member
89
+ end
90
+ end
91
+ }
92
+ }
93
+ end
94
+
95
+ results
96
+ end
97
+ end
98
+
99
+ end
@@ -0,0 +1,20 @@
1
+ module UkParliament
2
+ # Class representing the House of Commons.
3
+ class Commons
4
+ include UkParliament
5
+
6
+ # Unique identifier for House of Commons.
7
+ HOUSE_ID = 'commons'
8
+ # URL of where to look for the list of Commons members.
9
+ MEMBER_LIST_URL = 'http://www.parliament.uk/mps-lords-and-offices/mps/'
10
+
11
+ # Instance data accessor(s).
12
+ attr_reader :members
13
+
14
+ # Initialise the class populating the Commons member data.
15
+ def initialize(load_from_file = false)
16
+ @members = HouseMembersManager.new(HOUSE_ID, load_from_file).members
17
+ end
18
+ end
19
+
20
+ end
@@ -0,0 +1,44 @@
1
+ module UkParliament
2
+ # Class defining the pipeline process for a scraped document.
3
+ class DocPipeline
4
+ include UkParliament
5
+
6
+ # Initialise the class instance variables.
7
+ def initialize(house_id, document)
8
+ @house_id = house_id
9
+ @document = document
10
+
11
+ define_commons_tasks
12
+ define_lords_tasks
13
+ end
14
+
15
+ private
16
+
17
+ # Define the tasks that will be performed for a commons pipeline.
18
+ def define_commons_tasks
19
+ @commons_tasks = []
20
+ end
21
+
22
+ # Define the tasks that will be performed for a lords pipeline.
23
+ def define_lords_tasks
24
+ @lords_tasks = []
25
+ end
26
+
27
+ protected
28
+
29
+ # Execute the relevant pipeline's tasks.
30
+ def execute
31
+ # TODO We can do this better.
32
+ if @house_id == Commons::HOUSE_ID
33
+ @commons_tasks.each { |function_name|
34
+ send(function_name)
35
+ }
36
+ elsif @house_id == Lords::HOUSE_ID
37
+ @lords_tasks.each { |function_name|
38
+ send(function_name)
39
+ }
40
+ end
41
+ end
42
+ end
43
+
44
+ end
@@ -0,0 +1,11 @@
1
+ module UkParliament
2
+ # Class to load house member data from file.
3
+ class FileHouseMembers < HouseMembers
4
+ # Initialise the parent and load the correct file.
5
+ def initialize(house_id)
6
+ super
7
+
8
+ load_file
9
+ end
10
+ end
11
+ end
@@ -0,0 +1,53 @@
1
+ require 'fileutils'
2
+ require 'json'
3
+
4
+ module UkParliament
5
+ class HouseMembers
6
+ include UkParliament
7
+
8
+ attr_reader :members
9
+
10
+ def initialize(house_id)
11
+ @house_id = house_id
12
+ @members = []
13
+ @config = configuration
14
+ @backup = @config[:backup_before_write]
15
+ end
16
+
17
+ protected
18
+
19
+ # Load a house's .json file from disk.
20
+ def load_file
21
+ filename = File.join(@config[:data_file_path], "#{@house_id}.json")
22
+ raise "'#{filename}' Does not exist. Have you scraped '#{@house_id}' data yet? See README" unless File.exist?(filename)
23
+ json = File.read(filename)
24
+ @members = JSON.parse(json)
25
+ end
26
+
27
+ # Save a new version of a house's .json file to disk, optionally backing
28
+ # up any previous file beforehand.
29
+ def save_file
30
+ if @backup
31
+ backup_file
32
+ end
33
+
34
+ filename = File.join(@config[:data_file_path], "#{@house_id}.json")
35
+ File.open(filename, 'w') do |json_file|
36
+ json_file.write(JSON.pretty_generate(@members))
37
+ end
38
+
39
+ log.info("'#{@house_id}' saved to file")
40
+ end
41
+
42
+ # Back up an existing house's .json file.
43
+ def backup_file
44
+ filename = File.join(@config[:data_file_path], "#{@house_id}.json")
45
+
46
+ if File.exist?(filename)
47
+ backup_filename = "#{filename.split('.')[0]}-#{File.mtime(filename).strftime('%Y%m%d_%H%M%S')}.json"
48
+ FileUtils.cp(filename, backup_filename)
49
+ log.info("Previous '#{@house_id}' file was backed up")
50
+ end
51
+ end
52
+ end
53
+ end
@@ -0,0 +1,19 @@
1
+ module UkParliament
2
+ # Manages creation of the correct member data source class and makes the
3
+ # member data available to the caller.
4
+ class HouseMembersManager
5
+ include UkParliament
6
+
7
+ attr_reader :members
8
+
9
+ # Create the factory class instance and return its member data.
10
+ def initialize(house_id, load_from_file)
11
+ log.info('------------------------------------------------------------')
12
+ data_source_id = load_from_file ? DATA_SOURCE_FILE : DATA_SOURCE_HTTP
13
+ log.info("Using '#{data_source_id}' data source for '#{house_id}' members")
14
+ source = HouseMembersSourceFactory.init_data_source(data_source_id, house_id)
15
+ log.info("'#{house_id}' has #{source.members.size} members")
16
+ @members = source.members
17
+ end
18
+ end
19
+ end
@@ -0,0 +1,18 @@
1
+ module UkParliament
2
+ # Factory taking responsibility for instantiating correct data source class
3
+ # for a given data source ID/house ID pair.
4
+ class HouseMembersSourceFactory
5
+ # Create correct type of class for the IDs passed in.
6
+ def self.init_data_source(data_source_id, house_id)
7
+ source = nil
8
+
9
+ if data_source_id == DATA_SOURCE_FILE
10
+ source = FileHouseMembers.new(house_id)
11
+ elsif data_source_id == DATA_SOURCE_HTTP
12
+ source = HttpHouseMembers.new(house_id)
13
+ end
14
+
15
+ source
16
+ end
17
+ end
18
+ end
@@ -0,0 +1,103 @@
1
+ require 'nokogiri'
2
+ require 'open-uri'
3
+ require 'thread'
4
+
5
+ module UkParliament
6
+ # Class to load house member data from the web.
7
+ class HttpHouseMembers < HouseMembers
8
+ # Initialise our parent class and set about scraping data from the web.
9
+ def initialize(house_id)
10
+ super
11
+
12
+ @q_manager = QueueManager.new(house_id)
13
+
14
+ retrieve_members_list
15
+ assemble_members_data
16
+ end
17
+
18
+ private
19
+
20
+ # Gets the list of house members. Depending on the circumstance, we either
21
+ # just load a list from existing file or we got the parliament.uk site, and
22
+ # scrape the list from there.
23
+ #
24
+ # In the case of loading the file, the errors processed will be merged into
25
+ # the existing file data, and saved. This behaviour will continue until
26
+ # there are no more errors to process.
27
+ def retrieve_members_list
28
+ if @q_manager.scrape_errors?
29
+ load_file
30
+ else
31
+ scrape_members_list
32
+ end
33
+ end
34
+
35
+ # Scrape a particular house's membership list from it's list page.
36
+ def scrape_members_list
37
+ url = (@house_id == Lords::HOUSE_ID) ? Lords::MEMBER_LIST_URL : Commons::MEMBER_LIST_URL
38
+ log.info("Fetching '#{@house_id}' member list from #{url}")
39
+
40
+ document = Nokogiri::HTML(open(url))
41
+ pipeline = MemberListDocPipeline.new(@house_id, document)
42
+ pipeline.house_member_list(@members)
43
+ rescue => e
44
+ log.info("Error retrieving '#{@house_id}' member list, URL #{member['url']}, Exception #{e.message}")
45
+ end
46
+
47
+ # Scrape more detailed house member's info from their specific page.
48
+ def scrape_member_summary(member)
49
+ log.info("Fetching (#{member['id']}) #{member['alphabetical_name']}")
50
+
51
+ document = Nokogiri::HTML(open(member['url']))
52
+ pipeline = MemberSummaryDocPipeline.new(@house_id, document)
53
+ pipeline.enrich_member_data(member)
54
+
55
+ member['timestamp'] = Time.now.strftime('%FT%T%:z')
56
+ rescue => e
57
+ log.info("Error processing '#{@house_id}' member ID #{member['id'].to_s}, URL #{member['url']}, Exception #{e.message}")
58
+ @q_manager.error_queue.push(member['id'].to_s)
59
+ end
60
+
61
+ # Trigger scraping of more detailed house member information and save the
62
+ # results to file.
63
+ def assemble_members_data
64
+ @q_manager.enqueue(@members)
65
+
66
+ process_members_list { |member|
67
+ scrape_member_summary(member)
68
+ }
69
+
70
+ save_file
71
+
72
+ if @q_manager.error_queue_size > 0
73
+ log.info("#{@q_manager.error_queue.length} entries in the error queue to reprocess")
74
+ end
75
+ end
76
+
77
+ # Process the house members list, to retrieve more info about each member.
78
+ # Splits the work across multiple threads, to diminish the time taken.
79
+ def process_members_list
80
+ threads = []
81
+
82
+ @config[:scrape_no_of_threads].times do
83
+ threads << Thread.new do
84
+ until @q_manager.main_queue.empty?
85
+ id = @q_manager.main_queue.pop
86
+
87
+ if id
88
+ member = @members.find { |item|
89
+ item['id'] == id.to_i
90
+ }
91
+
92
+ yield member
93
+
94
+ sleep(@config[:scrape_request_delay])
95
+ end
96
+ end
97
+ end
98
+ end
99
+
100
+ threads.each { |t| t.join }
101
+ end
102
+ end
103
+ end