RubyGems - uk_parliament - Versions diffs - 0.1.0 - Mend

uk_parliament 0.1.0

Files changed (25) hide show

checksums.yaml +7 -0
data/.gitignore +10 -0
data/.rspec +2 -0
data/.travis.yml +5 -0
data/Gemfile +4 -0
data/LICENSE +674 -0
data/README.md +88 -0
data/Rakefile +15 -0
data/bin/console +14 -0
data/bin/setup +8 -0
data/lib/uk_parliament.rb +99 -0
data/lib/uk_parliament/commons.rb +20 -0
data/lib/uk_parliament/doc_pipeline.rb +44 -0
data/lib/uk_parliament/file_house_members.rb +11 -0
data/lib/uk_parliament/house_members.rb +53 -0
data/lib/uk_parliament/house_members_manager.rb +19 -0
data/lib/uk_parliament/house_members_source_factory.rb +18 -0
data/lib/uk_parliament/http_house_members.rb +103 -0
data/lib/uk_parliament/lords.rb +20 -0
data/lib/uk_parliament/member_list_doc_pipeline.rb +103 -0
data/lib/uk_parliament/member_summary_doc_pipeline.rb +215 -0
data/lib/uk_parliament/queue_manager.rb +104 -0
data/lib/uk_parliament/version.rb +3 -0
data/uk_parliament.gemspec +29 -0
metadata +137 -0

data/README.md ADDED Viewed

@@ -0,0 +1,88 @@
+# UkParliament
+Gem that scrapes current UK parliamentarians (members of the House of Commons and House of Lords) contact details (addresses, phone, email, Twitter, Facebook etc) from the [parliament.uk](http://parliament.uk) web site and stores the data to file(s) in JSON format.
+Each member of the House of Commons and House of Lords has a publicly available profile on the parliament.uk site. Each profile contains varying amounts of contact information, so results of each member will differ.
+Why scrape this info? Yes, there is an API, but it doesn't/didn't appear particularly straight forward to get the contact info required.
+## Installation
+Add this line to your application's Gemfile:
+```ruby
+gem 'uk_parliament'
+```
+And then execute:
+    $ bundle
+Or install it yourself as:
+    $ gem install uk_parliament
+## Usage
+Trigger scraping of data, so it can be saved to file, either, a) by running a Rake task:
+    rake scrape_parliament
+Or b), by opening `bin/console` and running:
+    parliament = UkParliament::Parliament.new(false, false)
+This will run for approx. 7-10 minutes (depending on machine/network etc.) and output two files, `commons.json` and `lords.json` in the user `$HOME/uk_parliament` directory.
+Progress of the data being scraped is output to a log file, `uk_parliament.log`, also in the user `$HOME/uk_parliament` directory.
+To monitor progress, run:
+    tail -f ~/uk_parliament/uk_parliament.log
+The log will tell you which, if any, requests fail, and at the end tell you how many have failed.
+If there are failures, they are recorded in an error queue. You just need to re-run the same command again, and the contents of the error queue will be used as the source of what to scrape, rather than attempting to scrape the whole set of data again.
+The successfully scraped errors will be merged with previous data in the JSON file(s).
+Errors appear to be few and intermittent. Keep reprocessing the error queue until it is empty.
+Each time you re-run the scraping, the output of the previous run (the `*.json` files) will be backed up, with the timestamp from when they were created. Eg. `commons.json` will become `commons-20161231_061221.json`
+Once you have created the output files, running:
+    parliament = UkParliament::Parliament.new
+or:
+    parliament = UkParliament::Parliament.new(true, true)
+will load the data from those files for you to process as you wish.
+For example, you can access just the members of the House of Commons data with:
+    parliament.houses[:commons].members
+The first entry in that list of members is then:
+    irb(main):013:0> parliament.houses[:commons].members[0]
+    => {"alphabetical_name"=>"Abbott, Ms Diane", ...}
+For quick results, there is a _very_ basic member name lookup that can be run with:
+    irb(main):007:0> parliament.parliamentarians_named('Ahmed')
+    => [{"alphabetical_name"=>"Ahmed-Sheikh, Ms Tasmina", ...}, {"alphabetical_name"=>"Ahmed, Lord", ...}]
+## Development
+After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
+To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
+## Contributing
+Bug reports and pull requests are welcome on GitHub at [https://github.com/a-sansom/uk_parliament](https://github.com/a-sansom/uk_parliament).

data/Rakefile ADDED Viewed

@@ -0,0 +1,15 @@
+require "bundler/gem_tasks"
+require "rspec/core/rake_task"
+require "uk_parliament"
+RSpec::Core::RakeTask.new(:spec)
+task :default => :spec
+task :scrape_parliament do
+  puts 'Scraping Parliament data'
+  puts 'This will take some time, approx. 10 mins'
+  puts "Check/tail the log file in your user '$HOME/uk_parliament' directory for progress."
+  parliament = UkParliament::Parliament.new(false, false)
+  puts "Finished scraping. Check user '$HOME/uk_parliament' directory for .json files."
+end

data/bin/console ADDED Viewed

@@ -0,0 +1,14 @@
+#!/usr/bin/env ruby
+require "bundler/setup"
+require "uk_parliament"
+# You can add fixtures and/or initialization code here to make experimenting
+# with your gem easier. You can also use a different console, if you like.
+# (If you use this, don't forget to add pry to your Gemfile!)
+# require "pry"
+# Pry.start
+require "irb"
+IRB.start

data/bin/setup ADDED Viewed

@@ -0,0 +1,8 @@
+#!/usr/bin/env bash
+set -euo pipefail
+IFS=$'\n\t'
+set -vx
+bundle install
+# Do any other automated setup that you need to do here

data/lib/uk_parliament.rb ADDED Viewed

@@ -0,0 +1,99 @@
+require 'uk_parliament/version'
+require 'uk_parliament/commons'
+require 'uk_parliament/doc_pipeline'
+require 'uk_parliament/house_members'
+require 'uk_parliament/file_house_members'
+require 'uk_parliament/house_members_manager'
+require 'uk_parliament/house_members_source_factory'
+require 'uk_parliament/http_house_members'
+require 'uk_parliament/lords'
+require 'uk_parliament/member_list_doc_pipeline'
+require 'uk_parliament/member_summary_doc_pipeline'
+require 'uk_parliament/queue_manager'
+require 'logger'
+# Module defining classes and methods enabling scraping of UK Parliament
+# members contact data from parliament.uk web site, or loading of scraped
+# data from file.
+module UkParliament
+  # Constants representing where data can come from.
+  DATA_SOURCE_FILE = 'file'
+  DATA_SOURCE_HTTP = 'http'
+  # Setup module-wide access to Log to file.
+  def log
+    UkParliament.log
+  end
+  # Setup a Logger instance, if one doesn't already exist.
+  def self.log
+    if @log.nil?
+      config = configuration
+      @log = Logger.new(File.join(config[:log_file_path], 'uk_parliament.log'), 'daily')
+      @log.level = Logger::INFO
+    end
+    @log
+  end
+  # Setup module-wide access to a set of configuration values.
+  def configuration
+    UkParliament.configuration
+  end
+  # Define set of configuration values for the module.
+  def self.configuration
+    if @configuration.nil?
+      base_dir = File.join(Dir.home, 'uk_parliament')
+      FileUtils.mkdir_p(base_dir) unless Dir.exist?(base_dir)
+      @configuration = {
+        :log_file_path => base_dir,
+        :data_file_path => base_dir,
+        :queue_file_path => base_dir,
+        :scrape_no_of_threads => 4,
+        :scrape_request_delay => 2,
+        :backup_before_write => true
+      }
+    end
+    @configuration
+  end
+  # Class representing Parliament.
+  class Parliament
+    include UkParliament
+    # Instance data accessor(s).
+    attr_reader :houses
+    # Initialise the class instance variables.
+    def initialize(load_commons_file = true, load_lords_file = true)
+      @houses = {
+        :commons => Commons.new(load_commons_file),
+        :lords => Lords.new(load_lords_file)
+      }
+    end
+    # Simple lookup of members with a particular name (or part of).
+    def parliamentarians_named(search_name)
+      search_name = search_name.strip.downcase
+      results = []
+      if search_name.size > 1
+        @houses.each_value { |house_data|
+          house_data.members.each { |member|
+            if member.key?('name')
+              if member['name']['full_name'].downcase.include?(search_name)
+                results << member
+              end
+            end
+          }
+        }
+      end
+      results
+    end
+  end
+end

data/lib/uk_parliament/commons.rb ADDED Viewed

@@ -0,0 +1,20 @@
+module UkParliament
+  # Class representing the House of Commons.
+  class Commons
+    include UkParliament
+    # Unique identifier for House of Commons.
+    HOUSE_ID = 'commons'
+    # URL of where to look for the list of Commons members.
+    MEMBER_LIST_URL = 'http://www.parliament.uk/mps-lords-and-offices/mps/'
+    # Instance data accessor(s).
+    attr_reader :members
+    # Initialise the class populating the Commons member data.
+    def initialize(load_from_file = false)
+      @members = HouseMembersManager.new(HOUSE_ID, load_from_file).members
+    end
+  end
+end

data/lib/uk_parliament/doc_pipeline.rb ADDED Viewed

@@ -0,0 +1,44 @@
+module UkParliament
+  # Class defining the pipeline process for a scraped document.
+  class DocPipeline
+    include UkParliament
+    # Initialise the class instance variables.
+    def initialize(house_id, document)
+      @house_id = house_id
+      @document = document
+      define_commons_tasks
+      define_lords_tasks
+    end
+    private
+    # Define the tasks that will be performed for a commons pipeline.
+    def define_commons_tasks
+      @commons_tasks = []
+    end
+    # Define the tasks that will be performed for a lords pipeline.
+    def define_lords_tasks
+      @lords_tasks = []
+    end
+    protected
+    # Execute the relevant pipeline's tasks.
+    def execute
+      # TODO We can do this better.
+      if @house_id == Commons::HOUSE_ID
+        @commons_tasks.each { |function_name|
+          send(function_name)
+        }
+      elsif @house_id == Lords::HOUSE_ID
+        @lords_tasks.each { |function_name|
+          send(function_name)
+        }
+      end
+    end
+  end
+end

data/lib/uk_parliament/file_house_members.rb ADDED Viewed

@@ -0,0 +1,11 @@
+module UkParliament
+  # Class to load house member data from file.
+  class FileHouseMembers < HouseMembers
+    # Initialise the parent and load the correct file.
+    def initialize(house_id)
+      super
+      load_file
+    end
+  end
+end

data/lib/uk_parliament/house_members.rb ADDED Viewed

@@ -0,0 +1,53 @@
+require 'fileutils'
+require 'json'
+module UkParliament
+  class HouseMembers
+    include UkParliament
+    attr_reader :members
+    def initialize(house_id)
+      @house_id = house_id
+      @members = []
+      @config = configuration
+      @backup = @config[:backup_before_write]
+    end
+    protected
+    # Load a house's .json file from disk.
+    def load_file
+      filename = File.join(@config[:data_file_path], "#{@house_id}.json")
+      raise "'#{filename}' Does not exist. Have you scraped '#{@house_id}' data yet? See README" unless File.exist?(filename)
+      json = File.read(filename)
+      @members = JSON.parse(json)
+    end
+    # Save a new version of a house's .json file to disk, optionally backing
+    # up any previous file beforehand.
+    def save_file
+      if @backup
+        backup_file
+      end
+      filename = File.join(@config[:data_file_path], "#{@house_id}.json")
+      File.open(filename, 'w') do |json_file|
+        json_file.write(JSON.pretty_generate(@members))
+      end
+      log.info("'#{@house_id}' saved to file")
+    end
+    # Back up an existing house's .json file.
+    def backup_file
+      filename = File.join(@config[:data_file_path], "#{@house_id}.json")
+      if File.exist?(filename)
+        backup_filename = "#{filename.split('.')[0]}-#{File.mtime(filename).strftime('%Y%m%d_%H%M%S')}.json"
+        FileUtils.cp(filename, backup_filename)
+        log.info("Previous '#{@house_id}' file was backed up")
+      end
+    end
+  end
+end

data/lib/uk_parliament/house_members_manager.rb ADDED Viewed

@@ -0,0 +1,19 @@
+module UkParliament
+  # Manages creation of the correct member data source class and makes the
+  # member data available to the caller.
+  class HouseMembersManager
+    include UkParliament
+    attr_reader :members
+    # Create the factory class instance and return its member data.
+    def initialize(house_id, load_from_file)
+      log.info('------------------------------------------------------------')
+      data_source_id = load_from_file ? DATA_SOURCE_FILE : DATA_SOURCE_HTTP
+      log.info("Using '#{data_source_id}' data source for '#{house_id}' members")
+      source = HouseMembersSourceFactory.init_data_source(data_source_id, house_id)
+      log.info("'#{house_id}' has #{source.members.size} members")
+      @members = source.members
+    end
+  end
+end

data/lib/uk_parliament/house_members_source_factory.rb ADDED Viewed

@@ -0,0 +1,18 @@
+module UkParliament
+  # Factory taking responsibility for instantiating correct data source class
+  # for a given data source ID/house ID pair.
+  class HouseMembersSourceFactory
+    # Create correct type of class for the IDs passed in.
+    def self.init_data_source(data_source_id, house_id)
+      source = nil
+      if data_source_id == DATA_SOURCE_FILE
+        source = FileHouseMembers.new(house_id)
+      elsif data_source_id == DATA_SOURCE_HTTP
+        source = HttpHouseMembers.new(house_id)
+      end
+      source
+    end
+  end
+end

data/lib/uk_parliament/http_house_members.rb ADDED Viewed

@@ -0,0 +1,103 @@
+require 'nokogiri'
+require 'open-uri'
+require 'thread'
+module UkParliament
+  # Class to load house member data from the web.
+  class HttpHouseMembers < HouseMembers
+    # Initialise our parent class and set about scraping data from the web.
+    def initialize(house_id)
+      super
+      @q_manager = QueueManager.new(house_id)
+      retrieve_members_list
+      assemble_members_data
+    end
+    private
+    # Gets the list of house members. Depending on the circumstance, we either
+    # just load a list from existing file or we got the parliament.uk site, and
+    # scrape the list from there.
+    #
+    # In the case of loading the file, the errors processed will be merged into
+    # the existing file data, and saved. This behaviour will continue until
+    # there are no more errors to process.
+    def retrieve_members_list
+      if @q_manager.scrape_errors?
+        load_file
+      else
+        scrape_members_list
+      end
+    end
+    # Scrape a particular house's membership list from it's list page.
+    def scrape_members_list
+      url = (@house_id == Lords::HOUSE_ID) ? Lords::MEMBER_LIST_URL : Commons::MEMBER_LIST_URL
+      log.info("Fetching '#{@house_id}' member list from #{url}")
+      document = Nokogiri::HTML(open(url))
+      pipeline = MemberListDocPipeline.new(@house_id, document)
+      pipeline.house_member_list(@members)
+    rescue => e
+      log.info("Error retrieving '#{@house_id}' member list, URL #{member['url']}, Exception #{e.message}")
+    end
+    # Scrape more detailed house member's info from their specific page.
+    def scrape_member_summary(member)
+      log.info("Fetching (#{member['id']}) #{member['alphabetical_name']}")
+      document = Nokogiri::HTML(open(member['url']))
+      pipeline = MemberSummaryDocPipeline.new(@house_id, document)
+      pipeline.enrich_member_data(member)
+      member['timestamp'] = Time.now.strftime('%FT%T%:z')
+    rescue => e
+      log.info("Error processing '#{@house_id}' member ID #{member['id'].to_s}, URL #{member['url']}, Exception #{e.message}")
+      @q_manager.error_queue.push(member['id'].to_s)
+    end
+    # Trigger scraping of more detailed house member information and save the
+    # results to file.
+    def assemble_members_data
+      @q_manager.enqueue(@members)
+      process_members_list { |member|
+        scrape_member_summary(member)
+      }
+      save_file
+      if @q_manager.error_queue_size > 0
+        log.info("#{@q_manager.error_queue.length} entries in the error queue to reprocess")
+      end
+    end
+    # Process the house members list, to retrieve more info about each member.
+    # Splits the work across multiple threads, to diminish the time taken.
+    def process_members_list
+      threads = []
+      @config[:scrape_no_of_threads].times do
+        threads << Thread.new do
+          until @q_manager.main_queue.empty?
+            id = @q_manager.main_queue.pop
+            if id
+              member = @members.find { |item|
+                item['id'] == id.to_i
+              }
+              yield member
+              sleep(@config[:scrape_request_delay])
+            end
+          end
+        end
+      end
+      threads.each { |t| t.join }
+    end
+  end
+end