RubyGems - slaw - Versions diffs - 0.2.0 → 0.3.0 - Mend

slaw 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

checksums.yaml +4 -4
data/README.md +154 -17
data/lib/slaw/act.rb +151 -20
data/lib/slaw/bylaw.rb +36 -20
data/lib/slaw/schemas/akomantoso20.xsd +6834 -0
data/lib/slaw/schemas/xml.xsd +120 -0
data/lib/slaw/version.rb +1 -1
data/spec/bylaw_spec.rb +68 -0
data/spec/fixtures/community-fire-safety.xml +3838 -0
metadata +8 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 405a0b941536c74c13588e1bfb4350c566337626
-  data.tar.gz: 809e1fd9fd4ada655d3531b4eb31702398490a42
+  metadata.gz: 30603c7c9387a2f1c2fc9d617f667b41824e0b68
+  data.tar.gz: 2b153cb4679f469f4b0b18e4ba8b6da239d69016
 SHA512:
-  metadata.gz: a636be697e3589db697232bc01876a864ee8c02eb4548232b9db8addc2c3d9fb0a5004ffeb94f12494e88889b2954c3edd61100a865ce9329b5eddad7381fbe8
-  data.tar.gz: 57e78d5489aa950436b2e7dc3ebe7d19a639b86c3f4fa3b95add5edfa9adbb0185b6f191a012f74ba4fc35100432096ac1ea09b928327afa214aaedd1a2c070c
+  metadata.gz: a1fb11223dfbd14614eafaf1436e2b73c17fbdbc3fc0511f54492e3cab616c0a1db4f6ebdff0d0e56b498c23f3de1c8ce81ff7cd67e1784253dd6f22457976a2
+  data.tar.gz: 5d46d60e58c26cc44fef10a23f81de366361d538bfe35f649b6f6a81ea121ade63785d76d081fba26fc6413981ba831771f1c32aa5014bb8a02159122f3f40d5

data/README.md CHANGED Viewed

@@ -1,7 +1,18 @@
 # Slaw [![Build Status](https://travis-ci.org/longhotsummer/slaw.svg)](http://travis-ci.org/longhotsummer/slaw)
-Slaw is a lightweight library for rendering and generating Akoma Ntoso acts from plain text and PDF documents.
-It is used to power [openbylaws.org.za](http://openbylaws.org.za).
+Slaw is a lightweight library for generating and rendering Akoma Ntoso 2.0 Act XML from plain text and PDF documents.
+It is used to power [openbylaws.org.za](http://openbylaws.org.za) and [steno.openbylaws.org.za](http://steno.openbylaws.org.za)
+and uses grammars developed for South African acts and by-laws.
+Slaw allows you to:
+1. extract plain text from PDFs and clean up that text
+2. parse plain text and transform it into an Akoma Ntoso Act XML document
+3. render the XML document into HTML
+Slaw is lightweight because it wraps around a Nokogiri XML representation of
+the parsed document. It provides some support methods for manipulating these
+documents, but anything advanced must manipulate the XML directly.
 ## Installation
@@ -13,37 +24,163 @@ And then execute:
     $ bundle
-Or install it yourself as:
+Or install it with:
     $ gem install slaw
-## Usage
+To run PDF extraction you will also need [xpdf](http://www.foolabs.com/xpdf/).
+If you're on a Mac, you can use:
-TODO: Write usage instructions here
+    brew install xpdf
-### Extracting text from PDFs
+## Overview
-You will need [xpdf](http://www.foolabs.com/xpdf/) to run PDF extraction. If you're
-on a Mac you can use
+Slaw generates Acts in the [Akoma Ntoso](http://www.akomantoso.org) 2.0 XML
+standard for legislative documents. It first parses plain text using a grammar
+and then generates XML from the resulting syntax tree.
-    brew install xpdf
+Most by-laws in South Africa are available as PDF documents. Slaw therefore has support
+for extracting and cleaning up text from PDFs before parsing it. Extracting text from
+PDFs can product oddities (such as oddly wrapped lines) and Slaw has a number of
+rules-of-thumb for correcting these. These rules are based on South African
+by-laws and may not be suitable for all regions.
+The grammar is expressed as a [Treetop](https://github.com/nathansobo/treetop/) grammar
+and has been developed specifically for the format of South African acts and by-laws.
+Grammars for other regions could de developed depending on the complexity of a region's
+formats.
-Extracting PDFs often break lines in odd places (or doesn't break them when it should). Slaw gets around
-this by running some cleanup routines on the extracted text.
+The grammar cannot catch some subtleties of an act or by-law -- such as nested list numbering --
+so Slaw performs some post-processing on the XML produced by the parser. In particular,
+it nests lists correctly and looks for specially defined terms and their occurrences in the document.
+## Quick Start
+Install the gem using
+    gem install slaw
+Extract text from a PDF and parse it as a South African by-law:
 ```ruby
+require 'slaw'
+# extract text from a PDF file and clean it up
 extractor = Slaw::Extract::Extractor.new
+text = extractor.extract_from_pdf('/path/to/file.pdf')
-# to guess the filetype by extension
-text = extractor.extract_from_file('/path/to/file.pdf')
+# parse the text into a XML and
+generator = Slaw::ZA::ByLawGenerator.new
+bylaw = generator.generate_from_text(text)
+puts bylaw.to_xml(indent: 2)
-# or if you know it's a PDF
-text = extractor.extract_from_pdf('/path/to/file.pdf')
+# render the by-law as HTML, using / as the root
+# for relative URLs
+renderer = Slaw::Render::HTMLRenderer.new
+puts renderer.render(bylaw.doc, '/')
+```
+## Extraction
+Extraction is done by the `Slaw::Extract::Extractor` class. It currently handles
+PDF and plain text files. Slaw uses `pdftotext` from the `xpdf` package to extract
+the plain text from PDFs. PDFs are great for presentation, but suck for accurately storing
+text. As a result, the extraction can produce oddities, such as lines broken in weird
+places (or not broken when they should be). Slaw gets around this by running
+some cleanup routines on the extracted text.
+For example, it knows that these lines:
+    (b) any wall, swimming pool, reservoir or bridge
+    or any other structure connected therewith; (c) any fuel pump or any
+    tank used in connection therewith
+should probably be broken at the section numbers:
+    (b) any wall, swimming pool, reservoir or bridge or any other structure connected therewith;
+    (c) any fuel pump or any tank used in connection therewith
+If your region's numbering format differs significantly from this, these rules might not work.
+Some other steps Slaw takes after extraction include (check `Slaw::Parse::Cleanser` for the full set):
-# You can also "extract" text from a plain-text file
-text = extractor.extract_from_text('/path/to/file.txt')
+* changing newlines to `\n`, and normalising quotation characters
+* removing page numbers and other boilerplate
+* stripping the table of contents (we can generate our own from the parsed document)
+* changing tabs to spaces, stripping leading and trailing spaces and removing blank lines
+## Parsing
+Slaw uses Treetop to compile a grammar into a backtracking parser. The parser builds a parse
+tree, each node of which knows how to serialize itself in XML format.
+While most South African by-laws are superficially very similar, there are a sufficient differences
+in their typesetting to make parsing them difficult. The grammar handles most
+edge cases but may not catch them all. The one thing it cannot yet detect well is the difference
+between section titles before and after a section number:
+    1. Definitions
+    In this by-law, the following words ...
+    Definitions
+    1. In this by-law, the following words ...
+This must be set by the user before parsing.
+The parser does its best not to choke on input it doesn't understand, preferring a best effort
+to a completely accurate result. For example it may not be able to work out a section heading
+and so will treat it as simply another statement in the previous section. This causes the parser
+to use a lot of backtracking and negative lookahead assertions, which can be slow for large documents.
+The grammar supports a number of subsection numbering formats, which are often mixed
+in a document to indicate different levels of nesting.
+    (a)
+    (2)
+    (3b)
+    (ii)
+    3.4
+During post-processing it works out how to nest these appropriately.
+For more information see the South African by-law grammar at
+[lib/slaw/za/bylaw.treetop](lib/slaw/za/bylaw.treetop) and the list nesting
+at [lib/slaw/parse/blocklists.rb](lib/slaw/parse/blocklists.rb).
+## Rendering
+Slaw renders XML to HTML using XSLT. For the most part there is a direct mapping between
+Akoma Ntoso structure and the HTML layout, so most AN nodes are simply mapped to `div` or `span`
+elements with a class attribute derived from the name of the AN element and an ID element taken
+from the node, if any. This makes it both fast and flexible, since it's easy to
+apply layout rules with CSS.
+Slaw can render either an entire document like this, or just a portion of the XML tree.
+## Meta-data
+Acts and by-laws have metadata which it is not possible to get from their plain text representations,
+such as their title, date and format of publication or act number. Slaw provides some helpers
+for manipulating this meta-data. For example,
+```ruby
+bylaw = Slaw::ByLaw.new('spec/fixtures/community-fire-safety.xml')
+print bylaw.id_uri
+bylaw.title = 'A new title'
+bylaw.name = 'a-new-title'
+bylaw.published!(date: '2014-09-28')
+print bylaw.id_uri
 ```
+## Schedules
+South African acts and by-laws can have addendums called schedules. They are technically a part of
+the act but are not part of the primary body and have more relaxed formatting. Slaw finds schedules
+by looking for section headings, but makes no effort to capture the format of their contents.
+Akoma Ntoso has no explicit support for schedules. Instead, Slaw stores all schedules under a single
+Akoma Ntoso `component` elements at the end of the XML document, with a name of `schedules`.
 ## Contributing
 1. Fork it at http://github.com/longhotsummer/slaw/fork

data/lib/slaw/act.rb CHANGED Viewed

@@ -18,25 +18,31 @@ module Slaw
     attr_accessor :doc
     # [Nokogiri::XML::Node] The `meta` XML node
-    attr_accessor :meta
+    attr_reader :meta
     # [Nokogiri::XML::Node] The `body` XML node
-    attr_accessor :body
+    attr_reader :body
     # [String] The year this act was published
-    attr_accessor :year
+    attr_reader :year
     # [String] The act number in the year this act was published
-    attr_accessor :num
+    attr_reader :num
     # [String] The FRBR URI of this act, which uniquely identifies it globally
-    attr_accessor :id_uri
+    attr_reader :id_uri
     # [String, nil] The source filename, or nil
-    attr_accessor :filename
+    attr_reader :filename
     # [Time, nil] The mtime of when the source file was last modified
-    attr_accessor :mtime
+    attr_reader :mtime
+    # [String] The underlying nature of this act, usually `act` although subclasses my override this.
+    attr_reader :nature
+    # [Nokogiri::XML::Schema] schema to validate against
+    attr_accessor :schema
     # Get the act that wraps the document that owns this XML node
     # @param node [Nokogiri::XML::Node]
@@ -49,6 +55,7 @@ module Slaw
     # @param filename [String] filename to load XML from
     def initialize(filename=nil)
       self.load(filename) if filename
+      @schema = nil
     end
     # Load the XML in `filename` into this instance
@@ -60,8 +67,9 @@ module Slaw
       File.open(filename) { |f| parse(f) }
     end
-    # Parse the XML contained in the file-like object `io`
-    # @param io [file-like] io object with XML
+    # Parse the XML contained in the file-like or String object `io`
+    #
+    # @param io [String, file-like] io object or String with XML
     def parse(io)
       self.doc = Nokogiri::XML(io)
     end
@@ -76,26 +84,90 @@ module Slaw
       @@acts[@doc] = self
-      _extract_id
+      extract_id_uri
     end
-    # Parse the FRBR Uri into its constituent parts
-    def _extract_id
-      @id_uri = @meta.at_xpath('./a:identification/a:FRBRWork/a:FRBRuri', a: NS)['value']
-      empty, @country, type, date, @num = @id_uri.split('/')
+    # Directly set the FRBR URI of this act. This must be a well-formed URI,
+    # such as `/za/act/2002/2`. This will, in turn, update the {#year}, {#nature},
+    # {#country} and {#num} attributes.
+    #
+    # You probably don't want to use this method. Instead, set each component
+    # (such as {#date}) manually.
+    #
+    # @param uri [String] new URI
+    def id_uri=(uri)
+      for component, xpath in [['main',      '//a:act/a:meta/a:identification'],
+                               ['schedules', '//a:component/a:doc/a:meta/a:identification']] do
+        ident = @doc.at_xpath(xpath, a: NS)
+        next if not ident
+        # work
+        ident.at_xpath('a:FRBRWork/a:FRBRthis', a: NS)['value'] = "#{uri}/#{component}"
+        ident.at_xpath('a:FRBRWork/a:FRBRuri', a: NS)['value'] = uri
+        # expression
+        ident.at_xpath('a:FRBRExpression/a:FRBRthis', a: NS)['value'] = "#{uri}/#{component}/eng@"
+        ident.at_xpath('a:FRBRExpression/a:FRBRuri', a: NS)['value'] = "#{uri}/eng@"
+        # manifestation
+        ident.at_xpath('a:FRBRManifestation/a:FRBRthis', a: NS)['value'] = "#{uri}/#{component}/eng@"
+        ident.at_xpath('a:FRBRManifestation/a:FRBRuri', a: NS)['value'] = "#{uri}/eng@"
+      end
-      # yyyy-mm-dd
-      @year = date.split('-', 2)[0]
+      extract_id_uri
+    end
+    # The date at which this act was first created/promulgated.
+    #
+    # @return [String] date, YYYY-MM-DD
+    def date
+      node = @meta.at_xpath('./a:identification/a:FRBRWork/a:FRBRdate[@name="Generation"]', a: NS)
+      node && node['date']
+    end
+    # Set the date at which this act was first created/promulgated. This is usually the same
+    # as the publication date but this is not enforced.
+    #
+    # This also updates the {#year} of this act, which in turn updates the {#id_uri}.
+    #
+    # @param date [String] date, YYYY-MM-DD
+    def date=(value)
+      for frbr in ['FRBRWork', 'FRBRExpression'] do
+        @meta.at_xpath("./a:identification/a:#{frbr}/a:FRBRdate[@name=\"Generation\"]", a: NS)['date'] = value
+      end
+      self.year = value.split('-')[0]
+    end
+    # Set the year for this act. You probably want to call {#date=} instead.
+    #
+    # This will also update the {#id_uri} but will not change {#date} at all.
+    #
+    # @param year [String, Number] year
+    def year=(year)
+      @year = year.to_s
+      rebuild_id_uri
     end
     # An applicable short title for this act, either from the `FRBRalias` element
     # or based on the act number and year.
     # @return [String]
-    def short_title
+    def title
       node = @meta.at_xpath('./a:identification/a:FRBRWork/a:FRBRalias', a: NS)
       node ? node['value'] : "Act #{num} of #{year}"
     end
+    # Change the title of this act.
+    def title=(value)
+      node = @meta.at_xpath('./a:identification/a:FRBRWork/a:FRBRalias', a: NS)
+      unless node
+        node = @doc.create_element('FRBRalias')
+        @meta.at_xpath('./a:identification/a:FRBRWork/a:FRBRuri', a: NS).after(node)
+      end
+      node['value'] = value
+    end
     # Has this act been amended? This is determined by testing the `contains`
     # attribute of the `act` root element.
     #
@@ -250,6 +322,24 @@ module Slaw
       @meta.at_xpath('./a:publication', a: NS)
     end
+    # Update the publication details of the act. All elements are optional.
+    #
+    # @option details [String] :name name of the publication
+    # @option details [String] :number publication number
+    # @option details [String] :date date of publication (YYYY-MM-DD)
+    def published!(details)
+      node = @meta.at_xpath('./a:publication', a: NS)
+      unless node
+        node = @doc.create_element('publication')
+        @meta.at_xpath('./a:identification', a: NS).after(node)
+      end
+      node['showAs'] = details[:name] if details.has_key? :name
+      node['name'] = details[:name] if details.has_key? :name
+      node['date'] = details[:date] if details.has_key? :date
+      node['number'] = details[:number] if details.has_key? :number
+    end
     # Has this by-law been repealed?
     #
     # @return [Boolean]
@@ -297,14 +387,55 @@ module Slaw
       node && node['date']
     end
-    # The underlying nature of this act, usually `act` although subclasses my override this.
-    def nature
-      "act"
+    # Validate the XML behind this document against the Akoma Ntoso schema and return
+    # any errors.
+    #
+    # @return [Object] array of errors, possibly empty
+    def validate
+      @schema ||= Dir.chdir(File.dirname(__FILE__) + "/schemas") { Nokogiri::XML::Schema(File.read('akomantoso20.xsd')) }
+      @schema.validate(@doc)
+    end
+    # Does this document validate against the schema?
+    #
+    # @see {#validate}
+    def validates?
+      validate.empty?
+    end
+    # Serialise the XML for this act, passing `args` to the Nokogiri serialiser.
+    # The most useful argument is usually `indent: 2` if you like your XML perdy.
+    #
+    # @return [String] serialized XML
+    def to_xml(*args)
+      @doc.to_xml(*args)
     end
     def inspect
       "<#{self.class.name} @id_uri=\"#{@id_uri}\">"
     end
+    protected
+    # Parse the FRBR Uri into its constituent parts
+    def extract_id_uri
+      @id_uri = @meta.at_xpath('./a:identification/a:FRBRWork/a:FRBRuri', a: NS)['value']
+      empty, @country, @nature, date, @num = @id_uri.split('/')
+      # yyyy-mm-dd
+      @year = date.split('-', 2)[0]
+    end
+    def build_id_uri
+      # /za/act/2002/3
+      "/#{@country}/#{@nature}/#{@year}/#{@num}"
+    end
+    # This rebuild's the FRBR uri for this document using its constituent components. It will
+    # update the XML then re-split the URI and grab its components.
+    def rebuild_id_uri
+      self.id_uri = build_id_uri
+    end
   end
 end

data/lib/slaw/bylaw.rb CHANGED Viewed

@@ -7,40 +7,56 @@ module Slaw
   # is not identified by a year and a number, and therefore has a different FRBR uri structure.
   class ByLaw < Act
-    # [String] The region this by-law applies to
-    attr_accessor :region
+    # [String] The code of the region this by-law applies to
+    attr_reader :region
     # [String] A short file-like name of this by-law, unique within its year and region
-    attr_accessor :name
-    def _extract_id
-      # /za/by-law/cape-town/2010/public-parks
-      @id_uri = @meta.at_xpath('./a:identification/a:FRBRWork/a:FRBRuri', a: NS)['value']
-      empty, @country, type, @region, date, @name = @id_uri.split('/')
-      # yyyy[-mm-dd]
-      @year = date.split('-', 2)[0]
-    end
+    attr_reader :name
     # ByLaws don't have numbers, use their short-name instead
     def num
       name
     end
-    def short_title
+    def title
       node = @meta.at_xpath('./a:identification/a:FRBRWork/a:FRBRalias', a: NS)
-      short_title = node ? node['value'] : "(Unknown)"
+      title = node ? node['value'] : "(Unknown)"
-      if amended? and not short_title.end_with?("as amended")
-        short_title = short_title + " as amended"
+      if amended? and not title.end_with?("as amended")
+        title = title + " as amended"
       end
-      short_title
+      title
+    end
+    # Set the short (file-like) name for this bylaw. This changes the {#id_uri}.
+    def name=(value)
+      @name = value
+      rebuild_id_uri
     end
-    def nature
-      "by-law"
+    # Set the region code for this bylaw. This changes the {#id_uri}.
+    def region=(value)
+      @region = value
+      rebuild_id_uri
     end
+    protected
+    def extract_id_uri
+      # /za/by-law/cape-town/2010/public-parks
+      @id_uri = @meta.at_xpath('./a:identification/a:FRBRWork/a:FRBRuri', a: NS)['value']
+      empty, @country, @nature, @region, date, @name = @id_uri.split('/')
+      # yyyy[-mm-dd]
+      @year = date.split('-', 2)[0]
+    end
+    def build_id_uri
+      # /za/by-law/cape-town/2010/public-parks
+      "/#{@country}/#{@nature}/#{@region}/#{@year}/#{@name}"
+    end
   end
 end