slaw 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 405a0b941536c74c13588e1bfb4350c566337626
4
- data.tar.gz: 809e1fd9fd4ada655d3531b4eb31702398490a42
3
+ metadata.gz: 30603c7c9387a2f1c2fc9d617f667b41824e0b68
4
+ data.tar.gz: 2b153cb4679f469f4b0b18e4ba8b6da239d69016
5
5
  SHA512:
6
- metadata.gz: a636be697e3589db697232bc01876a864ee8c02eb4548232b9db8addc2c3d9fb0a5004ffeb94f12494e88889b2954c3edd61100a865ce9329b5eddad7381fbe8
7
- data.tar.gz: 57e78d5489aa950436b2e7dc3ebe7d19a639b86c3f4fa3b95add5edfa9adbb0185b6f191a012f74ba4fc35100432096ac1ea09b928327afa214aaedd1a2c070c
6
+ metadata.gz: a1fb11223dfbd14614eafaf1436e2b73c17fbdbc3fc0511f54492e3cab616c0a1db4f6ebdff0d0e56b498c23f3de1c8ce81ff7cd67e1784253dd6f22457976a2
7
+ data.tar.gz: 5d46d60e58c26cc44fef10a23f81de366361d538bfe35f649b6f6a81ea121ade63785d76d081fba26fc6413981ba831771f1c32aa5014bb8a02159122f3f40d5
data/README.md CHANGED
@@ -1,7 +1,18 @@
1
1
  # Slaw [![Build Status](https://travis-ci.org/longhotsummer/slaw.svg)](http://travis-ci.org/longhotsummer/slaw)
2
2
 
3
- Slaw is a lightweight library for rendering and generating Akoma Ntoso acts from plain text and PDF documents.
4
- It is used to power [openbylaws.org.za](http://openbylaws.org.za).
3
+ Slaw is a lightweight library for generating and rendering Akoma Ntoso 2.0 Act XML from plain text and PDF documents.
4
+ It is used to power [openbylaws.org.za](http://openbylaws.org.za) and [steno.openbylaws.org.za](http://steno.openbylaws.org.za)
5
+ and uses grammars developed for South African acts and by-laws.
6
+
7
+ Slaw allows you to:
8
+
9
+ 1. extract plain text from PDFs and clean up that text
10
+ 2. parse plain text and transform it into an Akoma Ntoso Act XML document
11
+ 3. render the XML document into HTML
12
+
13
+ Slaw is lightweight because it wraps around a Nokogiri XML representation of
14
+ the parsed document. It provides some support methods for manipulating these
15
+ documents, but anything advanced must manipulate the XML directly.
5
16
 
6
17
  ## Installation
7
18
 
@@ -13,37 +24,163 @@ And then execute:
13
24
 
14
25
  $ bundle
15
26
 
16
- Or install it yourself as:
27
+ Or install it with:
17
28
 
18
29
  $ gem install slaw
19
30
 
20
- ## Usage
31
+ To run PDF extraction you will also need [xpdf](http://www.foolabs.com/xpdf/).
32
+ If you're on a Mac, you can use:
21
33
 
22
- TODO: Write usage instructions here
34
+ brew install xpdf
23
35
 
24
- ### Extracting text from PDFs
36
+ ## Overview
25
37
 
26
- You will need [xpdf](http://www.foolabs.com/xpdf/) to run PDF extraction. If you're
27
- on a Mac you can use
38
+ Slaw generates Acts in the [Akoma Ntoso](http://www.akomantoso.org) 2.0 XML
39
+ standard for legislative documents. It first parses plain text using a grammar
40
+ and then generates XML from the resulting syntax tree.
28
41
 
29
- brew install xpdf
42
+ Most by-laws in South Africa are available as PDF documents. Slaw therefore has support
43
+ for extracting and cleaning up text from PDFs before parsing it. Extracting text from
44
+ PDFs can product oddities (such as oddly wrapped lines) and Slaw has a number of
45
+ rules-of-thumb for correcting these. These rules are based on South African
46
+ by-laws and may not be suitable for all regions.
47
+
48
+ The grammar is expressed as a [Treetop](https://github.com/nathansobo/treetop/) grammar
49
+ and has been developed specifically for the format of South African acts and by-laws.
50
+ Grammars for other regions could de developed depending on the complexity of a region's
51
+ formats.
30
52
 
31
- Extracting PDFs often break lines in odd places (or doesn't break them when it should). Slaw gets around
32
- this by running some cleanup routines on the extracted text.
53
+ The grammar cannot catch some subtleties of an act or by-law -- such as nested list numbering --
54
+ so Slaw performs some post-processing on the XML produced by the parser. In particular,
55
+ it nests lists correctly and looks for specially defined terms and their occurrences in the document.
56
+
57
+ ## Quick Start
58
+
59
+ Install the gem using
60
+
61
+ gem install slaw
62
+
63
+ Extract text from a PDF and parse it as a South African by-law:
33
64
 
34
65
  ```ruby
66
+ require 'slaw'
67
+
68
+ # extract text from a PDF file and clean it up
35
69
  extractor = Slaw::Extract::Extractor.new
70
+ text = extractor.extract_from_pdf('/path/to/file.pdf')
36
71
 
37
- # to guess the filetype by extension
38
- text = extractor.extract_from_file('/path/to/file.pdf')
72
+ # parse the text into a XML and
73
+ generator = Slaw::ZA::ByLawGenerator.new
74
+ bylaw = generator.generate_from_text(text)
75
+ puts bylaw.to_xml(indent: 2)
39
76
 
40
- # or if you know it's a PDF
41
- text = extractor.extract_from_pdf('/path/to/file.pdf')
77
+ # render the by-law as HTML, using / as the root
78
+ # for relative URLs
79
+ renderer = Slaw::Render::HTMLRenderer.new
80
+ puts renderer.render(bylaw.doc, '/')
81
+ ```
82
+
83
+ ## Extraction
84
+
85
+ Extraction is done by the `Slaw::Extract::Extractor` class. It currently handles
86
+ PDF and plain text files. Slaw uses `pdftotext` from the `xpdf` package to extract
87
+ the plain text from PDFs. PDFs are great for presentation, but suck for accurately storing
88
+ text. As a result, the extraction can produce oddities, such as lines broken in weird
89
+ places (or not broken when they should be). Slaw gets around this by running
90
+ some cleanup routines on the extracted text.
91
+
92
+ For example, it knows that these lines:
93
+
94
+ (b) any wall, swimming pool, reservoir or bridge
95
+ or any other structure connected therewith; (c) any fuel pump or any
96
+ tank used in connection therewith
97
+
98
+ should probably be broken at the section numbers:
99
+
100
+ (b) any wall, swimming pool, reservoir or bridge or any other structure connected therewith;
101
+ (c) any fuel pump or any tank used in connection therewith
102
+
103
+ If your region's numbering format differs significantly from this, these rules might not work.
104
+
105
+ Some other steps Slaw takes after extraction include (check `Slaw::Parse::Cleanser` for the full set):
42
106
 
43
- # You can also "extract" text from a plain-text file
44
- text = extractor.extract_from_text('/path/to/file.txt')
107
+ * changing newlines to `\n`, and normalising quotation characters
108
+ * removing page numbers and other boilerplate
109
+ * stripping the table of contents (we can generate our own from the parsed document)
110
+ * changing tabs to spaces, stripping leading and trailing spaces and removing blank lines
111
+
112
+ ## Parsing
113
+
114
+ Slaw uses Treetop to compile a grammar into a backtracking parser. The parser builds a parse
115
+ tree, each node of which knows how to serialize itself in XML format.
116
+
117
+ While most South African by-laws are superficially very similar, there are a sufficient differences
118
+ in their typesetting to make parsing them difficult. The grammar handles most
119
+ edge cases but may not catch them all. The one thing it cannot yet detect well is the difference
120
+ between section titles before and after a section number:
121
+
122
+ 1. Definitions
123
+ In this by-law, the following words ...
124
+
125
+ Definitions
126
+ 1. In this by-law, the following words ...
127
+
128
+ This must be set by the user before parsing.
129
+
130
+ The parser does its best not to choke on input it doesn't understand, preferring a best effort
131
+ to a completely accurate result. For example it may not be able to work out a section heading
132
+ and so will treat it as simply another statement in the previous section. This causes the parser
133
+ to use a lot of backtracking and negative lookahead assertions, which can be slow for large documents.
134
+
135
+ The grammar supports a number of subsection numbering formats, which are often mixed
136
+ in a document to indicate different levels of nesting.
137
+
138
+ (a)
139
+ (2)
140
+ (3b)
141
+ (ii)
142
+ 3.4
143
+
144
+ During post-processing it works out how to nest these appropriately.
145
+
146
+ For more information see the South African by-law grammar at
147
+ [lib/slaw/za/bylaw.treetop](lib/slaw/za/bylaw.treetop) and the list nesting
148
+ at [lib/slaw/parse/blocklists.rb](lib/slaw/parse/blocklists.rb).
149
+
150
+ ## Rendering
151
+
152
+ Slaw renders XML to HTML using XSLT. For the most part there is a direct mapping between
153
+ Akoma Ntoso structure and the HTML layout, so most AN nodes are simply mapped to `div` or `span`
154
+ elements with a class attribute derived from the name of the AN element and an ID element taken
155
+ from the node, if any. This makes it both fast and flexible, since it's easy to
156
+ apply layout rules with CSS.
157
+
158
+ Slaw can render either an entire document like this, or just a portion of the XML tree.
159
+
160
+ ## Meta-data
161
+
162
+ Acts and by-laws have metadata which it is not possible to get from their plain text representations,
163
+ such as their title, date and format of publication or act number. Slaw provides some helpers
164
+ for manipulating this meta-data. For example,
165
+
166
+ ```ruby
167
+ bylaw = Slaw::ByLaw.new('spec/fixtures/community-fire-safety.xml')
168
+ print bylaw.id_uri
169
+ bylaw.title = 'A new title'
170
+ bylaw.name = 'a-new-title'
171
+ bylaw.published!(date: '2014-09-28')
172
+ print bylaw.id_uri
45
173
  ```
46
174
 
175
+ ## Schedules
176
+
177
+ South African acts and by-laws can have addendums called schedules. They are technically a part of
178
+ the act but are not part of the primary body and have more relaxed formatting. Slaw finds schedules
179
+ by looking for section headings, but makes no effort to capture the format of their contents.
180
+
181
+ Akoma Ntoso has no explicit support for schedules. Instead, Slaw stores all schedules under a single
182
+ Akoma Ntoso `component` elements at the end of the XML document, with a name of `schedules`.
183
+
47
184
  ## Contributing
48
185
 
49
186
  1. Fork it at http://github.com/longhotsummer/slaw/fork
data/lib/slaw/act.rb CHANGED
@@ -18,25 +18,31 @@ module Slaw
18
18
  attr_accessor :doc
19
19
 
20
20
  # [Nokogiri::XML::Node] The `meta` XML node
21
- attr_accessor :meta
21
+ attr_reader :meta
22
22
 
23
23
  # [Nokogiri::XML::Node] The `body` XML node
24
- attr_accessor :body
24
+ attr_reader :body
25
25
 
26
26
  # [String] The year this act was published
27
- attr_accessor :year
27
+ attr_reader :year
28
28
 
29
29
  # [String] The act number in the year this act was published
30
- attr_accessor :num
30
+ attr_reader :num
31
31
 
32
32
  # [String] The FRBR URI of this act, which uniquely identifies it globally
33
- attr_accessor :id_uri
33
+ attr_reader :id_uri
34
34
 
35
35
  # [String, nil] The source filename, or nil
36
- attr_accessor :filename
36
+ attr_reader :filename
37
37
 
38
38
  # [Time, nil] The mtime of when the source file was last modified
39
- attr_accessor :mtime
39
+ attr_reader :mtime
40
+
41
+ # [String] The underlying nature of this act, usually `act` although subclasses my override this.
42
+ attr_reader :nature
43
+
44
+ # [Nokogiri::XML::Schema] schema to validate against
45
+ attr_accessor :schema
40
46
 
41
47
  # Get the act that wraps the document that owns this XML node
42
48
  # @param node [Nokogiri::XML::Node]
@@ -49,6 +55,7 @@ module Slaw
49
55
  # @param filename [String] filename to load XML from
50
56
  def initialize(filename=nil)
51
57
  self.load(filename) if filename
58
+ @schema = nil
52
59
  end
53
60
 
54
61
  # Load the XML in `filename` into this instance
@@ -60,8 +67,9 @@ module Slaw
60
67
  File.open(filename) { |f| parse(f) }
61
68
  end
62
69
 
63
- # Parse the XML contained in the file-like object `io`
64
- # @param io [file-like] io object with XML
70
+ # Parse the XML contained in the file-like or String object `io`
71
+ #
72
+ # @param io [String, file-like] io object or String with XML
65
73
  def parse(io)
66
74
  self.doc = Nokogiri::XML(io)
67
75
  end
@@ -76,26 +84,90 @@ module Slaw
76
84
 
77
85
  @@acts[@doc] = self
78
86
 
79
- _extract_id
87
+ extract_id_uri
80
88
  end
81
89
 
82
- # Parse the FRBR Uri into its constituent parts
83
- def _extract_id
84
- @id_uri = @meta.at_xpath('./a:identification/a:FRBRWork/a:FRBRuri', a: NS)['value']
85
- empty, @country, type, date, @num = @id_uri.split('/')
90
+ # Directly set the FRBR URI of this act. This must be a well-formed URI,
91
+ # such as `/za/act/2002/2`. This will, in turn, update the {#year}, {#nature},
92
+ # {#country} and {#num} attributes.
93
+ #
94
+ # You probably don't want to use this method. Instead, set each component
95
+ # (such as {#date}) manually.
96
+ #
97
+ # @param uri [String] new URI
98
+ def id_uri=(uri)
99
+ for component, xpath in [['main', '//a:act/a:meta/a:identification'],
100
+ ['schedules', '//a:component/a:doc/a:meta/a:identification']] do
101
+ ident = @doc.at_xpath(xpath, a: NS)
102
+ next if not ident
103
+
104
+ # work
105
+ ident.at_xpath('a:FRBRWork/a:FRBRthis', a: NS)['value'] = "#{uri}/#{component}"
106
+ ident.at_xpath('a:FRBRWork/a:FRBRuri', a: NS)['value'] = uri
107
+
108
+ # expression
109
+ ident.at_xpath('a:FRBRExpression/a:FRBRthis', a: NS)['value'] = "#{uri}/#{component}/eng@"
110
+ ident.at_xpath('a:FRBRExpression/a:FRBRuri', a: NS)['value'] = "#{uri}/eng@"
111
+
112
+ # manifestation
113
+ ident.at_xpath('a:FRBRManifestation/a:FRBRthis', a: NS)['value'] = "#{uri}/#{component}/eng@"
114
+ ident.at_xpath('a:FRBRManifestation/a:FRBRuri', a: NS)['value'] = "#{uri}/eng@"
115
+ end
86
116
 
87
- # yyyy-mm-dd
88
- @year = date.split('-', 2)[0]
117
+ extract_id_uri
118
+ end
119
+
120
+ # The date at which this act was first created/promulgated.
121
+ #
122
+ # @return [String] date, YYYY-MM-DD
123
+ def date
124
+ node = @meta.at_xpath('./a:identification/a:FRBRWork/a:FRBRdate[@name="Generation"]', a: NS)
125
+ node && node['date']
126
+ end
127
+
128
+ # Set the date at which this act was first created/promulgated. This is usually the same
129
+ # as the publication date but this is not enforced.
130
+ #
131
+ # This also updates the {#year} of this act, which in turn updates the {#id_uri}.
132
+ #
133
+ # @param date [String] date, YYYY-MM-DD
134
+ def date=(value)
135
+ for frbr in ['FRBRWork', 'FRBRExpression'] do
136
+ @meta.at_xpath("./a:identification/a:#{frbr}/a:FRBRdate[@name=\"Generation\"]", a: NS)['date'] = value
137
+ end
138
+
139
+ self.year = value.split('-')[0]
140
+ end
141
+
142
+ # Set the year for this act. You probably want to call {#date=} instead.
143
+ #
144
+ # This will also update the {#id_uri} but will not change {#date} at all.
145
+ #
146
+ # @param year [String, Number] year
147
+ def year=(year)
148
+ @year = year.to_s
149
+ rebuild_id_uri
89
150
  end
90
151
 
91
152
  # An applicable short title for this act, either from the `FRBRalias` element
92
153
  # or based on the act number and year.
93
154
  # @return [String]
94
- def short_title
155
+ def title
95
156
  node = @meta.at_xpath('./a:identification/a:FRBRWork/a:FRBRalias', a: NS)
96
157
  node ? node['value'] : "Act #{num} of #{year}"
97
158
  end
98
159
 
160
+ # Change the title of this act.
161
+ def title=(value)
162
+ node = @meta.at_xpath('./a:identification/a:FRBRWork/a:FRBRalias', a: NS)
163
+ unless node
164
+ node = @doc.create_element('FRBRalias')
165
+ @meta.at_xpath('./a:identification/a:FRBRWork/a:FRBRuri', a: NS).after(node)
166
+ end
167
+
168
+ node['value'] = value
169
+ end
170
+
99
171
  # Has this act been amended? This is determined by testing the `contains`
100
172
  # attribute of the `act` root element.
101
173
  #
@@ -250,6 +322,24 @@ module Slaw
250
322
  @meta.at_xpath('./a:publication', a: NS)
251
323
  end
252
324
 
325
+ # Update the publication details of the act. All elements are optional.
326
+ #
327
+ # @option details [String] :name name of the publication
328
+ # @option details [String] :number publication number
329
+ # @option details [String] :date date of publication (YYYY-MM-DD)
330
+ def published!(details)
331
+ node = @meta.at_xpath('./a:publication', a: NS)
332
+ unless node
333
+ node = @doc.create_element('publication')
334
+ @meta.at_xpath('./a:identification', a: NS).after(node)
335
+ end
336
+
337
+ node['showAs'] = details[:name] if details.has_key? :name
338
+ node['name'] = details[:name] if details.has_key? :name
339
+ node['date'] = details[:date] if details.has_key? :date
340
+ node['number'] = details[:number] if details.has_key? :number
341
+ end
342
+
253
343
  # Has this by-law been repealed?
254
344
  #
255
345
  # @return [Boolean]
@@ -297,14 +387,55 @@ module Slaw
297
387
  node && node['date']
298
388
  end
299
389
 
300
- # The underlying nature of this act, usually `act` although subclasses my override this.
301
- def nature
302
- "act"
390
+ # Validate the XML behind this document against the Akoma Ntoso schema and return
391
+ # any errors.
392
+ #
393
+ # @return [Object] array of errors, possibly empty
394
+ def validate
395
+ @schema ||= Dir.chdir(File.dirname(__FILE__) + "/schemas") { Nokogiri::XML::Schema(File.read('akomantoso20.xsd')) }
396
+ @schema.validate(@doc)
397
+ end
398
+
399
+ # Does this document validate against the schema?
400
+ #
401
+ # @see {#validate}
402
+ def validates?
403
+ validate.empty?
404
+ end
405
+
406
+ # Serialise the XML for this act, passing `args` to the Nokogiri serialiser.
407
+ # The most useful argument is usually `indent: 2` if you like your XML perdy.
408
+ #
409
+ # @return [String] serialized XML
410
+ def to_xml(*args)
411
+ @doc.to_xml(*args)
303
412
  end
304
413
 
305
414
  def inspect
306
415
  "<#{self.class.name} @id_uri=\"#{@id_uri}\">"
307
416
  end
417
+
418
+ protected
419
+
420
+ # Parse the FRBR Uri into its constituent parts
421
+ def extract_id_uri
422
+ @id_uri = @meta.at_xpath('./a:identification/a:FRBRWork/a:FRBRuri', a: NS)['value']
423
+ empty, @country, @nature, date, @num = @id_uri.split('/')
424
+
425
+ # yyyy-mm-dd
426
+ @year = date.split('-', 2)[0]
427
+ end
428
+
429
+ def build_id_uri
430
+ # /za/act/2002/3
431
+ "/#{@country}/#{@nature}/#{@year}/#{@num}"
432
+ end
433
+
434
+ # This rebuild's the FRBR uri for this document using its constituent components. It will
435
+ # update the XML then re-split the URI and grab its components.
436
+ def rebuild_id_uri
437
+ self.id_uri = build_id_uri
438
+ end
308
439
  end
309
440
 
310
441
  end
data/lib/slaw/bylaw.rb CHANGED
@@ -7,40 +7,56 @@ module Slaw
7
7
  # is not identified by a year and a number, and therefore has a different FRBR uri structure.
8
8
  class ByLaw < Act
9
9
 
10
- # [String] The region this by-law applies to
11
- attr_accessor :region
10
+ # [String] The code of the region this by-law applies to
11
+ attr_reader :region
12
12
 
13
13
  # [String] A short file-like name of this by-law, unique within its year and region
14
- attr_accessor :name
15
-
16
- def _extract_id
17
- # /za/by-law/cape-town/2010/public-parks
18
-
19
- @id_uri = @meta.at_xpath('./a:identification/a:FRBRWork/a:FRBRuri', a: NS)['value']
20
- empty, @country, type, @region, date, @name = @id_uri.split('/')
21
-
22
- # yyyy[-mm-dd]
23
- @year = date.split('-', 2)[0]
24
- end
14
+ attr_reader :name
25
15
 
26
16
  # ByLaws don't have numbers, use their short-name instead
27
17
  def num
28
18
  name
29
19
  end
30
20
 
31
- def short_title
21
+ def title
32
22
  node = @meta.at_xpath('./a:identification/a:FRBRWork/a:FRBRalias', a: NS)
33
- short_title = node ? node['value'] : "(Unknown)"
23
+ title = node ? node['value'] : "(Unknown)"
34
24
 
35
- if amended? and not short_title.end_with?("as amended")
36
- short_title = short_title + " as amended"
25
+ if amended? and not title.end_with?("as amended")
26
+ title = title + " as amended"
37
27
  end
38
28
 
39
- short_title
29
+ title
30
+ end
31
+
32
+ # Set the short (file-like) name for this bylaw. This changes the {#id_uri}.
33
+ def name=(value)
34
+ @name = value
35
+ rebuild_id_uri
40
36
  end
41
37
 
42
- def nature
43
- "by-law"
38
+ # Set the region code for this bylaw. This changes the {#id_uri}.
39
+ def region=(value)
40
+ @region = value
41
+ rebuild_id_uri
44
42
  end
43
+
44
+ protected
45
+
46
+ def extract_id_uri
47
+ # /za/by-law/cape-town/2010/public-parks
48
+
49
+ @id_uri = @meta.at_xpath('./a:identification/a:FRBRWork/a:FRBRuri', a: NS)['value']
50
+ empty, @country, @nature, @region, date, @name = @id_uri.split('/')
51
+
52
+ # yyyy[-mm-dd]
53
+ @year = date.split('-', 2)[0]
54
+ end
55
+
56
+ def build_id_uri
57
+ # /za/by-law/cape-town/2010/public-parks
58
+ "/#{@country}/#{@nature}/#{@region}/#{@year}/#{@name}"
59
+ end
60
+
45
61
  end
46
62
  end