slaw 0.2.0 → 0.3.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 405a0b941536c74c13588e1bfb4350c566337626
4
- data.tar.gz: 809e1fd9fd4ada655d3531b4eb31702398490a42
3
+ metadata.gz: 30603c7c9387a2f1c2fc9d617f667b41824e0b68
4
+ data.tar.gz: 2b153cb4679f469f4b0b18e4ba8b6da239d69016
5
5
  SHA512:
6
- metadata.gz: a636be697e3589db697232bc01876a864ee8c02eb4548232b9db8addc2c3d9fb0a5004ffeb94f12494e88889b2954c3edd61100a865ce9329b5eddad7381fbe8
7
- data.tar.gz: 57e78d5489aa950436b2e7dc3ebe7d19a639b86c3f4fa3b95add5edfa9adbb0185b6f191a012f74ba4fc35100432096ac1ea09b928327afa214aaedd1a2c070c
6
+ metadata.gz: a1fb11223dfbd14614eafaf1436e2b73c17fbdbc3fc0511f54492e3cab616c0a1db4f6ebdff0d0e56b498c23f3de1c8ce81ff7cd67e1784253dd6f22457976a2
7
+ data.tar.gz: 5d46d60e58c26cc44fef10a23f81de366361d538bfe35f649b6f6a81ea121ade63785d76d081fba26fc6413981ba831771f1c32aa5014bb8a02159122f3f40d5
data/README.md CHANGED
@@ -1,7 +1,18 @@
1
1
  # Slaw [![Build Status](https://travis-ci.org/longhotsummer/slaw.svg)](http://travis-ci.org/longhotsummer/slaw)
2
2
 
3
- Slaw is a lightweight library for rendering and generating Akoma Ntoso acts from plain text and PDF documents.
4
- It is used to power [openbylaws.org.za](http://openbylaws.org.za).
3
+ Slaw is a lightweight library for generating and rendering Akoma Ntoso 2.0 Act XML from plain text and PDF documents.
4
+ It is used to power [openbylaws.org.za](http://openbylaws.org.za) and [steno.openbylaws.org.za](http://steno.openbylaws.org.za)
5
+ and uses grammars developed for South African acts and by-laws.
6
+
7
+ Slaw allows you to:
8
+
9
+ 1. extract plain text from PDFs and clean up that text
10
+ 2. parse plain text and transform it into an Akoma Ntoso Act XML document
11
+ 3. render the XML document into HTML
12
+
13
+ Slaw is lightweight because it wraps around a Nokogiri XML representation of
14
+ the parsed document. It provides some support methods for manipulating these
15
+ documents, but anything advanced must manipulate the XML directly.
5
16
 
6
17
  ## Installation
7
18
 
@@ -13,37 +24,163 @@ And then execute:
13
24
 
14
25
  $ bundle
15
26
 
16
- Or install it yourself as:
27
+ Or install it with:
17
28
 
18
29
  $ gem install slaw
19
30
 
20
- ## Usage
31
+ To run PDF extraction you will also need [xpdf](http://www.foolabs.com/xpdf/).
32
+ If you're on a Mac, you can use:
21
33
 
22
- TODO: Write usage instructions here
34
+ brew install xpdf
23
35
 
24
- ### Extracting text from PDFs
36
+ ## Overview
25
37
 
26
- You will need [xpdf](http://www.foolabs.com/xpdf/) to run PDF extraction. If you're
27
- on a Mac you can use
38
+ Slaw generates Acts in the [Akoma Ntoso](http://www.akomantoso.org) 2.0 XML
39
+ standard for legislative documents. It first parses plain text using a grammar
40
+ and then generates XML from the resulting syntax tree.
28
41
 
29
- brew install xpdf
42
+ Most by-laws in South Africa are available as PDF documents. Slaw therefore has support
43
+ for extracting and cleaning up text from PDFs before parsing it. Extracting text from
44
+ PDFs can product oddities (such as oddly wrapped lines) and Slaw has a number of
45
+ rules-of-thumb for correcting these. These rules are based on South African
46
+ by-laws and may not be suitable for all regions.
47
+
48
+ The grammar is expressed as a [Treetop](https://github.com/nathansobo/treetop/) grammar
49
+ and has been developed specifically for the format of South African acts and by-laws.
50
+ Grammars for other regions could de developed depending on the complexity of a region's
51
+ formats.
30
52
 
31
- Extracting PDFs often break lines in odd places (or doesn't break them when it should). Slaw gets around
32
- this by running some cleanup routines on the extracted text.
53
+ The grammar cannot catch some subtleties of an act or by-law -- such as nested list numbering --
54
+ so Slaw performs some post-processing on the XML produced by the parser. In particular,
55
+ it nests lists correctly and looks for specially defined terms and their occurrences in the document.
56
+
57
+ ## Quick Start
58
+
59
+ Install the gem using
60
+
61
+ gem install slaw
62
+
63
+ Extract text from a PDF and parse it as a South African by-law:
33
64
 
34
65
  ```ruby
66
+ require 'slaw'
67
+
68
+ # extract text from a PDF file and clean it up
35
69
  extractor = Slaw::Extract::Extractor.new
70
+ text = extractor.extract_from_pdf('/path/to/file.pdf')
36
71
 
37
- # to guess the filetype by extension
38
- text = extractor.extract_from_file('/path/to/file.pdf')
72
+ # parse the text into a XML and
73
+ generator = Slaw::ZA::ByLawGenerator.new
74
+ bylaw = generator.generate_from_text(text)
75
+ puts bylaw.to_xml(indent: 2)
39
76
 
40
- # or if you know it's a PDF
41
- text = extractor.extract_from_pdf('/path/to/file.pdf')
77
+ # render the by-law as HTML, using / as the root
78
+ # for relative URLs
79
+ renderer = Slaw::Render::HTMLRenderer.new
80
+ puts renderer.render(bylaw.doc, '/')
81
+ ```
82
+
83
+ ## Extraction
84
+
85
+ Extraction is done by the `Slaw::Extract::Extractor` class. It currently handles
86
+ PDF and plain text files. Slaw uses `pdftotext` from the `xpdf` package to extract
87
+ the plain text from PDFs. PDFs are great for presentation, but suck for accurately storing
88
+ text. As a result, the extraction can produce oddities, such as lines broken in weird
89
+ places (or not broken when they should be). Slaw gets around this by running
90
+ some cleanup routines on the extracted text.
91
+
92
+ For example, it knows that these lines:
93
+
94
+ (b) any wall, swimming pool, reservoir or bridge
95
+ or any other structure connected therewith; (c) any fuel pump or any
96
+ tank used in connection therewith
97
+
98
+ should probably be broken at the section numbers:
99
+
100
+ (b) any wall, swimming pool, reservoir or bridge or any other structure connected therewith;
101
+ (c) any fuel pump or any tank used in connection therewith
102
+
103
+ If your region's numbering format differs significantly from this, these rules might not work.
104
+
105
+ Some other steps Slaw takes after extraction include (check `Slaw::Parse::Cleanser` for the full set):
42
106
 
43
- # You can also "extract" text from a plain-text file
44
- text = extractor.extract_from_text('/path/to/file.txt')
107
+ * changing newlines to `\n`, and normalising quotation characters
108
+ * removing page numbers and other boilerplate
109
+ * stripping the table of contents (we can generate our own from the parsed document)
110
+ * changing tabs to spaces, stripping leading and trailing spaces and removing blank lines
111
+
112
+ ## Parsing
113
+
114
+ Slaw uses Treetop to compile a grammar into a backtracking parser. The parser builds a parse
115
+ tree, each node of which knows how to serialize itself in XML format.
116
+
117
+ While most South African by-laws are superficially very similar, there are a sufficient differences
118
+ in their typesetting to make parsing them difficult. The grammar handles most
119
+ edge cases but may not catch them all. The one thing it cannot yet detect well is the difference
120
+ between section titles before and after a section number:
121
+
122
+ 1. Definitions
123
+ In this by-law, the following words ...
124
+
125
+ Definitions
126
+ 1. In this by-law, the following words ...
127
+
128
+ This must be set by the user before parsing.
129
+
130
+ The parser does its best not to choke on input it doesn't understand, preferring a best effort
131
+ to a completely accurate result. For example it may not be able to work out a section heading
132
+ and so will treat it as simply another statement in the previous section. This causes the parser
133
+ to use a lot of backtracking and negative lookahead assertions, which can be slow for large documents.
134
+
135
+ The grammar supports a number of subsection numbering formats, which are often mixed
136
+ in a document to indicate different levels of nesting.
137
+
138
+ (a)
139
+ (2)
140
+ (3b)
141
+ (ii)
142
+ 3.4
143
+
144
+ During post-processing it works out how to nest these appropriately.
145
+
146
+ For more information see the South African by-law grammar at
147
+ [lib/slaw/za/bylaw.treetop](lib/slaw/za/bylaw.treetop) and the list nesting
148
+ at [lib/slaw/parse/blocklists.rb](lib/slaw/parse/blocklists.rb).
149
+
150
+ ## Rendering
151
+
152
+ Slaw renders XML to HTML using XSLT. For the most part there is a direct mapping between
153
+ Akoma Ntoso structure and the HTML layout, so most AN nodes are simply mapped to `div` or `span`
154
+ elements with a class attribute derived from the name of the AN element and an ID element taken
155
+ from the node, if any. This makes it both fast and flexible, since it's easy to
156
+ apply layout rules with CSS.
157
+
158
+ Slaw can render either an entire document like this, or just a portion of the XML tree.
159
+
160
+ ## Meta-data
161
+
162
+ Acts and by-laws have metadata which it is not possible to get from their plain text representations,
163
+ such as their title, date and format of publication or act number. Slaw provides some helpers
164
+ for manipulating this meta-data. For example,
165
+
166
+ ```ruby
167
+ bylaw = Slaw::ByLaw.new('spec/fixtures/community-fire-safety.xml')
168
+ print bylaw.id_uri
169
+ bylaw.title = 'A new title'
170
+ bylaw.name = 'a-new-title'
171
+ bylaw.published!(date: '2014-09-28')
172
+ print bylaw.id_uri
45
173
  ```
46
174
 
175
+ ## Schedules
176
+
177
+ South African acts and by-laws can have addendums called schedules. They are technically a part of
178
+ the act but are not part of the primary body and have more relaxed formatting. Slaw finds schedules
179
+ by looking for section headings, but makes no effort to capture the format of their contents.
180
+
181
+ Akoma Ntoso has no explicit support for schedules. Instead, Slaw stores all schedules under a single
182
+ Akoma Ntoso `component` elements at the end of the XML document, with a name of `schedules`.
183
+
47
184
  ## Contributing
48
185
 
49
186
  1. Fork it at http://github.com/longhotsummer/slaw/fork
data/lib/slaw/act.rb CHANGED
@@ -18,25 +18,31 @@ module Slaw
18
18
  attr_accessor :doc
19
19
 
20
20
  # [Nokogiri::XML::Node] The `meta` XML node
21
- attr_accessor :meta
21
+ attr_reader :meta
22
22
 
23
23
  # [Nokogiri::XML::Node] The `body` XML node
24
- attr_accessor :body
24
+ attr_reader :body
25
25
 
26
26
  # [String] The year this act was published
27
- attr_accessor :year
27
+ attr_reader :year
28
28
 
29
29
  # [String] The act number in the year this act was published
30
- attr_accessor :num
30
+ attr_reader :num
31
31
 
32
32
  # [String] The FRBR URI of this act, which uniquely identifies it globally
33
- attr_accessor :id_uri
33
+ attr_reader :id_uri
34
34
 
35
35
  # [String, nil] The source filename, or nil
36
- attr_accessor :filename
36
+ attr_reader :filename
37
37
 
38
38
  # [Time, nil] The mtime of when the source file was last modified
39
- attr_accessor :mtime
39
+ attr_reader :mtime
40
+
41
+ # [String] The underlying nature of this act, usually `act` although subclasses my override this.
42
+ attr_reader :nature
43
+
44
+ # [Nokogiri::XML::Schema] schema to validate against
45
+ attr_accessor :schema
40
46
 
41
47
  # Get the act that wraps the document that owns this XML node
42
48
  # @param node [Nokogiri::XML::Node]
@@ -49,6 +55,7 @@ module Slaw
49
55
  # @param filename [String] filename to load XML from
50
56
  def initialize(filename=nil)
51
57
  self.load(filename) if filename
58
+ @schema = nil
52
59
  end
53
60
 
54
61
  # Load the XML in `filename` into this instance
@@ -60,8 +67,9 @@ module Slaw
60
67
  File.open(filename) { |f| parse(f) }
61
68
  end
62
69
 
63
- # Parse the XML contained in the file-like object `io`
64
- # @param io [file-like] io object with XML
70
+ # Parse the XML contained in the file-like or String object `io`
71
+ #
72
+ # @param io [String, file-like] io object or String with XML
65
73
  def parse(io)
66
74
  self.doc = Nokogiri::XML(io)
67
75
  end
@@ -76,26 +84,90 @@ module Slaw
76
84
 
77
85
  @@acts[@doc] = self
78
86
 
79
- _extract_id
87
+ extract_id_uri
80
88
  end
81
89
 
82
- # Parse the FRBR Uri into its constituent parts
83
- def _extract_id
84
- @id_uri = @meta.at_xpath('./a:identification/a:FRBRWork/a:FRBRuri', a: NS)['value']
85
- empty, @country, type, date, @num = @id_uri.split('/')
90
+ # Directly set the FRBR URI of this act. This must be a well-formed URI,
91
+ # such as `/za/act/2002/2`. This will, in turn, update the {#year}, {#nature},
92
+ # {#country} and {#num} attributes.
93
+ #
94
+ # You probably don't want to use this method. Instead, set each component
95
+ # (such as {#date}) manually.
96
+ #
97
+ # @param uri [String] new URI
98
+ def id_uri=(uri)
99
+ for component, xpath in [['main', '//a:act/a:meta/a:identification'],
100
+ ['schedules', '//a:component/a:doc/a:meta/a:identification']] do
101
+ ident = @doc.at_xpath(xpath, a: NS)
102
+ next if not ident
103
+
104
+ # work
105
+ ident.at_xpath('a:FRBRWork/a:FRBRthis', a: NS)['value'] = "#{uri}/#{component}"
106
+ ident.at_xpath('a:FRBRWork/a:FRBRuri', a: NS)['value'] = uri
107
+
108
+ # expression
109
+ ident.at_xpath('a:FRBRExpression/a:FRBRthis', a: NS)['value'] = "#{uri}/#{component}/eng@"
110
+ ident.at_xpath('a:FRBRExpression/a:FRBRuri', a: NS)['value'] = "#{uri}/eng@"
111
+
112
+ # manifestation
113
+ ident.at_xpath('a:FRBRManifestation/a:FRBRthis', a: NS)['value'] = "#{uri}/#{component}/eng@"
114
+ ident.at_xpath('a:FRBRManifestation/a:FRBRuri', a: NS)['value'] = "#{uri}/eng@"
115
+ end
86
116
 
87
- # yyyy-mm-dd
88
- @year = date.split('-', 2)[0]
117
+ extract_id_uri
118
+ end
119
+
120
+ # The date at which this act was first created/promulgated.
121
+ #
122
+ # @return [String] date, YYYY-MM-DD
123
+ def date
124
+ node = @meta.at_xpath('./a:identification/a:FRBRWork/a:FRBRdate[@name="Generation"]', a: NS)
125
+ node && node['date']
126
+ end
127
+
128
+ # Set the date at which this act was first created/promulgated. This is usually the same
129
+ # as the publication date but this is not enforced.
130
+ #
131
+ # This also updates the {#year} of this act, which in turn updates the {#id_uri}.
132
+ #
133
+ # @param date [String] date, YYYY-MM-DD
134
+ def date=(value)
135
+ for frbr in ['FRBRWork', 'FRBRExpression'] do
136
+ @meta.at_xpath("./a:identification/a:#{frbr}/a:FRBRdate[@name=\"Generation\"]", a: NS)['date'] = value
137
+ end
138
+
139
+ self.year = value.split('-')[0]
140
+ end
141
+
142
+ # Set the year for this act. You probably want to call {#date=} instead.
143
+ #
144
+ # This will also update the {#id_uri} but will not change {#date} at all.
145
+ #
146
+ # @param year [String, Number] year
147
+ def year=(year)
148
+ @year = year.to_s
149
+ rebuild_id_uri
89
150
  end
90
151
 
91
152
  # An applicable short title for this act, either from the `FRBRalias` element
92
153
  # or based on the act number and year.
93
154
  # @return [String]
94
- def short_title
155
+ def title
95
156
  node = @meta.at_xpath('./a:identification/a:FRBRWork/a:FRBRalias', a: NS)
96
157
  node ? node['value'] : "Act #{num} of #{year}"
97
158
  end
98
159
 
160
+ # Change the title of this act.
161
+ def title=(value)
162
+ node = @meta.at_xpath('./a:identification/a:FRBRWork/a:FRBRalias', a: NS)
163
+ unless node
164
+ node = @doc.create_element('FRBRalias')
165
+ @meta.at_xpath('./a:identification/a:FRBRWork/a:FRBRuri', a: NS).after(node)
166
+ end
167
+
168
+ node['value'] = value
169
+ end
170
+
99
171
  # Has this act been amended? This is determined by testing the `contains`
100
172
  # attribute of the `act` root element.
101
173
  #
@@ -250,6 +322,24 @@ module Slaw
250
322
  @meta.at_xpath('./a:publication', a: NS)
251
323
  end
252
324
 
325
+ # Update the publication details of the act. All elements are optional.
326
+ #
327
+ # @option details [String] :name name of the publication
328
+ # @option details [String] :number publication number
329
+ # @option details [String] :date date of publication (YYYY-MM-DD)
330
+ def published!(details)
331
+ node = @meta.at_xpath('./a:publication', a: NS)
332
+ unless node
333
+ node = @doc.create_element('publication')
334
+ @meta.at_xpath('./a:identification', a: NS).after(node)
335
+ end
336
+
337
+ node['showAs'] = details[:name] if details.has_key? :name
338
+ node['name'] = details[:name] if details.has_key? :name
339
+ node['date'] = details[:date] if details.has_key? :date
340
+ node['number'] = details[:number] if details.has_key? :number
341
+ end
342
+
253
343
  # Has this by-law been repealed?
254
344
  #
255
345
  # @return [Boolean]
@@ -297,14 +387,55 @@ module Slaw
297
387
  node && node['date']
298
388
  end
299
389
 
300
- # The underlying nature of this act, usually `act` although subclasses my override this.
301
- def nature
302
- "act"
390
+ # Validate the XML behind this document against the Akoma Ntoso schema and return
391
+ # any errors.
392
+ #
393
+ # @return [Object] array of errors, possibly empty
394
+ def validate
395
+ @schema ||= Dir.chdir(File.dirname(__FILE__) + "/schemas") { Nokogiri::XML::Schema(File.read('akomantoso20.xsd')) }
396
+ @schema.validate(@doc)
397
+ end
398
+
399
+ # Does this document validate against the schema?
400
+ #
401
+ # @see {#validate}
402
+ def validates?
403
+ validate.empty?
404
+ end
405
+
406
+ # Serialise the XML for this act, passing `args` to the Nokogiri serialiser.
407
+ # The most useful argument is usually `indent: 2` if you like your XML perdy.
408
+ #
409
+ # @return [String] serialized XML
410
+ def to_xml(*args)
411
+ @doc.to_xml(*args)
303
412
  end
304
413
 
305
414
  def inspect
306
415
  "<#{self.class.name} @id_uri=\"#{@id_uri}\">"
307
416
  end
417
+
418
+ protected
419
+
420
+ # Parse the FRBR Uri into its constituent parts
421
+ def extract_id_uri
422
+ @id_uri = @meta.at_xpath('./a:identification/a:FRBRWork/a:FRBRuri', a: NS)['value']
423
+ empty, @country, @nature, date, @num = @id_uri.split('/')
424
+
425
+ # yyyy-mm-dd
426
+ @year = date.split('-', 2)[0]
427
+ end
428
+
429
+ def build_id_uri
430
+ # /za/act/2002/3
431
+ "/#{@country}/#{@nature}/#{@year}/#{@num}"
432
+ end
433
+
434
+ # This rebuild's the FRBR uri for this document using its constituent components. It will
435
+ # update the XML then re-split the URI and grab its components.
436
+ def rebuild_id_uri
437
+ self.id_uri = build_id_uri
438
+ end
308
439
  end
309
440
 
310
441
  end
data/lib/slaw/bylaw.rb CHANGED
@@ -7,40 +7,56 @@ module Slaw
7
7
  # is not identified by a year and a number, and therefore has a different FRBR uri structure.
8
8
  class ByLaw < Act
9
9
 
10
- # [String] The region this by-law applies to
11
- attr_accessor :region
10
+ # [String] The code of the region this by-law applies to
11
+ attr_reader :region
12
12
 
13
13
  # [String] A short file-like name of this by-law, unique within its year and region
14
- attr_accessor :name
15
-
16
- def _extract_id
17
- # /za/by-law/cape-town/2010/public-parks
18
-
19
- @id_uri = @meta.at_xpath('./a:identification/a:FRBRWork/a:FRBRuri', a: NS)['value']
20
- empty, @country, type, @region, date, @name = @id_uri.split('/')
21
-
22
- # yyyy[-mm-dd]
23
- @year = date.split('-', 2)[0]
24
- end
14
+ attr_reader :name
25
15
 
26
16
  # ByLaws don't have numbers, use their short-name instead
27
17
  def num
28
18
  name
29
19
  end
30
20
 
31
- def short_title
21
+ def title
32
22
  node = @meta.at_xpath('./a:identification/a:FRBRWork/a:FRBRalias', a: NS)
33
- short_title = node ? node['value'] : "(Unknown)"
23
+ title = node ? node['value'] : "(Unknown)"
34
24
 
35
- if amended? and not short_title.end_with?("as amended")
36
- short_title = short_title + " as amended"
25
+ if amended? and not title.end_with?("as amended")
26
+ title = title + " as amended"
37
27
  end
38
28
 
39
- short_title
29
+ title
30
+ end
31
+
32
+ # Set the short (file-like) name for this bylaw. This changes the {#id_uri}.
33
+ def name=(value)
34
+ @name = value
35
+ rebuild_id_uri
40
36
  end
41
37
 
42
- def nature
43
- "by-law"
38
+ # Set the region code for this bylaw. This changes the {#id_uri}.
39
+ def region=(value)
40
+ @region = value
41
+ rebuild_id_uri
44
42
  end
43
+
44
+ protected
45
+
46
+ def extract_id_uri
47
+ # /za/by-law/cape-town/2010/public-parks
48
+
49
+ @id_uri = @meta.at_xpath('./a:identification/a:FRBRWork/a:FRBRuri', a: NS)['value']
50
+ empty, @country, @nature, @region, date, @name = @id_uri.split('/')
51
+
52
+ # yyyy[-mm-dd]
53
+ @year = date.split('-', 2)[0]
54
+ end
55
+
56
+ def build_id_uri
57
+ # /za/by-law/cape-town/2010/public-parks
58
+ "/#{@country}/#{@nature}/#{@region}/#{@year}/#{@name}"
59
+ end
60
+
45
61
  end
46
62
  end