RubyGems - slaw - Versions diffs - 1.0.0.alpha.6 → 1.0.0 - Mend

slaw 1.0.0.alpha.6 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

checksums.yaml +4 -4
data/README.md +13 -147
data/bin/slaw +2 -1
data/lib/slaw.rb +0 -6
data/lib/slaw/generator.rb +2 -8
data/lib/slaw/grammars/pl/act.treetop +10 -14
data/lib/slaw/grammars/pl/act_text.xsl +271 -0
data/lib/slaw/version.rb +1 -1
data/slaw.gemspec +3 -3
metadata +6 -17
data/lib/slaw/act.rb +0 -452
data/lib/slaw/bylaw.rb +0 -62
data/lib/slaw/collection.rb +0 -60
data/lib/slaw/lifecycle_event.rb +0 -23
data/lib/slaw/render/html.rb +0 -70
data/lib/slaw/render/xsl/act.xsl +0 -15
data/lib/slaw/render/xsl/elements.xsl +0 -120
data/lib/slaw/render/xsl/fragment.xsl +0 -16
data/spec/act_spec.rb +0 -56
data/spec/bylaw_spec.rb +0 -49

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 928f7f4c5bb0b3b22d52b7aa210d575aadbd6907
-  data.tar.gz: 817f3529d2105b802139759231e95c56bd7f86b8
+  metadata.gz: 96bb9bd00dc6e71518da515b6595f4a8b5c9a5b0
+  data.tar.gz: 793f2aeaedb2dc7e89d479348270c339639b9363
 SHA512:
-  metadata.gz: 19842f4e6f22ee116ff25ed8b800e847ab56f2c3ff5eb91d7d6654310f9e4c6e0c796d23dbda2e9480e3b2365d423961e80874c2d317c180e29f074cf26dede7
-  data.tar.gz: 9cdf14ea59433dba8e056ea111472da4e375a98f8b98ca07bd6b289234cff6339761378ff93024edff189d53ad9e15de90ff54466e47a898cc287e5ec41f3a9c
+  metadata.gz: 4e36394603f98a99668a868ea5458ef64a23585d4fc1794f42fc7e68c335c3186760bfcec83be04f40aa1636f6cb8a1b00c0be73b28515d5376d8838c8ceb3c7
+  data.tar.gz: d7e0cfbcf66ee8b1bbce81a5160c5f3302162f2a9df0e3c127ad576a0c0176929fdbd2d0abb0d948f63288d9afcf17d3bd7c9cd92f4168a0f41640b46221850a

data/README.md CHANGED

@@ -1,14 +1,16 @@
 # Slaw [![Build Status](https://travis-ci.org/longhotsummer/slaw.svg)](http://travis-ci.org/longhotsummer/slaw)
-Slaw is a lightweight library for generating and rendering Akoma Ntoso 2.0 Act XML from plain text and PDF documents.
-It is used to power [openbylaws.org.za](http://openbylaws.org.za) and [steno.openbylaws.org.za](http://steno.openbylaws.org.za)
-and uses grammars developed for South African acts and by-laws.
+Slaw is a lightweight library for generating Akoma Ntoso 2.0 Act XML from plain text and PDF documents.
+It is used to power [Indigo](https://github.com/OpenUpSA/indigo) and uses grammars developed for the legal
+traditions in these countries:
+* South Africa
+* Poland
 Slaw allows you to:
-1. extract plain text from PDFs and clean up that text
-2. parse plain text and transform it into an Akoma Ntoso Act XML document
-3. unparse Akoma Ntoso XML into text that can be parsed backed into Akoma Ntoso.
+1. parse plain text and transform it into an Akoma Ntoso Act XML document
+2. unparse Akoma Ntoso XML into a plain-text format suitable for re-parsing
 Slaw is lightweight because it wraps around a Nokogiri XML representation of
 the parsed document. It provides some support methods for manipulating these
@@ -40,7 +42,7 @@ installed by default on most systems (including Mac). On Ubuntu you can use:
 The simplest way to use Slaw is via the commandline:
-    $ slaw parse myfile.pdf
+    $ slaw parse myfile.pdf --grammar za
 ## Overview
@@ -63,150 +65,13 @@ The grammar cannot catch some subtleties of an act or by-law -- such as nested l
 so Slaw performs some post-processing on the XML produced by the parser. In particular,
 it nests lists correctly.
-## Quick Start
-Install the gem using
-    gem install slaw
-Extract text from a PDF and parse it as a South African by-law:
-```ruby
-require 'slaw'
-# extract text from a PDF file and clean it up
-extractor = Slaw::Extract::Extractor.new
-text = extractor.extract_from_pdf('/path/to/file.pdf')
-# parse the text into a XML and
-generator = Slaw::ActGenerator.new
-bylaw = generator.generate_from_text(text)
-puts bylaw.to_xml(indent: 2)
-# render the by-law as HTML, using / as the root
-# for relative URLs
-renderer = Slaw::Render::HTMLRenderer.new
-puts renderer.render(bylaw.doc, '/')
-```
-## Extraction
-Extraction is done by the `Slaw::Extract::Extractor` class. It currently handles
-PDF and plain text files. Slaw uses `pdftotext` from the `xpdf` package to extract
-the plain text from PDFs. PDFs are great for presentation, but suck for accurately storing
-text. As a result, the extraction can produce oddities, such as lines broken in weird
-places (or not broken when they should be). Slaw gets around this by running
-some cleanup routines on the extracted text.
-For example, it knows that these lines:
-    (b) any wall, swimming pool, reservoir or bridge
-    or any other structure connected therewith; (c) any fuel pump or any
-    tank used in connection therewith
-should probably be broken at the section numbers:
-    (b) any wall, swimming pool, reservoir or bridge or any other structure connected therewith;
-    (c) any fuel pump or any tank used in connection therewith
-If your region's numbering format differs significantly from this, these rules might not work.
-Some other steps Slaw takes after extraction include (check `Slaw::Parse::Cleanser` for the full set):
-* changing newlines to `\n`, and normalising quotation characters
-* removing page numbers and other boilerplate
-* stripping the table of contents (we can generate our own from the parsed document)
-* changing tabs to spaces, stripping leading and trailing spaces and removing blank lines
 ## Parsing
 Slaw uses Treetop to compile a grammar into a backtracking parser. The parser builds a parse
-tree, each node of which knows how to serialize itself in XML format.
-While most South African by-laws are superficially very similar, there are a sufficient differences
-in their typesetting to make parsing them difficult. The grammar handles most
-edge cases but may not catch them all. The one thing it cannot yet detect well is the difference
-between section titles before and after a section number:
-    1. Definitions
-    In this by-law, the following words ...
-    Definitions
-    1. In this by-law, the following words ...
-This must be set by the user before parsing:
-```ruby
-generator = Slaw::ZA::BylawGenerator.new
-generator.parser.options = {section_number_after_title: true}
-```
-The parser does its best not to choke on input it doesn't understand, preferring a best effort
-to a completely accurate result. For example it may not be able to work out a section heading
-and so will treat it as simply another statement in the previous section. This causes the parser
-to use a lot of backtracking and negative lookahead assertions, which can be slow for large documents.
-The grammar supports a number of subsection numbering formats, which are often mixed
-in a document to indicate different levels of nesting.
-    (a)
-    (2)
-    (3b)
-    (ii)
-    3.4
-During post-processing it works out how to nest these appropriately.
-Special words, such as ``part`` and ``chapter`` are ignored if the line starts with a backslash ``\``.
-For more information see the South African by-law grammar at
-[lib/slaw/za/bylaw.treetop](lib/slaw/za/bylaw.treetop) and the list nesting
-at [lib/slaw/parse/blocklists.rb](lib/slaw/parse/blocklists.rb).
-## Rendering
-Slaw renders XML to HTML using XSLT. For the most part there is a direct mapping between
-Akoma Ntoso structure and the HTML layout, so most AN nodes are simply mapped to `div` or `span`
-elements with a class attribute derived from the name of the AN element and an ID element taken
-from the node, if any. This makes it both fast and flexible, since it's easy to
-apply layout rules with CSS.
-Slaw can render either an entire document like this, or just a portion of the XML tree.
-```ruby
-# render an entire document
-renderer = Slaw::Render::HTMLRenderer.new
-puts renderer.render(bylaw.doc, '/')
-# render the first section only
-puts renderer.render(bylaw.sections[0], '/')
-```
-For more information, see [/lib/slaw/render/html.rb](/lib/slaw/render/html.rb).
-## Meta-data
-Acts and by-laws have metadata which it is not possible to get from their plain text representations,
-such as their title, date and format of publication or act number. Slaw provides some helpers
-for manipulating this meta-data. For example,
-```ruby
-bylaw = Slaw::ByLaw.new('spec/fixtures/community-fire-safety.xml')
-print bylaw.id_uri
-bylaw.title = 'A new title'
-bylaw.name = 'a-new-title'
-bylaw.published!(date: '2014-09-28')
-print bylaw.id_uri
-```
-## Schedules
-South African acts and by-laws can have addendums called schedules. They are technically a part of
-the act but are not part of the primary body and have more relaxed formatting. Slaw finds schedules
-by looking for section headings, but makes no effort to capture the format of their contents.
+tree, the nodes of which know how to serialize themselves in XML format.
-Akoma Ntoso has no explicit support for schedules. Instead, Slaw stores all schedules under a single
-Akoma Ntoso `component` elements at the end of the XML document, with a name of `schedules`.
+Supporting formats from other country's legal traditions probably requires creating a new grammar
+and parser.
 ## Contributing
@@ -225,6 +90,7 @@ Akoma Ntoso `component` elements at the end of the XML document, with a name of
 * Slaw no longer does too much introspection of a parsed document, since that can be so tradition-dependent.
 * Move reformatting out of Slaw since it's tradition-dependent.
 * Remove definition linking, Slaw no longer supports it.
+* Remove unused code for interacting with the internals of acts.
 ### 0.17.2

data/bin/slaw CHANGED

@@ -90,8 +90,9 @@ class SlawCLI < Thor
   end
   desc "unparse FILE", "Unparse FILE from Akoma Ntoso XML back into text suitable for re-parsing"
+  option :grammar, type: :string, desc: "Grammar name (usually a two-letter country code). Default is za."
   def unparse(name)
-    generator = Slaw::ActGenerator.new
+    generator = Slaw::ActGenerator.new(options[:grammar] || 'za')
     doc = File.open(name, 'r') { |f| doc = generator.builder.parse_xml(f.read) }
     puts generator.text_from_act(doc)

data/lib/slaw.rb CHANGED

@@ -4,14 +4,8 @@ require 'slaw/version'
 require 'slaw/namespace'
 require 'slaw/logging'
-require 'slaw/act'
-require 'slaw/bylaw'
-require 'slaw/collection'
 require 'slaw/xml_support'
-require 'slaw/lifecycle_event'
-require 'slaw/render/html'
 require 'slaw/parse/blocklists'
 require 'slaw/parse/builder'
 require 'slaw/parse/cleanser'

data/lib/slaw/generator.rb CHANGED

@@ -7,9 +7,6 @@ module Slaw
     # [Slaw::Parse::Builder] builder used by the generator
     attr_accessor :builder
-    # The type that will hold the generated document
-    attr_accessor :document_class
     @@parsers = {}
     def initialize(grammar)
@@ -19,7 +16,6 @@ module Slaw
       @builder = Slaw::Parse::Builder.new(parser: @parser)
       @parser = @builder.parser
       @cleanser = Slaw::Parse::Cleanser.new
-      @document_class = Slaw::Act
     end
     def build_parser
@@ -39,11 +35,9 @@ module Slaw
     #
     # @param text [String] plain text
     #
-    # @return [Slaw::Act] the resulting act
+    # @return [Nokogiri::Document] the resulting xml
     def generate_from_text(text)
-      act = @document_class.new
-      act.doc = @builder.parse_and_process_text(cleanup(text))
-      act
+      @builder.parse_and_process_text(cleanup(text))
     end
     # Run basic cleanup on text, such as ensuring clean newlines

data/lib/slaw/grammars/pl/act.treetop CHANGED

@@ -111,32 +111,28 @@ module Slaw
         # these are used externally and provide support when parsing just
         # a particular portion of a document
-        rule divisions
-          children:division+ <GroupNode>
-        end
-        rule subdivisions
-          children:subdivision+ <GroupNode>
+        rule articles
+          children:article+ <GroupNode>
         end
         rule chapters
           children:chapter+ <GroupNode>
         end
-        rule articles
-          children:article+ <GroupNode>
-        end
-        rule sections
-          children:section+ <GroupNode>
+        rule divisions
+          children:division+ <GroupNode>
         end
         rule paragraphs
           children:paragraph+ <GroupNode>
         end
-        rule points
-          children:point+ <GroupNode>
+        rule sections
+          children:section+ <GroupNode>
+        end
+        rule subdivisions
+          children:subdivision+ <GroupNode>
         end
         ##########

data/lib/slaw/grammars/pl/act_text.xsl ADDED

@@ -0,0 +1,271 @@
+<?xml version="1.0"?>
+<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
+  xmlns:a="http://www.akomantoso.org/2.0"
+  exclude-result-prefixes="a">
+  <xsl:output method="text" indent="no" omit-xml-declaration="yes" />
+  <xsl:strip-space elements="*"/>
+  <!-- adds a backslash to the start of the value param, if necessary -->
+  <xsl:template name="escape">
+    <xsl:param name="value"/>
+    <xsl:variable name="prefix" select="translate(substring($value, 1, 10), 'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ')" />
+    <xsl:variable name="numprefix" select="translate(translate(substring($prefix, 1, 3), '1234567890', 'NNNNNNNNNN'), 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'NNNNNNNNNNNNNNNNNNNNNNNNNN')" />
+    <!-- p tags must escape initial content that looks like a block element marker.
+         Note that the two hyphens are different characters. -->
+    <xsl:if test="$prefix = 'BODY' or
+                  $prefix = 'PREAMBLE' or
+                  $prefix = 'PREFACE' or
+                  starts-with($prefix, 'ROZDZIA') or
+                  starts-with($prefix, 'DZIA') or
+                  starts-with($prefix, 'ODDZIA') or
+                  starts-with($prefix, 'ART.') or
+                  starts-with($prefix, '§') or
+                  starts-with($prefix, 'SCHEDULE ') or
+                  starts-with($prefix, '{|') or
+                  starts-with($numprefix, 'N)') or
+                  starts-with($numprefix, 'NN)') or
+                  starts-with($numprefix, 'N.') or
+                  starts-with($numprefix, 'NN.') or
+                  starts-with($numprefix, '-') or
+                  starts-with($numprefix, '–')">
+      <xsl:text>\</xsl:text>
+    </xsl:if>
+    <xsl:value-of select="$value"/>
+  </xsl:template>
+  <xsl:template match="a:act">
+    <xsl:apply-templates select="a:coverPage" />
+    <xsl:apply-templates select="a:preface" />
+    <xsl:apply-templates select="a:preamble" />
+    <xsl:apply-templates select="a:body" />
+    <xsl:apply-templates select="a:conclusions" />
+  </xsl:template>
+  <xsl:template match="a:preface">
+    <xsl:text>PREFACE</xsl:text>
+    <xsl:text>
+</xsl:text>
+    <xsl:apply-templates />
+  </xsl:template>
+  <xsl:template match="a:preamble">
+    <xsl:text>PREAMBLE</xsl:text>
+    <xsl:text>
+</xsl:text>
+    <xsl:apply-templates />
+  </xsl:template>
+  <xsl:template match="a:division">
+    <xsl:text>Dział </xsl:text>
+    <xsl:value-of select="./a:num" />
+    <xsl:text> - </xsl:text>
+    <xsl:value-of select="./a:heading" />
+    <xsl:text>
+</xsl:text>
+    <xsl:apply-templates select="./*[not(self::a:num) and not(self::a:heading)]" />
+  </xsl:template>
+  <xsl:template match="a:chapter">
+    <xsl:text>Rozdział </xsl:text>
+    <xsl:value-of select="./a:num" />
+    <xsl:text> - </xsl:text>
+    <xsl:value-of select="./a:heading" />
+    <xsl:text>
+</xsl:text>
+    <xsl:apply-templates select="./*[not(self::a:num) and not(self::a:heading)]" />
+  </xsl:template>
+  <xsl:template match="a:article">
+    <xsl:text>Art. </xsl:text>
+    <xsl:value-of select="a:num" />
+    <xsl:text>
+</xsl:text>
+    <xsl:apply-templates select="./*[not(self::a:num)]" />
+  </xsl:template>
+  <xsl:template match="a:section">
+    <xsl:text>§ </xsl:text>
+    <xsl:value-of select="a:num" />
+    <xsl:text>
+</xsl:text>
+    <xsl:apply-templates select="./*[not(self::a:num)]" />
+  </xsl:template>
+  <xsl:template match="a:paragraph">
+    <xsl:if test="a:num != ''">
+      <xsl:value-of select="a:num" />
+      <xsl:text> </xsl:text>
+    </xsl:if>
+    <xsl:apply-templates select="./*[not(self::a:num) and not(self::a:heading)]" />
+  </xsl:template>
+  <xsl:template match="a:indent">
+    <xsl:value-of select="a:num" />
+    <xsl:text>- </xsl:text>
+    <xsl:apply-templates select="./*[not(self::a:num)]" />
+  </xsl:template>
+  <!-- these are block elements and have a newline at the end -->
+  <xsl:template match="a:heading">
+    <xsl:apply-templates />
+    <xsl:text>
+</xsl:text>
+  </xsl:template>
+  <xsl:template match="a:p">
+    <xsl:apply-templates/>
+    <!-- p tags must end with a newline -->
+    <xsl:text>
+</xsl:text>
+  </xsl:template>
+  <!-- numbered lists -->
+  <xsl:template match="a:item | a:alinea | a:point">
+    <xsl:value-of select="./a:num" />
+    <xsl:text> </xsl:text>
+    <xsl:apply-templates select="./*[not(self::a:num)]" />
+  </xsl:template>
+  <xsl:template match="a:list">
+    <xsl:if test="a:intro != ''">
+      <xsl:value-of select="a:intro" />
+      <xsl:text>
+</xsl:text>
+    </xsl:if>
+    <xsl:apply-templates select="./*[not(self::a:intro)]" />
+  </xsl:template>
+  <!-- first text nodes of these elems must be escaped if they have special chars -->
+  <xsl:template match="a:p[not(ancestor::a:table)]/text()[1] | a:intro/text()[1]">
+    <xsl:call-template name="escape">
+      <xsl:with-param name="value" select="." />
+    </xsl:call-template>
+  </xsl:template>
+  <!-- components/schedules -->
+  <xsl:template match="a:doc">
+    <xsl:text>Schedule - </xsl:text>
+    <xsl:value-of select="a:meta/a:identification/a:FRBRWork/a:FRBRalias/@value" />
+    <xsl:if test="a:mainBody/a:article/a:heading">
+      <xsl:text>
+</xsl:text>
+      <xsl:value-of select="a:mainBody/a:article/a:heading" />
+    </xsl:if>
+    <xsl:text>
+</xsl:text>
+    <xsl:apply-templates select="a:mainBody" />
+  </xsl:template>
+  <xsl:template match="a:mainBody/a:article/a:heading">
+    <!-- no-op, this is handled by the schedules template above -->
+  </xsl:template>
+  <!-- tables -->
+  <xsl:template match="a:table">
+    <xsl:text>{| </xsl:text>
+    <!-- attributes -->
+    <xsl:for-each select="@*[local-name()!='id']">
+      <xsl:value-of select="local-name(.)" />
+      <xsl:text>="</xsl:text>
+      <xsl:value-of select="." />
+      <xsl:text>" </xsl:text>
+    </xsl:for-each>
+    <xsl:text>
+|-</xsl:text>
+    <xsl:apply-templates />
+    <xsl:text>
+|}
+</xsl:text>
+  </xsl:template>
+  <xsl:template match="a:tr">
+    <xsl:apply-templates />
+    <xsl:text>
+|-</xsl:text>
+  </xsl:template>
+  <xsl:template match="a:th|a:td">
+    <xsl:choose>
+      <xsl:when test="local-name(.) = 'th'">
+        <xsl:text>
+! </xsl:text>
+      </xsl:when>
+      <xsl:when test="local-name(.) = 'td'">
+        <xsl:text>
+| </xsl:text>
+      </xsl:when>
+    </xsl:choose>
+    <!-- attributes -->
+    <xsl:if test="@*">
+      <xsl:for-each select="@*">
+        <xsl:value-of select="local-name(.)" />
+        <xsl:text>="</xsl:text>
+        <xsl:value-of select="." />
+        <xsl:text>" </xsl:text>
+      </xsl:for-each>
+      <xsl:text>| </xsl:text>
+    </xsl:if>
+    <xsl:apply-templates />
+  </xsl:template>
+  <!-- don't end p tags with newlines in tables -->
+  <xsl:template match="a:table//a:p">
+    <xsl:apply-templates />
+  </xsl:template>
+  <!-- END tables -->
+  <xsl:template match="a:remark">
+    <xsl:text>[</xsl:text>
+    <xsl:apply-templates />
+    <xsl:text>]</xsl:text>
+  </xsl:template>
+  <xsl:template match="a:ref">
+    <xsl:text>[</xsl:text>
+    <xsl:apply-templates />
+    <xsl:text>](</xsl:text>
+    <xsl:value-of select="@href" />
+    <xsl:text>)</xsl:text>
+  </xsl:template>
+  <xsl:template match="a:img">
+    <xsl:text>![</xsl:text>
+    <xsl:value-of select="@alt" />
+    <xsl:text>](</xsl:text>
+    <xsl:value-of select="@src" />
+    <xsl:text>)</xsl:text>
+  </xsl:template>
+  <xsl:template match="a:eol">
+    <xsl:text>
+</xsl:text>
+  </xsl:template>
+  <!-- for most nodes, just dump their text content -->
+  <xsl:template match="*">
+    <xsl:text/><xsl:apply-templates /><xsl:text/>
+  </xsl:template>
+</xsl:stylesheet>