RubyGems - slaw - Versions diffs - 1.0.4 → 2.0.0 - Mend

slaw 1.0.4 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

checksums.yaml +4 -4
data/.travis.yml +3 -3
data/README.md +18 -14
data/bin/slaw +3 -15
data/lib/slaw/extract/extractor.rb +3 -102
data/lib/slaw/generator.rb +19 -6
data/lib/slaw/parse/builder.rb +0 -17
data/lib/slaw/version.rb +1 -1
data/slaw.gemspec +0 -5
data/spec/parse/builder_spec.rb +0 -38
metadata +2 -45
data/lib/slaw/extract/yomu_patch.rb +0 -9

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 963f970d7da2acd7fd2678973515073f7f9b5226
-  data.tar.gz: b5fcc0de969f1b6c286d70017bd15c61e4e959d4
+  metadata.gz: 1639f10e008ddcdd149e040767d97476c9183ad3
+  data.tar.gz: 95cce3c38e35910731bfe8776974a40ba3be7c32
 SHA512:
-  metadata.gz: 8617cd183a1af99370c17457ff19218a41d2ebf9f63aec1e1ad3101712bab518a9a89a272b912a631860f5c73d5918570c093dc024d6a53445268f196191cb59
-  data.tar.gz: 04e061c52b5ebbf4a21062ece4e35b1f8183f004d74996c8289b763cb5e931e9da4eb5bb5407059f18b234ef85e15018a05f20ae56c21dd7b83d77b48b566205
+  metadata.gz: 673521a6b0be293b57f7cd8279fb2df277d7c8263824fe5a007e69efaf81b76d8a2a3ef826589abe730d6ab4170f5d95ab42e915c30c44e18d7d15774715bdfa
+  data.tar.gz: c4e38899b280727459cbfd0982ceed856bbeb683b51e77839ba9c7f14b7cf7ff204f95eb2db52a015c1ffe8d116e0ca9bb87149d7f80168c65bf2965d051a196

data/.travis.yml CHANGED Viewed

@@ -1,7 +1,7 @@
 language: ruby
 rvm:
-  - 2.2.8
-  - 2.3.5
-  - 2.4.2
+  - 2.6.2
+  - 2.5.4
+  - 2.4.5
 before_install:
   - gem update bundler

data/README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Slaw [![Build Status](https://travis-ci.org/longhotsummer/slaw.svg)](http://travis-ci.org/longhotsummer/slaw)
-Slaw is a lightweight library for generating Akoma Ntoso 2.0 Act XML from plain text and PDF documents.
+Slaw is a lightweight library for generating Akoma Ntoso 2.0 Act XML from plain text documents.
 It is used to power [Indigo](https://github.com/OpenUpSA/indigo) and uses grammars developed for the legal
 traditions in these countries:
@@ -30,19 +30,9 @@ Or install it with:
     $ gem install slaw
-To run PDF extraction you will also need [poppler's pdftotext](https://poppler.freedesktop.org/).
-If you're on a Mac, you can use:
-    $ brew install poppler
-You may also need Ghostscript to remove password protection from PDF files. This is
-installed by default on most systems (including Mac). On Ubuntu you can use:
-    $ sudo apt-get install ghostscript
 The simplest way to use Slaw is via the commandline:
-    $ slaw parse myfile.pdf --grammar za
+    $ slaw parse myfile.text --grammar za
 ## Overview
@@ -50,8 +40,8 @@ Slaw generates Acts in the [Akoma Ntoso](http://www.akomantoso.org) 2.0 XML
 standard for legislative documents. It first parses plain text using a grammar
 and then generates XML from the resulting syntax tree.
-Most by-laws in South Africa are available as PDF documents. Slaw therefore has support
-for extracting and cleaning up text from PDFs before parsing it. Extracting text from
+Most by-laws in South Africa are available as PDF documents. You will therefore
+need to extract the text from the PDF first, using a tool like pdftotext.
 PDFs can product oddities (such as oddly wrapped lines) and Slaw has a number of
 rules-of-thumb for correcting these. These rules are based on South African
 by-laws and may not be suitable for all regions.
@@ -73,6 +63,14 @@ tree, the nodes of which know how to serialize themselves in XML format.
 Supporting formats from other country's legal traditions probably requires creating a new grammar
 and parser.
+## Adding your own grammar
+Slaw can dynamically load your custom Treetop grammars. When called with ``--grammar xy``, Slaw
+tries to require `slaw/grammars/xy/act` and instantiate the parser class ``Slaw::Grammars::XY::ActParser``.
+Slaw always uses the rule `act` as the root of the parser.
+You can create your own grammar by creating a gem that provides these files and classes.
 ## Contributing
 1. Fork it at http://github.com/longhotsummer/slaw/fork
@@ -86,6 +84,12 @@ and parser.
 ## Changelog
+### 2.0.0 (?)
+* Remove support for PDFs. Do text extraction from PDFs outside of this library.
+* Support dynamically loading grammars from other gems.
+* Don't change ALL CAPS headings to Sentence Case.
 ### 1.0.4 (5 February 2019)
 * SECURITY require Nokogiri 1.8.5 or greater to address https://nvd.nist.gov/vuln/detail/CVE-2018-14404

data/bin/slaw CHANGED Viewed

@@ -4,8 +4,6 @@ require 'thor'
 require 'slaw'
 class SlawCLI < Thor
-  # TODO: support different grammars and locales
   # Exit with non-zero exit code on failure.
   # See https://github.com/erikhuda/thor/issues/244
   def self.exit_on_failure?
@@ -15,29 +13,19 @@ class SlawCLI < Thor
   class_option :verbose, type: :boolean, desc: "Display log output on stderr"
   desc "parse FILE", "Parse FILE into Akoma Ntoso XML"
-  option :input, enum: ['text', 'pdf'], desc: "Type of input if it can't be determined automatically"
-  option :pdftotext, desc: "Location of the pdftotext binary if not in PATH"
+  option :input, enum: ['text', 'html'], desc: "Type of input if it can't be determined automatically"
   option :fragment, type: :string, desc: "Akoma Ntoso element name that the imported text represents. Support depends on the grammar."
   option :id_prefix, type: :string, desc: "Prefix to be used when generating ID elements when parsing a fragment."
   option :section_number_position, enum: ['before-title', 'after-title', 'guess'], desc: "Where do section titles come in relation to the section number? Default: before-title"
-  option :crop, type: :string, desc: "Crop box for PDF files, as 'left,top,width,height'."
   option :grammar, type: :string, desc: "Grammar name (usually a two-letter country code). Default is za."
   def parse(name)
     logging
-    Slaw::Extract::Extractor.pdftotext_path = options[:pdftotext] if options[:pdftotext]
     extractor = Slaw::Extract::Extractor.new
-    if options[:crop]
-      extractor.cropbox = options[:crop].split(',').map(&:to_i)
-      if extractor.cropbox.length != 4
-        raise Thor::Error.new("--crop requires four comma-separated integers")
-      end
-    end
     case options[:input]
-    when 'pdf'
-      text = extractor.extract_from_pdf(name)
+    when 'html'
+      text = extractor.extract_from_html(name)
     when 'text'
       text = extractor.extract_from_text(name)
     else

data/lib/slaw/extract/extractor.rb CHANGED Viewed

@@ -1,24 +1,12 @@
-require 'open3'
-require 'tempfile'
 require 'mimemagic'
 module Slaw
   module Extract
-    # Routines for extracting and cleaning up context from other formats, such as PDF.
-    #
-    # You may need to set the location of the `pdftotext` binary.
-    #
-    # On Mac OS X, use `brew install xpdf` or download from http://www.foolabs.com/xpdf/download.html
-    #
-    # On Heroku, you'll need to do some hoop jumping, see http://theprogrammingbutler.com/blog/archives/2011/07/28/running-pdftotext-on-heroku/
+    # Routines for extracting and cleaning up context from other formats, such as HTML.
     class Extractor
       include Slaw::Logging
-      @@pdftotext_path = "pdftotext"
-      attr_accessor :cropbox
       # Extract text from a file.
       #
       # @param filename [String] filename to extract from
@@ -28,61 +16,13 @@ module Slaw
         mimetype = get_mimetype(filename)
         case mimetype && mimetype.type
-        when 'application/pdf'
-          extract_from_pdf(filename)
-        when 'text/html', nil
+        when 'text/html'
           extract_from_html(filename)
         when 'text/plain', nil
           extract_from_text(filename)
         else
-          text = extract_via_tika(filename)
-          if text.empty? or text.nil?
-            raise ArgumentError.new("Unsupported file type #{mimetype || 'unknown'}")
-          end
-          text
-        end
-      end
-      # Extract text from a PDF
-      #
-      # @param filename [String] filename to extract from
-      #
-      # @return [String] extracted text
-      def extract_from_pdf(filename)
-        retried = false
-        while true
-          cmd = pdf_to_text_cmd(filename)
-          logger.info("Executing: #{cmd}")
-          stdout, status = Open3.capture2(*cmd)
-          case status.exitstatus
-          when 0
-            return stdout
-          when 3
-            return nil if retried
-            retried = true
-            self.remove_pdf_password(filename)
-          else
-            return nil
-          end
-        end
-      end
-      # Build a command for the external PDF-to-text utility.
-      #
-      # @param filename [String] the pdf file
-      #
-      # @return [Array<String>] command and params to execute
-      def pdf_to_text_cmd(filename)
-        cmd = [Extractor.pdftotext_path, "-enc", "UTF-8", "-nopgbrk"]
-        if @cropbox
-          # left, top, width, height
-          cmd += "-x -y -W -H".split.zip(@cropbox.map(&:to_s)).flatten
+          raise ArgumentError.new("Unsupported file type #{mimetype || 'unknown'}")
         end
-        cmd + [filename, "-"]
       end
       def extract_from_text(filename)
@@ -93,21 +33,6 @@ module Slaw
         html_to_text(File.read(filename))
       end
-      # Extract text from +filename+ by sending it to apache tika
-      # http://tika.apache.org/
-      def extract_via_tika(filename)
-        # the Yomu gem falls over when trying to write large amounts of data
-        # the JVM stdin, so we manually call java ourselves, relying on yomu
-        # to supply the gem
-        require 'slaw/extract/yomu_patch'
-        logger.info("Using Tika to get text from #{filename}. You need a JVM installed for this.")
-        html = Yomu.text_from_file(filename)
-        logger.info("Tika returned #{html.length} bytes")
-        # transform html into text
-        html_to_text(html)
-      end
       def html_to_text(html)
         here = File.dirname(__FILE__)
         xslt = Nokogiri::XSLT(File.open(File.join([here, 'html_to_akn_text.xsl'])))
@@ -117,34 +42,10 @@ module Slaw
         text.sub(/^<\?xml [^>]*>/, '')
       end
-      def remove_pdf_password(filename)
-        file = Tempfile.new('steno')
-        begin
-          logger.info("Trying to remove password from #{filename}")
-          cmd = "gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=#{file.path} -c .setpdfwrite -f #{filename}".split(" ")
-          logger.info("Executing: #{cmd}")
-          Open3.capture2(*cmd)
-          FileUtils.move(file.path, filename)
-        ensure
-          file.close
-          file.unlink
-        end
-      end
       def get_mimetype(filename)
         File.open(filename) { |f| MimeMagic.by_magic(f) } \
           || MimeMagic.by_path(filename)
       end
-      # Get location of the pdftotext executable for all instances.
-      def self.pdftotext_path
-        @@pdftotext_path
-      end
-      # Set location of the pdftotext executable for all instances.
-      def self.pdftotext_path=(val)
-        @@pdftotext_path = val
-      end
     end
   end
 end

data/lib/slaw/generator.rb CHANGED Viewed

@@ -1,3 +1,6 @@
+require 'polyglot'
+require 'treetop'
 module Slaw
   # Base class for generating Act documents
   class ActGenerator
@@ -20,15 +23,18 @@ module Slaw
     def build_parser
       unless @@parsers[@grammar]
-        # load the grammar
-        grammar_file = File.dirname(__FILE__) + "/grammars/#{@grammar}/act.treetop"
-        Treetop.load(grammar_file)
+        # load the grammar with polyglot and treetop
+        # this will ensure the class below is available
+        # see: http://cjheath.github.io/treetop/using_in_ruby.html
+        require "slaw/grammars/#{@grammar}/act"
         grammar_class = "Slaw::Grammars::#{@grammar.upcase}::ActParser"
         @@parsers[@grammar] = eval(grammar_class)
       end
       @parser = @@parsers[@grammar].new
+      @parser.root = :act
+      @parser
     end
     # Generate a Slaw::Act instance from plain text.
@@ -76,8 +82,15 @@ module Slaw
     # Transform an Akoma Ntoso XML document back into a plain-text version
     # suitable for re-parsing back into XML with no loss of structure.
     def text_from_act(doc)
-      xslt = Nokogiri::XSLT(File.read(File.join([File.dirname(__FILE__), "grammars/#{@grammar}/act_text.xsl"])))
-      xslt.transform(doc).child.to_xml
+      # look on the load path for an XSL file for this grammar
+      filename = "/slaw/grammars/#{@grammar}/act_text.xsl"
+      if dir = $LOAD_PATH.find { |p| File.exist?(p + filename) }
+        xslt = Nokogiri::XSLT(File.read(dir + filename))
+        xslt.transform(doc).child.to_xml
+      else
+        raise "Unable to find text XSL for grammar #{@grammar}: #{fragment}"
+      end
     end
   end
 end

data/lib/slaw/parse/builder.rb CHANGED Viewed

@@ -151,28 +151,11 @@ module Slaw
       #
       # @return [Nokogiri::XML::Document] the updated document
       def postprocess(doc)
-        normalise_headings(doc)
         adjust_blocklists(doc)
         doc
       end
-      # Change CAPCASE headings into Sentence case.
-      #
-      # @param doc [Nokogiri::XML::Document]
-      def normalise_headings(doc)
-        logger.info("Normalising headings")
-        nodes = doc.xpath('//a:body//a:heading/text()', a: NS) +
-                doc.xpath('//a:component/a:doc[@name="schedules"]//a:heading/text()', a: NS)
-        nodes.each do |heading|
-          if !(heading.content =~ /[a-z]/)
-            heading.content = heading.content.downcase.gsub(/^\w/) { $&.upcase }
-          end
-        end
-      end
       # Adjust blocklists:
       #
       # - nest them correctly

data/lib/slaw/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Slaw
-  VERSION = "1.0.4"
+  VERSION = "2.0.0"
 end

data/slaw.gemspec CHANGED Viewed

@@ -18,7 +18,6 @@ Gem::Specification.new do |spec|
   spec.test_files    = spec.files.grep(%r{^(test|spec|features)/})
   spec.require_paths = ["lib"]
-  spec.add_development_dependency "bundler", "~> 1.5"
   spec.add_development_dependency "rake", "~> 10.3.1"
   spec.add_development_dependency "rspec", "~> 2.14.1"
@@ -27,8 +26,4 @@ Gem::Specification.new do |spec|
   spec.add_runtime_dependency "log4r", "~> 1.1.10"
   spec.add_runtime_dependency "thor", "~> 0.19.1"
   spec.add_runtime_dependency "mimemagic", "~> 0.2.1"
-  spec.add_runtime_dependency 'yomu', '~> 0.2.2'
-  # anchor twitter-text to avoid bug in 1.14.3
-  # https://github.com/twitter/twitter-text/issues/162
-  spec.add_runtime_dependency 'twitter-text', '~> 1.12.0'
 end

data/spec/parse/builder_spec.rb CHANGED Viewed

@@ -715,44 +715,6 @@ XML
     end
   end
-  describe '#normalise_headings' do
-    it 'should normalise ALL CAPS headings' do
-      doc = xml2doc(section(<<XML
-          <heading>DEFINITIONS FOR A.B.C.</heading>
-          <content>
-            <p></p>
-          </content>
-XML
-      ))
-      subject.normalise_headings(doc)
-      doc.to_s.should == section(<<XML
-        <heading>Definitions for a.b.c.</heading>
-        <content>
-          <p/>
-        </content>
-XML
-      )
-    end
-    it 'should not normalise normal headings' do
-      doc = xml2doc(section(<<XML
-          <heading>Definitions for A.B.C.</heading>
-          <content>
-            <p></p>
-          </content>
-XML
-      ))
-      subject.normalise_headings(doc)
-      doc.to_s.should == section(<<XML
-        <heading>Definitions for A.B.C.</heading>
-        <content>
-          <p/>
-        </content>
-XML
-      )
-    end
-  end
   describe '#preprocess' do
     it 'should split inline table cells into block table cells' do
       text = <<EOS

metadata CHANGED Viewed

@@ -1,29 +1,15 @@
 --- !ruby/object:Gem::Specification
 name: slaw
 version: !ruby/object:Gem::Version
-  version: 1.0.4
+  version: 2.0.0
 platform: ruby
 authors:
 - Greg Kempe
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2019-02-05 00:00:00.000000000 Z
+date: 2019-03-15 00:00:00.000000000 Z
 dependencies:
-- !ruby/object:Gem::Dependency
-  name: bundler
-  requirement: !ruby/object:Gem::Requirement
-    requirements:
-    - - "~>"
-      - !ruby/object:Gem::Version
-        version: '1.5'
-  type: :development
-  prerelease: false
-  version_requirements: !ruby/object:Gem::Requirement
-    requirements:
-    - - "~>"
-      - !ruby/object:Gem::Version
-        version: '1.5'
 - !ruby/object:Gem::Dependency
   name: rake
   requirement: !ruby/object:Gem::Requirement
@@ -122,34 +108,6 @@ dependencies:
     - - "~>"
       - !ruby/object:Gem::Version
         version: 0.2.1
-- !ruby/object:Gem::Dependency
-  name: yomu
-  requirement: !ruby/object:Gem::Requirement
-    requirements:
-    - - "~>"
-      - !ruby/object:Gem::Version
-        version: 0.2.2
-  type: :runtime
-  prerelease: false
-  version_requirements: !ruby/object:Gem::Requirement
-    requirements:
-    - - "~>"
-      - !ruby/object:Gem::Version
-        version: 0.2.2
-- !ruby/object:Gem::Dependency
-  name: twitter-text
-  requirement: !ruby/object:Gem::Requirement
-    requirements:
-    - - "~>"
-      - !ruby/object:Gem::Version
-        version: 1.12.0
-  type: :runtime
-  prerelease: false
-  version_requirements: !ruby/object:Gem::Requirement
-    requirements:
-    - - "~>"
-      - !ruby/object:Gem::Version
-        version: 1.12.0
 description: Slaw is a lightweight library for rendering and generating Akoma Ntoso
   acts from plain text and PDF documents.
 email:
@@ -169,7 +127,6 @@ files:
 - lib/slaw.rb
 - lib/slaw/extract/extractor.rb
 - lib/slaw/extract/html_to_akn_text.xsl
-- lib/slaw/extract/yomu_patch.rb
 - lib/slaw/generator.rb
 - lib/slaw/grammars/core_nodes.rb
 - lib/slaw/grammars/inlines.treetop

data/lib/slaw/extract/yomu_patch.rb DELETED Viewed

@@ -1,9 +0,0 @@
-require 'yomu'
-class Yomu
-  def self.text_from_file(filename)
-    IO.popen("#{java} -Djava.awt.headless=true -jar #{Yomu::JARPATH} --html '#{filename}'", 'r') do |io|
-      io.read
-    end
-  end
-end