slaw 1.0.4 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 963f970d7da2acd7fd2678973515073f7f9b5226
4
- data.tar.gz: b5fcc0de969f1b6c286d70017bd15c61e4e959d4
3
+ metadata.gz: 1639f10e008ddcdd149e040767d97476c9183ad3
4
+ data.tar.gz: 95cce3c38e35910731bfe8776974a40ba3be7c32
5
5
  SHA512:
6
- metadata.gz: 8617cd183a1af99370c17457ff19218a41d2ebf9f63aec1e1ad3101712bab518a9a89a272b912a631860f5c73d5918570c093dc024d6a53445268f196191cb59
7
- data.tar.gz: 04e061c52b5ebbf4a21062ece4e35b1f8183f004d74996c8289b763cb5e931e9da4eb5bb5407059f18b234ef85e15018a05f20ae56c21dd7b83d77b48b566205
6
+ metadata.gz: 673521a6b0be293b57f7cd8279fb2df277d7c8263824fe5a007e69efaf81b76d8a2a3ef826589abe730d6ab4170f5d95ab42e915c30c44e18d7d15774715bdfa
7
+ data.tar.gz: c4e38899b280727459cbfd0982ceed856bbeb683b51e77839ba9c7f14b7cf7ff204f95eb2db52a015c1ffe8d116e0ca9bb87149d7f80168c65bf2965d051a196
data/.travis.yml CHANGED
@@ -1,7 +1,7 @@
1
1
  language: ruby
2
2
  rvm:
3
- - 2.2.8
4
- - 2.3.5
5
- - 2.4.2
3
+ - 2.6.2
4
+ - 2.5.4
5
+ - 2.4.5
6
6
  before_install:
7
7
  - gem update bundler
data/README.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # Slaw [![Build Status](https://travis-ci.org/longhotsummer/slaw.svg)](http://travis-ci.org/longhotsummer/slaw)
2
2
 
3
- Slaw is a lightweight library for generating Akoma Ntoso 2.0 Act XML from plain text and PDF documents.
3
+ Slaw is a lightweight library for generating Akoma Ntoso 2.0 Act XML from plain text documents.
4
4
  It is used to power [Indigo](https://github.com/OpenUpSA/indigo) and uses grammars developed for the legal
5
5
  traditions in these countries:
6
6
 
@@ -30,19 +30,9 @@ Or install it with:
30
30
 
31
31
  $ gem install slaw
32
32
 
33
- To run PDF extraction you will also need [poppler's pdftotext](https://poppler.freedesktop.org/).
34
- If you're on a Mac, you can use:
35
-
36
- $ brew install poppler
37
-
38
- You may also need Ghostscript to remove password protection from PDF files. This is
39
- installed by default on most systems (including Mac). On Ubuntu you can use:
40
-
41
- $ sudo apt-get install ghostscript
42
-
43
33
  The simplest way to use Slaw is via the commandline:
44
34
 
45
- $ slaw parse myfile.pdf --grammar za
35
+ $ slaw parse myfile.text --grammar za
46
36
 
47
37
  ## Overview
48
38
 
@@ -50,8 +40,8 @@ Slaw generates Acts in the [Akoma Ntoso](http://www.akomantoso.org) 2.0 XML
50
40
  standard for legislative documents. It first parses plain text using a grammar
51
41
  and then generates XML from the resulting syntax tree.
52
42
 
53
- Most by-laws in South Africa are available as PDF documents. Slaw therefore has support
54
- for extracting and cleaning up text from PDFs before parsing it. Extracting text from
43
+ Most by-laws in South Africa are available as PDF documents. You will therefore
44
+ need to extract the text from the PDF first, using a tool like pdftotext.
55
45
  PDFs can product oddities (such as oddly wrapped lines) and Slaw has a number of
56
46
  rules-of-thumb for correcting these. These rules are based on South African
57
47
  by-laws and may not be suitable for all regions.
@@ -73,6 +63,14 @@ tree, the nodes of which know how to serialize themselves in XML format.
73
63
  Supporting formats from other country's legal traditions probably requires creating a new grammar
74
64
  and parser.
75
65
 
66
+ ## Adding your own grammar
67
+
68
+ Slaw can dynamically load your custom Treetop grammars. When called with ``--grammar xy``, Slaw
69
+ tries to require `slaw/grammars/xy/act` and instantiate the parser class ``Slaw::Grammars::XY::ActParser``.
70
+ Slaw always uses the rule `act` as the root of the parser.
71
+
72
+ You can create your own grammar by creating a gem that provides these files and classes.
73
+
76
74
  ## Contributing
77
75
 
78
76
  1. Fork it at http://github.com/longhotsummer/slaw/fork
@@ -86,6 +84,12 @@ and parser.
86
84
 
87
85
  ## Changelog
88
86
 
87
+ ### 2.0.0 (?)
88
+
89
+ * Remove support for PDFs. Do text extraction from PDFs outside of this library.
90
+ * Support dynamically loading grammars from other gems.
91
+ * Don't change ALL CAPS headings to Sentence Case.
92
+
89
93
  ### 1.0.4 (5 February 2019)
90
94
 
91
95
  * SECURITY require Nokogiri 1.8.5 or greater to address https://nvd.nist.gov/vuln/detail/CVE-2018-14404
data/bin/slaw CHANGED
@@ -4,8 +4,6 @@ require 'thor'
4
4
  require 'slaw'
5
5
 
6
6
  class SlawCLI < Thor
7
- # TODO: support different grammars and locales
8
-
9
7
  # Exit with non-zero exit code on failure.
10
8
  # See https://github.com/erikhuda/thor/issues/244
11
9
  def self.exit_on_failure?
@@ -15,29 +13,19 @@ class SlawCLI < Thor
15
13
  class_option :verbose, type: :boolean, desc: "Display log output on stderr"
16
14
 
17
15
  desc "parse FILE", "Parse FILE into Akoma Ntoso XML"
18
- option :input, enum: ['text', 'pdf'], desc: "Type of input if it can't be determined automatically"
19
- option :pdftotext, desc: "Location of the pdftotext binary if not in PATH"
16
+ option :input, enum: ['text', 'html'], desc: "Type of input if it can't be determined automatically"
20
17
  option :fragment, type: :string, desc: "Akoma Ntoso element name that the imported text represents. Support depends on the grammar."
21
18
  option :id_prefix, type: :string, desc: "Prefix to be used when generating ID elements when parsing a fragment."
22
19
  option :section_number_position, enum: ['before-title', 'after-title', 'guess'], desc: "Where do section titles come in relation to the section number? Default: before-title"
23
- option :crop, type: :string, desc: "Crop box for PDF files, as 'left,top,width,height'."
24
20
  option :grammar, type: :string, desc: "Grammar name (usually a two-letter country code). Default is za."
25
21
  def parse(name)
26
22
  logging
27
23
 
28
- Slaw::Extract::Extractor.pdftotext_path = options[:pdftotext] if options[:pdftotext]
29
24
  extractor = Slaw::Extract::Extractor.new
30
25
 
31
- if options[:crop]
32
- extractor.cropbox = options[:crop].split(',').map(&:to_i)
33
- if extractor.cropbox.length != 4
34
- raise Thor::Error.new("--crop requires four comma-separated integers")
35
- end
36
- end
37
-
38
26
  case options[:input]
39
- when 'pdf'
40
- text = extractor.extract_from_pdf(name)
27
+ when 'html'
28
+ text = extractor.extract_from_html(name)
41
29
  when 'text'
42
30
  text = extractor.extract_from_text(name)
43
31
  else
@@ -1,24 +1,12 @@
1
- require 'open3'
2
- require 'tempfile'
3
1
  require 'mimemagic'
4
2
 
5
3
  module Slaw
6
4
  module Extract
7
5
 
8
- # Routines for extracting and cleaning up context from other formats, such as PDF.
9
- #
10
- # You may need to set the location of the `pdftotext` binary.
11
- #
12
- # On Mac OS X, use `brew install xpdf` or download from http://www.foolabs.com/xpdf/download.html
13
- #
14
- # On Heroku, you'll need to do some hoop jumping, see http://theprogrammingbutler.com/blog/archives/2011/07/28/running-pdftotext-on-heroku/
6
+ # Routines for extracting and cleaning up context from other formats, such as HTML.
15
7
  class Extractor
16
8
  include Slaw::Logging
17
9
 
18
- @@pdftotext_path = "pdftotext"
19
-
20
- attr_accessor :cropbox
21
-
22
10
  # Extract text from a file.
23
11
  #
24
12
  # @param filename [String] filename to extract from
@@ -28,61 +16,13 @@ module Slaw
28
16
  mimetype = get_mimetype(filename)
29
17
 
30
18
  case mimetype && mimetype.type
31
- when 'application/pdf'
32
- extract_from_pdf(filename)
33
- when 'text/html', nil
19
+ when 'text/html'
34
20
  extract_from_html(filename)
35
21
  when 'text/plain', nil
36
22
  extract_from_text(filename)
37
23
  else
38
- text = extract_via_tika(filename)
39
- if text.empty? or text.nil?
40
- raise ArgumentError.new("Unsupported file type #{mimetype || 'unknown'}")
41
- end
42
- text
43
- end
44
- end
45
-
46
- # Extract text from a PDF
47
- #
48
- # @param filename [String] filename to extract from
49
- #
50
- # @return [String] extracted text
51
- def extract_from_pdf(filename)
52
- retried = false
53
-
54
- while true
55
- cmd = pdf_to_text_cmd(filename)
56
- logger.info("Executing: #{cmd}")
57
- stdout, status = Open3.capture2(*cmd)
58
-
59
- case status.exitstatus
60
- when 0
61
- return stdout
62
- when 3
63
- return nil if retried
64
- retried = true
65
- self.remove_pdf_password(filename)
66
- else
67
- return nil
68
- end
69
- end
70
- end
71
-
72
- # Build a command for the external PDF-to-text utility.
73
- #
74
- # @param filename [String] the pdf file
75
- #
76
- # @return [Array<String>] command and params to execute
77
- def pdf_to_text_cmd(filename)
78
- cmd = [Extractor.pdftotext_path, "-enc", "UTF-8", "-nopgbrk"]
79
-
80
- if @cropbox
81
- # left, top, width, height
82
- cmd += "-x -y -W -H".split.zip(@cropbox.map(&:to_s)).flatten
24
+ raise ArgumentError.new("Unsupported file type #{mimetype || 'unknown'}")
83
25
  end
84
-
85
- cmd + [filename, "-"]
86
26
  end
87
27
 
88
28
  def extract_from_text(filename)
@@ -93,21 +33,6 @@ module Slaw
93
33
  html_to_text(File.read(filename))
94
34
  end
95
35
 
96
- # Extract text from +filename+ by sending it to apache tika
97
- # http://tika.apache.org/
98
- def extract_via_tika(filename)
99
- # the Yomu gem falls over when trying to write large amounts of data
100
- # the JVM stdin, so we manually call java ourselves, relying on yomu
101
- # to supply the gem
102
- require 'slaw/extract/yomu_patch'
103
- logger.info("Using Tika to get text from #{filename}. You need a JVM installed for this.")
104
-
105
- html = Yomu.text_from_file(filename)
106
- logger.info("Tika returned #{html.length} bytes")
107
- # transform html into text
108
- html_to_text(html)
109
- end
110
-
111
36
  def html_to_text(html)
112
37
  here = File.dirname(__FILE__)
113
38
  xslt = Nokogiri::XSLT(File.open(File.join([here, 'html_to_akn_text.xsl'])))
@@ -117,34 +42,10 @@ module Slaw
117
42
  text.sub(/^<\?xml [^>]*>/, '')
118
43
  end
119
44
 
120
- def remove_pdf_password(filename)
121
- file = Tempfile.new('steno')
122
- begin
123
- logger.info("Trying to remove password from #{filename}")
124
- cmd = "gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=#{file.path} -c .setpdfwrite -f #{filename}".split(" ")
125
- logger.info("Executing: #{cmd}")
126
- Open3.capture2(*cmd)
127
- FileUtils.move(file.path, filename)
128
- ensure
129
- file.close
130
- file.unlink
131
- end
132
- end
133
-
134
45
  def get_mimetype(filename)
135
46
  File.open(filename) { |f| MimeMagic.by_magic(f) } \
136
47
  || MimeMagic.by_path(filename)
137
48
  end
138
-
139
- # Get location of the pdftotext executable for all instances.
140
- def self.pdftotext_path
141
- @@pdftotext_path
142
- end
143
-
144
- # Set location of the pdftotext executable for all instances.
145
- def self.pdftotext_path=(val)
146
- @@pdftotext_path = val
147
- end
148
49
  end
149
50
  end
150
51
  end
@@ -1,3 +1,6 @@
1
+ require 'polyglot'
2
+ require 'treetop'
3
+
1
4
  module Slaw
2
5
  # Base class for generating Act documents
3
6
  class ActGenerator
@@ -20,15 +23,18 @@ module Slaw
20
23
 
21
24
  def build_parser
22
25
  unless @@parsers[@grammar]
23
- # load the grammar
24
- grammar_file = File.dirname(__FILE__) + "/grammars/#{@grammar}/act.treetop"
25
- Treetop.load(grammar_file)
26
-
26
+ # load the grammar with polyglot and treetop
27
+ # this will ensure the class below is available
28
+ # see: http://cjheath.github.io/treetop/using_in_ruby.html
29
+ require "slaw/grammars/#{@grammar}/act"
27
30
  grammar_class = "Slaw::Grammars::#{@grammar.upcase}::ActParser"
28
31
  @@parsers[@grammar] = eval(grammar_class)
29
32
  end
30
33
 
31
34
  @parser = @@parsers[@grammar].new
35
+ @parser.root = :act
36
+
37
+ @parser
32
38
  end
33
39
 
34
40
  # Generate a Slaw::Act instance from plain text.
@@ -76,8 +82,15 @@ module Slaw
76
82
  # Transform an Akoma Ntoso XML document back into a plain-text version
77
83
  # suitable for re-parsing back into XML with no loss of structure.
78
84
  def text_from_act(doc)
79
- xslt = Nokogiri::XSLT(File.read(File.join([File.dirname(__FILE__), "grammars/#{@grammar}/act_text.xsl"])))
80
- xslt.transform(doc).child.to_xml
85
+ # look on the load path for an XSL file for this grammar
86
+ filename = "/slaw/grammars/#{@grammar}/act_text.xsl"
87
+
88
+ if dir = $LOAD_PATH.find { |p| File.exist?(p + filename) }
89
+ xslt = Nokogiri::XSLT(File.read(dir + filename))
90
+ xslt.transform(doc).child.to_xml
91
+ else
92
+ raise "Unable to find text XSL for grammar #{@grammar}: #{fragment}"
93
+ end
81
94
  end
82
95
  end
83
96
  end
@@ -151,28 +151,11 @@ module Slaw
151
151
  #
152
152
  # @return [Nokogiri::XML::Document] the updated document
153
153
  def postprocess(doc)
154
- normalise_headings(doc)
155
154
  adjust_blocklists(doc)
156
155
 
157
156
  doc
158
157
  end
159
158
 
160
- # Change CAPCASE headings into Sentence case.
161
- #
162
- # @param doc [Nokogiri::XML::Document]
163
- def normalise_headings(doc)
164
- logger.info("Normalising headings")
165
-
166
- nodes = doc.xpath('//a:body//a:heading/text()', a: NS) +
167
- doc.xpath('//a:component/a:doc[@name="schedules"]//a:heading/text()', a: NS)
168
-
169
- nodes.each do |heading|
170
- if !(heading.content =~ /[a-z]/)
171
- heading.content = heading.content.downcase.gsub(/^\w/) { $&.upcase }
172
- end
173
- end
174
- end
175
-
176
159
  # Adjust blocklists:
177
160
  #
178
161
  # - nest them correctly
data/lib/slaw/version.rb CHANGED
@@ -1,3 +1,3 @@
1
1
  module Slaw
2
- VERSION = "1.0.4"
2
+ VERSION = "2.0.0"
3
3
  end
data/slaw.gemspec CHANGED
@@ -18,7 +18,6 @@ Gem::Specification.new do |spec|
18
18
  spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
19
19
  spec.require_paths = ["lib"]
20
20
 
21
- spec.add_development_dependency "bundler", "~> 1.5"
22
21
  spec.add_development_dependency "rake", "~> 10.3.1"
23
22
  spec.add_development_dependency "rspec", "~> 2.14.1"
24
23
 
@@ -27,8 +26,4 @@ Gem::Specification.new do |spec|
27
26
  spec.add_runtime_dependency "log4r", "~> 1.1.10"
28
27
  spec.add_runtime_dependency "thor", "~> 0.19.1"
29
28
  spec.add_runtime_dependency "mimemagic", "~> 0.2.1"
30
- spec.add_runtime_dependency 'yomu', '~> 0.2.2'
31
- # anchor twitter-text to avoid bug in 1.14.3
32
- # https://github.com/twitter/twitter-text/issues/162
33
- spec.add_runtime_dependency 'twitter-text', '~> 1.12.0'
34
29
  end
@@ -715,44 +715,6 @@ XML
715
715
  end
716
716
  end
717
717
 
718
- describe '#normalise_headings' do
719
- it 'should normalise ALL CAPS headings' do
720
- doc = xml2doc(section(<<XML
721
- <heading>DEFINITIONS FOR A.B.C.</heading>
722
- <content>
723
- <p></p>
724
- </content>
725
- XML
726
- ))
727
- subject.normalise_headings(doc)
728
- doc.to_s.should == section(<<XML
729
- <heading>Definitions for a.b.c.</heading>
730
- <content>
731
- <p/>
732
- </content>
733
- XML
734
- )
735
- end
736
-
737
- it 'should not normalise normal headings' do
738
- doc = xml2doc(section(<<XML
739
- <heading>Definitions for A.B.C.</heading>
740
- <content>
741
- <p></p>
742
- </content>
743
- XML
744
- ))
745
- subject.normalise_headings(doc)
746
- doc.to_s.should == section(<<XML
747
- <heading>Definitions for A.B.C.</heading>
748
- <content>
749
- <p/>
750
- </content>
751
- XML
752
- )
753
- end
754
- end
755
-
756
718
  describe '#preprocess' do
757
719
  it 'should split inline table cells into block table cells' do
758
720
  text = <<EOS
metadata CHANGED
@@ -1,29 +1,15 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: slaw
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.4
4
+ version: 2.0.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Greg Kempe
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2019-02-05 00:00:00.000000000 Z
11
+ date: 2019-03-15 00:00:00.000000000 Z
12
12
  dependencies:
13
- - !ruby/object:Gem::Dependency
14
- name: bundler
15
- requirement: !ruby/object:Gem::Requirement
16
- requirements:
17
- - - "~>"
18
- - !ruby/object:Gem::Version
19
- version: '1.5'
20
- type: :development
21
- prerelease: false
22
- version_requirements: !ruby/object:Gem::Requirement
23
- requirements:
24
- - - "~>"
25
- - !ruby/object:Gem::Version
26
- version: '1.5'
27
13
  - !ruby/object:Gem::Dependency
28
14
  name: rake
29
15
  requirement: !ruby/object:Gem::Requirement
@@ -122,34 +108,6 @@ dependencies:
122
108
  - - "~>"
123
109
  - !ruby/object:Gem::Version
124
110
  version: 0.2.1
125
- - !ruby/object:Gem::Dependency
126
- name: yomu
127
- requirement: !ruby/object:Gem::Requirement
128
- requirements:
129
- - - "~>"
130
- - !ruby/object:Gem::Version
131
- version: 0.2.2
132
- type: :runtime
133
- prerelease: false
134
- version_requirements: !ruby/object:Gem::Requirement
135
- requirements:
136
- - - "~>"
137
- - !ruby/object:Gem::Version
138
- version: 0.2.2
139
- - !ruby/object:Gem::Dependency
140
- name: twitter-text
141
- requirement: !ruby/object:Gem::Requirement
142
- requirements:
143
- - - "~>"
144
- - !ruby/object:Gem::Version
145
- version: 1.12.0
146
- type: :runtime
147
- prerelease: false
148
- version_requirements: !ruby/object:Gem::Requirement
149
- requirements:
150
- - - "~>"
151
- - !ruby/object:Gem::Version
152
- version: 1.12.0
153
111
  description: Slaw is a lightweight library for rendering and generating Akoma Ntoso
154
112
  acts from plain text and PDF documents.
155
113
  email:
@@ -169,7 +127,6 @@ files:
169
127
  - lib/slaw.rb
170
128
  - lib/slaw/extract/extractor.rb
171
129
  - lib/slaw/extract/html_to_akn_text.xsl
172
- - lib/slaw/extract/yomu_patch.rb
173
130
  - lib/slaw/generator.rb
174
131
  - lib/slaw/grammars/core_nodes.rb
175
132
  - lib/slaw/grammars/inlines.treetop
@@ -1,9 +0,0 @@
1
- require 'yomu'
2
-
3
- class Yomu
4
- def self.text_from_file(filename)
5
- IO.popen("#{java} -Djava.awt.headless=true -jar #{Yomu::JARPATH} --html '#{filename}'", 'r') do |io|
6
- io.read
7
- end
8
- end
9
- end