slaw 1.0.4 → 2.0.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 963f970d7da2acd7fd2678973515073f7f9b5226
4
- data.tar.gz: b5fcc0de969f1b6c286d70017bd15c61e4e959d4
3
+ metadata.gz: 1639f10e008ddcdd149e040767d97476c9183ad3
4
+ data.tar.gz: 95cce3c38e35910731bfe8776974a40ba3be7c32
5
5
  SHA512:
6
- metadata.gz: 8617cd183a1af99370c17457ff19218a41d2ebf9f63aec1e1ad3101712bab518a9a89a272b912a631860f5c73d5918570c093dc024d6a53445268f196191cb59
7
- data.tar.gz: 04e061c52b5ebbf4a21062ece4e35b1f8183f004d74996c8289b763cb5e931e9da4eb5bb5407059f18b234ef85e15018a05f20ae56c21dd7b83d77b48b566205
6
+ metadata.gz: 673521a6b0be293b57f7cd8279fb2df277d7c8263824fe5a007e69efaf81b76d8a2a3ef826589abe730d6ab4170f5d95ab42e915c30c44e18d7d15774715bdfa
7
+ data.tar.gz: c4e38899b280727459cbfd0982ceed856bbeb683b51e77839ba9c7f14b7cf7ff204f95eb2db52a015c1ffe8d116e0ca9bb87149d7f80168c65bf2965d051a196
data/.travis.yml CHANGED
@@ -1,7 +1,7 @@
1
1
  language: ruby
2
2
  rvm:
3
- - 2.2.8
4
- - 2.3.5
5
- - 2.4.2
3
+ - 2.6.2
4
+ - 2.5.4
5
+ - 2.4.5
6
6
  before_install:
7
7
  - gem update bundler
data/README.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # Slaw [![Build Status](https://travis-ci.org/longhotsummer/slaw.svg)](http://travis-ci.org/longhotsummer/slaw)
2
2
 
3
- Slaw is a lightweight library for generating Akoma Ntoso 2.0 Act XML from plain text and PDF documents.
3
+ Slaw is a lightweight library for generating Akoma Ntoso 2.0 Act XML from plain text documents.
4
4
  It is used to power [Indigo](https://github.com/OpenUpSA/indigo) and uses grammars developed for the legal
5
5
  traditions in these countries:
6
6
 
@@ -30,19 +30,9 @@ Or install it with:
30
30
 
31
31
  $ gem install slaw
32
32
 
33
- To run PDF extraction you will also need [poppler's pdftotext](https://poppler.freedesktop.org/).
34
- If you're on a Mac, you can use:
35
-
36
- $ brew install poppler
37
-
38
- You may also need Ghostscript to remove password protection from PDF files. This is
39
- installed by default on most systems (including Mac). On Ubuntu you can use:
40
-
41
- $ sudo apt-get install ghostscript
42
-
43
33
  The simplest way to use Slaw is via the commandline:
44
34
 
45
- $ slaw parse myfile.pdf --grammar za
35
+ $ slaw parse myfile.text --grammar za
46
36
 
47
37
  ## Overview
48
38
 
@@ -50,8 +40,8 @@ Slaw generates Acts in the [Akoma Ntoso](http://www.akomantoso.org) 2.0 XML
50
40
  standard for legislative documents. It first parses plain text using a grammar
51
41
  and then generates XML from the resulting syntax tree.
52
42
 
53
- Most by-laws in South Africa are available as PDF documents. Slaw therefore has support
54
- for extracting and cleaning up text from PDFs before parsing it. Extracting text from
43
+ Most by-laws in South Africa are available as PDF documents. You will therefore
44
+ need to extract the text from the PDF first, using a tool like pdftotext.
55
45
  PDFs can product oddities (such as oddly wrapped lines) and Slaw has a number of
56
46
  rules-of-thumb for correcting these. These rules are based on South African
57
47
  by-laws and may not be suitable for all regions.
@@ -73,6 +63,14 @@ tree, the nodes of which know how to serialize themselves in XML format.
73
63
  Supporting formats from other country's legal traditions probably requires creating a new grammar
74
64
  and parser.
75
65
 
66
+ ## Adding your own grammar
67
+
68
+ Slaw can dynamically load your custom Treetop grammars. When called with ``--grammar xy``, Slaw
69
+ tries to require `slaw/grammars/xy/act` and instantiate the parser class ``Slaw::Grammars::XY::ActParser``.
70
+ Slaw always uses the rule `act` as the root of the parser.
71
+
72
+ You can create your own grammar by creating a gem that provides these files and classes.
73
+
76
74
  ## Contributing
77
75
 
78
76
  1. Fork it at http://github.com/longhotsummer/slaw/fork
@@ -86,6 +84,12 @@ and parser.
86
84
 
87
85
  ## Changelog
88
86
 
87
+ ### 2.0.0 (?)
88
+
89
+ * Remove support for PDFs. Do text extraction from PDFs outside of this library.
90
+ * Support dynamically loading grammars from other gems.
91
+ * Don't change ALL CAPS headings to Sentence Case.
92
+
89
93
  ### 1.0.4 (5 February 2019)
90
94
 
91
95
  * SECURITY require Nokogiri 1.8.5 or greater to address https://nvd.nist.gov/vuln/detail/CVE-2018-14404
data/bin/slaw CHANGED
@@ -4,8 +4,6 @@ require 'thor'
4
4
  require 'slaw'
5
5
 
6
6
  class SlawCLI < Thor
7
- # TODO: support different grammars and locales
8
-
9
7
  # Exit with non-zero exit code on failure.
10
8
  # See https://github.com/erikhuda/thor/issues/244
11
9
  def self.exit_on_failure?
@@ -15,29 +13,19 @@ class SlawCLI < Thor
15
13
  class_option :verbose, type: :boolean, desc: "Display log output on stderr"
16
14
 
17
15
  desc "parse FILE", "Parse FILE into Akoma Ntoso XML"
18
- option :input, enum: ['text', 'pdf'], desc: "Type of input if it can't be determined automatically"
19
- option :pdftotext, desc: "Location of the pdftotext binary if not in PATH"
16
+ option :input, enum: ['text', 'html'], desc: "Type of input if it can't be determined automatically"
20
17
  option :fragment, type: :string, desc: "Akoma Ntoso element name that the imported text represents. Support depends on the grammar."
21
18
  option :id_prefix, type: :string, desc: "Prefix to be used when generating ID elements when parsing a fragment."
22
19
  option :section_number_position, enum: ['before-title', 'after-title', 'guess'], desc: "Where do section titles come in relation to the section number? Default: before-title"
23
- option :crop, type: :string, desc: "Crop box for PDF files, as 'left,top,width,height'."
24
20
  option :grammar, type: :string, desc: "Grammar name (usually a two-letter country code). Default is za."
25
21
  def parse(name)
26
22
  logging
27
23
 
28
- Slaw::Extract::Extractor.pdftotext_path = options[:pdftotext] if options[:pdftotext]
29
24
  extractor = Slaw::Extract::Extractor.new
30
25
 
31
- if options[:crop]
32
- extractor.cropbox = options[:crop].split(',').map(&:to_i)
33
- if extractor.cropbox.length != 4
34
- raise Thor::Error.new("--crop requires four comma-separated integers")
35
- end
36
- end
37
-
38
26
  case options[:input]
39
- when 'pdf'
40
- text = extractor.extract_from_pdf(name)
27
+ when 'html'
28
+ text = extractor.extract_from_html(name)
41
29
  when 'text'
42
30
  text = extractor.extract_from_text(name)
43
31
  else
@@ -1,24 +1,12 @@
1
- require 'open3'
2
- require 'tempfile'
3
1
  require 'mimemagic'
4
2
 
5
3
  module Slaw
6
4
  module Extract
7
5
 
8
- # Routines for extracting and cleaning up context from other formats, such as PDF.
9
- #
10
- # You may need to set the location of the `pdftotext` binary.
11
- #
12
- # On Mac OS X, use `brew install xpdf` or download from http://www.foolabs.com/xpdf/download.html
13
- #
14
- # On Heroku, you'll need to do some hoop jumping, see http://theprogrammingbutler.com/blog/archives/2011/07/28/running-pdftotext-on-heroku/
6
+ # Routines for extracting and cleaning up context from other formats, such as HTML.
15
7
  class Extractor
16
8
  include Slaw::Logging
17
9
 
18
- @@pdftotext_path = "pdftotext"
19
-
20
- attr_accessor :cropbox
21
-
22
10
  # Extract text from a file.
23
11
  #
24
12
  # @param filename [String] filename to extract from
@@ -28,61 +16,13 @@ module Slaw
28
16
  mimetype = get_mimetype(filename)
29
17
 
30
18
  case mimetype && mimetype.type
31
- when 'application/pdf'
32
- extract_from_pdf(filename)
33
- when 'text/html', nil
19
+ when 'text/html'
34
20
  extract_from_html(filename)
35
21
  when 'text/plain', nil
36
22
  extract_from_text(filename)
37
23
  else
38
- text = extract_via_tika(filename)
39
- if text.empty? or text.nil?
40
- raise ArgumentError.new("Unsupported file type #{mimetype || 'unknown'}")
41
- end
42
- text
43
- end
44
- end
45
-
46
- # Extract text from a PDF
47
- #
48
- # @param filename [String] filename to extract from
49
- #
50
- # @return [String] extracted text
51
- def extract_from_pdf(filename)
52
- retried = false
53
-
54
- while true
55
- cmd = pdf_to_text_cmd(filename)
56
- logger.info("Executing: #{cmd}")
57
- stdout, status = Open3.capture2(*cmd)
58
-
59
- case status.exitstatus
60
- when 0
61
- return stdout
62
- when 3
63
- return nil if retried
64
- retried = true
65
- self.remove_pdf_password(filename)
66
- else
67
- return nil
68
- end
69
- end
70
- end
71
-
72
- # Build a command for the external PDF-to-text utility.
73
- #
74
- # @param filename [String] the pdf file
75
- #
76
- # @return [Array<String>] command and params to execute
77
- def pdf_to_text_cmd(filename)
78
- cmd = [Extractor.pdftotext_path, "-enc", "UTF-8", "-nopgbrk"]
79
-
80
- if @cropbox
81
- # left, top, width, height
82
- cmd += "-x -y -W -H".split.zip(@cropbox.map(&:to_s)).flatten
24
+ raise ArgumentError.new("Unsupported file type #{mimetype || 'unknown'}")
83
25
  end
84
-
85
- cmd + [filename, "-"]
86
26
  end
87
27
 
88
28
  def extract_from_text(filename)
@@ -93,21 +33,6 @@ module Slaw
93
33
  html_to_text(File.read(filename))
94
34
  end
95
35
 
96
- # Extract text from +filename+ by sending it to apache tika
97
- # http://tika.apache.org/
98
- def extract_via_tika(filename)
99
- # the Yomu gem falls over when trying to write large amounts of data
100
- # the JVM stdin, so we manually call java ourselves, relying on yomu
101
- # to supply the gem
102
- require 'slaw/extract/yomu_patch'
103
- logger.info("Using Tika to get text from #{filename}. You need a JVM installed for this.")
104
-
105
- html = Yomu.text_from_file(filename)
106
- logger.info("Tika returned #{html.length} bytes")
107
- # transform html into text
108
- html_to_text(html)
109
- end
110
-
111
36
  def html_to_text(html)
112
37
  here = File.dirname(__FILE__)
113
38
  xslt = Nokogiri::XSLT(File.open(File.join([here, 'html_to_akn_text.xsl'])))
@@ -117,34 +42,10 @@ module Slaw
117
42
  text.sub(/^<\?xml [^>]*>/, '')
118
43
  end
119
44
 
120
- def remove_pdf_password(filename)
121
- file = Tempfile.new('steno')
122
- begin
123
- logger.info("Trying to remove password from #{filename}")
124
- cmd = "gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=#{file.path} -c .setpdfwrite -f #{filename}".split(" ")
125
- logger.info("Executing: #{cmd}")
126
- Open3.capture2(*cmd)
127
- FileUtils.move(file.path, filename)
128
- ensure
129
- file.close
130
- file.unlink
131
- end
132
- end
133
-
134
45
  def get_mimetype(filename)
135
46
  File.open(filename) { |f| MimeMagic.by_magic(f) } \
136
47
  || MimeMagic.by_path(filename)
137
48
  end
138
-
139
- # Get location of the pdftotext executable for all instances.
140
- def self.pdftotext_path
141
- @@pdftotext_path
142
- end
143
-
144
- # Set location of the pdftotext executable for all instances.
145
- def self.pdftotext_path=(val)
146
- @@pdftotext_path = val
147
- end
148
49
  end
149
50
  end
150
51
  end
@@ -1,3 +1,6 @@
1
+ require 'polyglot'
2
+ require 'treetop'
3
+
1
4
  module Slaw
2
5
  # Base class for generating Act documents
3
6
  class ActGenerator
@@ -20,15 +23,18 @@ module Slaw
20
23
 
21
24
  def build_parser
22
25
  unless @@parsers[@grammar]
23
- # load the grammar
24
- grammar_file = File.dirname(__FILE__) + "/grammars/#{@grammar}/act.treetop"
25
- Treetop.load(grammar_file)
26
-
26
+ # load the grammar with polyglot and treetop
27
+ # this will ensure the class below is available
28
+ # see: http://cjheath.github.io/treetop/using_in_ruby.html
29
+ require "slaw/grammars/#{@grammar}/act"
27
30
  grammar_class = "Slaw::Grammars::#{@grammar.upcase}::ActParser"
28
31
  @@parsers[@grammar] = eval(grammar_class)
29
32
  end
30
33
 
31
34
  @parser = @@parsers[@grammar].new
35
+ @parser.root = :act
36
+
37
+ @parser
32
38
  end
33
39
 
34
40
  # Generate a Slaw::Act instance from plain text.
@@ -76,8 +82,15 @@ module Slaw
76
82
  # Transform an Akoma Ntoso XML document back into a plain-text version
77
83
  # suitable for re-parsing back into XML with no loss of structure.
78
84
  def text_from_act(doc)
79
- xslt = Nokogiri::XSLT(File.read(File.join([File.dirname(__FILE__), "grammars/#{@grammar}/act_text.xsl"])))
80
- xslt.transform(doc).child.to_xml
85
+ # look on the load path for an XSL file for this grammar
86
+ filename = "/slaw/grammars/#{@grammar}/act_text.xsl"
87
+
88
+ if dir = $LOAD_PATH.find { |p| File.exist?(p + filename) }
89
+ xslt = Nokogiri::XSLT(File.read(dir + filename))
90
+ xslt.transform(doc).child.to_xml
91
+ else
92
+ raise "Unable to find text XSL for grammar #{@grammar}: #{fragment}"
93
+ end
81
94
  end
82
95
  end
83
96
  end
@@ -151,28 +151,11 @@ module Slaw
151
151
  #
152
152
  # @return [Nokogiri::XML::Document] the updated document
153
153
  def postprocess(doc)
154
- normalise_headings(doc)
155
154
  adjust_blocklists(doc)
156
155
 
157
156
  doc
158
157
  end
159
158
 
160
- # Change CAPCASE headings into Sentence case.
161
- #
162
- # @param doc [Nokogiri::XML::Document]
163
- def normalise_headings(doc)
164
- logger.info("Normalising headings")
165
-
166
- nodes = doc.xpath('//a:body//a:heading/text()', a: NS) +
167
- doc.xpath('//a:component/a:doc[@name="schedules"]//a:heading/text()', a: NS)
168
-
169
- nodes.each do |heading|
170
- if !(heading.content =~ /[a-z]/)
171
- heading.content = heading.content.downcase.gsub(/^\w/) { $&.upcase }
172
- end
173
- end
174
- end
175
-
176
159
  # Adjust blocklists:
177
160
  #
178
161
  # - nest them correctly
data/lib/slaw/version.rb CHANGED
@@ -1,3 +1,3 @@
1
1
  module Slaw
2
- VERSION = "1.0.4"
2
+ VERSION = "2.0.0"
3
3
  end
data/slaw.gemspec CHANGED
@@ -18,7 +18,6 @@ Gem::Specification.new do |spec|
18
18
  spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
19
19
  spec.require_paths = ["lib"]
20
20
 
21
- spec.add_development_dependency "bundler", "~> 1.5"
22
21
  spec.add_development_dependency "rake", "~> 10.3.1"
23
22
  spec.add_development_dependency "rspec", "~> 2.14.1"
24
23
 
@@ -27,8 +26,4 @@ Gem::Specification.new do |spec|
27
26
  spec.add_runtime_dependency "log4r", "~> 1.1.10"
28
27
  spec.add_runtime_dependency "thor", "~> 0.19.1"
29
28
  spec.add_runtime_dependency "mimemagic", "~> 0.2.1"
30
- spec.add_runtime_dependency 'yomu', '~> 0.2.2'
31
- # anchor twitter-text to avoid bug in 1.14.3
32
- # https://github.com/twitter/twitter-text/issues/162
33
- spec.add_runtime_dependency 'twitter-text', '~> 1.12.0'
34
29
  end
@@ -715,44 +715,6 @@ XML
715
715
  end
716
716
  end
717
717
 
718
- describe '#normalise_headings' do
719
- it 'should normalise ALL CAPS headings' do
720
- doc = xml2doc(section(<<XML
721
- <heading>DEFINITIONS FOR A.B.C.</heading>
722
- <content>
723
- <p></p>
724
- </content>
725
- XML
726
- ))
727
- subject.normalise_headings(doc)
728
- doc.to_s.should == section(<<XML
729
- <heading>Definitions for a.b.c.</heading>
730
- <content>
731
- <p/>
732
- </content>
733
- XML
734
- )
735
- end
736
-
737
- it 'should not normalise normal headings' do
738
- doc = xml2doc(section(<<XML
739
- <heading>Definitions for A.B.C.</heading>
740
- <content>
741
- <p></p>
742
- </content>
743
- XML
744
- ))
745
- subject.normalise_headings(doc)
746
- doc.to_s.should == section(<<XML
747
- <heading>Definitions for A.B.C.</heading>
748
- <content>
749
- <p/>
750
- </content>
751
- XML
752
- )
753
- end
754
- end
755
-
756
718
  describe '#preprocess' do
757
719
  it 'should split inline table cells into block table cells' do
758
720
  text = <<EOS
metadata CHANGED
@@ -1,29 +1,15 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: slaw
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.4
4
+ version: 2.0.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Greg Kempe
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2019-02-05 00:00:00.000000000 Z
11
+ date: 2019-03-15 00:00:00.000000000 Z
12
12
  dependencies:
13
- - !ruby/object:Gem::Dependency
14
- name: bundler
15
- requirement: !ruby/object:Gem::Requirement
16
- requirements:
17
- - - "~>"
18
- - !ruby/object:Gem::Version
19
- version: '1.5'
20
- type: :development
21
- prerelease: false
22
- version_requirements: !ruby/object:Gem::Requirement
23
- requirements:
24
- - - "~>"
25
- - !ruby/object:Gem::Version
26
- version: '1.5'
27
13
  - !ruby/object:Gem::Dependency
28
14
  name: rake
29
15
  requirement: !ruby/object:Gem::Requirement
@@ -122,34 +108,6 @@ dependencies:
122
108
  - - "~>"
123
109
  - !ruby/object:Gem::Version
124
110
  version: 0.2.1
125
- - !ruby/object:Gem::Dependency
126
- name: yomu
127
- requirement: !ruby/object:Gem::Requirement
128
- requirements:
129
- - - "~>"
130
- - !ruby/object:Gem::Version
131
- version: 0.2.2
132
- type: :runtime
133
- prerelease: false
134
- version_requirements: !ruby/object:Gem::Requirement
135
- requirements:
136
- - - "~>"
137
- - !ruby/object:Gem::Version
138
- version: 0.2.2
139
- - !ruby/object:Gem::Dependency
140
- name: twitter-text
141
- requirement: !ruby/object:Gem::Requirement
142
- requirements:
143
- - - "~>"
144
- - !ruby/object:Gem::Version
145
- version: 1.12.0
146
- type: :runtime
147
- prerelease: false
148
- version_requirements: !ruby/object:Gem::Requirement
149
- requirements:
150
- - - "~>"
151
- - !ruby/object:Gem::Version
152
- version: 1.12.0
153
111
  description: Slaw is a lightweight library for rendering and generating Akoma Ntoso
154
112
  acts from plain text and PDF documents.
155
113
  email:
@@ -169,7 +127,6 @@ files:
169
127
  - lib/slaw.rb
170
128
  - lib/slaw/extract/extractor.rb
171
129
  - lib/slaw/extract/html_to_akn_text.xsl
172
- - lib/slaw/extract/yomu_patch.rb
173
130
  - lib/slaw/generator.rb
174
131
  - lib/slaw/grammars/core_nodes.rb
175
132
  - lib/slaw/grammars/inlines.treetop
@@ -1,9 +0,0 @@
1
- require 'yomu'
2
-
3
- class Yomu
4
- def self.text_from_file(filename)
5
- IO.popen("#{java} -Djava.awt.headless=true -jar #{Yomu::JARPATH} --html '#{filename}'", 'r') do |io|
6
- io.read
7
- end
8
- end
9
- end