derivative-rodeo 0.5.2 → 0.5.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +23 -2
- data/lib/derivative_rodeo/generators/alto_generator.rb +3 -0
- data/lib/derivative_rodeo/generators/base_generator.rb +1 -0
- data/lib/derivative_rodeo/generators/hocr_generator.rb +28 -1
- data/lib/derivative_rodeo/generators/plain_text_generator.rb +3 -0
- data/lib/derivative_rodeo/generators/word_coordinates_generator.rb +3 -0
- data/lib/derivative_rodeo/services/convert_uri_via_template_service.rb +30 -5
- data/lib/derivative_rodeo/version.rb +1 -1
- metadata +3 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 21db0b8b76cb0eadec2dab8848e7a7ac6240895eee96b5a247ed627fc682ac35
|
4
|
+
data.tar.gz: ab68faec3b4c8775f9243ae53090bb89191e4ea57c6ef2e8e8f40c1d4fa73f16
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 6605936719a33bd7f8bbb6d945b1aa1d87e6b08a53206890af462668a975f519f5a29235e963404d1c159d8486af2db1800015da30be93a9493938e4d69c8c09
|
7
|
+
data.tar.gz: 3ad8cfa0dd71d4167439277c958ccca3682dc87c095afcbcbeb45afe2cf399c70b19d887fec2faaf763fc54a1d2d643957c72124768db8efad76f67c63624449
|
data/README.md
CHANGED
@@ -17,6 +17,7 @@
|
|
17
17
|
- [Registered Generators](#registered-generators)
|
18
18
|
- [Storage Locations](#storage-locations)
|
19
19
|
- [Supported Storage Locations](#supported-storage-locations)
|
20
|
+
- [Templates](#templates)
|
20
21
|
- [Development](#development)
|
21
22
|
- [Logging in Test Environment](#logging-in-test-environment)
|
22
23
|
- [Contributing](#contributing)
|
@@ -34,9 +35,11 @@ The `DerivativeRodeo` "moves" files from one storage location (e.g. *input*) to
|
|
34
35
|
|
35
36
|
## Process Life Cycle
|
36
37
|
|
37
|
-
In the case of a *input* storage location, we expect that the underlying file pointed at by the *input* storage location exists. After all we can't move what we don't have.
|
38
|
+
In the case of a *input* storage location (e.g. `input_location`), we expect that the underlying file pointed at by the *input* storage location exists. After all we can't move what we don't have.
|
38
39
|
|
39
|
-
In the case of a *output* storage location, we expect that the underlying file will exist after the generator has completed. The *output* storage location *could* already exist or we might need to generate the file for the *output* location.
|
40
|
+
In the case of a *output* storage location (e.g. `output_location`), we expect that the underlying file will exist after the generator has completed. The *output* storage location *could* already exist or we might need to generate the file for the *output* location.
|
41
|
+
|
42
|
+
There is also the concept of the *pre\_processed* storage location; when the *pre\_processed* storage location exists for the given input, copy that *pre\_processed* file to the *output* location. And skip running the derivative generator on the *input* storage location. In other words, if we've already done the derivation elsewhere, use that.
|
40
43
|
|
41
44
|
During the generator's process, we need to have a working copy of both the *input* and *output* file. This is done by creating a temporary file.
|
42
45
|
|
@@ -224,6 +227,24 @@ Storage locations follow a [URI pattern](https://en.wikipedia.org/wiki/Uniform_R
|
|
224
227
|
- `s3://` :: <abbr title="Amazon Web Service">AWS</abbr>’s <abbr title="Simple Storage Service">S3</abbr> storage system
|
225
228
|
- `sqs://` :: <abbr title="Amazon Web Service">AWS</abbr>’s <abbr title="Simple Queue Service">SQS</abbr>
|
226
229
|
|
230
|
+
#### Templates
|
231
|
+
|
232
|
+
Throughout the code you'll see reference to the following concepts:
|
233
|
+
|
234
|
+
- `input_location_template`
|
235
|
+
- `output_location_template`
|
236
|
+
- `preprocessed_location_template`
|
237
|
+
|
238
|
+
In [Process Life Cycle](#process-life-cycle) we discussed the `input_location`, `output_location`, and `preprocessed_location`. The concept of the template provides a flexibility in mapping a location to another location
|
239
|
+
|
240
|
+
Examples of mapping one file path to another are:
|
241
|
+
|
242
|
+
- I want to copy `https://hello.com/world/GUID/file.jpg` to `file:///tmp/GUID/file.jpg`.
|
243
|
+
- I want to transform `file:///tmp/GUID/file.jpg` to `file:///tmp/GUID/file.hocr`; that is run OCR on an image and write a `.hocr` file.
|
244
|
+
- I want to use the `file:///tmp/GUID/file.hocr` to generate a `file:///tmp/GUID/file.coordinates.json`; that is convert the HOCR file to a coordinates.json file.
|
245
|
+
|
246
|
+
See [DerivativeRodeo::Service::ConvertUriViaTemplateService](./lib/derivative_rodeo/services/convert_uri_via_template_service.rb) for more details.
|
247
|
+
|
227
248
|
## Development
|
228
249
|
|
229
250
|
- Checkout the repository: `git clone https://github.com/scientist-softserv/derivative_rodeo`
|
@@ -1,6 +1,7 @@
|
|
1
1
|
# frozen_string_literal: true
|
2
2
|
|
3
3
|
require_relative '../services/extract_word_coordinates_from_hocr_sgml_service'
|
4
|
+
require_relative 'hocr_generator'
|
4
5
|
|
5
6
|
module DerivativeRodeo
|
6
7
|
module Generators
|
@@ -13,6 +14,8 @@ module DerivativeRodeo
|
|
13
14
|
|
14
15
|
class_attribute :service, default: Services::ExtractWordCoordinatesFromHocrSgmlService
|
15
16
|
|
17
|
+
include HocrGenerator::RequiresExistingFile
|
18
|
+
|
16
19
|
##
|
17
20
|
# @param output_location [StorageLocations::BaseLocation]
|
18
21
|
# @param input_tmp_file_path [String] the location of the file that we can use for processing.
|
@@ -187,6 +187,7 @@ module DerivativeRodeo
|
|
187
187
|
#
|
188
188
|
# @see Generators::HocrGenerator
|
189
189
|
# @see Generators::PdfSplitGenerator
|
190
|
+
# @see Services::ConvertUriViaTemplateService.coerce_pre_requisite_template_from
|
190
191
|
def with_each_requisite_location_and_tmp_file_path
|
191
192
|
input_files.each do |input_location|
|
192
193
|
input_location.with_existing_tmp_path do |tmp_file_path|
|
@@ -65,7 +65,8 @@ module DerivativeRodeo
|
|
65
65
|
#
|
66
66
|
# @see BaseGenerator#with_each_requisite_location_and_tmp_file_path for further discussion
|
67
67
|
def with_each_requisite_location_and_tmp_file_path(builder: MonochromeGenerator)
|
68
|
-
mono_location_template =
|
68
|
+
mono_location_template = Services::ConvertUriViaTemplateService.coerce_pre_requisite_template_from(template: output_location_template)
|
69
|
+
|
69
70
|
requisite_files ||= builder.new(input_uris: input_uris, output_location_template: mono_location_template).generated_files
|
70
71
|
requisite_files.each do |input_location|
|
71
72
|
input_location.with_existing_tmp_path do |tmp_file_path|
|
@@ -107,6 +108,32 @@ module DerivativeRodeo
|
|
107
108
|
# TODO: capture output in case of exceptions; perhaps delegate that to the #run method.
|
108
109
|
run(cmd)
|
109
110
|
end
|
111
|
+
|
112
|
+
##
|
113
|
+
# A mixin for generators that rely on hocr files.
|
114
|
+
#
|
115
|
+
# @see #with_each_requisite_location_and_tmp_file_path
|
116
|
+
module RequiresExistingFile
|
117
|
+
##
|
118
|
+
# @param builder [Class, #generated_files]
|
119
|
+
#
|
120
|
+
# When a generator depends on a hocr file, this method will ensure that we have the requisite
|
121
|
+
# hocr file.
|
122
|
+
#
|
123
|
+
# @yieldparam file [StorageLocations::BaseLocation]
|
124
|
+
# @yieldparam tmp_path [String]
|
125
|
+
#
|
126
|
+
# @see BaseGenerator#with_each_requisite_location_and_tmp_file_path for further discussion
|
127
|
+
def with_each_requisite_location_and_tmp_file_path(builder: HocrGenerator)
|
128
|
+
prereq_output_location_template = Services::ConvertUriViaTemplateService.coerce_pre_requisite_template_from(template: output_location_template)
|
129
|
+
requisite_files ||= builder.new(input_uris: input_uris, output_location_template: prereq_output_location_template).generated_files
|
130
|
+
requisite_files.each do |input_location|
|
131
|
+
input_location.with_existing_tmp_path do |tmp_file_path|
|
132
|
+
yield(input_location, tmp_file_path)
|
133
|
+
end
|
134
|
+
end
|
135
|
+
end
|
136
|
+
end
|
110
137
|
end
|
111
138
|
end
|
112
139
|
end
|
@@ -1,6 +1,7 @@
|
|
1
1
|
# frozen_string_literal: true
|
2
2
|
|
3
3
|
require_relative '../services/extract_word_coordinates_from_hocr_sgml_service'
|
4
|
+
require_relative 'hocr_generator'
|
4
5
|
|
5
6
|
module DerivativeRodeo
|
6
7
|
module Generators
|
@@ -13,6 +14,8 @@ module DerivativeRodeo
|
|
13
14
|
|
14
15
|
class_attribute :service, default: Services::ExtractWordCoordinatesFromHocrSgmlService
|
15
16
|
|
17
|
+
include HocrGenerator::RequiresExistingFile
|
18
|
+
|
16
19
|
##
|
17
20
|
# @param output_location [StorageLocations::BaseLocation]
|
18
21
|
# @param input_tmp_file_path [String] the location of the file that we can use for processing.
|
@@ -1,5 +1,6 @@
|
|
1
1
|
# frozen_string_literal: true
|
2
2
|
|
3
|
+
require_relative 'hocr_generator'
|
3
4
|
module DerivativeRodeo
|
4
5
|
module Generators
|
5
6
|
##
|
@@ -9,6 +10,8 @@ module DerivativeRodeo
|
|
9
10
|
class WordCoordinatesGenerator < BaseGenerator
|
10
11
|
self.output_extension = "coordinates.json"
|
11
12
|
|
13
|
+
include HocrGenerator::RequiresExistingFile
|
14
|
+
|
12
15
|
##
|
13
16
|
# @param output_location [StorageLocations::BaseLocation]
|
14
17
|
# @param input_tmp_file_path [String] the location of the file that we can use for processing.
|
@@ -7,6 +7,7 @@ module DerivativeRodeo
|
|
7
7
|
# A service to convert an array of :from_uris to :to_uris via a :template.
|
8
8
|
#
|
9
9
|
# @see .call
|
10
|
+
# @see .coerce_pre_requisite_template_from
|
10
11
|
class ConvertUriViaTemplateService
|
11
12
|
DIR_PARTS_REPLACEMENT_REGEXP = %r{\{\{\s*dir_parts\[(?<left>\-?\d+)\.\.(?<right>\-?\d+)\]\s*\}\}}.freeze
|
12
13
|
FILENAME_REPLACEMENT_REGEXP = %r{\{\{\s*filename\s*\}\}}.freeze
|
@@ -16,6 +17,15 @@ module DerivativeRodeo
|
|
16
17
|
SCHEME_FOR_URI_REGEXP = %r{^(?<from_scheme>[^:]+)://}.freeze
|
17
18
|
attr_accessor :from_uri, :template, :adapter, :separator, :uri, :from_scheme, :path, :parts, :dir_parts, :filename, :basename, :extension, :template_without_query, :template_query
|
18
19
|
|
20
|
+
##
|
21
|
+
# @!group Class Attributes
|
22
|
+
#
|
23
|
+
# @!attribute separator [r|w]
|
24
|
+
# @return [String] the directory seperator character; default: "/"
|
25
|
+
class_attribute :separator, default: '/', instance_accessor: false
|
26
|
+
# @!endgroup Class Attributes
|
27
|
+
##
|
28
|
+
|
19
29
|
##
|
20
30
|
# Convert the given :from_uris to a different list of uris based on the given :template.
|
21
31
|
#
|
@@ -32,7 +42,7 @@ module DerivativeRodeo
|
|
32
42
|
# @param from_uri [String] Of the form "scheme://dir/parts/basename.extension"
|
33
43
|
# @param template [String] Another URI that may contain path_parts or scheme template values.
|
34
44
|
# @param adapter [StorageLocations::Location]
|
35
|
-
# @param
|
45
|
+
# @param options [Hash<Symbol, Object>]
|
36
46
|
#
|
37
47
|
# @return [String]
|
38
48
|
#
|
@@ -46,16 +56,31 @@ module DerivativeRodeo
|
|
46
56
|
# from_uris: ["file:///path1/A/file.pdf", "aws:///path2/B/file.pdf"],
|
47
57
|
# template: "file:///dest1/{{dir_parts[-1..-1]}}/{{ filename }}")
|
48
58
|
# => ["file:///dest1/A/file.pdf", "aws:///dest1/B/file.pdf"]
|
49
|
-
def self.call(from_uri:, template:, adapter: nil,
|
50
|
-
new(from_uri: from_uri, template: template, adapter: adapter,
|
59
|
+
def self.call(from_uri:, template:, adapter: nil, **options)
|
60
|
+
new(from_uri: from_uri, template: template, adapter: adapter, **options).call
|
61
|
+
end
|
62
|
+
|
63
|
+
##
|
64
|
+
# There are generators that have requisite files necessary for processing.
|
65
|
+
#
|
66
|
+
# For example, before we run `tesseract` on an image, we would like to make sure it is
|
67
|
+
# monochrome. Hence the interplay between the {Generators::HocrGenerator} and the
|
68
|
+
# {Generators::MonochromeGenerator}.
|
69
|
+
#
|
70
|
+
# @param template [String]
|
71
|
+
# @return [String]
|
72
|
+
#
|
73
|
+
# @see Generators::BaseGenerator#with_each_requisite_location_and_tmp_file_path
|
74
|
+
def self.coerce_pre_requisite_template_from(template:)
|
75
|
+
template.split(separator)[0..-2].join(separator) + "#{separator}{{ basename }}{{ extension }}"
|
51
76
|
end
|
52
77
|
|
53
78
|
# rubocop:disable Metrics/MethodLength
|
54
|
-
def initialize(from_uri:, template:, adapter: nil,
|
79
|
+
def initialize(from_uri:, template:, adapter: nil, **options)
|
55
80
|
@from_uri = from_uri
|
56
81
|
@template = template
|
57
82
|
@adapter = adapter
|
58
|
-
@separator = separator
|
83
|
+
@separator = options.fetch(:separator) { self.class.separator }
|
59
84
|
|
60
85
|
@uri, _query = from_uri.split("?")
|
61
86
|
@from_scheme, @path = uri.split("://")
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: derivative-rodeo
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.5.
|
4
|
+
version: 0.5.3
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Rob Kaufman
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: exe
|
11
11
|
cert_chain: []
|
12
|
-
date: 2023-
|
12
|
+
date: 2023-12-07 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: activesupport
|
@@ -337,7 +337,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
337
337
|
- !ruby/object:Gem::Version
|
338
338
|
version: '0'
|
339
339
|
requirements: []
|
340
|
-
rubygems_version: 3.
|
340
|
+
rubygems_version: 3.3.7
|
341
341
|
signing_key:
|
342
342
|
specification_version: 4
|
343
343
|
summary: An ETL Ecosystem for Derivative Processing.
|