derivative-rodeo 0.5.2 → 0.5.3
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +23 -2
- data/lib/derivative_rodeo/generators/alto_generator.rb +3 -0
- data/lib/derivative_rodeo/generators/base_generator.rb +1 -0
- data/lib/derivative_rodeo/generators/hocr_generator.rb +28 -1
- data/lib/derivative_rodeo/generators/plain_text_generator.rb +3 -0
- data/lib/derivative_rodeo/generators/word_coordinates_generator.rb +3 -0
- data/lib/derivative_rodeo/services/convert_uri_via_template_service.rb +30 -5
- data/lib/derivative_rodeo/version.rb +1 -1
- metadata +3 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 21db0b8b76cb0eadec2dab8848e7a7ac6240895eee96b5a247ed627fc682ac35
|
4
|
+
data.tar.gz: ab68faec3b4c8775f9243ae53090bb89191e4ea57c6ef2e8e8f40c1d4fa73f16
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 6605936719a33bd7f8bbb6d945b1aa1d87e6b08a53206890af462668a975f519f5a29235e963404d1c159d8486af2db1800015da30be93a9493938e4d69c8c09
|
7
|
+
data.tar.gz: 3ad8cfa0dd71d4167439277c958ccca3682dc87c095afcbcbeb45afe2cf399c70b19d887fec2faaf763fc54a1d2d643957c72124768db8efad76f67c63624449
|
data/README.md
CHANGED
@@ -17,6 +17,7 @@
|
|
17
17
|
- [Registered Generators](#registered-generators)
|
18
18
|
- [Storage Locations](#storage-locations)
|
19
19
|
- [Supported Storage Locations](#supported-storage-locations)
|
20
|
+
- [Templates](#templates)
|
20
21
|
- [Development](#development)
|
21
22
|
- [Logging in Test Environment](#logging-in-test-environment)
|
22
23
|
- [Contributing](#contributing)
|
@@ -34,9 +35,11 @@ The `DerivativeRodeo` "moves" files from one storage location (e.g. *input*) to
|
|
34
35
|
|
35
36
|
## Process Life Cycle
|
36
37
|
|
37
|
-
In the case of a *input* storage location, we expect that the underlying file pointed at by the *input* storage location exists. After all we can't move what we don't have.
|
38
|
+
In the case of a *input* storage location (e.g. `input_location`), we expect that the underlying file pointed at by the *input* storage location exists. After all we can't move what we don't have.
|
38
39
|
|
39
|
-
In the case of a *output* storage location, we expect that the underlying file will exist after the generator has completed. The *output* storage location *could* already exist or we might need to generate the file for the *output* location.
|
40
|
+
In the case of a *output* storage location (e.g. `output_location`), we expect that the underlying file will exist after the generator has completed. The *output* storage location *could* already exist or we might need to generate the file for the *output* location.
|
41
|
+
|
42
|
+
There is also the concept of the *pre\_processed* storage location; when the *pre\_processed* storage location exists for the given input, copy that *pre\_processed* file to the *output* location. And skip running the derivative generator on the *input* storage location. In other words, if we've already done the derivation elsewhere, use that.
|
40
43
|
|
41
44
|
During the generator's process, we need to have a working copy of both the *input* and *output* file. This is done by creating a temporary file.
|
42
45
|
|
@@ -224,6 +227,24 @@ Storage locations follow a [URI pattern](https://en.wikipedia.org/wiki/Uniform_R
|
|
224
227
|
- `s3://` :: <abbr title="Amazon Web Service">AWS</abbr>’s <abbr title="Simple Storage Service">S3</abbr> storage system
|
225
228
|
- `sqs://` :: <abbr title="Amazon Web Service">AWS</abbr>’s <abbr title="Simple Queue Service">SQS</abbr>
|
226
229
|
|
230
|
+
#### Templates
|
231
|
+
|
232
|
+
Throughout the code you'll see reference to the following concepts:
|
233
|
+
|
234
|
+
- `input_location_template`
|
235
|
+
- `output_location_template`
|
236
|
+
- `preprocessed_location_template`
|
237
|
+
|
238
|
+
In [Process Life Cycle](#process-life-cycle) we discussed the `input_location`, `output_location`, and `preprocessed_location`. The concept of the template provides a flexibility in mapping a location to another location
|
239
|
+
|
240
|
+
Examples of mapping one file path to another are:
|
241
|
+
|
242
|
+
- I want to copy `https://hello.com/world/GUID/file.jpg` to `file:///tmp/GUID/file.jpg`.
|
243
|
+
- I want to transform `file:///tmp/GUID/file.jpg` to `file:///tmp/GUID/file.hocr`; that is run OCR on an image and write a `.hocr` file.
|
244
|
+
- I want to use the `file:///tmp/GUID/file.hocr` to generate a `file:///tmp/GUID/file.coordinates.json`; that is convert the HOCR file to a coordinates.json file.
|
245
|
+
|
246
|
+
See [DerivativeRodeo::Service::ConvertUriViaTemplateService](./lib/derivative_rodeo/services/convert_uri_via_template_service.rb) for more details.
|
247
|
+
|
227
248
|
## Development
|
228
249
|
|
229
250
|
- Checkout the repository: `git clone https://github.com/scientist-softserv/derivative_rodeo`
|
@@ -1,6 +1,7 @@
|
|
1
1
|
# frozen_string_literal: true
|
2
2
|
|
3
3
|
require_relative '../services/extract_word_coordinates_from_hocr_sgml_service'
|
4
|
+
require_relative 'hocr_generator'
|
4
5
|
|
5
6
|
module DerivativeRodeo
|
6
7
|
module Generators
|
@@ -13,6 +14,8 @@ module DerivativeRodeo
|
|
13
14
|
|
14
15
|
class_attribute :service, default: Services::ExtractWordCoordinatesFromHocrSgmlService
|
15
16
|
|
17
|
+
include HocrGenerator::RequiresExistingFile
|
18
|
+
|
16
19
|
##
|
17
20
|
# @param output_location [StorageLocations::BaseLocation]
|
18
21
|
# @param input_tmp_file_path [String] the location of the file that we can use for processing.
|
@@ -187,6 +187,7 @@ module DerivativeRodeo
|
|
187
187
|
#
|
188
188
|
# @see Generators::HocrGenerator
|
189
189
|
# @see Generators::PdfSplitGenerator
|
190
|
+
# @see Services::ConvertUriViaTemplateService.coerce_pre_requisite_template_from
|
190
191
|
def with_each_requisite_location_and_tmp_file_path
|
191
192
|
input_files.each do |input_location|
|
192
193
|
input_location.with_existing_tmp_path do |tmp_file_path|
|
@@ -65,7 +65,8 @@ module DerivativeRodeo
|
|
65
65
|
#
|
66
66
|
# @see BaseGenerator#with_each_requisite_location_and_tmp_file_path for further discussion
|
67
67
|
def with_each_requisite_location_and_tmp_file_path(builder: MonochromeGenerator)
|
68
|
-
mono_location_template =
|
68
|
+
mono_location_template = Services::ConvertUriViaTemplateService.coerce_pre_requisite_template_from(template: output_location_template)
|
69
|
+
|
69
70
|
requisite_files ||= builder.new(input_uris: input_uris, output_location_template: mono_location_template).generated_files
|
70
71
|
requisite_files.each do |input_location|
|
71
72
|
input_location.with_existing_tmp_path do |tmp_file_path|
|
@@ -107,6 +108,32 @@ module DerivativeRodeo
|
|
107
108
|
# TODO: capture output in case of exceptions; perhaps delegate that to the #run method.
|
108
109
|
run(cmd)
|
109
110
|
end
|
111
|
+
|
112
|
+
##
|
113
|
+
# A mixin for generators that rely on hocr files.
|
114
|
+
#
|
115
|
+
# @see #with_each_requisite_location_and_tmp_file_path
|
116
|
+
module RequiresExistingFile
|
117
|
+
##
|
118
|
+
# @param builder [Class, #generated_files]
|
119
|
+
#
|
120
|
+
# When a generator depends on a hocr file, this method will ensure that we have the requisite
|
121
|
+
# hocr file.
|
122
|
+
#
|
123
|
+
# @yieldparam file [StorageLocations::BaseLocation]
|
124
|
+
# @yieldparam tmp_path [String]
|
125
|
+
#
|
126
|
+
# @see BaseGenerator#with_each_requisite_location_and_tmp_file_path for further discussion
|
127
|
+
def with_each_requisite_location_and_tmp_file_path(builder: HocrGenerator)
|
128
|
+
prereq_output_location_template = Services::ConvertUriViaTemplateService.coerce_pre_requisite_template_from(template: output_location_template)
|
129
|
+
requisite_files ||= builder.new(input_uris: input_uris, output_location_template: prereq_output_location_template).generated_files
|
130
|
+
requisite_files.each do |input_location|
|
131
|
+
input_location.with_existing_tmp_path do |tmp_file_path|
|
132
|
+
yield(input_location, tmp_file_path)
|
133
|
+
end
|
134
|
+
end
|
135
|
+
end
|
136
|
+
end
|
110
137
|
end
|
111
138
|
end
|
112
139
|
end
|
@@ -1,6 +1,7 @@
|
|
1
1
|
# frozen_string_literal: true
|
2
2
|
|
3
3
|
require_relative '../services/extract_word_coordinates_from_hocr_sgml_service'
|
4
|
+
require_relative 'hocr_generator'
|
4
5
|
|
5
6
|
module DerivativeRodeo
|
6
7
|
module Generators
|
@@ -13,6 +14,8 @@ module DerivativeRodeo
|
|
13
14
|
|
14
15
|
class_attribute :service, default: Services::ExtractWordCoordinatesFromHocrSgmlService
|
15
16
|
|
17
|
+
include HocrGenerator::RequiresExistingFile
|
18
|
+
|
16
19
|
##
|
17
20
|
# @param output_location [StorageLocations::BaseLocation]
|
18
21
|
# @param input_tmp_file_path [String] the location of the file that we can use for processing.
|
@@ -1,5 +1,6 @@
|
|
1
1
|
# frozen_string_literal: true
|
2
2
|
|
3
|
+
require_relative 'hocr_generator'
|
3
4
|
module DerivativeRodeo
|
4
5
|
module Generators
|
5
6
|
##
|
@@ -9,6 +10,8 @@ module DerivativeRodeo
|
|
9
10
|
class WordCoordinatesGenerator < BaseGenerator
|
10
11
|
self.output_extension = "coordinates.json"
|
11
12
|
|
13
|
+
include HocrGenerator::RequiresExistingFile
|
14
|
+
|
12
15
|
##
|
13
16
|
# @param output_location [StorageLocations::BaseLocation]
|
14
17
|
# @param input_tmp_file_path [String] the location of the file that we can use for processing.
|
@@ -7,6 +7,7 @@ module DerivativeRodeo
|
|
7
7
|
# A service to convert an array of :from_uris to :to_uris via a :template.
|
8
8
|
#
|
9
9
|
# @see .call
|
10
|
+
# @see .coerce_pre_requisite_template_from
|
10
11
|
class ConvertUriViaTemplateService
|
11
12
|
DIR_PARTS_REPLACEMENT_REGEXP = %r{\{\{\s*dir_parts\[(?<left>\-?\d+)\.\.(?<right>\-?\d+)\]\s*\}\}}.freeze
|
12
13
|
FILENAME_REPLACEMENT_REGEXP = %r{\{\{\s*filename\s*\}\}}.freeze
|
@@ -16,6 +17,15 @@ module DerivativeRodeo
|
|
16
17
|
SCHEME_FOR_URI_REGEXP = %r{^(?<from_scheme>[^:]+)://}.freeze
|
17
18
|
attr_accessor :from_uri, :template, :adapter, :separator, :uri, :from_scheme, :path, :parts, :dir_parts, :filename, :basename, :extension, :template_without_query, :template_query
|
18
19
|
|
20
|
+
##
|
21
|
+
# @!group Class Attributes
|
22
|
+
#
|
23
|
+
# @!attribute separator [r|w]
|
24
|
+
# @return [String] the directory seperator character; default: "/"
|
25
|
+
class_attribute :separator, default: '/', instance_accessor: false
|
26
|
+
# @!endgroup Class Attributes
|
27
|
+
##
|
28
|
+
|
19
29
|
##
|
20
30
|
# Convert the given :from_uris to a different list of uris based on the given :template.
|
21
31
|
#
|
@@ -32,7 +42,7 @@ module DerivativeRodeo
|
|
32
42
|
# @param from_uri [String] Of the form "scheme://dir/parts/basename.extension"
|
33
43
|
# @param template [String] Another URI that may contain path_parts or scheme template values.
|
34
44
|
# @param adapter [StorageLocations::Location]
|
35
|
-
# @param
|
45
|
+
# @param options [Hash<Symbol, Object>]
|
36
46
|
#
|
37
47
|
# @return [String]
|
38
48
|
#
|
@@ -46,16 +56,31 @@ module DerivativeRodeo
|
|
46
56
|
# from_uris: ["file:///path1/A/file.pdf", "aws:///path2/B/file.pdf"],
|
47
57
|
# template: "file:///dest1/{{dir_parts[-1..-1]}}/{{ filename }}")
|
48
58
|
# => ["file:///dest1/A/file.pdf", "aws:///dest1/B/file.pdf"]
|
49
|
-
def self.call(from_uri:, template:, adapter: nil,
|
50
|
-
new(from_uri: from_uri, template: template, adapter: adapter,
|
59
|
+
def self.call(from_uri:, template:, adapter: nil, **options)
|
60
|
+
new(from_uri: from_uri, template: template, adapter: adapter, **options).call
|
61
|
+
end
|
62
|
+
|
63
|
+
##
|
64
|
+
# There are generators that have requisite files necessary for processing.
|
65
|
+
#
|
66
|
+
# For example, before we run `tesseract` on an image, we would like to make sure it is
|
67
|
+
# monochrome. Hence the interplay between the {Generators::HocrGenerator} and the
|
68
|
+
# {Generators::MonochromeGenerator}.
|
69
|
+
#
|
70
|
+
# @param template [String]
|
71
|
+
# @return [String]
|
72
|
+
#
|
73
|
+
# @see Generators::BaseGenerator#with_each_requisite_location_and_tmp_file_path
|
74
|
+
def self.coerce_pre_requisite_template_from(template:)
|
75
|
+
template.split(separator)[0..-2].join(separator) + "#{separator}{{ basename }}{{ extension }}"
|
51
76
|
end
|
52
77
|
|
53
78
|
# rubocop:disable Metrics/MethodLength
|
54
|
-
def initialize(from_uri:, template:, adapter: nil,
|
79
|
+
def initialize(from_uri:, template:, adapter: nil, **options)
|
55
80
|
@from_uri = from_uri
|
56
81
|
@template = template
|
57
82
|
@adapter = adapter
|
58
|
-
@separator = separator
|
83
|
+
@separator = options.fetch(:separator) { self.class.separator }
|
59
84
|
|
60
85
|
@uri, _query = from_uri.split("?")
|
61
86
|
@from_scheme, @path = uri.split("://")
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: derivative-rodeo
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.5.
|
4
|
+
version: 0.5.3
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Rob Kaufman
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: exe
|
11
11
|
cert_chain: []
|
12
|
-
date: 2023-
|
12
|
+
date: 2023-12-07 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: activesupport
|
@@ -337,7 +337,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
337
337
|
- !ruby/object:Gem::Version
|
338
338
|
version: '0'
|
339
339
|
requirements: []
|
340
|
-
rubygems_version: 3.
|
340
|
+
rubygems_version: 3.3.7
|
341
341
|
signing_key:
|
342
342
|
specification_version: 4
|
343
343
|
summary: An ETL Ecosystem for Derivative Processing.
|