derivative-rodeo 0.5.2 → 0.5.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: ab1b83bcaef257fb1bf6759a69fef10c5f51753d7361b3ea42453134efc65ee7
4
- data.tar.gz: 726e7838955d8a1380ad18b24663ae0212d1eceae5c47f789050bda61f1460e1
3
+ metadata.gz: 21db0b8b76cb0eadec2dab8848e7a7ac6240895eee96b5a247ed627fc682ac35
4
+ data.tar.gz: ab68faec3b4c8775f9243ae53090bb89191e4ea57c6ef2e8e8f40c1d4fa73f16
5
5
  SHA512:
6
- metadata.gz: d35baa8c3020a5dadfe020cf091721a7132e034c671323c26cf89ad21923b4240bb1f1e6c4fc1675f0573f86e8d16eb9d67e17d08ba55d7e0cccc056ff0affce
7
- data.tar.gz: a5806370c9a0f797fb6f837536b47be9acbf8ef5a98f1d1c04f1886554d9834d5c97447ad3e701d1051391110f384a83128f2e8d7909ebaab4f6ba58e51ee7b0
6
+ metadata.gz: 6605936719a33bd7f8bbb6d945b1aa1d87e6b08a53206890af462668a975f519f5a29235e963404d1c159d8486af2db1800015da30be93a9493938e4d69c8c09
7
+ data.tar.gz: 3ad8cfa0dd71d4167439277c958ccca3682dc87c095afcbcbeb45afe2cf399c70b19d887fec2faaf763fc54a1d2d643957c72124768db8efad76f67c63624449
data/README.md CHANGED
@@ -17,6 +17,7 @@
17
17
  - [Registered Generators](#registered-generators)
18
18
  - [Storage Locations](#storage-locations)
19
19
  - [Supported Storage Locations](#supported-storage-locations)
20
+ - [Templates](#templates)
20
21
  - [Development](#development)
21
22
  - [Logging in Test Environment](#logging-in-test-environment)
22
23
  - [Contributing](#contributing)
@@ -34,9 +35,11 @@ The `DerivativeRodeo` "moves" files from one storage location (e.g. *input*) to
34
35
 
35
36
  ## Process Life Cycle
36
37
 
37
- In the case of a *input* storage location, we expect that the underlying file pointed at by the *input* storage location exists. After all we can't move what we don't have.
38
+ In the case of a *input* storage location (e.g. `input_location`), we expect that the underlying file pointed at by the *input* storage location exists. After all we can't move what we don't have.
38
39
 
39
- In the case of a *output* storage location, we expect that the underlying file will exist after the generator has completed. The *output* storage location *could* already exist or we might need to generate the file for the *output* location.
40
+ In the case of a *output* storage location (e.g. `output_location`), we expect that the underlying file will exist after the generator has completed. The *output* storage location *could* already exist or we might need to generate the file for the *output* location.
41
+
42
+ There is also the concept of the *pre\_processed* storage location; when the *pre\_processed* storage location exists for the given input, copy that *pre\_processed* file to the *output* location. And skip running the derivative generator on the *input* storage location. In other words, if we've already done the derivation elsewhere, use that.
40
43
 
41
44
  During the generator's process, we need to have a working copy of both the *input* and *output* file. This is done by creating a temporary file.
42
45
 
@@ -224,6 +227,24 @@ Storage locations follow a [URI pattern](https://en.wikipedia.org/wiki/Uniform_R
224
227
  - `s3://` :: <abbr title="Amazon Web Service">AWS</abbr>’s <abbr title="Simple Storage Service">S3</abbr> storage system
225
228
  - `sqs://` :: <abbr title="Amazon Web Service">AWS</abbr>’s <abbr title="Simple Queue Service">SQS</abbr>
226
229
 
230
+ #### Templates
231
+
232
+ Throughout the code you'll see reference to the following concepts:
233
+
234
+ - `input_location_template`
235
+ - `output_location_template`
236
+ - `preprocessed_location_template`
237
+
238
+ In [Process Life Cycle](#process-life-cycle) we discussed the `input_location`, `output_location`, and `preprocessed_location`. The concept of the template provides a flexibility in mapping a location to another location
239
+
240
+ Examples of mapping one file path to another are:
241
+
242
+ - I want to copy `https://hello.com/world/GUID/file.jpg` to `file:///tmp/GUID/file.jpg`.
243
+ - I want to transform `file:///tmp/GUID/file.jpg` to `file:///tmp/GUID/file.hocr`; that is run OCR on an image and write a `.hocr` file.
244
+ - I want to use the `file:///tmp/GUID/file.hocr` to generate a `file:///tmp/GUID/file.coordinates.json`; that is convert the HOCR file to a coordinates.json file.
245
+
246
+ See [DerivativeRodeo::Service::ConvertUriViaTemplateService](./lib/derivative_rodeo/services/convert_uri_via_template_service.rb) for more details.
247
+
227
248
  ## Development
228
249
 
229
250
  - Checkout the repository: `git clone https://github.com/scientist-softserv/derivative_rodeo`
@@ -1,6 +1,7 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  require_relative '../services/extract_word_coordinates_from_hocr_sgml_service'
4
+ require_relative 'hocr_generator'
4
5
 
5
6
  module DerivativeRodeo
6
7
  module Generators
@@ -13,6 +14,8 @@ module DerivativeRodeo
13
14
 
14
15
  class_attribute :service, default: Services::ExtractWordCoordinatesFromHocrSgmlService
15
16
 
17
+ include HocrGenerator::RequiresExistingFile
18
+
16
19
  ##
17
20
  # @param output_location [StorageLocations::BaseLocation]
18
21
  # @param input_tmp_file_path [String] the location of the file that we can use for processing.
@@ -187,6 +187,7 @@ module DerivativeRodeo
187
187
  #
188
188
  # @see Generators::HocrGenerator
189
189
  # @see Generators::PdfSplitGenerator
190
+ # @see Services::ConvertUriViaTemplateService.coerce_pre_requisite_template_from
190
191
  def with_each_requisite_location_and_tmp_file_path
191
192
  input_files.each do |input_location|
192
193
  input_location.with_existing_tmp_path do |tmp_file_path|
@@ -65,7 +65,8 @@ module DerivativeRodeo
65
65
  #
66
66
  # @see BaseGenerator#with_each_requisite_location_and_tmp_file_path for further discussion
67
67
  def with_each_requisite_location_and_tmp_file_path(builder: MonochromeGenerator)
68
- mono_location_template = output_location_template.gsub(self.class.output_extension, builder.output_extension)
68
+ mono_location_template = Services::ConvertUriViaTemplateService.coerce_pre_requisite_template_from(template: output_location_template)
69
+
69
70
  requisite_files ||= builder.new(input_uris: input_uris, output_location_template: mono_location_template).generated_files
70
71
  requisite_files.each do |input_location|
71
72
  input_location.with_existing_tmp_path do |tmp_file_path|
@@ -107,6 +108,32 @@ module DerivativeRodeo
107
108
  # TODO: capture output in case of exceptions; perhaps delegate that to the #run method.
108
109
  run(cmd)
109
110
  end
111
+
112
+ ##
113
+ # A mixin for generators that rely on hocr files.
114
+ #
115
+ # @see #with_each_requisite_location_and_tmp_file_path
116
+ module RequiresExistingFile
117
+ ##
118
+ # @param builder [Class, #generated_files]
119
+ #
120
+ # When a generator depends on a hocr file, this method will ensure that we have the requisite
121
+ # hocr file.
122
+ #
123
+ # @yieldparam file [StorageLocations::BaseLocation]
124
+ # @yieldparam tmp_path [String]
125
+ #
126
+ # @see BaseGenerator#with_each_requisite_location_and_tmp_file_path for further discussion
127
+ def with_each_requisite_location_and_tmp_file_path(builder: HocrGenerator)
128
+ prereq_output_location_template = Services::ConvertUriViaTemplateService.coerce_pre_requisite_template_from(template: output_location_template)
129
+ requisite_files ||= builder.new(input_uris: input_uris, output_location_template: prereq_output_location_template).generated_files
130
+ requisite_files.each do |input_location|
131
+ input_location.with_existing_tmp_path do |tmp_file_path|
132
+ yield(input_location, tmp_file_path)
133
+ end
134
+ end
135
+ end
136
+ end
110
137
  end
111
138
  end
112
139
  end
@@ -1,6 +1,7 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  require_relative '../services/extract_word_coordinates_from_hocr_sgml_service'
4
+ require_relative 'hocr_generator'
4
5
 
5
6
  module DerivativeRodeo
6
7
  module Generators
@@ -13,6 +14,8 @@ module DerivativeRodeo
13
14
 
14
15
  class_attribute :service, default: Services::ExtractWordCoordinatesFromHocrSgmlService
15
16
 
17
+ include HocrGenerator::RequiresExistingFile
18
+
16
19
  ##
17
20
  # @param output_location [StorageLocations::BaseLocation]
18
21
  # @param input_tmp_file_path [String] the location of the file that we can use for processing.
@@ -1,5 +1,6 @@
1
1
  # frozen_string_literal: true
2
2
 
3
+ require_relative 'hocr_generator'
3
4
  module DerivativeRodeo
4
5
  module Generators
5
6
  ##
@@ -9,6 +10,8 @@ module DerivativeRodeo
9
10
  class WordCoordinatesGenerator < BaseGenerator
10
11
  self.output_extension = "coordinates.json"
11
12
 
13
+ include HocrGenerator::RequiresExistingFile
14
+
12
15
  ##
13
16
  # @param output_location [StorageLocations::BaseLocation]
14
17
  # @param input_tmp_file_path [String] the location of the file that we can use for processing.
@@ -7,6 +7,7 @@ module DerivativeRodeo
7
7
  # A service to convert an array of :from_uris to :to_uris via a :template.
8
8
  #
9
9
  # @see .call
10
+ # @see .coerce_pre_requisite_template_from
10
11
  class ConvertUriViaTemplateService
11
12
  DIR_PARTS_REPLACEMENT_REGEXP = %r{\{\{\s*dir_parts\[(?<left>\-?\d+)\.\.(?<right>\-?\d+)\]\s*\}\}}.freeze
12
13
  FILENAME_REPLACEMENT_REGEXP = %r{\{\{\s*filename\s*\}\}}.freeze
@@ -16,6 +17,15 @@ module DerivativeRodeo
16
17
  SCHEME_FOR_URI_REGEXP = %r{^(?<from_scheme>[^:]+)://}.freeze
17
18
  attr_accessor :from_uri, :template, :adapter, :separator, :uri, :from_scheme, :path, :parts, :dir_parts, :filename, :basename, :extension, :template_without_query, :template_query
18
19
 
20
+ ##
21
+ # @!group Class Attributes
22
+ #
23
+ # @!attribute separator [r|w]
24
+ # @return [String] the directory seperator character; default: "/"
25
+ class_attribute :separator, default: '/', instance_accessor: false
26
+ # @!endgroup Class Attributes
27
+ ##
28
+
19
29
  ##
20
30
  # Convert the given :from_uris to a different list of uris based on the given :template.
21
31
  #
@@ -32,7 +42,7 @@ module DerivativeRodeo
32
42
  # @param from_uri [String] Of the form "scheme://dir/parts/basename.extension"
33
43
  # @param template [String] Another URI that may contain path_parts or scheme template values.
34
44
  # @param adapter [StorageLocations::Location]
35
- # @param separator [String]
45
+ # @param options [Hash<Symbol, Object>]
36
46
  #
37
47
  # @return [String]
38
48
  #
@@ -46,16 +56,31 @@ module DerivativeRodeo
46
56
  # from_uris: ["file:///path1/A/file.pdf", "aws:///path2/B/file.pdf"],
47
57
  # template: "file:///dest1/{{dir_parts[-1..-1]}}/{{ filename }}")
48
58
  # => ["file:///dest1/A/file.pdf", "aws:///dest1/B/file.pdf"]
49
- def self.call(from_uri:, template:, adapter: nil, separator: "/", **options)
50
- new(from_uri: from_uri, template: template, adapter: adapter, separator: separator, **options).call
59
+ def self.call(from_uri:, template:, adapter: nil, **options)
60
+ new(from_uri: from_uri, template: template, adapter: adapter, **options).call
61
+ end
62
+
63
+ ##
64
+ # There are generators that have requisite files necessary for processing.
65
+ #
66
+ # For example, before we run `tesseract` on an image, we would like to make sure it is
67
+ # monochrome. Hence the interplay between the {Generators::HocrGenerator} and the
68
+ # {Generators::MonochromeGenerator}.
69
+ #
70
+ # @param template [String]
71
+ # @return [String]
72
+ #
73
+ # @see Generators::BaseGenerator#with_each_requisite_location_and_tmp_file_path
74
+ def self.coerce_pre_requisite_template_from(template:)
75
+ template.split(separator)[0..-2].join(separator) + "#{separator}{{ basename }}{{ extension }}"
51
76
  end
52
77
 
53
78
  # rubocop:disable Metrics/MethodLength
54
- def initialize(from_uri:, template:, adapter: nil, separator: "/", **options)
79
+ def initialize(from_uri:, template:, adapter: nil, **options)
55
80
  @from_uri = from_uri
56
81
  @template = template
57
82
  @adapter = adapter
58
- @separator = separator
83
+ @separator = options.fetch(:separator) { self.class.separator }
59
84
 
60
85
  @uri, _query = from_uri.split("?")
61
86
  @from_scheme, @path = uri.split("://")
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module DerivativeRodeo
4
- VERSION = '0.5.2'
4
+ VERSION = '0.5.3'
5
5
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: derivative-rodeo
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.5.2
4
+ version: 0.5.3
5
5
  platform: ruby
6
6
  authors:
7
7
  - Rob Kaufman
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: exe
11
11
  cert_chain: []
12
- date: 2023-11-15 00:00:00.000000000 Z
12
+ date: 2023-12-07 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: activesupport
@@ -337,7 +337,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
337
337
  - !ruby/object:Gem::Version
338
338
  version: '0'
339
339
  requirements: []
340
- rubygems_version: 3.1.6
340
+ rubygems_version: 3.3.7
341
341
  signing_key:
342
342
  specification_version: 4
343
343
  summary: An ETL Ecosystem for Derivative Processing.