derivative-rodeo 0.5.2 → 0.5.3

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: ab1b83bcaef257fb1bf6759a69fef10c5f51753d7361b3ea42453134efc65ee7
4
- data.tar.gz: 726e7838955d8a1380ad18b24663ae0212d1eceae5c47f789050bda61f1460e1
3
+ metadata.gz: 21db0b8b76cb0eadec2dab8848e7a7ac6240895eee96b5a247ed627fc682ac35
4
+ data.tar.gz: ab68faec3b4c8775f9243ae53090bb89191e4ea57c6ef2e8e8f40c1d4fa73f16
5
5
  SHA512:
6
- metadata.gz: d35baa8c3020a5dadfe020cf091721a7132e034c671323c26cf89ad21923b4240bb1f1e6c4fc1675f0573f86e8d16eb9d67e17d08ba55d7e0cccc056ff0affce
7
- data.tar.gz: a5806370c9a0f797fb6f837536b47be9acbf8ef5a98f1d1c04f1886554d9834d5c97447ad3e701d1051391110f384a83128f2e8d7909ebaab4f6ba58e51ee7b0
6
+ metadata.gz: 6605936719a33bd7f8bbb6d945b1aa1d87e6b08a53206890af462668a975f519f5a29235e963404d1c159d8486af2db1800015da30be93a9493938e4d69c8c09
7
+ data.tar.gz: 3ad8cfa0dd71d4167439277c958ccca3682dc87c095afcbcbeb45afe2cf399c70b19d887fec2faaf763fc54a1d2d643957c72124768db8efad76f67c63624449
data/README.md CHANGED
@@ -17,6 +17,7 @@
17
17
  - [Registered Generators](#registered-generators)
18
18
  - [Storage Locations](#storage-locations)
19
19
  - [Supported Storage Locations](#supported-storage-locations)
20
+ - [Templates](#templates)
20
21
  - [Development](#development)
21
22
  - [Logging in Test Environment](#logging-in-test-environment)
22
23
  - [Contributing](#contributing)
@@ -34,9 +35,11 @@ The `DerivativeRodeo` "moves" files from one storage location (e.g. *input*) to
34
35
 
35
36
  ## Process Life Cycle
36
37
 
37
- In the case of a *input* storage location, we expect that the underlying file pointed at by the *input* storage location exists. After all we can't move what we don't have.
38
+ In the case of a *input* storage location (e.g. `input_location`), we expect that the underlying file pointed at by the *input* storage location exists. After all we can't move what we don't have.
38
39
 
39
- In the case of a *output* storage location, we expect that the underlying file will exist after the generator has completed. The *output* storage location *could* already exist or we might need to generate the file for the *output* location.
40
+ In the case of a *output* storage location (e.g. `output_location`), we expect that the underlying file will exist after the generator has completed. The *output* storage location *could* already exist or we might need to generate the file for the *output* location.
41
+
42
+ There is also the concept of the *pre\_processed* storage location; when the *pre\_processed* storage location exists for the given input, copy that *pre\_processed* file to the *output* location. And skip running the derivative generator on the *input* storage location. In other words, if we've already done the derivation elsewhere, use that.
40
43
 
41
44
  During the generator's process, we need to have a working copy of both the *input* and *output* file. This is done by creating a temporary file.
42
45
 
@@ -224,6 +227,24 @@ Storage locations follow a [URI pattern](https://en.wikipedia.org/wiki/Uniform_R
224
227
  - `s3://` :: <abbr title="Amazon Web Service">AWS</abbr>’s <abbr title="Simple Storage Service">S3</abbr> storage system
225
228
  - `sqs://` :: <abbr title="Amazon Web Service">AWS</abbr>’s <abbr title="Simple Queue Service">SQS</abbr>
226
229
 
230
+ #### Templates
231
+
232
+ Throughout the code you'll see reference to the following concepts:
233
+
234
+ - `input_location_template`
235
+ - `output_location_template`
236
+ - `preprocessed_location_template`
237
+
238
+ In [Process Life Cycle](#process-life-cycle) we discussed the `input_location`, `output_location`, and `preprocessed_location`. The concept of the template provides a flexibility in mapping a location to another location
239
+
240
+ Examples of mapping one file path to another are:
241
+
242
+ - I want to copy `https://hello.com/world/GUID/file.jpg` to `file:///tmp/GUID/file.jpg`.
243
+ - I want to transform `file:///tmp/GUID/file.jpg` to `file:///tmp/GUID/file.hocr`; that is run OCR on an image and write a `.hocr` file.
244
+ - I want to use the `file:///tmp/GUID/file.hocr` to generate a `file:///tmp/GUID/file.coordinates.json`; that is convert the HOCR file to a coordinates.json file.
245
+
246
+ See [DerivativeRodeo::Service::ConvertUriViaTemplateService](./lib/derivative_rodeo/services/convert_uri_via_template_service.rb) for more details.
247
+
227
248
  ## Development
228
249
 
229
250
  - Checkout the repository: `git clone https://github.com/scientist-softserv/derivative_rodeo`
@@ -1,6 +1,7 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  require_relative '../services/extract_word_coordinates_from_hocr_sgml_service'
4
+ require_relative 'hocr_generator'
4
5
 
5
6
  module DerivativeRodeo
6
7
  module Generators
@@ -13,6 +14,8 @@ module DerivativeRodeo
13
14
 
14
15
  class_attribute :service, default: Services::ExtractWordCoordinatesFromHocrSgmlService
15
16
 
17
+ include HocrGenerator::RequiresExistingFile
18
+
16
19
  ##
17
20
  # @param output_location [StorageLocations::BaseLocation]
18
21
  # @param input_tmp_file_path [String] the location of the file that we can use for processing.
@@ -187,6 +187,7 @@ module DerivativeRodeo
187
187
  #
188
188
  # @see Generators::HocrGenerator
189
189
  # @see Generators::PdfSplitGenerator
190
+ # @see Services::ConvertUriViaTemplateService.coerce_pre_requisite_template_from
190
191
  def with_each_requisite_location_and_tmp_file_path
191
192
  input_files.each do |input_location|
192
193
  input_location.with_existing_tmp_path do |tmp_file_path|
@@ -65,7 +65,8 @@ module DerivativeRodeo
65
65
  #
66
66
  # @see BaseGenerator#with_each_requisite_location_and_tmp_file_path for further discussion
67
67
  def with_each_requisite_location_and_tmp_file_path(builder: MonochromeGenerator)
68
- mono_location_template = output_location_template.gsub(self.class.output_extension, builder.output_extension)
68
+ mono_location_template = Services::ConvertUriViaTemplateService.coerce_pre_requisite_template_from(template: output_location_template)
69
+
69
70
  requisite_files ||= builder.new(input_uris: input_uris, output_location_template: mono_location_template).generated_files
70
71
  requisite_files.each do |input_location|
71
72
  input_location.with_existing_tmp_path do |tmp_file_path|
@@ -107,6 +108,32 @@ module DerivativeRodeo
107
108
  # TODO: capture output in case of exceptions; perhaps delegate that to the #run method.
108
109
  run(cmd)
109
110
  end
111
+
112
+ ##
113
+ # A mixin for generators that rely on hocr files.
114
+ #
115
+ # @see #with_each_requisite_location_and_tmp_file_path
116
+ module RequiresExistingFile
117
+ ##
118
+ # @param builder [Class, #generated_files]
119
+ #
120
+ # When a generator depends on a hocr file, this method will ensure that we have the requisite
121
+ # hocr file.
122
+ #
123
+ # @yieldparam file [StorageLocations::BaseLocation]
124
+ # @yieldparam tmp_path [String]
125
+ #
126
+ # @see BaseGenerator#with_each_requisite_location_and_tmp_file_path for further discussion
127
+ def with_each_requisite_location_and_tmp_file_path(builder: HocrGenerator)
128
+ prereq_output_location_template = Services::ConvertUriViaTemplateService.coerce_pre_requisite_template_from(template: output_location_template)
129
+ requisite_files ||= builder.new(input_uris: input_uris, output_location_template: prereq_output_location_template).generated_files
130
+ requisite_files.each do |input_location|
131
+ input_location.with_existing_tmp_path do |tmp_file_path|
132
+ yield(input_location, tmp_file_path)
133
+ end
134
+ end
135
+ end
136
+ end
110
137
  end
111
138
  end
112
139
  end
@@ -1,6 +1,7 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  require_relative '../services/extract_word_coordinates_from_hocr_sgml_service'
4
+ require_relative 'hocr_generator'
4
5
 
5
6
  module DerivativeRodeo
6
7
  module Generators
@@ -13,6 +14,8 @@ module DerivativeRodeo
13
14
 
14
15
  class_attribute :service, default: Services::ExtractWordCoordinatesFromHocrSgmlService
15
16
 
17
+ include HocrGenerator::RequiresExistingFile
18
+
16
19
  ##
17
20
  # @param output_location [StorageLocations::BaseLocation]
18
21
  # @param input_tmp_file_path [String] the location of the file that we can use for processing.
@@ -1,5 +1,6 @@
1
1
  # frozen_string_literal: true
2
2
 
3
+ require_relative 'hocr_generator'
3
4
  module DerivativeRodeo
4
5
  module Generators
5
6
  ##
@@ -9,6 +10,8 @@ module DerivativeRodeo
9
10
  class WordCoordinatesGenerator < BaseGenerator
10
11
  self.output_extension = "coordinates.json"
11
12
 
13
+ include HocrGenerator::RequiresExistingFile
14
+
12
15
  ##
13
16
  # @param output_location [StorageLocations::BaseLocation]
14
17
  # @param input_tmp_file_path [String] the location of the file that we can use for processing.
@@ -7,6 +7,7 @@ module DerivativeRodeo
7
7
  # A service to convert an array of :from_uris to :to_uris via a :template.
8
8
  #
9
9
  # @see .call
10
+ # @see .coerce_pre_requisite_template_from
10
11
  class ConvertUriViaTemplateService
11
12
  DIR_PARTS_REPLACEMENT_REGEXP = %r{\{\{\s*dir_parts\[(?<left>\-?\d+)\.\.(?<right>\-?\d+)\]\s*\}\}}.freeze
12
13
  FILENAME_REPLACEMENT_REGEXP = %r{\{\{\s*filename\s*\}\}}.freeze
@@ -16,6 +17,15 @@ module DerivativeRodeo
16
17
  SCHEME_FOR_URI_REGEXP = %r{^(?<from_scheme>[^:]+)://}.freeze
17
18
  attr_accessor :from_uri, :template, :adapter, :separator, :uri, :from_scheme, :path, :parts, :dir_parts, :filename, :basename, :extension, :template_without_query, :template_query
18
19
 
20
+ ##
21
+ # @!group Class Attributes
22
+ #
23
+ # @!attribute separator [r|w]
24
+ # @return [String] the directory seperator character; default: "/"
25
+ class_attribute :separator, default: '/', instance_accessor: false
26
+ # @!endgroup Class Attributes
27
+ ##
28
+
19
29
  ##
20
30
  # Convert the given :from_uris to a different list of uris based on the given :template.
21
31
  #
@@ -32,7 +42,7 @@ module DerivativeRodeo
32
42
  # @param from_uri [String] Of the form "scheme://dir/parts/basename.extension"
33
43
  # @param template [String] Another URI that may contain path_parts or scheme template values.
34
44
  # @param adapter [StorageLocations::Location]
35
- # @param separator [String]
45
+ # @param options [Hash<Symbol, Object>]
36
46
  #
37
47
  # @return [String]
38
48
  #
@@ -46,16 +56,31 @@ module DerivativeRodeo
46
56
  # from_uris: ["file:///path1/A/file.pdf", "aws:///path2/B/file.pdf"],
47
57
  # template: "file:///dest1/{{dir_parts[-1..-1]}}/{{ filename }}")
48
58
  # => ["file:///dest1/A/file.pdf", "aws:///dest1/B/file.pdf"]
49
- def self.call(from_uri:, template:, adapter: nil, separator: "/", **options)
50
- new(from_uri: from_uri, template: template, adapter: adapter, separator: separator, **options).call
59
+ def self.call(from_uri:, template:, adapter: nil, **options)
60
+ new(from_uri: from_uri, template: template, adapter: adapter, **options).call
61
+ end
62
+
63
+ ##
64
+ # There are generators that have requisite files necessary for processing.
65
+ #
66
+ # For example, before we run `tesseract` on an image, we would like to make sure it is
67
+ # monochrome. Hence the interplay between the {Generators::HocrGenerator} and the
68
+ # {Generators::MonochromeGenerator}.
69
+ #
70
+ # @param template [String]
71
+ # @return [String]
72
+ #
73
+ # @see Generators::BaseGenerator#with_each_requisite_location_and_tmp_file_path
74
+ def self.coerce_pre_requisite_template_from(template:)
75
+ template.split(separator)[0..-2].join(separator) + "#{separator}{{ basename }}{{ extension }}"
51
76
  end
52
77
 
53
78
  # rubocop:disable Metrics/MethodLength
54
- def initialize(from_uri:, template:, adapter: nil, separator: "/", **options)
79
+ def initialize(from_uri:, template:, adapter: nil, **options)
55
80
  @from_uri = from_uri
56
81
  @template = template
57
82
  @adapter = adapter
58
- @separator = separator
83
+ @separator = options.fetch(:separator) { self.class.separator }
59
84
 
60
85
  @uri, _query = from_uri.split("?")
61
86
  @from_scheme, @path = uri.split("://")
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module DerivativeRodeo
4
- VERSION = '0.5.2'
4
+ VERSION = '0.5.3'
5
5
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: derivative-rodeo
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.5.2
4
+ version: 0.5.3
5
5
  platform: ruby
6
6
  authors:
7
7
  - Rob Kaufman
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: exe
11
11
  cert_chain: []
12
- date: 2023-11-15 00:00:00.000000000 Z
12
+ date: 2023-12-07 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: activesupport
@@ -337,7 +337,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
337
337
  - !ruby/object:Gem::Version
338
338
  version: '0'
339
339
  requirements: []
340
- rubygems_version: 3.1.6
340
+ rubygems_version: 3.3.7
341
341
  signing_key:
342
342
  specification_version: 4
343
343
  summary: An ETL Ecosystem for Derivative Processing.