derivative-rodeo 0.5.1 → 0.5.3

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: d2dc233be659f66709737ebdebc8ac6a69802f74bce6ac450e5eabb383ea4e54
4
- data.tar.gz: 7a72ce66e69f827374b6f8e72767aca12638c71a9e5b2c63be09bbca08c12640
3
+ metadata.gz: 21db0b8b76cb0eadec2dab8848e7a7ac6240895eee96b5a247ed627fc682ac35
4
+ data.tar.gz: ab68faec3b4c8775f9243ae53090bb89191e4ea57c6ef2e8e8f40c1d4fa73f16
5
5
  SHA512:
6
- metadata.gz: de5ff8fc29943f5dc52b49c7741ae4a742793439dcca97ea94259d044cc15870dd19ced0911d9305901757e838a8ecb401eb81744c4f0529eef9607534d5dcb8
7
- data.tar.gz: bec393ace49d0e6689a07d335017987fa6e6edf9aae632159b87150dc2d4a0720441d64f6c6ce48ceb7c4498df3b6ada40b010cd0d439cb76cb071bd65c2c19d
6
+ metadata.gz: 6605936719a33bd7f8bbb6d945b1aa1d87e6b08a53206890af462668a975f519f5a29235e963404d1c159d8486af2db1800015da30be93a9493938e4d69c8c09
7
+ data.tar.gz: 3ad8cfa0dd71d4167439277c958ccca3682dc87c095afcbcbeb45afe2cf399c70b19d887fec2faaf763fc54a1d2d643957c72124768db8efad76f67c63624449
data/README.md CHANGED
@@ -17,6 +17,7 @@
17
17
  - [Registered Generators](#registered-generators)
18
18
  - [Storage Locations](#storage-locations)
19
19
  - [Supported Storage Locations](#supported-storage-locations)
20
+ - [Templates](#templates)
20
21
  - [Development](#development)
21
22
  - [Logging in Test Environment](#logging-in-test-environment)
22
23
  - [Contributing](#contributing)
@@ -34,9 +35,11 @@ The `DerivativeRodeo` "moves" files from one storage location (e.g. *input*) to
34
35
 
35
36
  ## Process Life Cycle
36
37
 
37
- In the case of a *input* storage location, we expect that the underlying file pointed at by the *input* storage location exists. After all we can't move what we don't have.
38
+ In the case of a *input* storage location (e.g. `input_location`), we expect that the underlying file pointed at by the *input* storage location exists. After all we can't move what we don't have.
38
39
 
39
- In the case of a *output* storage location, we expect that the underlying file will exist after the generator has completed. The *output* storage location *could* already exist or we might need to generate the file for the *output* location.
40
+ In the case of a *output* storage location (e.g. `output_location`), we expect that the underlying file will exist after the generator has completed. The *output* storage location *could* already exist or we might need to generate the file for the *output* location.
41
+
42
+ There is also the concept of the *pre\_processed* storage location; when the *pre\_processed* storage location exists for the given input, copy that *pre\_processed* file to the *output* location. And skip running the derivative generator on the *input* storage location. In other words, if we've already done the derivation elsewhere, use that.
40
43
 
41
44
  During the generator's process, we need to have a working copy of both the *input* and *output* file. This is done by creating a temporary file.
42
45
 
@@ -224,6 +227,24 @@ Storage locations follow a [URI pattern](https://en.wikipedia.org/wiki/Uniform_R
224
227
  - `s3://` :: <abbr title="Amazon Web Service">AWS</abbr>’s <abbr title="Simple Storage Service">S3</abbr> storage system
225
228
  - `sqs://` :: <abbr title="Amazon Web Service">AWS</abbr>’s <abbr title="Simple Queue Service">SQS</abbr>
226
229
 
230
+ #### Templates
231
+
232
+ Throughout the code you'll see reference to the following concepts:
233
+
234
+ - `input_location_template`
235
+ - `output_location_template`
236
+ - `preprocessed_location_template`
237
+
238
+ In [Process Life Cycle](#process-life-cycle) we discussed the `input_location`, `output_location`, and `preprocessed_location`. The concept of the template provides a flexibility in mapping a location to another location
239
+
240
+ Examples of mapping one file path to another are:
241
+
242
+ - I want to copy `https://hello.com/world/GUID/file.jpg` to `file:///tmp/GUID/file.jpg`.
243
+ - I want to transform `file:///tmp/GUID/file.jpg` to `file:///tmp/GUID/file.hocr`; that is run OCR on an image and write a `.hocr` file.
244
+ - I want to use the `file:///tmp/GUID/file.hocr` to generate a `file:///tmp/GUID/file.coordinates.json`; that is convert the HOCR file to a coordinates.json file.
245
+
246
+ See [DerivativeRodeo::Service::ConvertUriViaTemplateService](./lib/derivative_rodeo/services/convert_uri_via_template_service.rb) for more details.
247
+
227
248
  ## Development
228
249
 
229
250
  - Checkout the repository: `git clone https://github.com/scientist-softserv/derivative_rodeo`
@@ -1,6 +1,7 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  require_relative '../services/extract_word_coordinates_from_hocr_sgml_service'
4
+ require_relative 'hocr_generator'
4
5
 
5
6
  module DerivativeRodeo
6
7
  module Generators
@@ -13,6 +14,8 @@ module DerivativeRodeo
13
14
 
14
15
  class_attribute :service, default: Services::ExtractWordCoordinatesFromHocrSgmlService
15
16
 
17
+ include HocrGenerator::RequiresExistingFile
18
+
16
19
  ##
17
20
  # @param output_location [StorageLocations::BaseLocation]
18
21
  # @param input_tmp_file_path [String] the location of the file that we can use for processing.
@@ -187,6 +187,7 @@ module DerivativeRodeo
187
187
  #
188
188
  # @see Generators::HocrGenerator
189
189
  # @see Generators::PdfSplitGenerator
190
+ # @see Services::ConvertUriViaTemplateService.coerce_pre_requisite_template_from
190
191
  def with_each_requisite_location_and_tmp_file_path
191
192
  input_files.each do |input_location|
192
193
  input_location.with_existing_tmp_path do |tmp_file_path|
@@ -65,7 +65,8 @@ module DerivativeRodeo
65
65
  #
66
66
  # @see BaseGenerator#with_each_requisite_location_and_tmp_file_path for further discussion
67
67
  def with_each_requisite_location_and_tmp_file_path(builder: MonochromeGenerator)
68
- mono_location_template = output_location_template.gsub(self.class.output_extension, builder.output_extension)
68
+ mono_location_template = Services::ConvertUriViaTemplateService.coerce_pre_requisite_template_from(template: output_location_template)
69
+
69
70
  requisite_files ||= builder.new(input_uris: input_uris, output_location_template: mono_location_template).generated_files
70
71
  requisite_files.each do |input_location|
71
72
  input_location.with_existing_tmp_path do |tmp_file_path|
@@ -107,6 +108,32 @@ module DerivativeRodeo
107
108
  # TODO: capture output in case of exceptions; perhaps delegate that to the #run method.
108
109
  run(cmd)
109
110
  end
111
+
112
+ ##
113
+ # A mixin for generators that rely on hocr files.
114
+ #
115
+ # @see #with_each_requisite_location_and_tmp_file_path
116
+ module RequiresExistingFile
117
+ ##
118
+ # @param builder [Class, #generated_files]
119
+ #
120
+ # When a generator depends on a hocr file, this method will ensure that we have the requisite
121
+ # hocr file.
122
+ #
123
+ # @yieldparam file [StorageLocations::BaseLocation]
124
+ # @yieldparam tmp_path [String]
125
+ #
126
+ # @see BaseGenerator#with_each_requisite_location_and_tmp_file_path for further discussion
127
+ def with_each_requisite_location_and_tmp_file_path(builder: HocrGenerator)
128
+ prereq_output_location_template = Services::ConvertUriViaTemplateService.coerce_pre_requisite_template_from(template: output_location_template)
129
+ requisite_files ||= builder.new(input_uris: input_uris, output_location_template: prereq_output_location_template).generated_files
130
+ requisite_files.each do |input_location|
131
+ input_location.with_existing_tmp_path do |tmp_file_path|
132
+ yield(input_location, tmp_file_path)
133
+ end
134
+ end
135
+ end
136
+ end
110
137
  end
111
138
  end
112
139
  end
@@ -1,6 +1,7 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  require_relative '../services/extract_word_coordinates_from_hocr_sgml_service'
4
+ require_relative 'hocr_generator'
4
5
 
5
6
  module DerivativeRodeo
6
7
  module Generators
@@ -13,6 +14,8 @@ module DerivativeRodeo
13
14
 
14
15
  class_attribute :service, default: Services::ExtractWordCoordinatesFromHocrSgmlService
15
16
 
17
+ include HocrGenerator::RequiresExistingFile
18
+
16
19
  ##
17
20
  # @param output_location [StorageLocations::BaseLocation]
18
21
  # @param input_tmp_file_path [String] the location of the file that we can use for processing.
@@ -1,5 +1,6 @@
1
1
  # frozen_string_literal: true
2
2
 
3
+ require_relative 'hocr_generator'
3
4
  module DerivativeRodeo
4
5
  module Generators
5
6
  ##
@@ -9,6 +10,8 @@ module DerivativeRodeo
9
10
  class WordCoordinatesGenerator < BaseGenerator
10
11
  self.output_extension = "coordinates.json"
11
12
 
13
+ include HocrGenerator::RequiresExistingFile
14
+
12
15
  ##
13
16
  # @param output_location [StorageLocations::BaseLocation]
14
17
  # @param input_tmp_file_path [String] the location of the file that we can use for processing.
@@ -7,6 +7,7 @@ module DerivativeRodeo
7
7
  # A service to convert an array of :from_uris to :to_uris via a :template.
8
8
  #
9
9
  # @see .call
10
+ # @see .coerce_pre_requisite_template_from
10
11
  class ConvertUriViaTemplateService
11
12
  DIR_PARTS_REPLACEMENT_REGEXP = %r{\{\{\s*dir_parts\[(?<left>\-?\d+)\.\.(?<right>\-?\d+)\]\s*\}\}}.freeze
12
13
  FILENAME_REPLACEMENT_REGEXP = %r{\{\{\s*filename\s*\}\}}.freeze
@@ -16,6 +17,15 @@ module DerivativeRodeo
16
17
  SCHEME_FOR_URI_REGEXP = %r{^(?<from_scheme>[^:]+)://}.freeze
17
18
  attr_accessor :from_uri, :template, :adapter, :separator, :uri, :from_scheme, :path, :parts, :dir_parts, :filename, :basename, :extension, :template_without_query, :template_query
18
19
 
20
+ ##
21
+ # @!group Class Attributes
22
+ #
23
+ # @!attribute separator [r|w]
24
+ # @return [String] the directory seperator character; default: "/"
25
+ class_attribute :separator, default: '/', instance_accessor: false
26
+ # @!endgroup Class Attributes
27
+ ##
28
+
19
29
  ##
20
30
  # Convert the given :from_uris to a different list of uris based on the given :template.
21
31
  #
@@ -32,7 +42,7 @@ module DerivativeRodeo
32
42
  # @param from_uri [String] Of the form "scheme://dir/parts/basename.extension"
33
43
  # @param template [String] Another URI that may contain path_parts or scheme template values.
34
44
  # @param adapter [StorageLocations::Location]
35
- # @param separator [String]
45
+ # @param options [Hash<Symbol, Object>]
36
46
  #
37
47
  # @return [String]
38
48
  #
@@ -46,16 +56,31 @@ module DerivativeRodeo
46
56
  # from_uris: ["file:///path1/A/file.pdf", "aws:///path2/B/file.pdf"],
47
57
  # template: "file:///dest1/{{dir_parts[-1..-1]}}/{{ filename }}")
48
58
  # => ["file:///dest1/A/file.pdf", "aws:///dest1/B/file.pdf"]
49
- def self.call(from_uri:, template:, adapter: nil, separator: "/", **options)
50
- new(from_uri: from_uri, template: template, adapter: adapter, separator: separator, **options).call
59
+ def self.call(from_uri:, template:, adapter: nil, **options)
60
+ new(from_uri: from_uri, template: template, adapter: adapter, **options).call
61
+ end
62
+
63
+ ##
64
+ # There are generators that have requisite files necessary for processing.
65
+ #
66
+ # For example, before we run `tesseract` on an image, we would like to make sure it is
67
+ # monochrome. Hence the interplay between the {Generators::HocrGenerator} and the
68
+ # {Generators::MonochromeGenerator}.
69
+ #
70
+ # @param template [String]
71
+ # @return [String]
72
+ #
73
+ # @see Generators::BaseGenerator#with_each_requisite_location_and_tmp_file_path
74
+ def self.coerce_pre_requisite_template_from(template:)
75
+ template.split(separator)[0..-2].join(separator) + "#{separator}{{ basename }}{{ extension }}"
51
76
  end
52
77
 
53
78
  # rubocop:disable Metrics/MethodLength
54
- def initialize(from_uri:, template:, adapter: nil, separator: "/", **options)
79
+ def initialize(from_uri:, template:, adapter: nil, **options)
55
80
  @from_uri = from_uri
56
81
  @template = template
57
82
  @adapter = adapter
58
- @separator = separator
83
+ @separator = options.fetch(:separator) { self.class.separator }
59
84
 
60
85
  @uri, _query = from_uri.split("?")
61
86
  @from_scheme, @path = uri.split("://")
@@ -46,7 +46,7 @@ module DerivativeRodeo
46
46
  delegate :config, to: DerivativeRodeo
47
47
  end
48
48
 
49
- delegate :logger, to: DerivativeRodeo
49
+ delegate :config, :logger, to: DerivativeRodeo
50
50
 
51
51
  ##
52
52
  # @param location_name [String]
@@ -123,16 +123,14 @@ module DerivativeRodeo
123
123
  ##
124
124
  # @param file_uri [String] a URI to the file's location; this is **not** a templated URI (as
125
125
  # described in {DerivativeRodeo::Services::ConvertUriViaTemplateService}
126
- # @param config [DerivativeRodeo::Configuration]
127
- def initialize(file_uri, config: DerivativeRodeo.config)
126
+ def initialize(file_uri)
128
127
  @file_uri = file_uri
129
- @config = config
130
128
  end
131
129
 
132
130
  attr_accessor :tmp_file_path
133
131
  private :tmp_file_path=, :tmp_file_path
134
132
 
135
- attr_reader :config, :file_uri
133
+ attr_reader :file_uri
136
134
 
137
135
  ##
138
136
  # @param auto_write_file [Boolean] Provided as a testing helper method.
@@ -20,7 +20,7 @@ module DerivativeRodeo
20
20
  end
21
21
  end
22
22
 
23
- delegate :config, to: DerivativeRodeo
23
+ delegate :logger, to: DerivativeRodeo
24
24
 
25
25
  def with_existing_tmp_path(&block)
26
26
  with_tmp_path(lambda { |file_path, tmp_file_path, exist|
@@ -44,9 +44,9 @@ module DerivativeRodeo
44
44
  #
45
45
  # @return [String]
46
46
  def get(url)
47
- HTTParty.get(url, logger: config.logger)
47
+ HTTParty.get(url, logger: logger)
48
48
  rescue => e
49
- config.logger.error(%(#{e.message}\n#{e.backtrace.join("\n")}))
49
+ logger.error(%(#{e.message}\n#{e.backtrace.join("\n")}))
50
50
  raise e
51
51
  end
52
52
 
@@ -55,11 +55,24 @@ module DerivativeRodeo
55
55
  # @return [FalseClass] when the URL's head request is not successful or we've exhausted our
56
56
  # remaining redirects.
57
57
  def exist?
58
- HTTParty.head(file_uri, logger: config.logger)
58
+ HTTParty.head(file_uri, logger: logger)
59
59
  rescue => e
60
- config.logger.error(%(#{e.message}\n#{e.backtrace.join("\n")}))
60
+ logger.error(%(#{e.message}\n#{e.backtrace.join("\n")}))
61
61
  false
62
62
  end
63
+
64
+ ##
65
+ # @param tail_regexp [Regex] the file pattern that we're looking to find; but due to the
66
+ # nature of this location adapter, it won't matter.
67
+ # @return [Array] always returns an empty array.
68
+ #
69
+ # @see S3Location#matching_locations_in_file_dir
70
+ # @see FileLocation#matching_locations_in_file_dir
71
+ def matching_locations_in_file_dir(tail_regexp:)
72
+ logger.info("#{self.class}##{__method__} for file_uri: #{file_uri.inspect}, tail_regexp: #{tail_regexp} will always return an empty array. This is the nature of the #{self.class} location.")
73
+
74
+ []
75
+ end
63
76
  end
64
77
  end
65
78
  end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module DerivativeRodeo
4
- VERSION = '0.5.1'
4
+ VERSION = '0.5.3'
5
5
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: derivative-rodeo
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.5.1
4
+ version: 0.5.3
5
5
  platform: ruby
6
6
  authors:
7
7
  - Rob Kaufman
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: exe
11
11
  cert_chain: []
12
- date: 2023-11-06 00:00:00.000000000 Z
12
+ date: 2023-12-07 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: activesupport