format_parser 0.25.1 → 0.25.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 87adbfef15c2281ab6a13f151b857409f0ffad0ecc5270d9d0bbc5cebe207cdb
4
- data.tar.gz: 332e3c4efd4ae01b3cf47c669debba7cc1aee5c264bef503720f632f6d801054
3
+ metadata.gz: dbc915cd212f2207b6f54cc4b567ea342a0bfd76af1e5e3db23330a6cc879e27
4
+ data.tar.gz: 0a623b8303cbf40a58d348b4e1adcda2d6fc0170b411fba529f9d3ae53533282
5
5
  SHA512:
6
- metadata.gz: 231d768d4b69b2c2f29bcb861888a7fbb0f4242eec5c9313d6428c9053b4fb4e7d20b8615d731a695c278ac59d3bf07b4977d217892aa1fce8fa7adc9d415efa
7
- data.tar.gz: 732fae92e71f6a25fa98b7d88fa70b82404f00c234a6debd641e20ad8ffd9e45c78270f20adb8ad5593cd815e128a2a409ec639351c985b79763d4bfd821fec9
6
+ metadata.gz: 0ed65c67f508264c0c6c12e88287266ef2bb9346c7e44934ac2e8ac97001de8f5e7f605bdd497ee6f95a45bcf4ad7f6b69021d25739e85debafaecfca76153ce
7
+ data.tar.gz: 4b250cd8f41bbba735e5d90a4a829c4fea699f04dd190a37499853ed48475aecd19f6d32360f081474355a4f5f9c562e1ca78e8c7c8aba91e5fb9e0ca4930299
@@ -2,8 +2,9 @@ rvm:
2
2
  - 2.2.10
3
3
  - 2.3.8
4
4
  - 2.4.9
5
- - 2.5.7
6
- - 2.6.5
5
+ - 2.5.8
6
+ - 2.6.6
7
+ - 2.7.2
7
8
  - jruby
8
9
  sudo: false
9
10
  cache: bundler
@@ -1,3 +1,19 @@
1
+ ## 0.25.6
2
+ * Fix FormatParser.parse (with `results: :first`) to be deterministic
3
+
4
+ ## 0.25.5
5
+ * DPX: Fix DPXParser to support images without aspect ratio
6
+
7
+ ## 0.25.4
8
+ * MP3: Fix MP3Parser to return nil for TIFF files
9
+ * Add support to ruby 2.7
10
+
11
+ ## 0.25.3
12
+ * MP3: Fix parser to not skip the first bytes if it's not an ID3 header
13
+
14
+ ## 0.25.2
15
+ * Hotfix Moov parser
16
+
1
17
  ## 0.25.1
2
18
  * MOV: Fix error "negative length"
3
19
  * MOV: Fix reading dimensions in multi-track files
@@ -83,32 +83,59 @@ of software. Ideally, this file is going to be something you have produced yours
83
83
  and you are permitted to share under the MIT license provisions.
84
84
 
85
85
  When writing a parser, please try to ensure it returns a usable result as soon as possible,
86
- or no result as soon as possible (once you know the file is not fit for your specific parser).
86
+ or `nil` as soon as possible (once you know the file is not fit for your specific parser).
87
87
  Bear in mind that we enforce read budgets per-parser, so you will not be allowed to perform
88
88
  too many reads, or perform reads which are too large.
89
89
 
90
- In order to create new parsers, it is recommended to make a well-named class with an instance method `call`.
90
+ In order to create new parsers, make a well-named class with an instance method `call`,
91
+ and to register a single instance of that class as the parser - so that only one object needs to be stored
92
+ in memory when parsing multiple inputs. In that case your object must be **thread-safe and stateless** - this
93
+ is really important since FormatParser is thread-safe and multiple parsing procedures may be in progress
94
+ concurrently against the same parser object. You can also create a Proc if your parser is fairly trivial.
91
95
 
92
- `call` accepts the IO-ish object as an argument, parses data that it reads from it,
93
- and then returns the metadata for the file (if it could recover any) or `nil` if it couldn't. All files pass
94
- through all parsers by default, so if you are dealing with a file that is not "your" format - return `nil` from
95
- your method or `break` your Proc as early as possible. A blank `return` works fine too.
96
+ If it will be difficult to have your parser thread-safe you can register your class itself as
97
+ the parser and define the `self.call` method to parse using a fresh instance every time, allowing
98
+ object-level state:
99
+
100
+ ```ruby
101
+ class MyParser
102
+ def self.call(io)
103
+ new.call(io)
104
+ end
105
+
106
+ def call(io)
107
+ @state = ...
108
+ end
109
+ ```
110
+
111
+ `call` accepts a single argument - an IO-ish object which is guaranteed to respond to the same methods as the
112
+ ones defined in `IOConstraint` - that is, it is a strict subset of a standard Ruby IO object. *All reads from
113
+ this IO object are guaranteed to be returned in binary encoding.* The IO will be at offset of 0 when your parsing
114
+ proc receives it and there will be no concurrent calls to that object until your proc returns.
96
115
 
97
- The IO will at the minimum support the subset of the IO API defined in `IOConstraint`
116
+ Your parsing procedure may read from this IO object, and should return either a `Result`-like object with
117
+ the file metadata (if it could recover any) or `nil` if it couldn't. All files pass
118
+ through all parsers by default, so if you are dealing with a file that is not "your" format - return `nil` from
119
+ your method or `break` your Proc as early as possible. A blank `return` works fine too as it actually returns `nil`.
98
120
 
99
- Your parser has to be registered using `FormatParser.register_parser` with the information on the formats
100
- and file natures it provides.
121
+ Your parser then needs to be registered using `FormatParser.register_parser` with the information on the formats
122
+ and file natures it provides. This allows FormatParser to skip your parser if, say, the user only want to parse for
123
+ `:image` nature files but your parser parses `:audio`.
101
124
 
102
- Down below you can find the most basic parser implementation:
125
+ Down below you can find the most basic parser implementation which parses an imaginary `IMGA` file format:
103
126
 
104
127
  ```ruby
105
128
  MyParser = ->(io) {
106
- # ... do some parsing with `io`
129
+ # ... Read the magic bytes from the start of IO - the IO is
130
+ # guaranteed to be fed to you at offset 0, start-of-file.
107
131
  magic_bytes = io.read(4)
132
+
108
133
  # breaking the block returns `nil` to the caller signaling "no match"
109
134
  break if magic_bytes != 'IMGA'
110
135
 
136
+ # Our file format stores the width and height as 2 32-bit unsigned integers
111
137
  parsed_witdh, parsed_height = io.read(8).unpack('VV')
138
+
112
139
  # ...and return the FileInformation::Image object with the metadata.
113
140
  FormatParser::Image.new(
114
141
  format: :imga,
@@ -135,8 +162,8 @@ class MyParser
135
162
  # ... do some parsing with `io`
136
163
  # The instance will be discarded after parsing, so using instance variables
137
164
  # is permitted - they are not shared between calls to `call`
138
- @magic_bytes = io.read(4)
139
- break if @magic_bytes != 'IMGA'
165
+ magic_bytes = io.read(4)
166
+ break if magic_bytes != 'IMGA'
140
167
  parsed_witdh, parsed_height = io.read(8).unpack('VV')
141
168
  FormatParser::Image.new(
142
169
  format: :imga,
@@ -145,23 +172,33 @@ class MyParser
145
172
  )
146
173
  end
147
174
 
148
- FormatParser.register_parser self, natures: :image, formats: :bmp
175
+ # Note that we register an instance of the class, not the class. It is the
176
+ # instance that responds to `call()` and we can do this because our object
177
+ # is stateless.
178
+ FormatParser.register_parser new, natures: :image, formats: :bmp
149
179
  end
150
180
  ```
151
181
 
152
- ### Calling convention for preparing parsers
182
+ If your parser supports file types which have a known filename extension, you can add a method to it called `likely_match?`,
183
+ add this method on the object you register itself. For example, for the ZIP parser we use:
184
+
185
+ ```ruby
186
+ def likely_match?(filename)
187
+ filename =~ /\.(zip|docx|keynote|numbers|pptx|xlsx)$/i
188
+ end
189
+ ```
153
190
 
154
- A parser that gets registered using `register_parser` must be either:
191
+ If your parser matches the filename it is going to be applied *earlier*, saving time. Since most FormatParser users are
192
+ likely to only want the first result of the parsing, the sooner your parser gets applied - the sooner you can return the result,
193
+ avoiding unnecessary reads.
155
194
 
156
- 1) An object that can be `call()`-ed itself, with an argument that conforms to `IOConstraint`
157
- 2) An object that responds to `new` and returns something that can be `call()`-ed with with an argument that conforms to `IOConstraint`.
195
+ ### Calling convention for preparing parsers
158
196
 
159
- The second opton is recommended for most cases.
197
+ A parser that gets registered using `register_parser` must be an object that can be `call()`-ed, with an argument that conforms to `IOConstraint`
160
198
 
161
199
  FormatParser is made to be used in threaded environments, and if you use instance variables
162
- you need your parser to be isolated from it's siblings in other threads - therefore you can pass
163
- a Class on registration to have your parser instantiated for each `call()`, anew.
164
-
200
+ you need your parser to be isolated from it's siblings in other threads - create a copy for one-off use inside
201
+ your `call` method.
165
202
 
166
203
  ## Pull requests
167
204
 
@@ -36,7 +36,7 @@ module FormatParser
36
36
  # Register a parser object to be used to perform file format detection. Each parser FormatParser
37
37
  # provides out of the box registers itself using this method.
38
38
  #
39
- # @param callable_or_responding_to_new[#call, #new] an object that either responds to #new or to #call
39
+ # @param callable_parser[#call] an object that responds to #call for parsing an IO
40
40
  # @param formats[Array<Symbol>] file formats that the parser provides
41
41
  # @param natures[Array<Symbol>] file natures that the parser provides
42
42
  # @param priority[Integer] whether the parser has to be applied first or later. Parsers that offer the safest
@@ -45,39 +45,41 @@ module FormatParser
45
45
  # with a lower priority value will be applied first, and if a single result is requested, will also return
46
46
  # first.
47
47
  # @return void
48
- def self.register_parser(callable_or_responding_to_new, formats:, natures:, priority: LEAST_PRIORITY)
48
+ def self.register_parser(callable_parser, formats:, natures:, priority: LEAST_PRIORITY)
49
49
  parser_provided_formats = Array(formats)
50
50
  parser_provided_natures = Array(natures)
51
51
  PARSER_MUX.synchronize do
52
- @parsers ||= Set.new
53
- @parsers << callable_or_responding_to_new
52
+ # It can't be a Set because the method `parsers_for` depends on the order
53
+ # that the parsers were added.
54
+ @parsers ||= []
55
+ @parsers << callable_parser unless @parsers.include?(callable_parser)
54
56
  @parsers_per_nature ||= {}
55
57
  parser_provided_natures.each do |provided_nature|
56
58
  @parsers_per_nature[provided_nature] ||= Set.new
57
- @parsers_per_nature[provided_nature] << callable_or_responding_to_new
59
+ @parsers_per_nature[provided_nature] << callable_parser
58
60
  end
59
61
  @parsers_per_format ||= {}
60
62
  parser_provided_formats.each do |provided_format|
61
63
  @parsers_per_format[provided_format] ||= Set.new
62
- @parsers_per_format[provided_format] << callable_or_responding_to_new
64
+ @parsers_per_format[provided_format] << callable_parser
63
65
  end
64
66
  @parser_priorities ||= {}
65
- @parser_priorities[callable_or_responding_to_new] = priority
67
+ @parser_priorities[callable_parser] = priority
66
68
  end
67
69
  end
68
70
 
69
71
  # Deregister a parser object (makes FormatParser forget this parser existed). Is mostly used in
70
72
  # tests, but can also be used to forcibly disable some formats completely.
71
73
  #
72
- # @param callable_or_responding_to_new[#call, #new] an object that either responds to #new or to #call
74
+ # @param callable_parser[#==] an object that is identity-equal to any other registered parser
73
75
  # @return void
74
- def self.deregister_parser(callable_or_responding_to_new)
76
+ def self.deregister_parser(callable_parser)
75
77
  # Used only in tests
76
78
  PARSER_MUX.synchronize do
77
- (@parsers || []).delete(callable_or_responding_to_new)
78
- (@parsers_per_nature || {}).values.map { |e| e.delete(callable_or_responding_to_new) }
79
- (@parsers_per_format || {}).values.map { |e| e.delete(callable_or_responding_to_new) }
80
- (@parser_priorities || {}).delete(callable_or_responding_to_new)
79
+ (@parsers || []).delete(callable_parser)
80
+ (@parsers_per_nature || {}).values.map { |e| e.delete(callable_parser) }
81
+ (@parsers_per_format || {}).values.map { |e| e.delete(callable_parser) }
82
+ (@parser_priorities || {}).delete(callable_parser)
81
83
  end
82
84
  end
83
85
 
@@ -255,7 +257,19 @@ module FormatParser
255
257
  # Order the parsers according to their priority value. The ones having a lower
256
258
  # value will sort higher and will be applied sooner
257
259
  parsers_in_order_of_priority = parsers.to_a.sort do |parser_a, parser_b|
258
- @parser_priorities[parser_a] <=> @parser_priorities[parser_b]
260
+ if @parser_priorities[parser_a] != @parser_priorities[parser_b]
261
+ @parser_priorities[parser_a] <=> @parser_priorities[parser_b]
262
+ else
263
+ # Some parsers have the same priority and we want them to be always sorted
264
+ # in the same way, to not change the result of FormatParser.parse(results: :first).
265
+ # When this changes, it can generate flaky tests or event different
266
+ # results in different environments, which can be hard to understand why.
267
+ # There is also no guarantee in the order that the elements are added in
268
+ # @@parser_priorities
269
+ # So, to have always the same order, we sort by the order that the parsers
270
+ # were registered if the priorities are the same.
271
+ @parsers.index(parser_a) <=> @parsers.index(parser_b)
272
+ end
259
273
  end
260
274
 
261
275
  # If there is one parser that is more likely to match, place it first
@@ -1,3 +1,3 @@
1
1
  module FormatParser
2
- VERSION = '0.25.1'
2
+ VERSION = '0.25.6'
3
3
  end
@@ -35,18 +35,23 @@ class FormatParser::DPXParser
35
35
  w = dpx_structure.fetch(:image).fetch(:pixels_per_line)
36
36
  h = dpx_structure.fetch(:image).fetch(:lines_per_element)
37
37
 
38
+ display_w = w
39
+ display_h = h
40
+
38
41
  pixel_aspect_w = dpx_structure.fetch(:orientation).fetch(:horizontal_pixel_aspect)
39
42
  pixel_aspect_h = dpx_structure.fetch(:orientation).fetch(:vertical_pixel_aspect)
40
- pixel_aspect = pixel_aspect_w / pixel_aspect_h.to_f
41
43
 
42
- image_aspect = w / h.to_f * pixel_aspect
44
+ # Find display height and width based on aspect only if the file structure has pixel aspects
45
+ if pixel_aspect_h != 0 && pixel_aspect_w != 0
46
+ pixel_aspect = pixel_aspect_w / pixel_aspect_h.to_f
43
47
 
44
- display_w = w
45
- display_h = h
46
- if image_aspect > 1
47
- display_h = (display_w / image_aspect).round
48
- else
49
- display_w = (display_h * image_aspect).round
48
+ image_aspect = w / h.to_f * pixel_aspect
49
+
50
+ if image_aspect > 1
51
+ display_h = (display_w / image_aspect).round
52
+ else
53
+ display_w = (display_h * image_aspect).round
54
+ end
50
55
  end
51
56
 
52
57
  FormatParser::Image.new(
@@ -68,7 +68,7 @@ class FormatParser::MOOVParser::Decoder
68
68
  def find_video_trak_atom(atoms)
69
69
  trak_atoms = find_atoms_by_path(atoms, ['moov', 'trak'])
70
70
 
71
- return [] if trak_atoms.empty?
71
+ return if trak_atoms.empty?
72
72
 
73
73
  trak_atoms.find do |trak_atom|
74
74
  hdlr_atom = find_first_atom_by_path([trak_atom], 'trak', 'mdia', 'hdlr')
@@ -29,6 +29,10 @@ class FormatParser::MP3Parser
29
29
  ZIP_LOCAL_ENTRY_SIGNATURE = "PK\x03\x04\x14\x00".b
30
30
  PNG_HEADER_BYTES = [137, 80, 78, 71, 13, 10, 26, 10].pack('C*')
31
31
 
32
+ MAGIC_LE = [0x49, 0x49, 0x2A, 0x0].pack('C4')
33
+ MAGIC_BE = [0x4D, 0x4D, 0x0, 0x2A].pack('C4')
34
+ TIFF_HEADER_BYTES = [MAGIC_LE, MAGIC_BE]
35
+
32
36
  # Wraps the Tag object returned by ID3Tag in such
33
37
  # a way that a usable JSON representation gets
34
38
  # returned
@@ -68,11 +72,16 @@ class FormatParser::MP3Parser
68
72
  return if header.start_with?(ZIP_LOCAL_ENTRY_SIGNATURE)
69
73
  return if header.start_with?(PNG_HEADER_BYTES)
70
74
 
75
+ io.seek(0)
76
+ return if TIFF_HEADER_BYTES.include?(safe_read(io, 4))
77
+
71
78
  # Read all the ID3 tags (or at least attempt to)
72
79
  io.seek(0)
73
80
  id3v1 = ID3Extraction.attempt_id3_v1_extraction(io)
74
81
  tags = [id3v1, ID3Extraction.attempt_id3_v2_extraction(io)].compact
75
82
 
83
+ io.seek(0) if tags.empty?
84
+
76
85
  # Compute how many bytes are occupied by the actual MPEG frames
77
86
  ignore_bytes_at_tail = id3v1 ? 128 : 0
78
87
  ignore_bytes_at_head = io.pos
@@ -4,6 +4,7 @@ class FormatParser::TIFFParser
4
4
 
5
5
  MAGIC_LE = [0x49, 0x49, 0x2A, 0x0].pack('C4')
6
6
  MAGIC_BE = [0x4D, 0x4D, 0x0, 0x2A].pack('C4')
7
+ HEADER_BYTES = [MAGIC_LE, MAGIC_BE]
7
8
 
8
9
  def likely_match?(filename)
9
10
  filename =~ /\.tiff?$/i
@@ -12,7 +13,7 @@ class FormatParser::TIFFParser
12
13
  def call(io)
13
14
  io = FormatParser::IOConstraint.new(io)
14
15
 
15
- return unless [MAGIC_LE, MAGIC_BE].include?(safe_read(io, 4))
16
+ return unless HEADER_BYTES.include?(safe_read(io, 4))
16
17
  io.seek(io.pos + 2) # Skip over the offset of the IFD, EXIFR will re-read it anyway
17
18
  return if cr2?(io)
18
19
 
@@ -173,6 +173,26 @@ describe FormatParser do
173
173
  prioritized_parsers = FormatParser.parsers_for([:archive, :document, :image, :audio], [:tif, :jpg, :zip, :docx, :mp3, :aiff], 'a-file.zip')
174
174
  expect(prioritized_parsers.first).to be_kind_of(FormatParser::ZIPParser)
175
175
  end
176
+
177
+ it 'sorts the parsers by priority and name' do
178
+ parsers = FormatParser.parsers_for(
179
+ [:audio, :image],
180
+ [:cr2, :dpx, :fdx, :flac, :gif, :jpg, :mov, :mp4, :m4a, :mp3, :mpg, :mpeg, :ogg, :png, :tif, :wav]
181
+ )
182
+
183
+ expect(parsers.map { |parser| parser.class.name }).to eq([
184
+ 'FormatParser::GIFParser',
185
+ 'Class',
186
+ 'FormatParser::PNGParser',
187
+ 'FormatParser::CR2Parser',
188
+ 'FormatParser::DPXParser',
189
+ 'FormatParser::FLACParser',
190
+ 'FormatParser::MP3Parser',
191
+ 'FormatParser::OggParser',
192
+ 'FormatParser::TIFFParser',
193
+ 'FormatParser::WAVParser'
194
+ ])
195
+ end
176
196
  end
177
197
 
178
198
  describe '.register_parser and .deregister_parser' do
@@ -166,6 +166,18 @@ describe FormatParser::MP3Parser do
166
166
  expect(parsed.artist). to eq('wetransfer')
167
167
  end
168
168
 
169
+ it 'does not skip the first bytes if it is not a id3 tag header' do
170
+ fpath = fixtures_dir + '/MP3/no_id3_tags.mp3'
171
+
172
+ parsed = subject.call(File.open(fpath, 'rb'))
173
+
174
+ expect(parsed).not_to be_nil
175
+
176
+ expect(parsed.nature).to eq(:audio)
177
+ expect(parsed.format).to eq(:mp3)
178
+ expect(parsed.audio_sample_rate_hz).to eq(44100)
179
+ end
180
+
169
181
  describe '#as_json' do
170
182
  it 'converts all hash keys to string when stringify_keys: true' do
171
183
  fpath = fixtures_dir + '/MP3/Cassy.mp3'
@@ -193,4 +205,12 @@ describe FormatParser::MP3Parser do
193
205
  ).to eq([ID3Tag::Tag])
194
206
  end
195
207
  end
208
+
209
+ it 'does not recognize TIFF files as MP3' do
210
+ fpath = fixtures_dir + '/TIFF/test2.tif'
211
+
212
+ parsed = subject.call(File.open(fpath, 'rb'))
213
+
214
+ expect(parsed).to be_nil
215
+ end
196
216
  end
metadata CHANGED
@@ -1,15 +1,15 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: format_parser
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.25.1
4
+ version: 0.25.6
5
5
  platform: ruby
6
6
  authors:
7
7
  - Noah Berman
8
8
  - Julik Tarkhanov
9
- autorequire:
9
+ autorequire:
10
10
  bindir: exe
11
11
  cert_chain: []
12
- date: 2020-10-02 00:00:00.000000000 Z
12
+ date: 2020-12-17 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: ks
@@ -277,7 +277,7 @@ licenses:
277
277
  - MIT (Hippocratic)
278
278
  metadata:
279
279
  allowed_push_host: https://rubygems.org
280
- post_install_message:
280
+ post_install_message:
281
281
  rdoc_options: []
282
282
  require_paths:
283
283
  - lib
@@ -292,8 +292,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
292
292
  - !ruby/object:Gem::Version
293
293
  version: '0'
294
294
  requirements: []
295
- rubygems_version: 3.1.2
296
- signing_key:
295
+ rubygems_version: 3.1.4
296
+ signing_key:
297
297
  specification_version: 4
298
298
  summary: A library for efficient parsing of file metadata
299
299
  test_files: []