format_parser 0.25.1 → 0.25.6

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 87adbfef15c2281ab6a13f151b857409f0ffad0ecc5270d9d0bbc5cebe207cdb
4
- data.tar.gz: 332e3c4efd4ae01b3cf47c669debba7cc1aee5c264bef503720f632f6d801054
3
+ metadata.gz: dbc915cd212f2207b6f54cc4b567ea342a0bfd76af1e5e3db23330a6cc879e27
4
+ data.tar.gz: 0a623b8303cbf40a58d348b4e1adcda2d6fc0170b411fba529f9d3ae53533282
5
5
  SHA512:
6
- metadata.gz: 231d768d4b69b2c2f29bcb861888a7fbb0f4242eec5c9313d6428c9053b4fb4e7d20b8615d731a695c278ac59d3bf07b4977d217892aa1fce8fa7adc9d415efa
7
- data.tar.gz: 732fae92e71f6a25fa98b7d88fa70b82404f00c234a6debd641e20ad8ffd9e45c78270f20adb8ad5593cd815e128a2a409ec639351c985b79763d4bfd821fec9
6
+ metadata.gz: 0ed65c67f508264c0c6c12e88287266ef2bb9346c7e44934ac2e8ac97001de8f5e7f605bdd497ee6f95a45bcf4ad7f6b69021d25739e85debafaecfca76153ce
7
+ data.tar.gz: 4b250cd8f41bbba735e5d90a4a829c4fea699f04dd190a37499853ed48475aecd19f6d32360f081474355a4f5f9c562e1ca78e8c7c8aba91e5fb9e0ca4930299
@@ -2,8 +2,9 @@ rvm:
2
2
  - 2.2.10
3
3
  - 2.3.8
4
4
  - 2.4.9
5
- - 2.5.7
6
- - 2.6.5
5
+ - 2.5.8
6
+ - 2.6.6
7
+ - 2.7.2
7
8
  - jruby
8
9
  sudo: false
9
10
  cache: bundler
@@ -1,3 +1,19 @@
1
+ ## 0.25.6
2
+ * Fix FormatParser.parse (with `results: :first`) to be deterministic
3
+
4
+ ## 0.25.5
5
+ * DPX: Fix DPXParser to support images without aspect ratio
6
+
7
+ ## 0.25.4
8
+ * MP3: Fix MP3Parser to return nil for TIFF files
9
+ * Add support to ruby 2.7
10
+
11
+ ## 0.25.3
12
+ * MP3: Fix parser to not skip the first bytes if it's not an ID3 header
13
+
14
+ ## 0.25.2
15
+ * Hotfix Moov parser
16
+
1
17
  ## 0.25.1
2
18
  * MOV: Fix error "negative length"
3
19
  * MOV: Fix reading dimensions in multi-track files
@@ -83,32 +83,59 @@ of software. Ideally, this file is going to be something you have produced yours
83
83
  and you are permitted to share under the MIT license provisions.
84
84
 
85
85
  When writing a parser, please try to ensure it returns a usable result as soon as possible,
86
- or no result as soon as possible (once you know the file is not fit for your specific parser).
86
+ or `nil` as soon as possible (once you know the file is not fit for your specific parser).
87
87
  Bear in mind that we enforce read budgets per-parser, so you will not be allowed to perform
88
88
  too many reads, or perform reads which are too large.
89
89
 
90
- In order to create new parsers, it is recommended to make a well-named class with an instance method `call`.
90
+ In order to create new parsers, make a well-named class with an instance method `call`,
91
+ and to register a single instance of that class as the parser - so that only one object needs to be stored
92
+ in memory when parsing multiple inputs. In that case your object must be **thread-safe and stateless** - this
93
+ is really important since FormatParser is thread-safe and multiple parsing procedures may be in progress
94
+ concurrently against the same parser object. You can also create a Proc if your parser is fairly trivial.
91
95
 
92
- `call` accepts the IO-ish object as an argument, parses data that it reads from it,
93
- and then returns the metadata for the file (if it could recover any) or `nil` if it couldn't. All files pass
94
- through all parsers by default, so if you are dealing with a file that is not "your" format - return `nil` from
95
- your method or `break` your Proc as early as possible. A blank `return` works fine too.
96
+ If it will be difficult to have your parser thread-safe you can register your class itself as
97
+ the parser and define the `self.call` method to parse using a fresh instance every time, allowing
98
+ object-level state:
99
+
100
+ ```ruby
101
+ class MyParser
102
+ def self.call(io)
103
+ new.call(io)
104
+ end
105
+
106
+ def call(io)
107
+ @state = ...
108
+ end
109
+ ```
110
+
111
+ `call` accepts a single argument - an IO-ish object which is guaranteed to respond to the same methods as the
112
+ ones defined in `IOConstraint` - that is, it is a strict subset of a standard Ruby IO object. *All reads from
113
+ this IO object are guaranteed to be returned in binary encoding.* The IO will be at offset of 0 when your parsing
114
+ proc receives it and there will be no concurrent calls to that object until your proc returns.
96
115
 
97
- The IO will at the minimum support the subset of the IO API defined in `IOConstraint`
116
+ Your parsing procedure may read from this IO object, and should return either a `Result`-like object with
117
+ the file metadata (if it could recover any) or `nil` if it couldn't. All files pass
118
+ through all parsers by default, so if you are dealing with a file that is not "your" format - return `nil` from
119
+ your method or `break` your Proc as early as possible. A blank `return` works fine too as it actually returns `nil`.
98
120
 
99
- Your parser has to be registered using `FormatParser.register_parser` with the information on the formats
100
- and file natures it provides.
121
+ Your parser then needs to be registered using `FormatParser.register_parser` with the information on the formats
122
+ and file natures it provides. This allows FormatParser to skip your parser if, say, the user only want to parse for
123
+ `:image` nature files but your parser parses `:audio`.
101
124
 
102
- Down below you can find the most basic parser implementation:
125
+ Down below you can find the most basic parser implementation which parses an imaginary `IMGA` file format:
103
126
 
104
127
  ```ruby
105
128
  MyParser = ->(io) {
106
- # ... do some parsing with `io`
129
+ # ... Read the magic bytes from the start of IO - the IO is
130
+ # guaranteed to be fed to you at offset 0, start-of-file.
107
131
  magic_bytes = io.read(4)
132
+
108
133
  # breaking the block returns `nil` to the caller signaling "no match"
109
134
  break if magic_bytes != 'IMGA'
110
135
 
136
+ # Our file format stores the width and height as 2 32-bit unsigned integers
111
137
  parsed_witdh, parsed_height = io.read(8).unpack('VV')
138
+
112
139
  # ...and return the FileInformation::Image object with the metadata.
113
140
  FormatParser::Image.new(
114
141
  format: :imga,
@@ -135,8 +162,8 @@ class MyParser
135
162
  # ... do some parsing with `io`
136
163
  # The instance will be discarded after parsing, so using instance variables
137
164
  # is permitted - they are not shared between calls to `call`
138
- @magic_bytes = io.read(4)
139
- break if @magic_bytes != 'IMGA'
165
+ magic_bytes = io.read(4)
166
+ break if magic_bytes != 'IMGA'
140
167
  parsed_witdh, parsed_height = io.read(8).unpack('VV')
141
168
  FormatParser::Image.new(
142
169
  format: :imga,
@@ -145,23 +172,33 @@ class MyParser
145
172
  )
146
173
  end
147
174
 
148
- FormatParser.register_parser self, natures: :image, formats: :bmp
175
+ # Note that we register an instance of the class, not the class. It is the
176
+ # instance that responds to `call()` and we can do this because our object
177
+ # is stateless.
178
+ FormatParser.register_parser new, natures: :image, formats: :bmp
149
179
  end
150
180
  ```
151
181
 
152
- ### Calling convention for preparing parsers
182
+ If your parser supports file types which have a known filename extension, you can add a method to it called `likely_match?`,
183
+ add this method on the object you register itself. For example, for the ZIP parser we use:
184
+
185
+ ```ruby
186
+ def likely_match?(filename)
187
+ filename =~ /\.(zip|docx|keynote|numbers|pptx|xlsx)$/i
188
+ end
189
+ ```
153
190
 
154
- A parser that gets registered using `register_parser` must be either:
191
+ If your parser matches the filename it is going to be applied *earlier*, saving time. Since most FormatParser users are
192
+ likely to only want the first result of the parsing, the sooner your parser gets applied - the sooner you can return the result,
193
+ avoiding unnecessary reads.
155
194
 
156
- 1) An object that can be `call()`-ed itself, with an argument that conforms to `IOConstraint`
157
- 2) An object that responds to `new` and returns something that can be `call()`-ed with with an argument that conforms to `IOConstraint`.
195
+ ### Calling convention for preparing parsers
158
196
 
159
- The second opton is recommended for most cases.
197
+ A parser that gets registered using `register_parser` must be an object that can be `call()`-ed, with an argument that conforms to `IOConstraint`
160
198
 
161
199
  FormatParser is made to be used in threaded environments, and if you use instance variables
162
- you need your parser to be isolated from it's siblings in other threads - therefore you can pass
163
- a Class on registration to have your parser instantiated for each `call()`, anew.
164
-
200
+ you need your parser to be isolated from it's siblings in other threads - create a copy for one-off use inside
201
+ your `call` method.
165
202
 
166
203
  ## Pull requests
167
204
 
@@ -36,7 +36,7 @@ module FormatParser
36
36
  # Register a parser object to be used to perform file format detection. Each parser FormatParser
37
37
  # provides out of the box registers itself using this method.
38
38
  #
39
- # @param callable_or_responding_to_new[#call, #new] an object that either responds to #new or to #call
39
+ # @param callable_parser[#call] an object that responds to #call for parsing an IO
40
40
  # @param formats[Array<Symbol>] file formats that the parser provides
41
41
  # @param natures[Array<Symbol>] file natures that the parser provides
42
42
  # @param priority[Integer] whether the parser has to be applied first or later. Parsers that offer the safest
@@ -45,39 +45,41 @@ module FormatParser
45
45
  # with a lower priority value will be applied first, and if a single result is requested, will also return
46
46
  # first.
47
47
  # @return void
48
- def self.register_parser(callable_or_responding_to_new, formats:, natures:, priority: LEAST_PRIORITY)
48
+ def self.register_parser(callable_parser, formats:, natures:, priority: LEAST_PRIORITY)
49
49
  parser_provided_formats = Array(formats)
50
50
  parser_provided_natures = Array(natures)
51
51
  PARSER_MUX.synchronize do
52
- @parsers ||= Set.new
53
- @parsers << callable_or_responding_to_new
52
+ # It can't be a Set because the method `parsers_for` depends on the order
53
+ # that the parsers were added.
54
+ @parsers ||= []
55
+ @parsers << callable_parser unless @parsers.include?(callable_parser)
54
56
  @parsers_per_nature ||= {}
55
57
  parser_provided_natures.each do |provided_nature|
56
58
  @parsers_per_nature[provided_nature] ||= Set.new
57
- @parsers_per_nature[provided_nature] << callable_or_responding_to_new
59
+ @parsers_per_nature[provided_nature] << callable_parser
58
60
  end
59
61
  @parsers_per_format ||= {}
60
62
  parser_provided_formats.each do |provided_format|
61
63
  @parsers_per_format[provided_format] ||= Set.new
62
- @parsers_per_format[provided_format] << callable_or_responding_to_new
64
+ @parsers_per_format[provided_format] << callable_parser
63
65
  end
64
66
  @parser_priorities ||= {}
65
- @parser_priorities[callable_or_responding_to_new] = priority
67
+ @parser_priorities[callable_parser] = priority
66
68
  end
67
69
  end
68
70
 
69
71
  # Deregister a parser object (makes FormatParser forget this parser existed). Is mostly used in
70
72
  # tests, but can also be used to forcibly disable some formats completely.
71
73
  #
72
- # @param callable_or_responding_to_new[#call, #new] an object that either responds to #new or to #call
74
+ # @param callable_parser[#==] an object that is identity-equal to any other registered parser
73
75
  # @return void
74
- def self.deregister_parser(callable_or_responding_to_new)
76
+ def self.deregister_parser(callable_parser)
75
77
  # Used only in tests
76
78
  PARSER_MUX.synchronize do
77
- (@parsers || []).delete(callable_or_responding_to_new)
78
- (@parsers_per_nature || {}).values.map { |e| e.delete(callable_or_responding_to_new) }
79
- (@parsers_per_format || {}).values.map { |e| e.delete(callable_or_responding_to_new) }
80
- (@parser_priorities || {}).delete(callable_or_responding_to_new)
79
+ (@parsers || []).delete(callable_parser)
80
+ (@parsers_per_nature || {}).values.map { |e| e.delete(callable_parser) }
81
+ (@parsers_per_format || {}).values.map { |e| e.delete(callable_parser) }
82
+ (@parser_priorities || {}).delete(callable_parser)
81
83
  end
82
84
  end
83
85
 
@@ -255,7 +257,19 @@ module FormatParser
255
257
  # Order the parsers according to their priority value. The ones having a lower
256
258
  # value will sort higher and will be applied sooner
257
259
  parsers_in_order_of_priority = parsers.to_a.sort do |parser_a, parser_b|
258
- @parser_priorities[parser_a] <=> @parser_priorities[parser_b]
260
+ if @parser_priorities[parser_a] != @parser_priorities[parser_b]
261
+ @parser_priorities[parser_a] <=> @parser_priorities[parser_b]
262
+ else
263
+ # Some parsers have the same priority and we want them to be always sorted
264
+ # in the same way, to not change the result of FormatParser.parse(results: :first).
265
+ # When this changes, it can generate flaky tests or event different
266
+ # results in different environments, which can be hard to understand why.
267
+ # There is also no guarantee in the order that the elements are added in
268
+ # @@parser_priorities
269
+ # So, to have always the same order, we sort by the order that the parsers
270
+ # were registered if the priorities are the same.
271
+ @parsers.index(parser_a) <=> @parsers.index(parser_b)
272
+ end
259
273
  end
260
274
 
261
275
  # If there is one parser that is more likely to match, place it first
@@ -1,3 +1,3 @@
1
1
  module FormatParser
2
- VERSION = '0.25.1'
2
+ VERSION = '0.25.6'
3
3
  end
@@ -35,18 +35,23 @@ class FormatParser::DPXParser
35
35
  w = dpx_structure.fetch(:image).fetch(:pixels_per_line)
36
36
  h = dpx_structure.fetch(:image).fetch(:lines_per_element)
37
37
 
38
+ display_w = w
39
+ display_h = h
40
+
38
41
  pixel_aspect_w = dpx_structure.fetch(:orientation).fetch(:horizontal_pixel_aspect)
39
42
  pixel_aspect_h = dpx_structure.fetch(:orientation).fetch(:vertical_pixel_aspect)
40
- pixel_aspect = pixel_aspect_w / pixel_aspect_h.to_f
41
43
 
42
- image_aspect = w / h.to_f * pixel_aspect
44
+ # Find display height and width based on aspect only if the file structure has pixel aspects
45
+ if pixel_aspect_h != 0 && pixel_aspect_w != 0
46
+ pixel_aspect = pixel_aspect_w / pixel_aspect_h.to_f
43
47
 
44
- display_w = w
45
- display_h = h
46
- if image_aspect > 1
47
- display_h = (display_w / image_aspect).round
48
- else
49
- display_w = (display_h * image_aspect).round
48
+ image_aspect = w / h.to_f * pixel_aspect
49
+
50
+ if image_aspect > 1
51
+ display_h = (display_w / image_aspect).round
52
+ else
53
+ display_w = (display_h * image_aspect).round
54
+ end
50
55
  end
51
56
 
52
57
  FormatParser::Image.new(
@@ -68,7 +68,7 @@ class FormatParser::MOOVParser::Decoder
68
68
  def find_video_trak_atom(atoms)
69
69
  trak_atoms = find_atoms_by_path(atoms, ['moov', 'trak'])
70
70
 
71
- return [] if trak_atoms.empty?
71
+ return if trak_atoms.empty?
72
72
 
73
73
  trak_atoms.find do |trak_atom|
74
74
  hdlr_atom = find_first_atom_by_path([trak_atom], 'trak', 'mdia', 'hdlr')
@@ -29,6 +29,10 @@ class FormatParser::MP3Parser
29
29
  ZIP_LOCAL_ENTRY_SIGNATURE = "PK\x03\x04\x14\x00".b
30
30
  PNG_HEADER_BYTES = [137, 80, 78, 71, 13, 10, 26, 10].pack('C*')
31
31
 
32
+ MAGIC_LE = [0x49, 0x49, 0x2A, 0x0].pack('C4')
33
+ MAGIC_BE = [0x4D, 0x4D, 0x0, 0x2A].pack('C4')
34
+ TIFF_HEADER_BYTES = [MAGIC_LE, MAGIC_BE]
35
+
32
36
  # Wraps the Tag object returned by ID3Tag in such
33
37
  # a way that a usable JSON representation gets
34
38
  # returned
@@ -68,11 +72,16 @@ class FormatParser::MP3Parser
68
72
  return if header.start_with?(ZIP_LOCAL_ENTRY_SIGNATURE)
69
73
  return if header.start_with?(PNG_HEADER_BYTES)
70
74
 
75
+ io.seek(0)
76
+ return if TIFF_HEADER_BYTES.include?(safe_read(io, 4))
77
+
71
78
  # Read all the ID3 tags (or at least attempt to)
72
79
  io.seek(0)
73
80
  id3v1 = ID3Extraction.attempt_id3_v1_extraction(io)
74
81
  tags = [id3v1, ID3Extraction.attempt_id3_v2_extraction(io)].compact
75
82
 
83
+ io.seek(0) if tags.empty?
84
+
76
85
  # Compute how many bytes are occupied by the actual MPEG frames
77
86
  ignore_bytes_at_tail = id3v1 ? 128 : 0
78
87
  ignore_bytes_at_head = io.pos
@@ -4,6 +4,7 @@ class FormatParser::TIFFParser
4
4
 
5
5
  MAGIC_LE = [0x49, 0x49, 0x2A, 0x0].pack('C4')
6
6
  MAGIC_BE = [0x4D, 0x4D, 0x0, 0x2A].pack('C4')
7
+ HEADER_BYTES = [MAGIC_LE, MAGIC_BE]
7
8
 
8
9
  def likely_match?(filename)
9
10
  filename =~ /\.tiff?$/i
@@ -12,7 +13,7 @@ class FormatParser::TIFFParser
12
13
  def call(io)
13
14
  io = FormatParser::IOConstraint.new(io)
14
15
 
15
- return unless [MAGIC_LE, MAGIC_BE].include?(safe_read(io, 4))
16
+ return unless HEADER_BYTES.include?(safe_read(io, 4))
16
17
  io.seek(io.pos + 2) # Skip over the offset of the IFD, EXIFR will re-read it anyway
17
18
  return if cr2?(io)
18
19
 
@@ -173,6 +173,26 @@ describe FormatParser do
173
173
  prioritized_parsers = FormatParser.parsers_for([:archive, :document, :image, :audio], [:tif, :jpg, :zip, :docx, :mp3, :aiff], 'a-file.zip')
174
174
  expect(prioritized_parsers.first).to be_kind_of(FormatParser::ZIPParser)
175
175
  end
176
+
177
+ it 'sorts the parsers by priority and name' do
178
+ parsers = FormatParser.parsers_for(
179
+ [:audio, :image],
180
+ [:cr2, :dpx, :fdx, :flac, :gif, :jpg, :mov, :mp4, :m4a, :mp3, :mpg, :mpeg, :ogg, :png, :tif, :wav]
181
+ )
182
+
183
+ expect(parsers.map { |parser| parser.class.name }).to eq([
184
+ 'FormatParser::GIFParser',
185
+ 'Class',
186
+ 'FormatParser::PNGParser',
187
+ 'FormatParser::CR2Parser',
188
+ 'FormatParser::DPXParser',
189
+ 'FormatParser::FLACParser',
190
+ 'FormatParser::MP3Parser',
191
+ 'FormatParser::OggParser',
192
+ 'FormatParser::TIFFParser',
193
+ 'FormatParser::WAVParser'
194
+ ])
195
+ end
176
196
  end
177
197
 
178
198
  describe '.register_parser and .deregister_parser' do
@@ -166,6 +166,18 @@ describe FormatParser::MP3Parser do
166
166
  expect(parsed.artist). to eq('wetransfer')
167
167
  end
168
168
 
169
+ it 'does not skip the first bytes if it is not a id3 tag header' do
170
+ fpath = fixtures_dir + '/MP3/no_id3_tags.mp3'
171
+
172
+ parsed = subject.call(File.open(fpath, 'rb'))
173
+
174
+ expect(parsed).not_to be_nil
175
+
176
+ expect(parsed.nature).to eq(:audio)
177
+ expect(parsed.format).to eq(:mp3)
178
+ expect(parsed.audio_sample_rate_hz).to eq(44100)
179
+ end
180
+
169
181
  describe '#as_json' do
170
182
  it 'converts all hash keys to string when stringify_keys: true' do
171
183
  fpath = fixtures_dir + '/MP3/Cassy.mp3'
@@ -193,4 +205,12 @@ describe FormatParser::MP3Parser do
193
205
  ).to eq([ID3Tag::Tag])
194
206
  end
195
207
  end
208
+
209
+ it 'does not recognize TIFF files as MP3' do
210
+ fpath = fixtures_dir + '/TIFF/test2.tif'
211
+
212
+ parsed = subject.call(File.open(fpath, 'rb'))
213
+
214
+ expect(parsed).to be_nil
215
+ end
196
216
  end
metadata CHANGED
@@ -1,15 +1,15 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: format_parser
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.25.1
4
+ version: 0.25.6
5
5
  platform: ruby
6
6
  authors:
7
7
  - Noah Berman
8
8
  - Julik Tarkhanov
9
- autorequire:
9
+ autorequire:
10
10
  bindir: exe
11
11
  cert_chain: []
12
- date: 2020-10-02 00:00:00.000000000 Z
12
+ date: 2020-12-17 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: ks
@@ -277,7 +277,7 @@ licenses:
277
277
  - MIT (Hippocratic)
278
278
  metadata:
279
279
  allowed_push_host: https://rubygems.org
280
- post_install_message:
280
+ post_install_message:
281
281
  rdoc_options: []
282
282
  require_paths:
283
283
  - lib
@@ -292,8 +292,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
292
292
  - !ruby/object:Gem::Version
293
293
  version: '0'
294
294
  requirements: []
295
- rubygems_version: 3.1.2
296
- signing_key:
295
+ rubygems_version: 3.1.4
296
+ signing_key:
297
297
  specification_version: 4
298
298
  summary: A library for efficient parsing of file metadata
299
299
  test_files: []