format_parser 0.25.1 → 0.25.6
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.travis.yml +3 -2
- data/CHANGELOG.md +16 -0
- data/CONTRIBUTING.md +59 -22
- data/lib/format_parser.rb +28 -14
- data/lib/format_parser/version.rb +1 -1
- data/lib/parsers/dpx_parser.rb +13 -8
- data/lib/parsers/moov_parser/decoder.rb +1 -1
- data/lib/parsers/mp3_parser.rb +9 -0
- data/lib/parsers/tiff_parser.rb +2 -1
- data/spec/format_parser_spec.rb +20 -0
- data/spec/parsers/mp3_parser_spec.rb +20 -0
- metadata +6 -6
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: dbc915cd212f2207b6f54cc4b567ea342a0bfd76af1e5e3db23330a6cc879e27
|
4
|
+
data.tar.gz: 0a623b8303cbf40a58d348b4e1adcda2d6fc0170b411fba529f9d3ae53533282
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 0ed65c67f508264c0c6c12e88287266ef2bb9346c7e44934ac2e8ac97001de8f5e7f605bdd497ee6f95a45bcf4ad7f6b69021d25739e85debafaecfca76153ce
|
7
|
+
data.tar.gz: 4b250cd8f41bbba735e5d90a4a829c4fea699f04dd190a37499853ed48475aecd19f6d32360f081474355a4f5f9c562e1ca78e8c7c8aba91e5fb9e0ca4930299
|
data/.travis.yml
CHANGED
data/CHANGELOG.md
CHANGED
@@ -1,3 +1,19 @@
|
|
1
|
+
## 0.25.6
|
2
|
+
* Fix FormatParser.parse (with `results: :first`) to be deterministic
|
3
|
+
|
4
|
+
## 0.25.5
|
5
|
+
* DPX: Fix DPXParser to support images without aspect ratio
|
6
|
+
|
7
|
+
## 0.25.4
|
8
|
+
* MP3: Fix MP3Parser to return nil for TIFF files
|
9
|
+
* Add support to ruby 2.7
|
10
|
+
|
11
|
+
## 0.25.3
|
12
|
+
* MP3: Fix parser to not skip the first bytes if it's not an ID3 header
|
13
|
+
|
14
|
+
## 0.25.2
|
15
|
+
* Hotfix Moov parser
|
16
|
+
|
1
17
|
## 0.25.1
|
2
18
|
* MOV: Fix error "negative length"
|
3
19
|
* MOV: Fix reading dimensions in multi-track files
|
data/CONTRIBUTING.md
CHANGED
@@ -83,32 +83,59 @@ of software. Ideally, this file is going to be something you have produced yours
|
|
83
83
|
and you are permitted to share under the MIT license provisions.
|
84
84
|
|
85
85
|
When writing a parser, please try to ensure it returns a usable result as soon as possible,
|
86
|
-
or
|
86
|
+
or `nil` as soon as possible (once you know the file is not fit for your specific parser).
|
87
87
|
Bear in mind that we enforce read budgets per-parser, so you will not be allowed to perform
|
88
88
|
too many reads, or perform reads which are too large.
|
89
89
|
|
90
|
-
In order to create new parsers,
|
90
|
+
In order to create new parsers, make a well-named class with an instance method `call`,
|
91
|
+
and to register a single instance of that class as the parser - so that only one object needs to be stored
|
92
|
+
in memory when parsing multiple inputs. In that case your object must be **thread-safe and stateless** - this
|
93
|
+
is really important since FormatParser is thread-safe and multiple parsing procedures may be in progress
|
94
|
+
concurrently against the same parser object. You can also create a Proc if your parser is fairly trivial.
|
91
95
|
|
92
|
-
|
93
|
-
|
94
|
-
|
95
|
-
|
96
|
+
If it will be difficult to have your parser thread-safe you can register your class itself as
|
97
|
+
the parser and define the `self.call` method to parse using a fresh instance every time, allowing
|
98
|
+
object-level state:
|
99
|
+
|
100
|
+
```ruby
|
101
|
+
class MyParser
|
102
|
+
def self.call(io)
|
103
|
+
new.call(io)
|
104
|
+
end
|
105
|
+
|
106
|
+
def call(io)
|
107
|
+
@state = ...
|
108
|
+
end
|
109
|
+
```
|
110
|
+
|
111
|
+
`call` accepts a single argument - an IO-ish object which is guaranteed to respond to the same methods as the
|
112
|
+
ones defined in `IOConstraint` - that is, it is a strict subset of a standard Ruby IO object. *All reads from
|
113
|
+
this IO object are guaranteed to be returned in binary encoding.* The IO will be at offset of 0 when your parsing
|
114
|
+
proc receives it and there will be no concurrent calls to that object until your proc returns.
|
96
115
|
|
97
|
-
|
116
|
+
Your parsing procedure may read from this IO object, and should return either a `Result`-like object with
|
117
|
+
the file metadata (if it could recover any) or `nil` if it couldn't. All files pass
|
118
|
+
through all parsers by default, so if you are dealing with a file that is not "your" format - return `nil` from
|
119
|
+
your method or `break` your Proc as early as possible. A blank `return` works fine too as it actually returns `nil`.
|
98
120
|
|
99
|
-
Your parser
|
100
|
-
and file natures it provides.
|
121
|
+
Your parser then needs to be registered using `FormatParser.register_parser` with the information on the formats
|
122
|
+
and file natures it provides. This allows FormatParser to skip your parser if, say, the user only want to parse for
|
123
|
+
`:image` nature files but your parser parses `:audio`.
|
101
124
|
|
102
|
-
Down below you can find the most basic parser implementation:
|
125
|
+
Down below you can find the most basic parser implementation which parses an imaginary `IMGA` file format:
|
103
126
|
|
104
127
|
```ruby
|
105
128
|
MyParser = ->(io) {
|
106
|
-
# ...
|
129
|
+
# ... Read the magic bytes from the start of IO - the IO is
|
130
|
+
# guaranteed to be fed to you at offset 0, start-of-file.
|
107
131
|
magic_bytes = io.read(4)
|
132
|
+
|
108
133
|
# breaking the block returns `nil` to the caller signaling "no match"
|
109
134
|
break if magic_bytes != 'IMGA'
|
110
135
|
|
136
|
+
# Our file format stores the width and height as 2 32-bit unsigned integers
|
111
137
|
parsed_witdh, parsed_height = io.read(8).unpack('VV')
|
138
|
+
|
112
139
|
# ...and return the FileInformation::Image object with the metadata.
|
113
140
|
FormatParser::Image.new(
|
114
141
|
format: :imga,
|
@@ -135,8 +162,8 @@ class MyParser
|
|
135
162
|
# ... do some parsing with `io`
|
136
163
|
# The instance will be discarded after parsing, so using instance variables
|
137
164
|
# is permitted - they are not shared between calls to `call`
|
138
|
-
|
139
|
-
break if
|
165
|
+
magic_bytes = io.read(4)
|
166
|
+
break if magic_bytes != 'IMGA'
|
140
167
|
parsed_witdh, parsed_height = io.read(8).unpack('VV')
|
141
168
|
FormatParser::Image.new(
|
142
169
|
format: :imga,
|
@@ -145,23 +172,33 @@ class MyParser
|
|
145
172
|
)
|
146
173
|
end
|
147
174
|
|
148
|
-
|
175
|
+
# Note that we register an instance of the class, not the class. It is the
|
176
|
+
# instance that responds to `call()` and we can do this because our object
|
177
|
+
# is stateless.
|
178
|
+
FormatParser.register_parser new, natures: :image, formats: :bmp
|
149
179
|
end
|
150
180
|
```
|
151
181
|
|
152
|
-
|
182
|
+
If your parser supports file types which have a known filename extension, you can add a method to it called `likely_match?`,
|
183
|
+
add this method on the object you register itself. For example, for the ZIP parser we use:
|
184
|
+
|
185
|
+
```ruby
|
186
|
+
def likely_match?(filename)
|
187
|
+
filename =~ /\.(zip|docx|keynote|numbers|pptx|xlsx)$/i
|
188
|
+
end
|
189
|
+
```
|
153
190
|
|
154
|
-
|
191
|
+
If your parser matches the filename it is going to be applied *earlier*, saving time. Since most FormatParser users are
|
192
|
+
likely to only want the first result of the parsing, the sooner your parser gets applied - the sooner you can return the result,
|
193
|
+
avoiding unnecessary reads.
|
155
194
|
|
156
|
-
|
157
|
-
2) An object that responds to `new` and returns something that can be `call()`-ed with with an argument that conforms to `IOConstraint`.
|
195
|
+
### Calling convention for preparing parsers
|
158
196
|
|
159
|
-
|
197
|
+
A parser that gets registered using `register_parser` must be an object that can be `call()`-ed, with an argument that conforms to `IOConstraint`
|
160
198
|
|
161
199
|
FormatParser is made to be used in threaded environments, and if you use instance variables
|
162
|
-
you need your parser to be isolated from it's siblings in other threads -
|
163
|
-
|
164
|
-
|
200
|
+
you need your parser to be isolated from it's siblings in other threads - create a copy for one-off use inside
|
201
|
+
your `call` method.
|
165
202
|
|
166
203
|
## Pull requests
|
167
204
|
|
data/lib/format_parser.rb
CHANGED
@@ -36,7 +36,7 @@ module FormatParser
|
|
36
36
|
# Register a parser object to be used to perform file format detection. Each parser FormatParser
|
37
37
|
# provides out of the box registers itself using this method.
|
38
38
|
#
|
39
|
-
# @param
|
39
|
+
# @param callable_parser[#call] an object that responds to #call for parsing an IO
|
40
40
|
# @param formats[Array<Symbol>] file formats that the parser provides
|
41
41
|
# @param natures[Array<Symbol>] file natures that the parser provides
|
42
42
|
# @param priority[Integer] whether the parser has to be applied first or later. Parsers that offer the safest
|
@@ -45,39 +45,41 @@ module FormatParser
|
|
45
45
|
# with a lower priority value will be applied first, and if a single result is requested, will also return
|
46
46
|
# first.
|
47
47
|
# @return void
|
48
|
-
def self.register_parser(
|
48
|
+
def self.register_parser(callable_parser, formats:, natures:, priority: LEAST_PRIORITY)
|
49
49
|
parser_provided_formats = Array(formats)
|
50
50
|
parser_provided_natures = Array(natures)
|
51
51
|
PARSER_MUX.synchronize do
|
52
|
-
|
53
|
-
|
52
|
+
# It can't be a Set because the method `parsers_for` depends on the order
|
53
|
+
# that the parsers were added.
|
54
|
+
@parsers ||= []
|
55
|
+
@parsers << callable_parser unless @parsers.include?(callable_parser)
|
54
56
|
@parsers_per_nature ||= {}
|
55
57
|
parser_provided_natures.each do |provided_nature|
|
56
58
|
@parsers_per_nature[provided_nature] ||= Set.new
|
57
|
-
@parsers_per_nature[provided_nature] <<
|
59
|
+
@parsers_per_nature[provided_nature] << callable_parser
|
58
60
|
end
|
59
61
|
@parsers_per_format ||= {}
|
60
62
|
parser_provided_formats.each do |provided_format|
|
61
63
|
@parsers_per_format[provided_format] ||= Set.new
|
62
|
-
@parsers_per_format[provided_format] <<
|
64
|
+
@parsers_per_format[provided_format] << callable_parser
|
63
65
|
end
|
64
66
|
@parser_priorities ||= {}
|
65
|
-
@parser_priorities[
|
67
|
+
@parser_priorities[callable_parser] = priority
|
66
68
|
end
|
67
69
|
end
|
68
70
|
|
69
71
|
# Deregister a parser object (makes FormatParser forget this parser existed). Is mostly used in
|
70
72
|
# tests, but can also be used to forcibly disable some formats completely.
|
71
73
|
#
|
72
|
-
# @param
|
74
|
+
# @param callable_parser[#==] an object that is identity-equal to any other registered parser
|
73
75
|
# @return void
|
74
|
-
def self.deregister_parser(
|
76
|
+
def self.deregister_parser(callable_parser)
|
75
77
|
# Used only in tests
|
76
78
|
PARSER_MUX.synchronize do
|
77
|
-
(@parsers || []).delete(
|
78
|
-
(@parsers_per_nature || {}).values.map { |e| e.delete(
|
79
|
-
(@parsers_per_format || {}).values.map { |e| e.delete(
|
80
|
-
(@parser_priorities || {}).delete(
|
79
|
+
(@parsers || []).delete(callable_parser)
|
80
|
+
(@parsers_per_nature || {}).values.map { |e| e.delete(callable_parser) }
|
81
|
+
(@parsers_per_format || {}).values.map { |e| e.delete(callable_parser) }
|
82
|
+
(@parser_priorities || {}).delete(callable_parser)
|
81
83
|
end
|
82
84
|
end
|
83
85
|
|
@@ -255,7 +257,19 @@ module FormatParser
|
|
255
257
|
# Order the parsers according to their priority value. The ones having a lower
|
256
258
|
# value will sort higher and will be applied sooner
|
257
259
|
parsers_in_order_of_priority = parsers.to_a.sort do |parser_a, parser_b|
|
258
|
-
@parser_priorities[parser_a]
|
260
|
+
if @parser_priorities[parser_a] != @parser_priorities[parser_b]
|
261
|
+
@parser_priorities[parser_a] <=> @parser_priorities[parser_b]
|
262
|
+
else
|
263
|
+
# Some parsers have the same priority and we want them to be always sorted
|
264
|
+
# in the same way, to not change the result of FormatParser.parse(results: :first).
|
265
|
+
# When this changes, it can generate flaky tests or event different
|
266
|
+
# results in different environments, which can be hard to understand why.
|
267
|
+
# There is also no guarantee in the order that the elements are added in
|
268
|
+
# @@parser_priorities
|
269
|
+
# So, to have always the same order, we sort by the order that the parsers
|
270
|
+
# were registered if the priorities are the same.
|
271
|
+
@parsers.index(parser_a) <=> @parsers.index(parser_b)
|
272
|
+
end
|
259
273
|
end
|
260
274
|
|
261
275
|
# If there is one parser that is more likely to match, place it first
|
data/lib/parsers/dpx_parser.rb
CHANGED
@@ -35,18 +35,23 @@ class FormatParser::DPXParser
|
|
35
35
|
w = dpx_structure.fetch(:image).fetch(:pixels_per_line)
|
36
36
|
h = dpx_structure.fetch(:image).fetch(:lines_per_element)
|
37
37
|
|
38
|
+
display_w = w
|
39
|
+
display_h = h
|
40
|
+
|
38
41
|
pixel_aspect_w = dpx_structure.fetch(:orientation).fetch(:horizontal_pixel_aspect)
|
39
42
|
pixel_aspect_h = dpx_structure.fetch(:orientation).fetch(:vertical_pixel_aspect)
|
40
|
-
pixel_aspect = pixel_aspect_w / pixel_aspect_h.to_f
|
41
43
|
|
42
|
-
|
44
|
+
# Find display height and width based on aspect only if the file structure has pixel aspects
|
45
|
+
if pixel_aspect_h != 0 && pixel_aspect_w != 0
|
46
|
+
pixel_aspect = pixel_aspect_w / pixel_aspect_h.to_f
|
43
47
|
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
48
|
-
|
49
|
-
|
48
|
+
image_aspect = w / h.to_f * pixel_aspect
|
49
|
+
|
50
|
+
if image_aspect > 1
|
51
|
+
display_h = (display_w / image_aspect).round
|
52
|
+
else
|
53
|
+
display_w = (display_h * image_aspect).round
|
54
|
+
end
|
50
55
|
end
|
51
56
|
|
52
57
|
FormatParser::Image.new(
|
@@ -68,7 +68,7 @@ class FormatParser::MOOVParser::Decoder
|
|
68
68
|
def find_video_trak_atom(atoms)
|
69
69
|
trak_atoms = find_atoms_by_path(atoms, ['moov', 'trak'])
|
70
70
|
|
71
|
-
return
|
71
|
+
return if trak_atoms.empty?
|
72
72
|
|
73
73
|
trak_atoms.find do |trak_atom|
|
74
74
|
hdlr_atom = find_first_atom_by_path([trak_atom], 'trak', 'mdia', 'hdlr')
|
data/lib/parsers/mp3_parser.rb
CHANGED
@@ -29,6 +29,10 @@ class FormatParser::MP3Parser
|
|
29
29
|
ZIP_LOCAL_ENTRY_SIGNATURE = "PK\x03\x04\x14\x00".b
|
30
30
|
PNG_HEADER_BYTES = [137, 80, 78, 71, 13, 10, 26, 10].pack('C*')
|
31
31
|
|
32
|
+
MAGIC_LE = [0x49, 0x49, 0x2A, 0x0].pack('C4')
|
33
|
+
MAGIC_BE = [0x4D, 0x4D, 0x0, 0x2A].pack('C4')
|
34
|
+
TIFF_HEADER_BYTES = [MAGIC_LE, MAGIC_BE]
|
35
|
+
|
32
36
|
# Wraps the Tag object returned by ID3Tag in such
|
33
37
|
# a way that a usable JSON representation gets
|
34
38
|
# returned
|
@@ -68,11 +72,16 @@ class FormatParser::MP3Parser
|
|
68
72
|
return if header.start_with?(ZIP_LOCAL_ENTRY_SIGNATURE)
|
69
73
|
return if header.start_with?(PNG_HEADER_BYTES)
|
70
74
|
|
75
|
+
io.seek(0)
|
76
|
+
return if TIFF_HEADER_BYTES.include?(safe_read(io, 4))
|
77
|
+
|
71
78
|
# Read all the ID3 tags (or at least attempt to)
|
72
79
|
io.seek(0)
|
73
80
|
id3v1 = ID3Extraction.attempt_id3_v1_extraction(io)
|
74
81
|
tags = [id3v1, ID3Extraction.attempt_id3_v2_extraction(io)].compact
|
75
82
|
|
83
|
+
io.seek(0) if tags.empty?
|
84
|
+
|
76
85
|
# Compute how many bytes are occupied by the actual MPEG frames
|
77
86
|
ignore_bytes_at_tail = id3v1 ? 128 : 0
|
78
87
|
ignore_bytes_at_head = io.pos
|
data/lib/parsers/tiff_parser.rb
CHANGED
@@ -4,6 +4,7 @@ class FormatParser::TIFFParser
|
|
4
4
|
|
5
5
|
MAGIC_LE = [0x49, 0x49, 0x2A, 0x0].pack('C4')
|
6
6
|
MAGIC_BE = [0x4D, 0x4D, 0x0, 0x2A].pack('C4')
|
7
|
+
HEADER_BYTES = [MAGIC_LE, MAGIC_BE]
|
7
8
|
|
8
9
|
def likely_match?(filename)
|
9
10
|
filename =~ /\.tiff?$/i
|
@@ -12,7 +13,7 @@ class FormatParser::TIFFParser
|
|
12
13
|
def call(io)
|
13
14
|
io = FormatParser::IOConstraint.new(io)
|
14
15
|
|
15
|
-
return unless
|
16
|
+
return unless HEADER_BYTES.include?(safe_read(io, 4))
|
16
17
|
io.seek(io.pos + 2) # Skip over the offset of the IFD, EXIFR will re-read it anyway
|
17
18
|
return if cr2?(io)
|
18
19
|
|
data/spec/format_parser_spec.rb
CHANGED
@@ -173,6 +173,26 @@ describe FormatParser do
|
|
173
173
|
prioritized_parsers = FormatParser.parsers_for([:archive, :document, :image, :audio], [:tif, :jpg, :zip, :docx, :mp3, :aiff], 'a-file.zip')
|
174
174
|
expect(prioritized_parsers.first).to be_kind_of(FormatParser::ZIPParser)
|
175
175
|
end
|
176
|
+
|
177
|
+
it 'sorts the parsers by priority and name' do
|
178
|
+
parsers = FormatParser.parsers_for(
|
179
|
+
[:audio, :image],
|
180
|
+
[:cr2, :dpx, :fdx, :flac, :gif, :jpg, :mov, :mp4, :m4a, :mp3, :mpg, :mpeg, :ogg, :png, :tif, :wav]
|
181
|
+
)
|
182
|
+
|
183
|
+
expect(parsers.map { |parser| parser.class.name }).to eq([
|
184
|
+
'FormatParser::GIFParser',
|
185
|
+
'Class',
|
186
|
+
'FormatParser::PNGParser',
|
187
|
+
'FormatParser::CR2Parser',
|
188
|
+
'FormatParser::DPXParser',
|
189
|
+
'FormatParser::FLACParser',
|
190
|
+
'FormatParser::MP3Parser',
|
191
|
+
'FormatParser::OggParser',
|
192
|
+
'FormatParser::TIFFParser',
|
193
|
+
'FormatParser::WAVParser'
|
194
|
+
])
|
195
|
+
end
|
176
196
|
end
|
177
197
|
|
178
198
|
describe '.register_parser and .deregister_parser' do
|
@@ -166,6 +166,18 @@ describe FormatParser::MP3Parser do
|
|
166
166
|
expect(parsed.artist). to eq('wetransfer')
|
167
167
|
end
|
168
168
|
|
169
|
+
it 'does not skip the first bytes if it is not a id3 tag header' do
|
170
|
+
fpath = fixtures_dir + '/MP3/no_id3_tags.mp3'
|
171
|
+
|
172
|
+
parsed = subject.call(File.open(fpath, 'rb'))
|
173
|
+
|
174
|
+
expect(parsed).not_to be_nil
|
175
|
+
|
176
|
+
expect(parsed.nature).to eq(:audio)
|
177
|
+
expect(parsed.format).to eq(:mp3)
|
178
|
+
expect(parsed.audio_sample_rate_hz).to eq(44100)
|
179
|
+
end
|
180
|
+
|
169
181
|
describe '#as_json' do
|
170
182
|
it 'converts all hash keys to string when stringify_keys: true' do
|
171
183
|
fpath = fixtures_dir + '/MP3/Cassy.mp3'
|
@@ -193,4 +205,12 @@ describe FormatParser::MP3Parser do
|
|
193
205
|
).to eq([ID3Tag::Tag])
|
194
206
|
end
|
195
207
|
end
|
208
|
+
|
209
|
+
it 'does not recognize TIFF files as MP3' do
|
210
|
+
fpath = fixtures_dir + '/TIFF/test2.tif'
|
211
|
+
|
212
|
+
parsed = subject.call(File.open(fpath, 'rb'))
|
213
|
+
|
214
|
+
expect(parsed).to be_nil
|
215
|
+
end
|
196
216
|
end
|
metadata
CHANGED
@@ -1,15 +1,15 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: format_parser
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.25.
|
4
|
+
version: 0.25.6
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Noah Berman
|
8
8
|
- Julik Tarkhanov
|
9
|
-
autorequire:
|
9
|
+
autorequire:
|
10
10
|
bindir: exe
|
11
11
|
cert_chain: []
|
12
|
-
date: 2020-
|
12
|
+
date: 2020-12-17 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: ks
|
@@ -277,7 +277,7 @@ licenses:
|
|
277
277
|
- MIT (Hippocratic)
|
278
278
|
metadata:
|
279
279
|
allowed_push_host: https://rubygems.org
|
280
|
-
post_install_message:
|
280
|
+
post_install_message:
|
281
281
|
rdoc_options: []
|
282
282
|
require_paths:
|
283
283
|
- lib
|
@@ -292,8 +292,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
292
292
|
- !ruby/object:Gem::Version
|
293
293
|
version: '0'
|
294
294
|
requirements: []
|
295
|
-
rubygems_version: 3.1.
|
296
|
-
signing_key:
|
295
|
+
rubygems_version: 3.1.4
|
296
|
+
signing_key:
|
297
297
|
specification_version: 4
|
298
298
|
summary: A library for efficient parsing of file metadata
|
299
299
|
test_files: []
|