format_parser 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: ba98fc34f902ba77ee29c4e96c69ef5e583c86ac
4
+ data.tar.gz: 757999ef4e7d25fe268637f7b26843e41da7e342
5
+ SHA512:
6
+ metadata.gz: 12dd8fe78b7b58d24e8e15098c1c825ceb0877f33b393b8757178a0d15ca569098c0830e157f069a45b649baef837d774020eac047792d9c27cbea8cc56a731b
7
+ data.tar.gz: 130256991758f0a7c6fd399bfc8ab50e6418dd7a8bdbe73580da4e47ceb1ca3ac337a020d9b93f423c656258725704dc56a9f02da640a16cc0831ea72811353f
data/.gitignore ADDED
@@ -0,0 +1,13 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /Gemfile.lock
4
+ /_yardoc/
5
+ /coverage/
6
+ /doc/
7
+ /pkg/
8
+ /spec/reports/
9
+ /tmp/
10
+ *.gem
11
+
12
+ # rspec failure tracking
13
+ .rspec_status
data/.rspec ADDED
@@ -0,0 +1,3 @@
1
+ --format progress
2
+ --order rand
3
+ --color
data/.travis.yml ADDED
@@ -0,0 +1,11 @@
1
+ rvm:
2
+ - 2.3.0
3
+ - 2.4.2
4
+ - jruby-9.0
5
+ sudo: false
6
+ cache: bundler
7
+ matrix:
8
+ allow_failures:
9
+ - rvm: jruby-9.0
10
+ script:
11
+ - bundle exec rspec
@@ -0,0 +1,46 @@
1
+ # Contributor Covenant Code of Conduct
2
+
3
+ ## Our Pledge
4
+
5
+ In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, gender identity and expression, level of experience, nationality, personal appearance, race, religion, or sexual identity and orientation.
6
+
7
+ ## Our Standards
8
+
9
+ Examples of behavior that contributes to creating a positive environment include:
10
+
11
+ * Using welcoming and inclusive language
12
+ * Being respectful of differing viewpoints and experiences
13
+ * Gracefully accepting constructive criticism
14
+ * Focusing on what is best for the community
15
+ * Showing empathy towards other community members
16
+
17
+ Examples of unacceptable behavior by participants include:
18
+
19
+ * The use of sexualized language or imagery and unwelcome sexual attention or advances
20
+ * Trolling, insulting/derogatory comments, and personal or political attacks
21
+ * Public or private harassment
22
+ * Publishing others' private information, such as a physical or electronic address, without explicit permission
23
+ * Other conduct which could reasonably be considered inappropriate in a professional setting
24
+
25
+ ## Our Responsibilities
26
+
27
+ Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior.
28
+
29
+ Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.
30
+
31
+ ## Scope
32
+
33
+ This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers.
34
+
35
+ ## Enforcement
36
+
37
+ Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team at julik@wetransfer.com and/or noah@wetransfer.com. The project team will review and investigate all complaints, and will respond in a way that it deems appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately.
38
+
39
+ Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project's leadership.
40
+
41
+ ## Attribution
42
+
43
+ This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, available at [http://contributor-covenant.org/version/1/4][version]
44
+
45
+ [homepage]: http://contributor-covenant.org
46
+ [version]: http://contributor-covenant.org/version/1/4/
data/CONTRIBUTING.md ADDED
@@ -0,0 +1,157 @@
1
+ # Contributing to format_parser
2
+
3
+ Please take a moment to review this document in order to make the contribution
4
+ process easy and effective for everyone involved.
5
+
6
+ Following these guidelines helps to communicate that you respect the time of
7
+ the developers managing and developing this open source project. In return,
8
+ they should reciprocate that respect in addressing your issue or assessing
9
+ patches and features.
10
+
11
+ ## What do I need to know to help?
12
+
13
+ If you are already familiar with the [Ruby Programming Language](https://www.ruby-lang.org/) you can start contributing code right away, otherwise look for issues labeled with *documentation* or *good first issue* to get started.
14
+
15
+ If you are interested in contributing code and would like to learn more about the technologies that we use, check out the (non-exhaustive) list below. You can also get in touch with us via an issue or email to julik@wetransfer.com and/or noah@wetransfer.com to get additional information.
16
+
17
+ - [ruby](https://ruby-doc.org)
18
+ - [rspec](http://rspec.info/) (for testing)
19
+
20
+ # How do I make a contribution?
21
+
22
+ ## Using the issue tracker
23
+
24
+ The issue tracker is the preferred channel for [bug reports](#bug-reports),
25
+ [feature requests](#feature-requests) and [submitting pull
26
+ requests](#pull-requests), but please respect the following restrictions:
27
+
28
+ * Please **do not** derail or troll issues. Keep the discussion on topic and respect the opinions of others. Adhere to the principles set out in the [Code of Conduct](https://github.com/WeTransfer/format_parser/blob/master/CODE_OF_CONDUCT.md).
29
+
30
+ ## Bug reports
31
+
32
+ A bug is a _demonstrable problem_ that is caused by code in the repository.
33
+
34
+ Good bug reports are extremely helpful-thank you!
35
+
36
+ Guidelines for bug reports:
37
+
38
+ 1. **Use the GitHub issue search** – check if the issue has already been
39
+ reported.
40
+
41
+ 2. **Check if the issue has been fixed** – try to reproduce it using the
42
+ latest `master` branch in the repository.
43
+
44
+ 3. **Isolate the problem** – create a [reduced test
45
+ case](http://css-tricks.com/reduced-test-cases/) and a live example.
46
+
47
+ A good bug report shouldn't leave others needing to chase you up for more
48
+ information. Please try to be as detailed as possible in your report. What is
49
+ your environment? What steps will reproduce the issue? What tool(s) or OS will
50
+ experience the problem? What would you expect to be the outcome? All these
51
+ details will help people to fix any potential bugs.
52
+
53
+ Example:
54
+
55
+ > Short and descriptive example bug report title
56
+ >
57
+ > A summary of the issue and the OS environment in which it occurs. If
58
+ > suitable, include the steps required to reproduce the bug.
59
+ >
60
+ > 1. This is the first step
61
+ > 2. This is the second step
62
+ > 3. Further steps, etc.
63
+ >
64
+ > `<url>` - a link to the reduced test case, if possible. Feel free to use a [Gist](https://gist.github.com).
65
+ >
66
+ > Any other information you want to share that is relevant to the issue being
67
+ > reported. This might include the lines of code that you have identified as
68
+ > causing the bug, and potential solutions (and your opinions on their
69
+ > merits).
70
+
71
+ ## Feature requests
72
+
73
+ Feature requests are welcome. But take a moment to find out whether your idea
74
+ fits with the scope and aims of the project. It's up to *you* to make a strong
75
+ case to convince the project's developers of the merits of this feature. Please
76
+ provide as much detail and context as possible.
77
+
78
+ ## So, you want to contribute a new parser
79
+
80
+ That's awesome! Please do take care to add example files that fit your parser use case.
81
+ Make sure that the file you are adding is licensed for use within an MIT-licensed piece
82
+ of software. Ideally, this file is going to be something you have produced yourself
83
+ and you are permitted to share under the MIT license provisions.
84
+
85
+ When writing a parser, please try to ensure it returns a usable result as soon as possible,
86
+ or no result as soon as possible (once you know the file is not fit for your specific parser).
87
+ Bear in mind that we enforce read budgets per-parser, so you will not be allowed to perform
88
+ too many reads, or perform reads which are too large.
89
+
90
+ ## Pull requests
91
+
92
+ Good pull requests-patches, improvements, new features-are a fantastic
93
+ help. They should remain focused in scope and avoid containing unrelated
94
+ commits.
95
+
96
+ **Please ask first** before embarking on any significant pull request (e.g.
97
+ implementing features, refactoring code, porting to a different language),
98
+ otherwise you risk spending a lot of time working on something that the
99
+ project's developers might not want to merge into the project.
100
+
101
+ Please adhere to the coding conventions used throughout the project (indentation,
102
+ accurate comments, etc.) and any other requirements (such as test coverage).
103
+
104
+ The test suite can be run with `bundle exec rspec`.
105
+
106
+ Follow this process if you'd like your work considered for inclusion in the
107
+ project:
108
+
109
+ 1. [Fork](http://help.github.com/fork-a-repo/) the project, clone your fork,
110
+ and configure the remotes:
111
+
112
+ ```bash
113
+ # Clone your fork of the repo into the current directory
114
+ git clone git@github.com:WeTransfer/format_parser.git
115
+ # Navigate to the newly cloned directory
116
+ cd format_parser
117
+ # Assign the original repo to a remote called "upstream"
118
+ git remote add upstream git@github.com:WeTransfer/format_parser.git
119
+ ```
120
+
121
+ 2. If you cloned a while ago, get the latest changes from upstream:
122
+
123
+ ```bash
124
+ git checkout <dev-branch>
125
+ git pull upstream <dev-branch>
126
+ ```
127
+
128
+ 3. Create a new topic branch (off the main project development branch) to
129
+ contain your feature, change, or fix:
130
+
131
+ ```bash
132
+ git checkout -b <topic-branch-name>
133
+ ```
134
+
135
+ 4. Commit your changes in logical chunks and/or squash them for readability and
136
+ conciseness. Check out [this post](https://chris.beams.io/posts/git-commit/) or
137
+ [this other post](http://tbaggery.com/2008/04/19/a-note-about-git-commit-messages.html) for some tips re: writing good commit messages.
138
+
139
+ 5. Locally merge (or rebase) the upstream development branch into your topic branch:
140
+
141
+ ```bash
142
+ git pull [--rebase] upstream <dev-branch>
143
+ ```
144
+
145
+ 6. Push your topic branch up to your fork:
146
+
147
+ ```bash
148
+ git push origin <topic-branch-name>
149
+ ```
150
+
151
+ 7. [Open a Pull Request](https://help.github.com/articles/using-pull-requests/)
152
+ with a clear title and description.
153
+
154
+ **IMPORTANT**: By submitting a patch, you agree to allow the project owner to
155
+ license your work under the same license as that used by the project, which you
156
+ can see by clicking [here](https://github.com/WeTransfer/format_parser/blob/master/LICENSE.txt).
157
+ This provision also applies to the test files you include with the changed code as fixtures.
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Gem dependencies specified in the gemspec
4
+ gemspec
data/LICENSE.txt ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2017 WeTransfer
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,53 @@
1
+ # format_parser
2
+
3
+ is a Ruby library for prying open video, image, document, and audio files.
4
+ It includes a number of parser modules that try to recover metadata useful for post-processing and layout while reading the absolute
5
+ minimum amount of data possible.
6
+
7
+ `format_parser` is inspired by [imagesize,](https://rubygems.org/gem/imagesize) [fastimage](https://github.com/sdsykes/fastimage)
8
+ and [dimensions,](https://github.com/sstephenson/dimensions) borrowing from them where appropriate.
9
+
10
+ ## Basic usage
11
+
12
+ Pass an IO object that responds to `read` and `seek` to `FormatParser`.
13
+
14
+ ```ruby
15
+ file_info = FormatParser.parse(File.open("myimage.jpg", "rb"))
16
+ file_info.file_nature #=> :image
17
+ file_info.file_format #=> :JPG
18
+ file_info.width_px #=> 320
19
+ file_info.height_px #=> 240
20
+ file_info.orientation #=> :top_left
21
+ ```
22
+ If nothing is detected, the result will be `nil`.
23
+
24
+ ## Design rationale
25
+
26
+ We need to recover metadata from various file types, and we need to do so satisfying the following constraints:
27
+
28
+ * The data in those files can be malicious and/or incomplete, so we need to be failsafe
29
+ * The data will be fetched from a remote location, so we want to acquire it with as few HTTP requests as possible
30
+ and with fetches being sufficiently small - the number of HTTP requests being of greater concern due to the
31
+ fact that we rely on AWS, and data transfer is much cheaper than per-request fees.
32
+ * The data can be recognized ambiguously and match more than one format definition (like TIFF sections of camera RAW)
33
+ * The number of supported formats is only ever going to increase, not decrease
34
+ * The library is likely to be used in multiple consumer applications
35
+ * The information necessary is a small subset of the overall metadata available in the file
36
+
37
+ Therefore we adapt the following approaches:
38
+
39
+ * Modular parsers per file format, with some degree of code sharing between them (but not too much). Adding new formats
40
+ should be low-friction, and testing these format parsers should be possible in isolation
41
+ * Modular and configurable IO stack that supports limiting reads/loops from the source entity.
42
+ The IO stack is isolated from the parsers, meaning parsers do not need to care about things
43
+ like fetches using `Range:` headers, GZIP compression and the like
44
+ * A caching system that allows us to ideally fetch once, and only once, and as little as possible - but still accomodate formats
45
+ that have the important information at the end of the file or might need information from the middle of the file
46
+ * Minimal dependencies, and if dependencies are to be used they should be very stable and low-level
47
+ * Where possible, use small subsets of full-feature format parsers since we only care about a small subset of the data
48
+ * Avoid using C libraries which are likely to contain buffer overflows/underflows - we stay memory safe
49
+
50
+ ## Fixture Sources
51
+
52
+ - MIT licensed fixture files from the FastImage and Dimensions projects
53
+ - fixture.aiff was created by one of the project maintainers and is MIT licensed
data/Rakefile ADDED
@@ -0,0 +1,12 @@
1
+ require 'bundler/gem_tasks'
2
+ require 'rspec/core/rake_task'
3
+ require 'yard'
4
+
5
+ YARD::Rake::YardocTask.new(:doc) do |t|
6
+ # The dash has to be between the two to "divide" the source files and
7
+ # miscellaneous documentation files that contain no code
8
+ t.files = ['lib/**/*.rb', '-', 'LICENSE.txt', 'IMPLEMENTATION_DETAILS.md']
9
+ end
10
+
11
+ RSpec::Core::RakeTask.new(:spec)
12
+ task default: :spec
@@ -0,0 +1,43 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'format_parser/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "format_parser"
8
+ spec.version = FormatParser::VERSION
9
+ spec.authors = ['Noah Berman', 'Julik Tarkhanov']
10
+ spec.email = ['noah@noahberman.org', 'me@julik.nl']
11
+ spec.licenses = ['MIT']
12
+ spec.summary = "A library for efficient parsing of file metadata"
13
+ spec.description = "A Ruby library for prying open files you can convert to a previewable format, such as video, image and audio files. It includes
14
+ a number of parser modules that try to recover metadata useful for post-processing and layout while reading the absolute
15
+ minimum amount of data possible."
16
+ spec.homepage = "https://github.com/WeTransfer/format_parser"
17
+ spec.license = "MIT"
18
+
19
+ # Prevent pushing this gem to RubyGems.org. To allow pushes either set the 'allowed_push_host'
20
+ # to allow pushing to a single host or delete this section to allow pushing to any host.
21
+ if spec.respond_to?(:metadata)
22
+ spec.metadata['allowed_push_host'] = "https://rubygems.org"
23
+ else
24
+ raise "RubyGems 2.0 or newer is required to protect against public gem pushes."
25
+ end
26
+
27
+ spec.files = `git ls-files -z`.split("\x0").reject do |f|
28
+ # Make sure large fixture files are not packaged with the gem every time
29
+ f.match(%r{^spec/fixtures/})
30
+ end
31
+ spec.bindir = "exe"
32
+ spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
33
+ spec.require_paths = ["lib"]
34
+
35
+ spec.add_dependency 'exifr', '~> 1.0'
36
+ spec.add_dependency 'faraday', '~> 0.13'
37
+
38
+ spec.add_development_dependency 'rspec', '~> 3.0'
39
+ spec.add_development_dependency 'rake', '~> 12'
40
+ spec.add_development_dependency 'simplecov', '~> 0.15'
41
+ spec.add_development_dependency 'pry', '~> 0.11'
42
+ spec.add_development_dependency 'yard', '~> 0.9'
43
+ end
data/lib/care.rb ADDED
@@ -0,0 +1,123 @@
1
+ # Care (Caching Reader) makes it more efficient to feed a
2
+ # possibly remote IO to parsers that tend to read (and skip)
3
+ # in very small increments. This way, with a remote source that
4
+ # is only available via HTTP, for example, we can have less
5
+ # fetches and have them return more data for one fetch
6
+ class Care
7
+ DEFAULT_PAGE_SIZE = 16 * 1024
8
+
9
+ class IOWrapper
10
+ def initialize(io, cache=Cache.new(DEFAULT_PAGE_SIZE))
11
+ @io, @cache = io, cache
12
+ @pos = 0
13
+ end
14
+
15
+ def seek(to)
16
+ @pos = to
17
+ end
18
+
19
+ def read(n_bytes)
20
+ read = @cache.byteslice(@io, @pos, n_bytes)
21
+ return nil unless read && !read.empty?
22
+ @pos += read.bytesize
23
+ read
24
+ end
25
+
26
+ def clear
27
+ @cache.clear
28
+ end
29
+
30
+ def close
31
+ clear
32
+ @io.close if @io.respond_to?(:close)
33
+ end
34
+ end
35
+
36
+ # Stores cached pages of data from the given IO as strings.
37
+ # Pages are sized to be `page_size` or less (for the last page).
38
+ class Cache
39
+ def initialize(page_size = DEFAULT_PAGE_SIZE)
40
+ @page_size = page_size.to_i
41
+ raise ArgumentError, "The page size must be a positive Integer" unless @page_size > 0
42
+ @pages = {}
43
+ @lowest_known_empty_page = nil
44
+ end
45
+
46
+ # Returns the maximum possible byte string that can be
47
+ # recovered from the given `io` at the given offset.
48
+ # If the IO has been exhausted, `nil` will be returned
49
+ # instead. Will use the cached pages where available,
50
+ # or fetch pages where necessary
51
+ def byteslice(io, at, n_bytes)
52
+ if n_bytes < 1
53
+ raise ArgumentError, "The number of bytes to fetch must be a positive Integer"
54
+ end
55
+
56
+ first_page = at / @page_size
57
+ last_page = (at + n_bytes) / @page_size
58
+
59
+ relevant_pages = (first_page..last_page).map{|i| hydrate_page(io, i) }
60
+
61
+ # Create one string combining all the pages which are relevant for
62
+ # us - it is much easier to address that string instead of piecing
63
+ # the output together page by page, and joining arrays of strings
64
+ # is supposed to be optimized.
65
+ slab = if relevant_pages.length > 1
66
+ # If our read overlaps multiple pages, we do have to join them, this is
67
+ # the general case
68
+ relevant_pages.join
69
+ else # We only have one page
70
+ # Optimize a little. If we only have one page that we need to read from
71
+ # - which is likely going to be the case *often* we can avoid allocating
72
+ # a new string for the joined pages and juse use the only page
73
+ # directly as the slab. Since it might contain a `nil` and we do
74
+ # not join (which casts nils to strings) we take care of that too
75
+ relevant_pages.first || ''
76
+ end
77
+
78
+ offset_in_slab = at % @page_size
79
+ slice = slab.byteslice(offset_in_slab, n_bytes)
80
+
81
+ # Returning an empty string from read() is very confusing for the caller,
82
+ # and no builtins do this - if we are at EOF we should return nil
83
+ if slice && !slice.empty?
84
+ slice
85
+ else
86
+ nil
87
+ end
88
+ end
89
+
90
+ def clear
91
+ @pages.clear
92
+ end
93
+
94
+ def hydrate_page(io, page_i)
95
+ # Avoid trying to read the page if we know there is no content to fill it
96
+ # in the underlying IO
97
+ if @lowest_known_empty_page && page_i >= @lowest_known_empty_page
98
+ return nil
99
+ end
100
+
101
+ @pages[page_i] ||= read_page(io, page_i)
102
+ end
103
+
104
+ def read_page(io, page_i)
105
+ io.seek(page_i * @page_size)
106
+ read_result = io.read(@page_size)
107
+
108
+ if read_result.nil?
109
+ # If the read went past the end of the IO the read result will be nil,
110
+ # so we know our IO is exhausted here
111
+ if @lowest_known_empty_page.nil? || @lowest_known_empty_page > page_i
112
+ @lowest_known_empty_page = page_i
113
+ end
114
+ elsif read_result.bytesize < @page_size
115
+ # If we read less than we initially wanted we know there are no pages
116
+ # to read following this one, so we can also optimize
117
+ @lowest_known_empty_page = page_i + 1
118
+ end
119
+
120
+ read_result
121
+ end
122
+ end
123
+ end