zip_kit 6.0.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/.codeclimate.yml +7 -0
- data/.document +5 -0
- data/.github/workflows/ci.yml +29 -0
- data/.gitignore +61 -0
- data/.rspec +1 -0
- data/.standard.yml +8 -0
- data/.yardopts +1 -0
- data/CHANGELOG.md +255 -0
- data/CODE_OF_CONDUCT.md +46 -0
- data/CONTRIBUTING.md +153 -0
- data/Gemfile +4 -0
- data/IMPLEMENTATION_DETAILS.md +97 -0
- data/LICENSE.txt +20 -0
- data/README.md +234 -0
- data/Rakefile +21 -0
- data/bench/buffered_crc32_bench.rb +109 -0
- data/examples/archive_size_estimate.rb +15 -0
- data/examples/config.ru +7 -0
- data/examples/deferred_write.rb +58 -0
- data/examples/parallel_compression_with_block_deflate.rb +86 -0
- data/examples/rack_application.rb +63 -0
- data/examples/s3_upload.rb +23 -0
- data/lib/zip_kit/block_deflate.rb +130 -0
- data/lib/zip_kit/block_write.rb +47 -0
- data/lib/zip_kit/file_reader/inflating_reader.rb +36 -0
- data/lib/zip_kit/file_reader/stored_reader.rb +35 -0
- data/lib/zip_kit/file_reader.rb +740 -0
- data/lib/zip_kit/null_writer.rb +12 -0
- data/lib/zip_kit/output_enumerator.rb +150 -0
- data/lib/zip_kit/path_set.rb +163 -0
- data/lib/zip_kit/rack_chunked_body.rb +32 -0
- data/lib/zip_kit/rack_tempfile_body.rb +61 -0
- data/lib/zip_kit/rails_streaming.rb +37 -0
- data/lib/zip_kit/remote_io.rb +114 -0
- data/lib/zip_kit/remote_uncap.rb +22 -0
- data/lib/zip_kit/size_estimator.rb +84 -0
- data/lib/zip_kit/stream_crc32.rb +60 -0
- data/lib/zip_kit/streamer/deflated_writer.rb +45 -0
- data/lib/zip_kit/streamer/entry.rb +37 -0
- data/lib/zip_kit/streamer/filler.rb +9 -0
- data/lib/zip_kit/streamer/heuristic.rb +68 -0
- data/lib/zip_kit/streamer/stored_writer.rb +39 -0
- data/lib/zip_kit/streamer/writable.rb +36 -0
- data/lib/zip_kit/streamer.rb +614 -0
- data/lib/zip_kit/uniquify_filename.rb +39 -0
- data/lib/zip_kit/version.rb +5 -0
- data/lib/zip_kit/write_and_tell.rb +40 -0
- data/lib/zip_kit/write_buffer.rb +71 -0
- data/lib/zip_kit/write_shovel.rb +22 -0
- data/lib/zip_kit/zip_writer.rb +436 -0
- data/lib/zip_kit.rb +24 -0
- data/zip_kit.gemspec +41 -0
- metadata +335 -0
@@ -0,0 +1,97 @@
|
|
1
|
+
# Implementation details
|
2
|
+
|
3
|
+
The ZipKit streaming implementation is designed around the following requirements:
|
4
|
+
|
5
|
+
* Only ahead-writes (no IO seek or rewind)
|
6
|
+
* Automatic switching to Zip64 as the files get written (no IO seeks), but not requiring Zip64 support if the archive can do without
|
7
|
+
* Make use of the fact that CRC32 checksums and the sizes of the files (compressed _and_ uncompressed) are known upfront
|
8
|
+
|
9
|
+
It strives to be compatible with the following unzip programs _at the minimum:_
|
10
|
+
|
11
|
+
* OSX - builtin ArchiveUtility (except the Zip64 support when files larger than 4GB are in the archive)
|
12
|
+
* OSX - The Unarchiver, at least 3.10.1
|
13
|
+
* Windows 7 - built-in Explorer zip browser (except for Unicode filenames which it just doesn't support)
|
14
|
+
* Windows 7 - 7Zip 9.20
|
15
|
+
|
16
|
+
Below is the list of _specific_ decisions taken when writing the implementation, with an explanation for each.
|
17
|
+
We specifically _omit_ a number of things that we could do, but that are not necessary to satisfy our objectives.
|
18
|
+
The omissions are _intentional_ since we do not want to have things of which we _assume_ they work, or have things
|
19
|
+
that work only for one obscure unarchiver in one obscure case (like WinRAR with chinese filenames).
|
20
|
+
|
21
|
+
## Data descriptors (postfix CRC32/file sizes)
|
22
|
+
|
23
|
+
Data descriptors permit you to generate "postfix" ZIP files (where you write the local file header without having to
|
24
|
+
know the CRC32 and the file size upfront, then write the compressed file data, and only then - once you know what your CRC32,
|
25
|
+
compressed and uncompressed sizes are etc. - write them into a data descriptor that follows the file data.
|
26
|
+
|
27
|
+
The streamer has optional support for data descriptors. Their use can apparently [ be problematic](https://github.com/thejoshwolfe/yazl/issues/13)
|
28
|
+
with the 7Zip version that we want to support, but in our tests everything worked fine.
|
29
|
+
|
30
|
+
For more info see https://github.com/thejoshwolfe/yazl#general-purpose-bit-flag
|
31
|
+
|
32
|
+
## Zip64 support
|
33
|
+
|
34
|
+
Zip64 support switches on _by itself_, automatically, when _any_ of the following conditions is met:
|
35
|
+
|
36
|
+
* The start of the central directory lies beyound the 4GB limit
|
37
|
+
* The ZIP archive has more than 65535 files added to it
|
38
|
+
* Any entry is present whose compressed _or_ uncompressed size is above 4GB
|
39
|
+
|
40
|
+
When writing out local file headers, the Zip64 extra field (and related changes to the standard fields) are
|
41
|
+
_only_ performed if one of the file sizes is larger than 4GB. Otherwise the Zip64 extra will _only_ be
|
42
|
+
written in the central directory entry, but not in the local file header.
|
43
|
+
|
44
|
+
This has to do with the fact that otherwise we would write Zip64 extra fields for all local file headers,
|
45
|
+
regardless whether the file actually requires Zip64 or not. That might impede some older tools from reading
|
46
|
+
the archive, which is a problem you don't want to have if your archive otherwise fits perfectly below all
|
47
|
+
the Zip64 thresholds.
|
48
|
+
|
49
|
+
To be compatible with Windows7 built-in tools, the Zip64 extra field _must_ be written as _the first_ extra
|
50
|
+
field, any other extra fields should come after.
|
51
|
+
|
52
|
+
## International filename support and the Info-ZIP extra field
|
53
|
+
|
54
|
+
If a diacritic-containing character (such as å) does fit into the DOS-437
|
55
|
+
codepage, it should be encodable as such. This would, in theory, let older Windows tools
|
56
|
+
decode the filename correctly. However, this kills the filename decoding for the OSX builtin
|
57
|
+
archive utility (it assumes the filename to be UTF-8, regardless). So if we allow filenames
|
58
|
+
to be encoded in DOS-437, we _potentially_ have support in Windows but we upset everyone on Mac.
|
59
|
+
If we just use UTF-8 and set the right EFS bit in general purpose flags, we upset Windows users
|
60
|
+
because most of the Windows unarchive tools (at least the builtin ones) do not give a flying eff
|
61
|
+
about the EFS support bit being set.
|
62
|
+
|
63
|
+
Additionally, if we use Unarchiver on OSX (which is our recommended unpacker for large files),
|
64
|
+
it will (very rightfully) ask us how we should decode each filename that does not have the EFS bit,
|
65
|
+
but does contain something non-ASCII-decodable. This is horrible UX for users.
|
66
|
+
|
67
|
+
So, basically, we have 2 choices, for filenames containing diacritics (for bona-fide UTF-8 you do not
|
68
|
+
even get those choices, you _have_ to use UTF-8):
|
69
|
+
|
70
|
+
* Make life easier for Windows users by setting stuff to DOS, not care about the standard _and_ make
|
71
|
+
most of Mac users upset
|
72
|
+
* Make life easy for Mac users and conform to the standard, and tell Windows users to get a _decent_
|
73
|
+
ZIP unarchiving tool.
|
74
|
+
|
75
|
+
We are going with option 2, and this is well-thought-out. Trust me. If you want the crazytown
|
76
|
+
filename encoding scheme that is described here http://stackoverflow.com/questions/13261347
|
77
|
+
you can try this:
|
78
|
+
|
79
|
+
[Encoding::CP437, Encoding::ISO_8859_1, Encoding::UTF_8]
|
80
|
+
|
81
|
+
We don't want no such thing, and sorry Windows users, you are going to need a decent unarchiver
|
82
|
+
that honors the standard. Alas, alas.
|
83
|
+
|
84
|
+
Additionally, the tests with the unarchivers we _do_ support have shown that including the InfoZIP
|
85
|
+
extra field does not actually help any of them recognize the file name correctly. And the use of
|
86
|
+
those fields for the UTF-8 filename, per spec, tells us we should not set the EFS bit - which ruins
|
87
|
+
the unarchiving for all other solutions. As any other, this decision may be changed in the future.
|
88
|
+
|
89
|
+
There are some interesting notes about the Info-ZIP/EFS combination here
|
90
|
+
https://commons.apache.org/proper/commons-compress/zip.html
|
91
|
+
|
92
|
+
## Directory support
|
93
|
+
|
94
|
+
ZIP offers the possibility to store empty directories (folders). The directories that contain files, however, get
|
95
|
+
created automatically at unarchive time. If you store a file, called, say, `docs/item.doc` then the unarchiver will
|
96
|
+
automatically create the `docs` directory if it doesn't exist already. So you need to use the directory creation
|
97
|
+
methods only if you do not have any files in those directories.
|
data/LICENSE.txt
ADDED
@@ -0,0 +1,20 @@
|
|
1
|
+
Copyright (c) 2024 Julik Tarkhanov
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
4
|
+
a copy of this software and associated documentation files (the
|
5
|
+
"Software"), to deal in the Software without restriction, including
|
6
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
7
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
8
|
+
permit persons to whom the Software is furnished to do so, subject to
|
9
|
+
the following conditions:
|
10
|
+
|
11
|
+
The above copyright notice and this permission notice shall be
|
12
|
+
included in all copies or substantial portions of the Software.
|
13
|
+
|
14
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
15
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
16
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
17
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
18
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
19
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
20
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,234 @@
|
|
1
|
+
# zip_kit
|
2
|
+
|
3
|
+
Allows streaming, non-rewinding ZIP file output from Ruby.
|
4
|
+
|
5
|
+
`zip_kit` is a successor to and continuation of [zip_tricks](https://github.com/WeTransfer/zip_tricks), which
|
6
|
+
was inspired by [zipline](https://github.com/fringd/zipline). I am grateful to WeTransfer for allowing me
|
7
|
+
to develop zip_tricks and for sharing it with the community.
|
8
|
+
|
9
|
+
Allows you to write a ZIP archive out to a `File`, `Socket`, `String` or `Array` without having to rewind it at any
|
10
|
+
point. Usable for creating very large ZIP archives for immediate sending out to clients, or for writing
|
11
|
+
large ZIP archives without memory inflation.
|
12
|
+
|
13
|
+
The original gem (zip_tricks) handled all the zipping needs (millions of ZIP files generated per day),
|
14
|
+
for a large file transfer service, so we are pretty confident it is widely compatible with a large number
|
15
|
+
of unarchiving end-user applications and is well tested.
|
16
|
+
|
17
|
+
## Requirements
|
18
|
+
|
19
|
+
Ruby 2.6+ syntax support is required, as well as a a working zlib (all available to jRuby as well).
|
20
|
+
|
21
|
+
## Diving in: send some large CSV reports from Rails
|
22
|
+
|
23
|
+
The easiest is to include the `ZipKit::RailsStreaming` module into your
|
24
|
+
controller. You will then have a `zip_kit_stream` method available which accepts a block:
|
25
|
+
|
26
|
+
```ruby
|
27
|
+
class ZipsController < ActionController::Base
|
28
|
+
include ZipKit::RailsStreaming
|
29
|
+
|
30
|
+
def download
|
31
|
+
zip_kit_stream do |zip|
|
32
|
+
zip.write_file('report1.csv') do |sink|
|
33
|
+
CSV(sink) do |csv_write|
|
34
|
+
csv_write << Person.column_names
|
35
|
+
Person.all.find_each do |person|
|
36
|
+
csv_write << person.attributes.values
|
37
|
+
end
|
38
|
+
end
|
39
|
+
end
|
40
|
+
zip.write_file('report2.csv') do |sink|
|
41
|
+
...
|
42
|
+
end
|
43
|
+
end
|
44
|
+
end
|
45
|
+
end
|
46
|
+
```
|
47
|
+
|
48
|
+
The `write_file` method will use some heuristics to determine whether your output file would benefit
|
49
|
+
from compression, and pick the appropriate storage mode for the file accordingly.
|
50
|
+
|
51
|
+
If you want some more conveniences you can also use [zipline](https://github.com/fringd/zipline) which
|
52
|
+
will automatically process and stream attachments (Carrierwave, Shrine, ActiveStorage) and remote objects
|
53
|
+
via HTTP.
|
54
|
+
|
55
|
+
`RailsStreaming` will *not* use [ActionController::Live](https://api.rubyonrails.org/classes/ActionController/Live.html)
|
56
|
+
and the ZIP output will run in the same thread as your main request. Your testing flows (be it minitest or
|
57
|
+
RSpec) should work normally with controller actions returning ZIPs.
|
58
|
+
|
59
|
+
## Writing into other streaming destinations
|
60
|
+
|
61
|
+
Any object that accepts bytes via either `<<` or `write` methods can be a write destination. For example, here
|
62
|
+
is how to upload a sizeable ZIP to S3 - the SDK will happily chop your upload into multipart upload parts:
|
63
|
+
|
64
|
+
```ruby
|
65
|
+
bucket = Aws::S3::Bucket.new("mybucket")
|
66
|
+
obj = bucket.object("big.zip")
|
67
|
+
obj.upload_stream do |write_stream|
|
68
|
+
ZipKit::Streamer.open(write_stream) do |zip|
|
69
|
+
zip.write_file("large.csv") do |sink|
|
70
|
+
CSV(sink) do |csv|
|
71
|
+
csv << ["Line", "Item"]
|
72
|
+
20_000.times do |n|
|
73
|
+
csv << [n, "Item number #{n}"]
|
74
|
+
end
|
75
|
+
end
|
76
|
+
end
|
77
|
+
end
|
78
|
+
end
|
79
|
+
```
|
80
|
+
|
81
|
+
# Writing through an intermediary object
|
82
|
+
|
83
|
+
Any object that writes using either `<<` or `write` can write into a `sink`. For example, you can do streaming
|
84
|
+
output with [builder](https://github.com/jimweirich/builder#project-builder)
|
85
|
+
|
86
|
+
```ruby
|
87
|
+
zip.write_file('report1.csv') do |sink|
|
88
|
+
builder = Builder::XmlMarkup.new(target: sink, indent: 2)
|
89
|
+
builder.people do
|
90
|
+
Person.all.find_each do |person|
|
91
|
+
builder.person(name: person.name)
|
92
|
+
end
|
93
|
+
end
|
94
|
+
end
|
95
|
+
```
|
96
|
+
|
97
|
+
and this output will be compressed and output into the ZIP file on the fly. zip_kit composes with any
|
98
|
+
Ruby code that streams its output into a destination.
|
99
|
+
|
100
|
+
## Create a ZIP file without size estimation, compress on-the-fly during writes
|
101
|
+
|
102
|
+
Basic use case is compressing on the fly. Some data will be buffered by the Zlib deflater, but
|
103
|
+
memory inflation is going to be very constrained. Data will be written to destination at fairly regular
|
104
|
+
intervals. Deflate compression will work best for things like text files.
|
105
|
+
|
106
|
+
```ruby
|
107
|
+
out = my_tempfile # can also be a socket
|
108
|
+
ZipKit::Streamer.open(out) do |zip|
|
109
|
+
zip.write_file('mov.mp4.txt') do |sink|
|
110
|
+
File.open('mov.mp4', 'rb'){|source| IO.copy_stream(source, sink) }
|
111
|
+
end
|
112
|
+
zip.write_file('long-novel.txt') do |sink|
|
113
|
+
File.open('novel.txt', 'rb'){|source| IO.copy_stream(source, sink) }
|
114
|
+
end
|
115
|
+
end
|
116
|
+
```
|
117
|
+
Unfortunately with this approach it is impossible to compute the size of the ZIP file being output,
|
118
|
+
since you do not know how large the compressed data segments are going to be.
|
119
|
+
|
120
|
+
## Send a ZIP from a Rack response
|
121
|
+
|
122
|
+
zip_kit provides an `OutputEnumerator` object which will yield the binary chunks piece
|
123
|
+
by piece, and apply some amount of buffering as well. Make sure to also wrap your `OutputEnumerator` in a chunker
|
124
|
+
by calling `#to_chunked` on it. Return it to your webserver and you will have your ZIP streamed!
|
125
|
+
The block that you give to the `OutputEnumerator` receive the {ZipKit::Streamer} object and will only
|
126
|
+
start executing once your response body starts getting iterated over - when actually sending
|
127
|
+
the response to the client (unless you are using a buffering Rack webserver, such as Webrick).
|
128
|
+
|
129
|
+
```ruby
|
130
|
+
body = ZipKit::OutputEnumerator.new do | zip |
|
131
|
+
zip.write_file('mov.mp4') do |sink|
|
132
|
+
File.open('mov.mp4', 'rb'){|source| IO.copy_stream(source, sink) }
|
133
|
+
end
|
134
|
+
zip.write_file('long-novel.txt') do |sink|
|
135
|
+
File.open('novel.txt', 'rb'){|source| IO.copy_stream(source, sink) }
|
136
|
+
end
|
137
|
+
end
|
138
|
+
|
139
|
+
headers, streaming_body = body.to_rack_response_headers_and_body(env)
|
140
|
+
[200, headers, streaming_body]
|
141
|
+
```
|
142
|
+
|
143
|
+
## Send a ZIP file of known size, with correct headers
|
144
|
+
|
145
|
+
Use the `SizeEstimator` to compute the correct size of the resulting archive.
|
146
|
+
|
147
|
+
```ruby
|
148
|
+
# Precompute the Content-Length ahead of time
|
149
|
+
bytesize = ZipKit::SizeEstimator.estimate do |z|
|
150
|
+
z.add_stored_entry(filename: 'myfile1.bin', size: 9090821)
|
151
|
+
z.add_stored_entry(filename: 'myfile2.bin', size: 458678)
|
152
|
+
end
|
153
|
+
|
154
|
+
# Prepare the response body. The block will only be called when the response starts to be written.
|
155
|
+
zip_body = ZipKit::OutputEnumerator.new do | zip |
|
156
|
+
zip.add_stored_entry(filename: "myfile1.bin", size: 9090821, crc32: 12485)
|
157
|
+
zip << read_file('myfile1.bin')
|
158
|
+
zip.add_stored_entry(filename: "myfile2.bin", size: 458678, crc32: 89568)
|
159
|
+
zip << read_file('myfile2.bin')
|
160
|
+
end
|
161
|
+
|
162
|
+
headers, streaming_body = body.to_rack_response_headers_and_body(env, content_length: bytesize)
|
163
|
+
[200, headers, streaming_body]
|
164
|
+
```
|
165
|
+
|
166
|
+
## Writing ZIP files using the Streamer bypass
|
167
|
+
|
168
|
+
You do not have to "feed" all the contents of the files you put in the archive through the Streamer object.
|
169
|
+
If the write destination for your use case is a `Socket` (say, you are writing using Rack hijack) and you know
|
170
|
+
the metadata of the file upfront (the CRC32 of the uncompressed file and the sizes), you can write directly
|
171
|
+
to that socket using some accelerated writing technique, and only use the Streamer to write out the ZIP metadata.
|
172
|
+
|
173
|
+
```ruby
|
174
|
+
# io has to be an object that supports #<< or #write()
|
175
|
+
ZipKit::Streamer.open(io) do | zip |
|
176
|
+
# raw_file is written "as is" (STORED mode).
|
177
|
+
# Write the local file header first..
|
178
|
+
zip.add_stored_entry(filename: "first-file.bin", size: raw_file.size, crc32: raw_file_crc32)
|
179
|
+
|
180
|
+
# Adjust the ZIP offsets within the Streamer
|
181
|
+
zip.simulate_write(my_temp_file.size)
|
182
|
+
|
183
|
+
# ...and then send the actual file contents bypassing the Streamer interface
|
184
|
+
io.sendfile(my_temp_file)
|
185
|
+
end
|
186
|
+
```
|
187
|
+
|
188
|
+
## Other usage examples
|
189
|
+
|
190
|
+
Check out the `examples/` directory at the root of the project. This will give you a good idea
|
191
|
+
of various use cases the library supports.
|
192
|
+
|
193
|
+
### Computing the CRC32 value of a large file
|
194
|
+
|
195
|
+
`BlockCRC32` computes the CRC32 checksum of an IO in a streaming fashion.
|
196
|
+
It is slightly more convenient for the purpose than using the raw Zlib library functions.
|
197
|
+
|
198
|
+
```ruby
|
199
|
+
crc = ZipKit::StreamCRC32.new
|
200
|
+
crc << next_chunk_of_data
|
201
|
+
...
|
202
|
+
|
203
|
+
crc.to_i # Returns the actual CRC32 value computed so far
|
204
|
+
...
|
205
|
+
# Append a known CRC32 value that has been computed previosuly
|
206
|
+
crc.append(precomputed_crc32, size_of_the_blob_computed_from)
|
207
|
+
```
|
208
|
+
|
209
|
+
You can also compute the CRC32 for an entire IO object if it responds to `#eof?`:
|
210
|
+
|
211
|
+
```ruby
|
212
|
+
crc = ZipKit::StreamCRC32.from_io(file) # Returns an Integer
|
213
|
+
```
|
214
|
+
|
215
|
+
### Reading ZIP files
|
216
|
+
|
217
|
+
The library contains a reader module, play with it to see what is possible. It is not a complete ZIP reader
|
218
|
+
but it was designed for a specific purpose (highly-parallel unpacking of remotely stored ZIP files), and
|
219
|
+
as such it performs it's function quite well. Please beware of the security implications of using ZIP readers
|
220
|
+
that have not been formally verified (ours hasn't been).
|
221
|
+
|
222
|
+
## Contributing to zip_kit
|
223
|
+
|
224
|
+
* Check out the latest `main` to make sure the feature hasn't been implemented or the bug hasn't been fixed yet.
|
225
|
+
* Check out the issue tracker to make sure someone already hasn't requested it and/or contributed it.
|
226
|
+
* Fork the project.
|
227
|
+
* Start a feature/bugfix branch.
|
228
|
+
* Commit and push until you are happy with your contribution.
|
229
|
+
* Make sure to add tests for it. This is important so I don't break it in a future version unintentionally.
|
230
|
+
* Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it.
|
231
|
+
|
232
|
+
## Copyright
|
233
|
+
|
234
|
+
Copyright (c) 2024 Julik Tarkhanov. See LICENSE.txt for further details.
|
data/Rakefile
ADDED
@@ -0,0 +1,21 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require "bundler/gem_tasks"
|
4
|
+
require "rspec/core/rake_task"
|
5
|
+
require "yard"
|
6
|
+
require "rubocop/rake_task"
|
7
|
+
require "standard/rake"
|
8
|
+
|
9
|
+
task :format do
|
10
|
+
`bundle exec standardrb --fix-unsafely`
|
11
|
+
`bundle exec magic_frozen_string_literal ./lib`
|
12
|
+
end
|
13
|
+
|
14
|
+
YARD::Rake::YardocTask.new(:doc) do |t|
|
15
|
+
# The dash has to be between the two to "divide" the source files and
|
16
|
+
# miscellaneous documentation files that contain no code
|
17
|
+
t.files = ["lib/**/*.rb", "-", "LICENSE.txt", "IMPLEMENTATION_DETAILS.md"]
|
18
|
+
end
|
19
|
+
|
20
|
+
RSpec::Core::RakeTask.new(:spec)
|
21
|
+
task default: [:spec, :standard]
|
@@ -0,0 +1,109 @@
|
|
1
|
+
require "bundler"
|
2
|
+
Bundler.setup
|
3
|
+
|
4
|
+
require "benchmark"
|
5
|
+
require "benchmark/ips"
|
6
|
+
require_relative "../lib/zip_kit"
|
7
|
+
|
8
|
+
n_bytes = 5 * 1024 * 1024
|
9
|
+
r = Random.new
|
10
|
+
bytes = (0...n_bytes).map { r.bytes(1) }
|
11
|
+
buffer_sizes = [
|
12
|
+
1,
|
13
|
+
256,
|
14
|
+
512,
|
15
|
+
1024,
|
16
|
+
8 * 1024,
|
17
|
+
16 * 1024,
|
18
|
+
32 * 1024,
|
19
|
+
64 * 1024,
|
20
|
+
128 * 1024,
|
21
|
+
256 * 1024,
|
22
|
+
512 * 1024,
|
23
|
+
1024 * 1024,
|
24
|
+
2 * 1024 * 1024
|
25
|
+
]
|
26
|
+
|
27
|
+
Benchmark.ips do |x|
|
28
|
+
x.config(time: 5, warmup: 2)
|
29
|
+
buffer_sizes.each do |buf_size|
|
30
|
+
x.report "Single-byte <<-writes of #{n_bytes} using a #{buf_size} byte buffer" do
|
31
|
+
crc = ZipKit::WriteBuffer.new(ZipKit::StreamCRC32.new, buf_size)
|
32
|
+
bytes.each { |b| crc << b }
|
33
|
+
crc.to_i
|
34
|
+
end
|
35
|
+
end
|
36
|
+
x.compare!
|
37
|
+
end
|
38
|
+
|
39
|
+
__END__
|
40
|
+
|
41
|
+
Warming up --------------------------------------
|
42
|
+
Single-byte <<-writes of 5242880 using a 1 byte buffer
|
43
|
+
1.000 i/100ms
|
44
|
+
Single-byte <<-writes of 5242880 using a 256 byte buffer
|
45
|
+
1.000 i/100ms
|
46
|
+
Single-byte <<-writes of 5242880 using a 512 byte buffer
|
47
|
+
1.000 i/100ms
|
48
|
+
Single-byte <<-writes of 5242880 using a 1024 byte buffer
|
49
|
+
1.000 i/100ms
|
50
|
+
Single-byte <<-writes of 5242880 using a 8192 byte buffer
|
51
|
+
1.000 i/100ms
|
52
|
+
Single-byte <<-writes of 5242880 using a 16384 byte buffer
|
53
|
+
1.000 i/100ms
|
54
|
+
Single-byte <<-writes of 5242880 using a 32768 byte buffer
|
55
|
+
1.000 i/100ms
|
56
|
+
Single-byte <<-writes of 5242880 using a 65536 byte buffer
|
57
|
+
1.000 i/100ms
|
58
|
+
Single-byte <<-writes of 5242880 using a 131072 byte buffer
|
59
|
+
1.000 i/100ms
|
60
|
+
Single-byte <<-writes of 5242880 using a 262144 byte buffer
|
61
|
+
1.000 i/100ms
|
62
|
+
Single-byte <<-writes of 5242880 using a 524288 byte buffer
|
63
|
+
1.000 i/100ms
|
64
|
+
Single-byte <<-writes of 5242880 using a 1048576 byte buffer
|
65
|
+
1.000 i/100ms
|
66
|
+
Single-byte <<-writes of 5242880 using a 2097152 byte buffer
|
67
|
+
1.000 i/100ms
|
68
|
+
Calculating -------------------------------------
|
69
|
+
Single-byte <<-writes of 5242880 using a 1 byte buffer
|
70
|
+
0.054 (± 0.0%) i/s - 1.000 in 18.383019s
|
71
|
+
Single-byte <<-writes of 5242880 using a 256 byte buffer
|
72
|
+
0.121 (± 0.0%) i/s - 1.000 in 8.286061s
|
73
|
+
Single-byte <<-writes of 5242880 using a 512 byte buffer
|
74
|
+
0.124 (± 0.0%) i/s - 1.000 in 8.038112s
|
75
|
+
Single-byte <<-writes of 5242880 using a 1024 byte buffer
|
76
|
+
0.128 (± 0.0%) i/s - 1.000 in 7.828562s
|
77
|
+
Single-byte <<-writes of 5242880 using a 8192 byte buffer
|
78
|
+
0.123 (± 0.0%) i/s - 1.000 in 8.121586s
|
79
|
+
Single-byte <<-writes of 5242880 using a 16384 byte buffer
|
80
|
+
0.127 (± 0.0%) i/s - 1.000 in 7.872240s
|
81
|
+
Single-byte <<-writes of 5242880 using a 32768 byte buffer
|
82
|
+
0.126 (± 0.0%) i/s - 1.000 in 7.911816s
|
83
|
+
Single-byte <<-writes of 5242880 using a 65536 byte buffer
|
84
|
+
0.126 (± 0.0%) i/s - 1.000 in 7.917318s
|
85
|
+
Single-byte <<-writes of 5242880 using a 131072 byte buffer
|
86
|
+
0.127 (± 0.0%) i/s - 1.000 in 7.897223s
|
87
|
+
Single-byte <<-writes of 5242880 using a 262144 byte buffer
|
88
|
+
0.130 (± 0.0%) i/s - 1.000 in 7.675608s
|
89
|
+
Single-byte <<-writes of 5242880 using a 524288 byte buffer
|
90
|
+
0.130 (± 0.0%) i/s - 1.000 in 7.679886s
|
91
|
+
Single-byte <<-writes of 5242880 using a 1048576 byte buffer
|
92
|
+
0.128 (± 0.0%) i/s - 1.000 in 7.788439s
|
93
|
+
Single-byte <<-writes of 5242880 using a 2097152 byte buffer
|
94
|
+
0.128 (± 0.0%) i/s - 1.000 in 7.797839s
|
95
|
+
|
96
|
+
Comparison:
|
97
|
+
Single-byte <<-writes of 5242880 using a 262144 byte buffer: 0.1 i/s
|
98
|
+
Single-byte <<-writes of 5242880 using a 524288 byte buffer: 0.1 i/s - 1.00x slower
|
99
|
+
Single-byte <<-writes of 5242880 using a 1048576 byte buffer: 0.1 i/s - 1.01x slower
|
100
|
+
Single-byte <<-writes of 5242880 using a 2097152 byte buffer: 0.1 i/s - 1.02x slower
|
101
|
+
Single-byte <<-writes of 5242880 using a 1024 byte buffer: 0.1 i/s - 1.02x slower
|
102
|
+
Single-byte <<-writes of 5242880 using a 16384 byte buffer: 0.1 i/s - 1.03x slower
|
103
|
+
Single-byte <<-writes of 5242880 using a 131072 byte buffer: 0.1 i/s - 1.03x slower
|
104
|
+
Single-byte <<-writes of 5242880 using a 32768 byte buffer: 0.1 i/s - 1.03x slower
|
105
|
+
Single-byte <<-writes of 5242880 using a 65536 byte buffer: 0.1 i/s - 1.03x slower
|
106
|
+
Single-byte <<-writes of 5242880 using a 512 byte buffer: 0.1 i/s - 1.05x slower
|
107
|
+
Single-byte <<-writes of 5242880 using a 8192 byte buffer: 0.1 i/s - 1.06x slower
|
108
|
+
Single-byte <<-writes of 5242880 using a 256 byte buffer: 0.1 i/s - 1.08x slower
|
109
|
+
Single-byte <<-writes of 5242880 using a 1 byte buffer: 0.1 i/s - 2.39x slower
|
@@ -0,0 +1,15 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require_relative "../lib/zip_kit"
|
4
|
+
|
5
|
+
# Predict how large a ZIP file is going to be without having access to
|
6
|
+
# the actual file contents, but using just the filenames (influences the
|
7
|
+
# file size) and the size of the files
|
8
|
+
zip_archive_size_in_bytes = ZipKit::SizeEstimator.estimate { |zip|
|
9
|
+
# Pretend we are going to make a ZIP file which contains a few
|
10
|
+
# MP4 files (those do not compress all too well)
|
11
|
+
zip.add_stored_entry(filename: "MOV_1234.MP4", size: 898_090)
|
12
|
+
zip.add_stored_entry(filename: "MOV_1235.MP4", size: 7_855_126)
|
13
|
+
}
|
14
|
+
|
15
|
+
puts zip_archive_size_in_bytes #=> 8_753_467
|
data/examples/config.ru
ADDED
@@ -0,0 +1,58 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require_relative "../lib/zip_kit"
|
4
|
+
|
5
|
+
# Using deferred writes (when you want to "pull" from a Streamer)
|
6
|
+
# is also possible with ZipKit.
|
7
|
+
#
|
8
|
+
# The OutputEnumerator class instead of Streamer is very useful for this
|
9
|
+
# particular purpose. It does not start the archiving immediately,
|
10
|
+
# but waits instead until you start pulling data out of it.
|
11
|
+
#
|
12
|
+
# Let's make a OutputEnumerator that writes a few files with random content. Note that when you create
|
13
|
+
# that body it does not immediately write the ZIP:
|
14
|
+
iterable = ZipKit::Streamer.output_enum { |zip|
|
15
|
+
(1..5).each do |i|
|
16
|
+
zip.write_stored_file("random_%d04d.bin" % i) do |sink|
|
17
|
+
warn "Starting on file #{i}...\n"
|
18
|
+
sink << Random.new.bytes(1024)
|
19
|
+
end
|
20
|
+
end
|
21
|
+
}
|
22
|
+
|
23
|
+
warn "\n\nOutput using #each"
|
24
|
+
|
25
|
+
# Now we can treat the iterable as any Ruby enumerable object, since
|
26
|
+
# it supports #each yielding every binary string output by the Streamer.
|
27
|
+
# Only when we start using each() will the ZIP start generating. Just using
|
28
|
+
# each() like we do here runs the archiving procedure to completion. See how
|
29
|
+
# the output of the block within OutputEnumerator is interspersed with the stuff
|
30
|
+
# being yielded to each():
|
31
|
+
iterable.each do |_binary_string|
|
32
|
+
$stderr << "."
|
33
|
+
end
|
34
|
+
|
35
|
+
warn "\n\nOutput Enumerator returned from #each"
|
36
|
+
|
37
|
+
# We now have output the entire archive, so using each() again
|
38
|
+
# will restart the block we gave it. For example, we can user
|
39
|
+
# an Enumerator - via enum_for - to "take" chunks of output when
|
40
|
+
# we find necessary:
|
41
|
+
enum = iterable.each
|
42
|
+
15.times do
|
43
|
+
_bin_str = enum.next # Obtain the subsequent chunk of the ZIP
|
44
|
+
$stderr << "*"
|
45
|
+
end
|
46
|
+
|
47
|
+
# ... or a Fiber
|
48
|
+
|
49
|
+
warn "\n\nOutput using a Fiber"
|
50
|
+
fib = Fiber.new {
|
51
|
+
iterable.each do |binary_string|
|
52
|
+
$stderr << "•"
|
53
|
+
_next_iteration = Fiber.yield(binary_string)
|
54
|
+
end
|
55
|
+
}
|
56
|
+
15.times do
|
57
|
+
fib.resume # Process the subsequent chunk of the ZIP
|
58
|
+
end
|
@@ -0,0 +1,86 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require_relative "../lib/zip_kit"
|
4
|
+
require "tempfile"
|
5
|
+
|
6
|
+
# This shows how to perform compression in parallel (a-la pigz, but in a less
|
7
|
+
# advanced fashion since the compression tables are not shared - to
|
8
|
+
# minimize shared state).
|
9
|
+
#
|
10
|
+
# When using this approach, compressing a large file can be performed as a
|
11
|
+
# map-reduce operation.
|
12
|
+
# First you prepare all the data per part of your (potentially very large) file,
|
13
|
+
# and then you use the reduce task to combine that data into one linear zip.
|
14
|
+
# In this example we will generate threads and collect their return values in
|
15
|
+
# the order the threads were launched, which guarantees a consistent reduce.
|
16
|
+
#
|
17
|
+
# So, let each thread generate a part of the file, and also
|
18
|
+
# compute the CRC32 of it. The thread will compress it's own part
|
19
|
+
# as well, in an independent deflate segment - the threads do not share
|
20
|
+
# anything. You could also multiplex this over multiple processes or
|
21
|
+
# even machines.
|
22
|
+
threads = (0..12).map {
|
23
|
+
Thread.new do
|
24
|
+
source_tempfile = Tempfile.new "t"
|
25
|
+
source_tempfile.binmode
|
26
|
+
|
27
|
+
# Fill the part with random content
|
28
|
+
12.times { source_tempfile << Random.new.bytes(1 * 1024 * 1024) }
|
29
|
+
source_tempfile.rewind
|
30
|
+
|
31
|
+
# Compute the CRC32 of the source file
|
32
|
+
part_crc = ZipKit::StreamCRC32.from_io(source_tempfile)
|
33
|
+
source_tempfile.rewind
|
34
|
+
|
35
|
+
# Create a compressed part
|
36
|
+
compressed_tempfile = Tempfile.new("tc")
|
37
|
+
compressed_tempfile.binmode
|
38
|
+
ZipKit::BlockDeflate.deflate_in_blocks(source_tempfile,
|
39
|
+
compressed_tempfile)
|
40
|
+
|
41
|
+
source_tempfile.close!
|
42
|
+
# The data that the splicing process needs.
|
43
|
+
[compressed_tempfile, part_crc, source_tempfile.size]
|
44
|
+
end
|
45
|
+
}
|
46
|
+
|
47
|
+
# Threads return us a tuple with [compressed_tempfile, source_part_size,
|
48
|
+
# source_part_crc]
|
49
|
+
compressed_tempfiles_and_crc_of_parts = threads.map(&:join).map(&:value)
|
50
|
+
|
51
|
+
# Now we need to compute the CRC32 of the _entire_ file, and it has to be
|
52
|
+
# the CRC32 of the _source_ file (uncompressed), not of the compressed variant.
|
53
|
+
# Handily we know
|
54
|
+
entire_file_crc = ZipKit::StreamCRC32.new
|
55
|
+
compressed_tempfiles_and_crc_of_parts.each do |_, source_part_crc, source_part_size|
|
56
|
+
entire_file_crc.append(source_part_crc, source_part_size)
|
57
|
+
end
|
58
|
+
|
59
|
+
# We need to append the the terminator bytes to the end of the last part.
|
60
|
+
last_compressed_part = compressed_tempfiles_and_crc_of_parts[-1][0]
|
61
|
+
ZipKit::BlockDeflate.write_terminator(last_compressed_part)
|
62
|
+
|
63
|
+
# and we need to know how big the deflated segment of the ZIP is going to be, in total.
|
64
|
+
# To figure that out we just sum the sizes of the files
|
65
|
+
compressed_part_files = compressed_tempfiles_and_crc_of_parts.map(&:first)
|
66
|
+
size_of_deflated_segment = compressed_part_files.map(&:size).inject(&:+)
|
67
|
+
size_of_uncompressed_file = compressed_tempfiles_and_crc_of_parts.map { |e| e[2] }.inject(&:+)
|
68
|
+
|
69
|
+
# And now we can create a ZIP with our compressed file in it's entirety.
|
70
|
+
# We use a File as a destination here, but you can also use a socket or a
|
71
|
+
# non-rewindable IO. ZipKit never needs to rewind your output, since it is
|
72
|
+
# made for streaming.
|
73
|
+
output = File.open("zip_created_in_parallel.zip", "wb")
|
74
|
+
|
75
|
+
ZipKit::Streamer.open(output) do |zip|
|
76
|
+
zip.add_deflated_entry("parallel.bin",
|
77
|
+
size_of_uncompressed_file,
|
78
|
+
entire_file_crc.to_i,
|
79
|
+
size_of_deflated_segment)
|
80
|
+
compressed_part_files.each do |part_file|
|
81
|
+
part_file.rewind
|
82
|
+
while (blob = part_file.read(2048))
|
83
|
+
zip << blob
|
84
|
+
end
|
85
|
+
end
|
86
|
+
end
|