zip_kit 6.0.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (54) hide show
  1. checksums.yaml +7 -0
  2. data/.codeclimate.yml +7 -0
  3. data/.document +5 -0
  4. data/.github/workflows/ci.yml +29 -0
  5. data/.gitignore +61 -0
  6. data/.rspec +1 -0
  7. data/.standard.yml +8 -0
  8. data/.yardopts +1 -0
  9. data/CHANGELOG.md +255 -0
  10. data/CODE_OF_CONDUCT.md +46 -0
  11. data/CONTRIBUTING.md +153 -0
  12. data/Gemfile +4 -0
  13. data/IMPLEMENTATION_DETAILS.md +97 -0
  14. data/LICENSE.txt +20 -0
  15. data/README.md +234 -0
  16. data/Rakefile +21 -0
  17. data/bench/buffered_crc32_bench.rb +109 -0
  18. data/examples/archive_size_estimate.rb +15 -0
  19. data/examples/config.ru +7 -0
  20. data/examples/deferred_write.rb +58 -0
  21. data/examples/parallel_compression_with_block_deflate.rb +86 -0
  22. data/examples/rack_application.rb +63 -0
  23. data/examples/s3_upload.rb +23 -0
  24. data/lib/zip_kit/block_deflate.rb +130 -0
  25. data/lib/zip_kit/block_write.rb +47 -0
  26. data/lib/zip_kit/file_reader/inflating_reader.rb +36 -0
  27. data/lib/zip_kit/file_reader/stored_reader.rb +35 -0
  28. data/lib/zip_kit/file_reader.rb +740 -0
  29. data/lib/zip_kit/null_writer.rb +12 -0
  30. data/lib/zip_kit/output_enumerator.rb +150 -0
  31. data/lib/zip_kit/path_set.rb +163 -0
  32. data/lib/zip_kit/rack_chunked_body.rb +32 -0
  33. data/lib/zip_kit/rack_tempfile_body.rb +61 -0
  34. data/lib/zip_kit/rails_streaming.rb +37 -0
  35. data/lib/zip_kit/remote_io.rb +114 -0
  36. data/lib/zip_kit/remote_uncap.rb +22 -0
  37. data/lib/zip_kit/size_estimator.rb +84 -0
  38. data/lib/zip_kit/stream_crc32.rb +60 -0
  39. data/lib/zip_kit/streamer/deflated_writer.rb +45 -0
  40. data/lib/zip_kit/streamer/entry.rb +37 -0
  41. data/lib/zip_kit/streamer/filler.rb +9 -0
  42. data/lib/zip_kit/streamer/heuristic.rb +68 -0
  43. data/lib/zip_kit/streamer/stored_writer.rb +39 -0
  44. data/lib/zip_kit/streamer/writable.rb +36 -0
  45. data/lib/zip_kit/streamer.rb +614 -0
  46. data/lib/zip_kit/uniquify_filename.rb +39 -0
  47. data/lib/zip_kit/version.rb +5 -0
  48. data/lib/zip_kit/write_and_tell.rb +40 -0
  49. data/lib/zip_kit/write_buffer.rb +71 -0
  50. data/lib/zip_kit/write_shovel.rb +22 -0
  51. data/lib/zip_kit/zip_writer.rb +436 -0
  52. data/lib/zip_kit.rb +24 -0
  53. data/zip_kit.gemspec +41 -0
  54. metadata +335 -0
@@ -0,0 +1,97 @@
1
+ # Implementation details
2
+
3
+ The ZipKit streaming implementation is designed around the following requirements:
4
+
5
+ * Only ahead-writes (no IO seek or rewind)
6
+ * Automatic switching to Zip64 as the files get written (no IO seeks), but not requiring Zip64 support if the archive can do without
7
+ * Make use of the fact that CRC32 checksums and the sizes of the files (compressed _and_ uncompressed) are known upfront
8
+
9
+ It strives to be compatible with the following unzip programs _at the minimum:_
10
+
11
+ * OSX - builtin ArchiveUtility (except the Zip64 support when files larger than 4GB are in the archive)
12
+ * OSX - The Unarchiver, at least 3.10.1
13
+ * Windows 7 - built-in Explorer zip browser (except for Unicode filenames which it just doesn't support)
14
+ * Windows 7 - 7Zip 9.20
15
+
16
+ Below is the list of _specific_ decisions taken when writing the implementation, with an explanation for each.
17
+ We specifically _omit_ a number of things that we could do, but that are not necessary to satisfy our objectives.
18
+ The omissions are _intentional_ since we do not want to have things of which we _assume_ they work, or have things
19
+ that work only for one obscure unarchiver in one obscure case (like WinRAR with chinese filenames).
20
+
21
+ ## Data descriptors (postfix CRC32/file sizes)
22
+
23
+ Data descriptors permit you to generate "postfix" ZIP files (where you write the local file header without having to
24
+ know the CRC32 and the file size upfront, then write the compressed file data, and only then - once you know what your CRC32,
25
+ compressed and uncompressed sizes are etc. - write them into a data descriptor that follows the file data.
26
+
27
+ The streamer has optional support for data descriptors. Their use can apparently [ be problematic](https://github.com/thejoshwolfe/yazl/issues/13)
28
+ with the 7Zip version that we want to support, but in our tests everything worked fine.
29
+
30
+ For more info see https://github.com/thejoshwolfe/yazl#general-purpose-bit-flag
31
+
32
+ ## Zip64 support
33
+
34
+ Zip64 support switches on _by itself_, automatically, when _any_ of the following conditions is met:
35
+
36
+ * The start of the central directory lies beyound the 4GB limit
37
+ * The ZIP archive has more than 65535 files added to it
38
+ * Any entry is present whose compressed _or_ uncompressed size is above 4GB
39
+
40
+ When writing out local file headers, the Zip64 extra field (and related changes to the standard fields) are
41
+ _only_ performed if one of the file sizes is larger than 4GB. Otherwise the Zip64 extra will _only_ be
42
+ written in the central directory entry, but not in the local file header.
43
+
44
+ This has to do with the fact that otherwise we would write Zip64 extra fields for all local file headers,
45
+ regardless whether the file actually requires Zip64 or not. That might impede some older tools from reading
46
+ the archive, which is a problem you don't want to have if your archive otherwise fits perfectly below all
47
+ the Zip64 thresholds.
48
+
49
+ To be compatible with Windows7 built-in tools, the Zip64 extra field _must_ be written as _the first_ extra
50
+ field, any other extra fields should come after.
51
+
52
+ ## International filename support and the Info-ZIP extra field
53
+
54
+ If a diacritic-containing character (such as å) does fit into the DOS-437
55
+ codepage, it should be encodable as such. This would, in theory, let older Windows tools
56
+ decode the filename correctly. However, this kills the filename decoding for the OSX builtin
57
+ archive utility (it assumes the filename to be UTF-8, regardless). So if we allow filenames
58
+ to be encoded in DOS-437, we _potentially_ have support in Windows but we upset everyone on Mac.
59
+ If we just use UTF-8 and set the right EFS bit in general purpose flags, we upset Windows users
60
+ because most of the Windows unarchive tools (at least the builtin ones) do not give a flying eff
61
+ about the EFS support bit being set.
62
+
63
+ Additionally, if we use Unarchiver on OSX (which is our recommended unpacker for large files),
64
+ it will (very rightfully) ask us how we should decode each filename that does not have the EFS bit,
65
+ but does contain something non-ASCII-decodable. This is horrible UX for users.
66
+
67
+ So, basically, we have 2 choices, for filenames containing diacritics (for bona-fide UTF-8 you do not
68
+ even get those choices, you _have_ to use UTF-8):
69
+
70
+ * Make life easier for Windows users by setting stuff to DOS, not care about the standard _and_ make
71
+ most of Mac users upset
72
+ * Make life easy for Mac users and conform to the standard, and tell Windows users to get a _decent_
73
+ ZIP unarchiving tool.
74
+
75
+ We are going with option 2, and this is well-thought-out. Trust me. If you want the crazytown
76
+ filename encoding scheme that is described here http://stackoverflow.com/questions/13261347
77
+ you can try this:
78
+
79
+ [Encoding::CP437, Encoding::ISO_8859_1, Encoding::UTF_8]
80
+
81
+ We don't want no such thing, and sorry Windows users, you are going to need a decent unarchiver
82
+ that honors the standard. Alas, alas.
83
+
84
+ Additionally, the tests with the unarchivers we _do_ support have shown that including the InfoZIP
85
+ extra field does not actually help any of them recognize the file name correctly. And the use of
86
+ those fields for the UTF-8 filename, per spec, tells us we should not set the EFS bit - which ruins
87
+ the unarchiving for all other solutions. As any other, this decision may be changed in the future.
88
+
89
+ There are some interesting notes about the Info-ZIP/EFS combination here
90
+ https://commons.apache.org/proper/commons-compress/zip.html
91
+
92
+ ## Directory support
93
+
94
+ ZIP offers the possibility to store empty directories (folders). The directories that contain files, however, get
95
+ created automatically at unarchive time. If you store a file, called, say, `docs/item.doc` then the unarchiver will
96
+ automatically create the `docs` directory if it doesn't exist already. So you need to use the directory creation
97
+ methods only if you do not have any files in those directories.
data/LICENSE.txt ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2024 Julik Tarkhanov
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,234 @@
1
+ # zip_kit
2
+
3
+ Allows streaming, non-rewinding ZIP file output from Ruby.
4
+
5
+ `zip_kit` is a successor to and continuation of [zip_tricks](https://github.com/WeTransfer/zip_tricks), which
6
+ was inspired by [zipline](https://github.com/fringd/zipline). I am grateful to WeTransfer for allowing me
7
+ to develop zip_tricks and for sharing it with the community.
8
+
9
+ Allows you to write a ZIP archive out to a `File`, `Socket`, `String` or `Array` without having to rewind it at any
10
+ point. Usable for creating very large ZIP archives for immediate sending out to clients, or for writing
11
+ large ZIP archives without memory inflation.
12
+
13
+ The original gem (zip_tricks) handled all the zipping needs (millions of ZIP files generated per day),
14
+ for a large file transfer service, so we are pretty confident it is widely compatible with a large number
15
+ of unarchiving end-user applications and is well tested.
16
+
17
+ ## Requirements
18
+
19
+ Ruby 2.6+ syntax support is required, as well as a a working zlib (all available to jRuby as well).
20
+
21
+ ## Diving in: send some large CSV reports from Rails
22
+
23
+ The easiest is to include the `ZipKit::RailsStreaming` module into your
24
+ controller. You will then have a `zip_kit_stream` method available which accepts a block:
25
+
26
+ ```ruby
27
+ class ZipsController < ActionController::Base
28
+ include ZipKit::RailsStreaming
29
+
30
+ def download
31
+ zip_kit_stream do |zip|
32
+ zip.write_file('report1.csv') do |sink|
33
+ CSV(sink) do |csv_write|
34
+ csv_write << Person.column_names
35
+ Person.all.find_each do |person|
36
+ csv_write << person.attributes.values
37
+ end
38
+ end
39
+ end
40
+ zip.write_file('report2.csv') do |sink|
41
+ ...
42
+ end
43
+ end
44
+ end
45
+ end
46
+ ```
47
+
48
+ The `write_file` method will use some heuristics to determine whether your output file would benefit
49
+ from compression, and pick the appropriate storage mode for the file accordingly.
50
+
51
+ If you want some more conveniences you can also use [zipline](https://github.com/fringd/zipline) which
52
+ will automatically process and stream attachments (Carrierwave, Shrine, ActiveStorage) and remote objects
53
+ via HTTP.
54
+
55
+ `RailsStreaming` will *not* use [ActionController::Live](https://api.rubyonrails.org/classes/ActionController/Live.html)
56
+ and the ZIP output will run in the same thread as your main request. Your testing flows (be it minitest or
57
+ RSpec) should work normally with controller actions returning ZIPs.
58
+
59
+ ## Writing into other streaming destinations
60
+
61
+ Any object that accepts bytes via either `<<` or `write` methods can be a write destination. For example, here
62
+ is how to upload a sizeable ZIP to S3 - the SDK will happily chop your upload into multipart upload parts:
63
+
64
+ ```ruby
65
+ bucket = Aws::S3::Bucket.new("mybucket")
66
+ obj = bucket.object("big.zip")
67
+ obj.upload_stream do |write_stream|
68
+ ZipKit::Streamer.open(write_stream) do |zip|
69
+ zip.write_file("large.csv") do |sink|
70
+ CSV(sink) do |csv|
71
+ csv << ["Line", "Item"]
72
+ 20_000.times do |n|
73
+ csv << [n, "Item number #{n}"]
74
+ end
75
+ end
76
+ end
77
+ end
78
+ end
79
+ ```
80
+
81
+ # Writing through an intermediary object
82
+
83
+ Any object that writes using either `<<` or `write` can write into a `sink`. For example, you can do streaming
84
+ output with [builder](https://github.com/jimweirich/builder#project-builder)
85
+
86
+ ```ruby
87
+ zip.write_file('report1.csv') do |sink|
88
+ builder = Builder::XmlMarkup.new(target: sink, indent: 2)
89
+ builder.people do
90
+ Person.all.find_each do |person|
91
+ builder.person(name: person.name)
92
+ end
93
+ end
94
+ end
95
+ ```
96
+
97
+ and this output will be compressed and output into the ZIP file on the fly. zip_kit composes with any
98
+ Ruby code that streams its output into a destination.
99
+
100
+ ## Create a ZIP file without size estimation, compress on-the-fly during writes
101
+
102
+ Basic use case is compressing on the fly. Some data will be buffered by the Zlib deflater, but
103
+ memory inflation is going to be very constrained. Data will be written to destination at fairly regular
104
+ intervals. Deflate compression will work best for things like text files.
105
+
106
+ ```ruby
107
+ out = my_tempfile # can also be a socket
108
+ ZipKit::Streamer.open(out) do |zip|
109
+ zip.write_file('mov.mp4.txt') do |sink|
110
+ File.open('mov.mp4', 'rb'){|source| IO.copy_stream(source, sink) }
111
+ end
112
+ zip.write_file('long-novel.txt') do |sink|
113
+ File.open('novel.txt', 'rb'){|source| IO.copy_stream(source, sink) }
114
+ end
115
+ end
116
+ ```
117
+ Unfortunately with this approach it is impossible to compute the size of the ZIP file being output,
118
+ since you do not know how large the compressed data segments are going to be.
119
+
120
+ ## Send a ZIP from a Rack response
121
+
122
+ zip_kit provides an `OutputEnumerator` object which will yield the binary chunks piece
123
+ by piece, and apply some amount of buffering as well. Make sure to also wrap your `OutputEnumerator` in a chunker
124
+ by calling `#to_chunked` on it. Return it to your webserver and you will have your ZIP streamed!
125
+ The block that you give to the `OutputEnumerator` receive the {ZipKit::Streamer} object and will only
126
+ start executing once your response body starts getting iterated over - when actually sending
127
+ the response to the client (unless you are using a buffering Rack webserver, such as Webrick).
128
+
129
+ ```ruby
130
+ body = ZipKit::OutputEnumerator.new do | zip |
131
+ zip.write_file('mov.mp4') do |sink|
132
+ File.open('mov.mp4', 'rb'){|source| IO.copy_stream(source, sink) }
133
+ end
134
+ zip.write_file('long-novel.txt') do |sink|
135
+ File.open('novel.txt', 'rb'){|source| IO.copy_stream(source, sink) }
136
+ end
137
+ end
138
+
139
+ headers, streaming_body = body.to_rack_response_headers_and_body(env)
140
+ [200, headers, streaming_body]
141
+ ```
142
+
143
+ ## Send a ZIP file of known size, with correct headers
144
+
145
+ Use the `SizeEstimator` to compute the correct size of the resulting archive.
146
+
147
+ ```ruby
148
+ # Precompute the Content-Length ahead of time
149
+ bytesize = ZipKit::SizeEstimator.estimate do |z|
150
+ z.add_stored_entry(filename: 'myfile1.bin', size: 9090821)
151
+ z.add_stored_entry(filename: 'myfile2.bin', size: 458678)
152
+ end
153
+
154
+ # Prepare the response body. The block will only be called when the response starts to be written.
155
+ zip_body = ZipKit::OutputEnumerator.new do | zip |
156
+ zip.add_stored_entry(filename: "myfile1.bin", size: 9090821, crc32: 12485)
157
+ zip << read_file('myfile1.bin')
158
+ zip.add_stored_entry(filename: "myfile2.bin", size: 458678, crc32: 89568)
159
+ zip << read_file('myfile2.bin')
160
+ end
161
+
162
+ headers, streaming_body = body.to_rack_response_headers_and_body(env, content_length: bytesize)
163
+ [200, headers, streaming_body]
164
+ ```
165
+
166
+ ## Writing ZIP files using the Streamer bypass
167
+
168
+ You do not have to "feed" all the contents of the files you put in the archive through the Streamer object.
169
+ If the write destination for your use case is a `Socket` (say, you are writing using Rack hijack) and you know
170
+ the metadata of the file upfront (the CRC32 of the uncompressed file and the sizes), you can write directly
171
+ to that socket using some accelerated writing technique, and only use the Streamer to write out the ZIP metadata.
172
+
173
+ ```ruby
174
+ # io has to be an object that supports #<< or #write()
175
+ ZipKit::Streamer.open(io) do | zip |
176
+ # raw_file is written "as is" (STORED mode).
177
+ # Write the local file header first..
178
+ zip.add_stored_entry(filename: "first-file.bin", size: raw_file.size, crc32: raw_file_crc32)
179
+
180
+ # Adjust the ZIP offsets within the Streamer
181
+ zip.simulate_write(my_temp_file.size)
182
+
183
+ # ...and then send the actual file contents bypassing the Streamer interface
184
+ io.sendfile(my_temp_file)
185
+ end
186
+ ```
187
+
188
+ ## Other usage examples
189
+
190
+ Check out the `examples/` directory at the root of the project. This will give you a good idea
191
+ of various use cases the library supports.
192
+
193
+ ### Computing the CRC32 value of a large file
194
+
195
+ `BlockCRC32` computes the CRC32 checksum of an IO in a streaming fashion.
196
+ It is slightly more convenient for the purpose than using the raw Zlib library functions.
197
+
198
+ ```ruby
199
+ crc = ZipKit::StreamCRC32.new
200
+ crc << next_chunk_of_data
201
+ ...
202
+
203
+ crc.to_i # Returns the actual CRC32 value computed so far
204
+ ...
205
+ # Append a known CRC32 value that has been computed previosuly
206
+ crc.append(precomputed_crc32, size_of_the_blob_computed_from)
207
+ ```
208
+
209
+ You can also compute the CRC32 for an entire IO object if it responds to `#eof?`:
210
+
211
+ ```ruby
212
+ crc = ZipKit::StreamCRC32.from_io(file) # Returns an Integer
213
+ ```
214
+
215
+ ### Reading ZIP files
216
+
217
+ The library contains a reader module, play with it to see what is possible. It is not a complete ZIP reader
218
+ but it was designed for a specific purpose (highly-parallel unpacking of remotely stored ZIP files), and
219
+ as such it performs it's function quite well. Please beware of the security implications of using ZIP readers
220
+ that have not been formally verified (ours hasn't been).
221
+
222
+ ## Contributing to zip_kit
223
+
224
+ * Check out the latest `main` to make sure the feature hasn't been implemented or the bug hasn't been fixed yet.
225
+ * Check out the issue tracker to make sure someone already hasn't requested it and/or contributed it.
226
+ * Fork the project.
227
+ * Start a feature/bugfix branch.
228
+ * Commit and push until you are happy with your contribution.
229
+ * Make sure to add tests for it. This is important so I don't break it in a future version unintentionally.
230
+ * Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it.
231
+
232
+ ## Copyright
233
+
234
+ Copyright (c) 2024 Julik Tarkhanov. See LICENSE.txt for further details.
data/Rakefile ADDED
@@ -0,0 +1,21 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "bundler/gem_tasks"
4
+ require "rspec/core/rake_task"
5
+ require "yard"
6
+ require "rubocop/rake_task"
7
+ require "standard/rake"
8
+
9
+ task :format do
10
+ `bundle exec standardrb --fix-unsafely`
11
+ `bundle exec magic_frozen_string_literal ./lib`
12
+ end
13
+
14
+ YARD::Rake::YardocTask.new(:doc) do |t|
15
+ # The dash has to be between the two to "divide" the source files and
16
+ # miscellaneous documentation files that contain no code
17
+ t.files = ["lib/**/*.rb", "-", "LICENSE.txt", "IMPLEMENTATION_DETAILS.md"]
18
+ end
19
+
20
+ RSpec::Core::RakeTask.new(:spec)
21
+ task default: [:spec, :standard]
@@ -0,0 +1,109 @@
1
+ require "bundler"
2
+ Bundler.setup
3
+
4
+ require "benchmark"
5
+ require "benchmark/ips"
6
+ require_relative "../lib/zip_kit"
7
+
8
+ n_bytes = 5 * 1024 * 1024
9
+ r = Random.new
10
+ bytes = (0...n_bytes).map { r.bytes(1) }
11
+ buffer_sizes = [
12
+ 1,
13
+ 256,
14
+ 512,
15
+ 1024,
16
+ 8 * 1024,
17
+ 16 * 1024,
18
+ 32 * 1024,
19
+ 64 * 1024,
20
+ 128 * 1024,
21
+ 256 * 1024,
22
+ 512 * 1024,
23
+ 1024 * 1024,
24
+ 2 * 1024 * 1024
25
+ ]
26
+
27
+ Benchmark.ips do |x|
28
+ x.config(time: 5, warmup: 2)
29
+ buffer_sizes.each do |buf_size|
30
+ x.report "Single-byte <<-writes of #{n_bytes} using a #{buf_size} byte buffer" do
31
+ crc = ZipKit::WriteBuffer.new(ZipKit::StreamCRC32.new, buf_size)
32
+ bytes.each { |b| crc << b }
33
+ crc.to_i
34
+ end
35
+ end
36
+ x.compare!
37
+ end
38
+
39
+ __END__
40
+
41
+ Warming up --------------------------------------
42
+ Single-byte <<-writes of 5242880 using a 1 byte buffer
43
+ 1.000 i/100ms
44
+ Single-byte <<-writes of 5242880 using a 256 byte buffer
45
+ 1.000 i/100ms
46
+ Single-byte <<-writes of 5242880 using a 512 byte buffer
47
+ 1.000 i/100ms
48
+ Single-byte <<-writes of 5242880 using a 1024 byte buffer
49
+ 1.000 i/100ms
50
+ Single-byte <<-writes of 5242880 using a 8192 byte buffer
51
+ 1.000 i/100ms
52
+ Single-byte <<-writes of 5242880 using a 16384 byte buffer
53
+ 1.000 i/100ms
54
+ Single-byte <<-writes of 5242880 using a 32768 byte buffer
55
+ 1.000 i/100ms
56
+ Single-byte <<-writes of 5242880 using a 65536 byte buffer
57
+ 1.000 i/100ms
58
+ Single-byte <<-writes of 5242880 using a 131072 byte buffer
59
+ 1.000 i/100ms
60
+ Single-byte <<-writes of 5242880 using a 262144 byte buffer
61
+ 1.000 i/100ms
62
+ Single-byte <<-writes of 5242880 using a 524288 byte buffer
63
+ 1.000 i/100ms
64
+ Single-byte <<-writes of 5242880 using a 1048576 byte buffer
65
+ 1.000 i/100ms
66
+ Single-byte <<-writes of 5242880 using a 2097152 byte buffer
67
+ 1.000 i/100ms
68
+ Calculating -------------------------------------
69
+ Single-byte <<-writes of 5242880 using a 1 byte buffer
70
+ 0.054 (± 0.0%) i/s - 1.000 in 18.383019s
71
+ Single-byte <<-writes of 5242880 using a 256 byte buffer
72
+ 0.121 (± 0.0%) i/s - 1.000 in 8.286061s
73
+ Single-byte <<-writes of 5242880 using a 512 byte buffer
74
+ 0.124 (± 0.0%) i/s - 1.000 in 8.038112s
75
+ Single-byte <<-writes of 5242880 using a 1024 byte buffer
76
+ 0.128 (± 0.0%) i/s - 1.000 in 7.828562s
77
+ Single-byte <<-writes of 5242880 using a 8192 byte buffer
78
+ 0.123 (± 0.0%) i/s - 1.000 in 8.121586s
79
+ Single-byte <<-writes of 5242880 using a 16384 byte buffer
80
+ 0.127 (± 0.0%) i/s - 1.000 in 7.872240s
81
+ Single-byte <<-writes of 5242880 using a 32768 byte buffer
82
+ 0.126 (± 0.0%) i/s - 1.000 in 7.911816s
83
+ Single-byte <<-writes of 5242880 using a 65536 byte buffer
84
+ 0.126 (± 0.0%) i/s - 1.000 in 7.917318s
85
+ Single-byte <<-writes of 5242880 using a 131072 byte buffer
86
+ 0.127 (± 0.0%) i/s - 1.000 in 7.897223s
87
+ Single-byte <<-writes of 5242880 using a 262144 byte buffer
88
+ 0.130 (± 0.0%) i/s - 1.000 in 7.675608s
89
+ Single-byte <<-writes of 5242880 using a 524288 byte buffer
90
+ 0.130 (± 0.0%) i/s - 1.000 in 7.679886s
91
+ Single-byte <<-writes of 5242880 using a 1048576 byte buffer
92
+ 0.128 (± 0.0%) i/s - 1.000 in 7.788439s
93
+ Single-byte <<-writes of 5242880 using a 2097152 byte buffer
94
+ 0.128 (± 0.0%) i/s - 1.000 in 7.797839s
95
+
96
+ Comparison:
97
+ Single-byte <<-writes of 5242880 using a 262144 byte buffer: 0.1 i/s
98
+ Single-byte <<-writes of 5242880 using a 524288 byte buffer: 0.1 i/s - 1.00x slower
99
+ Single-byte <<-writes of 5242880 using a 1048576 byte buffer: 0.1 i/s - 1.01x slower
100
+ Single-byte <<-writes of 5242880 using a 2097152 byte buffer: 0.1 i/s - 1.02x slower
101
+ Single-byte <<-writes of 5242880 using a 1024 byte buffer: 0.1 i/s - 1.02x slower
102
+ Single-byte <<-writes of 5242880 using a 16384 byte buffer: 0.1 i/s - 1.03x slower
103
+ Single-byte <<-writes of 5242880 using a 131072 byte buffer: 0.1 i/s - 1.03x slower
104
+ Single-byte <<-writes of 5242880 using a 32768 byte buffer: 0.1 i/s - 1.03x slower
105
+ Single-byte <<-writes of 5242880 using a 65536 byte buffer: 0.1 i/s - 1.03x slower
106
+ Single-byte <<-writes of 5242880 using a 512 byte buffer: 0.1 i/s - 1.05x slower
107
+ Single-byte <<-writes of 5242880 using a 8192 byte buffer: 0.1 i/s - 1.06x slower
108
+ Single-byte <<-writes of 5242880 using a 256 byte buffer: 0.1 i/s - 1.08x slower
109
+ Single-byte <<-writes of 5242880 using a 1 byte buffer: 0.1 i/s - 2.39x slower
@@ -0,0 +1,15 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative "../lib/zip_kit"
4
+
5
+ # Predict how large a ZIP file is going to be without having access to
6
+ # the actual file contents, but using just the filenames (influences the
7
+ # file size) and the size of the files
8
+ zip_archive_size_in_bytes = ZipKit::SizeEstimator.estimate { |zip|
9
+ # Pretend we are going to make a ZIP file which contains a few
10
+ # MP4 files (those do not compress all too well)
11
+ zip.add_stored_entry(filename: "MOV_1234.MP4", size: 898_090)
12
+ zip.add_stored_entry(filename: "MOV_1235.MP4", size: 7_855_126)
13
+ }
14
+
15
+ puts zip_archive_size_in_bytes #=> 8_753_467
@@ -0,0 +1,7 @@
1
+ # frozen_string_literal: true
2
+
3
+ require File.dirname(__FILE__) + "/rack_application.rb"
4
+
5
+ # Demonstrates a Rack app that can offer a ZIP download composed
6
+ # at runtime (see rack_application.rb)
7
+ run ZipDownload.new
@@ -0,0 +1,58 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative "../lib/zip_kit"
4
+
5
+ # Using deferred writes (when you want to "pull" from a Streamer)
6
+ # is also possible with ZipKit.
7
+ #
8
+ # The OutputEnumerator class instead of Streamer is very useful for this
9
+ # particular purpose. It does not start the archiving immediately,
10
+ # but waits instead until you start pulling data out of it.
11
+ #
12
+ # Let's make a OutputEnumerator that writes a few files with random content. Note that when you create
13
+ # that body it does not immediately write the ZIP:
14
+ iterable = ZipKit::Streamer.output_enum { |zip|
15
+ (1..5).each do |i|
16
+ zip.write_stored_file("random_%d04d.bin" % i) do |sink|
17
+ warn "Starting on file #{i}...\n"
18
+ sink << Random.new.bytes(1024)
19
+ end
20
+ end
21
+ }
22
+
23
+ warn "\n\nOutput using #each"
24
+
25
+ # Now we can treat the iterable as any Ruby enumerable object, since
26
+ # it supports #each yielding every binary string output by the Streamer.
27
+ # Only when we start using each() will the ZIP start generating. Just using
28
+ # each() like we do here runs the archiving procedure to completion. See how
29
+ # the output of the block within OutputEnumerator is interspersed with the stuff
30
+ # being yielded to each():
31
+ iterable.each do |_binary_string|
32
+ $stderr << "."
33
+ end
34
+
35
+ warn "\n\nOutput Enumerator returned from #each"
36
+
37
+ # We now have output the entire archive, so using each() again
38
+ # will restart the block we gave it. For example, we can user
39
+ # an Enumerator - via enum_for - to "take" chunks of output when
40
+ # we find necessary:
41
+ enum = iterable.each
42
+ 15.times do
43
+ _bin_str = enum.next # Obtain the subsequent chunk of the ZIP
44
+ $stderr << "*"
45
+ end
46
+
47
+ # ... or a Fiber
48
+
49
+ warn "\n\nOutput using a Fiber"
50
+ fib = Fiber.new {
51
+ iterable.each do |binary_string|
52
+ $stderr << "•"
53
+ _next_iteration = Fiber.yield(binary_string)
54
+ end
55
+ }
56
+ 15.times do
57
+ fib.resume # Process the subsequent chunk of the ZIP
58
+ end
@@ -0,0 +1,86 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative "../lib/zip_kit"
4
+ require "tempfile"
5
+
6
+ # This shows how to perform compression in parallel (a-la pigz, but in a less
7
+ # advanced fashion since the compression tables are not shared - to
8
+ # minimize shared state).
9
+ #
10
+ # When using this approach, compressing a large file can be performed as a
11
+ # map-reduce operation.
12
+ # First you prepare all the data per part of your (potentially very large) file,
13
+ # and then you use the reduce task to combine that data into one linear zip.
14
+ # In this example we will generate threads and collect their return values in
15
+ # the order the threads were launched, which guarantees a consistent reduce.
16
+ #
17
+ # So, let each thread generate a part of the file, and also
18
+ # compute the CRC32 of it. The thread will compress it's own part
19
+ # as well, in an independent deflate segment - the threads do not share
20
+ # anything. You could also multiplex this over multiple processes or
21
+ # even machines.
22
+ threads = (0..12).map {
23
+ Thread.new do
24
+ source_tempfile = Tempfile.new "t"
25
+ source_tempfile.binmode
26
+
27
+ # Fill the part with random content
28
+ 12.times { source_tempfile << Random.new.bytes(1 * 1024 * 1024) }
29
+ source_tempfile.rewind
30
+
31
+ # Compute the CRC32 of the source file
32
+ part_crc = ZipKit::StreamCRC32.from_io(source_tempfile)
33
+ source_tempfile.rewind
34
+
35
+ # Create a compressed part
36
+ compressed_tempfile = Tempfile.new("tc")
37
+ compressed_tempfile.binmode
38
+ ZipKit::BlockDeflate.deflate_in_blocks(source_tempfile,
39
+ compressed_tempfile)
40
+
41
+ source_tempfile.close!
42
+ # The data that the splicing process needs.
43
+ [compressed_tempfile, part_crc, source_tempfile.size]
44
+ end
45
+ }
46
+
47
+ # Threads return us a tuple with [compressed_tempfile, source_part_size,
48
+ # source_part_crc]
49
+ compressed_tempfiles_and_crc_of_parts = threads.map(&:join).map(&:value)
50
+
51
+ # Now we need to compute the CRC32 of the _entire_ file, and it has to be
52
+ # the CRC32 of the _source_ file (uncompressed), not of the compressed variant.
53
+ # Handily we know
54
+ entire_file_crc = ZipKit::StreamCRC32.new
55
+ compressed_tempfiles_and_crc_of_parts.each do |_, source_part_crc, source_part_size|
56
+ entire_file_crc.append(source_part_crc, source_part_size)
57
+ end
58
+
59
+ # We need to append the the terminator bytes to the end of the last part.
60
+ last_compressed_part = compressed_tempfiles_and_crc_of_parts[-1][0]
61
+ ZipKit::BlockDeflate.write_terminator(last_compressed_part)
62
+
63
+ # and we need to know how big the deflated segment of the ZIP is going to be, in total.
64
+ # To figure that out we just sum the sizes of the files
65
+ compressed_part_files = compressed_tempfiles_and_crc_of_parts.map(&:first)
66
+ size_of_deflated_segment = compressed_part_files.map(&:size).inject(&:+)
67
+ size_of_uncompressed_file = compressed_tempfiles_and_crc_of_parts.map { |e| e[2] }.inject(&:+)
68
+
69
+ # And now we can create a ZIP with our compressed file in it's entirety.
70
+ # We use a File as a destination here, but you can also use a socket or a
71
+ # non-rewindable IO. ZipKit never needs to rewind your output, since it is
72
+ # made for streaming.
73
+ output = File.open("zip_created_in_parallel.zip", "wb")
74
+
75
+ ZipKit::Streamer.open(output) do |zip|
76
+ zip.add_deflated_entry("parallel.bin",
77
+ size_of_uncompressed_file,
78
+ entire_file_crc.to_i,
79
+ size_of_deflated_segment)
80
+ compressed_part_files.each do |part_file|
81
+ part_file.rewind
82
+ while (blob = part_file.read(2048))
83
+ zip << blob
84
+ end
85
+ end
86
+ end