iostreams 1.1.1 → 1.3.2
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +7 -426
- data/lib/io_streams/builder.rb +9 -6
- data/lib/io_streams/deprecated.rb +1 -1
- data/lib/io_streams/encode/reader.rb +1 -1
- data/lib/io_streams/encode/writer.rb +1 -1
- data/lib/io_streams/errors.rb +10 -0
- data/lib/io_streams/io_streams.rb +1 -1
- data/lib/io_streams/line/reader.rb +2 -2
- data/lib/io_streams/paths/file.rb +4 -4
- data/lib/io_streams/paths/http.rb +7 -3
- data/lib/io_streams/paths/s3.rb +1 -1
- data/lib/io_streams/paths/sftp.rb +16 -4
- data/lib/io_streams/pgp.rb +81 -138
- data/lib/io_streams/pgp/writer.rb +30 -5
- data/lib/io_streams/record/reader.rb +33 -8
- data/lib/io_streams/record/writer.rb +33 -8
- data/lib/io_streams/row/reader.rb +4 -4
- data/lib/io_streams/row/writer.rb +4 -4
- data/lib/io_streams/stream.rb +21 -10
- data/lib/io_streams/tabular.rb +34 -3
- data/lib/io_streams/tabular/header.rb +14 -12
- data/lib/io_streams/tabular/parser/fixed.rb +135 -25
- data/lib/io_streams/utils.rb +5 -4
- data/lib/io_streams/version.rb +1 -1
- data/lib/io_streams/zip/reader.rb +1 -1
- data/test/io_streams_test.rb +49 -0
- data/test/paths/file_test.rb +1 -1
- data/test/paths/s3_test.rb +2 -2
- data/test/paths/sftp_test.rb +4 -4
- data/test/pgp_test.rb +54 -4
- data/test/pgp_writer_test.rb +26 -0
- data/test/tabular_test.rb +55 -26
- data/test/test_helper.rb +5 -1
- metadata +2 -3
- data/lib/io_streams/utils/reliable_http.rb +0 -98
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 5b743c145f120e461d09eed5b9c7476bacbdaa18453200e1ccc57ec9eef2239c
|
4
|
+
data.tar.gz: 7c54a49f344e454ef64a1e19e47f2c665b46e675eda34c03e1604668a6543de4
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 27d4a3875009d1902285a9e01b67032e74e1eac6d295de3400096b01c663de8b14cb594e4ffd8f04154d23cf7cfd4a696dbc33b9651fd200808ff087c00c0dc2
|
7
|
+
data.tar.gz: 92656ad55f37e5ffb3b4588cc93d6d9c988ce31705f888d4e045d86ffcf5047f4185a8de30bc6c262a48ae7f048433edd00f827d155939db7caa858e5e495bf0
|
data/README.md
CHANGED
@@ -1,437 +1,18 @@
|
|
1
|
-
#
|
1
|
+
# IOStreams
|
2
2
|
[![Gem Version](https://img.shields.io/gem/v/iostreams.svg)](https://rubygems.org/gems/iostreams) [![Build Status](https://travis-ci.org/rocketjob/iostreams.svg?branch=master)](https://travis-ci.org/rocketjob/iostreams) [![Downloads](https://img.shields.io/gem/dt/iostreams.svg)](https://rubygems.org/gems/iostreams) [![License](https://img.shields.io/badge/license-Apache%202.0-brightgreen.svg)](http://opensource.org/licenses/Apache-2.0) ![](https://img.shields.io/badge/status-Production%20Ready-blue.svg) [![Gitter chat](https://img.shields.io/badge/IRC%20(gitter)-Support-brightgreen.svg)](https://gitter.im/rocketjob/support)
|
3
3
|
|
4
|
-
|
4
|
+
IOStreams is an incredibly powerful streaming library that makes changes to file formats, compression, encryption,
|
5
|
+
or storage mechanism transparent to the application.
|
5
6
|
|
6
7
|
## Project Status
|
7
8
|
|
8
|
-
Production Ready,
|
9
|
+
Production Ready, heavily used in production environments, many as part of Rocket Job.
|
9
10
|
|
10
|
-
##
|
11
|
+
## Documentation
|
11
12
|
|
12
|
-
|
13
|
+
Start with the [IOStreams tutorial](https://iostreams.rocketjob.io/tutorial) to get a great introduction to IOStreams.
|
13
14
|
|
14
|
-
|
15
|
-
* Gzip
|
16
|
-
* BZip2
|
17
|
-
* PGP (Requires GnuPG)
|
18
|
-
* Xlsx (Reading)
|
19
|
-
* Encryption using [Symmetric Encryption](https://github.com/reidmorrison/symmetric-encryption)
|
20
|
-
|
21
|
-
Supported sources and/or targets:
|
22
|
-
|
23
|
-
* File
|
24
|
-
* HTTP (Read only)
|
25
|
-
* AWS S3
|
26
|
-
* SFTP
|
27
|
-
|
28
|
-
Supported file formats:
|
29
|
-
|
30
|
-
* CSV
|
31
|
-
* Fixed width formats
|
32
|
-
* JSON
|
33
|
-
* PSV
|
34
|
-
|
35
|
-
## Quick examples
|
36
|
-
|
37
|
-
Read an entire file into memory:
|
38
|
-
|
39
|
-
```ruby
|
40
|
-
IOStreams.path('example.txt').read
|
41
|
-
```
|
42
|
-
|
43
|
-
Decompress an entire gzip file into memory:
|
44
|
-
|
45
|
-
```ruby
|
46
|
-
IOStreams.path('example.gz').read
|
47
|
-
```
|
48
|
-
|
49
|
-
Read and decompress the first file in a zip file into memory:
|
50
|
-
|
51
|
-
```ruby
|
52
|
-
IOStreams.path('example.zip').read
|
53
|
-
```
|
54
|
-
|
55
|
-
Read a file one line at a time
|
56
|
-
|
57
|
-
```ruby
|
58
|
-
IOStreams.path('example.txt').each do |line|
|
59
|
-
puts line
|
60
|
-
end
|
61
|
-
```
|
62
|
-
|
63
|
-
Read a CSV file one line at a time, returning each line as an array:
|
64
|
-
|
65
|
-
```ruby
|
66
|
-
IOStreams.path('example.csv').each(:array) do |array|
|
67
|
-
p array
|
68
|
-
end
|
69
|
-
```
|
70
|
-
|
71
|
-
Read a CSV file a record at a time, returning each line as a hash.
|
72
|
-
The first line of the file is assumed to be the header line:
|
73
|
-
|
74
|
-
```ruby
|
75
|
-
IOStreams.path('example.csv').each(:hash) do |hash|
|
76
|
-
p hash
|
77
|
-
end
|
78
|
-
```
|
79
|
-
|
80
|
-
Read a file using an http get,
|
81
|
-
decompressing the named file in the zip file,
|
82
|
-
returning each records from the named file as a hash:
|
83
|
-
|
84
|
-
```ruby
|
85
|
-
IOStreams.
|
86
|
-
path("https://www5.fdic.gov/idasp/Offices2.zip").
|
87
|
-
option(:zip, entry_file_name: 'OFFICES2_ALL.CSV').
|
88
|
-
reader(:hash) do |stream|
|
89
|
-
p stream.read
|
90
|
-
end
|
91
|
-
```
|
92
|
-
|
93
|
-
Read the file without unzipping and streaming the first file in the zip:
|
94
|
-
|
95
|
-
```ruby
|
96
|
-
IOStreams.path('https://www5.fdic.gov/idasp/Offices2.zip').stream(:none).reader {|file| puts file.read}
|
97
|
-
```
|
98
|
-
|
99
|
-
|
100
|
-
## Introduction
|
101
|
-
|
102
|
-
If all files were small, they could just be loaded into memory in their entirety. With the
|
103
|
-
advent of very large files, often into several Gigabytes, or even Terabytes in size, loading
|
104
|
-
them into memory is not feasible.
|
105
|
-
|
106
|
-
In linux it is common to use pipes to stream data between processes.
|
107
|
-
For example:
|
108
|
-
|
109
|
-
```
|
110
|
-
# Count the number of lines in a file that has been compressed with gzip
|
111
|
-
cat abc.gz | gunzip -c | wc -l
|
112
|
-
```
|
113
|
-
|
114
|
-
For large files it is critical to be able to read and write these files as streams. Ruby has support
|
115
|
-
for reading and writing files using streams, but has no built-in way of passing one stream through
|
116
|
-
another to support for example compressing the data, encrypting it and then finally writing the result
|
117
|
-
to a file. Several streaming implementations exist for languages such as `C++` and `Java` to chain
|
118
|
-
together several streams, `iostreams` attempts to offer similar features for Ruby.
|
119
|
-
|
120
|
-
```ruby
|
121
|
-
# Read a compressed file:
|
122
|
-
IOStreams.path("hello.gz").reader do |reader|
|
123
|
-
data = reader.read(1024)
|
124
|
-
puts "Read: #{data}"
|
125
|
-
end
|
126
|
-
```
|
127
|
-
|
128
|
-
The true power of streams is shown when many streams are chained together to achieve the end
|
129
|
-
result, without holding the entire file in memory, or ideally without needing to create
|
130
|
-
any temporary files to process the stream.
|
131
|
-
|
132
|
-
```ruby
|
133
|
-
# Create a file that is compressed with GZip and then encrypted with Symmetric Encryption:
|
134
|
-
IOStreams.path("hello.gz.enc").writer do |writer|
|
135
|
-
writer.write("Hello World")
|
136
|
-
writer.write("and some more")
|
137
|
-
end
|
138
|
-
```
|
139
|
-
|
140
|
-
The power of the above example applies when the data being written starts to exceed hundreds of megabytes,
|
141
|
-
or even gigabytes.
|
142
|
-
|
143
|
-
By looking at the file name supplied above, `iostreams` is able to determine which streams to apply
|
144
|
-
to the data being read or written. For example:
|
145
|
-
* `hello.zip` => Compressed using Zip
|
146
|
-
* `hello.zip.enc` => Compressed using Zip and then encrypted using Symmetric Encryption
|
147
|
-
* `hello.gz.enc` => Compressed using GZip and then encrypted using Symmetric Encryption
|
148
|
-
|
149
|
-
The objective is that all of these streaming processes are performed used streaming
|
150
|
-
so that only the current portion of the file is loaded into memory as it moves
|
151
|
-
through the entire file.
|
152
|
-
Where possible each stream never goes to disk, which for example could expose
|
153
|
-
un-encrypted data.
|
154
|
-
|
155
|
-
## Examples
|
156
|
-
|
157
|
-
While decompressing the file, display 128 characters at a time from the file.
|
158
|
-
|
159
|
-
~~~ruby
|
160
|
-
require "iostreams"
|
161
|
-
IOStreams.path("abc.csv").reader do |io|
|
162
|
-
while (data = io.read(128))
|
163
|
-
p data
|
164
|
-
end
|
165
|
-
end
|
166
|
-
~~~
|
167
|
-
|
168
|
-
While decompressing the file, display one line at a time from the file.
|
169
|
-
|
170
|
-
~~~ruby
|
171
|
-
IOStreams.path("abc.csv").each do |line|
|
172
|
-
puts line
|
173
|
-
end
|
174
|
-
~~~
|
175
|
-
|
176
|
-
While decompressing the file, display each row from the csv file as an array.
|
177
|
-
|
178
|
-
~~~ruby
|
179
|
-
IOStreams.path("abc.csv").each(:array) do |array|
|
180
|
-
p array
|
181
|
-
end
|
182
|
-
~~~
|
183
|
-
|
184
|
-
While decompressing the file, display each record from the csv file as a hash.
|
185
|
-
The first line is assumed to be the header row.
|
186
|
-
|
187
|
-
~~~ruby
|
188
|
-
IOStreams.path("abc.csv").each(:hash) do |hash|
|
189
|
-
p hash
|
190
|
-
end
|
191
|
-
~~~
|
192
|
-
|
193
|
-
Write data while compressing the file.
|
194
|
-
|
195
|
-
~~~ruby
|
196
|
-
IOStreams.path("abc.csv").writer do |io|
|
197
|
-
io.write("This")
|
198
|
-
io.write(" is ")
|
199
|
-
io.write(" one line\n")
|
200
|
-
end
|
201
|
-
~~~
|
202
|
-
|
203
|
-
Write a line at a time while compressing the file.
|
204
|
-
|
205
|
-
~~~ruby
|
206
|
-
IOStreams.path("abc.csv").writer(:line) do |file|
|
207
|
-
file << "these"
|
208
|
-
file << "are"
|
209
|
-
file << "all"
|
210
|
-
file << "separate"
|
211
|
-
file << "lines"
|
212
|
-
end
|
213
|
-
~~~
|
214
|
-
|
215
|
-
Write an array (row) at a time while compressing the file.
|
216
|
-
Each array is converted to csv before being compressed with zip.
|
217
|
-
|
218
|
-
~~~ruby
|
219
|
-
IOStreams.path("abc.csv").writer(:array) do |io|
|
220
|
-
io << %w[name address zip_code]
|
221
|
-
io << %w[Jack There 1234]
|
222
|
-
io << ["Joe", "Over There somewhere", 1234]
|
223
|
-
end
|
224
|
-
~~~
|
225
|
-
|
226
|
-
Write a hash (record) at a time while compressing the file.
|
227
|
-
Each hash is converted to csv before being compressed with zip.
|
228
|
-
The header row is extracted from the first hash supplied.
|
229
|
-
|
230
|
-
~~~ruby
|
231
|
-
IOStreams.path("abc.csv").writer(:hash) do |stream|
|
232
|
-
stream << {name: "Jack", address: "There", zip_code: 1234}
|
233
|
-
stream << {name: "Joe", address: "Over There somewhere", zip_code: 1234}
|
234
|
-
end
|
235
|
-
~~~
|
236
|
-
|
237
|
-
Write to a string IO for testing, supplying the filename so that the streams can be determined.
|
238
|
-
|
239
|
-
~~~ruby
|
240
|
-
io = StringIO.new
|
241
|
-
IOStreams.stream(io, file_name: "abc.csv").writer(:hash) do |stream|
|
242
|
-
stream << {name: "Jack", address: "There", zip_code: 1234}
|
243
|
-
stream << {name: "Joe", address: "Over There somewhere", zip_code: 1234}
|
244
|
-
end
|
245
|
-
puts io.string
|
246
|
-
~~~
|
247
|
-
|
248
|
-
Read a CSV file and write the output to an encrypted file in JSON format.
|
249
|
-
|
250
|
-
~~~ruby
|
251
|
-
IOStreams.path("sample.json.enc").writer(:hash) do |output|
|
252
|
-
IOStreams.path("sample.csv").each(:hash) do |record|
|
253
|
-
output << record
|
254
|
-
end
|
255
|
-
end
|
256
|
-
~~~
|
257
|
-
|
258
|
-
## Copying between files
|
259
|
-
|
260
|
-
Stream based file copying. Changes the file type without changing the file format. For example, compress or encrypt.
|
261
|
-
|
262
|
-
Encrypt the contents of the file `sample.json` and write to `sample.json.enc`
|
263
|
-
|
264
|
-
~~~ruby
|
265
|
-
input = IOStreams.path("sample.json")
|
266
|
-
IOStreams.path("sample.json.enc").copy_from(input)
|
267
|
-
~~~
|
268
|
-
|
269
|
-
Encrypt and compress the contents of the file `sample.json` with Symmetric Encryption and write to `sample.json.enc`
|
270
|
-
|
271
|
-
~~~ruby
|
272
|
-
input = IOStreams.path("sample.json")
|
273
|
-
IOStreams.path("sample.json.enc").option(:enc, compress: true).copy_from(input)
|
274
|
-
~~~
|
275
|
-
|
276
|
-
Encrypt and compress the contents of the file `sample.json` with pgp and write to `sample.json.enc`
|
277
|
-
|
278
|
-
~~~ruby
|
279
|
-
input = IOStreams.path("sample.json")
|
280
|
-
IOStreams.path("sample.json.pgp").option(:pgp, recipient: "sender@example.org").copy_from(input)
|
281
|
-
~~~
|
282
|
-
|
283
|
-
Decrypt the file `abc.csv.enc` and write it to `xyz.csv`.
|
284
|
-
|
285
|
-
~~~ruby
|
286
|
-
input = IOStreams.path("abc.csv.enc")
|
287
|
-
IOStreams.path("xyz.csv").copy_from(input)
|
288
|
-
~~~
|
289
|
-
|
290
|
-
Decrypt file `ABC` that was encrypted with Symmetric Encryption,
|
291
|
-
PGP encrypt the output file and write it to `xyz.csv.pgp` using the pgp key that was imported for `a@a.com`.
|
292
|
-
|
293
|
-
~~~ruby
|
294
|
-
input = IOStreams.path("ABC").stream(:enc)
|
295
|
-
IOStreams.path("xyz.csv.pgp").option(:pgp, recipient: "a@a.com").copy_from(input)
|
296
|
-
~~~
|
297
|
-
|
298
|
-
To copy a file _without_ performing any conversions (ignore file extensions), set `convert` to `false`:
|
299
|
-
|
300
|
-
~~~ruby
|
301
|
-
input = IOStreams.path("sample.json.zip")
|
302
|
-
IOStreams.path("sample.copy").copy_from(input, convert: false)
|
303
|
-
~~~
|
304
|
-
|
305
|
-
## Philosopy
|
306
|
-
|
307
|
-
IOStreams can be used to work against a single stream. it's real capability becomes apparent when chaining together
|
308
|
-
multiple streams to process data, without loading entire files into memory.
|
309
|
-
|
310
|
-
#### Linux Pipes
|
311
|
-
|
312
|
-
Linux has built-in support for streaming using the `|` (pipe operator) to send the output from one process to another.
|
313
|
-
|
314
|
-
Example: count the number of lines in a compressed file:
|
315
|
-
|
316
|
-
gunzip -c hello.csv.gz | wc -l
|
317
|
-
|
318
|
-
The file `hello.csv.gz` is uncompressed and returned to standard output, which in turn is piped into the standard
|
319
|
-
input for `wc -l`, which counts the number of lines in the uncompressed data.
|
320
|
-
|
321
|
-
As each block of data is returned from `gunzip` it is immediately passed into `wc` so that it
|
322
|
-
can start counting lines of uncompressed data, without waiting until the entire file is decompressed.
|
323
|
-
The uncompressed contents of the file are not written to disk before passing to `wc -l` and the file is not loaded
|
324
|
-
into memory before passing to `wc -l`.
|
325
|
-
|
326
|
-
In this way extremely large files can be processed with very little memory being used.
|
327
|
-
|
328
|
-
#### Push Model
|
329
|
-
|
330
|
-
In the Linux pipes example above this would be considered a "push model" where each task in the list pushes
|
331
|
-
its output to the input of the next task.
|
332
|
-
|
333
|
-
A major challenge or disadvantage with the push model is that buffering would need to occur between tasks since
|
334
|
-
each task could complete at very different speeds. To prevent large memory usage the standard output from a previous
|
335
|
-
task would have to be blocked to try and make it slow down.
|
336
|
-
|
337
|
-
#### Pull Model
|
338
|
-
|
339
|
-
Another approach with multiple tasks that need to process a single stream, is to move to a "pull model" where the
|
340
|
-
task at the end of the list pulls a block from a previous task when it is ready to process it.
|
341
|
-
|
342
|
-
#### IOStreams
|
343
|
-
|
344
|
-
IOStreams uses the pull model when reading data, where each stream performs a read against the previous stream
|
345
|
-
when it is ready for more data.
|
346
|
-
|
347
|
-
When writing to an output stream, IOStreams uses the push model, where each block of data that is ready to be written
|
348
|
-
is pushed to the task/stream in the list. The write push only returns once it has traversed all the way down to
|
349
|
-
the final task / stream in the list, this avoids complex buffering issues between each task / stream in the list.
|
350
|
-
|
351
|
-
Example: Implementing in Ruby: `gunzip -c hello.csv.gz | wc -l`
|
352
|
-
|
353
|
-
~~~ruby
|
354
|
-
line_count = 0
|
355
|
-
IOStreams::Gzip::Reader.open("hello.csv.gz") do |input|
|
356
|
-
IOStreams::Line::Reader.open(input) do |lines|
|
357
|
-
lines.each { line_count += 1}
|
358
|
-
end
|
359
|
-
end
|
360
|
-
puts "hello.csv.gz contains #{line_count} lines"
|
361
|
-
~~~
|
362
|
-
|
363
|
-
Since IOStreams can autodetect file types based on the file extension, `IOStreams.reader` can figure which stream
|
364
|
-
to start with:
|
365
|
-
~~~ruby
|
366
|
-
line_count = 0
|
367
|
-
IOStreams.path("hello.csv.gz").reader do |input|
|
368
|
-
IOStreams::Line::Reader.open(input) do |lines|
|
369
|
-
lines.each { line_count += 1}
|
370
|
-
end
|
371
|
-
end
|
372
|
-
puts "hello.csv.gz contains #{line_count} lines"
|
373
|
-
~~~
|
374
|
-
|
375
|
-
Since we know we want a line reader, it can be simplified using `#reader(:line)`:
|
376
|
-
~~~ruby
|
377
|
-
line_count = 0
|
378
|
-
IOStreams.path("hello.csv.gz").reader(:line) do |lines|
|
379
|
-
lines.each { line_count += 1}
|
380
|
-
end
|
381
|
-
puts "hello.csv.gz contains #{line_count} lines"
|
382
|
-
~~~
|
383
|
-
|
384
|
-
It can be simplified even further using `#each`:
|
385
|
-
~~~ruby
|
386
|
-
line_count = 0
|
387
|
-
IOStreams.path("hello.csv.gz").each { line_count += 1}
|
388
|
-
puts "hello.csv.gz contains #{line_count} lines"
|
389
|
-
~~~
|
390
|
-
|
391
|
-
The benefit in all of the above cases is that the file can be any arbitrary size and only one block of the file
|
392
|
-
is held in memory at any time.
|
393
|
-
|
394
|
-
#### Chaining
|
395
|
-
|
396
|
-
In the above example only 2 streams were used. Streams can be nested as deep as necessary to process data.
|
397
|
-
|
398
|
-
Example, search for all occurrences of the word apple, cleansing the input data stream of non printable characters
|
399
|
-
and converting to valid US ASCII.
|
400
|
-
|
401
|
-
~~~ruby
|
402
|
-
apple_count = 0
|
403
|
-
IOStreams::Gzip::Reader.open("hello.csv.gz") do |input|
|
404
|
-
IOStreams::Encode::Reader.open(input,
|
405
|
-
encoding: "US-ASCII",
|
406
|
-
encode_replace: "",
|
407
|
-
encode_cleaner: :printable) do |cleansed|
|
408
|
-
IOStreams::Line::Reader.open(cleansed) do |lines|
|
409
|
-
lines.each { |line| apple_count += line.scan("apple").count}
|
410
|
-
end
|
411
|
-
end
|
412
|
-
puts "Found the word 'apple' #{apple_count} times in hello.csv.gz"
|
413
|
-
~~~
|
414
|
-
|
415
|
-
Let IOStreams perform the above stream chaining automatically under the covers:
|
416
|
-
|
417
|
-
~~~ruby
|
418
|
-
apple_count = 0
|
419
|
-
IOStreams.path("hello.csv.gz").
|
420
|
-
option(:encode, encoding: "US-ASCII", replace: "", cleaner: :printable).
|
421
|
-
each do |line|
|
422
|
-
apple_count += line.scan("apple").count
|
423
|
-
end
|
424
|
-
|
425
|
-
puts "Found the word 'apple' #{apple_count} times in hello.csv.gz"
|
426
|
-
~~~
|
427
|
-
|
428
|
-
## Notes
|
429
|
-
|
430
|
-
* Due to the nature of Zip, both its Reader and Writer methods will create
|
431
|
-
a temp file when reading from or writing to a stream.
|
432
|
-
Recommended to use Gzip over Zip since it can be streamed without requiring temp files.
|
433
|
-
* Zip becomes exponentially slower with very large files, especially files
|
434
|
-
that exceed 4GB when uncompressed. Highly recommend using GZip for large files.
|
15
|
+
Next, checkout the remaining [IOStreams documentation](https://iostreams.rocketjob.io/)
|
435
16
|
|
436
17
|
## Versioning
|
437
18
|
|