iostreams 1.1.1 → 1.3.2

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 1a141ad8f92a6c1387ad487fc8cf545bcc010ebc862967ffce8f79f71b7ba2d6
4
- data.tar.gz: 0210b8d23390ddff6fe31133c24b88b2beb7e5e0f54d6b9df962632dc369a9ec
3
+ metadata.gz: 5b743c145f120e461d09eed5b9c7476bacbdaa18453200e1ccc57ec9eef2239c
4
+ data.tar.gz: 7c54a49f344e454ef64a1e19e47f2c665b46e675eda34c03e1604668a6543de4
5
5
  SHA512:
6
- metadata.gz: 810573a43277573365e946205465a4a1cc87c65cf517450dc6f363f5bb7168e66a25cd2d891e346d09c4b96aa9166f369fd2bc9dc21ec2112d87bce270467ff2
7
- data.tar.gz: 2304155b27a897f270263283bdf499855a81f80885b9103b6c116f135904a42d73bfb29705a8156f3bde973b7a899eba9b123073e7b140a2e502b2203869ad1f
6
+ metadata.gz: 27d4a3875009d1902285a9e01b67032e74e1eac6d295de3400096b01c663de8b14cb594e4ffd8f04154d23cf7cfd4a696dbc33b9651fd200808ff087c00c0dc2
7
+ data.tar.gz: 92656ad55f37e5ffb3b4588cc93d6d9c988ce31705f888d4e045d86ffcf5047f4185a8de30bc6c262a48ae7f048433edd00f827d155939db7caa858e5e495bf0
data/README.md CHANGED
@@ -1,437 +1,18 @@
1
- # iostreams
1
+ # IOStreams
2
2
  [![Gem Version](https://img.shields.io/gem/v/iostreams.svg)](https://rubygems.org/gems/iostreams) [![Build Status](https://travis-ci.org/rocketjob/iostreams.svg?branch=master)](https://travis-ci.org/rocketjob/iostreams) [![Downloads](https://img.shields.io/gem/dt/iostreams.svg)](https://rubygems.org/gems/iostreams) [![License](https://img.shields.io/badge/license-Apache%202.0-brightgreen.svg)](http://opensource.org/licenses/Apache-2.0) ![](https://img.shields.io/badge/status-Production%20Ready-blue.svg) [![Gitter chat](https://img.shields.io/badge/IRC%20(gitter)-Support-brightgreen.svg)](https://gitter.im/rocketjob/support)
3
3
 
4
- Input and Output streaming for Ruby.
4
+ IOStreams is an incredibly powerful streaming library that makes changes to file formats, compression, encryption,
5
+ or storage mechanism transparent to the application.
5
6
 
6
7
  ## Project Status
7
8
 
8
- Production Ready, but API is subject to breaking changes until V1 is released.
9
+ Production Ready, heavily used in production environments, many as part of Rocket Job.
9
10
 
10
- ## Features
11
+ ## Documentation
11
12
 
12
- Supported streams:
13
+ Start with the [IOStreams tutorial](https://iostreams.rocketjob.io/tutorial) to get a great introduction to IOStreams.
13
14
 
14
- * Zip
15
- * Gzip
16
- * BZip2
17
- * PGP (Requires GnuPG)
18
- * Xlsx (Reading)
19
- * Encryption using [Symmetric Encryption](https://github.com/reidmorrison/symmetric-encryption)
20
-
21
- Supported sources and/or targets:
22
-
23
- * File
24
- * HTTP (Read only)
25
- * AWS S3
26
- * SFTP
27
-
28
- Supported file formats:
29
-
30
- * CSV
31
- * Fixed width formats
32
- * JSON
33
- * PSV
34
-
35
- ## Quick examples
36
-
37
- Read an entire file into memory:
38
-
39
- ```ruby
40
- IOStreams.path('example.txt').read
41
- ```
42
-
43
- Decompress an entire gzip file into memory:
44
-
45
- ```ruby
46
- IOStreams.path('example.gz').read
47
- ```
48
-
49
- Read and decompress the first file in a zip file into memory:
50
-
51
- ```ruby
52
- IOStreams.path('example.zip').read
53
- ```
54
-
55
- Read a file one line at a time
56
-
57
- ```ruby
58
- IOStreams.path('example.txt').each do |line|
59
- puts line
60
- end
61
- ```
62
-
63
- Read a CSV file one line at a time, returning each line as an array:
64
-
65
- ```ruby
66
- IOStreams.path('example.csv').each(:array) do |array|
67
- p array
68
- end
69
- ```
70
-
71
- Read a CSV file a record at a time, returning each line as a hash.
72
- The first line of the file is assumed to be the header line:
73
-
74
- ```ruby
75
- IOStreams.path('example.csv').each(:hash) do |hash|
76
- p hash
77
- end
78
- ```
79
-
80
- Read a file using an http get,
81
- decompressing the named file in the zip file,
82
- returning each records from the named file as a hash:
83
-
84
- ```ruby
85
- IOStreams.
86
- path("https://www5.fdic.gov/idasp/Offices2.zip").
87
- option(:zip, entry_file_name: 'OFFICES2_ALL.CSV').
88
- reader(:hash) do |stream|
89
- p stream.read
90
- end
91
- ```
92
-
93
- Read the file without unzipping and streaming the first file in the zip:
94
-
95
- ```ruby
96
- IOStreams.path('https://www5.fdic.gov/idasp/Offices2.zip').stream(:none).reader {|file| puts file.read}
97
- ```
98
-
99
-
100
- ## Introduction
101
-
102
- If all files were small, they could just be loaded into memory in their entirety. With the
103
- advent of very large files, often into several Gigabytes, or even Terabytes in size, loading
104
- them into memory is not feasible.
105
-
106
- In linux it is common to use pipes to stream data between processes.
107
- For example:
108
-
109
- ```
110
- # Count the number of lines in a file that has been compressed with gzip
111
- cat abc.gz | gunzip -c | wc -l
112
- ```
113
-
114
- For large files it is critical to be able to read and write these files as streams. Ruby has support
115
- for reading and writing files using streams, but has no built-in way of passing one stream through
116
- another to support for example compressing the data, encrypting it and then finally writing the result
117
- to a file. Several streaming implementations exist for languages such as `C++` and `Java` to chain
118
- together several streams, `iostreams` attempts to offer similar features for Ruby.
119
-
120
- ```ruby
121
- # Read a compressed file:
122
- IOStreams.path("hello.gz").reader do |reader|
123
- data = reader.read(1024)
124
- puts "Read: #{data}"
125
- end
126
- ```
127
-
128
- The true power of streams is shown when many streams are chained together to achieve the end
129
- result, without holding the entire file in memory, or ideally without needing to create
130
- any temporary files to process the stream.
131
-
132
- ```ruby
133
- # Create a file that is compressed with GZip and then encrypted with Symmetric Encryption:
134
- IOStreams.path("hello.gz.enc").writer do |writer|
135
- writer.write("Hello World")
136
- writer.write("and some more")
137
- end
138
- ```
139
-
140
- The power of the above example applies when the data being written starts to exceed hundreds of megabytes,
141
- or even gigabytes.
142
-
143
- By looking at the file name supplied above, `iostreams` is able to determine which streams to apply
144
- to the data being read or written. For example:
145
- * `hello.zip` => Compressed using Zip
146
- * `hello.zip.enc` => Compressed using Zip and then encrypted using Symmetric Encryption
147
- * `hello.gz.enc` => Compressed using GZip and then encrypted using Symmetric Encryption
148
-
149
- The objective is that all of these streaming processes are performed used streaming
150
- so that only the current portion of the file is loaded into memory as it moves
151
- through the entire file.
152
- Where possible each stream never goes to disk, which for example could expose
153
- un-encrypted data.
154
-
155
- ## Examples
156
-
157
- While decompressing the file, display 128 characters at a time from the file.
158
-
159
- ~~~ruby
160
- require "iostreams"
161
- IOStreams.path("abc.csv").reader do |io|
162
- while (data = io.read(128))
163
- p data
164
- end
165
- end
166
- ~~~
167
-
168
- While decompressing the file, display one line at a time from the file.
169
-
170
- ~~~ruby
171
- IOStreams.path("abc.csv").each do |line|
172
- puts line
173
- end
174
- ~~~
175
-
176
- While decompressing the file, display each row from the csv file as an array.
177
-
178
- ~~~ruby
179
- IOStreams.path("abc.csv").each(:array) do |array|
180
- p array
181
- end
182
- ~~~
183
-
184
- While decompressing the file, display each record from the csv file as a hash.
185
- The first line is assumed to be the header row.
186
-
187
- ~~~ruby
188
- IOStreams.path("abc.csv").each(:hash) do |hash|
189
- p hash
190
- end
191
- ~~~
192
-
193
- Write data while compressing the file.
194
-
195
- ~~~ruby
196
- IOStreams.path("abc.csv").writer do |io|
197
- io.write("This")
198
- io.write(" is ")
199
- io.write(" one line\n")
200
- end
201
- ~~~
202
-
203
- Write a line at a time while compressing the file.
204
-
205
- ~~~ruby
206
- IOStreams.path("abc.csv").writer(:line) do |file|
207
- file << "these"
208
- file << "are"
209
- file << "all"
210
- file << "separate"
211
- file << "lines"
212
- end
213
- ~~~
214
-
215
- Write an array (row) at a time while compressing the file.
216
- Each array is converted to csv before being compressed with zip.
217
-
218
- ~~~ruby
219
- IOStreams.path("abc.csv").writer(:array) do |io|
220
- io << %w[name address zip_code]
221
- io << %w[Jack There 1234]
222
- io << ["Joe", "Over There somewhere", 1234]
223
- end
224
- ~~~
225
-
226
- Write a hash (record) at a time while compressing the file.
227
- Each hash is converted to csv before being compressed with zip.
228
- The header row is extracted from the first hash supplied.
229
-
230
- ~~~ruby
231
- IOStreams.path("abc.csv").writer(:hash) do |stream|
232
- stream << {name: "Jack", address: "There", zip_code: 1234}
233
- stream << {name: "Joe", address: "Over There somewhere", zip_code: 1234}
234
- end
235
- ~~~
236
-
237
- Write to a string IO for testing, supplying the filename so that the streams can be determined.
238
-
239
- ~~~ruby
240
- io = StringIO.new
241
- IOStreams.stream(io, file_name: "abc.csv").writer(:hash) do |stream|
242
- stream << {name: "Jack", address: "There", zip_code: 1234}
243
- stream << {name: "Joe", address: "Over There somewhere", zip_code: 1234}
244
- end
245
- puts io.string
246
- ~~~
247
-
248
- Read a CSV file and write the output to an encrypted file in JSON format.
249
-
250
- ~~~ruby
251
- IOStreams.path("sample.json.enc").writer(:hash) do |output|
252
- IOStreams.path("sample.csv").each(:hash) do |record|
253
- output << record
254
- end
255
- end
256
- ~~~
257
-
258
- ## Copying between files
259
-
260
- Stream based file copying. Changes the file type without changing the file format. For example, compress or encrypt.
261
-
262
- Encrypt the contents of the file `sample.json` and write to `sample.json.enc`
263
-
264
- ~~~ruby
265
- input = IOStreams.path("sample.json")
266
- IOStreams.path("sample.json.enc").copy_from(input)
267
- ~~~
268
-
269
- Encrypt and compress the contents of the file `sample.json` with Symmetric Encryption and write to `sample.json.enc`
270
-
271
- ~~~ruby
272
- input = IOStreams.path("sample.json")
273
- IOStreams.path("sample.json.enc").option(:enc, compress: true).copy_from(input)
274
- ~~~
275
-
276
- Encrypt and compress the contents of the file `sample.json` with pgp and write to `sample.json.enc`
277
-
278
- ~~~ruby
279
- input = IOStreams.path("sample.json")
280
- IOStreams.path("sample.json.pgp").option(:pgp, recipient: "sender@example.org").copy_from(input)
281
- ~~~
282
-
283
- Decrypt the file `abc.csv.enc` and write it to `xyz.csv`.
284
-
285
- ~~~ruby
286
- input = IOStreams.path("abc.csv.enc")
287
- IOStreams.path("xyz.csv").copy_from(input)
288
- ~~~
289
-
290
- Decrypt file `ABC` that was encrypted with Symmetric Encryption,
291
- PGP encrypt the output file and write it to `xyz.csv.pgp` using the pgp key that was imported for `a@a.com`.
292
-
293
- ~~~ruby
294
- input = IOStreams.path("ABC").stream(:enc)
295
- IOStreams.path("xyz.csv.pgp").option(:pgp, recipient: "a@a.com").copy_from(input)
296
- ~~~
297
-
298
- To copy a file _without_ performing any conversions (ignore file extensions), set `convert` to `false`:
299
-
300
- ~~~ruby
301
- input = IOStreams.path("sample.json.zip")
302
- IOStreams.path("sample.copy").copy_from(input, convert: false)
303
- ~~~
304
-
305
- ## Philosopy
306
-
307
- IOStreams can be used to work against a single stream. it's real capability becomes apparent when chaining together
308
- multiple streams to process data, without loading entire files into memory.
309
-
310
- #### Linux Pipes
311
-
312
- Linux has built-in support for streaming using the `|` (pipe operator) to send the output from one process to another.
313
-
314
- Example: count the number of lines in a compressed file:
315
-
316
- gunzip -c hello.csv.gz | wc -l
317
-
318
- The file `hello.csv.gz` is uncompressed and returned to standard output, which in turn is piped into the standard
319
- input for `wc -l`, which counts the number of lines in the uncompressed data.
320
-
321
- As each block of data is returned from `gunzip` it is immediately passed into `wc` so that it
322
- can start counting lines of uncompressed data, without waiting until the entire file is decompressed.
323
- The uncompressed contents of the file are not written to disk before passing to `wc -l` and the file is not loaded
324
- into memory before passing to `wc -l`.
325
-
326
- In this way extremely large files can be processed with very little memory being used.
327
-
328
- #### Push Model
329
-
330
- In the Linux pipes example above this would be considered a "push model" where each task in the list pushes
331
- its output to the input of the next task.
332
-
333
- A major challenge or disadvantage with the push model is that buffering would need to occur between tasks since
334
- each task could complete at very different speeds. To prevent large memory usage the standard output from a previous
335
- task would have to be blocked to try and make it slow down.
336
-
337
- #### Pull Model
338
-
339
- Another approach with multiple tasks that need to process a single stream, is to move to a "pull model" where the
340
- task at the end of the list pulls a block from a previous task when it is ready to process it.
341
-
342
- #### IOStreams
343
-
344
- IOStreams uses the pull model when reading data, where each stream performs a read against the previous stream
345
- when it is ready for more data.
346
-
347
- When writing to an output stream, IOStreams uses the push model, where each block of data that is ready to be written
348
- is pushed to the task/stream in the list. The write push only returns once it has traversed all the way down to
349
- the final task / stream in the list, this avoids complex buffering issues between each task / stream in the list.
350
-
351
- Example: Implementing in Ruby: `gunzip -c hello.csv.gz | wc -l`
352
-
353
- ~~~ruby
354
- line_count = 0
355
- IOStreams::Gzip::Reader.open("hello.csv.gz") do |input|
356
- IOStreams::Line::Reader.open(input) do |lines|
357
- lines.each { line_count += 1}
358
- end
359
- end
360
- puts "hello.csv.gz contains #{line_count} lines"
361
- ~~~
362
-
363
- Since IOStreams can autodetect file types based on the file extension, `IOStreams.reader` can figure which stream
364
- to start with:
365
- ~~~ruby
366
- line_count = 0
367
- IOStreams.path("hello.csv.gz").reader do |input|
368
- IOStreams::Line::Reader.open(input) do |lines|
369
- lines.each { line_count += 1}
370
- end
371
- end
372
- puts "hello.csv.gz contains #{line_count} lines"
373
- ~~~
374
-
375
- Since we know we want a line reader, it can be simplified using `#reader(:line)`:
376
- ~~~ruby
377
- line_count = 0
378
- IOStreams.path("hello.csv.gz").reader(:line) do |lines|
379
- lines.each { line_count += 1}
380
- end
381
- puts "hello.csv.gz contains #{line_count} lines"
382
- ~~~
383
-
384
- It can be simplified even further using `#each`:
385
- ~~~ruby
386
- line_count = 0
387
- IOStreams.path("hello.csv.gz").each { line_count += 1}
388
- puts "hello.csv.gz contains #{line_count} lines"
389
- ~~~
390
-
391
- The benefit in all of the above cases is that the file can be any arbitrary size and only one block of the file
392
- is held in memory at any time.
393
-
394
- #### Chaining
395
-
396
- In the above example only 2 streams were used. Streams can be nested as deep as necessary to process data.
397
-
398
- Example, search for all occurrences of the word apple, cleansing the input data stream of non printable characters
399
- and converting to valid US ASCII.
400
-
401
- ~~~ruby
402
- apple_count = 0
403
- IOStreams::Gzip::Reader.open("hello.csv.gz") do |input|
404
- IOStreams::Encode::Reader.open(input,
405
- encoding: "US-ASCII",
406
- encode_replace: "",
407
- encode_cleaner: :printable) do |cleansed|
408
- IOStreams::Line::Reader.open(cleansed) do |lines|
409
- lines.each { |line| apple_count += line.scan("apple").count}
410
- end
411
- end
412
- puts "Found the word 'apple' #{apple_count} times in hello.csv.gz"
413
- ~~~
414
-
415
- Let IOStreams perform the above stream chaining automatically under the covers:
416
-
417
- ~~~ruby
418
- apple_count = 0
419
- IOStreams.path("hello.csv.gz").
420
- option(:encode, encoding: "US-ASCII", replace: "", cleaner: :printable).
421
- each do |line|
422
- apple_count += line.scan("apple").count
423
- end
424
-
425
- puts "Found the word 'apple' #{apple_count} times in hello.csv.gz"
426
- ~~~
427
-
428
- ## Notes
429
-
430
- * Due to the nature of Zip, both its Reader and Writer methods will create
431
- a temp file when reading from or writing to a stream.
432
- Recommended to use Gzip over Zip since it can be streamed without requiring temp files.
433
- * Zip becomes exponentially slower with very large files, especially files
434
- that exceed 4GB when uncompressed. Highly recommend using GZip for large files.
15
+ Next, checkout the remaining [IOStreams documentation](https://iostreams.rocketjob.io/)
435
16
 
436
17
  ## Versioning
437
18