iostreams 1.1.1 → 1.3.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +7 -426
- data/lib/io_streams/builder.rb +9 -6
- data/lib/io_streams/deprecated.rb +1 -1
- data/lib/io_streams/encode/reader.rb +1 -1
- data/lib/io_streams/encode/writer.rb +1 -1
- data/lib/io_streams/errors.rb +10 -0
- data/lib/io_streams/io_streams.rb +1 -1
- data/lib/io_streams/line/reader.rb +2 -2
- data/lib/io_streams/paths/file.rb +4 -4
- data/lib/io_streams/paths/http.rb +7 -3
- data/lib/io_streams/paths/s3.rb +1 -1
- data/lib/io_streams/paths/sftp.rb +16 -4
- data/lib/io_streams/pgp.rb +81 -138
- data/lib/io_streams/pgp/writer.rb +30 -5
- data/lib/io_streams/record/reader.rb +33 -8
- data/lib/io_streams/record/writer.rb +33 -8
- data/lib/io_streams/row/reader.rb +4 -4
- data/lib/io_streams/row/writer.rb +4 -4
- data/lib/io_streams/stream.rb +21 -10
- data/lib/io_streams/tabular.rb +34 -3
- data/lib/io_streams/tabular/header.rb +14 -12
- data/lib/io_streams/tabular/parser/fixed.rb +135 -25
- data/lib/io_streams/utils.rb +5 -4
- data/lib/io_streams/version.rb +1 -1
- data/lib/io_streams/zip/reader.rb +1 -1
- data/test/io_streams_test.rb +49 -0
- data/test/paths/file_test.rb +1 -1
- data/test/paths/s3_test.rb +2 -2
- data/test/paths/sftp_test.rb +4 -4
- data/test/pgp_test.rb +54 -4
- data/test/pgp_writer_test.rb +26 -0
- data/test/tabular_test.rb +55 -26
- data/test/test_helper.rb +5 -1
- metadata +2 -3
- data/lib/io_streams/utils/reliable_http.rb +0 -98
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 5b743c145f120e461d09eed5b9c7476bacbdaa18453200e1ccc57ec9eef2239c
|
|
4
|
+
data.tar.gz: 7c54a49f344e454ef64a1e19e47f2c665b46e675eda34c03e1604668a6543de4
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 27d4a3875009d1902285a9e01b67032e74e1eac6d295de3400096b01c663de8b14cb594e4ffd8f04154d23cf7cfd4a696dbc33b9651fd200808ff087c00c0dc2
|
|
7
|
+
data.tar.gz: 92656ad55f37e5ffb3b4588cc93d6d9c988ce31705f888d4e045d86ffcf5047f4185a8de30bc6c262a48ae7f048433edd00f827d155939db7caa858e5e495bf0
|
data/README.md
CHANGED
|
@@ -1,437 +1,18 @@
|
|
|
1
|
-
#
|
|
1
|
+
# IOStreams
|
|
2
2
|
[](https://rubygems.org/gems/iostreams) [](https://travis-ci.org/rocketjob/iostreams) [](https://rubygems.org/gems/iostreams) [](http://opensource.org/licenses/Apache-2.0)  [-Support-brightgreen.svg)](https://gitter.im/rocketjob/support)
|
|
3
3
|
|
|
4
|
-
|
|
4
|
+
IOStreams is an incredibly powerful streaming library that makes changes to file formats, compression, encryption,
|
|
5
|
+
or storage mechanism transparent to the application.
|
|
5
6
|
|
|
6
7
|
## Project Status
|
|
7
8
|
|
|
8
|
-
Production Ready,
|
|
9
|
+
Production Ready, heavily used in production environments, many as part of Rocket Job.
|
|
9
10
|
|
|
10
|
-
##
|
|
11
|
+
## Documentation
|
|
11
12
|
|
|
12
|
-
|
|
13
|
+
Start with the [IOStreams tutorial](https://iostreams.rocketjob.io/tutorial) to get a great introduction to IOStreams.
|
|
13
14
|
|
|
14
|
-
|
|
15
|
-
* Gzip
|
|
16
|
-
* BZip2
|
|
17
|
-
* PGP (Requires GnuPG)
|
|
18
|
-
* Xlsx (Reading)
|
|
19
|
-
* Encryption using [Symmetric Encryption](https://github.com/reidmorrison/symmetric-encryption)
|
|
20
|
-
|
|
21
|
-
Supported sources and/or targets:
|
|
22
|
-
|
|
23
|
-
* File
|
|
24
|
-
* HTTP (Read only)
|
|
25
|
-
* AWS S3
|
|
26
|
-
* SFTP
|
|
27
|
-
|
|
28
|
-
Supported file formats:
|
|
29
|
-
|
|
30
|
-
* CSV
|
|
31
|
-
* Fixed width formats
|
|
32
|
-
* JSON
|
|
33
|
-
* PSV
|
|
34
|
-
|
|
35
|
-
## Quick examples
|
|
36
|
-
|
|
37
|
-
Read an entire file into memory:
|
|
38
|
-
|
|
39
|
-
```ruby
|
|
40
|
-
IOStreams.path('example.txt').read
|
|
41
|
-
```
|
|
42
|
-
|
|
43
|
-
Decompress an entire gzip file into memory:
|
|
44
|
-
|
|
45
|
-
```ruby
|
|
46
|
-
IOStreams.path('example.gz').read
|
|
47
|
-
```
|
|
48
|
-
|
|
49
|
-
Read and decompress the first file in a zip file into memory:
|
|
50
|
-
|
|
51
|
-
```ruby
|
|
52
|
-
IOStreams.path('example.zip').read
|
|
53
|
-
```
|
|
54
|
-
|
|
55
|
-
Read a file one line at a time
|
|
56
|
-
|
|
57
|
-
```ruby
|
|
58
|
-
IOStreams.path('example.txt').each do |line|
|
|
59
|
-
puts line
|
|
60
|
-
end
|
|
61
|
-
```
|
|
62
|
-
|
|
63
|
-
Read a CSV file one line at a time, returning each line as an array:
|
|
64
|
-
|
|
65
|
-
```ruby
|
|
66
|
-
IOStreams.path('example.csv').each(:array) do |array|
|
|
67
|
-
p array
|
|
68
|
-
end
|
|
69
|
-
```
|
|
70
|
-
|
|
71
|
-
Read a CSV file a record at a time, returning each line as a hash.
|
|
72
|
-
The first line of the file is assumed to be the header line:
|
|
73
|
-
|
|
74
|
-
```ruby
|
|
75
|
-
IOStreams.path('example.csv').each(:hash) do |hash|
|
|
76
|
-
p hash
|
|
77
|
-
end
|
|
78
|
-
```
|
|
79
|
-
|
|
80
|
-
Read a file using an http get,
|
|
81
|
-
decompressing the named file in the zip file,
|
|
82
|
-
returning each records from the named file as a hash:
|
|
83
|
-
|
|
84
|
-
```ruby
|
|
85
|
-
IOStreams.
|
|
86
|
-
path("https://www5.fdic.gov/idasp/Offices2.zip").
|
|
87
|
-
option(:zip, entry_file_name: 'OFFICES2_ALL.CSV').
|
|
88
|
-
reader(:hash) do |stream|
|
|
89
|
-
p stream.read
|
|
90
|
-
end
|
|
91
|
-
```
|
|
92
|
-
|
|
93
|
-
Read the file without unzipping and streaming the first file in the zip:
|
|
94
|
-
|
|
95
|
-
```ruby
|
|
96
|
-
IOStreams.path('https://www5.fdic.gov/idasp/Offices2.zip').stream(:none).reader {|file| puts file.read}
|
|
97
|
-
```
|
|
98
|
-
|
|
99
|
-
|
|
100
|
-
## Introduction
|
|
101
|
-
|
|
102
|
-
If all files were small, they could just be loaded into memory in their entirety. With the
|
|
103
|
-
advent of very large files, often into several Gigabytes, or even Terabytes in size, loading
|
|
104
|
-
them into memory is not feasible.
|
|
105
|
-
|
|
106
|
-
In linux it is common to use pipes to stream data between processes.
|
|
107
|
-
For example:
|
|
108
|
-
|
|
109
|
-
```
|
|
110
|
-
# Count the number of lines in a file that has been compressed with gzip
|
|
111
|
-
cat abc.gz | gunzip -c | wc -l
|
|
112
|
-
```
|
|
113
|
-
|
|
114
|
-
For large files it is critical to be able to read and write these files as streams. Ruby has support
|
|
115
|
-
for reading and writing files using streams, but has no built-in way of passing one stream through
|
|
116
|
-
another to support for example compressing the data, encrypting it and then finally writing the result
|
|
117
|
-
to a file. Several streaming implementations exist for languages such as `C++` and `Java` to chain
|
|
118
|
-
together several streams, `iostreams` attempts to offer similar features for Ruby.
|
|
119
|
-
|
|
120
|
-
```ruby
|
|
121
|
-
# Read a compressed file:
|
|
122
|
-
IOStreams.path("hello.gz").reader do |reader|
|
|
123
|
-
data = reader.read(1024)
|
|
124
|
-
puts "Read: #{data}"
|
|
125
|
-
end
|
|
126
|
-
```
|
|
127
|
-
|
|
128
|
-
The true power of streams is shown when many streams are chained together to achieve the end
|
|
129
|
-
result, without holding the entire file in memory, or ideally without needing to create
|
|
130
|
-
any temporary files to process the stream.
|
|
131
|
-
|
|
132
|
-
```ruby
|
|
133
|
-
# Create a file that is compressed with GZip and then encrypted with Symmetric Encryption:
|
|
134
|
-
IOStreams.path("hello.gz.enc").writer do |writer|
|
|
135
|
-
writer.write("Hello World")
|
|
136
|
-
writer.write("and some more")
|
|
137
|
-
end
|
|
138
|
-
```
|
|
139
|
-
|
|
140
|
-
The power of the above example applies when the data being written starts to exceed hundreds of megabytes,
|
|
141
|
-
or even gigabytes.
|
|
142
|
-
|
|
143
|
-
By looking at the file name supplied above, `iostreams` is able to determine which streams to apply
|
|
144
|
-
to the data being read or written. For example:
|
|
145
|
-
* `hello.zip` => Compressed using Zip
|
|
146
|
-
* `hello.zip.enc` => Compressed using Zip and then encrypted using Symmetric Encryption
|
|
147
|
-
* `hello.gz.enc` => Compressed using GZip and then encrypted using Symmetric Encryption
|
|
148
|
-
|
|
149
|
-
The objective is that all of these streaming processes are performed used streaming
|
|
150
|
-
so that only the current portion of the file is loaded into memory as it moves
|
|
151
|
-
through the entire file.
|
|
152
|
-
Where possible each stream never goes to disk, which for example could expose
|
|
153
|
-
un-encrypted data.
|
|
154
|
-
|
|
155
|
-
## Examples
|
|
156
|
-
|
|
157
|
-
While decompressing the file, display 128 characters at a time from the file.
|
|
158
|
-
|
|
159
|
-
~~~ruby
|
|
160
|
-
require "iostreams"
|
|
161
|
-
IOStreams.path("abc.csv").reader do |io|
|
|
162
|
-
while (data = io.read(128))
|
|
163
|
-
p data
|
|
164
|
-
end
|
|
165
|
-
end
|
|
166
|
-
~~~
|
|
167
|
-
|
|
168
|
-
While decompressing the file, display one line at a time from the file.
|
|
169
|
-
|
|
170
|
-
~~~ruby
|
|
171
|
-
IOStreams.path("abc.csv").each do |line|
|
|
172
|
-
puts line
|
|
173
|
-
end
|
|
174
|
-
~~~
|
|
175
|
-
|
|
176
|
-
While decompressing the file, display each row from the csv file as an array.
|
|
177
|
-
|
|
178
|
-
~~~ruby
|
|
179
|
-
IOStreams.path("abc.csv").each(:array) do |array|
|
|
180
|
-
p array
|
|
181
|
-
end
|
|
182
|
-
~~~
|
|
183
|
-
|
|
184
|
-
While decompressing the file, display each record from the csv file as a hash.
|
|
185
|
-
The first line is assumed to be the header row.
|
|
186
|
-
|
|
187
|
-
~~~ruby
|
|
188
|
-
IOStreams.path("abc.csv").each(:hash) do |hash|
|
|
189
|
-
p hash
|
|
190
|
-
end
|
|
191
|
-
~~~
|
|
192
|
-
|
|
193
|
-
Write data while compressing the file.
|
|
194
|
-
|
|
195
|
-
~~~ruby
|
|
196
|
-
IOStreams.path("abc.csv").writer do |io|
|
|
197
|
-
io.write("This")
|
|
198
|
-
io.write(" is ")
|
|
199
|
-
io.write(" one line\n")
|
|
200
|
-
end
|
|
201
|
-
~~~
|
|
202
|
-
|
|
203
|
-
Write a line at a time while compressing the file.
|
|
204
|
-
|
|
205
|
-
~~~ruby
|
|
206
|
-
IOStreams.path("abc.csv").writer(:line) do |file|
|
|
207
|
-
file << "these"
|
|
208
|
-
file << "are"
|
|
209
|
-
file << "all"
|
|
210
|
-
file << "separate"
|
|
211
|
-
file << "lines"
|
|
212
|
-
end
|
|
213
|
-
~~~
|
|
214
|
-
|
|
215
|
-
Write an array (row) at a time while compressing the file.
|
|
216
|
-
Each array is converted to csv before being compressed with zip.
|
|
217
|
-
|
|
218
|
-
~~~ruby
|
|
219
|
-
IOStreams.path("abc.csv").writer(:array) do |io|
|
|
220
|
-
io << %w[name address zip_code]
|
|
221
|
-
io << %w[Jack There 1234]
|
|
222
|
-
io << ["Joe", "Over There somewhere", 1234]
|
|
223
|
-
end
|
|
224
|
-
~~~
|
|
225
|
-
|
|
226
|
-
Write a hash (record) at a time while compressing the file.
|
|
227
|
-
Each hash is converted to csv before being compressed with zip.
|
|
228
|
-
The header row is extracted from the first hash supplied.
|
|
229
|
-
|
|
230
|
-
~~~ruby
|
|
231
|
-
IOStreams.path("abc.csv").writer(:hash) do |stream|
|
|
232
|
-
stream << {name: "Jack", address: "There", zip_code: 1234}
|
|
233
|
-
stream << {name: "Joe", address: "Over There somewhere", zip_code: 1234}
|
|
234
|
-
end
|
|
235
|
-
~~~
|
|
236
|
-
|
|
237
|
-
Write to a string IO for testing, supplying the filename so that the streams can be determined.
|
|
238
|
-
|
|
239
|
-
~~~ruby
|
|
240
|
-
io = StringIO.new
|
|
241
|
-
IOStreams.stream(io, file_name: "abc.csv").writer(:hash) do |stream|
|
|
242
|
-
stream << {name: "Jack", address: "There", zip_code: 1234}
|
|
243
|
-
stream << {name: "Joe", address: "Over There somewhere", zip_code: 1234}
|
|
244
|
-
end
|
|
245
|
-
puts io.string
|
|
246
|
-
~~~
|
|
247
|
-
|
|
248
|
-
Read a CSV file and write the output to an encrypted file in JSON format.
|
|
249
|
-
|
|
250
|
-
~~~ruby
|
|
251
|
-
IOStreams.path("sample.json.enc").writer(:hash) do |output|
|
|
252
|
-
IOStreams.path("sample.csv").each(:hash) do |record|
|
|
253
|
-
output << record
|
|
254
|
-
end
|
|
255
|
-
end
|
|
256
|
-
~~~
|
|
257
|
-
|
|
258
|
-
## Copying between files
|
|
259
|
-
|
|
260
|
-
Stream based file copying. Changes the file type without changing the file format. For example, compress or encrypt.
|
|
261
|
-
|
|
262
|
-
Encrypt the contents of the file `sample.json` and write to `sample.json.enc`
|
|
263
|
-
|
|
264
|
-
~~~ruby
|
|
265
|
-
input = IOStreams.path("sample.json")
|
|
266
|
-
IOStreams.path("sample.json.enc").copy_from(input)
|
|
267
|
-
~~~
|
|
268
|
-
|
|
269
|
-
Encrypt and compress the contents of the file `sample.json` with Symmetric Encryption and write to `sample.json.enc`
|
|
270
|
-
|
|
271
|
-
~~~ruby
|
|
272
|
-
input = IOStreams.path("sample.json")
|
|
273
|
-
IOStreams.path("sample.json.enc").option(:enc, compress: true).copy_from(input)
|
|
274
|
-
~~~
|
|
275
|
-
|
|
276
|
-
Encrypt and compress the contents of the file `sample.json` with pgp and write to `sample.json.enc`
|
|
277
|
-
|
|
278
|
-
~~~ruby
|
|
279
|
-
input = IOStreams.path("sample.json")
|
|
280
|
-
IOStreams.path("sample.json.pgp").option(:pgp, recipient: "sender@example.org").copy_from(input)
|
|
281
|
-
~~~
|
|
282
|
-
|
|
283
|
-
Decrypt the file `abc.csv.enc` and write it to `xyz.csv`.
|
|
284
|
-
|
|
285
|
-
~~~ruby
|
|
286
|
-
input = IOStreams.path("abc.csv.enc")
|
|
287
|
-
IOStreams.path("xyz.csv").copy_from(input)
|
|
288
|
-
~~~
|
|
289
|
-
|
|
290
|
-
Decrypt file `ABC` that was encrypted with Symmetric Encryption,
|
|
291
|
-
PGP encrypt the output file and write it to `xyz.csv.pgp` using the pgp key that was imported for `a@a.com`.
|
|
292
|
-
|
|
293
|
-
~~~ruby
|
|
294
|
-
input = IOStreams.path("ABC").stream(:enc)
|
|
295
|
-
IOStreams.path("xyz.csv.pgp").option(:pgp, recipient: "a@a.com").copy_from(input)
|
|
296
|
-
~~~
|
|
297
|
-
|
|
298
|
-
To copy a file _without_ performing any conversions (ignore file extensions), set `convert` to `false`:
|
|
299
|
-
|
|
300
|
-
~~~ruby
|
|
301
|
-
input = IOStreams.path("sample.json.zip")
|
|
302
|
-
IOStreams.path("sample.copy").copy_from(input, convert: false)
|
|
303
|
-
~~~
|
|
304
|
-
|
|
305
|
-
## Philosopy
|
|
306
|
-
|
|
307
|
-
IOStreams can be used to work against a single stream. it's real capability becomes apparent when chaining together
|
|
308
|
-
multiple streams to process data, without loading entire files into memory.
|
|
309
|
-
|
|
310
|
-
#### Linux Pipes
|
|
311
|
-
|
|
312
|
-
Linux has built-in support for streaming using the `|` (pipe operator) to send the output from one process to another.
|
|
313
|
-
|
|
314
|
-
Example: count the number of lines in a compressed file:
|
|
315
|
-
|
|
316
|
-
gunzip -c hello.csv.gz | wc -l
|
|
317
|
-
|
|
318
|
-
The file `hello.csv.gz` is uncompressed and returned to standard output, which in turn is piped into the standard
|
|
319
|
-
input for `wc -l`, which counts the number of lines in the uncompressed data.
|
|
320
|
-
|
|
321
|
-
As each block of data is returned from `gunzip` it is immediately passed into `wc` so that it
|
|
322
|
-
can start counting lines of uncompressed data, without waiting until the entire file is decompressed.
|
|
323
|
-
The uncompressed contents of the file are not written to disk before passing to `wc -l` and the file is not loaded
|
|
324
|
-
into memory before passing to `wc -l`.
|
|
325
|
-
|
|
326
|
-
In this way extremely large files can be processed with very little memory being used.
|
|
327
|
-
|
|
328
|
-
#### Push Model
|
|
329
|
-
|
|
330
|
-
In the Linux pipes example above this would be considered a "push model" where each task in the list pushes
|
|
331
|
-
its output to the input of the next task.
|
|
332
|
-
|
|
333
|
-
A major challenge or disadvantage with the push model is that buffering would need to occur between tasks since
|
|
334
|
-
each task could complete at very different speeds. To prevent large memory usage the standard output from a previous
|
|
335
|
-
task would have to be blocked to try and make it slow down.
|
|
336
|
-
|
|
337
|
-
#### Pull Model
|
|
338
|
-
|
|
339
|
-
Another approach with multiple tasks that need to process a single stream, is to move to a "pull model" where the
|
|
340
|
-
task at the end of the list pulls a block from a previous task when it is ready to process it.
|
|
341
|
-
|
|
342
|
-
#### IOStreams
|
|
343
|
-
|
|
344
|
-
IOStreams uses the pull model when reading data, where each stream performs a read against the previous stream
|
|
345
|
-
when it is ready for more data.
|
|
346
|
-
|
|
347
|
-
When writing to an output stream, IOStreams uses the push model, where each block of data that is ready to be written
|
|
348
|
-
is pushed to the task/stream in the list. The write push only returns once it has traversed all the way down to
|
|
349
|
-
the final task / stream in the list, this avoids complex buffering issues between each task / stream in the list.
|
|
350
|
-
|
|
351
|
-
Example: Implementing in Ruby: `gunzip -c hello.csv.gz | wc -l`
|
|
352
|
-
|
|
353
|
-
~~~ruby
|
|
354
|
-
line_count = 0
|
|
355
|
-
IOStreams::Gzip::Reader.open("hello.csv.gz") do |input|
|
|
356
|
-
IOStreams::Line::Reader.open(input) do |lines|
|
|
357
|
-
lines.each { line_count += 1}
|
|
358
|
-
end
|
|
359
|
-
end
|
|
360
|
-
puts "hello.csv.gz contains #{line_count} lines"
|
|
361
|
-
~~~
|
|
362
|
-
|
|
363
|
-
Since IOStreams can autodetect file types based on the file extension, `IOStreams.reader` can figure which stream
|
|
364
|
-
to start with:
|
|
365
|
-
~~~ruby
|
|
366
|
-
line_count = 0
|
|
367
|
-
IOStreams.path("hello.csv.gz").reader do |input|
|
|
368
|
-
IOStreams::Line::Reader.open(input) do |lines|
|
|
369
|
-
lines.each { line_count += 1}
|
|
370
|
-
end
|
|
371
|
-
end
|
|
372
|
-
puts "hello.csv.gz contains #{line_count} lines"
|
|
373
|
-
~~~
|
|
374
|
-
|
|
375
|
-
Since we know we want a line reader, it can be simplified using `#reader(:line)`:
|
|
376
|
-
~~~ruby
|
|
377
|
-
line_count = 0
|
|
378
|
-
IOStreams.path("hello.csv.gz").reader(:line) do |lines|
|
|
379
|
-
lines.each { line_count += 1}
|
|
380
|
-
end
|
|
381
|
-
puts "hello.csv.gz contains #{line_count} lines"
|
|
382
|
-
~~~
|
|
383
|
-
|
|
384
|
-
It can be simplified even further using `#each`:
|
|
385
|
-
~~~ruby
|
|
386
|
-
line_count = 0
|
|
387
|
-
IOStreams.path("hello.csv.gz").each { line_count += 1}
|
|
388
|
-
puts "hello.csv.gz contains #{line_count} lines"
|
|
389
|
-
~~~
|
|
390
|
-
|
|
391
|
-
The benefit in all of the above cases is that the file can be any arbitrary size and only one block of the file
|
|
392
|
-
is held in memory at any time.
|
|
393
|
-
|
|
394
|
-
#### Chaining
|
|
395
|
-
|
|
396
|
-
In the above example only 2 streams were used. Streams can be nested as deep as necessary to process data.
|
|
397
|
-
|
|
398
|
-
Example, search for all occurrences of the word apple, cleansing the input data stream of non printable characters
|
|
399
|
-
and converting to valid US ASCII.
|
|
400
|
-
|
|
401
|
-
~~~ruby
|
|
402
|
-
apple_count = 0
|
|
403
|
-
IOStreams::Gzip::Reader.open("hello.csv.gz") do |input|
|
|
404
|
-
IOStreams::Encode::Reader.open(input,
|
|
405
|
-
encoding: "US-ASCII",
|
|
406
|
-
encode_replace: "",
|
|
407
|
-
encode_cleaner: :printable) do |cleansed|
|
|
408
|
-
IOStreams::Line::Reader.open(cleansed) do |lines|
|
|
409
|
-
lines.each { |line| apple_count += line.scan("apple").count}
|
|
410
|
-
end
|
|
411
|
-
end
|
|
412
|
-
puts "Found the word 'apple' #{apple_count} times in hello.csv.gz"
|
|
413
|
-
~~~
|
|
414
|
-
|
|
415
|
-
Let IOStreams perform the above stream chaining automatically under the covers:
|
|
416
|
-
|
|
417
|
-
~~~ruby
|
|
418
|
-
apple_count = 0
|
|
419
|
-
IOStreams.path("hello.csv.gz").
|
|
420
|
-
option(:encode, encoding: "US-ASCII", replace: "", cleaner: :printable).
|
|
421
|
-
each do |line|
|
|
422
|
-
apple_count += line.scan("apple").count
|
|
423
|
-
end
|
|
424
|
-
|
|
425
|
-
puts "Found the word 'apple' #{apple_count} times in hello.csv.gz"
|
|
426
|
-
~~~
|
|
427
|
-
|
|
428
|
-
## Notes
|
|
429
|
-
|
|
430
|
-
* Due to the nature of Zip, both its Reader and Writer methods will create
|
|
431
|
-
a temp file when reading from or writing to a stream.
|
|
432
|
-
Recommended to use Gzip over Zip since it can be streamed without requiring temp files.
|
|
433
|
-
* Zip becomes exponentially slower with very large files, especially files
|
|
434
|
-
that exceed 4GB when uncompressed. Highly recommend using GZip for large files.
|
|
15
|
+
Next, checkout the remaining [IOStreams documentation](https://iostreams.rocketjob.io/)
|
|
435
16
|
|
|
436
17
|
## Versioning
|
|
437
18
|
|