iostreams 0.16.1 → 0.16.2
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +123 -0
- data/lib/io_streams/tabular/parser/csv.rb +28 -4
- data/lib/io_streams/version.rb +1 -1
- metadata +3 -4
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: e9a14b746c83e98c98950f8fe1086e689383e997a2f62688272f419d0a144c36
|
4
|
+
data.tar.gz: 61c6da8d61da48d205f8537bd0a11b0b0cf4a1160ab4e00247f3cea507540072
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 9736cb2bacdb162120a9bd6febe300760c34ea70148a62352244d33ab9142dead20cdc600242f8304ba460fcc9f4364dd2fdc8588ab4fda2dd34802e302037d8
|
7
|
+
data.tar.gz: 64a012432f793855f216be83b6522fe8301eff9f02fdc967782fb5488b85b6df630c44f6c4aefb4541b76b9a8f518856af2315b64ff733516c3b73b2451341f4
|
data/README.md
CHANGED
@@ -236,6 +236,129 @@ IOStreams.copy('ABC', 'xyz.csv.pgp',
|
|
236
236
|
target_options: [pgp: {email_recipient: 'a@a.com'})
|
237
237
|
~~~
|
238
238
|
|
239
|
+
## Philosopy
|
240
|
+
|
241
|
+
IOStreams can be used to work against a single stream. it's real capability becomes apparent when chainging together
|
242
|
+
multiple streams to process data, without loading entire files into memory.
|
243
|
+
|
244
|
+
#### Linux Pipes
|
245
|
+
|
246
|
+
Linux has built-in support for streaming using the `|` (pipe operator) to send the output from one process to another.
|
247
|
+
|
248
|
+
Example: count the number of lines in a compressed file:
|
249
|
+
|
250
|
+
gunzip -c hello.csv.gz | wc -l
|
251
|
+
|
252
|
+
The file `hello.csv.gz` is uncompressed and returned to standard output, which in turn is piped into the standard
|
253
|
+
input for `wc -l`, which counts the number of lines in the uncompressed data.
|
254
|
+
|
255
|
+
As each block of data is returned from `gunzip` it is immediately passed into `wc` so that it
|
256
|
+
can start counting lines of uncompressed data, without waiting until the entire file is decompressed.
|
257
|
+
The uncompressed contents of the file are not written to disk before passing to `wc -l` and the file is not loaded
|
258
|
+
into memory before passing to `wc -l`.
|
259
|
+
|
260
|
+
In this way extremely large files can be processed with very little memory being used.
|
261
|
+
|
262
|
+
#### Push Model
|
263
|
+
|
264
|
+
In the Linux pipes example above this would be considered a "push model" where each task in the list pushes
|
265
|
+
its output to the input of the next task.
|
266
|
+
|
267
|
+
A major challenge or disadvantage with the push model is that buffering would need to occur between tasks since
|
268
|
+
each task could complete at very different speeds. To prevent large memory usage the standard output from a previous
|
269
|
+
task would have to be blocked to try and make it slow down.
|
270
|
+
|
271
|
+
#### Pull Model
|
272
|
+
|
273
|
+
Another approach with multiple tasks that need to process a single stream, is to move to a "pull model" where the
|
274
|
+
task at the end of the list pulls a block from a previous task when it is ready to process it.
|
275
|
+
|
276
|
+
#### IOStreams
|
277
|
+
|
278
|
+
IOStreams uses the pull model when reading data, where each stream performs a read against the previous stream
|
279
|
+
when it is ready for more data.
|
280
|
+
|
281
|
+
When writing to an output stream, IOStreams uses the push model, where each block of data that is ready to be written
|
282
|
+
is pushed to the task/stream in the list. The write push only returns once it has traversed all the way down to
|
283
|
+
the final task / stream in the list, this avoids complex buffering issues between each task / stream in the list.
|
284
|
+
|
285
|
+
Example: Implementing in Ruby: `gunzip -c hello.csv.gz | wc -l`
|
286
|
+
|
287
|
+
~~~ruby
|
288
|
+
line_count = 0
|
289
|
+
IOStreams::Gzip::Reader.open("hello.csv.gz") do |input|
|
290
|
+
IOStreams::Line::Reader.open(input) do |lines|
|
291
|
+
lines.each { line_count += 1}
|
292
|
+
end
|
293
|
+
end
|
294
|
+
puts "hello.csv.gz contains #{line_count} lines"
|
295
|
+
~~~
|
296
|
+
|
297
|
+
Since IOStreams can autodetect file types based on the file extension, `IOStreams.reader` can figure which stream
|
298
|
+
to start with:
|
299
|
+
~~~ruby
|
300
|
+
line_count = 0
|
301
|
+
IOStreams.reader("hello.csv.gz") do |input|
|
302
|
+
IOStreams::Line::Reader.open(input) do |lines|
|
303
|
+
lines.each { line_count += 1}
|
304
|
+
end
|
305
|
+
end
|
306
|
+
puts "hello.csv.gz contains #{line_count} lines"
|
307
|
+
~~~
|
308
|
+
|
309
|
+
Since we know we want a line reader, it can be simplified using `IOStreams.line_reader`:
|
310
|
+
~~~ruby
|
311
|
+
line_count = 0
|
312
|
+
IOStreams.line_reader("hello.csv.gz") do |lines|
|
313
|
+
lines.each { line_count += 1}
|
314
|
+
end
|
315
|
+
puts "hello.csv.gz contains #{line_count} lines"
|
316
|
+
~~~
|
317
|
+
|
318
|
+
It can be simplified even further using `IOStreams.each_line`:
|
319
|
+
~~~ruby
|
320
|
+
line_count = 0
|
321
|
+
IOStreams.each_line("hello.csv.gz") { line_count += 1}
|
322
|
+
puts "hello.csv.gz contains #{line_count} lines"
|
323
|
+
~~~
|
324
|
+
|
325
|
+
The benefit in all of the above cases is that the file can be any arbitrary size and only one block of the file
|
326
|
+
is held in memory at any time.
|
327
|
+
|
328
|
+
#### Chaining
|
329
|
+
|
330
|
+
In the above example only 2 streams were used. Streams can be nested as deep as necessary to process data.
|
331
|
+
|
332
|
+
Example, search for all occurrences of the word apple, cleansing the input data stream of non printable characters
|
333
|
+
and converting to valid US ASCII.
|
334
|
+
|
335
|
+
~~~ruby
|
336
|
+
apple_count = 0
|
337
|
+
IOStreams::Gzip::Reader.open("hello.csv.gz") do |input|
|
338
|
+
IOStreams::Encode::Reader.open(input,
|
339
|
+
encoding: 'US-ASCII',
|
340
|
+
encode_replace: '',
|
341
|
+
encode_cleaner: :printable) do |cleansed|
|
342
|
+
IOStreams::Line::Reader.open(cleansed) do |lines|
|
343
|
+
lines.each { |line| apple_count += line.scan('apple').count}
|
344
|
+
end
|
345
|
+
end
|
346
|
+
puts "Found the word 'apple' #{apple_count} times in hello.csv.gz"
|
347
|
+
~~~
|
348
|
+
|
349
|
+
Let IOStreams perform the above stream chaining automatically under the covers:
|
350
|
+
~~~ruby
|
351
|
+
apple_count = 0
|
352
|
+
IOStreams.each_line("hello.csv.gz",
|
353
|
+
encoding: 'US-ASCII',
|
354
|
+
encode_replace: '',
|
355
|
+
encode_cleaner: :printable) do |line|
|
356
|
+
apple_count += line.scan('apple').count
|
357
|
+
end
|
358
|
+
|
359
|
+
puts "Found the word 'apple' #{apple_count} times in hello.csv.gz"
|
360
|
+
~~~
|
361
|
+
|
239
362
|
## Notes
|
240
363
|
|
241
364
|
* Due to the nature of Zip, both its Reader and Writer methods will create
|
@@ -1,3 +1,4 @@
|
|
1
|
+
require 'csv'
|
1
2
|
module IOStreams
|
2
3
|
class Tabular
|
3
4
|
module Parser
|
@@ -5,7 +6,7 @@ module IOStreams
|
|
5
6
|
attr_reader :csv_parser
|
6
7
|
|
7
8
|
def initialize
|
8
|
-
@csv_parser = Utility::CSVRow.new
|
9
|
+
@csv_parser = Utility::CSVRow.new unless RUBY_VERSION.to_f >= 2.6
|
9
10
|
end
|
10
11
|
|
11
12
|
# Returns [Array<String>] the header row.
|
@@ -15,7 +16,7 @@ module IOStreams
|
|
15
16
|
|
16
17
|
raise(IOStreams::Errors::InvalidHeader, "Format is :csv. Invalid input header: #{row.class.name}") unless row.is_a?(String)
|
17
18
|
|
18
|
-
|
19
|
+
parse_line(row)
|
19
20
|
end
|
20
21
|
|
21
22
|
# Returns [Array] the parsed CSV line
|
@@ -24,15 +25,38 @@ module IOStreams
|
|
24
25
|
|
25
26
|
raise(IOStreams::Errors::TypeMismatch, "Format is :csv. Invalid input: #{row.class.name}") unless row.is_a?(String)
|
26
27
|
|
27
|
-
|
28
|
+
parse_line(row)
|
28
29
|
end
|
29
30
|
|
30
31
|
# Return the supplied array as a single line CSV string.
|
31
32
|
def render(row, header)
|
32
33
|
array = header.to_array(row)
|
33
|
-
|
34
|
+
render_array(array)
|
34
35
|
end
|
35
36
|
|
37
|
+
private
|
38
|
+
|
39
|
+
if RUBY_VERSION.to_f >= 2.6
|
40
|
+
# About 10 times slower than the approach used in Ruby 2.5 and earlier,
|
41
|
+
# but at least it works on Ruby 2.6 and above.
|
42
|
+
def parse_line(line)
|
43
|
+
return if IOStreams.blank?(line)
|
44
|
+
|
45
|
+
CSV.parse_line(line)
|
46
|
+
end
|
47
|
+
|
48
|
+
def render_array(array)
|
49
|
+
CSV.generate_line(array, encoding: 'UTF-8', row_sep: '')
|
50
|
+
end
|
51
|
+
else
|
52
|
+
def parse_line(line)
|
53
|
+
csv_parser.parse(line)
|
54
|
+
end
|
55
|
+
|
56
|
+
def render_array(array)
|
57
|
+
csv_parser.to_csv(array)
|
58
|
+
end
|
59
|
+
end
|
36
60
|
end
|
37
61
|
end
|
38
62
|
end
|
data/lib/io_streams/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: iostreams
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.16.
|
4
|
+
version: 0.16.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Reid Morrison
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2019-02-11 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: concurrent-ruby
|
@@ -124,8 +124,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
124
124
|
- !ruby/object:Gem::Version
|
125
125
|
version: '0'
|
126
126
|
requirements: []
|
127
|
-
|
128
|
-
rubygems_version: 2.7.7
|
127
|
+
rubygems_version: 3.0.2
|
129
128
|
signing_key:
|
130
129
|
specification_version: 4
|
131
130
|
summary: Input and Output streaming for Ruby.
|