iostreams 0.16.1 → 0.16.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +123 -0
- data/lib/io_streams/tabular/parser/csv.rb +28 -4
- data/lib/io_streams/version.rb +1 -1
- metadata +3 -4
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: e9a14b746c83e98c98950f8fe1086e689383e997a2f62688272f419d0a144c36
|
4
|
+
data.tar.gz: 61c6da8d61da48d205f8537bd0a11b0b0cf4a1160ab4e00247f3cea507540072
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 9736cb2bacdb162120a9bd6febe300760c34ea70148a62352244d33ab9142dead20cdc600242f8304ba460fcc9f4364dd2fdc8588ab4fda2dd34802e302037d8
|
7
|
+
data.tar.gz: 64a012432f793855f216be83b6522fe8301eff9f02fdc967782fb5488b85b6df630c44f6c4aefb4541b76b9a8f518856af2315b64ff733516c3b73b2451341f4
|
data/README.md
CHANGED
@@ -236,6 +236,129 @@ IOStreams.copy('ABC', 'xyz.csv.pgp',
|
|
236
236
|
target_options: [pgp: {email_recipient: 'a@a.com'})
|
237
237
|
~~~
|
238
238
|
|
239
|
+
## Philosopy
|
240
|
+
|
241
|
+
IOStreams can be used to work against a single stream. it's real capability becomes apparent when chainging together
|
242
|
+
multiple streams to process data, without loading entire files into memory.
|
243
|
+
|
244
|
+
#### Linux Pipes
|
245
|
+
|
246
|
+
Linux has built-in support for streaming using the `|` (pipe operator) to send the output from one process to another.
|
247
|
+
|
248
|
+
Example: count the number of lines in a compressed file:
|
249
|
+
|
250
|
+
gunzip -c hello.csv.gz | wc -l
|
251
|
+
|
252
|
+
The file `hello.csv.gz` is uncompressed and returned to standard output, which in turn is piped into the standard
|
253
|
+
input for `wc -l`, which counts the number of lines in the uncompressed data.
|
254
|
+
|
255
|
+
As each block of data is returned from `gunzip` it is immediately passed into `wc` so that it
|
256
|
+
can start counting lines of uncompressed data, without waiting until the entire file is decompressed.
|
257
|
+
The uncompressed contents of the file are not written to disk before passing to `wc -l` and the file is not loaded
|
258
|
+
into memory before passing to `wc -l`.
|
259
|
+
|
260
|
+
In this way extremely large files can be processed with very little memory being used.
|
261
|
+
|
262
|
+
#### Push Model
|
263
|
+
|
264
|
+
In the Linux pipes example above this would be considered a "push model" where each task in the list pushes
|
265
|
+
its output to the input of the next task.
|
266
|
+
|
267
|
+
A major challenge or disadvantage with the push model is that buffering would need to occur between tasks since
|
268
|
+
each task could complete at very different speeds. To prevent large memory usage the standard output from a previous
|
269
|
+
task would have to be blocked to try and make it slow down.
|
270
|
+
|
271
|
+
#### Pull Model
|
272
|
+
|
273
|
+
Another approach with multiple tasks that need to process a single stream, is to move to a "pull model" where the
|
274
|
+
task at the end of the list pulls a block from a previous task when it is ready to process it.
|
275
|
+
|
276
|
+
#### IOStreams
|
277
|
+
|
278
|
+
IOStreams uses the pull model when reading data, where each stream performs a read against the previous stream
|
279
|
+
when it is ready for more data.
|
280
|
+
|
281
|
+
When writing to an output stream, IOStreams uses the push model, where each block of data that is ready to be written
|
282
|
+
is pushed to the task/stream in the list. The write push only returns once it has traversed all the way down to
|
283
|
+
the final task / stream in the list, this avoids complex buffering issues between each task / stream in the list.
|
284
|
+
|
285
|
+
Example: Implementing in Ruby: `gunzip -c hello.csv.gz | wc -l`
|
286
|
+
|
287
|
+
~~~ruby
|
288
|
+
line_count = 0
|
289
|
+
IOStreams::Gzip::Reader.open("hello.csv.gz") do |input|
|
290
|
+
IOStreams::Line::Reader.open(input) do |lines|
|
291
|
+
lines.each { line_count += 1}
|
292
|
+
end
|
293
|
+
end
|
294
|
+
puts "hello.csv.gz contains #{line_count} lines"
|
295
|
+
~~~
|
296
|
+
|
297
|
+
Since IOStreams can autodetect file types based on the file extension, `IOStreams.reader` can figure which stream
|
298
|
+
to start with:
|
299
|
+
~~~ruby
|
300
|
+
line_count = 0
|
301
|
+
IOStreams.reader("hello.csv.gz") do |input|
|
302
|
+
IOStreams::Line::Reader.open(input) do |lines|
|
303
|
+
lines.each { line_count += 1}
|
304
|
+
end
|
305
|
+
end
|
306
|
+
puts "hello.csv.gz contains #{line_count} lines"
|
307
|
+
~~~
|
308
|
+
|
309
|
+
Since we know we want a line reader, it can be simplified using `IOStreams.line_reader`:
|
310
|
+
~~~ruby
|
311
|
+
line_count = 0
|
312
|
+
IOStreams.line_reader("hello.csv.gz") do |lines|
|
313
|
+
lines.each { line_count += 1}
|
314
|
+
end
|
315
|
+
puts "hello.csv.gz contains #{line_count} lines"
|
316
|
+
~~~
|
317
|
+
|
318
|
+
It can be simplified even further using `IOStreams.each_line`:
|
319
|
+
~~~ruby
|
320
|
+
line_count = 0
|
321
|
+
IOStreams.each_line("hello.csv.gz") { line_count += 1}
|
322
|
+
puts "hello.csv.gz contains #{line_count} lines"
|
323
|
+
~~~
|
324
|
+
|
325
|
+
The benefit in all of the above cases is that the file can be any arbitrary size and only one block of the file
|
326
|
+
is held in memory at any time.
|
327
|
+
|
328
|
+
#### Chaining
|
329
|
+
|
330
|
+
In the above example only 2 streams were used. Streams can be nested as deep as necessary to process data.
|
331
|
+
|
332
|
+
Example, search for all occurrences of the word apple, cleansing the input data stream of non printable characters
|
333
|
+
and converting to valid US ASCII.
|
334
|
+
|
335
|
+
~~~ruby
|
336
|
+
apple_count = 0
|
337
|
+
IOStreams::Gzip::Reader.open("hello.csv.gz") do |input|
|
338
|
+
IOStreams::Encode::Reader.open(input,
|
339
|
+
encoding: 'US-ASCII',
|
340
|
+
encode_replace: '',
|
341
|
+
encode_cleaner: :printable) do |cleansed|
|
342
|
+
IOStreams::Line::Reader.open(cleansed) do |lines|
|
343
|
+
lines.each { |line| apple_count += line.scan('apple').count}
|
344
|
+
end
|
345
|
+
end
|
346
|
+
puts "Found the word 'apple' #{apple_count} times in hello.csv.gz"
|
347
|
+
~~~
|
348
|
+
|
349
|
+
Let IOStreams perform the above stream chaining automatically under the covers:
|
350
|
+
~~~ruby
|
351
|
+
apple_count = 0
|
352
|
+
IOStreams.each_line("hello.csv.gz",
|
353
|
+
encoding: 'US-ASCII',
|
354
|
+
encode_replace: '',
|
355
|
+
encode_cleaner: :printable) do |line|
|
356
|
+
apple_count += line.scan('apple').count
|
357
|
+
end
|
358
|
+
|
359
|
+
puts "Found the word 'apple' #{apple_count} times in hello.csv.gz"
|
360
|
+
~~~
|
361
|
+
|
239
362
|
## Notes
|
240
363
|
|
241
364
|
* Due to the nature of Zip, both its Reader and Writer methods will create
|
@@ -1,3 +1,4 @@
|
|
1
|
+
require 'csv'
|
1
2
|
module IOStreams
|
2
3
|
class Tabular
|
3
4
|
module Parser
|
@@ -5,7 +6,7 @@ module IOStreams
|
|
5
6
|
attr_reader :csv_parser
|
6
7
|
|
7
8
|
def initialize
|
8
|
-
@csv_parser = Utility::CSVRow.new
|
9
|
+
@csv_parser = Utility::CSVRow.new unless RUBY_VERSION.to_f >= 2.6
|
9
10
|
end
|
10
11
|
|
11
12
|
# Returns [Array<String>] the header row.
|
@@ -15,7 +16,7 @@ module IOStreams
|
|
15
16
|
|
16
17
|
raise(IOStreams::Errors::InvalidHeader, "Format is :csv. Invalid input header: #{row.class.name}") unless row.is_a?(String)
|
17
18
|
|
18
|
-
|
19
|
+
parse_line(row)
|
19
20
|
end
|
20
21
|
|
21
22
|
# Returns [Array] the parsed CSV line
|
@@ -24,15 +25,38 @@ module IOStreams
|
|
24
25
|
|
25
26
|
raise(IOStreams::Errors::TypeMismatch, "Format is :csv. Invalid input: #{row.class.name}") unless row.is_a?(String)
|
26
27
|
|
27
|
-
|
28
|
+
parse_line(row)
|
28
29
|
end
|
29
30
|
|
30
31
|
# Return the supplied array as a single line CSV string.
|
31
32
|
def render(row, header)
|
32
33
|
array = header.to_array(row)
|
33
|
-
|
34
|
+
render_array(array)
|
34
35
|
end
|
35
36
|
|
37
|
+
private
|
38
|
+
|
39
|
+
if RUBY_VERSION.to_f >= 2.6
|
40
|
+
# About 10 times slower than the approach used in Ruby 2.5 and earlier,
|
41
|
+
# but at least it works on Ruby 2.6 and above.
|
42
|
+
def parse_line(line)
|
43
|
+
return if IOStreams.blank?(line)
|
44
|
+
|
45
|
+
CSV.parse_line(line)
|
46
|
+
end
|
47
|
+
|
48
|
+
def render_array(array)
|
49
|
+
CSV.generate_line(array, encoding: 'UTF-8', row_sep: '')
|
50
|
+
end
|
51
|
+
else
|
52
|
+
def parse_line(line)
|
53
|
+
csv_parser.parse(line)
|
54
|
+
end
|
55
|
+
|
56
|
+
def render_array(array)
|
57
|
+
csv_parser.to_csv(array)
|
58
|
+
end
|
59
|
+
end
|
36
60
|
end
|
37
61
|
end
|
38
62
|
end
|
data/lib/io_streams/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: iostreams
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.16.
|
4
|
+
version: 0.16.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Reid Morrison
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2019-02-11 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: concurrent-ruby
|
@@ -124,8 +124,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
124
124
|
- !ruby/object:Gem::Version
|
125
125
|
version: '0'
|
126
126
|
requirements: []
|
127
|
-
|
128
|
-
rubygems_version: 2.7.7
|
127
|
+
rubygems_version: 3.0.2
|
129
128
|
signing_key:
|
130
129
|
specification_version: 4
|
131
130
|
summary: Input and Output streaming for Ruby.
|