iostreams 0.16.1 → 0.16.2

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 71dc5d69019340a1bcaf15bb7b79c46d136caee788c78ababcb1adbec55ccaae
4
- data.tar.gz: 215a9d22d27fa102529a54b1886d54caad022289cb7bc62f0311e799fd2ead2a
3
+ metadata.gz: e9a14b746c83e98c98950f8fe1086e689383e997a2f62688272f419d0a144c36
4
+ data.tar.gz: 61c6da8d61da48d205f8537bd0a11b0b0cf4a1160ab4e00247f3cea507540072
5
5
  SHA512:
6
- metadata.gz: 534218bfa084bcb43af55192210e08de829044ee316aa4043c9f06b0e97a6b484b4eb666821327f4fb79877383d10819f58570d7194420a00682a6e3e860d437
7
- data.tar.gz: 0e8445aabb3cc8ae7a7f36859c12a724e8b4db966e97d45589ddd3ed1cc9fd30e9c2d2c29c321ee35ca2c4487f72004242ee2aa7d8425ac2061a792434e9fc66
6
+ metadata.gz: 9736cb2bacdb162120a9bd6febe300760c34ea70148a62352244d33ab9142dead20cdc600242f8304ba460fcc9f4364dd2fdc8588ab4fda2dd34802e302037d8
7
+ data.tar.gz: 64a012432f793855f216be83b6522fe8301eff9f02fdc967782fb5488b85b6df630c44f6c4aefb4541b76b9a8f518856af2315b64ff733516c3b73b2451341f4
data/README.md CHANGED
@@ -236,6 +236,129 @@ IOStreams.copy('ABC', 'xyz.csv.pgp',
236
236
  target_options: [pgp: {email_recipient: 'a@a.com'})
237
237
  ~~~
238
238
 
239
+ ## Philosopy
240
+
241
+ IOStreams can be used to work against a single stream. it's real capability becomes apparent when chainging together
242
+ multiple streams to process data, without loading entire files into memory.
243
+
244
+ #### Linux Pipes
245
+
246
+ Linux has built-in support for streaming using the `|` (pipe operator) to send the output from one process to another.
247
+
248
+ Example: count the number of lines in a compressed file:
249
+
250
+ gunzip -c hello.csv.gz | wc -l
251
+
252
+ The file `hello.csv.gz` is uncompressed and returned to standard output, which in turn is piped into the standard
253
+ input for `wc -l`, which counts the number of lines in the uncompressed data.
254
+
255
+ As each block of data is returned from `gunzip` it is immediately passed into `wc` so that it
256
+ can start counting lines of uncompressed data, without waiting until the entire file is decompressed.
257
+ The uncompressed contents of the file are not written to disk before passing to `wc -l` and the file is not loaded
258
+ into memory before passing to `wc -l`.
259
+
260
+ In this way extremely large files can be processed with very little memory being used.
261
+
262
+ #### Push Model
263
+
264
+ In the Linux pipes example above this would be considered a "push model" where each task in the list pushes
265
+ its output to the input of the next task.
266
+
267
+ A major challenge or disadvantage with the push model is that buffering would need to occur between tasks since
268
+ each task could complete at very different speeds. To prevent large memory usage the standard output from a previous
269
+ task would have to be blocked to try and make it slow down.
270
+
271
+ #### Pull Model
272
+
273
+ Another approach with multiple tasks that need to process a single stream, is to move to a "pull model" where the
274
+ task at the end of the list pulls a block from a previous task when it is ready to process it.
275
+
276
+ #### IOStreams
277
+
278
+ IOStreams uses the pull model when reading data, where each stream performs a read against the previous stream
279
+ when it is ready for more data.
280
+
281
+ When writing to an output stream, IOStreams uses the push model, where each block of data that is ready to be written
282
+ is pushed to the task/stream in the list. The write push only returns once it has traversed all the way down to
283
+ the final task / stream in the list, this avoids complex buffering issues between each task / stream in the list.
284
+
285
+ Example: Implementing in Ruby: `gunzip -c hello.csv.gz | wc -l`
286
+
287
+ ~~~ruby
288
+ line_count = 0
289
+ IOStreams::Gzip::Reader.open("hello.csv.gz") do |input|
290
+ IOStreams::Line::Reader.open(input) do |lines|
291
+ lines.each { line_count += 1}
292
+ end
293
+ end
294
+ puts "hello.csv.gz contains #{line_count} lines"
295
+ ~~~
296
+
297
+ Since IOStreams can autodetect file types based on the file extension, `IOStreams.reader` can figure which stream
298
+ to start with:
299
+ ~~~ruby
300
+ line_count = 0
301
+ IOStreams.reader("hello.csv.gz") do |input|
302
+ IOStreams::Line::Reader.open(input) do |lines|
303
+ lines.each { line_count += 1}
304
+ end
305
+ end
306
+ puts "hello.csv.gz contains #{line_count} lines"
307
+ ~~~
308
+
309
+ Since we know we want a line reader, it can be simplified using `IOStreams.line_reader`:
310
+ ~~~ruby
311
+ line_count = 0
312
+ IOStreams.line_reader("hello.csv.gz") do |lines|
313
+ lines.each { line_count += 1}
314
+ end
315
+ puts "hello.csv.gz contains #{line_count} lines"
316
+ ~~~
317
+
318
+ It can be simplified even further using `IOStreams.each_line`:
319
+ ~~~ruby
320
+ line_count = 0
321
+ IOStreams.each_line("hello.csv.gz") { line_count += 1}
322
+ puts "hello.csv.gz contains #{line_count} lines"
323
+ ~~~
324
+
325
+ The benefit in all of the above cases is that the file can be any arbitrary size and only one block of the file
326
+ is held in memory at any time.
327
+
328
+ #### Chaining
329
+
330
+ In the above example only 2 streams were used. Streams can be nested as deep as necessary to process data.
331
+
332
+ Example, search for all occurrences of the word apple, cleansing the input data stream of non printable characters
333
+ and converting to valid US ASCII.
334
+
335
+ ~~~ruby
336
+ apple_count = 0
337
+ IOStreams::Gzip::Reader.open("hello.csv.gz") do |input|
338
+ IOStreams::Encode::Reader.open(input,
339
+ encoding: 'US-ASCII',
340
+ encode_replace: '',
341
+ encode_cleaner: :printable) do |cleansed|
342
+ IOStreams::Line::Reader.open(cleansed) do |lines|
343
+ lines.each { |line| apple_count += line.scan('apple').count}
344
+ end
345
+ end
346
+ puts "Found the word 'apple' #{apple_count} times in hello.csv.gz"
347
+ ~~~
348
+
349
+ Let IOStreams perform the above stream chaining automatically under the covers:
350
+ ~~~ruby
351
+ apple_count = 0
352
+ IOStreams.each_line("hello.csv.gz",
353
+ encoding: 'US-ASCII',
354
+ encode_replace: '',
355
+ encode_cleaner: :printable) do |line|
356
+ apple_count += line.scan('apple').count
357
+ end
358
+
359
+ puts "Found the word 'apple' #{apple_count} times in hello.csv.gz"
360
+ ~~~
361
+
239
362
  ## Notes
240
363
 
241
364
  * Due to the nature of Zip, both its Reader and Writer methods will create
@@ -1,3 +1,4 @@
1
+ require 'csv'
1
2
  module IOStreams
2
3
  class Tabular
3
4
  module Parser
@@ -5,7 +6,7 @@ module IOStreams
5
6
  attr_reader :csv_parser
6
7
 
7
8
  def initialize
8
- @csv_parser = Utility::CSVRow.new
9
+ @csv_parser = Utility::CSVRow.new unless RUBY_VERSION.to_f >= 2.6
9
10
  end
10
11
 
11
12
  # Returns [Array<String>] the header row.
@@ -15,7 +16,7 @@ module IOStreams
15
16
 
16
17
  raise(IOStreams::Errors::InvalidHeader, "Format is :csv. Invalid input header: #{row.class.name}") unless row.is_a?(String)
17
18
 
18
- csv_parser.parse(row)
19
+ parse_line(row)
19
20
  end
20
21
 
21
22
  # Returns [Array] the parsed CSV line
@@ -24,15 +25,38 @@ module IOStreams
24
25
 
25
26
  raise(IOStreams::Errors::TypeMismatch, "Format is :csv. Invalid input: #{row.class.name}") unless row.is_a?(String)
26
27
 
27
- csv_parser.parse(row)
28
+ parse_line(row)
28
29
  end
29
30
 
30
31
  # Return the supplied array as a single line CSV string.
31
32
  def render(row, header)
32
33
  array = header.to_array(row)
33
- csv_parser.to_csv(array)
34
+ render_array(array)
34
35
  end
35
36
 
37
+ private
38
+
39
+ if RUBY_VERSION.to_f >= 2.6
40
+ # About 10 times slower than the approach used in Ruby 2.5 and earlier,
41
+ # but at least it works on Ruby 2.6 and above.
42
+ def parse_line(line)
43
+ return if IOStreams.blank?(line)
44
+
45
+ CSV.parse_line(line)
46
+ end
47
+
48
+ def render_array(array)
49
+ CSV.generate_line(array, encoding: 'UTF-8', row_sep: '')
50
+ end
51
+ else
52
+ def parse_line(line)
53
+ csv_parser.parse(line)
54
+ end
55
+
56
+ def render_array(array)
57
+ csv_parser.to_csv(array)
58
+ end
59
+ end
36
60
  end
37
61
  end
38
62
  end
@@ -1,3 +1,3 @@
1
1
  module IOStreams
2
- VERSION = '0.16.1'
2
+ VERSION = '0.16.2'
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: iostreams
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.16.1
4
+ version: 0.16.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Reid Morrison
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2018-11-26 00:00:00.000000000 Z
11
+ date: 2019-02-11 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: concurrent-ruby
@@ -124,8 +124,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
124
124
  - !ruby/object:Gem::Version
125
125
  version: '0'
126
126
  requirements: []
127
- rubyforge_project:
128
- rubygems_version: 2.7.7
127
+ rubygems_version: 3.0.2
129
128
  signing_key:
130
129
  specification_version: 4
131
130
  summary: Input and Output streaming for Ruby.