iostreams 0.16.1 → 0.16.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 71dc5d69019340a1bcaf15bb7b79c46d136caee788c78ababcb1adbec55ccaae
4
- data.tar.gz: 215a9d22d27fa102529a54b1886d54caad022289cb7bc62f0311e799fd2ead2a
3
+ metadata.gz: e9a14b746c83e98c98950f8fe1086e689383e997a2f62688272f419d0a144c36
4
+ data.tar.gz: 61c6da8d61da48d205f8537bd0a11b0b0cf4a1160ab4e00247f3cea507540072
5
5
  SHA512:
6
- metadata.gz: 534218bfa084bcb43af55192210e08de829044ee316aa4043c9f06b0e97a6b484b4eb666821327f4fb79877383d10819f58570d7194420a00682a6e3e860d437
7
- data.tar.gz: 0e8445aabb3cc8ae7a7f36859c12a724e8b4db966e97d45589ddd3ed1cc9fd30e9c2d2c29c321ee35ca2c4487f72004242ee2aa7d8425ac2061a792434e9fc66
6
+ metadata.gz: 9736cb2bacdb162120a9bd6febe300760c34ea70148a62352244d33ab9142dead20cdc600242f8304ba460fcc9f4364dd2fdc8588ab4fda2dd34802e302037d8
7
+ data.tar.gz: 64a012432f793855f216be83b6522fe8301eff9f02fdc967782fb5488b85b6df630c44f6c4aefb4541b76b9a8f518856af2315b64ff733516c3b73b2451341f4
data/README.md CHANGED
@@ -236,6 +236,129 @@ IOStreams.copy('ABC', 'xyz.csv.pgp',
236
236
  target_options: [pgp: {email_recipient: 'a@a.com'})
237
237
  ~~~
238
238
 
239
+ ## Philosopy
240
+
241
+ IOStreams can be used to work against a single stream. it's real capability becomes apparent when chainging together
242
+ multiple streams to process data, without loading entire files into memory.
243
+
244
+ #### Linux Pipes
245
+
246
+ Linux has built-in support for streaming using the `|` (pipe operator) to send the output from one process to another.
247
+
248
+ Example: count the number of lines in a compressed file:
249
+
250
+ gunzip -c hello.csv.gz | wc -l
251
+
252
+ The file `hello.csv.gz` is uncompressed and returned to standard output, which in turn is piped into the standard
253
+ input for `wc -l`, which counts the number of lines in the uncompressed data.
254
+
255
+ As each block of data is returned from `gunzip` it is immediately passed into `wc` so that it
256
+ can start counting lines of uncompressed data, without waiting until the entire file is decompressed.
257
+ The uncompressed contents of the file are not written to disk before passing to `wc -l` and the file is not loaded
258
+ into memory before passing to `wc -l`.
259
+
260
+ In this way extremely large files can be processed with very little memory being used.
261
+
262
+ #### Push Model
263
+
264
+ In the Linux pipes example above this would be considered a "push model" where each task in the list pushes
265
+ its output to the input of the next task.
266
+
267
+ A major challenge or disadvantage with the push model is that buffering would need to occur between tasks since
268
+ each task could complete at very different speeds. To prevent large memory usage the standard output from a previous
269
+ task would have to be blocked to try and make it slow down.
270
+
271
+ #### Pull Model
272
+
273
+ Another approach with multiple tasks that need to process a single stream, is to move to a "pull model" where the
274
+ task at the end of the list pulls a block from a previous task when it is ready to process it.
275
+
276
+ #### IOStreams
277
+
278
+ IOStreams uses the pull model when reading data, where each stream performs a read against the previous stream
279
+ when it is ready for more data.
280
+
281
+ When writing to an output stream, IOStreams uses the push model, where each block of data that is ready to be written
282
+ is pushed to the task/stream in the list. The write push only returns once it has traversed all the way down to
283
+ the final task / stream in the list, this avoids complex buffering issues between each task / stream in the list.
284
+
285
+ Example: Implementing in Ruby: `gunzip -c hello.csv.gz | wc -l`
286
+
287
+ ~~~ruby
288
+ line_count = 0
289
+ IOStreams::Gzip::Reader.open("hello.csv.gz") do |input|
290
+ IOStreams::Line::Reader.open(input) do |lines|
291
+ lines.each { line_count += 1}
292
+ end
293
+ end
294
+ puts "hello.csv.gz contains #{line_count} lines"
295
+ ~~~
296
+
297
+ Since IOStreams can autodetect file types based on the file extension, `IOStreams.reader` can figure which stream
298
+ to start with:
299
+ ~~~ruby
300
+ line_count = 0
301
+ IOStreams.reader("hello.csv.gz") do |input|
302
+ IOStreams::Line::Reader.open(input) do |lines|
303
+ lines.each { line_count += 1}
304
+ end
305
+ end
306
+ puts "hello.csv.gz contains #{line_count} lines"
307
+ ~~~
308
+
309
+ Since we know we want a line reader, it can be simplified using `IOStreams.line_reader`:
310
+ ~~~ruby
311
+ line_count = 0
312
+ IOStreams.line_reader("hello.csv.gz") do |lines|
313
+ lines.each { line_count += 1}
314
+ end
315
+ puts "hello.csv.gz contains #{line_count} lines"
316
+ ~~~
317
+
318
+ It can be simplified even further using `IOStreams.each_line`:
319
+ ~~~ruby
320
+ line_count = 0
321
+ IOStreams.each_line("hello.csv.gz") { line_count += 1}
322
+ puts "hello.csv.gz contains #{line_count} lines"
323
+ ~~~
324
+
325
+ The benefit in all of the above cases is that the file can be any arbitrary size and only one block of the file
326
+ is held in memory at any time.
327
+
328
+ #### Chaining
329
+
330
+ In the above example only 2 streams were used. Streams can be nested as deep as necessary to process data.
331
+
332
+ Example, search for all occurrences of the word apple, cleansing the input data stream of non printable characters
333
+ and converting to valid US ASCII.
334
+
335
+ ~~~ruby
336
+ apple_count = 0
337
+ IOStreams::Gzip::Reader.open("hello.csv.gz") do |input|
338
+ IOStreams::Encode::Reader.open(input,
339
+ encoding: 'US-ASCII',
340
+ encode_replace: '',
341
+ encode_cleaner: :printable) do |cleansed|
342
+ IOStreams::Line::Reader.open(cleansed) do |lines|
343
+ lines.each { |line| apple_count += line.scan('apple').count}
344
+ end
345
+ end
346
+ puts "Found the word 'apple' #{apple_count} times in hello.csv.gz"
347
+ ~~~
348
+
349
+ Let IOStreams perform the above stream chaining automatically under the covers:
350
+ ~~~ruby
351
+ apple_count = 0
352
+ IOStreams.each_line("hello.csv.gz",
353
+ encoding: 'US-ASCII',
354
+ encode_replace: '',
355
+ encode_cleaner: :printable) do |line|
356
+ apple_count += line.scan('apple').count
357
+ end
358
+
359
+ puts "Found the word 'apple' #{apple_count} times in hello.csv.gz"
360
+ ~~~
361
+
239
362
  ## Notes
240
363
 
241
364
  * Due to the nature of Zip, both its Reader and Writer methods will create
@@ -1,3 +1,4 @@
1
+ require 'csv'
1
2
  module IOStreams
2
3
  class Tabular
3
4
  module Parser
@@ -5,7 +6,7 @@ module IOStreams
5
6
  attr_reader :csv_parser
6
7
 
7
8
  def initialize
8
- @csv_parser = Utility::CSVRow.new
9
+ @csv_parser = Utility::CSVRow.new unless RUBY_VERSION.to_f >= 2.6
9
10
  end
10
11
 
11
12
  # Returns [Array<String>] the header row.
@@ -15,7 +16,7 @@ module IOStreams
15
16
 
16
17
  raise(IOStreams::Errors::InvalidHeader, "Format is :csv. Invalid input header: #{row.class.name}") unless row.is_a?(String)
17
18
 
18
- csv_parser.parse(row)
19
+ parse_line(row)
19
20
  end
20
21
 
21
22
  # Returns [Array] the parsed CSV line
@@ -24,15 +25,38 @@ module IOStreams
24
25
 
25
26
  raise(IOStreams::Errors::TypeMismatch, "Format is :csv. Invalid input: #{row.class.name}") unless row.is_a?(String)
26
27
 
27
- csv_parser.parse(row)
28
+ parse_line(row)
28
29
  end
29
30
 
30
31
  # Return the supplied array as a single line CSV string.
31
32
  def render(row, header)
32
33
  array = header.to_array(row)
33
- csv_parser.to_csv(array)
34
+ render_array(array)
34
35
  end
35
36
 
37
+ private
38
+
39
+ if RUBY_VERSION.to_f >= 2.6
40
+ # About 10 times slower than the approach used in Ruby 2.5 and earlier,
41
+ # but at least it works on Ruby 2.6 and above.
42
+ def parse_line(line)
43
+ return if IOStreams.blank?(line)
44
+
45
+ CSV.parse_line(line)
46
+ end
47
+
48
+ def render_array(array)
49
+ CSV.generate_line(array, encoding: 'UTF-8', row_sep: '')
50
+ end
51
+ else
52
+ def parse_line(line)
53
+ csv_parser.parse(line)
54
+ end
55
+
56
+ def render_array(array)
57
+ csv_parser.to_csv(array)
58
+ end
59
+ end
36
60
  end
37
61
  end
38
62
  end
@@ -1,3 +1,3 @@
1
1
  module IOStreams
2
- VERSION = '0.16.1'
2
+ VERSION = '0.16.2'
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: iostreams
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.16.1
4
+ version: 0.16.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Reid Morrison
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2018-11-26 00:00:00.000000000 Z
11
+ date: 2019-02-11 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: concurrent-ruby
@@ -124,8 +124,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
124
124
  - !ruby/object:Gem::Version
125
125
  version: '0'
126
126
  requirements: []
127
- rubyforge_project:
128
- rubygems_version: 2.7.7
127
+ rubygems_version: 3.0.2
129
128
  signing_key:
130
129
  specification_version: 4
131
130
  summary: Input and Output streaming for Ruby.