RubyGems - iostreams - Versions diffs - 0.16.1 → 0.16.2 - Mend

iostreams 0.16.1 → 0.16.2

Files changed (5) hide show

checksums.yaml +4 -4
data/README.md +123 -0
data/lib/io_streams/tabular/parser/csv.rb +28 -4
data/lib/io_streams/version.rb +1 -1
metadata +3 -4

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 71dc5d69019340a1bcaf15bb7b79c46d136caee788c78ababcb1adbec55ccaae
-  data.tar.gz: 215a9d22d27fa102529a54b1886d54caad022289cb7bc62f0311e799fd2ead2a
+  metadata.gz: e9a14b746c83e98c98950f8fe1086e689383e997a2f62688272f419d0a144c36
+  data.tar.gz: 61c6da8d61da48d205f8537bd0a11b0b0cf4a1160ab4e00247f3cea507540072
 SHA512:
-  metadata.gz: 534218bfa084bcb43af55192210e08de829044ee316aa4043c9f06b0e97a6b484b4eb666821327f4fb79877383d10819f58570d7194420a00682a6e3e860d437
-  data.tar.gz: 0e8445aabb3cc8ae7a7f36859c12a724e8b4db966e97d45589ddd3ed1cc9fd30e9c2d2c29c321ee35ca2c4487f72004242ee2aa7d8425ac2061a792434e9fc66
+  metadata.gz: 9736cb2bacdb162120a9bd6febe300760c34ea70148a62352244d33ab9142dead20cdc600242f8304ba460fcc9f4364dd2fdc8588ab4fda2dd34802e302037d8
+  data.tar.gz: 64a012432f793855f216be83b6522fe8301eff9f02fdc967782fb5488b85b6df630c44f6c4aefb4541b76b9a8f518856af2315b64ff733516c3b73b2451341f4

data/README.md CHANGED

@@ -236,6 +236,129 @@ IOStreams.copy('ABC', 'xyz.csv.pgp',
                target_options: [pgp: {email_recipient: 'a@a.com'})
 ~~~
+## Philosopy
+IOStreams can be used to work against a single stream. it's real capability becomes apparent when chainging together
+multiple streams to process data, without loading entire files into memory.
+#### Linux Pipes
+Linux has built-in support for streaming using the `|` (pipe operator) to send the output from one process to another.
+Example: count the number of lines in a compressed file:
+    gunzip -c hello.csv.gz | wc -l
+The file `hello.csv.gz` is uncompressed and returned to standard output, which in turn is piped into the standard
+input for `wc -l`, which counts the number of lines in the uncompressed data.
+As each block of data is returned from `gunzip` it is immediately passed into `wc` so that it
+can start counting lines of uncompressed data, without waiting until the entire file is decompressed.
+The uncompressed contents of the file are not written to disk before passing to `wc -l` and the file is not loaded
+into memory before passing to `wc -l`.
+In this way extremely large files can be processed with very little memory being used.
+#### Push Model
+In the Linux pipes example above this would be considered a "push model" where each task in the list pushes
+its output to the input of the next task.
+A major challenge or disadvantage with the push model is that buffering would need to occur between tasks since
+each task could complete at very different speeds. To prevent large memory usage the standard output from a previous
+task would have to be blocked to try and make it slow down.
+#### Pull Model
+Another approach with multiple tasks that need to process a single stream, is to move to a "pull model" where the
+task at the end of the list pulls a block from a previous task when it is ready to process it.
+#### IOStreams
+IOStreams uses the pull model when reading data, where each stream performs a read against the previous stream
+when it is ready for more data.
+When writing to an output stream, IOStreams uses the push model, where each block of data that is ready to be written
+is pushed to the task/stream in the list. The write push only returns once it has traversed all the way down to
+the final task / stream in the list, this avoids complex buffering issues between each task / stream in the list.
+Example: Implementing in Ruby: `gunzip -c hello.csv.gz | wc -l`
+~~~ruby
+  line_count = 0
+  IOStreams::Gzip::Reader.open("hello.csv.gz") do |input|
+    IOStreams::Line::Reader.open(input) do |lines|
+      lines.each { line_count += 1}
+    end
+  end
+  puts "hello.csv.gz contains #{line_count} lines"
+~~~
+Since IOStreams can autodetect file types based on the file extension, `IOStreams.reader` can figure which stream
+to start with:
+~~~ruby
+  line_count = 0
+  IOStreams.reader("hello.csv.gz") do |input|
+    IOStreams::Line::Reader.open(input) do |lines|
+      lines.each { line_count += 1}
+    end
+  end
+  puts "hello.csv.gz contains #{line_count} lines"
+~~~
+Since we know we want a line reader, it can be simplified using `IOStreams.line_reader`:
+~~~ruby
+  line_count = 0
+  IOStreams.line_reader("hello.csv.gz") do |lines|
+    lines.each { line_count += 1}
+  end
+  puts "hello.csv.gz contains #{line_count} lines"
+~~~
+It can be simplified even further using `IOStreams.each_line`:
+~~~ruby
+  line_count = 0
+  IOStreams.each_line("hello.csv.gz") { line_count += 1}
+  puts "hello.csv.gz contains #{line_count} lines"
+~~~
+The benefit in all of the above cases is that the file can be any arbitrary size and only one block of the file
+is held in memory at any time.
+#### Chaining
+In the above example only 2 streams were used. Streams can be nested as deep as necessary to process data.
+Example, search for all occurrences of the word apple, cleansing the input data stream of non printable characters
+and converting to valid US ASCII.
+~~~ruby
+  apple_count = 0
+  IOStreams::Gzip::Reader.open("hello.csv.gz") do |input|
+    IOStreams::Encode::Reader.open(input,
+                                   encoding:       'US-ASCII',
+                                   encode_replace: '',
+                                   encode_cleaner: :printable) do |cleansed|
+      IOStreams::Line::Reader.open(cleansed) do |lines|
+        lines.each { |line| apple_count += line.scan('apple').count}
+      end
+  end
+  puts "Found the word 'apple' #{apple_count} times in hello.csv.gz"
+~~~
+Let IOStreams perform the above stream chaining automatically under the covers:
+~~~ruby
+  apple_count = 0
+  IOStreams.each_line("hello.csv.gz",
+                      encoding:       'US-ASCII',
+                      encode_replace: '',
+                      encode_cleaner: :printable) do |line|
+    apple_count += line.scan('apple').count
+  end
+  puts "Found the word 'apple' #{apple_count} times in hello.csv.gz"
+~~~
 ## Notes
 * Due to the nature of Zip, both its Reader and Writer methods will create

data/lib/io_streams/tabular/parser/csv.rb CHANGED

@@ -1,3 +1,4 @@
+require 'csv'
 module IOStreams
   class Tabular
     module Parser
@@ -5,7 +6,7 @@ module IOStreams
         attr_reader :csv_parser
         def initialize
-          @csv_parser = Utility::CSVRow.new
+          @csv_parser = Utility::CSVRow.new unless RUBY_VERSION.to_f >= 2.6
         end
         # Returns [Array<String>] the header row.
@@ -15,7 +16,7 @@ module IOStreams
           raise(IOStreams::Errors::InvalidHeader, "Format is :csv. Invalid input header: #{row.class.name}") unless row.is_a?(String)
-          csv_parser.parse(row)
+          parse_line(row)
         end
         # Returns [Array] the parsed CSV line
@@ -24,15 +25,38 @@ module IOStreams
           raise(IOStreams::Errors::TypeMismatch, "Format is :csv. Invalid input: #{row.class.name}") unless row.is_a?(String)
-          csv_parser.parse(row)
+          parse_line(row)
         end
         # Return the supplied array as a single line CSV string.
         def render(row, header)
           array = header.to_array(row)
-          csv_parser.to_csv(array)
+          render_array(array)
         end
+        private
+        if RUBY_VERSION.to_f >= 2.6
+          # About 10 times slower than the approach used in Ruby 2.5 and earlier,
+          # but at least it works on Ruby 2.6 and above.
+          def parse_line(line)
+            return if IOStreams.blank?(line)
+            CSV.parse_line(line)
+          end
+          def render_array(array)
+            CSV.generate_line(array, encoding: 'UTF-8', row_sep: '')
+          end
+        else
+          def parse_line(line)
+            csv_parser.parse(line)
+          end
+          def render_array(array)
+            csv_parser.to_csv(array)
+          end
+        end
       end
     end
   end

data/lib/io_streams/version.rb CHANGED

@@ -1,3 +1,3 @@
 module IOStreams
-  VERSION = '0.16.1'
+  VERSION = '0.16.2'
 end

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: iostreams
 version: !ruby/object:Gem::Version
-  version: 0.16.1
+  version: 0.16.2
 platform: ruby
 authors:
 - Reid Morrison
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2018-11-26 00:00:00.000000000 Z
+date: 2019-02-11 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: concurrent-ruby
@@ -124,8 +124,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubyforge_project:
-rubygems_version: 2.7.7
+rubygems_version: 3.0.2
 signing_key:
 specification_version: 4
 summary: Input and Output streaming for Ruby.