RubyGems - iostreams - Versions diffs - 0.16.1 → 0.16.2 - Mend

iostreams 0.16.1 → 0.16.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

checksums.yaml +4 -4
data/README.md +123 -0
data/lib/io_streams/tabular/parser/csv.rb +28 -4
data/lib/io_streams/version.rb +1 -1
metadata +3 -4

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 71dc5d69019340a1bcaf15bb7b79c46d136caee788c78ababcb1adbec55ccaae
-  data.tar.gz: 215a9d22d27fa102529a54b1886d54caad022289cb7bc62f0311e799fd2ead2a
+  metadata.gz: e9a14b746c83e98c98950f8fe1086e689383e997a2f62688272f419d0a144c36
+  data.tar.gz: 61c6da8d61da48d205f8537bd0a11b0b0cf4a1160ab4e00247f3cea507540072
 SHA512:
-  metadata.gz: 534218bfa084bcb43af55192210e08de829044ee316aa4043c9f06b0e97a6b484b4eb666821327f4fb79877383d10819f58570d7194420a00682a6e3e860d437
-  data.tar.gz: 0e8445aabb3cc8ae7a7f36859c12a724e8b4db966e97d45589ddd3ed1cc9fd30e9c2d2c29c321ee35ca2c4487f72004242ee2aa7d8425ac2061a792434e9fc66
+  metadata.gz: 9736cb2bacdb162120a9bd6febe300760c34ea70148a62352244d33ab9142dead20cdc600242f8304ba460fcc9f4364dd2fdc8588ab4fda2dd34802e302037d8
+  data.tar.gz: 64a012432f793855f216be83b6522fe8301eff9f02fdc967782fb5488b85b6df630c44f6c4aefb4541b76b9a8f518856af2315b64ff733516c3b73b2451341f4

data/README.md CHANGED

@@ -236,6 +236,129 @@ IOStreams.copy('ABC', 'xyz.csv.pgp',
                target_options: [pgp: {email_recipient: 'a@a.com'})
 ~~~
+## Philosopy
+IOStreams can be used to work against a single stream. it's real capability becomes apparent when chainging together
+multiple streams to process data, without loading entire files into memory.
+#### Linux Pipes
+Linux has built-in support for streaming using the `|` (pipe operator) to send the output from one process to another.
+Example: count the number of lines in a compressed file:
+    gunzip -c hello.csv.gz | wc -l
+The file `hello.csv.gz` is uncompressed and returned to standard output, which in turn is piped into the standard
+input for `wc -l`, which counts the number of lines in the uncompressed data.
+As each block of data is returned from `gunzip` it is immediately passed into `wc` so that it
+can start counting lines of uncompressed data, without waiting until the entire file is decompressed.
+The uncompressed contents of the file are not written to disk before passing to `wc -l` and the file is not loaded
+into memory before passing to `wc -l`.
+In this way extremely large files can be processed with very little memory being used.
+#### Push Model
+In the Linux pipes example above this would be considered a "push model" where each task in the list pushes
+its output to the input of the next task.
+A major challenge or disadvantage with the push model is that buffering would need to occur between tasks since
+each task could complete at very different speeds. To prevent large memory usage the standard output from a previous
+task would have to be blocked to try and make it slow down.
+#### Pull Model
+Another approach with multiple tasks that need to process a single stream, is to move to a "pull model" where the
+task at the end of the list pulls a block from a previous task when it is ready to process it.
+#### IOStreams
+IOStreams uses the pull model when reading data, where each stream performs a read against the previous stream
+when it is ready for more data.
+When writing to an output stream, IOStreams uses the push model, where each block of data that is ready to be written
+is pushed to the task/stream in the list. The write push only returns once it has traversed all the way down to
+the final task / stream in the list, this avoids complex buffering issues between each task / stream in the list.
+Example: Implementing in Ruby: `gunzip -c hello.csv.gz | wc -l`
+~~~ruby
+  line_count = 0
+  IOStreams::Gzip::Reader.open("hello.csv.gz") do |input|
+    IOStreams::Line::Reader.open(input) do |lines|
+      lines.each { line_count += 1}
+    end
+  end
+  puts "hello.csv.gz contains #{line_count} lines"
+~~~
+Since IOStreams can autodetect file types based on the file extension, `IOStreams.reader` can figure which stream
+to start with:
+~~~ruby
+  line_count = 0
+  IOStreams.reader("hello.csv.gz") do |input|
+    IOStreams::Line::Reader.open(input) do |lines|
+      lines.each { line_count += 1}
+    end
+  end
+  puts "hello.csv.gz contains #{line_count} lines"
+~~~
+Since we know we want a line reader, it can be simplified using `IOStreams.line_reader`:
+~~~ruby
+  line_count = 0
+  IOStreams.line_reader("hello.csv.gz") do |lines|
+    lines.each { line_count += 1}
+  end
+  puts "hello.csv.gz contains #{line_count} lines"
+~~~
+It can be simplified even further using `IOStreams.each_line`:
+~~~ruby
+  line_count = 0
+  IOStreams.each_line("hello.csv.gz") { line_count += 1}
+  puts "hello.csv.gz contains #{line_count} lines"
+~~~
+The benefit in all of the above cases is that the file can be any arbitrary size and only one block of the file
+is held in memory at any time.
+#### Chaining
+In the above example only 2 streams were used. Streams can be nested as deep as necessary to process data.
+Example, search for all occurrences of the word apple, cleansing the input data stream of non printable characters
+and converting to valid US ASCII.
+~~~ruby
+  apple_count = 0
+  IOStreams::Gzip::Reader.open("hello.csv.gz") do |input|
+    IOStreams::Encode::Reader.open(input,
+                                   encoding:       'US-ASCII',
+                                   encode_replace: '',
+                                   encode_cleaner: :printable) do |cleansed|
+      IOStreams::Line::Reader.open(cleansed) do |lines|
+        lines.each { |line| apple_count += line.scan('apple').count}
+      end
+  end
+  puts "Found the word 'apple' #{apple_count} times in hello.csv.gz"
+~~~
+Let IOStreams perform the above stream chaining automatically under the covers:
+~~~ruby
+  apple_count = 0
+  IOStreams.each_line("hello.csv.gz",
+                      encoding:       'US-ASCII',
+                      encode_replace: '',
+                      encode_cleaner: :printable) do |line|
+    apple_count += line.scan('apple').count
+  end
+  puts "Found the word 'apple' #{apple_count} times in hello.csv.gz"
+~~~
 ## Notes
 * Due to the nature of Zip, both its Reader and Writer methods will create

data/lib/io_streams/tabular/parser/csv.rb CHANGED

@@ -1,3 +1,4 @@
+require 'csv'
 module IOStreams
   class Tabular
     module Parser
@@ -5,7 +6,7 @@ module IOStreams
         attr_reader :csv_parser
         def initialize
-          @csv_parser = Utility::CSVRow.new
+          @csv_parser = Utility::CSVRow.new unless RUBY_VERSION.to_f >= 2.6
         end
         # Returns [Array<String>] the header row.
@@ -15,7 +16,7 @@ module IOStreams
           raise(IOStreams::Errors::InvalidHeader, "Format is :csv. Invalid input header: #{row.class.name}") unless row.is_a?(String)
-          csv_parser.parse(row)
+          parse_line(row)
         end
         # Returns [Array] the parsed CSV line
@@ -24,15 +25,38 @@ module IOStreams
           raise(IOStreams::Errors::TypeMismatch, "Format is :csv. Invalid input: #{row.class.name}") unless row.is_a?(String)
-          csv_parser.parse(row)
+          parse_line(row)
         end
         # Return the supplied array as a single line CSV string.
         def render(row, header)
           array = header.to_array(row)
-          csv_parser.to_csv(array)
+          render_array(array)
         end
+        private
+        if RUBY_VERSION.to_f >= 2.6
+          # About 10 times slower than the approach used in Ruby 2.5 and earlier,
+          # but at least it works on Ruby 2.6 and above.
+          def parse_line(line)
+            return if IOStreams.blank?(line)
+            CSV.parse_line(line)
+          end
+          def render_array(array)
+            CSV.generate_line(array, encoding: 'UTF-8', row_sep: '')
+          end
+        else
+          def parse_line(line)
+            csv_parser.parse(line)
+          end
+          def render_array(array)
+            csv_parser.to_csv(array)
+          end
+        end
       end
     end
   end

data/lib/io_streams/version.rb CHANGED

@@ -1,3 +1,3 @@
 module IOStreams
-  VERSION = '0.16.1'
+  VERSION = '0.16.2'
 end

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: iostreams
 version: !ruby/object:Gem::Version
-  version: 0.16.1
+  version: 0.16.2
 platform: ruby
 authors:
 - Reid Morrison
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2018-11-26 00:00:00.000000000 Z
+date: 2019-02-11 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: concurrent-ruby
@@ -124,8 +124,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubyforge_project:
-rubygems_version: 2.7.7
+rubygems_version: 3.0.2
 signing_key:
 specification_version: 4
 summary: Input and Output streaming for Ruby.