RubyGems - iostreams - Versions diffs - 0.7.0 → 0.8.0 - Mend

iostreams 0.7.0 → 0.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

checksums.yaml +4 -4
data/README.md +126 -20
data/Rakefile +4 -5
data/lib/io_streams/delimited/reader.rb +121 -0
data/lib/io_streams/io_streams.rb +9 -5
data/lib/io_streams/version.rb +1 -1
data/lib/io_streams/zip/reader.rb +3 -3
data/lib/io_streams/zip/writer.rb +1 -1
data/lib/iostreams.rb +4 -0
data/test/delimited_reader_test.rb +71 -0
metadata +5 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: dbaf3d597b1fc3ad00a9c371f932236332d3a1b4
-  data.tar.gz: ee333efa4fe7737d77dfdaeff353e4811fa4f298
+  metadata.gz: 6bb29de6cb870a20f009e5517ab9aab138bb3c67
+  data.tar.gz: 9fb17cc79cbb54a550834d59abac8d4533d201cd
 SHA512:
-  metadata.gz: 801f73a031c8fbadd38bf7939fa90dd6fca5360fef7c8f77b08c65ed72bdb142deaecccf8c9a3dd2c7f47af4725397635959cdb14a87e4af603967ced96709a2
-  data.tar.gz: 7abef1b3814194b1fa69d05e5c007aaca2b386329fb2d4af3a228dcedccd465501c5a2036f510b7e3602c96ad54b030fe7385dbb096a566e3a26ead686d3f167
+  metadata.gz: 1cb75b18750e30ebe36cc4c0bd1496fc539d1b2976bdd975ee8c186a46d132e99aa85243cd5bb07e844c6cee8c8854e727524e3baeb5b713ba2f76be3c294805
+  data.tar.gz: afc66bc4c5425f1e199f3edb4d078a944e2b7e9ec48968744e1be81619029fc22b7065fabea367a6ea3863ff6c89dafd686617956ca3bee6a23975ed33e3f1cd

data/README.md CHANGED Viewed

@@ -1,39 +1,145 @@
-# iostreams
+# iostreams [![Gem Version](https://badge.fury.io/rb/iostreams.svg)](http://badge.fury.io/rb/iostreams) [![Build Status](https://secure.travis-ci.org/rocketjob/iostreams.png?branch=master)](http://travis-ci.org/rocketjob/iostreams) ![](http://ruby-gem-downloads-badge.herokuapp.com/iostreams?type=total)
-Ruby Input and Output streaming with support for Zip, Gzip, and Encryption.
+Ruby Input and Output streaming for Ruby
-## Status
+## Project Status
-Alpha - Feedback on the API is welcome. API will change.
+Beta - Feedback on the API is welcome. API is subject to change.
-## Introduction
+## Features
+Currently streaming classes are available for:
+* Zip
+* Gzip
+* Encryption using [Symmetric Encryption](https://github.com/reidmorrison/symmetric-encryption)
-`iostreams` allows files to be read and written in a streaming fashion to reduce
-memory overhead. It supports reading and writing of Zip, GZip and encrypted files.
+## Introduction
-These streams can be chained together just like piped programs in linux.
-This allows one stream to read the file, another stream to decrypt the file and
-then a third stream to decompress the result.
+If all files were small, they could just be loaded into memory in their entirety. With the
+advent of very large files, often into several Gigabytes, or even Terabytes in size, loading
+them into memory is not feasible.
+In linux it is common to use pipes to stream data between processes.
+For example:
+```
+# Count the number of lines in a file that has been compressed with gzip
+cat abc.gz | gunzip -c | wc -l
+```
+For large files it is critical to be able to read and write these files as streams. Ruby has support
+for reading and writing files using streams, but has no built-in way of passing one stream through
+another to support for example compressing the data, encrypting it and then finally writing the result
+to a file. Several streaming implementations exist for languages such as `C++` and `Java` to chain
+together several streams, `iostreams` attempts to offer similar features for Ruby.
+```ruby
+# Read a compressed file:
+IOStreams.reader('hello.gz') do |reader|
+  data = reader.read(1024)
+  puts "Read: #{data}"
+end
+```
+The true power of streams is shown when many streams are chained together to achieve the end
+result, without holding the entire file in memory, or ideally without needing to create
+any temporary files to process the stream.
+```ruby
+# Create a file that is compressed with GZip and then encrypted with Symmetric Encryption:
+IOStreams.writer('hello.gz.enc') do |writer|
+  writer.write('Hello World')
+  writer.write('and some more')
+end
+```
+The power of the above example applies when the data being written starts to exceed hundreds of megabytes,
+or even gigabytes.
+By looking at the file name supplied above, `iostreams` is able to determine which streams to apply
+to the data being read or written. For example:
+* `hello.zip` => Compressed using Zip
+* `hello.zip.enc` => Compressed using Zip and then encrypted using Symmetric Encryption
+* `hello.gz.enc` => Compressed using GZip and then encrypted using Symmetric Encryption
 The objective is that all of these streaming processes are performed used streaming
-so that only portions of the file are loaded into memory at a time.
+so that only the current portion of the file is loaded into memory as it moves
+through the entire file.
 Where possible each stream never goes to disk, which for example could expose
 un-encrypted data.
+## Architecture
+Streams are chained together by passing the
+Every Reader or Writer is invoked by calling its `.open` method and passing the block
+that must be invoked for the duration of that stream.
+The above block is passed the stream that needs to be encoded/decoded using that
+Reader or Writer every time the `#read` or `#write` method is called on it.
+### Readers
+Each reader stream must implement: `#read`
+### Writer
+Each writer stream must implement: `#write`
+### Optional methods
+The following methods on the stream are useful for both Readers and Writers
+### close
+Close the stream, and cleanup any buffers, etc.
+### closed?
+Has the stream already been closed? Useful, when child streams have already closed the stream
+so that `#close` is not called more than once on a stream.
 ## Notes
 * Due to the nature of Zip, both its Reader and Writer methods will create
   a temp file when reading from or writing to a stream.
   Recommended to use Gzip over Zip since it can be streamed.
-## Meta
-* Code: `git clone git://github.com/rocketjob/iostreams.git`
-* Home: <https://github.com/rocketjob/iostreams>
-* Issues: <http://github.com/rocketjob/iostreams/issues>
-* Gems: <http://rubygems.org/gems/iostreams>
-This project uses [Semantic Versioning](http://semver.org/).
+* Zip becomes exponentially slower with very large files, especially files
+  that exceed 4GB when uncompressed. Highly recommend using GZip for large files.
+## Future
+Below are just some of the streams that are envisaged for `iostreams`:
+* PGP reader and write
+    * Read and write PGP encrypted files
+* CSV
+    * Read and write CSV data, reading data back as Arrays and writing Arrays as CSV text
+* Delimited Text Stream
+    * Autodetect Windows/Linux line endings and return a line at a time
+* MongoFS
+    * Read and write file streams to and from MongoFS
+For example:
+```ruby
+# Read a CSV file, delimited with Windows line endings, compressed with GZip, and encrypted with PGP:
+IOStreams.reader('hello.csv.gz.pgp', [:csv, :delimited, :gz, :pgp]) do |reader|
+  # Returns an Array at a time
+  reader.each do |row|
+    puts "Read: #{row.inspect}"
+  end
+end
+```
+To completely implement io streaming for Ruby will take a lot more input and thoughts
+from the Ruby community. This gem represents a starting point to get the discussion going.
+By keeping this gem in Beta state and not going V1, we can change the interface as needed
+to implement community feedback.
+## Versioning
+This project adheres to [Semantic Versioning](http://semver.org/).
 ## Author

data/Rakefile CHANGED Viewed

@@ -1,21 +1,20 @@
 require 'rake/clean'
 require 'rake/testtask'
-$LOAD_PATH.unshift File.expand_path("../lib", __FILE__)
-require 'io_streams/version'
+require_relative 'lib/io_streams/version'
 task :gem do
-  system "gem build iostreams.gemspec"
+  system 'gem build iostreams.gemspec'
 end
 task :publish => :gem do
   system "git tag -a v#{IOStreams::VERSION} -m 'Tagging #{IOStreams::VERSION}'"
-  system "git push --tags"
+  system 'git push --tags'
   system "gem push iostreams-#{IOStreams::VERSION}.gem"
   system "rm iostreams-#{IOStreams::VERSION}.gem"
 end
-desc "Run Test Suite"
+desc 'Run Test Suite'
 task :test do
   Rake::TestTask.new(:functional) do |t|
     t.test_files = FileList['test/**/*_test.rb']

data/lib/io_streams/delimited/reader.rb ADDED Viewed

@@ -0,0 +1,121 @@
+module IOStreams
+  module Delimited
+    class Reader
+      attr_accessor :delimiter
+      # Read from a file or stream
+      def self.open(file_name_or_io, options={}, &block)
+        if file_name_or_io.respond_to?(:read)
+          block.call(new(file_name_or_io, options))
+        else
+          ::File.open(file_name_or_io, 'rb') do |io|
+            block.call(new(io, options))
+          end
+        end
+      end
+      # Create a delimited UTF8 stream reader from the supplied input streams
+      #
+      # The input stream should be binary with no text conversions performed
+      # since `strip_non_printable` will be applied to the binary stream before
+      # converting to UTF-8
+      #
+      # Parameters
+      #   input_stream
+      #     The input stream that implements #read
+      #
+      #   options
+      #     :delimiter[Symbol|String]
+      #       Line / Record delimiter to use to break the stream up into records
+      #         nil
+      #           Automatically detect line endings and break up by line
+      #           Searches for the first "\r\n" or "\n" and then uses that as the
+      #           delimiter for all subsequent records
+      #         String:
+      #           Any string to break the stream up by
+      #           The records when saved will not include this delimiter
+      #       Default: nil
+      #
+      #     :buffer_size [Integer]
+      #       Maximum size of the buffer into which to read the stream into for
+      #       processing.
+      #       Must be large enough to hold the entire first line and its delimiter(s)
+      #       Default: 65536 ( 64K )
+      #
+      #     :strip_non_printable [true|false]
+      #       Strip all non-printable characters read from the file
+      #       Default: true iff :encoding is UTF8_ENCODING, otherwise false
+      #
+      #     :encoding
+      #       Force encoding to this encoding for all data being read
+      #       Default: UTF8_ENCODING
+      #       Set to nil to disable encoding
+      def initialize(input_stream, options={})
+        @input_stream        = input_stream
+        options              = options.dup
+        @delimiter           = options.delete(:delimiter)
+        @buffer_size         = options.delete(:buffer_size) || 65536
+        @encoding            = options.has_key?(:encoding) ? options.delete(:encoding) : UTF8_ENCODING
+        @strip_non_printable = options.delete(:strip_non_printable)
+        @strip_non_printable = @strip_non_printable.nil? && (@encoding == UTF8_ENCODING)
+        raise ArgumentError.new("Unknown IOStreams::Delimited::Reader#initialize options: #{options.inspect}") if options.size > 0
+        @delimiter.force_encoding(UTF8_ENCODING) if @delimiter && @encoding
+        @buffer = ''
+      end
+      # Returns each line at a time to to the supplied block
+      def each_line(&block)
+        partial = nil
+        loop do
+          if read_chunk == 0
+            block.call(partial) if partial
+            return
+          end
+          self.delimiter ||= detect_delimiter
+          end_index      ||= (delimiter.size + 1) * -1
+          @buffer.each_line(delimiter) do |line|
+            if line.end_with?(delimiter)
+              # Strip off delimiter
+              block.call(line[0..end_index])
+              partial = nil
+            else
+              partial = line
+            end
+          end
+          @buffer = partial.nil? ? '' : partial
+        end
+      end
+      ##########################################################################
+      private
+      # Returns [Integer] the number of bytes read into the internal buffer
+      # Returns 0 on EOF
+      def read_chunk
+        chunk = @input_stream.read(@buffer_size)
+        # EOF reached?
+        return 0 unless chunk
+        # Strip out non-printable characters before converting to UTF-8
+        chunk = chunk.scan(/[[:print:]]|\r|\n/).join if @strip_non_printable
+        @buffer << (@encoding ? chunk.force_encoding(@encoding) : chunk)
+        chunk.size
+      end
+      # Auto detect text line delimiter
+      def detect_delimiter
+        if @buffer =~ /\r\n|\n\r|\n/
+          $&
+        elsif @buffer.size <= @buffer_size
+          # Handle one line files that are smaller than the buffer size
+          "\n"
+        end
+      end
+    end
+  end
+end

data/lib/io_streams/io_streams.rb CHANGED Viewed

@@ -3,6 +3,9 @@ module IOStreams
   # A registry to hold formats for processing files during upload or download
   @@extensions = ThreadSafe::Hash.new
+  UTF8_ENCODING      = Encoding.find('UTF-8').freeze
+  BINARY_ENCODING    = Encoding.find('BINARY').freeze
   # Returns [Array] the formats required to process the file by looking at
   # its extension(s)
   #
@@ -28,9 +31,9 @@ module IOStreams
   #   RocketJob::Formatter::Formats.streams_for_file_name('myfile.csv')
   #   => [ :file ]
   def self.streams_for_file_name(file_name)
-    raise ArgumentError.new("File name cannot be nil") if file_name.nil?
+    raise ArgumentError.new('File name cannot be nil') if file_name.nil?
     raise ArgumentError.new("RocketJob Cannot detect file format when uploading to stream: #{file_name.inspect}") if file_name.respond_to?(:read)
-    parts = file_name.split('.')
+    parts      = file_name.split('.')
     extensions = []
     while extension = parts.pop
       break unless @@extensions[extension.to_sym]
@@ -192,7 +195,7 @@ module IOStreams
   def self.stream(type, file_name_or_io, streams=nil, &block)
     unless streams
       respond_to = type == :reader ? :read : :write
-      streams = file_name_or_io.respond_to?(respond_to) ? [ :file ] : streams_for_file_name(file_name_or_io)
+      streams    = file_name_or_io.respond_to?(respond_to) ? [:file] : streams_for_file_name(file_name_or_io)
     end
     stream_structs = streams_for(type, streams)
     if stream_structs.size == 1
@@ -200,7 +203,7 @@ module IOStreams
       stream_struct.klass.open(file_name_or_io, stream_struct.options, &block)
     else
       # Daisy chain multiple streams together
-      last = stream_structs.inject(block){ |inner, stream_struct| -> io { stream_struct.klass.open(io, stream_struct.options, &inner) } }
+      last = stream_structs.inject(block) { |inner, stream_struct| -> io { stream_struct.klass.open(io, stream_struct.options, &inner) } }
       last.call(file_name_or_io)
     end
   end
@@ -208,7 +211,7 @@ module IOStreams
   # type: :reader or :writer
   def self.streams_for(type, params)
     if params.is_a?(Symbol)
-      [ stream_struct_for_stream(type, params) ]
+      [stream_struct_for_stream(type, params)]
     elsif params.is_a?(Array)
       a = []
       params.each do |stream|
@@ -235,6 +238,7 @@ module IOStreams
   end
   # Register File extensions
+  # @formatter:off
   register_extension(:enc,  SymmetricEncryption::Reader, SymmetricEncryption::Writer) if defined?(SymmetricEncryption)
   register_extension(:file, IOStreams::File::Reader,         IOStreams::File::Writer)
   register_extension(:gz,   IOStreams::Gzip::Reader,         IOStreams::Gzip::Writer)

data/lib/io_streams/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module IOStreams #:nodoc
-  VERSION = "0.7.0"
+  VERSION = '0.8.0'
 end

data/lib/io_streams/zip/reader.rb CHANGED Viewed

@@ -13,8 +13,8 @@ module IOStreams
       #     end
       #   end
       def self.open(file_name_or_io, options={}, &block)
-        options       = options.dup
-        buffer_size   = options.delete(:buffer_size) || 65536
+        options     = options.dup
+        buffer_size = options.delete(:buffer_size) || 65536
         raise(ArgumentError, "Unknown IOStreams::Zip::Reader option: #{options.inspect}") if options.size > 0
         # File name supplied
@@ -54,7 +54,7 @@ module IOStreams
         begin
           require 'zip'
         rescue LoadError => exc
-          puts "Please install gem rubyzip so that RocketJob can read Zip files in Ruby MRI"
+          puts 'Please install gem rubyzip so that RocketJob can read Zip files in Ruby MRI'
           raise(exc)
         end

data/lib/io_streams/zip/writer.rb CHANGED Viewed

@@ -67,7 +67,7 @@ module IOStreams
         begin
           require 'zip'
         rescue LoadError => exc
-          puts "Please install gem rubyzip so that RocketJob can read Zip files in Ruby MRI"
+          puts 'Please install gem rubyzip so that RocketJob can read Zip files in Ruby MRI'
           raise(exc)
         end

data/lib/iostreams.rb CHANGED Viewed

@@ -12,5 +12,9 @@ module IOStreams
     autoload :Reader,  'io_streams/zip/reader'
     autoload :Writer,  'io_streams/zip/writer'
   end
+  module Delimited
+    autoload :Reader,  'io_streams/delimited/reader'
+    autoload :Writer,  'io_streams/delimited/writer'
+  end
 end
 require 'io_streams/io_streams'

data/test/delimited_reader_test.rb ADDED Viewed

@@ -0,0 +1,71 @@
+require_relative 'test_helper'
+# Unit Test for IOStreams::File
+module Streams
+  class DelimitedReaderTest < Minitest::Test
+    context IOStreams::File::Reader do
+      setup do
+        @file_name = File.join(File.dirname(__FILE__), 'files', 'text.txt')
+        @data      = []
+        File.open(@file_name, 'rt') do |file|
+          while !file.eof?
+            @data << file.readline.strip
+          end
+        end
+      end
+      context '.open' do
+        should 'each_line file' do
+          lines = []
+          IOStreams::Delimited::Reader.open(@file_name) do |io|
+            io.each_line { |line| lines << line }
+          end
+          assert_equal @data, lines
+        end
+        should 'each_line stream' do
+          lines = []
+          File.open(@file_name) do |file|
+            IOStreams::Delimited::Reader.open(file) do |io|
+              io.each_line { |line| lines << line }
+            end
+          end
+          assert_equal @data, lines
+        end
+        ["\r\n", "\n\r", "\n"].each do |delimiter|
+          should "autodetect delimiter: #{delimiter.inspect}" do
+            lines  = []
+            stream = StringIO.new(@data.join(delimiter))
+            IOStreams::Delimited::Reader.open(stream, buffer_size: 15) do |io|
+              io.each_line { |line| lines << line }
+            end
+            assert_equal @data, lines
+          end
+        end
+        ['@', 'BLAH'].each do |delimiter|
+          should "read delimited #{delimiter.inspect}" do
+            lines  = []
+            stream = StringIO.new(@data.join(delimiter))
+            IOStreams::Delimited::Reader.open(stream, buffer_size: 15, delimiter: delimiter) do |io|
+              io.each_line { |line| lines << line }
+            end
+            assert_equal @data, lines
+          end
+        end
+        should "read binary delimited" do
+          delimiter = "\x01"
+          lines     = []
+          stream    = StringIO.new(@data.join(delimiter))
+          IOStreams::Delimited::Reader.open(stream, buffer_size: 15, delimiter: delimiter, encoding: IOStreams::BINARY_ENCODING) do |io|
+            io.each_line { |line| lines << line }
+          end
+          assert_equal @data, lines
+        end
+      end
+    end
+  end
+end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: iostreams
 version: !ruby/object:Gem::Version
-  version: 0.7.0
+  version: 0.8.0
 platform: ruby
 authors:
 - Reid Morrison
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2015-07-14 00:00:00.000000000 Z
+date: 2015-08-25 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: symmetric-encryption
@@ -47,6 +47,7 @@ extra_rdoc_files: []
 files:
 - README.md
 - Rakefile
+- lib/io_streams/delimited/reader.rb
 - lib/io_streams/file/reader.rb
 - lib/io_streams/file/writer.rb
 - lib/io_streams/gzip/reader.rb
@@ -56,6 +57,7 @@ files:
 - lib/io_streams/zip/reader.rb
 - lib/io_streams/zip/writer.rb
 - lib/iostreams.rb
+- test/delimited_reader_test.rb
 - test/file_reader_test.rb
 - test/file_writer_test.rb
 - test/files/text.txt
@@ -92,6 +94,7 @@ signing_key:
 specification_version: 4
 summary: Ruby Input and Output streaming with support for Zip, Gzip, and Encryption.
 test_files:
+- test/delimited_reader_test.rb
 - test/file_reader_test.rb
 - test/file_writer_test.rb
 - test/files/text.txt