RubyGems - iostreams - Versions diffs - 0.7.0 → 0.8.0 - Mend

iostreams 0.7.0 → 0.8.0

Files changed (11) hide show

checksums.yaml +4 -4
data/README.md +126 -20
data/Rakefile +4 -5
data/lib/io_streams/delimited/reader.rb +121 -0
data/lib/io_streams/io_streams.rb +9 -5
data/lib/io_streams/version.rb +1 -1
data/lib/io_streams/zip/reader.rb +3 -3
data/lib/io_streams/zip/writer.rb +1 -1
data/lib/iostreams.rb +4 -0
data/test/delimited_reader_test.rb +71 -0
metadata +5 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: dbaf3d597b1fc3ad00a9c371f932236332d3a1b4
-  data.tar.gz: ee333efa4fe7737d77dfdaeff353e4811fa4f298
+  metadata.gz: 6bb29de6cb870a20f009e5517ab9aab138bb3c67
+  data.tar.gz: 9fb17cc79cbb54a550834d59abac8d4533d201cd
 SHA512:
-  metadata.gz: 801f73a031c8fbadd38bf7939fa90dd6fca5360fef7c8f77b08c65ed72bdb142deaecccf8c9a3dd2c7f47af4725397635959cdb14a87e4af603967ced96709a2
-  data.tar.gz: 7abef1b3814194b1fa69d05e5c007aaca2b386329fb2d4af3a228dcedccd465501c5a2036f510b7e3602c96ad54b030fe7385dbb096a566e3a26ead686d3f167
+  metadata.gz: 1cb75b18750e30ebe36cc4c0bd1496fc539d1b2976bdd975ee8c186a46d132e99aa85243cd5bb07e844c6cee8c8854e727524e3baeb5b713ba2f76be3c294805
+  data.tar.gz: afc66bc4c5425f1e199f3edb4d078a944e2b7e9ec48968744e1be81619029fc22b7065fabea367a6ea3863ff6c89dafd686617956ca3bee6a23975ed33e3f1cd

data/README.md CHANGED Viewed

@@ -1,39 +1,145 @@
-# iostreams
+# iostreams [![Gem Version](https://badge.fury.io/rb/iostreams.svg)](http://badge.fury.io/rb/iostreams) [![Build Status](https://secure.travis-ci.org/rocketjob/iostreams.png?branch=master)](http://travis-ci.org/rocketjob/iostreams) ![](http://ruby-gem-downloads-badge.herokuapp.com/iostreams?type=total)
-Ruby Input and Output streaming with support for Zip, Gzip, and Encryption.
+Ruby Input and Output streaming for Ruby
-## Status
+## Project Status
-Alpha - Feedback on the API is welcome. API will change.
+Beta - Feedback on the API is welcome. API is subject to change.
-## Introduction
+## Features
+Currently streaming classes are available for:
+* Zip
+* Gzip
+* Encryption using [Symmetric Encryption](https://github.com/reidmorrison/symmetric-encryption)
-`iostreams` allows files to be read and written in a streaming fashion to reduce
-memory overhead. It supports reading and writing of Zip, GZip and encrypted files.
+## Introduction
-These streams can be chained together just like piped programs in linux.
-This allows one stream to read the file, another stream to decrypt the file and
-then a third stream to decompress the result.
+If all files were small, they could just be loaded into memory in their entirety. With the
+advent of very large files, often into several Gigabytes, or even Terabytes in size, loading
+them into memory is not feasible.
+In linux it is common to use pipes to stream data between processes.
+For example:
+```
+# Count the number of lines in a file that has been compressed with gzip
+cat abc.gz | gunzip -c | wc -l
+```
+For large files it is critical to be able to read and write these files as streams. Ruby has support
+for reading and writing files using streams, but has no built-in way of passing one stream through
+another to support for example compressing the data, encrypting it and then finally writing the result
+to a file. Several streaming implementations exist for languages such as `C++` and `Java` to chain
+together several streams, `iostreams` attempts to offer similar features for Ruby.
+```ruby
+# Read a compressed file:
+IOStreams.reader('hello.gz') do |reader|
+  data = reader.read(1024)
+  puts "Read: #{data}"
+end
+```
+The true power of streams is shown when many streams are chained together to achieve the end
+result, without holding the entire file in memory, or ideally without needing to create
+any temporary files to process the stream.
+```ruby
+# Create a file that is compressed with GZip and then encrypted with Symmetric Encryption:
+IOStreams.writer('hello.gz.enc') do |writer|
+  writer.write('Hello World')
+  writer.write('and some more')
+end
+```
+The power of the above example applies when the data being written starts to exceed hundreds of megabytes,
+or even gigabytes.
+By looking at the file name supplied above, `iostreams` is able to determine which streams to apply
+to the data being read or written. For example:
+* `hello.zip` => Compressed using Zip
+* `hello.zip.enc` => Compressed using Zip and then encrypted using Symmetric Encryption
+* `hello.gz.enc` => Compressed using GZip and then encrypted using Symmetric Encryption
 The objective is that all of these streaming processes are performed used streaming
-so that only portions of the file are loaded into memory at a time.
+so that only the current portion of the file is loaded into memory as it moves
+through the entire file.
 Where possible each stream never goes to disk, which for example could expose
 un-encrypted data.
+## Architecture
+Streams are chained together by passing the
+Every Reader or Writer is invoked by calling its `.open` method and passing the block
+that must be invoked for the duration of that stream.
+The above block is passed the stream that needs to be encoded/decoded using that
+Reader or Writer every time the `#read` or `#write` method is called on it.
+### Readers
+Each reader stream must implement: `#read`
+### Writer
+Each writer stream must implement: `#write`
+### Optional methods
+The following methods on the stream are useful for both Readers and Writers
+### close
+Close the stream, and cleanup any buffers, etc.
+### closed?
+Has the stream already been closed? Useful, when child streams have already closed the stream
+so that `#close` is not called more than once on a stream.
 ## Notes
 * Due to the nature of Zip, both its Reader and Writer methods will create
   a temp file when reading from or writing to a stream.
   Recommended to use Gzip over Zip since it can be streamed.
-## Meta
-* Code: `git clone git://github.com/rocketjob/iostreams.git`
-* Home: <https://github.com/rocketjob/iostreams>
-* Issues: <http://github.com/rocketjob/iostreams/issues>
-* Gems: <http://rubygems.org/gems/iostreams>
-This project uses [Semantic Versioning](http://semver.org/).
+* Zip becomes exponentially slower with very large files, especially files
+  that exceed 4GB when uncompressed. Highly recommend using GZip for large files.
+## Future
+Below are just some of the streams that are envisaged for `iostreams`:
+* PGP reader and write
+    * Read and write PGP encrypted files
+* CSV
+    * Read and write CSV data, reading data back as Arrays and writing Arrays as CSV text
+* Delimited Text Stream
+    * Autodetect Windows/Linux line endings and return a line at a time
+* MongoFS
+    * Read and write file streams to and from MongoFS
+For example:
+```ruby
+# Read a CSV file, delimited with Windows line endings, compressed with GZip, and encrypted with PGP:
+IOStreams.reader('hello.csv.gz.pgp', [:csv, :delimited, :gz, :pgp]) do |reader|
+  # Returns an Array at a time
+  reader.each do |row|
+    puts "Read: #{row.inspect}"
+  end
+end
+```
+To completely implement io streaming for Ruby will take a lot more input and thoughts
+from the Ruby community. This gem represents a starting point to get the discussion going.
+By keeping this gem in Beta state and not going V1, we can change the interface as needed
+to implement community feedback.
+## Versioning
+This project adheres to [Semantic Versioning](http://semver.org/).
 ## Author

data/Rakefile CHANGED Viewed

@@ -1,21 +1,20 @@
 require 'rake/clean'
 require 'rake/testtask'
-$LOAD_PATH.unshift File.expand_path("../lib", __FILE__)
-require 'io_streams/version'
+require_relative 'lib/io_streams/version'
 task :gem do
-  system "gem build iostreams.gemspec"
+  system 'gem build iostreams.gemspec'
 end
 task :publish => :gem do
   system "git tag -a v#{IOStreams::VERSION} -m 'Tagging #{IOStreams::VERSION}'"
-  system "git push --tags"
+  system 'git push --tags'
   system "gem push iostreams-#{IOStreams::VERSION}.gem"
   system "rm iostreams-#{IOStreams::VERSION}.gem"
 end
-desc "Run Test Suite"
+desc 'Run Test Suite'
 task :test do
   Rake::TestTask.new(:functional) do |t|
     t.test_files = FileList['test/**/*_test.rb']

data/lib/io_streams/delimited/reader.rb ADDED Viewed

@@ -0,0 +1,121 @@
+module IOStreams
+  module Delimited
+    class Reader
+      attr_accessor :delimiter
+      # Read from a file or stream
+      def self.open(file_name_or_io, options={}, &block)
+        if file_name_or_io.respond_to?(:read)
+          block.call(new(file_name_or_io, options))
+        else
+          ::File.open(file_name_or_io, 'rb') do |io|
+            block.call(new(io, options))
+          end
+        end
+      end
+      # Create a delimited UTF8 stream reader from the supplied input streams
+      #
+      # The input stream should be binary with no text conversions performed
+      # since `strip_non_printable` will be applied to the binary stream before
+      # converting to UTF-8
+      #
+      # Parameters
+      #   input_stream
+      #     The input stream that implements #read
+      #
+      #   options
+      #     :delimiter[Symbol|String]
+      #       Line / Record delimiter to use to break the stream up into records
+      #         nil
+      #           Automatically detect line endings and break up by line
+      #           Searches for the first "\r\n" or "\n" and then uses that as the
+      #           delimiter for all subsequent records
+      #         String:
+      #           Any string to break the stream up by
+      #           The records when saved will not include this delimiter
+      #       Default: nil
+      #
+      #     :buffer_size [Integer]
+      #       Maximum size of the buffer into which to read the stream into for
+      #       processing.
+      #       Must be large enough to hold the entire first line and its delimiter(s)
+      #       Default: 65536 ( 64K )
+      #
+      #     :strip_non_printable [true|false]
+      #       Strip all non-printable characters read from the file
+      #       Default: true iff :encoding is UTF8_ENCODING, otherwise false
+      #
+      #     :encoding
+      #       Force encoding to this encoding for all data being read
+      #       Default: UTF8_ENCODING
+      #       Set to nil to disable encoding
+      def initialize(input_stream, options={})
+        @input_stream        = input_stream
+        options              = options.dup
+        @delimiter           = options.delete(:delimiter)
+        @buffer_size         = options.delete(:buffer_size) || 65536
+        @encoding            = options.has_key?(:encoding) ? options.delete(:encoding) : UTF8_ENCODING
+        @strip_non_printable = options.delete(:strip_non_printable)
+        @strip_non_printable = @strip_non_printable.nil? && (@encoding == UTF8_ENCODING)
+        raise ArgumentError.new("Unknown IOStreams::Delimited::Reader#initialize options: #{options.inspect}") if options.size > 0
+        @delimiter.force_encoding(UTF8_ENCODING) if @delimiter && @encoding
+        @buffer = ''
+      end
+      # Returns each line at a time to to the supplied block
+      def each_line(&block)
+        partial = nil
+        loop do
+          if read_chunk == 0
+            block.call(partial) if partial
+            return
+          end
+          self.delimiter ||= detect_delimiter
+          end_index      ||= (delimiter.size + 1) * -1
+          @buffer.each_line(delimiter) do |line|
+            if line.end_with?(delimiter)
+              # Strip off delimiter
+              block.call(line[0..end_index])
+              partial = nil
+            else
+              partial = line
+            end
+          end
+          @buffer = partial.nil? ? '' : partial
+        end
+      end
+      ##########################################################################
+      private
+      # Returns [Integer] the number of bytes read into the internal buffer
+      # Returns 0 on EOF
+      def read_chunk
+        chunk = @input_stream.read(@buffer_size)
+        # EOF reached?
+        return 0 unless chunk
+        # Strip out non-printable characters before converting to UTF-8
+        chunk = chunk.scan(/[[:print:]]|\r|\n/).join if @strip_non_printable
+        @buffer << (@encoding ? chunk.force_encoding(@encoding) : chunk)
+        chunk.size
+      end
+      # Auto detect text line delimiter
+      def detect_delimiter
+        if @buffer =~ /\r\n|\n\r|\n/
+          $&
+        elsif @buffer.size <= @buffer_size
+          # Handle one line files that are smaller than the buffer size
+          "\n"
+        end
+      end
+    end
+  end
+end

data/lib/io_streams/io_streams.rb CHANGED Viewed

@@ -3,6 +3,9 @@ module IOStreams
   # A registry to hold formats for processing files during upload or download
   @@extensions = ThreadSafe::Hash.new
+  UTF8_ENCODING      = Encoding.find('UTF-8').freeze
+  BINARY_ENCODING    = Encoding.find('BINARY').freeze
   # Returns [Array] the formats required to process the file by looking at
   # its extension(s)
   #
@@ -28,9 +31,9 @@ module IOStreams
   #   RocketJob::Formatter::Formats.streams_for_file_name('myfile.csv')
   #   => [ :file ]
   def self.streams_for_file_name(file_name)
-    raise ArgumentError.new("File name cannot be nil") if file_name.nil?
+    raise ArgumentError.new('File name cannot be nil') if file_name.nil?
     raise ArgumentError.new("RocketJob Cannot detect file format when uploading to stream: #{file_name.inspect}") if file_name.respond_to?(:read)
-    parts = file_name.split('.')
+    parts      = file_name.split('.')
     extensions = []
     while extension = parts.pop
       break unless @@extensions[extension.to_sym]
@@ -192,7 +195,7 @@ module IOStreams
   def self.stream(type, file_name_or_io, streams=nil, &block)
     unless streams
       respond_to = type == :reader ? :read : :write
-      streams = file_name_or_io.respond_to?(respond_to) ? [ :file ] : streams_for_file_name(file_name_or_io)
+      streams    = file_name_or_io.respond_to?(respond_to) ? [:file] : streams_for_file_name(file_name_or_io)
     end
     stream_structs = streams_for(type, streams)
     if stream_structs.size == 1
@@ -200,7 +203,7 @@ module IOStreams
       stream_struct.klass.open(file_name_or_io, stream_struct.options, &block)
     else
       # Daisy chain multiple streams together
-      last = stream_structs.inject(block){ |inner, stream_struct| -> io { stream_struct.klass.open(io, stream_struct.options, &inner) } }
+      last = stream_structs.inject(block) { |inner, stream_struct| -> io { stream_struct.klass.open(io, stream_struct.options, &inner) } }
       last.call(file_name_or_io)
     end
   end
@@ -208,7 +211,7 @@ module IOStreams
   # type: :reader or :writer
   def self.streams_for(type, params)
     if params.is_a?(Symbol)
-      [ stream_struct_for_stream(type, params) ]
+      [stream_struct_for_stream(type, params)]
     elsif params.is_a?(Array)
       a = []
       params.each do |stream|
@@ -235,6 +238,7 @@ module IOStreams
   end
   # Register File extensions
+  # @formatter:off
   register_extension(:enc,  SymmetricEncryption::Reader, SymmetricEncryption::Writer) if defined?(SymmetricEncryption)
   register_extension(:file, IOStreams::File::Reader,         IOStreams::File::Writer)
   register_extension(:gz,   IOStreams::Gzip::Reader,         IOStreams::Gzip::Writer)

data/lib/io_streams/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module IOStreams #:nodoc
-  VERSION = "0.7.0"
+  VERSION = '0.8.0'
 end

data/lib/io_streams/zip/reader.rb CHANGED Viewed

@@ -13,8 +13,8 @@ module IOStreams
       #     end
       #   end
       def self.open(file_name_or_io, options={}, &block)
-        options       = options.dup
-        buffer_size   = options.delete(:buffer_size) || 65536
+        options     = options.dup
+        buffer_size = options.delete(:buffer_size) || 65536
         raise(ArgumentError, "Unknown IOStreams::Zip::Reader option: #{options.inspect}") if options.size > 0
         # File name supplied
@@ -54,7 +54,7 @@ module IOStreams
         begin
           require 'zip'
         rescue LoadError => exc
-          puts "Please install gem rubyzip so that RocketJob can read Zip files in Ruby MRI"
+          puts 'Please install gem rubyzip so that RocketJob can read Zip files in Ruby MRI'
           raise(exc)
         end

data/lib/io_streams/zip/writer.rb CHANGED Viewed

@@ -67,7 +67,7 @@ module IOStreams
         begin
           require 'zip'
         rescue LoadError => exc
-          puts "Please install gem rubyzip so that RocketJob can read Zip files in Ruby MRI"
+          puts 'Please install gem rubyzip so that RocketJob can read Zip files in Ruby MRI'
           raise(exc)
         end

data/lib/iostreams.rb CHANGED Viewed

@@ -12,5 +12,9 @@ module IOStreams
     autoload :Reader,  'io_streams/zip/reader'
     autoload :Writer,  'io_streams/zip/writer'
   end
+  module Delimited
+    autoload :Reader,  'io_streams/delimited/reader'
+    autoload :Writer,  'io_streams/delimited/writer'
+  end
 end
 require 'io_streams/io_streams'

data/test/delimited_reader_test.rb ADDED Viewed

@@ -0,0 +1,71 @@
+require_relative 'test_helper'
+# Unit Test for IOStreams::File
+module Streams
+  class DelimitedReaderTest < Minitest::Test
+    context IOStreams::File::Reader do
+      setup do
+        @file_name = File.join(File.dirname(__FILE__), 'files', 'text.txt')
+        @data      = []
+        File.open(@file_name, 'rt') do |file|
+          while !file.eof?
+            @data << file.readline.strip
+          end
+        end
+      end
+      context '.open' do
+        should 'each_line file' do
+          lines = []
+          IOStreams::Delimited::Reader.open(@file_name) do |io|
+            io.each_line { |line| lines << line }
+          end
+          assert_equal @data, lines
+        end
+        should 'each_line stream' do
+          lines = []
+          File.open(@file_name) do |file|
+            IOStreams::Delimited::Reader.open(file) do |io|
+              io.each_line { |line| lines << line }
+            end
+          end
+          assert_equal @data, lines
+        end
+        ["\r\n", "\n\r", "\n"].each do |delimiter|
+          should "autodetect delimiter: #{delimiter.inspect}" do
+            lines  = []
+            stream = StringIO.new(@data.join(delimiter))
+            IOStreams::Delimited::Reader.open(stream, buffer_size: 15) do |io|
+              io.each_line { |line| lines << line }
+            end
+            assert_equal @data, lines
+          end
+        end
+        ['@', 'BLAH'].each do |delimiter|
+          should "read delimited #{delimiter.inspect}" do
+            lines  = []
+            stream = StringIO.new(@data.join(delimiter))
+            IOStreams::Delimited::Reader.open(stream, buffer_size: 15, delimiter: delimiter) do |io|
+              io.each_line { |line| lines << line }
+            end
+            assert_equal @data, lines
+          end
+        end
+        should "read binary delimited" do
+          delimiter = "\x01"
+          lines     = []
+          stream    = StringIO.new(@data.join(delimiter))
+          IOStreams::Delimited::Reader.open(stream, buffer_size: 15, delimiter: delimiter, encoding: IOStreams::BINARY_ENCODING) do |io|
+            io.each_line { |line| lines << line }
+          end
+          assert_equal @data, lines
+        end
+      end
+    end
+  end
+end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: iostreams
 version: !ruby/object:Gem::Version
-  version: 0.7.0
+  version: 0.8.0
 platform: ruby
 authors:
 - Reid Morrison
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2015-07-14 00:00:00.000000000 Z
+date: 2015-08-25 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: symmetric-encryption
@@ -47,6 +47,7 @@ extra_rdoc_files: []
 files:
 - README.md
 - Rakefile
+- lib/io_streams/delimited/reader.rb
 - lib/io_streams/file/reader.rb
 - lib/io_streams/file/writer.rb
 - lib/io_streams/gzip/reader.rb
@@ -56,6 +57,7 @@ files:
 - lib/io_streams/zip/reader.rb
 - lib/io_streams/zip/writer.rb
 - lib/iostreams.rb
+- test/delimited_reader_test.rb
 - test/file_reader_test.rb
 - test/file_writer_test.rb
 - test/files/text.txt
@@ -92,6 +94,7 @@ signing_key:
 specification_version: 4
 summary: Ruby Input and Output streaming with support for Zip, Gzip, and Encryption.
 test_files:
+- test/delimited_reader_test.rb
 - test/file_reader_test.rb
 - test/file_writer_test.rb
 - test/files/text.txt