iostreams 0.7.0 → 0.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: dbaf3d597b1fc3ad00a9c371f932236332d3a1b4
4
- data.tar.gz: ee333efa4fe7737d77dfdaeff353e4811fa4f298
3
+ metadata.gz: 6bb29de6cb870a20f009e5517ab9aab138bb3c67
4
+ data.tar.gz: 9fb17cc79cbb54a550834d59abac8d4533d201cd
5
5
  SHA512:
6
- metadata.gz: 801f73a031c8fbadd38bf7939fa90dd6fca5360fef7c8f77b08c65ed72bdb142deaecccf8c9a3dd2c7f47af4725397635959cdb14a87e4af603967ced96709a2
7
- data.tar.gz: 7abef1b3814194b1fa69d05e5c007aaca2b386329fb2d4af3a228dcedccd465501c5a2036f510b7e3602c96ad54b030fe7385dbb096a566e3a26ead686d3f167
6
+ metadata.gz: 1cb75b18750e30ebe36cc4c0bd1496fc539d1b2976bdd975ee8c186a46d132e99aa85243cd5bb07e844c6cee8c8854e727524e3baeb5b713ba2f76be3c294805
7
+ data.tar.gz: afc66bc4c5425f1e199f3edb4d078a944e2b7e9ec48968744e1be81619029fc22b7065fabea367a6ea3863ff6c89dafd686617956ca3bee6a23975ed33e3f1cd
data/README.md CHANGED
@@ -1,39 +1,145 @@
1
- # iostreams
1
+ # iostreams [![Gem Version](https://badge.fury.io/rb/iostreams.svg)](http://badge.fury.io/rb/iostreams) [![Build Status](https://secure.travis-ci.org/rocketjob/iostreams.png?branch=master)](http://travis-ci.org/rocketjob/iostreams) ![](http://ruby-gem-downloads-badge.herokuapp.com/iostreams?type=total)
2
2
 
3
- Ruby Input and Output streaming with support for Zip, Gzip, and Encryption.
3
+ Ruby Input and Output streaming for Ruby
4
4
 
5
- ## Status
5
+ ## Project Status
6
6
 
7
- Alpha - Feedback on the API is welcome. API will change.
7
+ Beta - Feedback on the API is welcome. API is subject to change.
8
8
 
9
- ## Introduction
9
+ ## Features
10
+
11
+ Currently streaming classes are available for:
12
+
13
+ * Zip
14
+ * Gzip
15
+ * Encryption using [Symmetric Encryption](https://github.com/reidmorrison/symmetric-encryption)
10
16
 
11
- `iostreams` allows files to be read and written in a streaming fashion to reduce
12
- memory overhead. It supports reading and writing of Zip, GZip and encrypted files.
17
+ ## Introduction
13
18
 
14
- These streams can be chained together just like piped programs in linux.
15
- This allows one stream to read the file, another stream to decrypt the file and
16
- then a third stream to decompress the result.
19
+ If all files were small, they could just be loaded into memory in their entirety. With the
20
+ advent of very large files, often into several Gigabytes, or even Terabytes in size, loading
21
+ them into memory is not feasible.
22
+
23
+ In linux it is common to use pipes to stream data between processes.
24
+ For example:
25
+
26
+ ```
27
+ # Count the number of lines in a file that has been compressed with gzip
28
+ cat abc.gz | gunzip -c | wc -l
29
+ ```
30
+
31
+ For large files it is critical to be able to read and write these files as streams. Ruby has support
32
+ for reading and writing files using streams, but has no built-in way of passing one stream through
33
+ another to support for example compressing the data, encrypting it and then finally writing the result
34
+ to a file. Several streaming implementations exist for languages such as `C++` and `Java` to chain
35
+ together several streams, `iostreams` attempts to offer similar features for Ruby.
36
+
37
+ ```ruby
38
+ # Read a compressed file:
39
+ IOStreams.reader('hello.gz') do |reader|
40
+ data = reader.read(1024)
41
+ puts "Read: #{data}"
42
+ end
43
+ ```
44
+
45
+ The true power of streams is shown when many streams are chained together to achieve the end
46
+ result, without holding the entire file in memory, or ideally without needing to create
47
+ any temporary files to process the stream.
48
+
49
+ ```ruby
50
+ # Create a file that is compressed with GZip and then encrypted with Symmetric Encryption:
51
+ IOStreams.writer('hello.gz.enc') do |writer|
52
+ writer.write('Hello World')
53
+ writer.write('and some more')
54
+ end
55
+ ```
56
+
57
+ The power of the above example applies when the data being written starts to exceed hundreds of megabytes,
58
+ or even gigabytes.
59
+
60
+ By looking at the file name supplied above, `iostreams` is able to determine which streams to apply
61
+ to the data being read or written. For example:
62
+ * `hello.zip` => Compressed using Zip
63
+ * `hello.zip.enc` => Compressed using Zip and then encrypted using Symmetric Encryption
64
+ * `hello.gz.enc` => Compressed using GZip and then encrypted using Symmetric Encryption
17
65
 
18
66
  The objective is that all of these streaming processes are performed used streaming
19
- so that only portions of the file are loaded into memory at a time.
67
+ so that only the current portion of the file is loaded into memory as it moves
68
+ through the entire file.
20
69
  Where possible each stream never goes to disk, which for example could expose
21
70
  un-encrypted data.
22
71
 
72
+ ## Architecture
73
+
74
+ Streams are chained together by passing the
75
+
76
+ Every Reader or Writer is invoked by calling its `.open` method and passing the block
77
+ that must be invoked for the duration of that stream.
78
+
79
+ The above block is passed the stream that needs to be encoded/decoded using that
80
+ Reader or Writer every time the `#read` or `#write` method is called on it.
81
+
82
+ ### Readers
83
+
84
+ Each reader stream must implement: `#read`
85
+
86
+ ### Writer
87
+
88
+ Each writer stream must implement: `#write`
89
+
90
+ ### Optional methods
91
+
92
+ The following methods on the stream are useful for both Readers and Writers
93
+
94
+ ### close
95
+
96
+ Close the stream, and cleanup any buffers, etc.
97
+
98
+ ### closed?
99
+
100
+ Has the stream already been closed? Useful, when child streams have already closed the stream
101
+ so that `#close` is not called more than once on a stream.
102
+
23
103
  ## Notes
24
104
 
25
105
  * Due to the nature of Zip, both its Reader and Writer methods will create
26
106
  a temp file when reading from or writing to a stream.
27
107
  Recommended to use Gzip over Zip since it can be streamed.
28
-
29
- ## Meta
30
-
31
- * Code: `git clone git://github.com/rocketjob/iostreams.git`
32
- * Home: <https://github.com/rocketjob/iostreams>
33
- * Issues: <http://github.com/rocketjob/iostreams/issues>
34
- * Gems: <http://rubygems.org/gems/iostreams>
35
-
36
- This project uses [Semantic Versioning](http://semver.org/).
108
+ * Zip becomes exponentially slower with very large files, especially files
109
+ that exceed 4GB when uncompressed. Highly recommend using GZip for large files.
110
+
111
+ ## Future
112
+
113
+ Below are just some of the streams that are envisaged for `iostreams`:
114
+ * PGP reader and write
115
+ * Read and write PGP encrypted files
116
+ * CSV
117
+ * Read and write CSV data, reading data back as Arrays and writing Arrays as CSV text
118
+ * Delimited Text Stream
119
+ * Autodetect Windows/Linux line endings and return a line at a time
120
+ * MongoFS
121
+ * Read and write file streams to and from MongoFS
122
+
123
+ For example:
124
+ ```ruby
125
+ # Read a CSV file, delimited with Windows line endings, compressed with GZip, and encrypted with PGP:
126
+ IOStreams.reader('hello.csv.gz.pgp', [:csv, :delimited, :gz, :pgp]) do |reader|
127
+ # Returns an Array at a time
128
+ reader.each do |row|
129
+ puts "Read: #{row.inspect}"
130
+ end
131
+ end
132
+ ```
133
+
134
+ To completely implement io streaming for Ruby will take a lot more input and thoughts
135
+ from the Ruby community. This gem represents a starting point to get the discussion going.
136
+
137
+ By keeping this gem in Beta state and not going V1, we can change the interface as needed
138
+ to implement community feedback.
139
+
140
+ ## Versioning
141
+
142
+ This project adheres to [Semantic Versioning](http://semver.org/).
37
143
 
38
144
  ## Author
39
145
 
data/Rakefile CHANGED
@@ -1,21 +1,20 @@
1
1
  require 'rake/clean'
2
2
  require 'rake/testtask'
3
3
 
4
- $LOAD_PATH.unshift File.expand_path("../lib", __FILE__)
5
- require 'io_streams/version'
4
+ require_relative 'lib/io_streams/version'
6
5
 
7
6
  task :gem do
8
- system "gem build iostreams.gemspec"
7
+ system 'gem build iostreams.gemspec'
9
8
  end
10
9
 
11
10
  task :publish => :gem do
12
11
  system "git tag -a v#{IOStreams::VERSION} -m 'Tagging #{IOStreams::VERSION}'"
13
- system "git push --tags"
12
+ system 'git push --tags'
14
13
  system "gem push iostreams-#{IOStreams::VERSION}.gem"
15
14
  system "rm iostreams-#{IOStreams::VERSION}.gem"
16
15
  end
17
16
 
18
- desc "Run Test Suite"
17
+ desc 'Run Test Suite'
19
18
  task :test do
20
19
  Rake::TestTask.new(:functional) do |t|
21
20
  t.test_files = FileList['test/**/*_test.rb']
@@ -0,0 +1,121 @@
1
+ module IOStreams
2
+ module Delimited
3
+ class Reader
4
+ attr_accessor :delimiter
5
+
6
+ # Read from a file or stream
7
+ def self.open(file_name_or_io, options={}, &block)
8
+ if file_name_or_io.respond_to?(:read)
9
+ block.call(new(file_name_or_io, options))
10
+ else
11
+ ::File.open(file_name_or_io, 'rb') do |io|
12
+ block.call(new(io, options))
13
+ end
14
+ end
15
+ end
16
+
17
+ # Create a delimited UTF8 stream reader from the supplied input streams
18
+ #
19
+ # The input stream should be binary with no text conversions performed
20
+ # since `strip_non_printable` will be applied to the binary stream before
21
+ # converting to UTF-8
22
+ #
23
+ # Parameters
24
+ # input_stream
25
+ # The input stream that implements #read
26
+ #
27
+ # options
28
+ # :delimiter[Symbol|String]
29
+ # Line / Record delimiter to use to break the stream up into records
30
+ # nil
31
+ # Automatically detect line endings and break up by line
32
+ # Searches for the first "\r\n" or "\n" and then uses that as the
33
+ # delimiter for all subsequent records
34
+ # String:
35
+ # Any string to break the stream up by
36
+ # The records when saved will not include this delimiter
37
+ # Default: nil
38
+ #
39
+ # :buffer_size [Integer]
40
+ # Maximum size of the buffer into which to read the stream into for
41
+ # processing.
42
+ # Must be large enough to hold the entire first line and its delimiter(s)
43
+ # Default: 65536 ( 64K )
44
+ #
45
+ # :strip_non_printable [true|false]
46
+ # Strip all non-printable characters read from the file
47
+ # Default: true iff :encoding is UTF8_ENCODING, otherwise false
48
+ #
49
+ # :encoding
50
+ # Force encoding to this encoding for all data being read
51
+ # Default: UTF8_ENCODING
52
+ # Set to nil to disable encoding
53
+ def initialize(input_stream, options={})
54
+ @input_stream = input_stream
55
+ options = options.dup
56
+ @delimiter = options.delete(:delimiter)
57
+ @buffer_size = options.delete(:buffer_size) || 65536
58
+ @encoding = options.has_key?(:encoding) ? options.delete(:encoding) : UTF8_ENCODING
59
+ @strip_non_printable = options.delete(:strip_non_printable)
60
+ @strip_non_printable = @strip_non_printable.nil? && (@encoding == UTF8_ENCODING)
61
+ raise ArgumentError.new("Unknown IOStreams::Delimited::Reader#initialize options: #{options.inspect}") if options.size > 0
62
+
63
+ @delimiter.force_encoding(UTF8_ENCODING) if @delimiter && @encoding
64
+ @buffer = ''
65
+ end
66
+
67
+ # Returns each line at a time to to the supplied block
68
+ def each_line(&block)
69
+ partial = nil
70
+ loop do
71
+ if read_chunk == 0
72
+ block.call(partial) if partial
73
+ return
74
+ end
75
+
76
+ self.delimiter ||= detect_delimiter
77
+ end_index ||= (delimiter.size + 1) * -1
78
+
79
+ @buffer.each_line(delimiter) do |line|
80
+ if line.end_with?(delimiter)
81
+ # Strip off delimiter
82
+ block.call(line[0..end_index])
83
+ partial = nil
84
+ else
85
+ partial = line
86
+ end
87
+ end
88
+ @buffer = partial.nil? ? '' : partial
89
+ end
90
+ end
91
+
92
+ ##########################################################################
93
+ private
94
+
95
+ # Returns [Integer] the number of bytes read into the internal buffer
96
+ # Returns 0 on EOF
97
+ def read_chunk
98
+ chunk = @input_stream.read(@buffer_size)
99
+ # EOF reached?
100
+ return 0 unless chunk
101
+
102
+ # Strip out non-printable characters before converting to UTF-8
103
+ chunk = chunk.scan(/[[:print:]]|\r|\n/).join if @strip_non_printable
104
+
105
+ @buffer << (@encoding ? chunk.force_encoding(@encoding) : chunk)
106
+ chunk.size
107
+ end
108
+
109
+ # Auto detect text line delimiter
110
+ def detect_delimiter
111
+ if @buffer =~ /\r\n|\n\r|\n/
112
+ $&
113
+ elsif @buffer.size <= @buffer_size
114
+ # Handle one line files that are smaller than the buffer size
115
+ "\n"
116
+ end
117
+ end
118
+
119
+ end
120
+ end
121
+ end
@@ -3,6 +3,9 @@ module IOStreams
3
3
  # A registry to hold formats for processing files during upload or download
4
4
  @@extensions = ThreadSafe::Hash.new
5
5
 
6
+ UTF8_ENCODING = Encoding.find('UTF-8').freeze
7
+ BINARY_ENCODING = Encoding.find('BINARY').freeze
8
+
6
9
  # Returns [Array] the formats required to process the file by looking at
7
10
  # its extension(s)
8
11
  #
@@ -28,9 +31,9 @@ module IOStreams
28
31
  # RocketJob::Formatter::Formats.streams_for_file_name('myfile.csv')
29
32
  # => [ :file ]
30
33
  def self.streams_for_file_name(file_name)
31
- raise ArgumentError.new("File name cannot be nil") if file_name.nil?
34
+ raise ArgumentError.new('File name cannot be nil') if file_name.nil?
32
35
  raise ArgumentError.new("RocketJob Cannot detect file format when uploading to stream: #{file_name.inspect}") if file_name.respond_to?(:read)
33
- parts = file_name.split('.')
36
+ parts = file_name.split('.')
34
37
  extensions = []
35
38
  while extension = parts.pop
36
39
  break unless @@extensions[extension.to_sym]
@@ -192,7 +195,7 @@ module IOStreams
192
195
  def self.stream(type, file_name_or_io, streams=nil, &block)
193
196
  unless streams
194
197
  respond_to = type == :reader ? :read : :write
195
- streams = file_name_or_io.respond_to?(respond_to) ? [ :file ] : streams_for_file_name(file_name_or_io)
198
+ streams = file_name_or_io.respond_to?(respond_to) ? [:file] : streams_for_file_name(file_name_or_io)
196
199
  end
197
200
  stream_structs = streams_for(type, streams)
198
201
  if stream_structs.size == 1
@@ -200,7 +203,7 @@ module IOStreams
200
203
  stream_struct.klass.open(file_name_or_io, stream_struct.options, &block)
201
204
  else
202
205
  # Daisy chain multiple streams together
203
- last = stream_structs.inject(block){ |inner, stream_struct| -> io { stream_struct.klass.open(io, stream_struct.options, &inner) } }
206
+ last = stream_structs.inject(block) { |inner, stream_struct| -> io { stream_struct.klass.open(io, stream_struct.options, &inner) } }
204
207
  last.call(file_name_or_io)
205
208
  end
206
209
  end
@@ -208,7 +211,7 @@ module IOStreams
208
211
  # type: :reader or :writer
209
212
  def self.streams_for(type, params)
210
213
  if params.is_a?(Symbol)
211
- [ stream_struct_for_stream(type, params) ]
214
+ [stream_struct_for_stream(type, params)]
212
215
  elsif params.is_a?(Array)
213
216
  a = []
214
217
  params.each do |stream|
@@ -235,6 +238,7 @@ module IOStreams
235
238
  end
236
239
 
237
240
  # Register File extensions
241
+ # @formatter:off
238
242
  register_extension(:enc, SymmetricEncryption::Reader, SymmetricEncryption::Writer) if defined?(SymmetricEncryption)
239
243
  register_extension(:file, IOStreams::File::Reader, IOStreams::File::Writer)
240
244
  register_extension(:gz, IOStreams::Gzip::Reader, IOStreams::Gzip::Writer)
@@ -1,3 +1,3 @@
1
1
  module IOStreams #:nodoc
2
- VERSION = "0.7.0"
2
+ VERSION = '0.8.0'
3
3
  end
@@ -13,8 +13,8 @@ module IOStreams
13
13
  # end
14
14
  # end
15
15
  def self.open(file_name_or_io, options={}, &block)
16
- options = options.dup
17
- buffer_size = options.delete(:buffer_size) || 65536
16
+ options = options.dup
17
+ buffer_size = options.delete(:buffer_size) || 65536
18
18
  raise(ArgumentError, "Unknown IOStreams::Zip::Reader option: #{options.inspect}") if options.size > 0
19
19
 
20
20
  # File name supplied
@@ -54,7 +54,7 @@ module IOStreams
54
54
  begin
55
55
  require 'zip'
56
56
  rescue LoadError => exc
57
- puts "Please install gem rubyzip so that RocketJob can read Zip files in Ruby MRI"
57
+ puts 'Please install gem rubyzip so that RocketJob can read Zip files in Ruby MRI'
58
58
  raise(exc)
59
59
  end
60
60
 
@@ -67,7 +67,7 @@ module IOStreams
67
67
  begin
68
68
  require 'zip'
69
69
  rescue LoadError => exc
70
- puts "Please install gem rubyzip so that RocketJob can read Zip files in Ruby MRI"
70
+ puts 'Please install gem rubyzip so that RocketJob can read Zip files in Ruby MRI'
71
71
  raise(exc)
72
72
  end
73
73
 
data/lib/iostreams.rb CHANGED
@@ -12,5 +12,9 @@ module IOStreams
12
12
  autoload :Reader, 'io_streams/zip/reader'
13
13
  autoload :Writer, 'io_streams/zip/writer'
14
14
  end
15
+ module Delimited
16
+ autoload :Reader, 'io_streams/delimited/reader'
17
+ autoload :Writer, 'io_streams/delimited/writer'
18
+ end
15
19
  end
16
20
  require 'io_streams/io_streams'
@@ -0,0 +1,71 @@
1
+ require_relative 'test_helper'
2
+
3
+ # Unit Test for IOStreams::File
4
+ module Streams
5
+ class DelimitedReaderTest < Minitest::Test
6
+ context IOStreams::File::Reader do
7
+ setup do
8
+ @file_name = File.join(File.dirname(__FILE__), 'files', 'text.txt')
9
+ @data = []
10
+ File.open(@file_name, 'rt') do |file|
11
+ while !file.eof?
12
+ @data << file.readline.strip
13
+ end
14
+ end
15
+ end
16
+
17
+ context '.open' do
18
+ should 'each_line file' do
19
+ lines = []
20
+ IOStreams::Delimited::Reader.open(@file_name) do |io|
21
+ io.each_line { |line| lines << line }
22
+ end
23
+ assert_equal @data, lines
24
+ end
25
+
26
+ should 'each_line stream' do
27
+ lines = []
28
+ File.open(@file_name) do |file|
29
+ IOStreams::Delimited::Reader.open(file) do |io|
30
+ io.each_line { |line| lines << line }
31
+ end
32
+ end
33
+ assert_equal @data, lines
34
+ end
35
+
36
+ ["\r\n", "\n\r", "\n"].each do |delimiter|
37
+ should "autodetect delimiter: #{delimiter.inspect}" do
38
+ lines = []
39
+ stream = StringIO.new(@data.join(delimiter))
40
+ IOStreams::Delimited::Reader.open(stream, buffer_size: 15) do |io|
41
+ io.each_line { |line| lines << line }
42
+ end
43
+ assert_equal @data, lines
44
+ end
45
+ end
46
+
47
+ ['@', 'BLAH'].each do |delimiter|
48
+ should "read delimited #{delimiter.inspect}" do
49
+ lines = []
50
+ stream = StringIO.new(@data.join(delimiter))
51
+ IOStreams::Delimited::Reader.open(stream, buffer_size: 15, delimiter: delimiter) do |io|
52
+ io.each_line { |line| lines << line }
53
+ end
54
+ assert_equal @data, lines
55
+ end
56
+ end
57
+
58
+ should "read binary delimited" do
59
+ delimiter = "\x01"
60
+ lines = []
61
+ stream = StringIO.new(@data.join(delimiter))
62
+ IOStreams::Delimited::Reader.open(stream, buffer_size: 15, delimiter: delimiter, encoding: IOStreams::BINARY_ENCODING) do |io|
63
+ io.each_line { |line| lines << line }
64
+ end
65
+ assert_equal @data, lines
66
+ end
67
+ end
68
+
69
+ end
70
+ end
71
+ end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: iostreams
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.7.0
4
+ version: 0.8.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Reid Morrison
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2015-07-14 00:00:00.000000000 Z
11
+ date: 2015-08-25 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: symmetric-encryption
@@ -47,6 +47,7 @@ extra_rdoc_files: []
47
47
  files:
48
48
  - README.md
49
49
  - Rakefile
50
+ - lib/io_streams/delimited/reader.rb
50
51
  - lib/io_streams/file/reader.rb
51
52
  - lib/io_streams/file/writer.rb
52
53
  - lib/io_streams/gzip/reader.rb
@@ -56,6 +57,7 @@ files:
56
57
  - lib/io_streams/zip/reader.rb
57
58
  - lib/io_streams/zip/writer.rb
58
59
  - lib/iostreams.rb
60
+ - test/delimited_reader_test.rb
59
61
  - test/file_reader_test.rb
60
62
  - test/file_writer_test.rb
61
63
  - test/files/text.txt
@@ -92,6 +94,7 @@ signing_key:
92
94
  specification_version: 4
93
95
  summary: Ruby Input and Output streaming with support for Zip, Gzip, and Encryption.
94
96
  test_files:
97
+ - test/delimited_reader_test.rb
95
98
  - test/file_reader_test.rb
96
99
  - test/file_writer_test.rb
97
100
  - test/files/text.txt