iostreams 0.7.0 → 0.8.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: dbaf3d597b1fc3ad00a9c371f932236332d3a1b4
4
- data.tar.gz: ee333efa4fe7737d77dfdaeff353e4811fa4f298
3
+ metadata.gz: 6bb29de6cb870a20f009e5517ab9aab138bb3c67
4
+ data.tar.gz: 9fb17cc79cbb54a550834d59abac8d4533d201cd
5
5
  SHA512:
6
- metadata.gz: 801f73a031c8fbadd38bf7939fa90dd6fca5360fef7c8f77b08c65ed72bdb142deaecccf8c9a3dd2c7f47af4725397635959cdb14a87e4af603967ced96709a2
7
- data.tar.gz: 7abef1b3814194b1fa69d05e5c007aaca2b386329fb2d4af3a228dcedccd465501c5a2036f510b7e3602c96ad54b030fe7385dbb096a566e3a26ead686d3f167
6
+ metadata.gz: 1cb75b18750e30ebe36cc4c0bd1496fc539d1b2976bdd975ee8c186a46d132e99aa85243cd5bb07e844c6cee8c8854e727524e3baeb5b713ba2f76be3c294805
7
+ data.tar.gz: afc66bc4c5425f1e199f3edb4d078a944e2b7e9ec48968744e1be81619029fc22b7065fabea367a6ea3863ff6c89dafd686617956ca3bee6a23975ed33e3f1cd
data/README.md CHANGED
@@ -1,39 +1,145 @@
1
- # iostreams
1
+ # iostreams [![Gem Version](https://badge.fury.io/rb/iostreams.svg)](http://badge.fury.io/rb/iostreams) [![Build Status](https://secure.travis-ci.org/rocketjob/iostreams.png?branch=master)](http://travis-ci.org/rocketjob/iostreams) ![](http://ruby-gem-downloads-badge.herokuapp.com/iostreams?type=total)
2
2
 
3
- Ruby Input and Output streaming with support for Zip, Gzip, and Encryption.
3
+ Ruby Input and Output streaming for Ruby
4
4
 
5
- ## Status
5
+ ## Project Status
6
6
 
7
- Alpha - Feedback on the API is welcome. API will change.
7
+ Beta - Feedback on the API is welcome. API is subject to change.
8
8
 
9
- ## Introduction
9
+ ## Features
10
+
11
+ Currently streaming classes are available for:
12
+
13
+ * Zip
14
+ * Gzip
15
+ * Encryption using [Symmetric Encryption](https://github.com/reidmorrison/symmetric-encryption)
10
16
 
11
- `iostreams` allows files to be read and written in a streaming fashion to reduce
12
- memory overhead. It supports reading and writing of Zip, GZip and encrypted files.
17
+ ## Introduction
13
18
 
14
- These streams can be chained together just like piped programs in linux.
15
- This allows one stream to read the file, another stream to decrypt the file and
16
- then a third stream to decompress the result.
19
+ If all files were small, they could just be loaded into memory in their entirety. With the
20
+ advent of very large files, often into several Gigabytes, or even Terabytes in size, loading
21
+ them into memory is not feasible.
22
+
23
+ In linux it is common to use pipes to stream data between processes.
24
+ For example:
25
+
26
+ ```
27
+ # Count the number of lines in a file that has been compressed with gzip
28
+ cat abc.gz | gunzip -c | wc -l
29
+ ```
30
+
31
+ For large files it is critical to be able to read and write these files as streams. Ruby has support
32
+ for reading and writing files using streams, but has no built-in way of passing one stream through
33
+ another to support for example compressing the data, encrypting it and then finally writing the result
34
+ to a file. Several streaming implementations exist for languages such as `C++` and `Java` to chain
35
+ together several streams, `iostreams` attempts to offer similar features for Ruby.
36
+
37
+ ```ruby
38
+ # Read a compressed file:
39
+ IOStreams.reader('hello.gz') do |reader|
40
+ data = reader.read(1024)
41
+ puts "Read: #{data}"
42
+ end
43
+ ```
44
+
45
+ The true power of streams is shown when many streams are chained together to achieve the end
46
+ result, without holding the entire file in memory, or ideally without needing to create
47
+ any temporary files to process the stream.
48
+
49
+ ```ruby
50
+ # Create a file that is compressed with GZip and then encrypted with Symmetric Encryption:
51
+ IOStreams.writer('hello.gz.enc') do |writer|
52
+ writer.write('Hello World')
53
+ writer.write('and some more')
54
+ end
55
+ ```
56
+
57
+ The power of the above example applies when the data being written starts to exceed hundreds of megabytes,
58
+ or even gigabytes.
59
+
60
+ By looking at the file name supplied above, `iostreams` is able to determine which streams to apply
61
+ to the data being read or written. For example:
62
+ * `hello.zip` => Compressed using Zip
63
+ * `hello.zip.enc` => Compressed using Zip and then encrypted using Symmetric Encryption
64
+ * `hello.gz.enc` => Compressed using GZip and then encrypted using Symmetric Encryption
17
65
 
18
66
  The objective is that all of these streaming processes are performed used streaming
19
- so that only portions of the file are loaded into memory at a time.
67
+ so that only the current portion of the file is loaded into memory as it moves
68
+ through the entire file.
20
69
  Where possible each stream never goes to disk, which for example could expose
21
70
  un-encrypted data.
22
71
 
72
+ ## Architecture
73
+
74
+ Streams are chained together by passing the
75
+
76
+ Every Reader or Writer is invoked by calling its `.open` method and passing the block
77
+ that must be invoked for the duration of that stream.
78
+
79
+ The above block is passed the stream that needs to be encoded/decoded using that
80
+ Reader or Writer every time the `#read` or `#write` method is called on it.
81
+
82
+ ### Readers
83
+
84
+ Each reader stream must implement: `#read`
85
+
86
+ ### Writer
87
+
88
+ Each writer stream must implement: `#write`
89
+
90
+ ### Optional methods
91
+
92
+ The following methods on the stream are useful for both Readers and Writers
93
+
94
+ ### close
95
+
96
+ Close the stream, and cleanup any buffers, etc.
97
+
98
+ ### closed?
99
+
100
+ Has the stream already been closed? Useful, when child streams have already closed the stream
101
+ so that `#close` is not called more than once on a stream.
102
+
23
103
  ## Notes
24
104
 
25
105
  * Due to the nature of Zip, both its Reader and Writer methods will create
26
106
  a temp file when reading from or writing to a stream.
27
107
  Recommended to use Gzip over Zip since it can be streamed.
28
-
29
- ## Meta
30
-
31
- * Code: `git clone git://github.com/rocketjob/iostreams.git`
32
- * Home: <https://github.com/rocketjob/iostreams>
33
- * Issues: <http://github.com/rocketjob/iostreams/issues>
34
- * Gems: <http://rubygems.org/gems/iostreams>
35
-
36
- This project uses [Semantic Versioning](http://semver.org/).
108
+ * Zip becomes exponentially slower with very large files, especially files
109
+ that exceed 4GB when uncompressed. Highly recommend using GZip for large files.
110
+
111
+ ## Future
112
+
113
+ Below are just some of the streams that are envisaged for `iostreams`:
114
+ * PGP reader and write
115
+ * Read and write PGP encrypted files
116
+ * CSV
117
+ * Read and write CSV data, reading data back as Arrays and writing Arrays as CSV text
118
+ * Delimited Text Stream
119
+ * Autodetect Windows/Linux line endings and return a line at a time
120
+ * MongoFS
121
+ * Read and write file streams to and from MongoFS
122
+
123
+ For example:
124
+ ```ruby
125
+ # Read a CSV file, delimited with Windows line endings, compressed with GZip, and encrypted with PGP:
126
+ IOStreams.reader('hello.csv.gz.pgp', [:csv, :delimited, :gz, :pgp]) do |reader|
127
+ # Returns an Array at a time
128
+ reader.each do |row|
129
+ puts "Read: #{row.inspect}"
130
+ end
131
+ end
132
+ ```
133
+
134
+ To completely implement io streaming for Ruby will take a lot more input and thoughts
135
+ from the Ruby community. This gem represents a starting point to get the discussion going.
136
+
137
+ By keeping this gem in Beta state and not going V1, we can change the interface as needed
138
+ to implement community feedback.
139
+
140
+ ## Versioning
141
+
142
+ This project adheres to [Semantic Versioning](http://semver.org/).
37
143
 
38
144
  ## Author
39
145
 
data/Rakefile CHANGED
@@ -1,21 +1,20 @@
1
1
  require 'rake/clean'
2
2
  require 'rake/testtask'
3
3
 
4
- $LOAD_PATH.unshift File.expand_path("../lib", __FILE__)
5
- require 'io_streams/version'
4
+ require_relative 'lib/io_streams/version'
6
5
 
7
6
  task :gem do
8
- system "gem build iostreams.gemspec"
7
+ system 'gem build iostreams.gemspec'
9
8
  end
10
9
 
11
10
  task :publish => :gem do
12
11
  system "git tag -a v#{IOStreams::VERSION} -m 'Tagging #{IOStreams::VERSION}'"
13
- system "git push --tags"
12
+ system 'git push --tags'
14
13
  system "gem push iostreams-#{IOStreams::VERSION}.gem"
15
14
  system "rm iostreams-#{IOStreams::VERSION}.gem"
16
15
  end
17
16
 
18
- desc "Run Test Suite"
17
+ desc 'Run Test Suite'
19
18
  task :test do
20
19
  Rake::TestTask.new(:functional) do |t|
21
20
  t.test_files = FileList['test/**/*_test.rb']
@@ -0,0 +1,121 @@
1
+ module IOStreams
2
+ module Delimited
3
+ class Reader
4
+ attr_accessor :delimiter
5
+
6
+ # Read from a file or stream
7
+ def self.open(file_name_or_io, options={}, &block)
8
+ if file_name_or_io.respond_to?(:read)
9
+ block.call(new(file_name_or_io, options))
10
+ else
11
+ ::File.open(file_name_or_io, 'rb') do |io|
12
+ block.call(new(io, options))
13
+ end
14
+ end
15
+ end
16
+
17
+ # Create a delimited UTF8 stream reader from the supplied input streams
18
+ #
19
+ # The input stream should be binary with no text conversions performed
20
+ # since `strip_non_printable` will be applied to the binary stream before
21
+ # converting to UTF-8
22
+ #
23
+ # Parameters
24
+ # input_stream
25
+ # The input stream that implements #read
26
+ #
27
+ # options
28
+ # :delimiter[Symbol|String]
29
+ # Line / Record delimiter to use to break the stream up into records
30
+ # nil
31
+ # Automatically detect line endings and break up by line
32
+ # Searches for the first "\r\n" or "\n" and then uses that as the
33
+ # delimiter for all subsequent records
34
+ # String:
35
+ # Any string to break the stream up by
36
+ # The records when saved will not include this delimiter
37
+ # Default: nil
38
+ #
39
+ # :buffer_size [Integer]
40
+ # Maximum size of the buffer into which to read the stream into for
41
+ # processing.
42
+ # Must be large enough to hold the entire first line and its delimiter(s)
43
+ # Default: 65536 ( 64K )
44
+ #
45
+ # :strip_non_printable [true|false]
46
+ # Strip all non-printable characters read from the file
47
+ # Default: true iff :encoding is UTF8_ENCODING, otherwise false
48
+ #
49
+ # :encoding
50
+ # Force encoding to this encoding for all data being read
51
+ # Default: UTF8_ENCODING
52
+ # Set to nil to disable encoding
53
+ def initialize(input_stream, options={})
54
+ @input_stream = input_stream
55
+ options = options.dup
56
+ @delimiter = options.delete(:delimiter)
57
+ @buffer_size = options.delete(:buffer_size) || 65536
58
+ @encoding = options.has_key?(:encoding) ? options.delete(:encoding) : UTF8_ENCODING
59
+ @strip_non_printable = options.delete(:strip_non_printable)
60
+ @strip_non_printable = @strip_non_printable.nil? && (@encoding == UTF8_ENCODING)
61
+ raise ArgumentError.new("Unknown IOStreams::Delimited::Reader#initialize options: #{options.inspect}") if options.size > 0
62
+
63
+ @delimiter.force_encoding(UTF8_ENCODING) if @delimiter && @encoding
64
+ @buffer = ''
65
+ end
66
+
67
+ # Returns each line at a time to to the supplied block
68
+ def each_line(&block)
69
+ partial = nil
70
+ loop do
71
+ if read_chunk == 0
72
+ block.call(partial) if partial
73
+ return
74
+ end
75
+
76
+ self.delimiter ||= detect_delimiter
77
+ end_index ||= (delimiter.size + 1) * -1
78
+
79
+ @buffer.each_line(delimiter) do |line|
80
+ if line.end_with?(delimiter)
81
+ # Strip off delimiter
82
+ block.call(line[0..end_index])
83
+ partial = nil
84
+ else
85
+ partial = line
86
+ end
87
+ end
88
+ @buffer = partial.nil? ? '' : partial
89
+ end
90
+ end
91
+
92
+ ##########################################################################
93
+ private
94
+
95
+ # Returns [Integer] the number of bytes read into the internal buffer
96
+ # Returns 0 on EOF
97
+ def read_chunk
98
+ chunk = @input_stream.read(@buffer_size)
99
+ # EOF reached?
100
+ return 0 unless chunk
101
+
102
+ # Strip out non-printable characters before converting to UTF-8
103
+ chunk = chunk.scan(/[[:print:]]|\r|\n/).join if @strip_non_printable
104
+
105
+ @buffer << (@encoding ? chunk.force_encoding(@encoding) : chunk)
106
+ chunk.size
107
+ end
108
+
109
+ # Auto detect text line delimiter
110
+ def detect_delimiter
111
+ if @buffer =~ /\r\n|\n\r|\n/
112
+ $&
113
+ elsif @buffer.size <= @buffer_size
114
+ # Handle one line files that are smaller than the buffer size
115
+ "\n"
116
+ end
117
+ end
118
+
119
+ end
120
+ end
121
+ end
@@ -3,6 +3,9 @@ module IOStreams
3
3
  # A registry to hold formats for processing files during upload or download
4
4
  @@extensions = ThreadSafe::Hash.new
5
5
 
6
+ UTF8_ENCODING = Encoding.find('UTF-8').freeze
7
+ BINARY_ENCODING = Encoding.find('BINARY').freeze
8
+
6
9
  # Returns [Array] the formats required to process the file by looking at
7
10
  # its extension(s)
8
11
  #
@@ -28,9 +31,9 @@ module IOStreams
28
31
  # RocketJob::Formatter::Formats.streams_for_file_name('myfile.csv')
29
32
  # => [ :file ]
30
33
  def self.streams_for_file_name(file_name)
31
- raise ArgumentError.new("File name cannot be nil") if file_name.nil?
34
+ raise ArgumentError.new('File name cannot be nil') if file_name.nil?
32
35
  raise ArgumentError.new("RocketJob Cannot detect file format when uploading to stream: #{file_name.inspect}") if file_name.respond_to?(:read)
33
- parts = file_name.split('.')
36
+ parts = file_name.split('.')
34
37
  extensions = []
35
38
  while extension = parts.pop
36
39
  break unless @@extensions[extension.to_sym]
@@ -192,7 +195,7 @@ module IOStreams
192
195
  def self.stream(type, file_name_or_io, streams=nil, &block)
193
196
  unless streams
194
197
  respond_to = type == :reader ? :read : :write
195
- streams = file_name_or_io.respond_to?(respond_to) ? [ :file ] : streams_for_file_name(file_name_or_io)
198
+ streams = file_name_or_io.respond_to?(respond_to) ? [:file] : streams_for_file_name(file_name_or_io)
196
199
  end
197
200
  stream_structs = streams_for(type, streams)
198
201
  if stream_structs.size == 1
@@ -200,7 +203,7 @@ module IOStreams
200
203
  stream_struct.klass.open(file_name_or_io, stream_struct.options, &block)
201
204
  else
202
205
  # Daisy chain multiple streams together
203
- last = stream_structs.inject(block){ |inner, stream_struct| -> io { stream_struct.klass.open(io, stream_struct.options, &inner) } }
206
+ last = stream_structs.inject(block) { |inner, stream_struct| -> io { stream_struct.klass.open(io, stream_struct.options, &inner) } }
204
207
  last.call(file_name_or_io)
205
208
  end
206
209
  end
@@ -208,7 +211,7 @@ module IOStreams
208
211
  # type: :reader or :writer
209
212
  def self.streams_for(type, params)
210
213
  if params.is_a?(Symbol)
211
- [ stream_struct_for_stream(type, params) ]
214
+ [stream_struct_for_stream(type, params)]
212
215
  elsif params.is_a?(Array)
213
216
  a = []
214
217
  params.each do |stream|
@@ -235,6 +238,7 @@ module IOStreams
235
238
  end
236
239
 
237
240
  # Register File extensions
241
+ # @formatter:off
238
242
  register_extension(:enc, SymmetricEncryption::Reader, SymmetricEncryption::Writer) if defined?(SymmetricEncryption)
239
243
  register_extension(:file, IOStreams::File::Reader, IOStreams::File::Writer)
240
244
  register_extension(:gz, IOStreams::Gzip::Reader, IOStreams::Gzip::Writer)
@@ -1,3 +1,3 @@
1
1
  module IOStreams #:nodoc
2
- VERSION = "0.7.0"
2
+ VERSION = '0.8.0'
3
3
  end
@@ -13,8 +13,8 @@ module IOStreams
13
13
  # end
14
14
  # end
15
15
  def self.open(file_name_or_io, options={}, &block)
16
- options = options.dup
17
- buffer_size = options.delete(:buffer_size) || 65536
16
+ options = options.dup
17
+ buffer_size = options.delete(:buffer_size) || 65536
18
18
  raise(ArgumentError, "Unknown IOStreams::Zip::Reader option: #{options.inspect}") if options.size > 0
19
19
 
20
20
  # File name supplied
@@ -54,7 +54,7 @@ module IOStreams
54
54
  begin
55
55
  require 'zip'
56
56
  rescue LoadError => exc
57
- puts "Please install gem rubyzip so that RocketJob can read Zip files in Ruby MRI"
57
+ puts 'Please install gem rubyzip so that RocketJob can read Zip files in Ruby MRI'
58
58
  raise(exc)
59
59
  end
60
60
 
@@ -67,7 +67,7 @@ module IOStreams
67
67
  begin
68
68
  require 'zip'
69
69
  rescue LoadError => exc
70
- puts "Please install gem rubyzip so that RocketJob can read Zip files in Ruby MRI"
70
+ puts 'Please install gem rubyzip so that RocketJob can read Zip files in Ruby MRI'
71
71
  raise(exc)
72
72
  end
73
73
 
data/lib/iostreams.rb CHANGED
@@ -12,5 +12,9 @@ module IOStreams
12
12
  autoload :Reader, 'io_streams/zip/reader'
13
13
  autoload :Writer, 'io_streams/zip/writer'
14
14
  end
15
+ module Delimited
16
+ autoload :Reader, 'io_streams/delimited/reader'
17
+ autoload :Writer, 'io_streams/delimited/writer'
18
+ end
15
19
  end
16
20
  require 'io_streams/io_streams'
@@ -0,0 +1,71 @@
1
+ require_relative 'test_helper'
2
+
3
+ # Unit Test for IOStreams::File
4
+ module Streams
5
+ class DelimitedReaderTest < Minitest::Test
6
+ context IOStreams::File::Reader do
7
+ setup do
8
+ @file_name = File.join(File.dirname(__FILE__), 'files', 'text.txt')
9
+ @data = []
10
+ File.open(@file_name, 'rt') do |file|
11
+ while !file.eof?
12
+ @data << file.readline.strip
13
+ end
14
+ end
15
+ end
16
+
17
+ context '.open' do
18
+ should 'each_line file' do
19
+ lines = []
20
+ IOStreams::Delimited::Reader.open(@file_name) do |io|
21
+ io.each_line { |line| lines << line }
22
+ end
23
+ assert_equal @data, lines
24
+ end
25
+
26
+ should 'each_line stream' do
27
+ lines = []
28
+ File.open(@file_name) do |file|
29
+ IOStreams::Delimited::Reader.open(file) do |io|
30
+ io.each_line { |line| lines << line }
31
+ end
32
+ end
33
+ assert_equal @data, lines
34
+ end
35
+
36
+ ["\r\n", "\n\r", "\n"].each do |delimiter|
37
+ should "autodetect delimiter: #{delimiter.inspect}" do
38
+ lines = []
39
+ stream = StringIO.new(@data.join(delimiter))
40
+ IOStreams::Delimited::Reader.open(stream, buffer_size: 15) do |io|
41
+ io.each_line { |line| lines << line }
42
+ end
43
+ assert_equal @data, lines
44
+ end
45
+ end
46
+
47
+ ['@', 'BLAH'].each do |delimiter|
48
+ should "read delimited #{delimiter.inspect}" do
49
+ lines = []
50
+ stream = StringIO.new(@data.join(delimiter))
51
+ IOStreams::Delimited::Reader.open(stream, buffer_size: 15, delimiter: delimiter) do |io|
52
+ io.each_line { |line| lines << line }
53
+ end
54
+ assert_equal @data, lines
55
+ end
56
+ end
57
+
58
+ should "read binary delimited" do
59
+ delimiter = "\x01"
60
+ lines = []
61
+ stream = StringIO.new(@data.join(delimiter))
62
+ IOStreams::Delimited::Reader.open(stream, buffer_size: 15, delimiter: delimiter, encoding: IOStreams::BINARY_ENCODING) do |io|
63
+ io.each_line { |line| lines << line }
64
+ end
65
+ assert_equal @data, lines
66
+ end
67
+ end
68
+
69
+ end
70
+ end
71
+ end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: iostreams
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.7.0
4
+ version: 0.8.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Reid Morrison
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2015-07-14 00:00:00.000000000 Z
11
+ date: 2015-08-25 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: symmetric-encryption
@@ -47,6 +47,7 @@ extra_rdoc_files: []
47
47
  files:
48
48
  - README.md
49
49
  - Rakefile
50
+ - lib/io_streams/delimited/reader.rb
50
51
  - lib/io_streams/file/reader.rb
51
52
  - lib/io_streams/file/writer.rb
52
53
  - lib/io_streams/gzip/reader.rb
@@ -56,6 +57,7 @@ files:
56
57
  - lib/io_streams/zip/reader.rb
57
58
  - lib/io_streams/zip/writer.rb
58
59
  - lib/iostreams.rb
60
+ - test/delimited_reader_test.rb
59
61
  - test/file_reader_test.rb
60
62
  - test/file_writer_test.rb
61
63
  - test/files/text.txt
@@ -92,6 +94,7 @@ signing_key:
92
94
  specification_version: 4
93
95
  summary: Ruby Input and Output streaming with support for Zip, Gzip, and Encryption.
94
96
  test_files:
97
+ - test/delimited_reader_test.rb
95
98
  - test/file_reader_test.rb
96
99
  - test/file_writer_test.rb
97
100
  - test/files/text.txt