RubyGems - iostreams - Versions diffs - 1.2.0 → 1.2.1 - Mend

iostreams 1.2.0 → 1.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

checksums.yaml +4 -4
data/README.md +3 -425
data/lib/io_streams/line/reader.rb +2 -2
data/lib/io_streams/paths/http.rb +1 -0
data/lib/io_streams/pgp.rb +0 -76
data/lib/io_streams/record/reader.rb +33 -8
data/lib/io_streams/record/writer.rb +33 -8
data/lib/io_streams/row/reader.rb +4 -4
data/lib/io_streams/row/writer.rb +4 -4
data/lib/io_streams/stream.rb +6 -6
data/lib/io_streams/tabular.rb +29 -1
data/lib/io_streams/utils.rb +1 -0
data/lib/io_streams/version.rb +1 -1
data/test/io_streams_test.rb +49 -0
metadata +3 -4
data/lib/io_streams/utils/reliable_http.rb +0 -98

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 56c0432a5b7820924d8e7f72df07579366ae09dae03d8abe431e5b8fbc88de2b
-  data.tar.gz: 73d78b153b9bfd079f9d3345f8f56eed7afec70a7fddda14f5478135faedf6d3
+  metadata.gz: 1dad581b0665992975c33f75b23f50964ae1311e025b7a1524fca4004f0ede2b
+  data.tar.gz: 4db01e4d6c2d36ce522df3b323a6e0d9f42de0d1644a282a0cea06479e979289
 SHA512:
-  metadata.gz: 84b123abc4f78428344baa772356cfac922584ed8c03af41701c5bdcd17380283b3592c9989941b1126999a6b3dcef7367b4a0d4bbf67143ab678a24dc798ff6
-  data.tar.gz: 4c35ae431bc0862b47e738b04f0dd88a98661b299845f3124c08f368c3532cf03012504d0ddcd18375295dd8590ea565e5a2483244602dd88f25fca7f7ef1328
+  metadata.gz: 4057a5c484129c60dbc9c84e462026da862900e17b0604b385164210f14814fbae6d065d015ee9171402eb9f793f33ac26c0ee7658f94b8cdeb0724c796cbe63
+  data.tar.gz: 5a84fe37c1eebc775bd84b9903181ff035c325b1233ab64990e586f5b0bd3fd51c21d4f1429f9b0e8ab64733e9b63be5c5e05df7bf026e9d1d8c0cd8a7716417

data/README.md CHANGED

@@ -5,433 +5,11 @@ Input and Output streaming for Ruby.
 ## Project Status
-Production Ready.
+Production Ready, heavily used in production environments, many as part of Rocket Job.
-## Features
+## Documentation
-Supported streams:
-* Zip
-* Gzip
-* BZip2
-* PGP (Requires GnuPG)
-* Xlsx (Reading)
-* Encryption using [Symmetric Encryption](https://github.com/reidmorrison/symmetric-encryption)
-Supported sources and/or targets:
-* File
-* HTTP (Read only)
-* AWS S3
-* SFTP
-Supported file formats:
-* CSV
-* Fixed width formats
-* JSON
-* PSV
-## Quick examples
-Read an entire file into memory:
-```ruby
-IOStreams.path('example.txt').read
-```
-Decompress an entire gzip file into memory:
-```ruby
-IOStreams.path('example.gz').read
-```
-Read and decompress the first file in a zip file into memory:
-```ruby
-IOStreams.path('example.zip').read
-```
-Read a file one line at a time
-```ruby
-IOStreams.path('example.txt').each do |line|
-  puts line
-end
-```
-Read a CSV file one line at a time, returning each line as an array:
-```ruby
-IOStreams.path('example.csv').each(:array) do |array|
-  p array
-end
-```
-Read a CSV file a record at a time, returning each line as a hash.
-The first line of the file is assumed to be the header line:
-```ruby
-IOStreams.path('example.csv').each(:hash) do |hash|
-  p hash
-end
-```
-Read a file using an http get,
-decompressing the named file in the zip file,
-returning each records from the named file as a hash:
-```ruby
-IOStreams.
-  path("https://www5.fdic.gov/idasp/Offices2.zip").
-  option(:zip, entry_file_name: 'OFFICES2_ALL.CSV').
-  reader(:hash) do |stream|
-    p stream.read
-  end
-```
-Read the file without unzipping and streaming the first file in the zip:
-```ruby
-IOStreams.path('https://www5.fdic.gov/idasp/Offices2.zip').stream(:none).reader {|file| puts file.read}
-```
-## Introduction
-If all files were small, they could just be loaded into memory in their entirety. With the
-advent of very large files, often into several Gigabytes, or even Terabytes in size, loading
-them into memory is not feasible.
-In linux it is common to use pipes to stream data between processes.
-For example:
-```
-# Count the number of lines in a file that has been compressed with gzip
-cat abc.gz | gunzip -c | wc -l
-```
-For large files it is critical to be able to read and write these files as streams. Ruby has support
-for reading and writing files using streams, but has no built-in way of passing one stream through
-another to support for example compressing the data, encrypting it and then finally writing the result
-to a file. Several streaming implementations exist for languages such as `C++` and `Java` to chain
-together several streams, `iostreams` attempts to offer similar features for Ruby.
-```ruby
-# Read a compressed file:
-IOStreams.path("hello.gz").reader do |reader|
-  data = reader.read(1024)
-  puts "Read: #{data}"
-end
-```
-The true power of streams is shown when many streams are chained together to achieve the end
-result, without holding the entire file in memory, or ideally without needing to create
-any temporary files to process the stream.
-```ruby
-# Create a file that is compressed with GZip and then encrypted with Symmetric Encryption:
-IOStreams.path("hello.gz.enc").writer do |writer|
-  writer.write("Hello World")
-  writer.write("and some more")
-end
-```
-The power of the above example applies when the data being written starts to exceed hundreds of megabytes,
-or even gigabytes.
-By looking at the file name supplied above, `iostreams` is able to determine which streams to apply
-to the data being read or written. For example:
-* `hello.zip` => Compressed using Zip
-* `hello.zip.enc` => Compressed using Zip and then encrypted using Symmetric Encryption
-* `hello.gz.enc` => Compressed using GZip and then encrypted using Symmetric Encryption
-The objective is that all of these streaming processes are performed used streaming
-so that only the current portion of the file is loaded into memory as it moves
-through the entire file.
-Where possible each stream never goes to disk, which for example could expose
-un-encrypted data.
-## Examples
-While decompressing the file, display 128 characters at a time from the file.
-~~~ruby
-require "iostreams"
-IOStreams.path("abc.csv").reader do |io|
-  while (data = io.read(128))
-    p data
-  end
-end
-~~~
-While decompressing the file, display one line at a time from the file.
-~~~ruby
-IOStreams.path("abc.csv").each do |line|
-  puts line
-end
-~~~
-While decompressing the file, display each row from the csv file as an array.
-~~~ruby
-IOStreams.path("abc.csv").each(:array) do |array|
-  p array
-end
-~~~
-While decompressing the file, display each record from the csv file as a hash.
-The first line is assumed to be the header row.
-~~~ruby
-IOStreams.path("abc.csv").each(:hash) do |hash|
-  p hash
-end
-~~~
-Write data while compressing the file.
-~~~ruby
-IOStreams.path("abc.csv").writer do |io|
-  io.write("This")
-  io.write(" is ")
-  io.write(" one line\n")
-end
-~~~
-Write a line at a time while compressing the file.
-~~~ruby
-IOStreams.path("abc.csv").writer(:line) do |file|
-  file << "these"
-  file << "are"
-  file << "all"
-  file << "separate"
-  file << "lines"
-end
-~~~
-Write an array (row) at a time while compressing the file.
-Each array is converted to csv before being compressed with zip.
-~~~ruby
-IOStreams.path("abc.csv").writer(:array) do |io|
-  io << %w[name address zip_code]
-  io << %w[Jack There 1234]
-  io << ["Joe", "Over There somewhere", 1234]
-end
-~~~
-Write a hash (record) at a time while compressing the file.
-Each hash is converted to csv before being compressed with zip.
-The header row is extracted from the first hash supplied.
-~~~ruby
-IOStreams.path("abc.csv").writer(:hash) do |stream|
-  stream << {name: "Jack", address: "There", zip_code: 1234}
-  stream << {name: "Joe", address: "Over There somewhere", zip_code: 1234}
-end
-~~~
-Write to a string IO for testing, supplying the filename so that the streams can be determined.
-~~~ruby
-io = StringIO.new
-IOStreams.stream(io, file_name: "abc.csv").writer(:hash) do |stream|
-  stream << {name: "Jack", address: "There", zip_code: 1234}
-  stream << {name: "Joe", address: "Over There somewhere", zip_code: 1234}
-end
-puts io.string
-~~~
-Read a CSV file and write the output to an encrypted file in JSON format.
-~~~ruby
-IOStreams.path("sample.json.enc").writer(:hash) do |output|
-  IOStreams.path("sample.csv").each(:hash) do |record|
-    output << record
-  end
-end
-~~~
-## Copying between files
-Stream based file copying. Changes the file type without changing the file format. For example, compress or encrypt.
-Encrypt the contents of the file `sample.json` and write to `sample.json.enc`
-~~~ruby
-input = IOStreams.path("sample.json")
-IOStreams.path("sample.json.enc").copy_from(input)
-~~~
-Encrypt and compress the contents of the file `sample.json` with Symmetric Encryption and write to `sample.json.enc`
-~~~ruby
-input = IOStreams.path("sample.json")
-IOStreams.path("sample.json.enc").option(:enc, compress: true).copy_from(input)
-~~~
-Encrypt and compress the contents of the file `sample.json` with pgp and write to `sample.json.enc`
-~~~ruby
-input = IOStreams.path("sample.json")
-IOStreams.path("sample.json.pgp").option(:pgp, recipient: "sender@example.org").copy_from(input)
-~~~
-Decrypt the file `abc.csv.enc` and write it to `xyz.csv`.
-~~~ruby
-input = IOStreams.path("abc.csv.enc")
-IOStreams.path("xyz.csv").copy_from(input)
-~~~
-Decrypt file `ABC` that was encrypted with Symmetric Encryption,
-PGP encrypt the output file and write it to `xyz.csv.pgp` using the pgp key that was imported for `a@a.com`.
-~~~ruby
-input = IOStreams.path("ABC").stream(:enc)
-IOStreams.path("xyz.csv.pgp").option(:pgp, recipient: "a@a.com").copy_from(input)
-~~~
-To copy a file _without_ performing any conversions (ignore file extensions), set `convert` to `false`:
-~~~ruby
-input = IOStreams.path("sample.json.zip")
-IOStreams.path("sample.copy").copy_from(input, convert: false)
-~~~
-## Philosopy
-IOStreams can be used to work against a single stream. it's real capability becomes apparent when chaining together
-multiple streams to process data, without loading entire files into memory.
-#### Linux Pipes
-Linux has built-in support for streaming using the `|` (pipe operator) to send the output from one process to another.
-Example: count the number of lines in a compressed file:
-    gunzip -c hello.csv.gz | wc -l
-The file `hello.csv.gz` is uncompressed and returned to standard output, which in turn is piped into the standard
-input for `wc -l`, which counts the number of lines in the uncompressed data.
-As each block of data is returned from `gunzip` it is immediately passed into `wc` so that it
-can start counting lines of uncompressed data, without waiting until the entire file is decompressed.
-The uncompressed contents of the file are not written to disk before passing to `wc -l` and the file is not loaded
-into memory before passing to `wc -l`.
-In this way extremely large files can be processed with very little memory being used.
-#### Push Model
-In the Linux pipes example above this would be considered a "push model" where each task in the list pushes
-its output to the input of the next task.
-A major challenge or disadvantage with the push model is that buffering would need to occur between tasks since
-each task could complete at very different speeds. To prevent large memory usage the standard output from a previous
-task would have to be blocked to try and make it slow down.
-#### Pull Model
-Another approach with multiple tasks that need to process a single stream, is to move to a "pull model" where the
-task at the end of the list pulls a block from a previous task when it is ready to process it.
-#### IOStreams
-IOStreams uses the pull model when reading data, where each stream performs a read against the previous stream
-when it is ready for more data.
-When writing to an output stream, IOStreams uses the push model, where each block of data that is ready to be written
-is pushed to the task/stream in the list. The write push only returns once it has traversed all the way down to
-the final task / stream in the list, this avoids complex buffering issues between each task / stream in the list.
-Example: Implementing in Ruby: `gunzip -c hello.csv.gz | wc -l`
-~~~ruby
-  line_count = 0
-  IOStreams::Gzip::Reader.open("hello.csv.gz") do |input|
-    IOStreams::Line::Reader.open(input) do |lines|
-      lines.each { line_count += 1}
-    end
-  end
-  puts "hello.csv.gz contains #{line_count} lines"
-~~~
-Since IOStreams can autodetect file types based on the file extension, `IOStreams.reader` can figure which stream
-to start with:
-~~~ruby
-  line_count = 0
-  IOStreams.path("hello.csv.gz").reader do |input|
-    IOStreams::Line::Reader.open(input) do |lines|
-      lines.each { line_count += 1}
-    end
-  end
-  puts "hello.csv.gz contains #{line_count} lines"
-~~~
-Since we know we want a line reader, it can be simplified using `#reader(:line)`:
-~~~ruby
-  line_count = 0
-  IOStreams.path("hello.csv.gz").reader(:line) do |lines|
-    lines.each { line_count += 1}
-  end
-  puts "hello.csv.gz contains #{line_count} lines"
-~~~
-It can be simplified even further using `#each`:
-~~~ruby
-  line_count = 0
-  IOStreams.path("hello.csv.gz").each { line_count += 1}
-  puts "hello.csv.gz contains #{line_count} lines"
-~~~
-The benefit in all of the above cases is that the file can be any arbitrary size and only one block of the file
-is held in memory at any time.
-#### Chaining
-In the above example only 2 streams were used. Streams can be nested as deep as necessary to process data.
-Example, search for all occurrences of the word apple, cleansing the input data stream of non printable characters
-and converting to valid US ASCII.
-~~~ruby
-  apple_count = 0
-  IOStreams::Gzip::Reader.open("hello.csv.gz") do |input|
-    IOStreams::Encode::Reader.open(input,
-                                   encoding:       "US-ASCII",
-                                   encode_replace: "",
-                                   encode_cleaner: :printable) do |cleansed|
-      IOStreams::Line::Reader.open(cleansed) do |lines|
-        lines.each { |line| apple_count += line.scan("apple").count}
-      end
-  end
-  puts "Found the word 'apple' #{apple_count} times in hello.csv.gz"
-~~~
-Let IOStreams perform the above stream chaining automatically under the covers:
-~~~ruby
-  apple_count = 0
-  IOStreams.path("hello.csv.gz").
-    option(:encode, encoding: "US-ASCII", replace: "", cleaner: :printable).
-    each do |line|
-      apple_count += line.scan("apple").count
-    end
-  puts "Found the word 'apple' #{apple_count} times in hello.csv.gz"
-~~~
-## Notes
-* Due to the nature of Zip, both its Reader and Writer methods will create
-  a temp file when reading from or writing to a stream.
-  Recommended to use Gzip over Zip since it can be streamed without requiring temp files.
-* Zip becomes exponentially slower with very large files, especially files
-  that exceed 4GB when uncompressed. Highly recommend using GZip for large files.
+[Semantic Logger Guide](http://rocketjob.github.io/iostreams)
 ## Versioning

data/lib/io_streams/line/reader.rb CHANGED

@@ -9,7 +9,7 @@ module IOStreams
       LINEFEED_REGEXP = Regexp.compile(/\r\n|\n|\r/).freeze
       # Read a line at a time from a stream
-      def self.stream(input_stream, original_file_name: nil, **args)
+      def self.stream(input_stream, **args)
         # Pass-through if already a line reader
         return yield(input_stream) if input_stream.is_a?(self.class)
@@ -44,7 +44,7 @@ module IOStreams
       # - Skip "empty" / "blank" lines. RegExp?
       # - Extract header line(s) / first non-comment, non-blank line
       # - Embedded newline support, RegExp? or Proc?
-      def initialize(input_stream, delimiter: nil, buffer_size: 65_536, embedded_within: nil)
+      def initialize(input_stream, delimiter: nil, buffer_size: 65_536, embedded_within: nil, original_file_name: nil)
         super(input_stream)
         @embedded_within = embedded_within

data/lib/io_streams/paths/http.rb CHANGED

@@ -1,5 +1,6 @@
 require "net/http"
 require "uri"
+require "cgi"
 module IOStreams
   module Paths
     class HTTP < IOStreams::Path

data/lib/io_streams/pgp.rb CHANGED

@@ -2,85 +2,9 @@ require "open3"
 module IOStreams
   # Read/Write PGP/GPG file or stream.
   #
-  # Example Setup:
-  #
-  #   1. Install OpenPGP
-  #      Mac OSX (homebrew) : `brew install gpg2`
-  #      Redhat Linux: `rpm install gpg2`
-  #
-  #   2. # Generate senders private and public key
-  #      IOStreams::Pgp.generate_key(name: 'Sender', email: 'sender@example.org', passphrase: 'sender_passphrase')
-  #
-  #   3. # Generate receivers private and public key
-  #      IOStreams::Pgp.generate_key(name: 'Receiver', email: 'receiver@example.org', passphrase: 'receiver_passphrase')
-  #
-  # Example 1:
-  #
-  #   # Generate encrypted file for a specific recipient and sign it with senders credentials
-  #   data = %w(this is some data that should be encrypted using pgp)
-  #   IOStreams::Pgp::Writer.open('secure.gpg', recipient: 'receiver@example.org', signer: 'sender@example.org', signer_passphrase: 'sender_passphrase') do |output|
-  #     data.each { |word| output.puts(word) }
-  #   end
-  #
-  #   # Decrypt the file sent to `receiver@example.org` using its private key
-  #   # Recipient must also have the senders public key to verify the signature
-  #   IOStreams::Pgp::Reader.open('secure.gpg', passphrase: 'receiver_passphrase') do |stream|
-  #     while !stream.eof?
-  #       p stream.read(10)
-  #       puts
-  #     end
-  #   end
-  #
-  # Example 2:
-  #
-  #   # Default user and passphrase to sign the output file:
-  #   IOStreams::Pgp::Writer.default_signer            = 'sender@example.org'
-  #   IOStreams::Pgp::Writer.default_signer_passphrase = 'sender_passphrase'
-  #
-  #   # Default passphrase for decrypting recipients files.
-  #   # Note: Usually this would be the senders passphrase, but in this example
-  #   #       it is decrypting the file intended for the recipient.
-  #   IOStreams::Pgp::Reader.default_passphrase = 'receiver_passphrase'
-  #
-  #   # Generate encrypted file for a specific recipient and sign it with senders credentials
-  #   data = %w(this is some data that should be encrypted using pgp)
-  #   IOStreams.writer('secure.gpg', streams: {pgp: {recipient: 'receiver@example.org'}}) do |output|
-  #     data.each { |word| output.puts(word) }
-  #   end
-  #
-  #   # Decrypt the file sent to `receiver@example.org` using its private key
-  #   # Recipient must also have the senders public key to verify the signature
-  #   IOStreams.reader('secure.gpg') do |stream|
-  #     while data = stream.read(10)
-  #       p data
-  #     end
-  #   end
-  #
-  # FAQ:
-  # - If you get not trusted errors
-  #    gpg --edit-key sender@example.org
-  #      Select highest level: 5
-  #
-  # Delete test keys:
-  #   IOStreams::Pgp.delete_keys(email: 'sender@example.org', private: true)
-  #   IOStreams::Pgp.delete_keys(email: 'receiver@example.org', private: true)
-  #
   # Limitations
   # - Designed for processing larger files since a process is spawned for each file processed.
   # - For small in memory files or individual emails, use the 'opengpgme' library.
-  #
-  # Compression Performance:
-  #   Running tests on an Early 2015 Macbook Pro Dual Core with Ruby v2.3.1
-  #
-  #   Input file: test.log 3.6GB
-  #     :none:  size: 3.6GB  write:  52s  read:  45s
-  #     :zip:   size: 411MB  write:  75s  read:  31s
-  #     :zlib:  size: 241MB  write:  66s  read:  23s  ( 756KB Memory )
-  #     :bzip2: size: 129MB  write: 430s  read: 130s  ( 5MB Memory )
-  #
-  # Notes:
-  # - Tested against gnupg v1.4.21 and v2.0.30
-  # - Does not work yet with gnupg v2.1. Pull Requests welcome.
   module Pgp
     autoload :Reader, "io_streams/pgp/reader"
     autoload :Writer, "io_streams/pgp/writer"

data/lib/io_streams/record/reader.rb CHANGED

@@ -7,7 +7,7 @@ module IOStreams
       # Read a record at a time from a line stream
       # Note:
       # - The supplied stream _must_ already be a line stream, or a stream that responds to :each
-      def self.stream(line_reader, original_file_name: nil, **args)
+      def self.stream(line_reader, **args)
         # Pass-through if already a record reader
         return yield(line_reader) if line_reader.is_a?(self.class)
@@ -17,7 +17,7 @@ module IOStreams
       # When reading from a file also add the line reader stream
       def self.file(file_name, original_file_name: file_name, delimiter: $/, **args)
         IOStreams::Line::Reader.file(file_name, original_file_name: original_file_name, delimiter: delimiter) do |io|
-          yield new(io, **args)
+          yield new(io, original_file_name: original_file_name, **args)
         end
       end
@@ -25,19 +25,44 @@ module IOStreams
       # Parse a delimited data source.
       #
       # Parameters
-      #   delimited: [#each]
-      #     Anything that returns one line / record at a time when #each is called on it.
-      #
       #   format: [Symbol]
       #     :csv, :hash, :array, :json, :psv, :fixed
       #
-      #   For all other parameters, see Tabular::Header.new
-      def initialize(line_reader, cleanse_header: true, **args)
+      #   file_name: [String]
+      #     When `:format` is not supplied the file name can be used to infer the required format.
+      #     Optional. Default: nil
+      #
+      #   format_options: [Hash]
+      #     Any specialized format specific options. For example, `:fixed` format requires the file definition.
+      #
+      #   columns [Array<String>]
+      #     The header columns when the file does not include a header row.
+      #     Note:
+      #       It is recommended to keep all columns as strings to avoid any issues when persistence
+      #       with MongoDB when it converts symbol keys to strings.
+      #
+      #   allowed_columns [Array<String>]
+      #     List of columns to allow.
+      #     Default: nil ( Allow all columns )
+      #     Note:
+      #       When supplied any columns that are rejected will be returned in the cleansed columns
+      #       as nil so that they can be ignored during processing.
+      #
+      #   required_columns [Array<String>]
+      #     List of columns that must be present, otherwise an Exception is raised.
+      #
+      #   skip_unknown [true|false]
+      #     true:
+      #       Skip columns not present in the `allowed_columns` by cleansing them to nil.
+      #       #as_hash will skip these additional columns entirely as if they were not in the file at all.
+      #     false:
+      #       Raises Tabular::InvalidHeader when a column is supplied that is not in the whitelist.
+      def initialize(line_reader, cleanse_header: true, original_file_name: nil, **args)
         unless line_reader.respond_to?(:each)
           raise(ArgumentError, "Stream must be a IOStreams::Line::Reader or implement #each")
         end
-        @tabular        = IOStreams::Tabular.new(**args)
+        @tabular        = IOStreams::Tabular.new(file_name: original_file_name, **args)
         @line_reader    = line_reader
         @cleanse_header = cleanse_header
       end

data/lib/io_streams/record/writer.rb CHANGED

@@ -9,7 +9,7 @@ module IOStreams
       # Write a record as a Hash at a time to a stream.
       # Note:
       # - The supplied stream _must_ already be a line stream, or a stream that responds to :<<
-      def self.stream(line_writer, original_file_name: nil, **args)
+      def self.stream(line_writer, **args)
         # Pass-through if already a record writer
         return yield(line_writer) if line_writer.is_a?(self.class)
@@ -19,7 +19,7 @@ module IOStreams
       # When writing to a file also add the line writer stream
       def self.file(file_name, original_file_name: file_name, delimiter: $/, **args, &block)
         IOStreams::Line::Writer.file(file_name, original_file_name: original_file_name, delimiter: delimiter) do |io|
-          yield new(io, **args, &block)
+          yield new(io, original_file_name: original_file_name, **args, &block)
         end
       end
@@ -27,17 +27,42 @@ module IOStreams
       # Parse a delimited data source.
       #
       # Parameters
-      #   delimited: [#<<]
-      #     Anything that accepts a line / record at a time when #<< is called on it.
-      #
       #   format: [Symbol]
       #     :csv, :hash, :array, :json, :psv, :fixed
       #
-      #   For all other parameters, see Tabular::Header.new
-      def initialize(line_writer, columns: nil, **args)
+      #   file_name: [String]
+      #     When `:format` is not supplied the file name can be used to infer the required format.
+      #     Optional. Default: nil
+      #
+      #   format_options: [Hash]
+      #     Any specialized format specific options. For example, `:fixed` format requires the file definition.
+      #
+      #   columns [Array<String>]
+      #     The header columns when the file does not include a header row.
+      #     Note:
+      #       It is recommended to keep all columns as strings to avoid any issues when persistence
+      #       with MongoDB when it converts symbol keys to strings.
+      #
+      #   allowed_columns [Array<String>]
+      #     List of columns to allow.
+      #     Default: nil ( Allow all columns )
+      #     Note:
+      #       When supplied any columns that are rejected will be returned in the cleansed columns
+      #       as nil so that they can be ignored during processing.
+      #
+      #   required_columns [Array<String>]
+      #     List of columns that must be present, otherwise an Exception is raised.
+      #
+      #   skip_unknown [true|false]
+      #     true:
+      #       Skip columns not present in the `allowed_columns` by cleansing them to nil.
+      #       #as_hash will skip these additional columns entirely as if they were not in the file at all.
+      #     false:
+      #       Raises Tabular::InvalidHeader when a column is supplied that is not in the whitelist.
+      def initialize(line_writer, columns: nil, original_file_name: nil, **args)
         raise(ArgumentError, "Stream must be a IOStreams::Line::Writer or implement #<<") unless line_writer.respond_to?(:<<)
-        @tabular     = IOStreams::Tabular.new(columns: columns, **args)
+        @tabular     = IOStreams::Tabular.new(columns: columns, file_name: original_file_name, **args)
         @line_writer = line_writer
         # Render header line when `columns` is supplied.

data/lib/io_streams/row/reader.rb CHANGED

@@ -5,7 +5,7 @@ module IOStreams
       # Read a line as an Array at a time from a stream.
       # Note:
       # - The supplied stream _must_ already be a line stream, or a stream that responds to :each
-      def self.stream(line_reader, original_file_name: nil, **args)
+      def self.stream(line_reader, **args)
         # Pass-through if already a row reader
         return yield(line_reader) if line_reader.is_a?(self.class)
@@ -15,7 +15,7 @@ module IOStreams
       # When reading from a file also add the line reader stream
       def self.file(file_name, original_file_name: file_name, delimiter: $/, **args)
         IOStreams::Line::Reader.file(file_name, original_file_name: original_file_name, delimiter: delimiter) do |io|
-          yield new(io, **args)
+          yield new(io, original_file_name: original_file_name, **args)
         end
       end
@@ -29,12 +29,12 @@ module IOStreams
       #     :csv, :hash, :array, :json, :psv, :fixed
       #
       #   For all other parameters, see Tabular::Header.new
-      def initialize(line_reader, cleanse_header: true, **args)
+      def initialize(line_reader, cleanse_header: true, original_file_name: nil, **args)
         unless line_reader.respond_to?(:each)
           raise(ArgumentError, "Stream must be a IOStreams::Line::Reader or implement #each")
         end
-        @tabular        = IOStreams::Tabular.new(**args)
+        @tabular        = IOStreams::Tabular.new(file_name: original_file_name, **args)
         @line_reader    = line_reader
         @cleanse_header = cleanse_header
       end

data/lib/io_streams/row/writer.rb CHANGED

@@ -12,7 +12,7 @@ module IOStreams
       #
       # Note:
       # - The supplied stream _must_ already be a line stream, or a stream that responds to :<<
-      def self.stream(line_writer, original_file_name: nil, **args)
+      def self.stream(line_writer, **args)
         # Pass-through if already a row writer
         return yield(line_writer) if line_writer.is_a?(self.class)
@@ -22,7 +22,7 @@ module IOStreams
       # When writing to a file also add the line writer stream
       def self.file(file_name, original_file_name: file_name, delimiter: $/, **args, &block)
         IOStreams::Line::Writer.file(file_name, original_file_name: original_file_name, delimiter: delimiter) do |io|
-          yield new(io, **args, &block)
+          yield new(io, original_file_name: original_file_name, **args, &block)
         end
       end
@@ -36,10 +36,10 @@ module IOStreams
       #     :csv, :hash, :array, :json, :psv, :fixed
       #
       #   For all other parameters, see Tabular::Header.new
-      def initialize(line_writer, columns: nil, **args)
+      def initialize(line_writer, columns: nil, original_file_name: nil, **args)
         raise(ArgumentError, "Stream must be a IOStreams::Line::Writer or implement #<<") unless line_writer.respond_to?(:<<)
-        @tabular     = IOStreams::Tabular.new(columns: columns, **args)
+        @tabular     = IOStreams::Tabular.new(columns: columns, file_name: original_file_name, **args)
         @line_writer = line_writer
         # Render header line when `columns` is supplied.

data/lib/io_streams/stream.rb CHANGED

@@ -282,20 +282,20 @@ module IOStreams
     def line_reader(embedded_within: nil, **args)
       embedded_within = '"' if embedded_within.nil? && builder.file_name&.include?(".csv")
-      stream_reader { |io| yield IOStreams::Line::Reader.new(io, embedded_within: embedded_within, **args) }
+      stream_reader { |io| yield IOStreams::Line::Reader.new(io, original_file_name: builder.file_name, embedded_within: embedded_within, **args) }
     end
     # Iterate over a file / stream returning each line as an array, one at a time.
     def row_reader(delimiter: nil, embedded_within: nil, **args)
       line_reader(delimiter: delimiter, embedded_within: embedded_within) do |io|
-        yield IOStreams::Row::Reader.new(io, **args)
+        yield IOStreams::Row::Reader.new(io, original_file_name: builder.file_name, **args)
       end
     end
     # Iterate over a file / stream returning each line as a hash, one at a time.
     def record_reader(delimiter: nil, embedded_within: nil, **args)
       line_reader(delimiter: delimiter, embedded_within: embedded_within) do |io|
-        yield IOStreams::Record::Reader.new(io, **args)
+        yield IOStreams::Record::Reader.new(io, original_file_name: builder.file_name, **args)
       end
     end
@@ -306,19 +306,19 @@ module IOStreams
     def line_writer(**args, &block)
       return block.call(io_stream) if io_stream&.is_a?(IOStreams::Line::Writer)
-      writer { |io| IOStreams::Line::Writer.stream(io, **args, &block) }
+      writer { |io| IOStreams::Line::Writer.stream(io, original_file_name: builder.file_name, **args, &block) }
     end
     def row_writer(delimiter: $/, **args, &block)
       return block.call(io_stream) if io_stream&.is_a?(IOStreams::Row::Writer)
-      line_writer(delimiter: delimiter) { |io| IOStreams::Row::Writer.stream(io, **args, &block) }
+      line_writer(delimiter: delimiter) { |io| IOStreams::Row::Writer.stream(io, original_file_name: builder.file_name, **args, &block) }
     end
     def record_writer(delimiter: $/, **args, &block)
       return block.call(io_stream) if io_stream&.is_a?(IOStreams::Record::Writer)
-      line_writer(delimiter: delimiter) { |io| IOStreams::Record::Writer.stream(io, **args, &block) }
+      line_writer(delimiter: delimiter) { |io| IOStreams::Record::Writer.stream(io, original_file_name: builder.file_name, **args, &block) }
     end
   end
 end

data/lib/io_streams/tabular.rb CHANGED

@@ -52,7 +52,35 @@ module IOStreams
     #   format: [Symbol]
     #     :csv, :hash, :array, :json, :psv, :fixed
     #
-    #   For all other parameters, see Tabular::Header.new
+    #   file_name: [String]
+    #     When `:format` is not supplied the file name can be used to infer the required format.
+    #     Optional. Default: nil
+    #
+    #   format_options: [Hash]
+    #     Any specialized format specific options. For example, `:fixed` format requires the file definition.
+    #
+    #   columns [Array<String>]
+    #     The header columns when the file does not include a header row.
+    #     Note:
+    #       It is recommended to keep all columns as strings to avoid any issues when persistence
+    #       with MongoDB when it converts symbol keys to strings.
+    #
+    #   allowed_columns [Array<String>]
+    #     List of columns to allow.
+    #     Default: nil ( Allow all columns )
+    #     Note:
+    #       When supplied any columns that are rejected will be returned in the cleansed columns
+    #       as nil so that they can be ignored during processing.
+    #
+    #   required_columns [Array<String>]
+    #     List of columns that must be present, otherwise an Exception is raised.
+    #
+    #   skip_unknown [true|false]
+    #     true:
+    #       Skip columns not present in the `allowed_columns` by cleansing them to nil.
+    #       #as_hash will skip these additional columns entirely as if they were not in the file at all.
+    #     false:
+    #       Raises Tabular::InvalidHeader when a column is supplied that is not in the whitelist.
     def initialize(format: nil, file_name: nil, format_options: nil, **args)
       @header = Header.new(**args)
       klass   =

data/lib/io_streams/utils.rb CHANGED

@@ -1,4 +1,5 @@
 require "uri"
+require "tmpdir"
 module IOStreams
   module Utils
     MAX_TEMP_FILE_NAME_ATTEMPTS = 5

data/lib/io_streams/version.rb CHANGED

@@ -1,3 +1,3 @@
 module IOStreams
-  VERSION = "1.2.0".freeze
+  VERSION = "1.2.1".freeze
 end

data/test/io_streams_test.rb CHANGED

@@ -1,8 +1,24 @@
 require_relative "test_helper"
+require "json"
 module IOStreams
   class PathTest < Minitest::Test
     describe IOStreams do
+      let :records do
+        [
+          {"name" => "Jack Jones", "login" => "jjones"},
+          {"name" => "Jill Smith", "login" => "jsmith"}
+        ]
+      end
+      let :expected_json do
+        records.collect(&:to_json).join("\n") + "\n"
+      end
+      let :json_file_name do
+        "/tmp/io_streams/abc.json"
+      end
       describe ".root" do
         it "return default path" do
           path = ::File.expand_path(::File.join(__dir__, "../tmp/default"))
@@ -60,6 +76,39 @@ module IOStreams
           IOStreams.path("s3://a.xyz")
           assert_equal :s3, path
         end
+        it "hash writer detects json format from file name" do
+          path = IOStreams.path("/tmp/io_streams/abc.json")
+          path.writer(:hash) do |io|
+            records.each { |hash| io << hash }
+          end
+          actual = path.read
+          path.delete
+          assert_equal expected_json, actual
+        end
+        it "hash reader detects json format from file name" do
+          ::File.open(json_file_name, "wb") { |file| file.write(expected_json) }
+          rows = []
+          path = IOStreams.path("/tmp/io_streams/abc.json")
+          path.each(:hash) do |row|
+            rows << row
+          end
+          actual = rows.collect(&:to_json).join("\n") + "\n"
+          path.delete
+          assert_equal expected_json, actual
+        end
+        it "array writer detects json format from file name" do
+          path = IOStreams.path("/tmp/io_streams/abc.json")
+          path.writer(:array, columns: %w[name login]) do |io|
+            io << ["Jack Jones", "jjones"]
+            io << ["Jill Smith", "jsmith"]
+          end
+          actual = path.read
+          path.delete
+          assert_equal expected_json, actual
+        end
       end
       describe ".temp_file" do

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: iostreams
 version: !ruby/object:Gem::Version
-  version: 1.2.0
+  version: 1.2.1
 platform: ruby
 authors:
 - Reid Morrison
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2020-04-29 00:00:00.000000000 Z
+date: 2020-05-19 00:00:00.000000000 Z
 dependencies: []
 description:
 email:
@@ -60,7 +60,6 @@ files:
 - lib/io_streams/tabular/parser/psv.rb
 - lib/io_streams/tabular/utility/csv_row.rb
 - lib/io_streams/utils.rb
-- lib/io_streams/utils/reliable_http.rb
 - lib/io_streams/version.rb
 - lib/io_streams/writer.rb
 - lib/io_streams/xlsx/reader.rb
@@ -131,7 +130,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 3.0.8
+rubygems_version: 3.0.6
 signing_key:
 specification_version: 4
 summary: Input and Output streaming for Ruby.

data/lib/io_streams/utils/reliable_http.rb DELETED

@@ -1,98 +0,0 @@
-require "net/http"
-require "uri"
-module IOStreams
-  module Utils
-    class ReliableHTTP
-      attr_reader :username, :password, :max_redirects, :url
-      # Reliable HTTP implementation with support for:
-      # * HTTP Redirects
-      # * Basic authentication
-      # * Raises an exception anytime the HTTP call is not successful.
-      # * TODO: Automatic retries with a logarithmic backoff strategy.
-      #
-      # Parameters:
-      #   url: [String]
-      #      URI of the file to download.
-      #     Example:
-      #       https://www5.fdic.gov/idasp/Offices2.zip
-      #       http://hostname/path/file_name
-      #
-      #     Full url showing all the optional elements that can be set via the url:
-      #       https://username:password@hostname/path/file_name
-      #
-      #   username: [String]
-      #     When supplied, basic authentication is used with the username and password.
-      #
-      #   password: [String]
-      #     Password to use use with basic authentication when the username is supplied.
-      #
-      #   max_redirects: [Integer]
-      #     Maximum number of http redirects to follow.
-      def initialize(url, username: nil, password: nil, max_redirects: 10)
-        uri = URI.parse(url)
-        unless %w[http https].include?(uri.scheme)
-          raise(ArgumentError, "Invalid URL. Required Format: 'http://<host_name>/<file_name>', or 'https://<host_name>/<file_name>'")
-        end
-        @username      = username || uri.user
-        @password      = password || uri.password
-        @max_redirects = max_redirects
-        @url           = url
-      end
-      # Read a file using an http get.
-      #
-      # For example:
-      #   IOStreams.path('https://www5.fdic.gov/idasp/Offices2.zip').reader {|file| puts file.read}
-      #
-      # Read the file without unzipping and streaming the first file in the zip:
-      #   IOStreams.path('https://www5.fdic.gov/idasp/Offices2.zip').stream(:none).reader {|file| puts file.read}
-      #
-      # Notes:
-      # * Since Net::HTTP download only supports a push stream, the data is streamed into a tempfile first.
-      def post(&block)
-        handle_redirects(Net::HTTP::Post, url, max_redirects, &block)
-      end
-      def get(&block)
-        handle_redirects(Net::HTTP::Get, url, max_redirects, &block)
-      end
-      private
-      def handle_redirects(request_class, uri, max_redirects, &block)
-        uri    = URI.parse(uri) unless uri.is_a?(URI)
-        result = nil
-        raise(IOStreams::Errors::CommunicationsFailure, "Too many redirects") if max_redirects < 1
-        Net::HTTP.start(uri.hostname, uri.port, use_ssl: uri.scheme == "https") do |http|
-          request = request_class.new(uri)
-          request.basic_auth(username, password) if username
-          http.request(request) do |response|
-            raise(IOStreams::Errors::CommunicationsFailure, "Invalid URL: #{uri}") if response.is_a?(Net::HTTPNotFound)
-            if response.is_a?(Net::HTTPUnauthorized)
-              raise(IOStreams::Errors::CommunicationsFailure, "Authorization Required: Invalid :username or :password.")
-            end
-            if response.is_a?(Net::HTTPRedirection)
-              new_uri = response["location"]
-              return handle_redirects(request_class, new_uri, max_redirects: max_redirects - 1, &block)
-            end
-            unless response.is_a?(Net::HTTPSuccess)
-              raise(IOStreams::Errors::CommunicationsFailure, "Invalid response code: #{response.code}")
-            end
-            yield(response) if block_given?
-            result = response
-          end
-        end
-        result
-      end
-    end
-  end
-end