iostreams 1.2.0 → 1.2.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 56c0432a5b7820924d8e7f72df07579366ae09dae03d8abe431e5b8fbc88de2b
4
- data.tar.gz: 73d78b153b9bfd079f9d3345f8f56eed7afec70a7fddda14f5478135faedf6d3
3
+ metadata.gz: 1dad581b0665992975c33f75b23f50964ae1311e025b7a1524fca4004f0ede2b
4
+ data.tar.gz: 4db01e4d6c2d36ce522df3b323a6e0d9f42de0d1644a282a0cea06479e979289
5
5
  SHA512:
6
- metadata.gz: 84b123abc4f78428344baa772356cfac922584ed8c03af41701c5bdcd17380283b3592c9989941b1126999a6b3dcef7367b4a0d4bbf67143ab678a24dc798ff6
7
- data.tar.gz: 4c35ae431bc0862b47e738b04f0dd88a98661b299845f3124c08f368c3532cf03012504d0ddcd18375295dd8590ea565e5a2483244602dd88f25fca7f7ef1328
6
+ metadata.gz: 4057a5c484129c60dbc9c84e462026da862900e17b0604b385164210f14814fbae6d065d015ee9171402eb9f793f33ac26c0ee7658f94b8cdeb0724c796cbe63
7
+ data.tar.gz: 5a84fe37c1eebc775bd84b9903181ff035c325b1233ab64990e586f5b0bd3fd51c21d4f1429f9b0e8ab64733e9b63be5c5e05df7bf026e9d1d8c0cd8a7716417
data/README.md CHANGED
@@ -5,433 +5,11 @@ Input and Output streaming for Ruby.
5
5
 
6
6
  ## Project Status
7
7
 
8
- Production Ready.
8
+ Production Ready, heavily used in production environments, many as part of Rocket Job.
9
9
 
10
- ## Features
10
+ ## Documentation
11
11
 
12
- Supported streams:
13
-
14
- * Zip
15
- * Gzip
16
- * BZip2
17
- * PGP (Requires GnuPG)
18
- * Xlsx (Reading)
19
- * Encryption using [Symmetric Encryption](https://github.com/reidmorrison/symmetric-encryption)
20
-
21
- Supported sources and/or targets:
22
-
23
- * File
24
- * HTTP (Read only)
25
- * AWS S3
26
- * SFTP
27
-
28
- Supported file formats:
29
-
30
- * CSV
31
- * Fixed width formats
32
- * JSON
33
- * PSV
34
-
35
- ## Quick examples
36
-
37
- Read an entire file into memory:
38
-
39
- ```ruby
40
- IOStreams.path('example.txt').read
41
- ```
42
-
43
- Decompress an entire gzip file into memory:
44
-
45
- ```ruby
46
- IOStreams.path('example.gz').read
47
- ```
48
-
49
- Read and decompress the first file in a zip file into memory:
50
-
51
- ```ruby
52
- IOStreams.path('example.zip').read
53
- ```
54
-
55
- Read a file one line at a time
56
-
57
- ```ruby
58
- IOStreams.path('example.txt').each do |line|
59
- puts line
60
- end
61
- ```
62
-
63
- Read a CSV file one line at a time, returning each line as an array:
64
-
65
- ```ruby
66
- IOStreams.path('example.csv').each(:array) do |array|
67
- p array
68
- end
69
- ```
70
-
71
- Read a CSV file a record at a time, returning each line as a hash.
72
- The first line of the file is assumed to be the header line:
73
-
74
- ```ruby
75
- IOStreams.path('example.csv').each(:hash) do |hash|
76
- p hash
77
- end
78
- ```
79
-
80
- Read a file using an http get,
81
- decompressing the named file in the zip file,
82
- returning each records from the named file as a hash:
83
-
84
- ```ruby
85
- IOStreams.
86
- path("https://www5.fdic.gov/idasp/Offices2.zip").
87
- option(:zip, entry_file_name: 'OFFICES2_ALL.CSV').
88
- reader(:hash) do |stream|
89
- p stream.read
90
- end
91
- ```
92
-
93
- Read the file without unzipping and streaming the first file in the zip:
94
-
95
- ```ruby
96
- IOStreams.path('https://www5.fdic.gov/idasp/Offices2.zip').stream(:none).reader {|file| puts file.read}
97
- ```
98
-
99
-
100
- ## Introduction
101
-
102
- If all files were small, they could just be loaded into memory in their entirety. With the
103
- advent of very large files, often into several Gigabytes, or even Terabytes in size, loading
104
- them into memory is not feasible.
105
-
106
- In linux it is common to use pipes to stream data between processes.
107
- For example:
108
-
109
- ```
110
- # Count the number of lines in a file that has been compressed with gzip
111
- cat abc.gz | gunzip -c | wc -l
112
- ```
113
-
114
- For large files it is critical to be able to read and write these files as streams. Ruby has support
115
- for reading and writing files using streams, but has no built-in way of passing one stream through
116
- another to support for example compressing the data, encrypting it and then finally writing the result
117
- to a file. Several streaming implementations exist for languages such as `C++` and `Java` to chain
118
- together several streams, `iostreams` attempts to offer similar features for Ruby.
119
-
120
- ```ruby
121
- # Read a compressed file:
122
- IOStreams.path("hello.gz").reader do |reader|
123
- data = reader.read(1024)
124
- puts "Read: #{data}"
125
- end
126
- ```
127
-
128
- The true power of streams is shown when many streams are chained together to achieve the end
129
- result, without holding the entire file in memory, or ideally without needing to create
130
- any temporary files to process the stream.
131
-
132
- ```ruby
133
- # Create a file that is compressed with GZip and then encrypted with Symmetric Encryption:
134
- IOStreams.path("hello.gz.enc").writer do |writer|
135
- writer.write("Hello World")
136
- writer.write("and some more")
137
- end
138
- ```
139
-
140
- The power of the above example applies when the data being written starts to exceed hundreds of megabytes,
141
- or even gigabytes.
142
-
143
- By looking at the file name supplied above, `iostreams` is able to determine which streams to apply
144
- to the data being read or written. For example:
145
- * `hello.zip` => Compressed using Zip
146
- * `hello.zip.enc` => Compressed using Zip and then encrypted using Symmetric Encryption
147
- * `hello.gz.enc` => Compressed using GZip and then encrypted using Symmetric Encryption
148
-
149
- The objective is that all of these streaming processes are performed used streaming
150
- so that only the current portion of the file is loaded into memory as it moves
151
- through the entire file.
152
- Where possible each stream never goes to disk, which for example could expose
153
- un-encrypted data.
154
-
155
- ## Examples
156
-
157
- While decompressing the file, display 128 characters at a time from the file.
158
-
159
- ~~~ruby
160
- require "iostreams"
161
- IOStreams.path("abc.csv").reader do |io|
162
- while (data = io.read(128))
163
- p data
164
- end
165
- end
166
- ~~~
167
-
168
- While decompressing the file, display one line at a time from the file.
169
-
170
- ~~~ruby
171
- IOStreams.path("abc.csv").each do |line|
172
- puts line
173
- end
174
- ~~~
175
-
176
- While decompressing the file, display each row from the csv file as an array.
177
-
178
- ~~~ruby
179
- IOStreams.path("abc.csv").each(:array) do |array|
180
- p array
181
- end
182
- ~~~
183
-
184
- While decompressing the file, display each record from the csv file as a hash.
185
- The first line is assumed to be the header row.
186
-
187
- ~~~ruby
188
- IOStreams.path("abc.csv").each(:hash) do |hash|
189
- p hash
190
- end
191
- ~~~
192
-
193
- Write data while compressing the file.
194
-
195
- ~~~ruby
196
- IOStreams.path("abc.csv").writer do |io|
197
- io.write("This")
198
- io.write(" is ")
199
- io.write(" one line\n")
200
- end
201
- ~~~
202
-
203
- Write a line at a time while compressing the file.
204
-
205
- ~~~ruby
206
- IOStreams.path("abc.csv").writer(:line) do |file|
207
- file << "these"
208
- file << "are"
209
- file << "all"
210
- file << "separate"
211
- file << "lines"
212
- end
213
- ~~~
214
-
215
- Write an array (row) at a time while compressing the file.
216
- Each array is converted to csv before being compressed with zip.
217
-
218
- ~~~ruby
219
- IOStreams.path("abc.csv").writer(:array) do |io|
220
- io << %w[name address zip_code]
221
- io << %w[Jack There 1234]
222
- io << ["Joe", "Over There somewhere", 1234]
223
- end
224
- ~~~
225
-
226
- Write a hash (record) at a time while compressing the file.
227
- Each hash is converted to csv before being compressed with zip.
228
- The header row is extracted from the first hash supplied.
229
-
230
- ~~~ruby
231
- IOStreams.path("abc.csv").writer(:hash) do |stream|
232
- stream << {name: "Jack", address: "There", zip_code: 1234}
233
- stream << {name: "Joe", address: "Over There somewhere", zip_code: 1234}
234
- end
235
- ~~~
236
-
237
- Write to a string IO for testing, supplying the filename so that the streams can be determined.
238
-
239
- ~~~ruby
240
- io = StringIO.new
241
- IOStreams.stream(io, file_name: "abc.csv").writer(:hash) do |stream|
242
- stream << {name: "Jack", address: "There", zip_code: 1234}
243
- stream << {name: "Joe", address: "Over There somewhere", zip_code: 1234}
244
- end
245
- puts io.string
246
- ~~~
247
-
248
- Read a CSV file and write the output to an encrypted file in JSON format.
249
-
250
- ~~~ruby
251
- IOStreams.path("sample.json.enc").writer(:hash) do |output|
252
- IOStreams.path("sample.csv").each(:hash) do |record|
253
- output << record
254
- end
255
- end
256
- ~~~
257
-
258
- ## Copying between files
259
-
260
- Stream based file copying. Changes the file type without changing the file format. For example, compress or encrypt.
261
-
262
- Encrypt the contents of the file `sample.json` and write to `sample.json.enc`
263
-
264
- ~~~ruby
265
- input = IOStreams.path("sample.json")
266
- IOStreams.path("sample.json.enc").copy_from(input)
267
- ~~~
268
-
269
- Encrypt and compress the contents of the file `sample.json` with Symmetric Encryption and write to `sample.json.enc`
270
-
271
- ~~~ruby
272
- input = IOStreams.path("sample.json")
273
- IOStreams.path("sample.json.enc").option(:enc, compress: true).copy_from(input)
274
- ~~~
275
-
276
- Encrypt and compress the contents of the file `sample.json` with pgp and write to `sample.json.enc`
277
-
278
- ~~~ruby
279
- input = IOStreams.path("sample.json")
280
- IOStreams.path("sample.json.pgp").option(:pgp, recipient: "sender@example.org").copy_from(input)
281
- ~~~
282
-
283
- Decrypt the file `abc.csv.enc` and write it to `xyz.csv`.
284
-
285
- ~~~ruby
286
- input = IOStreams.path("abc.csv.enc")
287
- IOStreams.path("xyz.csv").copy_from(input)
288
- ~~~
289
-
290
- Decrypt file `ABC` that was encrypted with Symmetric Encryption,
291
- PGP encrypt the output file and write it to `xyz.csv.pgp` using the pgp key that was imported for `a@a.com`.
292
-
293
- ~~~ruby
294
- input = IOStreams.path("ABC").stream(:enc)
295
- IOStreams.path("xyz.csv.pgp").option(:pgp, recipient: "a@a.com").copy_from(input)
296
- ~~~
297
-
298
- To copy a file _without_ performing any conversions (ignore file extensions), set `convert` to `false`:
299
-
300
- ~~~ruby
301
- input = IOStreams.path("sample.json.zip")
302
- IOStreams.path("sample.copy").copy_from(input, convert: false)
303
- ~~~
304
-
305
- ## Philosopy
306
-
307
- IOStreams can be used to work against a single stream. it's real capability becomes apparent when chaining together
308
- multiple streams to process data, without loading entire files into memory.
309
-
310
- #### Linux Pipes
311
-
312
- Linux has built-in support for streaming using the `|` (pipe operator) to send the output from one process to another.
313
-
314
- Example: count the number of lines in a compressed file:
315
-
316
- gunzip -c hello.csv.gz | wc -l
317
-
318
- The file `hello.csv.gz` is uncompressed and returned to standard output, which in turn is piped into the standard
319
- input for `wc -l`, which counts the number of lines in the uncompressed data.
320
-
321
- As each block of data is returned from `gunzip` it is immediately passed into `wc` so that it
322
- can start counting lines of uncompressed data, without waiting until the entire file is decompressed.
323
- The uncompressed contents of the file are not written to disk before passing to `wc -l` and the file is not loaded
324
- into memory before passing to `wc -l`.
325
-
326
- In this way extremely large files can be processed with very little memory being used.
327
-
328
- #### Push Model
329
-
330
- In the Linux pipes example above this would be considered a "push model" where each task in the list pushes
331
- its output to the input of the next task.
332
-
333
- A major challenge or disadvantage with the push model is that buffering would need to occur between tasks since
334
- each task could complete at very different speeds. To prevent large memory usage the standard output from a previous
335
- task would have to be blocked to try and make it slow down.
336
-
337
- #### Pull Model
338
-
339
- Another approach with multiple tasks that need to process a single stream, is to move to a "pull model" where the
340
- task at the end of the list pulls a block from a previous task when it is ready to process it.
341
-
342
- #### IOStreams
343
-
344
- IOStreams uses the pull model when reading data, where each stream performs a read against the previous stream
345
- when it is ready for more data.
346
-
347
- When writing to an output stream, IOStreams uses the push model, where each block of data that is ready to be written
348
- is pushed to the task/stream in the list. The write push only returns once it has traversed all the way down to
349
- the final task / stream in the list, this avoids complex buffering issues between each task / stream in the list.
350
-
351
- Example: Implementing in Ruby: `gunzip -c hello.csv.gz | wc -l`
352
-
353
- ~~~ruby
354
- line_count = 0
355
- IOStreams::Gzip::Reader.open("hello.csv.gz") do |input|
356
- IOStreams::Line::Reader.open(input) do |lines|
357
- lines.each { line_count += 1}
358
- end
359
- end
360
- puts "hello.csv.gz contains #{line_count} lines"
361
- ~~~
362
-
363
- Since IOStreams can autodetect file types based on the file extension, `IOStreams.reader` can figure which stream
364
- to start with:
365
- ~~~ruby
366
- line_count = 0
367
- IOStreams.path("hello.csv.gz").reader do |input|
368
- IOStreams::Line::Reader.open(input) do |lines|
369
- lines.each { line_count += 1}
370
- end
371
- end
372
- puts "hello.csv.gz contains #{line_count} lines"
373
- ~~~
374
-
375
- Since we know we want a line reader, it can be simplified using `#reader(:line)`:
376
- ~~~ruby
377
- line_count = 0
378
- IOStreams.path("hello.csv.gz").reader(:line) do |lines|
379
- lines.each { line_count += 1}
380
- end
381
- puts "hello.csv.gz contains #{line_count} lines"
382
- ~~~
383
-
384
- It can be simplified even further using `#each`:
385
- ~~~ruby
386
- line_count = 0
387
- IOStreams.path("hello.csv.gz").each { line_count += 1}
388
- puts "hello.csv.gz contains #{line_count} lines"
389
- ~~~
390
-
391
- The benefit in all of the above cases is that the file can be any arbitrary size and only one block of the file
392
- is held in memory at any time.
393
-
394
- #### Chaining
395
-
396
- In the above example only 2 streams were used. Streams can be nested as deep as necessary to process data.
397
-
398
- Example, search for all occurrences of the word apple, cleansing the input data stream of non printable characters
399
- and converting to valid US ASCII.
400
-
401
- ~~~ruby
402
- apple_count = 0
403
- IOStreams::Gzip::Reader.open("hello.csv.gz") do |input|
404
- IOStreams::Encode::Reader.open(input,
405
- encoding: "US-ASCII",
406
- encode_replace: "",
407
- encode_cleaner: :printable) do |cleansed|
408
- IOStreams::Line::Reader.open(cleansed) do |lines|
409
- lines.each { |line| apple_count += line.scan("apple").count}
410
- end
411
- end
412
- puts "Found the word 'apple' #{apple_count} times in hello.csv.gz"
413
- ~~~
414
-
415
- Let IOStreams perform the above stream chaining automatically under the covers:
416
-
417
- ~~~ruby
418
- apple_count = 0
419
- IOStreams.path("hello.csv.gz").
420
- option(:encode, encoding: "US-ASCII", replace: "", cleaner: :printable).
421
- each do |line|
422
- apple_count += line.scan("apple").count
423
- end
424
-
425
- puts "Found the word 'apple' #{apple_count} times in hello.csv.gz"
426
- ~~~
427
-
428
- ## Notes
429
-
430
- * Due to the nature of Zip, both its Reader and Writer methods will create
431
- a temp file when reading from or writing to a stream.
432
- Recommended to use Gzip over Zip since it can be streamed without requiring temp files.
433
- * Zip becomes exponentially slower with very large files, especially files
434
- that exceed 4GB when uncompressed. Highly recommend using GZip for large files.
12
+ [Semantic Logger Guide](http://rocketjob.github.io/iostreams)
435
13
 
436
14
  ## Versioning
437
15
 
@@ -9,7 +9,7 @@ module IOStreams
9
9
  LINEFEED_REGEXP = Regexp.compile(/\r\n|\n|\r/).freeze
10
10
 
11
11
  # Read a line at a time from a stream
12
- def self.stream(input_stream, original_file_name: nil, **args)
12
+ def self.stream(input_stream, **args)
13
13
  # Pass-through if already a line reader
14
14
  return yield(input_stream) if input_stream.is_a?(self.class)
15
15
 
@@ -44,7 +44,7 @@ module IOStreams
44
44
  # - Skip "empty" / "blank" lines. RegExp?
45
45
  # - Extract header line(s) / first non-comment, non-blank line
46
46
  # - Embedded newline support, RegExp? or Proc?
47
- def initialize(input_stream, delimiter: nil, buffer_size: 65_536, embedded_within: nil)
47
+ def initialize(input_stream, delimiter: nil, buffer_size: 65_536, embedded_within: nil, original_file_name: nil)
48
48
  super(input_stream)
49
49
 
50
50
  @embedded_within = embedded_within
@@ -1,5 +1,6 @@
1
1
  require "net/http"
2
2
  require "uri"
3
+ require "cgi"
3
4
  module IOStreams
4
5
  module Paths
5
6
  class HTTP < IOStreams::Path
@@ -2,85 +2,9 @@ require "open3"
2
2
  module IOStreams
3
3
  # Read/Write PGP/GPG file or stream.
4
4
  #
5
- # Example Setup:
6
- #
7
- # 1. Install OpenPGP
8
- # Mac OSX (homebrew) : `brew install gpg2`
9
- # Redhat Linux: `rpm install gpg2`
10
- #
11
- # 2. # Generate senders private and public key
12
- # IOStreams::Pgp.generate_key(name: 'Sender', email: 'sender@example.org', passphrase: 'sender_passphrase')
13
- #
14
- # 3. # Generate receivers private and public key
15
- # IOStreams::Pgp.generate_key(name: 'Receiver', email: 'receiver@example.org', passphrase: 'receiver_passphrase')
16
- #
17
- # Example 1:
18
- #
19
- # # Generate encrypted file for a specific recipient and sign it with senders credentials
20
- # data = %w(this is some data that should be encrypted using pgp)
21
- # IOStreams::Pgp::Writer.open('secure.gpg', recipient: 'receiver@example.org', signer: 'sender@example.org', signer_passphrase: 'sender_passphrase') do |output|
22
- # data.each { |word| output.puts(word) }
23
- # end
24
- #
25
- # # Decrypt the file sent to `receiver@example.org` using its private key
26
- # # Recipient must also have the senders public key to verify the signature
27
- # IOStreams::Pgp::Reader.open('secure.gpg', passphrase: 'receiver_passphrase') do |stream|
28
- # while !stream.eof?
29
- # p stream.read(10)
30
- # puts
31
- # end
32
- # end
33
- #
34
- # Example 2:
35
- #
36
- # # Default user and passphrase to sign the output file:
37
- # IOStreams::Pgp::Writer.default_signer = 'sender@example.org'
38
- # IOStreams::Pgp::Writer.default_signer_passphrase = 'sender_passphrase'
39
- #
40
- # # Default passphrase for decrypting recipients files.
41
- # # Note: Usually this would be the senders passphrase, but in this example
42
- # # it is decrypting the file intended for the recipient.
43
- # IOStreams::Pgp::Reader.default_passphrase = 'receiver_passphrase'
44
- #
45
- # # Generate encrypted file for a specific recipient and sign it with senders credentials
46
- # data = %w(this is some data that should be encrypted using pgp)
47
- # IOStreams.writer('secure.gpg', streams: {pgp: {recipient: 'receiver@example.org'}}) do |output|
48
- # data.each { |word| output.puts(word) }
49
- # end
50
- #
51
- # # Decrypt the file sent to `receiver@example.org` using its private key
52
- # # Recipient must also have the senders public key to verify the signature
53
- # IOStreams.reader('secure.gpg') do |stream|
54
- # while data = stream.read(10)
55
- # p data
56
- # end
57
- # end
58
- #
59
- # FAQ:
60
- # - If you get not trusted errors
61
- # gpg --edit-key sender@example.org
62
- # Select highest level: 5
63
- #
64
- # Delete test keys:
65
- # IOStreams::Pgp.delete_keys(email: 'sender@example.org', private: true)
66
- # IOStreams::Pgp.delete_keys(email: 'receiver@example.org', private: true)
67
- #
68
5
  # Limitations
69
6
  # - Designed for processing larger files since a process is spawned for each file processed.
70
7
  # - For small in memory files or individual emails, use the 'opengpgme' library.
71
- #
72
- # Compression Performance:
73
- # Running tests on an Early 2015 Macbook Pro Dual Core with Ruby v2.3.1
74
- #
75
- # Input file: test.log 3.6GB
76
- # :none: size: 3.6GB write: 52s read: 45s
77
- # :zip: size: 411MB write: 75s read: 31s
78
- # :zlib: size: 241MB write: 66s read: 23s ( 756KB Memory )
79
- # :bzip2: size: 129MB write: 430s read: 130s ( 5MB Memory )
80
- #
81
- # Notes:
82
- # - Tested against gnupg v1.4.21 and v2.0.30
83
- # - Does not work yet with gnupg v2.1. Pull Requests welcome.
84
8
  module Pgp
85
9
  autoload :Reader, "io_streams/pgp/reader"
86
10
  autoload :Writer, "io_streams/pgp/writer"
@@ -7,7 +7,7 @@ module IOStreams
7
7
  # Read a record at a time from a line stream
8
8
  # Note:
9
9
  # - The supplied stream _must_ already be a line stream, or a stream that responds to :each
10
- def self.stream(line_reader, original_file_name: nil, **args)
10
+ def self.stream(line_reader, **args)
11
11
  # Pass-through if already a record reader
12
12
  return yield(line_reader) if line_reader.is_a?(self.class)
13
13
 
@@ -17,7 +17,7 @@ module IOStreams
17
17
  # When reading from a file also add the line reader stream
18
18
  def self.file(file_name, original_file_name: file_name, delimiter: $/, **args)
19
19
  IOStreams::Line::Reader.file(file_name, original_file_name: original_file_name, delimiter: delimiter) do |io|
20
- yield new(io, **args)
20
+ yield new(io, original_file_name: original_file_name, **args)
21
21
  end
22
22
  end
23
23
 
@@ -25,19 +25,44 @@ module IOStreams
25
25
  # Parse a delimited data source.
26
26
  #
27
27
  # Parameters
28
- # delimited: [#each]
29
- # Anything that returns one line / record at a time when #each is called on it.
30
- #
31
28
  # format: [Symbol]
32
29
  # :csv, :hash, :array, :json, :psv, :fixed
33
30
  #
34
- # For all other parameters, see Tabular::Header.new
35
- def initialize(line_reader, cleanse_header: true, **args)
31
+ # file_name: [String]
32
+ # When `:format` is not supplied the file name can be used to infer the required format.
33
+ # Optional. Default: nil
34
+ #
35
+ # format_options: [Hash]
36
+ # Any specialized format specific options. For example, `:fixed` format requires the file definition.
37
+ #
38
+ # columns [Array<String>]
39
+ # The header columns when the file does not include a header row.
40
+ # Note:
41
+ # It is recommended to keep all columns as strings to avoid any issues when persistence
42
+ # with MongoDB when it converts symbol keys to strings.
43
+ #
44
+ # allowed_columns [Array<String>]
45
+ # List of columns to allow.
46
+ # Default: nil ( Allow all columns )
47
+ # Note:
48
+ # When supplied any columns that are rejected will be returned in the cleansed columns
49
+ # as nil so that they can be ignored during processing.
50
+ #
51
+ # required_columns [Array<String>]
52
+ # List of columns that must be present, otherwise an Exception is raised.
53
+ #
54
+ # skip_unknown [true|false]
55
+ # true:
56
+ # Skip columns not present in the `allowed_columns` by cleansing them to nil.
57
+ # #as_hash will skip these additional columns entirely as if they were not in the file at all.
58
+ # false:
59
+ # Raises Tabular::InvalidHeader when a column is supplied that is not in the whitelist.
60
+ def initialize(line_reader, cleanse_header: true, original_file_name: nil, **args)
36
61
  unless line_reader.respond_to?(:each)
37
62
  raise(ArgumentError, "Stream must be a IOStreams::Line::Reader or implement #each")
38
63
  end
39
64
 
40
- @tabular = IOStreams::Tabular.new(**args)
65
+ @tabular = IOStreams::Tabular.new(file_name: original_file_name, **args)
41
66
  @line_reader = line_reader
42
67
  @cleanse_header = cleanse_header
43
68
  end
@@ -9,7 +9,7 @@ module IOStreams
9
9
  # Write a record as a Hash at a time to a stream.
10
10
  # Note:
11
11
  # - The supplied stream _must_ already be a line stream, or a stream that responds to :<<
12
- def self.stream(line_writer, original_file_name: nil, **args)
12
+ def self.stream(line_writer, **args)
13
13
  # Pass-through if already a record writer
14
14
  return yield(line_writer) if line_writer.is_a?(self.class)
15
15
 
@@ -19,7 +19,7 @@ module IOStreams
19
19
  # When writing to a file also add the line writer stream
20
20
  def self.file(file_name, original_file_name: file_name, delimiter: $/, **args, &block)
21
21
  IOStreams::Line::Writer.file(file_name, original_file_name: original_file_name, delimiter: delimiter) do |io|
22
- yield new(io, **args, &block)
22
+ yield new(io, original_file_name: original_file_name, **args, &block)
23
23
  end
24
24
  end
25
25
 
@@ -27,17 +27,42 @@ module IOStreams
27
27
  # Parse a delimited data source.
28
28
  #
29
29
  # Parameters
30
- # delimited: [#<<]
31
- # Anything that accepts a line / record at a time when #<< is called on it.
32
- #
33
30
  # format: [Symbol]
34
31
  # :csv, :hash, :array, :json, :psv, :fixed
35
32
  #
36
- # For all other parameters, see Tabular::Header.new
37
- def initialize(line_writer, columns: nil, **args)
33
+ # file_name: [String]
34
+ # When `:format` is not supplied the file name can be used to infer the required format.
35
+ # Optional. Default: nil
36
+ #
37
+ # format_options: [Hash]
38
+ # Any specialized format specific options. For example, `:fixed` format requires the file definition.
39
+ #
40
+ # columns [Array<String>]
41
+ # The header columns when the file does not include a header row.
42
+ # Note:
43
+ # It is recommended to keep all columns as strings to avoid any issues when persistence
44
+ # with MongoDB when it converts symbol keys to strings.
45
+ #
46
+ # allowed_columns [Array<String>]
47
+ # List of columns to allow.
48
+ # Default: nil ( Allow all columns )
49
+ # Note:
50
+ # When supplied any columns that are rejected will be returned in the cleansed columns
51
+ # as nil so that they can be ignored during processing.
52
+ #
53
+ # required_columns [Array<String>]
54
+ # List of columns that must be present, otherwise an Exception is raised.
55
+ #
56
+ # skip_unknown [true|false]
57
+ # true:
58
+ # Skip columns not present in the `allowed_columns` by cleansing them to nil.
59
+ # #as_hash will skip these additional columns entirely as if they were not in the file at all.
60
+ # false:
61
+ # Raises Tabular::InvalidHeader when a column is supplied that is not in the whitelist.
62
+ def initialize(line_writer, columns: nil, original_file_name: nil, **args)
38
63
  raise(ArgumentError, "Stream must be a IOStreams::Line::Writer or implement #<<") unless line_writer.respond_to?(:<<)
39
64
 
40
- @tabular = IOStreams::Tabular.new(columns: columns, **args)
65
+ @tabular = IOStreams::Tabular.new(columns: columns, file_name: original_file_name, **args)
41
66
  @line_writer = line_writer
42
67
 
43
68
  # Render header line when `columns` is supplied.
@@ -5,7 +5,7 @@ module IOStreams
5
5
  # Read a line as an Array at a time from a stream.
6
6
  # Note:
7
7
  # - The supplied stream _must_ already be a line stream, or a stream that responds to :each
8
- def self.stream(line_reader, original_file_name: nil, **args)
8
+ def self.stream(line_reader, **args)
9
9
  # Pass-through if already a row reader
10
10
  return yield(line_reader) if line_reader.is_a?(self.class)
11
11
 
@@ -15,7 +15,7 @@ module IOStreams
15
15
  # When reading from a file also add the line reader stream
16
16
  def self.file(file_name, original_file_name: file_name, delimiter: $/, **args)
17
17
  IOStreams::Line::Reader.file(file_name, original_file_name: original_file_name, delimiter: delimiter) do |io|
18
- yield new(io, **args)
18
+ yield new(io, original_file_name: original_file_name, **args)
19
19
  end
20
20
  end
21
21
 
@@ -29,12 +29,12 @@ module IOStreams
29
29
  # :csv, :hash, :array, :json, :psv, :fixed
30
30
  #
31
31
  # For all other parameters, see Tabular::Header.new
32
- def initialize(line_reader, cleanse_header: true, **args)
32
+ def initialize(line_reader, cleanse_header: true, original_file_name: nil, **args)
33
33
  unless line_reader.respond_to?(:each)
34
34
  raise(ArgumentError, "Stream must be a IOStreams::Line::Reader or implement #each")
35
35
  end
36
36
 
37
- @tabular = IOStreams::Tabular.new(**args)
37
+ @tabular = IOStreams::Tabular.new(file_name: original_file_name, **args)
38
38
  @line_reader = line_reader
39
39
  @cleanse_header = cleanse_header
40
40
  end
@@ -12,7 +12,7 @@ module IOStreams
12
12
  #
13
13
  # Note:
14
14
  # - The supplied stream _must_ already be a line stream, or a stream that responds to :<<
15
- def self.stream(line_writer, original_file_name: nil, **args)
15
+ def self.stream(line_writer, **args)
16
16
  # Pass-through if already a row writer
17
17
  return yield(line_writer) if line_writer.is_a?(self.class)
18
18
 
@@ -22,7 +22,7 @@ module IOStreams
22
22
  # When writing to a file also add the line writer stream
23
23
  def self.file(file_name, original_file_name: file_name, delimiter: $/, **args, &block)
24
24
  IOStreams::Line::Writer.file(file_name, original_file_name: original_file_name, delimiter: delimiter) do |io|
25
- yield new(io, **args, &block)
25
+ yield new(io, original_file_name: original_file_name, **args, &block)
26
26
  end
27
27
  end
28
28
 
@@ -36,10 +36,10 @@ module IOStreams
36
36
  # :csv, :hash, :array, :json, :psv, :fixed
37
37
  #
38
38
  # For all other parameters, see Tabular::Header.new
39
- def initialize(line_writer, columns: nil, **args)
39
+ def initialize(line_writer, columns: nil, original_file_name: nil, **args)
40
40
  raise(ArgumentError, "Stream must be a IOStreams::Line::Writer or implement #<<") unless line_writer.respond_to?(:<<)
41
41
 
42
- @tabular = IOStreams::Tabular.new(columns: columns, **args)
42
+ @tabular = IOStreams::Tabular.new(columns: columns, file_name: original_file_name, **args)
43
43
  @line_writer = line_writer
44
44
 
45
45
  # Render header line when `columns` is supplied.
@@ -282,20 +282,20 @@ module IOStreams
282
282
  def line_reader(embedded_within: nil, **args)
283
283
  embedded_within = '"' if embedded_within.nil? && builder.file_name&.include?(".csv")
284
284
 
285
- stream_reader { |io| yield IOStreams::Line::Reader.new(io, embedded_within: embedded_within, **args) }
285
+ stream_reader { |io| yield IOStreams::Line::Reader.new(io, original_file_name: builder.file_name, embedded_within: embedded_within, **args) }
286
286
  end
287
287
 
288
288
  # Iterate over a file / stream returning each line as an array, one at a time.
289
289
  def row_reader(delimiter: nil, embedded_within: nil, **args)
290
290
  line_reader(delimiter: delimiter, embedded_within: embedded_within) do |io|
291
- yield IOStreams::Row::Reader.new(io, **args)
291
+ yield IOStreams::Row::Reader.new(io, original_file_name: builder.file_name, **args)
292
292
  end
293
293
  end
294
294
 
295
295
  # Iterate over a file / stream returning each line as a hash, one at a time.
296
296
  def record_reader(delimiter: nil, embedded_within: nil, **args)
297
297
  line_reader(delimiter: delimiter, embedded_within: embedded_within) do |io|
298
- yield IOStreams::Record::Reader.new(io, **args)
298
+ yield IOStreams::Record::Reader.new(io, original_file_name: builder.file_name, **args)
299
299
  end
300
300
  end
301
301
 
@@ -306,19 +306,19 @@ module IOStreams
306
306
  def line_writer(**args, &block)
307
307
  return block.call(io_stream) if io_stream&.is_a?(IOStreams::Line::Writer)
308
308
 
309
- writer { |io| IOStreams::Line::Writer.stream(io, **args, &block) }
309
+ writer { |io| IOStreams::Line::Writer.stream(io, original_file_name: builder.file_name, **args, &block) }
310
310
  end
311
311
 
312
312
  def row_writer(delimiter: $/, **args, &block)
313
313
  return block.call(io_stream) if io_stream&.is_a?(IOStreams::Row::Writer)
314
314
 
315
- line_writer(delimiter: delimiter) { |io| IOStreams::Row::Writer.stream(io, **args, &block) }
315
+ line_writer(delimiter: delimiter) { |io| IOStreams::Row::Writer.stream(io, original_file_name: builder.file_name, **args, &block) }
316
316
  end
317
317
 
318
318
  def record_writer(delimiter: $/, **args, &block)
319
319
  return block.call(io_stream) if io_stream&.is_a?(IOStreams::Record::Writer)
320
320
 
321
- line_writer(delimiter: delimiter) { |io| IOStreams::Record::Writer.stream(io, **args, &block) }
321
+ line_writer(delimiter: delimiter) { |io| IOStreams::Record::Writer.stream(io, original_file_name: builder.file_name, **args, &block) }
322
322
  end
323
323
  end
324
324
  end
@@ -52,7 +52,35 @@ module IOStreams
52
52
  # format: [Symbol]
53
53
  # :csv, :hash, :array, :json, :psv, :fixed
54
54
  #
55
- # For all other parameters, see Tabular::Header.new
55
+ # file_name: [String]
56
+ # When `:format` is not supplied the file name can be used to infer the required format.
57
+ # Optional. Default: nil
58
+ #
59
+ # format_options: [Hash]
60
+ # Any specialized format specific options. For example, `:fixed` format requires the file definition.
61
+ #
62
+ # columns [Array<String>]
63
+ # The header columns when the file does not include a header row.
64
+ # Note:
65
+ # It is recommended to keep all columns as strings to avoid any issues when persistence
66
+ # with MongoDB when it converts symbol keys to strings.
67
+ #
68
+ # allowed_columns [Array<String>]
69
+ # List of columns to allow.
70
+ # Default: nil ( Allow all columns )
71
+ # Note:
72
+ # When supplied any columns that are rejected will be returned in the cleansed columns
73
+ # as nil so that they can be ignored during processing.
74
+ #
75
+ # required_columns [Array<String>]
76
+ # List of columns that must be present, otherwise an Exception is raised.
77
+ #
78
+ # skip_unknown [true|false]
79
+ # true:
80
+ # Skip columns not present in the `allowed_columns` by cleansing them to nil.
81
+ # #as_hash will skip these additional columns entirely as if they were not in the file at all.
82
+ # false:
83
+ # Raises Tabular::InvalidHeader when a column is supplied that is not in the whitelist.
56
84
  def initialize(format: nil, file_name: nil, format_options: nil, **args)
57
85
  @header = Header.new(**args)
58
86
  klass =
@@ -1,4 +1,5 @@
1
1
  require "uri"
2
+ require "tmpdir"
2
3
  module IOStreams
3
4
  module Utils
4
5
  MAX_TEMP_FILE_NAME_ATTEMPTS = 5
@@ -1,3 +1,3 @@
1
1
  module IOStreams
2
- VERSION = "1.2.0".freeze
2
+ VERSION = "1.2.1".freeze
3
3
  end
@@ -1,8 +1,24 @@
1
1
  require_relative "test_helper"
2
+ require "json"
2
3
 
3
4
  module IOStreams
4
5
  class PathTest < Minitest::Test
5
6
  describe IOStreams do
7
+ let :records do
8
+ [
9
+ {"name" => "Jack Jones", "login" => "jjones"},
10
+ {"name" => "Jill Smith", "login" => "jsmith"}
11
+ ]
12
+ end
13
+
14
+ let :expected_json do
15
+ records.collect(&:to_json).join("\n") + "\n"
16
+ end
17
+
18
+ let :json_file_name do
19
+ "/tmp/io_streams/abc.json"
20
+ end
21
+
6
22
  describe ".root" do
7
23
  it "return default path" do
8
24
  path = ::File.expand_path(::File.join(__dir__, "../tmp/default"))
@@ -60,6 +76,39 @@ module IOStreams
60
76
  IOStreams.path("s3://a.xyz")
61
77
  assert_equal :s3, path
62
78
  end
79
+
80
+ it "hash writer detects json format from file name" do
81
+ path = IOStreams.path("/tmp/io_streams/abc.json")
82
+ path.writer(:hash) do |io|
83
+ records.each { |hash| io << hash }
84
+ end
85
+ actual = path.read
86
+ path.delete
87
+ assert_equal expected_json, actual
88
+ end
89
+
90
+ it "hash reader detects json format from file name" do
91
+ ::File.open(json_file_name, "wb") { |file| file.write(expected_json) }
92
+ rows = []
93
+ path = IOStreams.path("/tmp/io_streams/abc.json")
94
+ path.each(:hash) do |row|
95
+ rows << row
96
+ end
97
+ actual = rows.collect(&:to_json).join("\n") + "\n"
98
+ path.delete
99
+ assert_equal expected_json, actual
100
+ end
101
+
102
+ it "array writer detects json format from file name" do
103
+ path = IOStreams.path("/tmp/io_streams/abc.json")
104
+ path.writer(:array, columns: %w[name login]) do |io|
105
+ io << ["Jack Jones", "jjones"]
106
+ io << ["Jill Smith", "jsmith"]
107
+ end
108
+ actual = path.read
109
+ path.delete
110
+ assert_equal expected_json, actual
111
+ end
63
112
  end
64
113
 
65
114
  describe ".temp_file" do
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: iostreams
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.2.0
4
+ version: 1.2.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Reid Morrison
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-04-29 00:00:00.000000000 Z
11
+ date: 2020-05-19 00:00:00.000000000 Z
12
12
  dependencies: []
13
13
  description:
14
14
  email:
@@ -60,7 +60,6 @@ files:
60
60
  - lib/io_streams/tabular/parser/psv.rb
61
61
  - lib/io_streams/tabular/utility/csv_row.rb
62
62
  - lib/io_streams/utils.rb
63
- - lib/io_streams/utils/reliable_http.rb
64
63
  - lib/io_streams/version.rb
65
64
  - lib/io_streams/writer.rb
66
65
  - lib/io_streams/xlsx/reader.rb
@@ -131,7 +130,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
131
130
  - !ruby/object:Gem::Version
132
131
  version: '0'
133
132
  requirements: []
134
- rubygems_version: 3.0.8
133
+ rubygems_version: 3.0.6
135
134
  signing_key:
136
135
  specification_version: 4
137
136
  summary: Input and Output streaming for Ruby.
@@ -1,98 +0,0 @@
1
- require "net/http"
2
- require "uri"
3
- module IOStreams
4
- module Utils
5
- class ReliableHTTP
6
- attr_reader :username, :password, :max_redirects, :url
7
-
8
- # Reliable HTTP implementation with support for:
9
- # * HTTP Redirects
10
- # * Basic authentication
11
- # * Raises an exception anytime the HTTP call is not successful.
12
- # * TODO: Automatic retries with a logarithmic backoff strategy.
13
- #
14
- # Parameters:
15
- # url: [String]
16
- # URI of the file to download.
17
- # Example:
18
- # https://www5.fdic.gov/idasp/Offices2.zip
19
- # http://hostname/path/file_name
20
- #
21
- # Full url showing all the optional elements that can be set via the url:
22
- # https://username:password@hostname/path/file_name
23
- #
24
- # username: [String]
25
- # When supplied, basic authentication is used with the username and password.
26
- #
27
- # password: [String]
28
- # Password to use use with basic authentication when the username is supplied.
29
- #
30
- # max_redirects: [Integer]
31
- # Maximum number of http redirects to follow.
32
- def initialize(url, username: nil, password: nil, max_redirects: 10)
33
- uri = URI.parse(url)
34
- unless %w[http https].include?(uri.scheme)
35
- raise(ArgumentError, "Invalid URL. Required Format: 'http://<host_name>/<file_name>', or 'https://<host_name>/<file_name>'")
36
- end
37
-
38
- @username = username || uri.user
39
- @password = password || uri.password
40
- @max_redirects = max_redirects
41
- @url = url
42
- end
43
-
44
- # Read a file using an http get.
45
- #
46
- # For example:
47
- # IOStreams.path('https://www5.fdic.gov/idasp/Offices2.zip').reader {|file| puts file.read}
48
- #
49
- # Read the file without unzipping and streaming the first file in the zip:
50
- # IOStreams.path('https://www5.fdic.gov/idasp/Offices2.zip').stream(:none).reader {|file| puts file.read}
51
- #
52
- # Notes:
53
- # * Since Net::HTTP download only supports a push stream, the data is streamed into a tempfile first.
54
- def post(&block)
55
- handle_redirects(Net::HTTP::Post, url, max_redirects, &block)
56
- end
57
-
58
- def get(&block)
59
- handle_redirects(Net::HTTP::Get, url, max_redirects, &block)
60
- end
61
-
62
- private
63
-
64
- def handle_redirects(request_class, uri, max_redirects, &block)
65
- uri = URI.parse(uri) unless uri.is_a?(URI)
66
- result = nil
67
- raise(IOStreams::Errors::CommunicationsFailure, "Too many redirects") if max_redirects < 1
68
-
69
- Net::HTTP.start(uri.hostname, uri.port, use_ssl: uri.scheme == "https") do |http|
70
- request = request_class.new(uri)
71
- request.basic_auth(username, password) if username
72
-
73
- http.request(request) do |response|
74
- raise(IOStreams::Errors::CommunicationsFailure, "Invalid URL: #{uri}") if response.is_a?(Net::HTTPNotFound)
75
-
76
- if response.is_a?(Net::HTTPUnauthorized)
77
- raise(IOStreams::Errors::CommunicationsFailure, "Authorization Required: Invalid :username or :password.")
78
- end
79
-
80
- if response.is_a?(Net::HTTPRedirection)
81
- new_uri = response["location"]
82
- return handle_redirects(request_class, new_uri, max_redirects: max_redirects - 1, &block)
83
- end
84
-
85
- unless response.is_a?(Net::HTTPSuccess)
86
- raise(IOStreams::Errors::CommunicationsFailure, "Invalid response code: #{response.code}")
87
- end
88
-
89
- yield(response) if block_given?
90
-
91
- result = response
92
- end
93
- end
94
- result
95
- end
96
- end
97
- end
98
- end