iostreams 1.2.0 → 1.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 56c0432a5b7820924d8e7f72df07579366ae09dae03d8abe431e5b8fbc88de2b
4
- data.tar.gz: 73d78b153b9bfd079f9d3345f8f56eed7afec70a7fddda14f5478135faedf6d3
3
+ metadata.gz: 1dad581b0665992975c33f75b23f50964ae1311e025b7a1524fca4004f0ede2b
4
+ data.tar.gz: 4db01e4d6c2d36ce522df3b323a6e0d9f42de0d1644a282a0cea06479e979289
5
5
  SHA512:
6
- metadata.gz: 84b123abc4f78428344baa772356cfac922584ed8c03af41701c5bdcd17380283b3592c9989941b1126999a6b3dcef7367b4a0d4bbf67143ab678a24dc798ff6
7
- data.tar.gz: 4c35ae431bc0862b47e738b04f0dd88a98661b299845f3124c08f368c3532cf03012504d0ddcd18375295dd8590ea565e5a2483244602dd88f25fca7f7ef1328
6
+ metadata.gz: 4057a5c484129c60dbc9c84e462026da862900e17b0604b385164210f14814fbae6d065d015ee9171402eb9f793f33ac26c0ee7658f94b8cdeb0724c796cbe63
7
+ data.tar.gz: 5a84fe37c1eebc775bd84b9903181ff035c325b1233ab64990e586f5b0bd3fd51c21d4f1429f9b0e8ab64733e9b63be5c5e05df7bf026e9d1d8c0cd8a7716417
data/README.md CHANGED
@@ -5,433 +5,11 @@ Input and Output streaming for Ruby.
5
5
 
6
6
  ## Project Status
7
7
 
8
- Production Ready.
8
+ Production Ready, heavily used in production environments, many as part of Rocket Job.
9
9
 
10
- ## Features
10
+ ## Documentation
11
11
 
12
- Supported streams:
13
-
14
- * Zip
15
- * Gzip
16
- * BZip2
17
- * PGP (Requires GnuPG)
18
- * Xlsx (Reading)
19
- * Encryption using [Symmetric Encryption](https://github.com/reidmorrison/symmetric-encryption)
20
-
21
- Supported sources and/or targets:
22
-
23
- * File
24
- * HTTP (Read only)
25
- * AWS S3
26
- * SFTP
27
-
28
- Supported file formats:
29
-
30
- * CSV
31
- * Fixed width formats
32
- * JSON
33
- * PSV
34
-
35
- ## Quick examples
36
-
37
- Read an entire file into memory:
38
-
39
- ```ruby
40
- IOStreams.path('example.txt').read
41
- ```
42
-
43
- Decompress an entire gzip file into memory:
44
-
45
- ```ruby
46
- IOStreams.path('example.gz').read
47
- ```
48
-
49
- Read and decompress the first file in a zip file into memory:
50
-
51
- ```ruby
52
- IOStreams.path('example.zip').read
53
- ```
54
-
55
- Read a file one line at a time
56
-
57
- ```ruby
58
- IOStreams.path('example.txt').each do |line|
59
- puts line
60
- end
61
- ```
62
-
63
- Read a CSV file one line at a time, returning each line as an array:
64
-
65
- ```ruby
66
- IOStreams.path('example.csv').each(:array) do |array|
67
- p array
68
- end
69
- ```
70
-
71
- Read a CSV file a record at a time, returning each line as a hash.
72
- The first line of the file is assumed to be the header line:
73
-
74
- ```ruby
75
- IOStreams.path('example.csv').each(:hash) do |hash|
76
- p hash
77
- end
78
- ```
79
-
80
- Read a file using an http get,
81
- decompressing the named file in the zip file,
82
- returning each records from the named file as a hash:
83
-
84
- ```ruby
85
- IOStreams.
86
- path("https://www5.fdic.gov/idasp/Offices2.zip").
87
- option(:zip, entry_file_name: 'OFFICES2_ALL.CSV').
88
- reader(:hash) do |stream|
89
- p stream.read
90
- end
91
- ```
92
-
93
- Read the file without unzipping and streaming the first file in the zip:
94
-
95
- ```ruby
96
- IOStreams.path('https://www5.fdic.gov/idasp/Offices2.zip').stream(:none).reader {|file| puts file.read}
97
- ```
98
-
99
-
100
- ## Introduction
101
-
102
- If all files were small, they could just be loaded into memory in their entirety. With the
103
- advent of very large files, often into several Gigabytes, or even Terabytes in size, loading
104
- them into memory is not feasible.
105
-
106
- In linux it is common to use pipes to stream data between processes.
107
- For example:
108
-
109
- ```
110
- # Count the number of lines in a file that has been compressed with gzip
111
- cat abc.gz | gunzip -c | wc -l
112
- ```
113
-
114
- For large files it is critical to be able to read and write these files as streams. Ruby has support
115
- for reading and writing files using streams, but has no built-in way of passing one stream through
116
- another to support for example compressing the data, encrypting it and then finally writing the result
117
- to a file. Several streaming implementations exist for languages such as `C++` and `Java` to chain
118
- together several streams, `iostreams` attempts to offer similar features for Ruby.
119
-
120
- ```ruby
121
- # Read a compressed file:
122
- IOStreams.path("hello.gz").reader do |reader|
123
- data = reader.read(1024)
124
- puts "Read: #{data}"
125
- end
126
- ```
127
-
128
- The true power of streams is shown when many streams are chained together to achieve the end
129
- result, without holding the entire file in memory, or ideally without needing to create
130
- any temporary files to process the stream.
131
-
132
- ```ruby
133
- # Create a file that is compressed with GZip and then encrypted with Symmetric Encryption:
134
- IOStreams.path("hello.gz.enc").writer do |writer|
135
- writer.write("Hello World")
136
- writer.write("and some more")
137
- end
138
- ```
139
-
140
- The power of the above example applies when the data being written starts to exceed hundreds of megabytes,
141
- or even gigabytes.
142
-
143
- By looking at the file name supplied above, `iostreams` is able to determine which streams to apply
144
- to the data being read or written. For example:
145
- * `hello.zip` => Compressed using Zip
146
- * `hello.zip.enc` => Compressed using Zip and then encrypted using Symmetric Encryption
147
- * `hello.gz.enc` => Compressed using GZip and then encrypted using Symmetric Encryption
148
-
149
- The objective is that all of these streaming processes are performed used streaming
150
- so that only the current portion of the file is loaded into memory as it moves
151
- through the entire file.
152
- Where possible each stream never goes to disk, which for example could expose
153
- un-encrypted data.
154
-
155
- ## Examples
156
-
157
- While decompressing the file, display 128 characters at a time from the file.
158
-
159
- ~~~ruby
160
- require "iostreams"
161
- IOStreams.path("abc.csv").reader do |io|
162
- while (data = io.read(128))
163
- p data
164
- end
165
- end
166
- ~~~
167
-
168
- While decompressing the file, display one line at a time from the file.
169
-
170
- ~~~ruby
171
- IOStreams.path("abc.csv").each do |line|
172
- puts line
173
- end
174
- ~~~
175
-
176
- While decompressing the file, display each row from the csv file as an array.
177
-
178
- ~~~ruby
179
- IOStreams.path("abc.csv").each(:array) do |array|
180
- p array
181
- end
182
- ~~~
183
-
184
- While decompressing the file, display each record from the csv file as a hash.
185
- The first line is assumed to be the header row.
186
-
187
- ~~~ruby
188
- IOStreams.path("abc.csv").each(:hash) do |hash|
189
- p hash
190
- end
191
- ~~~
192
-
193
- Write data while compressing the file.
194
-
195
- ~~~ruby
196
- IOStreams.path("abc.csv").writer do |io|
197
- io.write("This")
198
- io.write(" is ")
199
- io.write(" one line\n")
200
- end
201
- ~~~
202
-
203
- Write a line at a time while compressing the file.
204
-
205
- ~~~ruby
206
- IOStreams.path("abc.csv").writer(:line) do |file|
207
- file << "these"
208
- file << "are"
209
- file << "all"
210
- file << "separate"
211
- file << "lines"
212
- end
213
- ~~~
214
-
215
- Write an array (row) at a time while compressing the file.
216
- Each array is converted to csv before being compressed with zip.
217
-
218
- ~~~ruby
219
- IOStreams.path("abc.csv").writer(:array) do |io|
220
- io << %w[name address zip_code]
221
- io << %w[Jack There 1234]
222
- io << ["Joe", "Over There somewhere", 1234]
223
- end
224
- ~~~
225
-
226
- Write a hash (record) at a time while compressing the file.
227
- Each hash is converted to csv before being compressed with zip.
228
- The header row is extracted from the first hash supplied.
229
-
230
- ~~~ruby
231
- IOStreams.path("abc.csv").writer(:hash) do |stream|
232
- stream << {name: "Jack", address: "There", zip_code: 1234}
233
- stream << {name: "Joe", address: "Over There somewhere", zip_code: 1234}
234
- end
235
- ~~~
236
-
237
- Write to a string IO for testing, supplying the filename so that the streams can be determined.
238
-
239
- ~~~ruby
240
- io = StringIO.new
241
- IOStreams.stream(io, file_name: "abc.csv").writer(:hash) do |stream|
242
- stream << {name: "Jack", address: "There", zip_code: 1234}
243
- stream << {name: "Joe", address: "Over There somewhere", zip_code: 1234}
244
- end
245
- puts io.string
246
- ~~~
247
-
248
- Read a CSV file and write the output to an encrypted file in JSON format.
249
-
250
- ~~~ruby
251
- IOStreams.path("sample.json.enc").writer(:hash) do |output|
252
- IOStreams.path("sample.csv").each(:hash) do |record|
253
- output << record
254
- end
255
- end
256
- ~~~
257
-
258
- ## Copying between files
259
-
260
- Stream based file copying. Changes the file type without changing the file format. For example, compress or encrypt.
261
-
262
- Encrypt the contents of the file `sample.json` and write to `sample.json.enc`
263
-
264
- ~~~ruby
265
- input = IOStreams.path("sample.json")
266
- IOStreams.path("sample.json.enc").copy_from(input)
267
- ~~~
268
-
269
- Encrypt and compress the contents of the file `sample.json` with Symmetric Encryption and write to `sample.json.enc`
270
-
271
- ~~~ruby
272
- input = IOStreams.path("sample.json")
273
- IOStreams.path("sample.json.enc").option(:enc, compress: true).copy_from(input)
274
- ~~~
275
-
276
- Encrypt and compress the contents of the file `sample.json` with pgp and write to `sample.json.enc`
277
-
278
- ~~~ruby
279
- input = IOStreams.path("sample.json")
280
- IOStreams.path("sample.json.pgp").option(:pgp, recipient: "sender@example.org").copy_from(input)
281
- ~~~
282
-
283
- Decrypt the file `abc.csv.enc` and write it to `xyz.csv`.
284
-
285
- ~~~ruby
286
- input = IOStreams.path("abc.csv.enc")
287
- IOStreams.path("xyz.csv").copy_from(input)
288
- ~~~
289
-
290
- Decrypt file `ABC` that was encrypted with Symmetric Encryption,
291
- PGP encrypt the output file and write it to `xyz.csv.pgp` using the pgp key that was imported for `a@a.com`.
292
-
293
- ~~~ruby
294
- input = IOStreams.path("ABC").stream(:enc)
295
- IOStreams.path("xyz.csv.pgp").option(:pgp, recipient: "a@a.com").copy_from(input)
296
- ~~~
297
-
298
- To copy a file _without_ performing any conversions (ignore file extensions), set `convert` to `false`:
299
-
300
- ~~~ruby
301
- input = IOStreams.path("sample.json.zip")
302
- IOStreams.path("sample.copy").copy_from(input, convert: false)
303
- ~~~
304
-
305
- ## Philosopy
306
-
307
- IOStreams can be used to work against a single stream. it's real capability becomes apparent when chaining together
308
- multiple streams to process data, without loading entire files into memory.
309
-
310
- #### Linux Pipes
311
-
312
- Linux has built-in support for streaming using the `|` (pipe operator) to send the output from one process to another.
313
-
314
- Example: count the number of lines in a compressed file:
315
-
316
- gunzip -c hello.csv.gz | wc -l
317
-
318
- The file `hello.csv.gz` is uncompressed and returned to standard output, which in turn is piped into the standard
319
- input for `wc -l`, which counts the number of lines in the uncompressed data.
320
-
321
- As each block of data is returned from `gunzip` it is immediately passed into `wc` so that it
322
- can start counting lines of uncompressed data, without waiting until the entire file is decompressed.
323
- The uncompressed contents of the file are not written to disk before passing to `wc -l` and the file is not loaded
324
- into memory before passing to `wc -l`.
325
-
326
- In this way extremely large files can be processed with very little memory being used.
327
-
328
- #### Push Model
329
-
330
- In the Linux pipes example above this would be considered a "push model" where each task in the list pushes
331
- its output to the input of the next task.
332
-
333
- A major challenge or disadvantage with the push model is that buffering would need to occur between tasks since
334
- each task could complete at very different speeds. To prevent large memory usage the standard output from a previous
335
- task would have to be blocked to try and make it slow down.
336
-
337
- #### Pull Model
338
-
339
- Another approach with multiple tasks that need to process a single stream, is to move to a "pull model" where the
340
- task at the end of the list pulls a block from a previous task when it is ready to process it.
341
-
342
- #### IOStreams
343
-
344
- IOStreams uses the pull model when reading data, where each stream performs a read against the previous stream
345
- when it is ready for more data.
346
-
347
- When writing to an output stream, IOStreams uses the push model, where each block of data that is ready to be written
348
- is pushed to the task/stream in the list. The write push only returns once it has traversed all the way down to
349
- the final task / stream in the list, this avoids complex buffering issues between each task / stream in the list.
350
-
351
- Example: Implementing in Ruby: `gunzip -c hello.csv.gz | wc -l`
352
-
353
- ~~~ruby
354
- line_count = 0
355
- IOStreams::Gzip::Reader.open("hello.csv.gz") do |input|
356
- IOStreams::Line::Reader.open(input) do |lines|
357
- lines.each { line_count += 1}
358
- end
359
- end
360
- puts "hello.csv.gz contains #{line_count} lines"
361
- ~~~
362
-
363
- Since IOStreams can autodetect file types based on the file extension, `IOStreams.reader` can figure which stream
364
- to start with:
365
- ~~~ruby
366
- line_count = 0
367
- IOStreams.path("hello.csv.gz").reader do |input|
368
- IOStreams::Line::Reader.open(input) do |lines|
369
- lines.each { line_count += 1}
370
- end
371
- end
372
- puts "hello.csv.gz contains #{line_count} lines"
373
- ~~~
374
-
375
- Since we know we want a line reader, it can be simplified using `#reader(:line)`:
376
- ~~~ruby
377
- line_count = 0
378
- IOStreams.path("hello.csv.gz").reader(:line) do |lines|
379
- lines.each { line_count += 1}
380
- end
381
- puts "hello.csv.gz contains #{line_count} lines"
382
- ~~~
383
-
384
- It can be simplified even further using `#each`:
385
- ~~~ruby
386
- line_count = 0
387
- IOStreams.path("hello.csv.gz").each { line_count += 1}
388
- puts "hello.csv.gz contains #{line_count} lines"
389
- ~~~
390
-
391
- The benefit in all of the above cases is that the file can be any arbitrary size and only one block of the file
392
- is held in memory at any time.
393
-
394
- #### Chaining
395
-
396
- In the above example only 2 streams were used. Streams can be nested as deep as necessary to process data.
397
-
398
- Example, search for all occurrences of the word apple, cleansing the input data stream of non printable characters
399
- and converting to valid US ASCII.
400
-
401
- ~~~ruby
402
- apple_count = 0
403
- IOStreams::Gzip::Reader.open("hello.csv.gz") do |input|
404
- IOStreams::Encode::Reader.open(input,
405
- encoding: "US-ASCII",
406
- encode_replace: "",
407
- encode_cleaner: :printable) do |cleansed|
408
- IOStreams::Line::Reader.open(cleansed) do |lines|
409
- lines.each { |line| apple_count += line.scan("apple").count}
410
- end
411
- end
412
- puts "Found the word 'apple' #{apple_count} times in hello.csv.gz"
413
- ~~~
414
-
415
- Let IOStreams perform the above stream chaining automatically under the covers:
416
-
417
- ~~~ruby
418
- apple_count = 0
419
- IOStreams.path("hello.csv.gz").
420
- option(:encode, encoding: "US-ASCII", replace: "", cleaner: :printable).
421
- each do |line|
422
- apple_count += line.scan("apple").count
423
- end
424
-
425
- puts "Found the word 'apple' #{apple_count} times in hello.csv.gz"
426
- ~~~
427
-
428
- ## Notes
429
-
430
- * Due to the nature of Zip, both its Reader and Writer methods will create
431
- a temp file when reading from or writing to a stream.
432
- Recommended to use Gzip over Zip since it can be streamed without requiring temp files.
433
- * Zip becomes exponentially slower with very large files, especially files
434
- that exceed 4GB when uncompressed. Highly recommend using GZip for large files.
12
+ [Semantic Logger Guide](http://rocketjob.github.io/iostreams)
435
13
 
436
14
  ## Versioning
437
15
 
@@ -9,7 +9,7 @@ module IOStreams
9
9
  LINEFEED_REGEXP = Regexp.compile(/\r\n|\n|\r/).freeze
10
10
 
11
11
  # Read a line at a time from a stream
12
- def self.stream(input_stream, original_file_name: nil, **args)
12
+ def self.stream(input_stream, **args)
13
13
  # Pass-through if already a line reader
14
14
  return yield(input_stream) if input_stream.is_a?(self.class)
15
15
 
@@ -44,7 +44,7 @@ module IOStreams
44
44
  # - Skip "empty" / "blank" lines. RegExp?
45
45
  # - Extract header line(s) / first non-comment, non-blank line
46
46
  # - Embedded newline support, RegExp? or Proc?
47
- def initialize(input_stream, delimiter: nil, buffer_size: 65_536, embedded_within: nil)
47
+ def initialize(input_stream, delimiter: nil, buffer_size: 65_536, embedded_within: nil, original_file_name: nil)
48
48
  super(input_stream)
49
49
 
50
50
  @embedded_within = embedded_within
@@ -1,5 +1,6 @@
1
1
  require "net/http"
2
2
  require "uri"
3
+ require "cgi"
3
4
  module IOStreams
4
5
  module Paths
5
6
  class HTTP < IOStreams::Path
@@ -2,85 +2,9 @@ require "open3"
2
2
  module IOStreams
3
3
  # Read/Write PGP/GPG file or stream.
4
4
  #
5
- # Example Setup:
6
- #
7
- # 1. Install OpenPGP
8
- # Mac OSX (homebrew) : `brew install gpg2`
9
- # Redhat Linux: `rpm install gpg2`
10
- #
11
- # 2. # Generate senders private and public key
12
- # IOStreams::Pgp.generate_key(name: 'Sender', email: 'sender@example.org', passphrase: 'sender_passphrase')
13
- #
14
- # 3. # Generate receivers private and public key
15
- # IOStreams::Pgp.generate_key(name: 'Receiver', email: 'receiver@example.org', passphrase: 'receiver_passphrase')
16
- #
17
- # Example 1:
18
- #
19
- # # Generate encrypted file for a specific recipient and sign it with senders credentials
20
- # data = %w(this is some data that should be encrypted using pgp)
21
- # IOStreams::Pgp::Writer.open('secure.gpg', recipient: 'receiver@example.org', signer: 'sender@example.org', signer_passphrase: 'sender_passphrase') do |output|
22
- # data.each { |word| output.puts(word) }
23
- # end
24
- #
25
- # # Decrypt the file sent to `receiver@example.org` using its private key
26
- # # Recipient must also have the senders public key to verify the signature
27
- # IOStreams::Pgp::Reader.open('secure.gpg', passphrase: 'receiver_passphrase') do |stream|
28
- # while !stream.eof?
29
- # p stream.read(10)
30
- # puts
31
- # end
32
- # end
33
- #
34
- # Example 2:
35
- #
36
- # # Default user and passphrase to sign the output file:
37
- # IOStreams::Pgp::Writer.default_signer = 'sender@example.org'
38
- # IOStreams::Pgp::Writer.default_signer_passphrase = 'sender_passphrase'
39
- #
40
- # # Default passphrase for decrypting recipients files.
41
- # # Note: Usually this would be the senders passphrase, but in this example
42
- # # it is decrypting the file intended for the recipient.
43
- # IOStreams::Pgp::Reader.default_passphrase = 'receiver_passphrase'
44
- #
45
- # # Generate encrypted file for a specific recipient and sign it with senders credentials
46
- # data = %w(this is some data that should be encrypted using pgp)
47
- # IOStreams.writer('secure.gpg', streams: {pgp: {recipient: 'receiver@example.org'}}) do |output|
48
- # data.each { |word| output.puts(word) }
49
- # end
50
- #
51
- # # Decrypt the file sent to `receiver@example.org` using its private key
52
- # # Recipient must also have the senders public key to verify the signature
53
- # IOStreams.reader('secure.gpg') do |stream|
54
- # while data = stream.read(10)
55
- # p data
56
- # end
57
- # end
58
- #
59
- # FAQ:
60
- # - If you get not trusted errors
61
- # gpg --edit-key sender@example.org
62
- # Select highest level: 5
63
- #
64
- # Delete test keys:
65
- # IOStreams::Pgp.delete_keys(email: 'sender@example.org', private: true)
66
- # IOStreams::Pgp.delete_keys(email: 'receiver@example.org', private: true)
67
- #
68
5
  # Limitations
69
6
  # - Designed for processing larger files since a process is spawned for each file processed.
70
7
  # - For small in memory files or individual emails, use the 'opengpgme' library.
71
- #
72
- # Compression Performance:
73
- # Running tests on an Early 2015 Macbook Pro Dual Core with Ruby v2.3.1
74
- #
75
- # Input file: test.log 3.6GB
76
- # :none: size: 3.6GB write: 52s read: 45s
77
- # :zip: size: 411MB write: 75s read: 31s
78
- # :zlib: size: 241MB write: 66s read: 23s ( 756KB Memory )
79
- # :bzip2: size: 129MB write: 430s read: 130s ( 5MB Memory )
80
- #
81
- # Notes:
82
- # - Tested against gnupg v1.4.21 and v2.0.30
83
- # - Does not work yet with gnupg v2.1. Pull Requests welcome.
84
8
  module Pgp
85
9
  autoload :Reader, "io_streams/pgp/reader"
86
10
  autoload :Writer, "io_streams/pgp/writer"
@@ -7,7 +7,7 @@ module IOStreams
7
7
  # Read a record at a time from a line stream
8
8
  # Note:
9
9
  # - The supplied stream _must_ already be a line stream, or a stream that responds to :each
10
- def self.stream(line_reader, original_file_name: nil, **args)
10
+ def self.stream(line_reader, **args)
11
11
  # Pass-through if already a record reader
12
12
  return yield(line_reader) if line_reader.is_a?(self.class)
13
13
 
@@ -17,7 +17,7 @@ module IOStreams
17
17
  # When reading from a file also add the line reader stream
18
18
  def self.file(file_name, original_file_name: file_name, delimiter: $/, **args)
19
19
  IOStreams::Line::Reader.file(file_name, original_file_name: original_file_name, delimiter: delimiter) do |io|
20
- yield new(io, **args)
20
+ yield new(io, original_file_name: original_file_name, **args)
21
21
  end
22
22
  end
23
23
 
@@ -25,19 +25,44 @@ module IOStreams
25
25
  # Parse a delimited data source.
26
26
  #
27
27
  # Parameters
28
- # delimited: [#each]
29
- # Anything that returns one line / record at a time when #each is called on it.
30
- #
31
28
  # format: [Symbol]
32
29
  # :csv, :hash, :array, :json, :psv, :fixed
33
30
  #
34
- # For all other parameters, see Tabular::Header.new
35
- def initialize(line_reader, cleanse_header: true, **args)
31
+ # file_name: [String]
32
+ # When `:format` is not supplied the file name can be used to infer the required format.
33
+ # Optional. Default: nil
34
+ #
35
+ # format_options: [Hash]
36
+ # Any specialized format specific options. For example, `:fixed` format requires the file definition.
37
+ #
38
+ # columns [Array<String>]
39
+ # The header columns when the file does not include a header row.
40
+ # Note:
41
+ # It is recommended to keep all columns as strings to avoid any issues when persistence
42
+ # with MongoDB when it converts symbol keys to strings.
43
+ #
44
+ # allowed_columns [Array<String>]
45
+ # List of columns to allow.
46
+ # Default: nil ( Allow all columns )
47
+ # Note:
48
+ # When supplied any columns that are rejected will be returned in the cleansed columns
49
+ # as nil so that they can be ignored during processing.
50
+ #
51
+ # required_columns [Array<String>]
52
+ # List of columns that must be present, otherwise an Exception is raised.
53
+ #
54
+ # skip_unknown [true|false]
55
+ # true:
56
+ # Skip columns not present in the `allowed_columns` by cleansing them to nil.
57
+ # #as_hash will skip these additional columns entirely as if they were not in the file at all.
58
+ # false:
59
+ # Raises Tabular::InvalidHeader when a column is supplied that is not in the whitelist.
60
+ def initialize(line_reader, cleanse_header: true, original_file_name: nil, **args)
36
61
  unless line_reader.respond_to?(:each)
37
62
  raise(ArgumentError, "Stream must be a IOStreams::Line::Reader or implement #each")
38
63
  end
39
64
 
40
- @tabular = IOStreams::Tabular.new(**args)
65
+ @tabular = IOStreams::Tabular.new(file_name: original_file_name, **args)
41
66
  @line_reader = line_reader
42
67
  @cleanse_header = cleanse_header
43
68
  end
@@ -9,7 +9,7 @@ module IOStreams
9
9
  # Write a record as a Hash at a time to a stream.
10
10
  # Note:
11
11
  # - The supplied stream _must_ already be a line stream, or a stream that responds to :<<
12
- def self.stream(line_writer, original_file_name: nil, **args)
12
+ def self.stream(line_writer, **args)
13
13
  # Pass-through if already a record writer
14
14
  return yield(line_writer) if line_writer.is_a?(self.class)
15
15
 
@@ -19,7 +19,7 @@ module IOStreams
19
19
  # When writing to a file also add the line writer stream
20
20
  def self.file(file_name, original_file_name: file_name, delimiter: $/, **args, &block)
21
21
  IOStreams::Line::Writer.file(file_name, original_file_name: original_file_name, delimiter: delimiter) do |io|
22
- yield new(io, **args, &block)
22
+ yield new(io, original_file_name: original_file_name, **args, &block)
23
23
  end
24
24
  end
25
25
 
@@ -27,17 +27,42 @@ module IOStreams
27
27
  # Parse a delimited data source.
28
28
  #
29
29
  # Parameters
30
- # delimited: [#<<]
31
- # Anything that accepts a line / record at a time when #<< is called on it.
32
- #
33
30
  # format: [Symbol]
34
31
  # :csv, :hash, :array, :json, :psv, :fixed
35
32
  #
36
- # For all other parameters, see Tabular::Header.new
37
- def initialize(line_writer, columns: nil, **args)
33
+ # file_name: [String]
34
+ # When `:format` is not supplied the file name can be used to infer the required format.
35
+ # Optional. Default: nil
36
+ #
37
+ # format_options: [Hash]
38
+ # Any specialized format specific options. For example, `:fixed` format requires the file definition.
39
+ #
40
+ # columns [Array<String>]
41
+ # The header columns when the file does not include a header row.
42
+ # Note:
43
+ # It is recommended to keep all columns as strings to avoid any issues when persistence
44
+ # with MongoDB when it converts symbol keys to strings.
45
+ #
46
+ # allowed_columns [Array<String>]
47
+ # List of columns to allow.
48
+ # Default: nil ( Allow all columns )
49
+ # Note:
50
+ # When supplied any columns that are rejected will be returned in the cleansed columns
51
+ # as nil so that they can be ignored during processing.
52
+ #
53
+ # required_columns [Array<String>]
54
+ # List of columns that must be present, otherwise an Exception is raised.
55
+ #
56
+ # skip_unknown [true|false]
57
+ # true:
58
+ # Skip columns not present in the `allowed_columns` by cleansing them to nil.
59
+ # #as_hash will skip these additional columns entirely as if they were not in the file at all.
60
+ # false:
61
+ # Raises Tabular::InvalidHeader when a column is supplied that is not in the whitelist.
62
+ def initialize(line_writer, columns: nil, original_file_name: nil, **args)
38
63
  raise(ArgumentError, "Stream must be a IOStreams::Line::Writer or implement #<<") unless line_writer.respond_to?(:<<)
39
64
 
40
- @tabular = IOStreams::Tabular.new(columns: columns, **args)
65
+ @tabular = IOStreams::Tabular.new(columns: columns, file_name: original_file_name, **args)
41
66
  @line_writer = line_writer
42
67
 
43
68
  # Render header line when `columns` is supplied.
@@ -5,7 +5,7 @@ module IOStreams
5
5
  # Read a line as an Array at a time from a stream.
6
6
  # Note:
7
7
  # - The supplied stream _must_ already be a line stream, or a stream that responds to :each
8
- def self.stream(line_reader, original_file_name: nil, **args)
8
+ def self.stream(line_reader, **args)
9
9
  # Pass-through if already a row reader
10
10
  return yield(line_reader) if line_reader.is_a?(self.class)
11
11
 
@@ -15,7 +15,7 @@ module IOStreams
15
15
  # When reading from a file also add the line reader stream
16
16
  def self.file(file_name, original_file_name: file_name, delimiter: $/, **args)
17
17
  IOStreams::Line::Reader.file(file_name, original_file_name: original_file_name, delimiter: delimiter) do |io|
18
- yield new(io, **args)
18
+ yield new(io, original_file_name: original_file_name, **args)
19
19
  end
20
20
  end
21
21
 
@@ -29,12 +29,12 @@ module IOStreams
29
29
  # :csv, :hash, :array, :json, :psv, :fixed
30
30
  #
31
31
  # For all other parameters, see Tabular::Header.new
32
- def initialize(line_reader, cleanse_header: true, **args)
32
+ def initialize(line_reader, cleanse_header: true, original_file_name: nil, **args)
33
33
  unless line_reader.respond_to?(:each)
34
34
  raise(ArgumentError, "Stream must be a IOStreams::Line::Reader or implement #each")
35
35
  end
36
36
 
37
- @tabular = IOStreams::Tabular.new(**args)
37
+ @tabular = IOStreams::Tabular.new(file_name: original_file_name, **args)
38
38
  @line_reader = line_reader
39
39
  @cleanse_header = cleanse_header
40
40
  end
@@ -12,7 +12,7 @@ module IOStreams
12
12
  #
13
13
  # Note:
14
14
  # - The supplied stream _must_ already be a line stream, or a stream that responds to :<<
15
- def self.stream(line_writer, original_file_name: nil, **args)
15
+ def self.stream(line_writer, **args)
16
16
  # Pass-through if already a row writer
17
17
  return yield(line_writer) if line_writer.is_a?(self.class)
18
18
 
@@ -22,7 +22,7 @@ module IOStreams
22
22
  # When writing to a file also add the line writer stream
23
23
  def self.file(file_name, original_file_name: file_name, delimiter: $/, **args, &block)
24
24
  IOStreams::Line::Writer.file(file_name, original_file_name: original_file_name, delimiter: delimiter) do |io|
25
- yield new(io, **args, &block)
25
+ yield new(io, original_file_name: original_file_name, **args, &block)
26
26
  end
27
27
  end
28
28
 
@@ -36,10 +36,10 @@ module IOStreams
36
36
  # :csv, :hash, :array, :json, :psv, :fixed
37
37
  #
38
38
  # For all other parameters, see Tabular::Header.new
39
- def initialize(line_writer, columns: nil, **args)
39
+ def initialize(line_writer, columns: nil, original_file_name: nil, **args)
40
40
  raise(ArgumentError, "Stream must be a IOStreams::Line::Writer or implement #<<") unless line_writer.respond_to?(:<<)
41
41
 
42
- @tabular = IOStreams::Tabular.new(columns: columns, **args)
42
+ @tabular = IOStreams::Tabular.new(columns: columns, file_name: original_file_name, **args)
43
43
  @line_writer = line_writer
44
44
 
45
45
  # Render header line when `columns` is supplied.
@@ -282,20 +282,20 @@ module IOStreams
282
282
  def line_reader(embedded_within: nil, **args)
283
283
  embedded_within = '"' if embedded_within.nil? && builder.file_name&.include?(".csv")
284
284
 
285
- stream_reader { |io| yield IOStreams::Line::Reader.new(io, embedded_within: embedded_within, **args) }
285
+ stream_reader { |io| yield IOStreams::Line::Reader.new(io, original_file_name: builder.file_name, embedded_within: embedded_within, **args) }
286
286
  end
287
287
 
288
288
  # Iterate over a file / stream returning each line as an array, one at a time.
289
289
  def row_reader(delimiter: nil, embedded_within: nil, **args)
290
290
  line_reader(delimiter: delimiter, embedded_within: embedded_within) do |io|
291
- yield IOStreams::Row::Reader.new(io, **args)
291
+ yield IOStreams::Row::Reader.new(io, original_file_name: builder.file_name, **args)
292
292
  end
293
293
  end
294
294
 
295
295
  # Iterate over a file / stream returning each line as a hash, one at a time.
296
296
  def record_reader(delimiter: nil, embedded_within: nil, **args)
297
297
  line_reader(delimiter: delimiter, embedded_within: embedded_within) do |io|
298
- yield IOStreams::Record::Reader.new(io, **args)
298
+ yield IOStreams::Record::Reader.new(io, original_file_name: builder.file_name, **args)
299
299
  end
300
300
  end
301
301
 
@@ -306,19 +306,19 @@ module IOStreams
306
306
  def line_writer(**args, &block)
307
307
  return block.call(io_stream) if io_stream&.is_a?(IOStreams::Line::Writer)
308
308
 
309
- writer { |io| IOStreams::Line::Writer.stream(io, **args, &block) }
309
+ writer { |io| IOStreams::Line::Writer.stream(io, original_file_name: builder.file_name, **args, &block) }
310
310
  end
311
311
 
312
312
  def row_writer(delimiter: $/, **args, &block)
313
313
  return block.call(io_stream) if io_stream&.is_a?(IOStreams::Row::Writer)
314
314
 
315
- line_writer(delimiter: delimiter) { |io| IOStreams::Row::Writer.stream(io, **args, &block) }
315
+ line_writer(delimiter: delimiter) { |io| IOStreams::Row::Writer.stream(io, original_file_name: builder.file_name, **args, &block) }
316
316
  end
317
317
 
318
318
  def record_writer(delimiter: $/, **args, &block)
319
319
  return block.call(io_stream) if io_stream&.is_a?(IOStreams::Record::Writer)
320
320
 
321
- line_writer(delimiter: delimiter) { |io| IOStreams::Record::Writer.stream(io, **args, &block) }
321
+ line_writer(delimiter: delimiter) { |io| IOStreams::Record::Writer.stream(io, original_file_name: builder.file_name, **args, &block) }
322
322
  end
323
323
  end
324
324
  end
@@ -52,7 +52,35 @@ module IOStreams
52
52
  # format: [Symbol]
53
53
  # :csv, :hash, :array, :json, :psv, :fixed
54
54
  #
55
- # For all other parameters, see Tabular::Header.new
55
+ # file_name: [String]
56
+ # When `:format` is not supplied the file name can be used to infer the required format.
57
+ # Optional. Default: nil
58
+ #
59
+ # format_options: [Hash]
60
+ # Any specialized format specific options. For example, `:fixed` format requires the file definition.
61
+ #
62
+ # columns [Array<String>]
63
+ # The header columns when the file does not include a header row.
64
+ # Note:
65
+ # It is recommended to keep all columns as strings to avoid any issues when persistence
66
+ # with MongoDB when it converts symbol keys to strings.
67
+ #
68
+ # allowed_columns [Array<String>]
69
+ # List of columns to allow.
70
+ # Default: nil ( Allow all columns )
71
+ # Note:
72
+ # When supplied any columns that are rejected will be returned in the cleansed columns
73
+ # as nil so that they can be ignored during processing.
74
+ #
75
+ # required_columns [Array<String>]
76
+ # List of columns that must be present, otherwise an Exception is raised.
77
+ #
78
+ # skip_unknown [true|false]
79
+ # true:
80
+ # Skip columns not present in the `allowed_columns` by cleansing them to nil.
81
+ # #as_hash will skip these additional columns entirely as if they were not in the file at all.
82
+ # false:
83
+ # Raises Tabular::InvalidHeader when a column is supplied that is not in the whitelist.
56
84
  def initialize(format: nil, file_name: nil, format_options: nil, **args)
57
85
  @header = Header.new(**args)
58
86
  klass =
@@ -1,4 +1,5 @@
1
1
  require "uri"
2
+ require "tmpdir"
2
3
  module IOStreams
3
4
  module Utils
4
5
  MAX_TEMP_FILE_NAME_ATTEMPTS = 5
@@ -1,3 +1,3 @@
1
1
  module IOStreams
2
- VERSION = "1.2.0".freeze
2
+ VERSION = "1.2.1".freeze
3
3
  end
@@ -1,8 +1,24 @@
1
1
  require_relative "test_helper"
2
+ require "json"
2
3
 
3
4
  module IOStreams
4
5
  class PathTest < Minitest::Test
5
6
  describe IOStreams do
7
+ let :records do
8
+ [
9
+ {"name" => "Jack Jones", "login" => "jjones"},
10
+ {"name" => "Jill Smith", "login" => "jsmith"}
11
+ ]
12
+ end
13
+
14
+ let :expected_json do
15
+ records.collect(&:to_json).join("\n") + "\n"
16
+ end
17
+
18
+ let :json_file_name do
19
+ "/tmp/io_streams/abc.json"
20
+ end
21
+
6
22
  describe ".root" do
7
23
  it "return default path" do
8
24
  path = ::File.expand_path(::File.join(__dir__, "../tmp/default"))
@@ -60,6 +76,39 @@ module IOStreams
60
76
  IOStreams.path("s3://a.xyz")
61
77
  assert_equal :s3, path
62
78
  end
79
+
80
+ it "hash writer detects json format from file name" do
81
+ path = IOStreams.path("/tmp/io_streams/abc.json")
82
+ path.writer(:hash) do |io|
83
+ records.each { |hash| io << hash }
84
+ end
85
+ actual = path.read
86
+ path.delete
87
+ assert_equal expected_json, actual
88
+ end
89
+
90
+ it "hash reader detects json format from file name" do
91
+ ::File.open(json_file_name, "wb") { |file| file.write(expected_json) }
92
+ rows = []
93
+ path = IOStreams.path("/tmp/io_streams/abc.json")
94
+ path.each(:hash) do |row|
95
+ rows << row
96
+ end
97
+ actual = rows.collect(&:to_json).join("\n") + "\n"
98
+ path.delete
99
+ assert_equal expected_json, actual
100
+ end
101
+
102
+ it "array writer detects json format from file name" do
103
+ path = IOStreams.path("/tmp/io_streams/abc.json")
104
+ path.writer(:array, columns: %w[name login]) do |io|
105
+ io << ["Jack Jones", "jjones"]
106
+ io << ["Jill Smith", "jsmith"]
107
+ end
108
+ actual = path.read
109
+ path.delete
110
+ assert_equal expected_json, actual
111
+ end
63
112
  end
64
113
 
65
114
  describe ".temp_file" do
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: iostreams
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.2.0
4
+ version: 1.2.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Reid Morrison
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-04-29 00:00:00.000000000 Z
11
+ date: 2020-05-19 00:00:00.000000000 Z
12
12
  dependencies: []
13
13
  description:
14
14
  email:
@@ -60,7 +60,6 @@ files:
60
60
  - lib/io_streams/tabular/parser/psv.rb
61
61
  - lib/io_streams/tabular/utility/csv_row.rb
62
62
  - lib/io_streams/utils.rb
63
- - lib/io_streams/utils/reliable_http.rb
64
63
  - lib/io_streams/version.rb
65
64
  - lib/io_streams/writer.rb
66
65
  - lib/io_streams/xlsx/reader.rb
@@ -131,7 +130,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
131
130
  - !ruby/object:Gem::Version
132
131
  version: '0'
133
132
  requirements: []
134
- rubygems_version: 3.0.8
133
+ rubygems_version: 3.0.6
135
134
  signing_key:
136
135
  specification_version: 4
137
136
  summary: Input and Output streaming for Ruby.
@@ -1,98 +0,0 @@
1
- require "net/http"
2
- require "uri"
3
- module IOStreams
4
- module Utils
5
- class ReliableHTTP
6
- attr_reader :username, :password, :max_redirects, :url
7
-
8
- # Reliable HTTP implementation with support for:
9
- # * HTTP Redirects
10
- # * Basic authentication
11
- # * Raises an exception anytime the HTTP call is not successful.
12
- # * TODO: Automatic retries with a logarithmic backoff strategy.
13
- #
14
- # Parameters:
15
- # url: [String]
16
- # URI of the file to download.
17
- # Example:
18
- # https://www5.fdic.gov/idasp/Offices2.zip
19
- # http://hostname/path/file_name
20
- #
21
- # Full url showing all the optional elements that can be set via the url:
22
- # https://username:password@hostname/path/file_name
23
- #
24
- # username: [String]
25
- # When supplied, basic authentication is used with the username and password.
26
- #
27
- # password: [String]
28
- # Password to use use with basic authentication when the username is supplied.
29
- #
30
- # max_redirects: [Integer]
31
- # Maximum number of http redirects to follow.
32
- def initialize(url, username: nil, password: nil, max_redirects: 10)
33
- uri = URI.parse(url)
34
- unless %w[http https].include?(uri.scheme)
35
- raise(ArgumentError, "Invalid URL. Required Format: 'http://<host_name>/<file_name>', or 'https://<host_name>/<file_name>'")
36
- end
37
-
38
- @username = username || uri.user
39
- @password = password || uri.password
40
- @max_redirects = max_redirects
41
- @url = url
42
- end
43
-
44
- # Read a file using an http get.
45
- #
46
- # For example:
47
- # IOStreams.path('https://www5.fdic.gov/idasp/Offices2.zip').reader {|file| puts file.read}
48
- #
49
- # Read the file without unzipping and streaming the first file in the zip:
50
- # IOStreams.path('https://www5.fdic.gov/idasp/Offices2.zip').stream(:none).reader {|file| puts file.read}
51
- #
52
- # Notes:
53
- # * Since Net::HTTP download only supports a push stream, the data is streamed into a tempfile first.
54
- def post(&block)
55
- handle_redirects(Net::HTTP::Post, url, max_redirects, &block)
56
- end
57
-
58
- def get(&block)
59
- handle_redirects(Net::HTTP::Get, url, max_redirects, &block)
60
- end
61
-
62
- private
63
-
64
- def handle_redirects(request_class, uri, max_redirects, &block)
65
- uri = URI.parse(uri) unless uri.is_a?(URI)
66
- result = nil
67
- raise(IOStreams::Errors::CommunicationsFailure, "Too many redirects") if max_redirects < 1
68
-
69
- Net::HTTP.start(uri.hostname, uri.port, use_ssl: uri.scheme == "https") do |http|
70
- request = request_class.new(uri)
71
- request.basic_auth(username, password) if username
72
-
73
- http.request(request) do |response|
74
- raise(IOStreams::Errors::CommunicationsFailure, "Invalid URL: #{uri}") if response.is_a?(Net::HTTPNotFound)
75
-
76
- if response.is_a?(Net::HTTPUnauthorized)
77
- raise(IOStreams::Errors::CommunicationsFailure, "Authorization Required: Invalid :username or :password.")
78
- end
79
-
80
- if response.is_a?(Net::HTTPRedirection)
81
- new_uri = response["location"]
82
- return handle_redirects(request_class, new_uri, max_redirects: max_redirects - 1, &block)
83
- end
84
-
85
- unless response.is_a?(Net::HTTPSuccess)
86
- raise(IOStreams::Errors::CommunicationsFailure, "Invalid response code: #{response.code}")
87
- end
88
-
89
- yield(response) if block_given?
90
-
91
- result = response
92
- end
93
- end
94
- result
95
- end
96
- end
97
- end
98
- end