micdrop 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 4e2a6458190e5d2bc6537824586b135a8d6a9a0c24ab55d9b4de5db9197b5c88
4
+ data.tar.gz: 869d3c01befdb76e074c339caf9288a3be3734365fe5d3b4b9d6cf0e772cdc91
5
+ SHA512:
6
+ metadata.gz: 182e84bd143229048a783333aabaa4fdbf7ad26020b75c06ae6692e142ce223aad0262c60e3260916df96b1dc5432623b22e9e5ab40c2115fe4cb0b8aee1792d
7
+ data.tar.gz: 6b81a5ef76726c2738e09a8fdb6f117f8e1df2d7c0cb3c699eb0fc65efd7ba3584ebdb16791dcad4d8e9612d94eacd4a7a5860e59972b1bd7112ba858dcdb00b
@@ -0,0 +1,22 @@
1
+ // For format details, see https://aka.ms/devcontainer.json. For config options, see the
2
+ // README at: https://github.com/devcontainers/templates/tree/main/src/ruby
3
+ {
4
+ "name": "Ruby",
5
+ // Or use a Dockerfile or Docker Compose file. More info: https://containers.dev/guide/dockerfile
6
+ "image": "mcr.microsoft.com/devcontainers/ruby:1-3.4-bullseye"
7
+
8
+ // Features to add to the dev container. More info: https://containers.dev/features.
9
+ // "features": {},
10
+
11
+ // Use 'forwardPorts' to make a list of ports inside the container available locally.
12
+ // "forwardPorts": [],
13
+
14
+ // Use 'postCreateCommand' to run commands after the container is created.
15
+ // "postCreateCommand": "ruby --version",
16
+
17
+ // Configure tool-specific properties.
18
+ // "customizations": {},
19
+
20
+ // Uncomment to connect as root instead. More info: https://aka.ms/dev-containers-non-root.
21
+ // "remoteUser": "root"
22
+ }
data/.rubocop.yml ADDED
@@ -0,0 +1,8 @@
1
+ AllCops:
2
+ TargetRubyVersion: 3.1
3
+
4
+ Style/StringLiterals:
5
+ EnforcedStyle: double_quotes
6
+
7
+ Style/StringLiteralsInInterpolation:
8
+ EnforcedStyle: double_quotes
@@ -0,0 +1,17 @@
1
+ {
2
+ // See https://go.microsoft.com/fwlink/?LinkId=733558
3
+ // for the documentation about the tasks.json format
4
+ "version": "2.0.0",
5
+ "tasks": [
6
+ {
7
+ "label": "test",
8
+ "type": "shell",
9
+ "command": "rake test"
10
+ },
11
+ {
12
+ "label": "build",
13
+ "type": "shell",
14
+ "command": "gem build micdrop.gemspec"
15
+ }
16
+ ]
17
+ }
data/README.md ADDED
@@ -0,0 +1,448 @@
1
+ # Micdrop
2
+
3
+ Extensible framework/library to migrate data from source to another using a more declarative interface. It is primarily intended for use in:
4
+
5
+ * Import scripts
6
+ * Export scripts
7
+ * Cross-application data migrations
8
+
9
+ At its core, the library's operation is quite simple: loop over the rows of the source data, perform some transformations, and output the transformed data to the sink.
10
+
11
+ ```ruby
12
+ # Many types can be used as sources and sinks, including CSV files and Sequel Datasets
13
+ source = CSV.read("data_source.csv", headers:true)
14
+ sink = DB[:destination_table]
15
+
16
+ # Migrations define how data from the source is transformed to go into the sink
17
+ Micdrop.migrate source, sink do
18
+ take "Name", put: :name
19
+ take "Birth Date" do
20
+ parse_date "%m/%d/%y"
21
+ format_date "%Y-%d-%m"
22
+ put :dob
23
+ end
24
+ take "Deceased?" do
25
+ parse_boolean
26
+ default false
27
+ put :is_deceased
28
+ end
29
+ end
30
+ ```
31
+
32
+ > **Note:**
33
+ >
34
+ > This is a re-implementation in Ruby of my [previous attempt](https://github.com/dmjohnsson23/micdrop) at this concept in Python. Ruby provides a far superior syntax for this concept than Python. This version does not yet have the full feature set of the Python version, but is still quite usable.
35
+
36
+ ## Terminology
37
+
38
+ * Source: A source of data at the beginning of a pipeline; a sequence of multiple Records.
39
+ * Sink: The final destination in which Records are to be stored after their transformations.
40
+ * Record: A single record in a Source. (For example, a database row.)
41
+ * Item: An sub-component or a record. (Such as a database column. Items may also be Records themselves if the parent record represents structured data such as JSON.)
42
+ * Take: Extract a single Item from a Record or Collector
43
+ * Put: Deposit a single item into a Collector
44
+ * Collector: Similar to a Record, but intended to be filled by the migration rather than coming from the source. (A single Collector exists by default, which will be pushed to the Sink. However, you can also use manually-created Collectors as Items to build up hierarchical structures.)
45
+
46
+ For any curious, the library name itself is partially an abbreviation of the words, "Migrate, Import, and Convert".
47
+
48
+ ## Usage
49
+
50
+ Before we can begin a migration, we need a source and a sink.
51
+
52
+ Essentially, any Ruby object which meets the following criteria can be used as a source:
53
+
54
+ * The object responds to `:each`, `:each_with_index`, and/or `:each_pair` (so, any `Enumerator` works)
55
+ * The items yielded by `:each` and friends respond to `:[]`
56
+
57
+ A sink is similar, but has a single criteria: it must respond to `:<<`.
58
+
59
+ By default, the `:<<` method of the sink will receive a hash. However, if another object is needed,
60
+ the sink may optionally implement `:make_collector` to return another object instead. The collector
61
+ must respond to `:[]=` but otherwise may be any object you wish.
62
+
63
+ ### Simple Migrations
64
+
65
+ Let's begin with the simplest possible migration:
66
+
67
+ ```ruby
68
+ # Many objects can be used as sources. While dedicated source classes exist for more complex
69
+ # use-cases, even a simple array of hashes can be used as a source.
70
+ source = [
71
+ {a:1, b:2},
72
+ {a:3, b:4},
73
+ {a:5, b:6},
74
+ ]
75
+
76
+ # Likewise, a simple array can also be used as a sink, though special classes exist for more
77
+ # complex use-cases.
78
+ sink = []
79
+
80
+ Micdrop.migrate source, sink do # This block is executed for every record in the source
81
+ # If no conversion is needed, you can simply Take items and Put them in the appropriate place
82
+ take :a, put: "A"
83
+ take :b, put: "B"
84
+ end
85
+
86
+ # `sink` now looks like this:
87
+ [
88
+ {"A"=>1, "B"=>2},
89
+ {"A"=>3, "B"=>4},
90
+ {"A"=>5, "B"=>6},
91
+ ]
92
+ ```
93
+
94
+ Or course, data rarely maps so cleanly in the real world; conversion is usually needed. Adding a block to the Take allows you to specify transforms:
95
+
96
+ ```ruby
97
+ source = [
98
+ {a:"Yes", b:"08/07/22", c:"Stuff"},
99
+ {a:"Yes", b:"24/04/24", c:"Things"},
100
+ {a:"No", b:"11/12/21", c:nil},
101
+ ]
102
+ sink = []
103
+
104
+ Micdrop.migrate source, sink do
105
+ take :a, put: "A" do
106
+ parse_boolean
107
+ end
108
+ take :b, put: "B" do
109
+ # We"ll parse the date from a string, and then format it in the new format
110
+ parse_date "%m/%d/%y"
111
+ format_date "%Y-%d-%m"
112
+ end
113
+ take :c do
114
+ default "Whatsit"
115
+ # The Put can optionally be specified in the block body rather than as a method parameter
116
+ put "C"
117
+ end
118
+ # Method chaining is also allowed. The previous block could alternatively have bee written as:
119
+ take(:c).default("Whatsit").put("C")
120
+ end
121
+
122
+ # `sink` now looks like this:
123
+ [
124
+ {"A"=>true, "B"=>"2022-07-08", "C"=>"Stuff"},
125
+ {"A"=>true, "B"=>"2024-04-24", "C"=>"Things"},
126
+ {"A"=>false, "B"=>"2021-12-11", "C"=>"Whatsit"},
127
+ ]
128
+ ```
129
+
130
+ Each block acts as a pipeline, with each transform being executed sequentially and modifying the value in-place. Your pipelines can be arbitrarily complex, and even include multiple Puts at different stages of the pipeline.
131
+
132
+ ### Destructuring Data
133
+
134
+ If your source data is more structured, you can use `scope` and `extract` to navigate the tree:
135
+
136
+ ```ruby
137
+ source = [
138
+ {some:{deeply:{nested:{data: "1", stuff: "2"}}, other: "3"}},
139
+ {some:{deeply:{nested:{data: "4", stuff: "5"}}, other: "6"}},
140
+ {some:{deeply:{nested:{data: "7", stuff: "8"}}, other: "9"}},
141
+ ]
142
+ sink = []
143
+
144
+ Micdrop.migrate source, sink do
145
+ take :some do
146
+ scope do
147
+ # The `scope` method prevents operations in this block from affecting the value in
148
+ # the outer `take` block. The `extract` method traverses down the tree.
149
+ extract :other
150
+ put :some_other
151
+ end
152
+ scope do
153
+ # `extract` can be used multiple times to go down multiple levels
154
+ extract :deeply
155
+ extract :nested
156
+ # And scopes can be nested
157
+ scope do
158
+ extract :data
159
+ put :some_deeply_nested_data
160
+ end
161
+ scope do
162
+ extract :stuff
163
+ put :some_deeply_nested_stuff
164
+ end
165
+ end
166
+ end
167
+ # If you only need a single item in a deeply nested structure, you can chain all the methods
168
+ # directly on the `take` as well
169
+ take(:some).extract(:deeply).extract(:nested).extract(:stuff).put(:some_deeply_nested_stuff)
170
+ # Or even use `take_dig`
171
+ take_dig :some, :deeply, :nested, :stuff, put: :some_deeply_nested_stuff
172
+ end
173
+ ```
174
+
175
+ ### Outputting Structured Data
176
+
177
+ By default, Micdrop assumes your output data follows a normal row/column structure, rather than containing complex strucutures. Micdrop has some limited suport for building up structure, though more complex tools are in the works for the future.
178
+
179
+ The `collect_list` method is currently the primary supported way of building up structure. It takes multiple `take`s and allows them to be operated on in a single pipeline:
180
+
181
+ ```ruby
182
+ source = [
183
+ {person: 1, home_phone: nil, work_phone: "(354) 756-4796", cell_phone: "(234) 678-7564"},
184
+ {person: 2, home_phone: "(867) 123-9748", work_phone: nil, cell_phone: "(475) 364-8365"},
185
+ ]
186
+ sink = []
187
+
188
+ Micdrop.migrate source, sink do
189
+ take :person, put: :person_id
190
+ collect_list(take(:home_phone), take(:work_phone), take(:cell_phone)) do
191
+ # Here, the value is a list containing the values of all three `take`s
192
+ # We can remove the nil values from the list
193
+ compact
194
+ # Then join the remaining as a JSON-formatted list
195
+ format_json
196
+ put :phones
197
+ end
198
+ end
199
+
200
+ # `sink` now looks like this:
201
+ [
202
+ {person_id: 1, phones: '["(354) 756-4796", "(234) 678-7564"]'},
203
+ {person_id: 2, phones: '["(867) 123-9748", "(475) 364-8365"]'},
204
+ ]
205
+ ```
206
+
207
+ There are several other methods that are useful for operating on collected lists as well, such as `filter`, `map`, `coalesce`, and `map_apply`.
208
+
209
+ In addition to `collect_list`, there is also `collect_kv` which takes a hash of `take`s as the first argument:
210
+
211
+ ```ruby
212
+ Micdrop.migrate source, sink do
213
+ take :person, put: :person_id
214
+ collect_kv({"Home"=>take(:home_phone), "Work"=>take(:work_phone), "Cell"=>take(:cell_phone)}) do
215
+ # Here, the value is a hash containing the values of all three `take`s
216
+ end
217
+ end
218
+ ```
219
+
220
+ And also `collect_format_string`, which collects multiple items into a format string:
221
+
222
+
223
+ ```ruby
224
+ Micdrop.migrate source, sink do
225
+ take :person, put: :person_id
226
+ collect_format_string("Home: %s, Work: %s, Cell: %s", take(:home_phone), take(:work_phone), take(:cell_phone)) do
227
+ # Here, the value is a string with the `take`n values inserted
228
+ end
229
+ end
230
+ ```
231
+
232
+ Finally, you can also create nested structures on the fly using the `put_bury` method in place of the normal `put`.
233
+
234
+
235
+ ```ruby
236
+ Micdrop.migrate source, sink do
237
+ take :person, put: :person_id
238
+ take :home_phone do
239
+ put_bury :phones, :home unless value.nil?
240
+ end
241
+ take :work_phone do
242
+ put_bury :phones, :work unless value.nil?
243
+ end
244
+ take :cell_phone do
245
+ put_bury :phones, :cell unless value.nil?
246
+ end
247
+ end
248
+
249
+ # `sink` now looks like this:
250
+
251
+ [
252
+ { person_id: 1, phones: {work: "(354) 756-4796", cell: "(234) 678-7564"} },
253
+ { person_id: 2, phones: {home: "(867) 123-9748", cell: "(475) 364-8365"} },
254
+ ]
255
+ ```
256
+
257
+ `put_bury` operates similarly to Ruby's standard `dig` method, but in reverse.
258
+
259
+ ### Creating Multiple Output Records
260
+
261
+ For instances where a single source record maps to multiple sink records, there are techniques for outputting multiple records. The first is simply to use `flush`.
262
+
263
+ ```ruby
264
+ source = [
265
+ {person: 1, home_phone: "(634) 654-2457", work_phone: "(354) 756-4796", cell_phone: "(234) 678-7564"},
266
+ {person: 2, home_phone: "(867) 123-9748", work_phone: "(234) 534-2667", cell_phone: "(475) 364-8365"},
267
+ ]
268
+ sink = []
269
+
270
+ Micdrop.migrate source, sink do
271
+ take :person, put: :person_id
272
+ take :home_phone, put: :number
273
+ static "Home", put: :type
274
+ flush # This creates the first record and resets the collector
275
+ # Now we start the second record
276
+ take :person, put: :person_id
277
+ take :work_phone, put: :number
278
+ static "Work", put: :type
279
+ flush
280
+ # And the third record
281
+ take :person, put: :person_id
282
+ take :cell_phone, put: :number
283
+ static "Cell", put: :type
284
+ # There is an implicit flush at the end of the block, so we don't need an explicit one (though it won't hurt anything)
285
+ end
286
+
287
+ # `sink` now looks like this:
288
+ [
289
+ {person_id: 1, number: "(634) 654-2457", type: "Home"},
290
+ {person_id: 1, number: "(354) 756-4796", type: "Work"},
291
+ {person_id: 1, number: "(234) 678-7564", type: "Cell"},
292
+ {person_id: 2, number: "(867) 123-9748", type: "Home"},
293
+ {person_id: 2, number: "(234) 534-2667", type: "Work"},
294
+ {person_id: 2, number: "(475) 364-8365", type: "Cell"},
295
+ ]
296
+ ```
297
+
298
+ `flush` takes an optional `reset` parameter that is true by default. If set to false, the output will still be generated, but the collector will not be reset.
299
+
300
+ In cases where iteration is desired, `each_subrecord` provides a convenient interface:
301
+
302
+ ```ruby
303
+ source = [
304
+ {person: 1, addresses: [{line1: "123 Example St.", city: "Anytown", state: "AL", zip: "12345", role: "Mailing"}]},
305
+ {person: 2, addresses: [{line1: "123 Any Way", city: "Thereabouts", state: "AK", zip: "98765", role: "Home"}, {line1: "PO Box 123", city: "Thereabouts", state: "AK", zip: "98765", role: "Mailing"}]},
306
+ ]
307
+ sink = []
308
+
309
+ Micdrop.migrate source, sink do
310
+ # Save this so we can `put` it separately in each record
311
+ person_id = take :person
312
+ # Iterate each address, and automatically flush and reset after each
313
+ take(:addresses).each_subrecord flush: true, reset: true do
314
+ person_id.put :person_id
315
+ take :line1, put: :line1
316
+ take :city, put: :city
317
+ take :state, put: :state
318
+ take :zip, put: :zip
319
+ take :role, put: :role
320
+ end
321
+ end
322
+
323
+ # `sink` now looks like this:
324
+ [
325
+ {person_id: 1, line1: "123 Example St.", city: "Anytown", state: "AL", zip: "12345", role: "Mailing"},
326
+ {person_id: 2, line1: "123 Any Way", city: "Thereabouts", state: "AK", zip: "98765", role: "Home"},
327
+ {person_id: 2, line1: "PO Box 123", city: "Thereabouts", state: "AK", zip: "98765", role: "Mailing"},
328
+ ]
329
+ ```
330
+
331
+ There may also be cases where multiple sinks are needed, rather than merely multiple records in the same sink. For this use case, it is recommended to simply iterate the same source multiple times, once to each sink.
332
+
333
+ ```ruby
334
+ source = [
335
+ {id: 1, first_name: "Alice", last_name: "Anderson", mail_line1: "123 Example St.", mail_city: "Anytown", mail_state: "AL", mail_zip: "12345"},
336
+ {id: 2, first_name: "Bob", last_name: "Benson", mail_line1: "PO Box 123", mail_city: "Thereabouts", mail_state: "AK", mail_zip: "98765"},
337
+ ]
338
+ person_sink = []
339
+ address_sink = []
340
+
341
+ Micdrop.migrate source, person_sink do
342
+ take :id, put: :id
343
+ take :first_name, put: :fname
344
+ take :last_name, put: :lname
345
+ end
346
+
347
+ Micdrop.migrate source, address_sink do
348
+ take :id, put: :person_id
349
+ take :mail_line1, put: :line1
350
+ take :mail_city, put: :city
351
+ take :mail_state, put: :state
352
+ take :mail_zip, put: :zip
353
+ static "Mailing", put: :role
354
+ end
355
+ ```
356
+
357
+ ### Filling the Gaps
358
+
359
+ If you find yourself writing the same block multiple times, you can instead write it as a proc and apply that to the Takes.
360
+
361
+ ```ruby
362
+ source = [
363
+ {a:1, b:2},
364
+ {a:nil, b:4},
365
+ {a:5, b:nil},
366
+ ]
367
+ sink = []
368
+
369
+ default_0 = proc do
370
+ # This reusable pipeline can be as complex as needed
371
+ default 0
372
+ end
373
+
374
+ Micdrop.migrate source, sink do
375
+ # Both of the following syntaxes are equivilent
376
+ take :a, apply: default_0, put: "A"
377
+ take :b do
378
+ apply default_0
379
+ put "B"
380
+ end
381
+ end
382
+
383
+ # `sink` now looks like this:
384
+ [
385
+ {a:1, b:2},
386
+ {a:0, b:4},
387
+ {a:5, b:0},
388
+ ]
389
+ ```
390
+
391
+ Of course, you may need to modify data in ways that are not supported by existing transforms. But, you can just use plain old Ruby to fill the gaps. There are a few ways to do this:
392
+
393
+ ```ruby
394
+ source = [
395
+ {a:1, b:2, c:3},
396
+ {a:4, b:5, c:6},
397
+ {a:7, b:8, c:9},
398
+ ]
399
+ sink = []
400
+
401
+ Micdrop.migrate source, sink do
402
+ # You can pass a proc (or symbol) to the `convert` parameter
403
+ take :a, convert: proc {it + 1}, put: "A"
404
+ # Or you can use a `convert` block
405
+ take :b do
406
+ convert {it * 2}
407
+ put "B"
408
+ end
409
+ # Or you can use the `update` and `value` methods directly in the main item block
410
+ take :c do
411
+ if value % 2
412
+ update "Odd"
413
+ else
414
+ update "Even"
415
+ end
416
+ put "C"
417
+ end
418
+ end
419
+
420
+ # `sink` now looks like this:
421
+ [
422
+ {"A"=>2, "B"=>4, "C"=>"Odd"},
423
+ {"A"=>5, "B"=>10, "C"=>"Even"},
424
+ {"A"=>8, "B"=>16, "C"=>"Odd"},
425
+ ]
426
+ ```
427
+
428
+ And transforms are nothing more than standard Ruby methods; there is no magic going on under the hood (other than the normal Ruby magic). So, if you find yourself needing the same pure-Ruby code often, you can just extend `ItemContext` with an additional method, which can then be used as a transform.
429
+
430
+ ```ruby
431
+ module Micdrop
432
+ class ItemContext
433
+ def subtract(v)
434
+ # Do whatever you like here; just make sure to save the result to @value
435
+ @value = @value - v
436
+ # Also return `self` to enable method chaining
437
+ self
438
+ end
439
+ end
440
+ end
441
+
442
+ Micdrop.migrate source, sink do
443
+ take :a do
444
+ subtract 1
445
+ put "A"
446
+ end
447
+ ```
448
+
data/Rakefile ADDED
@@ -0,0 +1,12 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "bundler/gem_tasks"
4
+ require "minitest/test_task"
5
+
6
+ Minitest::TestTask.create
7
+
8
+ require "rubocop/rake_task"
9
+
10
+ RuboCop::RakeTask.new
11
+
12
+ task default: %i[test rubocop]
data/TODO.md ADDED
@@ -0,0 +1,15 @@
1
+ ## Short-Term Taks
2
+
3
+ * CSV Source
4
+ * CSV Sink
5
+ * SQL Table Source
6
+ * SQL Query Source
7
+ * SQL Insert Sink
8
+ * SQL Update Sink
9
+ * SQL Query Sink
10
+ * SQL Lookups
11
+ * Filesystem Operations (Move, Copy, Read, Write)
12
+
13
+ ## Long-Term Ideas
14
+
15
+ * Query language inspired by [jq](https://jqlang.org/)