dump_cleaner 0.5.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (45) hide show
  1. checksums.yaml +7 -0
  2. data/.rspec +2 -0
  3. data/.rubocop.yml +25 -0
  4. data/CHANGELOG.md +5 -0
  5. data/LICENSE.txt +21 -0
  6. data/README.md +295 -0
  7. data/Rakefile +8 -0
  8. data/doc/workflow_steps.md +1400 -0
  9. data/dump_cleaner.gemspec +38 -0
  10. data/exe/dump_cleaner +7 -0
  11. data/lib/dump_cleaner/cleaners/base_cleaner.rb +32 -0
  12. data/lib/dump_cleaner/cleaners/mysql_shell_dump_cleaner.rb +47 -0
  13. data/lib/dump_cleaner/cleaners/mysql_shell_dump_helpers.rb +11 -0
  14. data/lib/dump_cleaner/cleaners/mysql_shell_table_cleaner.rb +184 -0
  15. data/lib/dump_cleaner/cleanup/bytesize_helpers.rb +39 -0
  16. data/lib/dump_cleaner/cleanup/cleaning.rb +69 -0
  17. data/lib/dump_cleaner/cleanup/cleaning_steps/add_repetition_suffix.rb +23 -0
  18. data/lib/dump_cleaner/cleanup/cleaning_steps/base.rb +33 -0
  19. data/lib/dump_cleaner/cleanup/cleaning_steps/fill_up_with_string.rb +20 -0
  20. data/lib/dump_cleaner/cleanup/cleaning_steps/generate_random_string.rb +37 -0
  21. data/lib/dump_cleaner/cleanup/cleaning_steps/inspect_context.rb +16 -0
  22. data/lib/dump_cleaner/cleanup/cleaning_steps/randomize_email.rb +78 -0
  23. data/lib/dump_cleaner/cleanup/cleaning_steps/randomize_formatted_number.rb +63 -0
  24. data/lib/dump_cleaner/cleanup/cleaning_steps/randomize_number.rb +29 -0
  25. data/lib/dump_cleaner/cleanup/cleaning_steps/select_data_by_bytesize.rb +17 -0
  26. data/lib/dump_cleaner/cleanup/cleaning_steps/select_data_by_pattern.rb +20 -0
  27. data/lib/dump_cleaner/cleanup/cleaning_steps/take_sample.rb +28 -0
  28. data/lib/dump_cleaner/cleanup/data_source.rb +19 -0
  29. data/lib/dump_cleaner/cleanup/data_source_steps/base.rb +26 -0
  30. data/lib/dump_cleaner/cleanup/data_source_steps/group_by_bytesize.rb +37 -0
  31. data/lib/dump_cleaner/cleanup/data_source_steps/inspect_context.rb +16 -0
  32. data/lib/dump_cleaner/cleanup/data_source_steps/load_yaml_file.rb +24 -0
  33. data/lib/dump_cleaner/cleanup/data_source_steps/remove_accents.rb +29 -0
  34. data/lib/dump_cleaner/cleanup/inspection.rb +37 -0
  35. data/lib/dump_cleaner/cleanup/step_context.rb +46 -0
  36. data/lib/dump_cleaner/cleanup/uniqueness.rb +66 -0
  37. data/lib/dump_cleaner/cleanup/workflow.rb +38 -0
  38. data/lib/dump_cleaner/conditions.rb +42 -0
  39. data/lib/dump_cleaner/config.rb +109 -0
  40. data/lib/dump_cleaner/log.rb +42 -0
  41. data/lib/dump_cleaner/options.rb +46 -0
  42. data/lib/dump_cleaner/processor.rb +37 -0
  43. data/lib/dump_cleaner/version.rb +5 -0
  44. data/lib/dump_cleaner.rb +10 -0
  45. metadata +105 -0
@@ -0,0 +1,1400 @@
1
+ # Workflow steps
2
+
3
+ This is a reference documentation for individual [workflow steps](/README.md#how-does-dumpcleaner-work) of [DumpCleaner](/README.md).
4
+
5
+ #### [Data source steps](#data-source-steps-1)
6
+
7
+ step | main purpose
8
+ ---------|--------
9
+ [GroupByBytesize](#groupbybytesize) | groups data by length and byte size
10
+ [InspectContext](#inspectcontext) | shows debug information about the current step context
11
+ [LoadYamlFile](#loadyamlfile) | loads a YAML file into data
12
+ [RemoveAccents](#removeaccents) | removes accents from the data
13
+
14
+ #### [Cleaning steps](#cleaning-steps-1)
15
+
16
+ step | main purpose
17
+ ---------|--------
18
+ [AddRepetitionSuffix](#addrepetitionsuffix) | ensures a unique value by adding a repetition suffix to it
19
+ [FillUpWithString](#fillupwithstring) | fills the value with a static string
20
+ [GenerateRandomString](#generaterandomstring) | replaces the value with a random string
21
+ [InspectContext](#inspectcontext-1) | shows debug information about the current step context
22
+ [RandomizeEmail](#randomizeemail) | randomizes parts of an email address
23
+ [RandomizeFormattedNumber](#randomizeformattednumber) | randomizes parts of a formatted number
24
+ [RandomizeNumber](#randomizenumber) | randomly shifts a number to a certain extent
25
+ [SelectDataByBytesize](#selectdatabybytesize) | selects a data subset by value length and byte size
26
+ [SelectDataByPattern](#selectdatabypattern) | selects a data subset by matching a pattern against the value
27
+ [TakeSample](#takesample) | takes a random sample from the data
28
+
29
+ #### [Failure steps](#failure-steps-1)
30
+
31
+ ## Data source steps
32
+
33
+ The data source steps **prepare the data** (**”cleanup data“**) that can be later used in the [cleaning workflows](#cleaning-steps-1). These steps are linked to the [`cleanup_type`](/README.md#cleanup_types) itself and have no notion of the individual records in the dump table data. The data output by the last step in the data source workflow is **cached** and reused as the initial data when running the cleaning workflows for the same type.
34
+
35
+
36
+ Steps are listed in alphabetical order:
37
+
38
+ ### [GroupByBytesize](https://github.com/NejRemeslnici/dump-cleaner/blob/main/lib/dump_cleaner/cleanup/data_source_steps/group_by_bytesize.rb)
39
+
40
+ This step takes the cleanup data as a list of values (typically loaded from a dictionary file) and groups them according to their length and byte size (the number of bytes needed to encode the values). It processes the data into a hash, where the keys are constructed as `"<length>-<bytesize>"` strings. It is supposed to be used together with the [`SelectDataByBytesize`](#selectdatabybytesize) cleaning step.
41
+
42
+ The grouping has two goals:
43
+ - it allows replacing source data with random data of the same length which adds fidelity to the cleaned up data (long values stay long and vice versa),
44
+ - it allows obeying the technical limits of the MySQL Shell dump format which separates larger data into ”chunks“ and annotates them using byte-size-indexed files; DumpCleaner does not attempt to update these index files, thus it needs to keep the byte size of the values untouched.
45
+
46
+ #### Params:
47
+
48
+ - `under_keys`: notifies DumpCleaner that the cleanup data is not a list but actually a hash of multiple lists and that the grouping should be done only in lists under the specified keys of the data hash. This is useful in cases when the cleanup data needs to hold multiple unrelated lists of values.
49
+
50
+ #### Examples:
51
+
52
+ <table>
53
+ <tr><th>configuration</th><th>input data</th><th>output data</th></tr>
54
+ <tbody>
55
+ <tr>
56
+ <td>
57
+
58
+ ```yaml
59
+ - step: GroupByBytesize
60
+ ```
61
+ </td>
62
+ <td>
63
+
64
+ ```
65
+ ["newspaper", "show", "rest", "résumé"]
66
+ ```
67
+ </td>
68
+ <td>
69
+
70
+ ```
71
+ {
72
+ "9-9" => ["newspaper"],
73
+ "4-4" => ["show", "rest"],
74
+ "6-8" => ["résumé"]
75
+ }
76
+ ```
77
+ </td>
78
+ </tr>
79
+ <tr>
80
+ <td>
81
+
82
+ ```yaml
83
+ - step: GroupByBytesize
84
+ params:
85
+ under_keys:
86
+ - words
87
+ ```
88
+ </td>
89
+ <td>
90
+
91
+ ```
92
+ {
93
+ "words" => ["newspaper", "show", "rest", "résumé"],
94
+ "domains" => ["gmail.com", "example.com"]
95
+ }
96
+
97
+ ```
98
+ </td>
99
+ <td>
100
+
101
+ ```
102
+ {
103
+ "words" => {
104
+ "9-9" => ["newspaper"],
105
+ "4-4" => ["show", "rest"],
106
+ "6-8" => ["résumé"]
107
+ },
108
+ "domains" => ["gmail.com", "example.com"]
109
+ }
110
+ ```
111
+ </td>
112
+ </tr>
113
+ </tbody>
114
+ </table>
115
+
116
+ ### [InspectContext](https://github.com/NejRemeslnici/dump-cleaner/blob/main/lib/dump_cleaner/cleanup/data_source_steps/inspect_context.rb)
117
+
118
+ This is purely a debugging step that makes DumpCleaner print the current step context. The step context includes:
119
+
120
+ - `type`: the [cleanup type](/README.md#cleanup_types) that this step is working with
121
+ - `record`: the record context taken from the current record (see the `record_context_columns` option above)
122
+ - `cleanup_data`: the data available for the step (only a subset of all data is shown here)
123
+ - `repetition`: the current iteration in the uniqueness loop.
124
+
125
+ ### [LoadYamlFile](https://github.com/NejRemeslnici/dump-cleaner/blob/main/lib/dump_cleaner/cleanup/data_source_steps/load_yaml_file.rb)
126
+
127
+ This is usually the first step in the ”data source“ workflow. It loads some data from a YAML file (a ”dictionary“) and returns it. The data can be, in general, any YAML-supported structure, but most commonly it will be a list of string values or a hash of multiple such lists.
128
+
129
+ Care should be taken when loading string data taken from various dictionaries. There must be quoting applied at least to the [words having special meaning](https://yaml.org/type/bool.html) in YAML, such as "no" or "n", otherwise the cleanup process will likely fail.
130
+
131
+ #### Params:
132
+
133
+ - `file`: specifies the path to the YAML file; this is a mandatory parameter.
134
+ - `under_key`: optionally makes the step put the loaded data into a hash under the specified key instead of returning the loaded data itself. This is useful for grabbing multiple value lists from different dictionary files.
135
+
136
+ #### Examples:
137
+
138
+ <table>
139
+ <tr><th>configuration</th><th>input data</th><th>output data</th></tr>
140
+ <tbody>
141
+ <tr>
142
+ <td>
143
+
144
+ ```yaml
145
+ - step: LoadYamlFile
146
+ params:
147
+ file: some_file.yml
148
+ ```
149
+
150
+ ```yaml
151
+ # some_file.yml:
152
+ - words
153
+ - to
154
+ - load
155
+ ```
156
+ </td>
157
+ <td>
158
+
159
+ `nil` (or just anything)
160
+ </td>
161
+ <td>
162
+
163
+ ```
164
+ ["words", "to", "load"]
165
+ ```
166
+ </td>
167
+ </tr>
168
+ <tr>
169
+ <td>
170
+
171
+ ```yaml
172
+ - step: LoadYamlFile
173
+ params:
174
+ file: dictionary.yml
175
+ under_key: words
176
+ ```
177
+
178
+ ```yaml
179
+ # dictionary.yml:
180
+ - words
181
+ - to
182
+ - load
183
+ ```
184
+ </td>
185
+ <td>
186
+
187
+ `nil`
188
+ </td>
189
+ <td>
190
+
191
+ ```
192
+ {
193
+ "words" => ["words", "to", "load"]
194
+ }
195
+ ```
196
+ </td>
197
+ </tr>
198
+ <tr>
199
+ <td>
200
+
201
+ ```yaml
202
+ - step: LoadYamlFile
203
+ params:
204
+ file: dictionary.yml
205
+ under_key: words
206
+ ```
207
+
208
+ ```yaml
209
+ # dictionary.yml:
210
+ - words
211
+ - to
212
+ - load
213
+ ```
214
+ </td>
215
+ <td>
216
+
217
+ ```
218
+ {
219
+ "other" => ["some", "other", "words"]
220
+ }
221
+ ```
222
+ </td>
223
+ <td>
224
+
225
+ ```
226
+ {
227
+ "words" => ["words", "to", "load"],
228
+ "other" => ["some", "other", "words"]
229
+ }
230
+ ```
231
+ </td>
232
+ </tr>
233
+ </tbody>
234
+ </table>
235
+
236
+
237
+ ### [RemoveAccents](https://github.com/NejRemeslnici/dump-cleaner/blob/main/lib/dump_cleaner/cleanup/data_source_steps/remove_accents.rb)
238
+
239
+ This step uses the [`unicode_normalize`](https://ruby-doc.org/stdlib-2.4.0/libdoc/unicode_normalize/rdoc/String.html) method to remove all accents from all values in the cleanup data, i.e., for example, ”naïve“ will be converted to ”naive“. This can be useful when we want to use the same YAML file to build a generic random words dictionary as well as a dictionary of logins, domains or other words that should have no accented characters in them.
240
+
241
+ #### Params:
242
+
243
+ - `under_keys`: optionally tells the step to only process the list under the specified keys in the cleanup data hash.
244
+
245
+ #### Examples:
246
+
247
+ <table>
248
+ <tr><th>configuration</th><th>input data</th><th>output data</th></tr>
249
+ <tbody>
250
+ <tr>
251
+ <td>
252
+
253
+ ```yaml
254
+ - step: RemoveAccents
255
+ ```
256
+ </td>
257
+ <td>
258
+
259
+ ```
260
+ ["naïve", "résumé"]
261
+ ```
262
+ </td>
263
+ <td>
264
+
265
+ ```
266
+ ["naive", "resume"]
267
+ ```
268
+ </td>
269
+ </tr>
270
+ <tr>
271
+ <td>
272
+
273
+ ```yaml
274
+ - step: RemoveAccents
275
+ params:
276
+ under_keys:
277
+ - accounts
278
+ ```
279
+ </td>
280
+ <td>
281
+
282
+ ```
283
+ {
284
+ "accounts" => ["zoë", "renée"],
285
+ "words" => ["café", "jalapeño"]
286
+ }
287
+ ```
288
+ </td>
289
+ <td>
290
+
291
+ ```
292
+ {
293
+ "logins" => ["zoe", "renee"],
294
+ "words" => ["café", "jalapeño"]
295
+ }
296
+ ```
297
+ </td>
298
+ </tr>
299
+ </tbody>
300
+ </table>
301
+
302
+ ## Cleaning steps
303
+
304
+ Contrary to the Source data workflow, steps in the Cleaning workflow do know about the currently processed record and **can access, or modify, the currently processed field value**. The first step gets its ”current value“ from the currently processed table record.
305
+
306
+ In general, the steps can either transform the current value or the [cleanup data](#data-source-steps-1) or both. Processing the cleanup data is often useful during the first few workflow steps because it can further prepare the data in relation to the current record / field. Later steps (usually the last one) then, with the help of the prepared data, do the cleaning itself. Steps are listed in alphabetical order:
307
+
308
+ ### [AddRepetitionSuffix](https://github.com/NejRemeslnici/dump-cleaner/blob/main/lib/dump_cleaner/cleanup/cleaning_steps/add_repetition_suffix.rb)
309
+
310
+ This step may replace the end of the current value string with a repetition suffix. This is useful in unique columns when the randomized value conflicts with a value determined for one of the earlier records. As described in the [Uniqueness section](/README.md#unique-values), this step keeps the string length and byte size untouched. The step never adds a suffix for repetition 0, i.e. for the very first iteration of the uniqueness loop or when there is no uniqueness requested in the first place.
311
+
312
+ If the current value is too small to even hold a repetition suffix, it is replaced by a randomly generated alphanumeric string of equal length.
313
+
314
+ #### Examples:
315
+
316
+ <table>
317
+ <tr><th>configuration</th><th>input value</th><th>output value</th></tr>
318
+ <tbody>
319
+ <tr>
320
+ <td>
321
+
322
+ ```yaml
323
+ - step: AddRepetitionSuffix
324
+ ```
325
+ </td>
326
+ <td>
327
+
328
+ ```
329
+ "something"
330
+ ```
331
+
332
+ (when repetition is 0)
333
+ </td>
334
+ <td>
335
+
336
+ ```
337
+ "something"
338
+ ```
339
+ </td>
340
+ </tr>
341
+ <tr>
342
+ <td>
343
+
344
+ ```yaml
345
+ - step: AddRepetitionSuffix
346
+ ```
347
+ </td>
348
+ <td>
349
+
350
+ ```
351
+ "something"
352
+ ```
353
+
354
+ (when repetition is 23)
355
+ </td>
356
+ <td>
357
+
358
+ ```
359
+ "somethi23"
360
+ ```
361
+ </td>
362
+ </tr>
363
+ <tr>
364
+ <td>
365
+
366
+ ```yaml
367
+ - step: AddRepetitionSuffix
368
+ ```
369
+ </td>
370
+ <td>
371
+
372
+ ```
373
+ "a"
374
+ ```
375
+
376
+ (when repetition is 3)
377
+ </td>
378
+ <td>
379
+
380
+ ```
381
+ "M"
382
+ ```
383
+ </td>
384
+ </tr>
385
+ </tbody>
386
+ </table>
387
+
388
+ ### [FillUpWithString](https://github.com/NejRemeslnici/dump-cleaner/blob/main/lib/dump_cleaner/cleanup/cleaning_steps/fill_up_with_string.rb)
389
+
390
+ This step replaces the current value with a predefined static string. By default, it truncates or prolongs (by repeating) the string so that its byte size is the same as for the original value. If uniqueness is desired for the column, a [repetition suffix](#addrepetitionsuffix) may be added in cases of conflicts.
391
+
392
+ #### Params:
393
+
394
+ - `string`: the string to replace the current value with; it is automatically truncated or prolonged to the desired number of bytes; the default string is `"anonymized <type>"` where `<type>` is the name of the current [`cleanup_type`](/README.md#cleanup_types).
395
+ - `padding`: this parameter should normally be set to a single 1-byte character; by default it is a space `" "`; this parameter serves two distinct roles:
396
+ 1. it serves as the separator when the `string` needs to be prolonged by repeating.
397
+ 2. if the truncated or prolonged string still does not perfectly fit the desired byte size (due to multi-byte characters in the string and/or original value), the string is padded with the contents of this parameter.
398
+ - `strict_bytesize_check`: if set to true, the step will raise an error if the `string` byte size differs from the byte size of the current value. This is useful for resetting all values of a given table column to the same string and ensuring byte size consistency along the way.
399
+
400
+ #### Examples:
401
+
402
+ <table>
403
+ <tr><th>configuration</th><th>input value</th><th>output value</th></tr>
404
+ <tbody>
405
+ <tr>
406
+ <td>
407
+
408
+ ```yaml
409
+ - step: FillUpWithString
410
+ ```
411
+ </td>
412
+ <td>
413
+
414
+ ```
415
+ "Santiago de León de Caracas"
416
+ ```
417
+ (when cleanup type is "city")
418
+ </td>
419
+ <td>
420
+
421
+ ```
422
+ "anonymized city anonymized c"
423
+ ```
424
+ </td>
425
+ </tr>
426
+ <tr>
427
+ <td>
428
+
429
+ ```yaml
430
+ - step: FillUpWithString
431
+ ```
432
+ </td>
433
+ <td>
434
+
435
+ ```
436
+ "Caracas"
437
+ ```
438
+
439
+ (when cleanup type is "city")
440
+ </td>
441
+ <td>
442
+
443
+ ```
444
+ "anonymi"
445
+ ```
446
+ </td>
447
+ </tr>
448
+ <tr>
449
+ <td>
450
+
451
+ ```yaml
452
+ - step: FillUpWithString
453
+ params:
454
+ string: City
455
+ ```
456
+ </td>
457
+ <td>
458
+
459
+ ```
460
+ "Caracas"
461
+ ```
462
+
463
+ </td>
464
+ <td>
465
+
466
+ ```
467
+ "City Ci"
468
+ ```
469
+ </td>
470
+ </tr>
471
+ <tr>
472
+ <td>
473
+
474
+ ```yaml
475
+ - step: FillUpWithString
476
+ params:
477
+ string: City
478
+ padding: "-"
479
+ ```
480
+ </td>
481
+ <td>
482
+
483
+ ```
484
+ "Caracas"
485
+ ```
486
+
487
+ </td>
488
+ <td>
489
+
490
+ ```
491
+ "City-Ci"
492
+ ```
493
+ </td>
494
+ </tr>
495
+ <tr>
496
+ <td>
497
+
498
+ ```yaml
499
+ - step: FillUpWithString
500
+ params:
501
+ string: ab€
502
+ padding: "-"
503
+ ```
504
+ </td>
505
+ <td>
506
+
507
+ ```
508
+ "abcd"
509
+ ```
510
+
511
+ </td>
512
+ <td>
513
+
514
+ ```
515
+ "ab--"
516
+ ```
517
+ </td>
518
+ </tr>
519
+ <tr>
520
+ <td>
521
+
522
+ ```yaml
523
+ - step: FillUpWithString
524
+ params:
525
+ string: hash
526
+ strict_bytesize_check: true
527
+ ```
528
+ </td>
529
+ <td>
530
+
531
+ ```
532
+ "3aa9177571b27"
533
+ ```
534
+
535
+ </td>
536
+ <td>
537
+
538
+ N/A (an error is raised)
539
+
540
+ </td>
541
+ </tr>
542
+ </tbody>
543
+ </table>
544
+
545
+ ### [GenerateRandomString](https://github.com/NejRemeslnici/dump-cleaner/blob/main/lib/dump_cleaner/cleanup/cleaning_steps/generate_random_string.rb)
546
+
547
+ This step replaces the current value using a generated a random string with the same byte size. The randomness is [deterministic](/README.md#randomization-is-deterministic). By default, the random string will consist of alphanumeric characters.
548
+
549
+ #### Params:
550
+
551
+ - `character_set`: determines the character set to be used when generating the string. Currently supported predefined values are:
552
+ - `alphanumeric`: the default - lower- and uppercase letters and numbers
553
+ - `alpha`: lower- and uppercase letters only
554
+ - `lowercase`: lowercase letters only
555
+ - `uppercase`: uppercase letters only
556
+ - `numeric`: numbers only
557
+
558
+ Or, the character set may be passed in explicitly as an array of characters.
559
+
560
+ #### Examples:
561
+
562
+ <table>
563
+ <tr><th>configuration</th><th>input value</th><th>output value</th></tr>
564
+ <tbody>
565
+ <tr>
566
+ <td>
567
+
568
+ ```yaml
569
+ - step: GenerateRandomString
570
+ ```
571
+ </td>
572
+ <td>
573
+
574
+ ```
575
+ "something"
576
+ ```
577
+ </td>
578
+ <td>
579
+
580
+ ```
581
+ "QXSq31BxY"
582
+ ```
583
+ </td>
584
+ </tr>
585
+ <tr>
586
+ <td>
587
+
588
+ ```yaml
589
+ - step: GenerateRandomString
590
+ params:
591
+ character_set: alpha
592
+ ```
593
+ </td>
594
+ <td>
595
+
596
+ ```
597
+ "something"
598
+ ```
599
+ </td>
600
+ <td>
601
+
602
+ ```
603
+ "sktfNNGQZ"
604
+ ```
605
+ </td>
606
+ </tr>
607
+ <tr>
608
+ <td>
609
+
610
+ ```yaml
611
+ - step: GenerateRandomString
612
+ params:
613
+ character_set: lowercase
614
+ ```
615
+ </td>
616
+ <td>
617
+
618
+ ```
619
+ "something"
620
+ ```
621
+ </td>
622
+ <td>
623
+
624
+ ```
625
+ "suyqbynno"
626
+ ```
627
+ </td>
628
+ </tr>
629
+ <tr>
630
+ <td>
631
+
632
+ ```yaml
633
+ - step: GenerateRandomString
634
+ params:
635
+ character_set: uppercase
636
+ ```
637
+ </td>
638
+ <td>
639
+
640
+ ```
641
+ "something"
642
+ ```
643
+ </td>
644
+ <td>
645
+
646
+ ```
647
+ "SUYQBYNNO"
648
+ ```
649
+ </td>
650
+ </tr>
651
+ <tr>
652
+ <td>
653
+
654
+ ```yaml
655
+ - step: GenerateRandomString
656
+ params:
657
+ character_set: numeric
658
+ ```
659
+ </td>
660
+ <td>
661
+
662
+ ```
663
+ "something"
664
+ ```
665
+ </td>
666
+ <td>
667
+
668
+ ```
669
+ "098877228"
670
+ ```
671
+ </td>
672
+ </tr>
673
+ <tr>
674
+ <td>
675
+
676
+ ```yaml
677
+ - step: GenerateRandomString
678
+ params:
679
+ character_set: ["a", "b", "c"]
680
+ ```
681
+ </td>
682
+ <td>
683
+
684
+ ```
685
+ "something"
686
+ ```
687
+ </td>
688
+ <td>
689
+
690
+ ```
691
+ "bccccaaca"
692
+ ```
693
+ </td>
694
+ </tr>
695
+ </tbody>
696
+ </table>
697
+
698
+ ### [InspectContext](https://github.com/NejRemeslnici/dump-cleaner/blob/main/lib/dump_cleaner/cleanup/cleaning_steps/inspect_context.rb)
699
+
700
+ This is purely a debugging step that makes DumpCleaner print the current step context. The step context includes:
701
+
702
+ - `orig_value`: original value taken from the table record field
703
+ - `current_value`: i.e. the running state of the result value in the current workflow
704
+ - `type`: the [cleanup type](/README.md#cleanup_types) that this step is working with
705
+ - `record`: the record context taken from the current record (see the `record_context_columns` option above)
706
+ - `cleanup_data`: the data available for the step (only a subset of all data is shown here)
707
+ - `repetition`: the current iteration in the uniqueness loop.
708
+
709
+ ### [RandomizeEmail](https://github.com/NejRemeslnici/dump-cleaner/blob/main/lib/dump_cleaner/cleanup/cleaning_steps/randomize_email.rb)
710
+
711
+ This step tries to randomize an email address in a clever, high-fidelity way. In general, it replaces both the mailbox part as well as the domain of the email address with random words taken from a dictionary. For the mailbox part, it keeps the dots (`"."`) in it and randomizes the words surrounding them.
712
+
713
+ If finding a suitable word from the dictionary fails (e.g. there is no word with the necessary byte size present in it), the step generates a random string of the proper size instead.
714
+
715
+ By default, the step keeps the TLD part of the domain. Optionally, it can keep the whole domain which may be convenient for well-known domains, such as gmail.com or example.com. Note that the step generally does no particular effort to guarantee that it doesn’t generate a valid, existing email address.
716
+
717
+ In case unique column values are requested, the step may add [repetition suffix](#addrepetitionsuffix) to the randomized parts of the mail address.
718
+
719
+ If an invalid email address is encountered, a warning is logged and the processing switches to the [failure workflow](#failure-steps-1).
720
+
721
+ #### Params:
722
+
723
+ - `domains_to_keep_data_key`: the key in the cleanup data hash which contains the list of well-known domains; email addresses from such domains will keep the domain part intact; it may be set to `nil` or an empty string, in which case the step will always repalce even the domain name with a random word; default value: `domains_to_keep`
724
+ - `words_data_key`: the key in the cleanup data hash which contains the list of words to take random samples from; the list must be grouped by the byte size (use [`GroupByBytesize`](#groupbybytesize) with the `under_keys` param); default value: `words`.
725
+
726
+ #### Examples:
727
+
728
+ <table>
729
+ <tr><th>configuration</th><th>cleanup data</th><th>input value</th><th>output value</th></tr>
730
+ <tbody>
731
+ <tr>
732
+ <td>
733
+
734
+ ```yaml
735
+ - step: RandomizeEmail
736
+ ```
737
+ </td>
738
+ <td>
739
+
740
+ ```ruby
741
+ {
742
+ "domains_to_keep" => ["gmail.com", "example.com"],
743
+ "words" => {
744
+ "6-6" => ["sunfly", "echoes"],
745
+ "7-7" => ["warless", "cadence"]
746
+ }
747
+ }
748
+ ```
749
+ </td>
750
+ <td>
751
+
752
+ ```
753
+ "evelyn@adomain.com"
754
+ ```
755
+ </td>
756
+ <td>
757
+
758
+ ```
759
+ "sunfly@cadence.com"
760
+ ```
761
+ </td>
762
+ </tr>
763
+ <tr>
764
+ <td>
765
+
766
+ ```yaml
767
+ - step: RandomizeEmail
768
+ ```
769
+ </td>
770
+ <td>
771
+
772
+ ```ruby
773
+ {
774
+ "domains_to_keep" => ["gmail.com", "example.com"],
775
+ "words" => {
776
+ "5-5" => ["waste", "octet"],
777
+ "6-6" => ["sunfly", "echoes"]
778
+ }
779
+ }
780
+ ```
781
+ </td>
782
+ <td>
783
+
784
+ ```
785
+ "evelyn.cohen@gmail.com"
786
+ ```
787
+ </td>
788
+ <td>
789
+
790
+ ```
791
+ "sunfly.octet@gmail.com"
792
+ ```
793
+ </td>
794
+ </tr>
795
+ <tr>
796
+ <td>
797
+
798
+ ```yaml
799
+ - step: RandomizeEmail
800
+ ```
801
+ </td>
802
+ <td>
803
+
804
+ ```ruby
805
+ {
806
+ "domains_to_keep" => ["gmail.com", "example.com"],
807
+ "words" => {
808
+ "6-6" => ["sunfly", "echoes"],
809
+ "7-7" => ["warless", "cadence"]
810
+ }
811
+ }
812
+ ```
813
+ </td>
814
+ <td>
815
+
816
+ ```
817
+ "evelyn@adomain.com"
818
+ ```
819
+ </td>
820
+ <td>
821
+
822
+ ```
823
+ "sunfly@cadence.com"
824
+ ```
825
+ </td>
826
+ </tr>
827
+ <tr>
828
+ <td>
829
+
830
+ ```yaml
831
+ - step: RandomizeEmail
832
+ params:
833
+ domains_to_keep_data_key: "domains"
834
+ words_data_key: "dictionary"
835
+ ```
836
+ </td>
837
+ <td>
838
+
839
+ ```ruby
840
+ {
841
+ "domains" => ["gmail.com", "example.com"],
842
+ "dictionary" => {
843
+ "6-6" => ["sunfly", "echoes"],
844
+ "7-7" => ["warless", "cadence"]
845
+ }
846
+ }
847
+ ```
848
+ </td>
849
+ <td>
850
+
851
+ ```
852
+ "evelyn@adomain.com"
853
+ ```
854
+ </td>
855
+ <td>
856
+
857
+ ```
858
+ "sunfly@cadence.com"
859
+ ```
860
+ </td>
861
+ </tr>
862
+ <tr>
863
+ <td>
864
+
865
+ ```yaml
866
+ - step: RandomizeEmail
867
+ params:
868
+ domains_to_keep_data_key: ""
869
+ ```
870
+ </td>
871
+ <td>
872
+
873
+ ```ruby
874
+ {
875
+ "dictionary" => {
876
+ "5-5" => ["waste", "octet"],
877
+ "6-6" => ["sunfly", "echoes"]
878
+ }
879
+ }
880
+ ```
881
+ </td>
882
+ <td>
883
+
884
+ ```
885
+ "evelyn@gmail.com"
886
+ ```
887
+ </td>
888
+ <td>
889
+
890
+ ```
891
+ "sunfly@waste.com"
892
+ ```
893
+ </td>
894
+ </tr>
895
+ </tbody>
896
+ </table>
897
+
898
+ ### [RandomizeFormattedNumber](https://github.com/NejRemeslnici/dump-cleaner/blob/main/lib/dump_cleaner/cleanup/cleaning_steps/randomize_formatted_number.rb)
899
+
900
+ This step randomizes the specified parts of a formatted number. It can be used for randomizing the last N digits of a number, a phone number, an IP address etc. The step requires a regular expression to be passed in, specifying the [named match groups](https://ruby-doc.org/3.3.2/Regexp.html#class-Regexp-label-Named+Captures) to be randomized or kept intact. The randomized parts of the formatted value are replaced by a random number with a corresponding number of digits.
901
+
902
+ If an invalid formatted number is encountered (i.e. a number which does not match the regexp), a warning is logged and the processing switches to the [failure workflow](#failure-steps-1).
903
+
904
+ #### Params:
905
+
906
+ - `format`: the regexp with the [named match groups](https://ruby-doc.org/3.3.2/Regexp.html#class-Regexp-label-Named+Captures) defined, that determine which parts of the value will be replaced by a random number and which will be kept. The matching groups in the regexp must cover the whole formatted value, otherwise the length of the resulting value won’t match the original. The parts to be randomized must be covered by match groups named `x…` (beginning with the letter `x`), other parts must be covered by match groups named with another first letter. Each match group must have a unique name.
907
+
908
+ #### Examples:
909
+
910
+ <table>
911
+ <tr><th>configuration</th><th>input value</th><th>output value</th></tr>
912
+ <tbody>
913
+ <tr>
914
+ <td>
915
+
916
+ ```yaml
917
+ - step: RandomizeFormattedNumber
918
+ params:
919
+ format: (?<a>[^-]-)(?<x>\d{3})
920
+ ```
921
+ </td>
922
+ <td>
923
+
924
+ ```
925
+ "anything-123"
926
+ ```
927
+ </td>
928
+ <td>
929
+
930
+ ```
931
+ "anything-879"
932
+ ```
933
+ </td>
934
+ </tr>
935
+ <tr>
936
+ <td>
937
+
938
+ ```yaml
939
+ - step: RandomizeFormattedNumber
940
+ params:
941
+ format: (?<front>\d-\d{3}-)(?<x1>\d{3})(?<hyphen>-)(?<x2>\d{3})
942
+ ```
943
+ </td>
944
+ <td>
945
+
946
+ ```
947
+ "1-123-456-789"
948
+ ```
949
+ </td>
950
+ <td>
951
+
952
+ ```
953
+ "1-123-786-802"
954
+ ```
955
+ </td>
956
+ </tr>
957
+ <tr>
958
+ <td>
959
+
960
+ ```yaml
961
+ - step: RandomizeFormattedNumber
962
+ params:
963
+ format: (?<front>\d)(?<x>\d{3})(?<back>.*)
964
+ ```
965
+ </td>
966
+ <td>
967
+
968
+ ```
969
+ "12345"
970
+ ```
971
+ </td>
972
+ <td>
973
+
974
+ ```
975
+ "18865"
976
+ ```
977
+ </td>
978
+ </tr>
979
+ </tbody>
980
+ </table>
981
+
982
+ ### [RandomizeNumber](https://github.com/NejRemeslnici/dump-cleaner/blob/main/lib/dump_cleaner/cleanup/cleaning_steps/randomize_number.rb)
983
+
984
+ The goal of this step is to randomize a floating point or integer number to a certain extent. I.e., the current value is not fully replaced, it is randomly shifted within a certain limit instead. The number is converted to a floating point number before randomization and the final sanitized value is rounded to the same decimal places as the original.
985
+
986
+ Note that the sign of the number is never changed, even in cases when the calculation leads to a number with an opposite sign. This is to keep the byte size of the value the same under all circumstances.
987
+
988
+ #### Params:
989
+
990
+ - `difference_within`: the maximum difference between the original value and the randomized one. The limit is exclusive, i.e. the final difference will always be at least a bit smaller and will never reach the maximum difference itself; this param defaults to 1.0.
991
+
992
+ #### Examples:
993
+
994
+ <table>
995
+ <tr><th>configuration</th><th>input value</th><th>output value</th></tr>
996
+ <tbody>
997
+ <tr>
998
+ <td>
999
+
1000
+ ```yaml
1001
+ - step: RandomizeNumber
1002
+ ```
1003
+ </td>
1004
+ <td>
1005
+
1006
+ ```
1007
+ "123.45"
1008
+ ```
1009
+ </td>
1010
+ <td>
1011
+
1012
+ ```
1013
+ "122.82"
1014
+ ```
1015
+ </td>
1016
+ </tr>
1017
+ <tr>
1018
+ <td>
1019
+
1020
+ ```yaml
1021
+ - step: RandomizeNumber
1022
+ params:
1023
+ difference_within: 10
1024
+ ```
1025
+ </td>
1026
+ <td>
1027
+
1028
+ ```
1029
+ "-123"
1030
+ ```
1031
+ </td>
1032
+ <td>
1033
+
1034
+ ```
1035
+ "-127"
1036
+ ```
1037
+ </td>
1038
+ </tr>
1039
+ <tr>
1040
+ <td>
1041
+
1042
+ ```yaml
1043
+ - step: RandomizeNumber
1044
+ ```
1045
+ </td>
1046
+ <td>
1047
+
1048
+ ```
1049
+ "123"
1050
+ ```
1051
+ </td>
1052
+ <td>
1053
+
1054
+ ```
1055
+ "123"
1056
+ ```
1057
+ (the same due to `difference_within` exclusiveness and no decimal point)
1058
+ </td>
1059
+ </tr>
1060
+ </tbody>
1061
+ </table>
1062
+
1063
+ ### [SelectDataByBytesize](https://github.com/NejRemeslnici/dump-cleaner/blob/main/lib/dump_cleaner/cleanup/cleaning_steps/select_data_by_bytesize.rb)
1064
+
1065
+ This step expects the cleanup data to be a hash of lists grouped by the byte size (see [GroupByBytesize](#groupbybytesize)) and selects the list of values that have the same length and bytesize as the currently processed value. It does not affect the value, only the cleanup data.
1066
+
1067
+ #### Examples:
1068
+
1069
+ <table>
1070
+ <tr><th>configuration</th><th>input cleanup data</th><th>value</th><th>output cleanup data</th></tr>
1071
+ <tbody>
1072
+ <tr>
1073
+ <td>
1074
+
1075
+ ```yaml
1076
+ - step: SelectDataByBytesize
1077
+ ```
1078
+ </td>
1079
+ <td>
1080
+
1081
+ ```
1082
+ {
1083
+ "5-5" => ["waste", "octet"],
1084
+ "6-6" => ["sunfly", "echoes"]
1085
+ }
1086
+ ```
1087
+ </td>
1088
+ <td>
1089
+
1090
+ ```
1091
+ "Alice"
1092
+ ```
1093
+ </td>
1094
+ <td>
1095
+
1096
+ ```
1097
+ ["waste", "octet"]
1098
+ ```
1099
+ </td>
1100
+ </tr>
1101
+ <tr>
1102
+ <td>
1103
+
1104
+ ```yaml
1105
+ - step: SelectDataByBytesize
1106
+ ```
1107
+ </td>
1108
+ <td>
1109
+
1110
+ ```
1111
+ {
1112
+ "5-5" => ["waste", "octet"],
1113
+ "6-6" => ["sunfly", "echoes"]
1114
+ "6-7" => ["Brontë", "exposé"]
1115
+ }
1116
+ ```
1117
+ </td>
1118
+ <td>
1119
+
1120
+ ```
1121
+ "frappé"
1122
+ ```
1123
+ </td>
1124
+ <td>
1125
+
1126
+ ```
1127
+ ["Brontë", "exposé"]
1128
+ ```
1129
+ </td>
1130
+ </tr>
1131
+ </tbody>
1132
+ </table>
1133
+
1134
+ ### [SelectDataByPattern](https://github.com/NejRemeslnici/dump-cleaner/blob/main/lib/dump_cleaner/cleanup/cleaning_steps/select_data_by_pattern.rb)
1135
+
1136
+ This step expects the cleanup data to be a hash of lists and selects the list determined by matching the current value against a set of regexp patterns. It does not affect the value, only the cleanup data.
1137
+
1138
+ For example, let’s have two lists of names, male and female names, in the cleanup data hash and let’s pretend that we can guess the gender from a pattern in a person’s last name reasonably well (which is indeed possible in some languages). We can then for example set up a pattern to select female data and leave all other matches to male data.
1139
+
1140
+ #### Params:
1141
+
1142
+ - `patterns`: a mandatory parameter that is an array of hashes with the following keys:
1143
+ - `pattern`: the repexp pattern to match the current value against
1144
+ - `flags`: optional [regexp flags](https://docs.ruby-lang.org/en/master/Regexp.html#class-Regexp-label-Modes) (modes), such as `"i"` for case-insensitive matching
1145
+ - `key`: the key in cleanup data to select if the `pattern` matches the current value
1146
+ - `default_key`: this key is selected from the cleanup data hash if no `pattern` matches the current value (default: `nil`).
1147
+
1148
+ #### Examples:
1149
+
1150
+ <table>
1151
+ <tr><th>configuration</th><th>input cleanup data</th><th>value</th><th>output cleanup data</th></tr>
1152
+ <tbody>
1153
+ <tr>
1154
+ <td>
1155
+
1156
+ ```yaml
1157
+ - step: SelectDataByPattern
1158
+ params:
1159
+ patterns:
1160
+ - pattern: (ful|tic|ous)$
1161
+ key: adjectives
1162
+ - pattern: (ly|ward|wise)$
1163
+ key: adverbs"
1164
+ default_key: words
1165
+ ```
1166
+ </td>
1167
+ <td>
1168
+
1169
+ ```
1170
+ {
1171
+ "adjectives" => ["porous", "calm", "athletic"],
1172
+ "adverbs" => ["softly", "outward", "likewise"],
1173
+ "words" => ["other", "random", "words"]
1174
+ }
1175
+ ```
1176
+ </td>
1177
+ <td>
1178
+
1179
+ ```
1180
+ "thoroughly"
1181
+ ```
1182
+ </td>
1183
+ <td>
1184
+
1185
+ ```
1186
+ ["softly", "outward", "likewise"]
1187
+ ```
1188
+ </td>
1189
+ </tr>
1190
+ <tr>
1191
+ <td>
1192
+
1193
+ ```yaml
1194
+ - step: SelectDataByPattern
1195
+ params:
1196
+ patterns:
1197
+ - pattern: (ful|tic|ous)$
1198
+ key: adjectives
1199
+ - pattern: (ly|ward|wise)$
1200
+ key: adverbs"
1201
+ default_key: words
1202
+ ```
1203
+ </td>
1204
+ <td>
1205
+
1206
+ ```
1207
+ {
1208
+ "adjectives" => ["porous", "calm", "athletic"],
1209
+ "adverbs" => ["softly", "outward", "likewise"],
1210
+ "words" => ["other", "random", "words"]
1211
+ }
1212
+ ```
1213
+ </td>
1214
+ <td>
1215
+
1216
+ ```
1217
+ "second"
1218
+ ```
1219
+ </td>
1220
+ <td>
1221
+
1222
+ ```
1223
+ ["other", "random", "words"]
1224
+ ```
1225
+ </td>
1226
+ </tr>
1227
+ <tr>
1228
+ <td>
1229
+
1230
+ ```yaml
1231
+ - step: SelectDataByPattern
1232
+ params:
1233
+ patterns:
1234
+ - pattern: (ful|tic|ous)$
1235
+ key: adjectives
1236
+ flags: i
1237
+ ```
1238
+ </td>
1239
+ <td>
1240
+
1241
+ ```
1242
+ {
1243
+ "adjectives" => ["porous", "calm", "athletic"],
1244
+ "adverbs" => ["softly", "outward", "likewise"]
1245
+ }
1246
+ ```
1247
+ </td>
1248
+ <td>
1249
+
1250
+ ```
1251
+ "PRECIOUS"
1252
+ ```
1253
+ </td>
1254
+ <td>
1255
+
1256
+ ```
1257
+ ["porous", "calm", "athletic"]
1258
+ ```
1259
+ </td>
1260
+ </tr>
1261
+ </tbody>
1262
+ </table>
1263
+
1264
+ ### [TakeSample](https://github.com/NejRemeslnici/dump-cleaner/blob/main/lib/dump_cleaner/cleanup/cleaning_steps/take_sample.rb)
1265
+
1266
+ This step takes a random sample from a list in the cleanup data. If uniqueness is requested for the currently processed value, the step can either retry taking a sample or add a repetition suffix.
1267
+
1268
+ If the cleanup data is empty or missing, the procedure switches to the [failure workflow](#failure-steps-1).
1269
+
1270
+ #### Params:
1271
+
1272
+ - `uniqueness_strategy`: the strategy this step undertakes when it hits a conflicting value while uniqueness is desired:
1273
+ - `resample`: the step takes another random sample from the same data; this is the default
1274
+ - `suffix`: the step adds a [repetition suffix](#addrepetitionsuffix) to the sample taken in the first repetition loop.
1275
+
1276
+ #### Examples:
1277
+
1278
+ <table>
1279
+ <tr><th>configuration</th><th>cleanup data</th><th>input value</th><th>output value</th></tr>
1280
+ <tbody>
1281
+ <tr>
1282
+ <td>
1283
+
1284
+ ```yaml
1285
+ - step: TakeSample
1286
+ ```
1287
+ </td>
1288
+ <td>
1289
+
1290
+ ```ruby
1291
+ ["sunfly", "echoes", "talent"]
1292
+ ```
1293
+ </td>
1294
+ <td>
1295
+
1296
+ ```
1297
+ "jaguar"
1298
+ ```
1299
+ (when repetition is: 0)
1300
+ </td>
1301
+ <td>
1302
+
1303
+ ```
1304
+ "talent"
1305
+ ```
1306
+ </td>
1307
+ </tr>
1308
+ <tr>
1309
+ <td>
1310
+
1311
+ ```yaml
1312
+ - step: TakeSample
1313
+ ```
1314
+ </td>
1315
+ <td>
1316
+
1317
+ ```ruby
1318
+ ["sunfly", "echoes", "talent"]
1319
+ ```
1320
+ </td>
1321
+ <td>
1322
+
1323
+ ```
1324
+ "jaguar"
1325
+ ```
1326
+ (when repetition is: 1)
1327
+ </td>
1328
+ <td>
1329
+
1330
+ ```
1331
+ "echoes"
1332
+ ```
1333
+ </td>
1334
+ </tr>
1335
+ <tr>
1336
+ <td>
1337
+
1338
+ ```yaml
1339
+ - step: TakeSample
1340
+ params:
1341
+ uniqueness_strategy: suffix
1342
+ ```
1343
+ </td>
1344
+ <td>
1345
+
1346
+ ```ruby
1347
+ ["sunfly", "echoes", "talent"]
1348
+ ```
1349
+ </td>
1350
+ <td>
1351
+
1352
+ ```
1353
+ "jaguar"
1354
+ ```
1355
+ (when repetition is: 0)
1356
+ </td>
1357
+ <td>
1358
+
1359
+ ```
1360
+ "talent"
1361
+ ```
1362
+ </td>
1363
+ </tr>
1364
+ <tr>
1365
+ <td>
1366
+
1367
+ ```yaml
1368
+ - step: TakeSample
1369
+ params:
1370
+ uniqueness_strategy: suffix
1371
+ ```
1372
+ </td>
1373
+ <td>
1374
+
1375
+ ```ruby
1376
+ ["sunfly", "echoes", "talent"]
1377
+ ```
1378
+ </td>
1379
+ <td>
1380
+
1381
+ ```
1382
+ "jaguar"
1383
+ ```
1384
+ (when repetition is: 1)
1385
+ </td>
1386
+ <td>
1387
+
1388
+ ```
1389
+ "talen1"
1390
+ ```
1391
+ </td>
1392
+ </tr>
1393
+ </tbody>
1394
+ </table>
1395
+
1396
+ ## Failure steps
1397
+
1398
+ In general, any step from the [Cleaning workflow](#cleaning-steps-1) may be used as a Failure workflow step as well. In practice, some of the steps are used more commonly than others here.
1399
+
1400
+ If even the failure workflow fails to return some value (returns a `nil` value), the behavior is not fully specified. An error is logged and a blank value is probably written to the destination dump, which may lead to some ”data corruption“ warnings during re-importing the dump.