data-anonymization 0.3.0 → 0.5.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (86) hide show
  1. data/.gitignore +2 -1
  2. data/.rvmrc +1 -1
  3. data/.travis.yml +2 -0
  4. data/Gemfile +2 -0
  5. data/README.md +295 -258
  6. data/bin/datanon +57 -0
  7. data/data-anonymization.gemspec +2 -1
  8. data/examples/blacklist_dsl.rb +42 -0
  9. data/examples/mongodb_blacklist_dsl.rb +38 -0
  10. data/examples/mongodb_whitelist_dsl.rb +44 -0
  11. data/examples/whitelist_dsl.rb +63 -0
  12. data/lib/core/database.rb +21 -3
  13. data/lib/core/field.rb +5 -2
  14. data/lib/core/fields_missing_strategy.rb +30 -0
  15. data/lib/core/table_errors.rb +32 -0
  16. data/lib/data-anonymization.rb +11 -0
  17. data/lib/parallel/table.rb +8 -1
  18. data/lib/strategy/base.rb +35 -14
  19. data/lib/strategy/blacklist.rb +1 -1
  20. data/lib/strategy/field/anonymize_array.rb +28 -0
  21. data/lib/strategy/field/contact/random_address.rb +12 -0
  22. data/lib/strategy/field/contact/random_city.rb +12 -0
  23. data/lib/strategy/field/contact/random_phone_number.rb +4 -0
  24. data/lib/strategy/field/contact/random_province.rb +12 -0
  25. data/lib/strategy/field/contact/random_zipcode.rb +12 -0
  26. data/lib/strategy/field/datetime/anonymize_date.rb +15 -0
  27. data/lib/strategy/field/datetime/anonymize_datetime.rb +19 -0
  28. data/lib/strategy/field/datetime/anonymize_time.rb +19 -0
  29. data/lib/strategy/field/datetime/date_delta.rb +10 -0
  30. data/lib/strategy/field/datetime/date_time_delta.rb +9 -0
  31. data/lib/strategy/field/datetime/time_delta.rb +8 -0
  32. data/lib/strategy/field/default_anon.rb +4 -1
  33. data/lib/strategy/field/email/gmail_template.rb +8 -0
  34. data/lib/strategy/field/email/random_email.rb +7 -0
  35. data/lib/strategy/field/email/random_mailinator_email.rb +5 -0
  36. data/lib/strategy/field/fields.rb +4 -0
  37. data/lib/strategy/field/name/random_first_name.rb +10 -0
  38. data/lib/strategy/field/name/random_full_name.rb +10 -2
  39. data/lib/strategy/field/name/random_last_name.rb +9 -0
  40. data/lib/strategy/field/name/random_user_name.rb +5 -0
  41. data/lib/strategy/field/number/random_big_decimal_delta.rb +6 -0
  42. data/lib/strategy/field/number/random_float.rb +4 -0
  43. data/lib/strategy/field/number/random_float_delta.rb +6 -0
  44. data/lib/strategy/field/number/random_integer.rb +4 -0
  45. data/lib/strategy/field/number/random_integer_delta.rb +6 -0
  46. data/lib/strategy/field/string/formatted_string_numbers.rb +10 -6
  47. data/lib/strategy/field/string/lorem_ipsum.rb +9 -0
  48. data/lib/strategy/field/string/random_formatted_string.rb +39 -0
  49. data/lib/strategy/field/string/random_string.rb +6 -0
  50. data/lib/strategy/field/string/random_url.rb +7 -1
  51. data/lib/strategy/field/string/select_from_database.rb +7 -5
  52. data/lib/strategy/field/string/select_from_file.rb +7 -0
  53. data/lib/strategy/field/string/select_from_list.rb +8 -0
  54. data/lib/strategy/field/string/string_template.rb +11 -0
  55. data/lib/strategy/mongodb/anonymize_field.rb +44 -0
  56. data/lib/strategy/mongodb/blacklist.rb +29 -0
  57. data/lib/strategy/mongodb/whitelist.rb +62 -0
  58. data/lib/strategy/strategies.rb +10 -1
  59. data/lib/strategy/whitelist.rb +7 -2
  60. data/lib/thor/helpers/mongodb_dsl_generator.rb +66 -0
  61. data/lib/thor/helpers/rdbms_dsl_generator.rb +36 -0
  62. data/lib/thor/templates/mongodb_whitelist_template.erb +15 -0
  63. data/lib/thor/templates/whitelist_template.erb +21 -0
  64. data/lib/utils/database.rb +4 -0
  65. data/lib/utils/parallel_progress_bar.rb +24 -0
  66. data/lib/utils/progress_bar.rb +34 -22
  67. data/lib/utils/random_string.rb +3 -2
  68. data/lib/utils/random_string_chars_only.rb +3 -5
  69. data/lib/utils/template_helper.rb +44 -0
  70. data/lib/version.rb +1 -1
  71. data/spec/acceptance/mongodb_blacklist_spec.rb +75 -0
  72. data/spec/acceptance/mongodb_whitelist_spec.rb +107 -0
  73. data/spec/core/fields_missing_strategy_spec.rb +26 -0
  74. data/spec/strategy/field/name/random_first_name_spec.rb +1 -1
  75. data/spec/strategy/field/name/random_full_name_spec.rb +12 -7
  76. data/spec/strategy/field/name/random_last_name_spec.rb +1 -1
  77. data/spec/strategy/field/string/random_formatted_string_spec.rb +39 -0
  78. data/spec/strategy/field/string/select_from_file_spec.rb +21 -0
  79. data/spec/strategy/mongodb/anonymize_field_spec.rb +52 -0
  80. data/spec/utils/random_float_spec.rb +12 -0
  81. data/spec/utils/random_string_char_only_spec.rb +12 -0
  82. data/spec/utils/template_helper_spec.rb +14 -0
  83. metadata +56 -6
  84. data/blacklist_dsl.rb +0 -17
  85. data/blacklist_nosql_dsl.rb +0 -36
  86. data/whitelist_dsl.rb +0 -42
data/.gitignore CHANGED
@@ -17,4 +17,5 @@ test/version_tmp
17
17
  tmp
18
18
  .idea
19
19
  sample-data/chinook-empty.sqlite
20
- tmp
20
+ tmp
21
+ examples/mongodb_whitelist_generated.rb
data/.rvmrc CHANGED
@@ -1 +1 @@
1
- rvm use 1.9.3-p125@data-anon --create
1
+ rvm use 1.9.3@data-anon --create
data/.travis.yml CHANGED
@@ -1,4 +1,6 @@
1
1
  language: ruby
2
+ services:
3
+ - mongodb
2
4
  before_script: rake empty_dest
3
5
  rvm:
4
6
  - 1.9.2
data/Gemfile CHANGED
@@ -8,5 +8,7 @@ group :development, :test do
8
8
  gem 'rspec'
9
9
  gem 'pry'
10
10
  gem 'sqlite3'
11
+ gem 'mongo'
12
+ gem 'bson_ext'
11
13
  end
12
14
 
data/README.md CHANGED
@@ -1,7 +1,12 @@
1
1
  # Data::Anonymization
2
2
  Tool to create anonymized production data dump to use for PERF and other TEST environments.
3
3
 
4
+ [<img src="https://secure.travis-ci.org/sunitparekh/data-anonymization.png?branch=master">](http://travis-ci.org/sunitparekh/data-anonymization)
5
+ [<img src="https://gemnasium.com/sunitparekh/data-anonymization.png?travis">](https://gemnasium.com/sunitparekh/data-anonymization)
6
+ [<img src="https://codeclimate.com/badge.png">](https://codeclimate.com/github/sunitparekh/data-anonymization)
7
+
4
8
  ## Getting started
9
+
5
10
  Install gem using:
6
11
 
7
12
  $ gem install data-anonymization
@@ -39,15 +44,42 @@ Run using:
39
44
 
40
45
  $ ruby my_dsl.rb
41
46
 
47
+ Liked it? please share
48
+
49
+ [<img src="https://si0.twimg.com/a/1346446870/images/resources/twitter-bird-light-bgs.png" height="35" width="35">](https://twitter.com/share?text=A+simple+ruby+DSL+based+data+anonymization&url=http:%2F%2Fsunitparekh.github.com%2Fdata-anonymization&via=dataanon&hashtags=dataanon)
50
+
42
51
  ## Examples
43
52
 
44
- 1. [Whitelist using Chinoook sample database](https://github.com/sunitparekh/data-anonymization/blob/master/whitelist_dsl.rb)
45
- 2. [Blacklist using Chinoook sample database](https://github.com/sunitparekh/data-anonymization/blob/master/blacklist_dsl.rb)
46
- 3. [Whitelist with composite primary key using DellStore sample database](https://github.com/sunitparekh/test-anonymization/blob/master/dell_whitelist.rb)
47
- 4. [Blacklist with composite primary key using DellStore sample database](https://github.com/sunitparekh/test-anonymization/blob/master/dell_blacklist.rb)
53
+ SQLite database
54
+
55
+ 1. [Whitelist](https://github.com/sunitparekh/data-anonymization/blob/master/examples/whitelist_dsl.rb)
56
+ 2. [Blacklist](https://github.com/sunitparekh/data-anonymization/blob/master/examples/blacklist_dsl.rb)
57
+
58
+ MongoDB
59
+
60
+ 1. [Whitelist](https://github.com/sunitparekh/data-anonymization/blob/master/examples/mongodb_whitelist_dsl.rb)
61
+ 2. [Blacklist](https://github.com/sunitparekh/data-anonymization/blob/master/examples/mongodb_blacklist_dsl.rb)
62
+
63
+ Postgresql database having **composite primary key**
64
+
65
+ 1. [Whitelist](https://github.com/sunitparekh/test-anonymization/blob/master/dell_whitelist.rb)
66
+ 2. [Blacklist](https://github.com/sunitparekh/test-anonymization/blob/master/dell_blacklist.rb)
67
+
48
68
 
49
69
  ## Changelog
50
70
 
71
+ #### 0.5.0 (rc1 released. install gem using --pre option)
72
+
73
+ Major changes:
74
+
75
+ 1. MongoDB support
76
+ 2. Command line utility to generate whitelist DSL for RDBMS & MongoDB (reduces pain for writing whitelist dsl)
77
+ 3. Added support for reporting fields missing mapping in case of whitelist
78
+ 4. Errors reported at the end of process. Job doesn't fail for a single error.
79
+
80
+
81
+ Please see the [Github 0.5.0 milestone page](https://github.com/sunitparekh/data-anonymization/issues?milestone=2&state=open) for more details on changes/fixes in release 0.5.0
82
+
51
83
  #### 0.3.0 (Sep 4, 2012)
52
84
 
53
85
  Major changes:
@@ -56,7 +88,7 @@ Major changes:
56
88
  2. Change in default String strategy from LoremIpsum to RandomString based on end user feedback.
57
89
  3. Fixed issue with table column name 'type' as this is default name for STI in activerecord.
58
90
 
59
- Please see the [Github 0.3.0 milestone page](https://github.com/sunitparekh/data-anonymization/issues?milestone=1&page=1&state=open) for more details on changes/fixes in release 0.3.0
91
+ Please see the [Github 0.3.0 milestone page](https://github.com/sunitparekh/data-anonymization/issues?milestone=1&state=closed) for more details on changes/fixes in release 0.3.0
60
92
 
61
93
  #### 0.2.0 (August 16, 2012)
62
94
 
@@ -71,15 +103,10 @@ Please see the [Github 0.3.0 milestone page](https://github.com/sunitparekh/data
71
103
 
72
104
  ## Roadmap
73
105
 
74
- #### 0.4.0
75
-
76
- 1. MongoDB anonymization support (NoSQL document based database support)
106
+ MVP done. Fix defects and support queries, suggestions, enhancements logged in Github issues :-)
77
107
 
78
- #### 0.5.0
108
+ ## Share feedback
79
109
 
80
- 1. Generate DSL from database and build schema from source as part of Whitelist approach.
81
-
82
- #### Share feedback
83
110
  Please use Github [issues](https://github.com/sunitparekh/data-anonymization/issues) to share feedback, feature suggestions and report issues.
84
111
 
85
112
  ## What is data anonymization?
@@ -126,295 +153,305 @@ end
126
153
  2. Change [default field strategies](#default-field-strategies) to avoid using same strategy again and again in your DSL.
127
154
  3. To run anonymization in parallel at Table level, provided no FK constraint on tables use DataAnon::Parallel::Table strategy
128
155
 
129
- ## Running in Parallel
130
- Currently provides capability of running anonymization in parallel at table level provided no FK constraints on tables.
131
- It uses [Parallel gem](https://github.com/grosser/parallel) provided by Michael Grosser.
132
- By default it starts multiple parallel ruby processes processing table one by one.
133
- ```ruby
134
- database 'DellStore' do
135
- strategy DataAnon::Strategy::Whitelist
136
- execution_strategy DataAnon::Parallel::Table # by default sequential table processing
137
- ...
138
- end
139
- ```
156
+ ## DSL Generation
140
157
 
158
+ We provide a command line tool to generate whitelist scripts for RDBMS and NoSQL databases. The user needs to supply the connection details to the database and a script is generated by analyzing the schema. Below are examples of how to use the tool to generate the scripts for RDBMS and NoSQL datastores
141
159
 
142
- ## DataAnon::Core::Field
143
- The object that gets passed along with the field strategies.
144
-
145
- has following attribute accessor
160
+ When you install the data-anonymization tool, the **datanon** command become available on the terminal. If you type **datanon --help** and execute you should see the below
146
161
 
147
- - `name` current field/column name
148
- - `value` current field/column value
149
- - `row_number` current row number
150
- - `ar_record` active record of the current row under processing
151
-
152
- ## Field Strategies
162
+ ```
163
+ Tasks:
153
164
 
154
- ### LoremIpsum
155
- Default anonymization strategy for `string` content. Uses default 'Lorem ipsum...' text or text supplied in strategy to generate same length string.
165
+ datanon generate_mongo_dsl -d, --database=DATABASE -h, --host=HOST # Generates a base anonymization script(whitelist strategy) for a Mongo DB using the database schema
166
+ datanon generate_rdbms_dsl -a, --adapter=ADAPTER -d, --database=DATABASE -h, --host=HOST # Generates a base anonymization script(whitelist strategy) for a RDBMS database using the database schema
167
+ datanon help [TASK] # Describe available tasks or one specific task
156
168
 
157
- ```ruby
158
- anonymize('UserName').using FieldStrategy::LoremIpsum.new
159
- ```
160
- ```ruby
161
- anonymize('UserName').using FieldStrategy::LoremIpsum.new("very large string....")
162
- ```
163
- ```ruby
164
- anonymize('UserName').using FieldStrategy::LoremIpsum.new(File.read('my_file.txt'))
165
169
  ```
166
170
 
167
- ### RandomString
168
- Generates random string of same length.
169
- ```ruby
170
- anonymize('UserName').using FieldStrategy::RandomString.new
171
- ```
171
+ ### RDBMS whitelist generation
172
172
 
173
- ### StringTemplate
174
- Simple string evaluation within [DataAnon::Core::Field](#dataanon-core-field) context. Can be used for email, username anonymization.
175
- Make sure to put the string in 'single quote' else it will get evaluated inline.
176
- ```ruby
177
- anonymize('UserName').using FieldStrategy::StringTemplate.new('user#{row_number}')
178
- ```
179
- ```ruby
180
- anonymize('Email').using FieldStrategy::StringTemplate.new('valid.address+#{row_number}@gmail.com')
181
- ```
182
- ```ruby
183
- anonymize('Email').using FieldStrategy::StringTemplate.new('useremail#{row_number}@mailinator.com')
184
- ```
173
+ The gem uses ActiveRecord(AR) abstraction to connect to relational databases. You can generate a whitelist script in seconds for any relational database supported by Active Record. To do so use the following command
185
174
 
186
- ### SelectFromList
187
- Select randomly one of the values specified.
188
- ```ruby
189
- anonymize('State').using FieldStrategy::SelectFromList.new(['New York','Georgia',...])
190
- ```
191
- ```ruby
192
- anonymize('NameTitle').using FieldStrategy::SelectFromList.new(['Mr','Mrs','Dr',...])
193
175
  ```
176
+ datanon generate_rdbms_dsl [options]
194
177
 
195
- ### SelectFromFile
196
- Similar to SelectFromList only difference is the list of values are picked up from file. Classical usage is like states field anonymization.
197
- ```ruby
198
- anonymize('State').using FieldStrategy::SelectFromFile.new('states.txt')
199
178
  ```
200
179
 
201
- ### FormattedStringNumber
202
- Keeping the format same it changes each digit in the string with random digit.
203
- ```ruby
204
- anonymize('CreditCardNumber').using FieldStrategy::FormattedStringNumber.new
205
- ```
180
+ The options available are :
206
181
 
207
- ### SelectFromDatabase
208
- Similar to SelectFromList with difference is the list of values are collected from the database table using distinct column query.
209
- ```ruby
210
- # values are collected using `select distinct state from customers` query
211
- anonymize('State').using FieldStrategy::SelectFromDatabase.new('customers','state')
212
- ```
182
+ 1. adapter(-a) : The activerecord adapter to use to connect to the database (eg. mysql2, postgresql)
183
+ 2. host(-h) : DB host name or IP address
184
+ 3. database(-d) : The name of the database to generate the whitelist script for
185
+ 4. username(-u) : Username for DB authentication
186
+ 5. password(-w) : Password for DB authentication
187
+ 6. port(-p) : The port the database service is running on. Default port provided by AR will be used if nothing is specififed.
213
188
 
214
- ### RandomAddress
215
- Generates address using the [geojson](http://www.geojson.org/geojson-spec.html) format file. The default US/UK file chooses randomly from 300 addresses.
216
- The large data set can be downloaded from [here](http://www.infochimps.com/datasets/simplegeo-places-dump)
217
- ```ruby
218
- anonymize('Address').using FieldStrategy::RandomAddress.region_US
219
- ```
220
- ```ruby
221
- anonymize('Address').using FieldStrategy::RandomAddress.region_UK
222
- ```
223
- ```ruby
224
- # get your own geo_json file and use it
225
- anonymize('Address').using FieldStrategy::RandomAddress.new('my_geo_json.json')
226
- ```
189
+ The adapter, host and database options are mandatory. The others are optional.
227
190
 
228
- ### RandomCity
229
- Similar to RandomAddress, generates city using the [geojson](http://www.geojson.org/geojson-spec.html) format file. The default US/UK file chooses randomly from 300 addresses.
230
- The large data set can be downloaded from [here](http://www.infochimps.com/datasets/simplegeo-places-dump)
231
- ```ruby
232
- anonymize('City').using FieldStrategy::RandomCity.region_US
233
- ```
234
- ```ruby
235
- anonymize('City').using FieldStrategy::RandomCity.region_UK
236
- ```
237
- ```ruby
238
- # get your own geo_json file and use it
239
- anonymize('City').using FieldStrategy::RandomCity.new('my_geo_json.json')
240
- ```
191
+ A few examples of the command is shown below
241
192
 
242
- ### RandomProvince
243
- Similar to RandomAddress, generates province using the [geojson](http://www.geojson.org/geojson-spec.html) format file. The default US/UK file chooses randomly from 300 addresses.
244
- The large data set can be downloaded from [here](http://www.infochimps.com/datasets/simplegeo-places-dump)
245
- ```ruby
246
- anonymize('Province').using FieldStrategy::RandomProvince.region_US
247
- ```
248
- ```ruby
249
- anonymize('Province').using FieldStrategy::RandomProvince.region_UK
250
- ```
251
- ```ruby
252
- # get your own geo_json file and use it
253
- anonymize('Province').using FieldStrategy::RandomProvince.new('my_geo_json.json')
254
193
  ```
194
+ datanon generate_rdbms_dsl -a mysql2 -h db.host.com -p 3306 -d production_db -u root -w password
255
195
 
256
- ### RandomZipcode
257
- Similar to RandomAddress, generates zipcode using the [geojson](http://www.geojson.org/geojson-spec.html) format file. The default US/UK file chooses randomly from 300 addresses.
258
- The large data set can be downloaded from [here](http://www.infochimps.com/datasets/simplegeo-places-dump)
259
- ```ruby
260
- anonymize('Address').using FieldStrategy::RandomZipcode.region_US
261
- ```
262
- ```ruby
263
- anonymize('Address').using FieldStrategy::RandomZipcode.region_UK
264
- ```
265
- ```ruby
266
- # get your own geo_json file and use it
267
- anonymize('Address').using FieldStrategy::RandomZipcode.new('my_geo_json.json')
268
- ```
196
+ datanon generate_rdbms_dsl -a postgresql -h 123.456.7.8 -d production_db
269
197
 
270
- ### RandomPhoneNumber
271
- Keeping the format same it changes each digit in the string with random digit.
272
- ```ruby
273
- anonymize('PhoneNumber').using FieldStrategy::RandomPhoneNumber.new
274
198
  ```
275
199
 
276
- ### AnonymizeDateTime
277
- Anonymizes each field(except year and seconds) within the natural range (e.g. hour between 1-24 and day within the month) based on true/false
278
- input for that field. By default, all fields are anonymized.
279
- ```ruby
280
- #anonymizes month and hour fields, leaving the day and minute fields untouched
281
- anonymize('DateOfBirth').using FieldStrategy::AnonymizeDateTime.new(true,false,true,false)
282
- ```
200
+ The relevant db gems must be installed so that AR has the adapters required to establish the connection to the databases. The script generates a file named **rdbms_whitelist_generated.rb** in the same location as the project.
283
201
 
284
- In addition to customizing which fields you want anonymized, there are some helper methods which allow for quick anonymization
285
- ```ruby
286
- # anonymizes only the month field
287
- anonymize('DateOfBirth').using FieldStrategy::AnonymizeDateTime.only_month
288
- # anonymizes only the day field
289
- anonymize('DateOfBirth').using FieldStrategy::AnonymizeDateTime.only_day
290
- # anonymizes only the hour field
291
- anonymize('DateOfBirth').using FieldStrategy::AnonymizeDateTime.only_hour
292
- # anonymizes only the minute field
293
- anonymize('DateOfBirth').using FieldStrategy::AnonymizeDateTime.only_minute
294
- ```
202
+ ### MongoDB whitelist generation
295
203
 
296
- ### AnonymizeTime
297
- Exactly similar to the above DateTime strategy, except that the returned object is of type `Time`
204
+ Similar to the the relational databases, a whitelist script for mongo db can be generated by analysing the database structure
298
205
 
299
- ### AnonymizeDate
300
- Anonmizes day and month fields within natural range based on true/false input for that field. By defaut both fields are
301
- anonymized
302
- ```ruby
303
- # anonymizes month and leaves day unchanged
304
- anonymize('DateOfBirth').using FieldStrategy::AnonymizeDate.new(true,false)
305
206
  ```
207
+ datanon generate_mongo_dsl [options]
306
208
 
307
- In addition to customizing which fields you want anonymized, there are some helper methods which allow for quick anonymization
308
- ```ruby
309
- # anonymizes only the month field
310
- anonymize('DateOfBirth').using FieldStrategy::AnonymizeDate.only_month
311
- # anonymizes only the day field
312
- anonymize('DateOfBirth').using FieldStrategy::AnonymizeDate.only_day
313
209
  ```
314
210
 
315
- ### DateTimeDelta
316
- Shifts data randomly within given range. Default shifts date within 10 days + or - and shifts time within 30 minutes.
317
- ```ruby
318
- anonymize('DateOfBirth').using FieldStrategy::DateTimeDelta.new
319
- ```
320
- ```ruby
321
- # shifts date within 20 days and time within 50 minutes
322
- anonymize('DateOfBirth').using FieldStrategy::DateTimeDelta.new(20, 50)
323
- ```
211
+ The options available are :
324
212
 
325
- ### TimeDelta
326
- Exactly similar to the above DateTime strategy, except that the returned object is of type `Time`
213
+ 1. host(-h) : DB host name or IP address
214
+ 2. database(-d) : The name of the database to generate the whitelist script for
215
+ 3. username(-u) : Username for DB authentication
216
+ 4. password(-w) : Password for DB authentication
217
+ 5. port(-p) : The port the database service is running on.
218
+ 6. whitelist patterns(-r): A regex expression which can be used to match records in the database to list as whitelisted fields in the generated script.
327
219
 
328
- ### DateDelta
220
+ The host and database options are mandatory. The others are optional.
329
221
 
330
- Shifts date randomly within given delta range. Default shits date within 10 days + or -
331
- ```ruby
332
- anonymize('DateOfBirth').using FieldStrategy::AnonymizeDate.new
333
- ```
334
- ```ruby
335
- # shifts date within 25 days
336
- anonymize('DateOfBirth').using FieldStrategy::DateDelta.new(25)
337
- ```
222
+ A few examples of the command is shown below
338
223
 
339
- ### RandomEmail
340
- Generates email randomly using the given HOSTNAME and TLD.
341
- By defaults generates hostname randomly along with email id.
342
- ```ruby
343
- anonymize('Email').using FieldStrategy::RandomEmail.new('thoughtworks','com')
344
224
  ```
225
+ datanon generate_mongo_dsl -h db.host.com -d production_db -u root -w password
345
226
 
346
- ### GmailTemplate
347
- Generates a valid unique gmail address by taking advantage of the gmail + strategy. Takes in a valid gmail username and
348
- generates emails of the form username+<number>@gmail.com
349
- ```ruby
350
- anonymize('Email').using FieldStrategy::GmailTemplate.new('username')
351
- ```
227
+ datanon generate_mongo_dsl -h 123.456.7.8 -d production_db
352
228
 
353
- ### RandomMailinatorEmail
354
- Generates random email using mailinator hostname. e.g. <randomstring>@mailinator.com
355
- ```ruby
356
- anonymize('Email').using FieldStrategy::RandomMailinatorEmail.new
357
229
  ```
358
230
 
359
- ### RandomUserName
360
- Generates random user name of same length as original user name.
361
- ```ruby
362
- anonymize('Username').using FieldStrategy::RandomUserName.new
363
- ```
231
+ The **mongo** gem is required in order to install the mongo db drivers. The script generates a file named **mongodb_whitelist_generated.rb** in the same location as the project.
364
232
 
365
- ### RandomFirstName
366
- Randomly picks up first name from the predefined list in the file. Default [file](https://raw.github.com/sunitparekh/data-anonymization/master/resources/first_names.txt) is part of the gem.
367
- File should contain first name on each line.
368
- ```ruby
369
- anonymize('FirstName').using FieldStrategy::RandomFirstName.new
370
- ```
371
- ```ruby
372
- anonymize('FirstName').using FieldStrategy::RandomFirstName.new('my_first_names.txt')
373
- ```
374
233
 
375
- ### RandomLastName
376
- Randomly picks up last name from the predefined list in the file. Default [file](https://raw.github.com/sunitparekh/data-anonymization/master/resources/last_names.txt) is part of the gem.
377
- File should contain last name on each line.
378
- ```ruby
379
- anonymize('LastName').using FieldStrategy::RandomLastName.new
380
- ```
381
- ```ruby
382
- anonymize('LastName').using FieldStrategy::RandomLastName.new('my_last_names.txt')
383
- ```
384
234
 
385
- ### RandomFullName
386
- Generates full name using the RandomFirstName and RandomLastName strategies.
387
- It also creates the s
388
- ```ruby
389
- anonymize('FullName').using FieldStrategy::RandomFullName.new
390
- ```
391
- ```ruby
392
- anonymize('FullName').using FieldStrategy::RandomLastName.new('my_first_names.txt', 'my_last_names.txt')
393
- ```
235
+ ## Running in Parallel
236
+ Currently provides capability of running anonymization in parallel at table level provided no FK constraints on tables.
237
+ It uses [Parallel gem](https://github.com/grosser/parallel) provided by Michael Grosser.
238
+ By default it starts multiple parallel ruby processes processing table one by one.
394
239
 
395
- ### RandomInteger
396
- Generates random integer number between given two numbers. Default range is 0 to 100.
397
240
  ```ruby
398
- anonymize('Age').using FieldStrategy::RandomInteger.new(18,70)
241
+ database 'DellStore' do
242
+ strategy DataAnon::Strategy::Whitelist
243
+ execution_strategy DataAnon::Parallel::Table # by default sequential table processing
244
+ ...
245
+ end
399
246
  ```
400
247
 
401
- ### RandomIntegerDelta
402
- Shifts the current value randomly within given delta + and -. Default is 10
403
- ```ruby
404
- anonymize('Age').using FieldStrategy::RandomIntegerDelta.new(2)
405
- ```
406
248
 
407
- ### RandomFloat
408
- Generates random float number between given two numbers. Default range is 0.0 to 100.0
409
- ```ruby
410
- anonymize('points').using FieldStrategy::RandomInteger.new(3.0,5.0)
411
- ```
249
+ ## DataAnon::Core::Field
250
+ The object that gets passed along with the field strategies.
251
+
252
+ has following attribute accessor
253
+
254
+ - `name` current field/column name
255
+ - `value` current field/column value
256
+ - `row_number` current row number
257
+ - `ar_record` active record of the current row under processing
258
+
259
+ ## Field Strategies
260
+
261
+
262
+ <table>
263
+ <tr>
264
+ <th align="left">Content</th>
265
+ <th align="left">Name</th>
266
+ <th align="left">Description</th>
267
+ </tr>
268
+ <tr>
269
+ <td align="left">Text</td>
270
+ <td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/LoremIpsum">LoremIpsum</a></td>
271
+ <td align="left">Generates a random Lorep Ipsum String</td>
272
+ </tr>
273
+ <tr>
274
+ <td align="left">Text</td>
275
+ <td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomString">RandomString</a></td>
276
+ <td align="left">Generates a random string of equal length</td>
277
+ </tr>
278
+ <tr>
279
+ <td align="left">Text</td>
280
+ <td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/StringTemplate">StringTemplate</a></td>
281
+ <td align="left">Generates a string based on provided template</td>
282
+ </tr>
283
+ <tr>
284
+ <td align="left">Text</td>
285
+ <td align="left"><a>SelectFromList</a></td>
286
+ <td align="left">Randomly selects a string from a provided list</td>
287
+ </tr>
288
+ <tr>
289
+ <td align="left">Text</td>
290
+ <td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/SelectFromFile">SelectFromFile</a></td>
291
+ <td align="left">Randomly selects a string from a provided file</td>
292
+ </tr>
293
+ <tr>
294
+ <td align="left">Text</td>
295
+ <td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/FormattedStringNumber">FormattedStringNumber</a></td>
296
+ <td align="left">Randomize digits in a string while maintaining the format</td>
297
+ </tr>
298
+ <tr>
299
+ <td align="left">Text</td>
300
+ <td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/SelectFromDatabase">SelectFromDatabase</a></td>
301
+ <td align="left">Selects randomly from the result of a query on a database</td>
302
+ </tr>
303
+ <tr>
304
+ <td align="left">Text</td>
305
+ <td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomUrl">RandomURL</a></td>
306
+ <td align="left">Anonymizes a URL while mainting the structure</td>
307
+ </tr>
308
+ </table><table>
309
+ <tr>
310
+ <th align="left">Content</th>
311
+ <th align="left">Name</th>
312
+ <th align="left">Description</th>
313
+ </tr>
314
+ <tr>
315
+ <td align="left">Number</td>
316
+ <td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomInteger">RandomInteger</a></td>
317
+ <td align="left">Generates a random integer between provided limits (default 0 to 100)</td>
318
+ </tr>
319
+ <tr>
320
+ <td align="left">Number</td>
321
+ <td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomIntegerDelta">RandomIntegerDelta</a></td>
322
+ <td align="left">Generates a random integer within -delta and delta of original integer</td>
323
+ </tr>
324
+ <tr>
325
+ <td align="left">Number</td>
326
+ <td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomFloat">RandomFloat</a></td>
327
+ <td align="left">Generates a random float between provided limits (default 0.0 to 100.0)</td>
328
+ </tr>
329
+ <tr>
330
+ <td align="left">Number</td>
331
+ <td align="left"><a>RandomFloatDelta</a></td>
332
+ <td align="left">Generates a random float within -delta and delta of original float</td>
333
+ </tr>
334
+ <tr>
335
+ <td align="left">Number</td>
336
+ <td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomBigDecimalDelta">RandomBigDecimalDelta</a></td>
337
+ <td align="left">Similar to previous but creates a big decimal object</td>
338
+ </tr>
339
+ </table><table>
340
+ <tr>
341
+ <th align="left">Content</th>
342
+ <th align="left">Name</th>
343
+ <th align="left">Description</th>
344
+ </tr>
345
+ <tr>
346
+ <td align="left">Address</td>
347
+ <td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomAddress">RandomAddress</a></td>
348
+ <td align="left">Randomly selects an address from a geojson flat file [Default US address]</td>
349
+ </tr>
350
+ <tr>
351
+ <td align="left">City</td>
352
+ <td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomCity">RandomCity</a></td>
353
+ <td align="left">Similar to address, picks a random city from a geojson flafile [Default US cities]</td>
354
+ </tr>
355
+ <tr>
356
+ <td align="left">Province</td>
357
+ <td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomProvince">RandomProvince</a></td>
358
+ <td align="left">Similar to address, picks a random city from a geojson flafile [Default US provinces]</td>
359
+ </tr>
360
+ <tr>
361
+ <td align="left">Zip code</td>
362
+ <td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomZipcode">RandomZipcode</a></td>
363
+ <td align="left">Similar to address, picks a random zipcode from a geojson flafile [Default US zipcodes]</td>
364
+ </tr>
365
+ <tr>
366
+ <td align="left">Phone number</td>
367
+ <td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomPhoneNumber">RandomPhoneNumber</a></td>
368
+ <td align="left">Randomizes a phone number while preserving locale specific fomatting</td>
369
+ </tr>
370
+ </table><table>
371
+ <tr>
372
+ <th align="left">Content</th>
373
+ <th align="left">Name</th>
374
+ <th align="left">Description</th>
375
+ </tr>
376
+ <tr>
377
+ <td align="left">DateTime</td>
378
+ <td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/AnonymizeDateTime">AnonymizeDateTime</a></td>
379
+ <td align="left">Anonymizes each field (except year and seconds) within natural range of the field depending on true/false flag provided</td>
380
+ </tr>
381
+ <tr>
382
+ <td align="left">Time</td>
383
+ <td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/AnonymizeTime">AnonymizeTime</a></td>
384
+ <td align="left">Exactly similar to above except returned object is of type 'Time'</td>
385
+ </tr>
386
+ <tr>
387
+ <td align="left">Date</td>
388
+ <td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/AnonymizeDate">AnonymizeDate</a></td>
389
+ <td align="left">Anonymizes day and month within natural ranges based on true/false flag</td>
390
+ </tr>
391
+ <tr>
392
+ <td align="left">DateTimeDelta</td>
393
+ <td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/DateTimeDelta">DateTimeDelta</a></td>
394
+ <td align="left">Shifts data randomly within given range. Default shifts date within 10 days + or - and shifts time within 30 minutes.</td>
395
+ </tr>
396
+ <tr>
397
+ <td align="left">TimeDelta</td>
398
+ <td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/TimeDelta">TimeDelta</a></td>
399
+ <td align="left">Exactly similar to above except returned object is of type 'Time'</td>
400
+ </tr>
401
+ <tr>
402
+ <td align="left">DateDelta</td>
403
+ <td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/DateDelta">DateDelta</a></td>
404
+ <td align="left">Shifts date randomly within given delta range. Default shits date within 10 days + or -</td>
405
+ </tr>
406
+ </table><table>
407
+ <tr>
408
+ <th align="left">Content</th>
409
+ <th align="left">Name</th>
410
+ <th align="left">Description</th>
411
+ </tr>
412
+ <tr>
413
+ <td align="left">Email</td>
414
+ <td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomEmail">RandomEmail</a></td>
415
+ <td align="left">Generates email randomly using the given HOSTNAME and TLD.</td>
416
+ </tr>
417
+ <tr>
418
+ <td align="left">Email</td>
419
+ <td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/GmailTemplate">GmailTemplate</a></td>
420
+ <td align="left">Generates a valid unique gmail address by taking advantage of the gmail + strategy</td>
421
+ </tr>
422
+ <tr>
423
+ <td align="left">Email</td>
424
+ <td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomMailinatorEmail">RandomMailinatorEmail</a></td>
425
+ <td align="left">Generates random email using mailinator hostname.</td>
426
+ </tr>
427
+ </table><table>
428
+ <tr>
429
+ <th align="left">Content</th>
430
+ <th align="left">Name</th>
431
+ <th align="left">Description</th>
432
+ </tr>
433
+ <tr>
434
+ <td align="left">First name</td>
435
+ <td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomFirstName">RandomFirstName</a></td>
436
+ <td align="left">Randomly picks up first name from the predefined list in the file. Default <a href="https://raw.github.com/sunitparekh/data-anonymization/master/resources/first_names.txt">file</a> is part of the gem.</td>
437
+ </tr>
438
+ <tr>
439
+ <td align="left">Last name</td>
440
+ <td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomLastName">RandomLastName</a></td>
441
+ <td align="left">Randomly picks up first name from the predefined list in the file. Default <a href="https://raw.github.com/sunitparekh/data-anonymization/master/resources/first_names.txt">file</a> is part of the gem.</td>
442
+ </tr>
443
+ <tr>
444
+ <td align="left">Full Name</td>
445
+ <td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomFullName">RandomFullName</a></td>
446
+ <td align="left">Generates full name using the RandomFirstName and RandomLastName strategies.</td>
447
+ </tr>
448
+ <tr>
449
+ <td align="left">User name</td>
450
+ <td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomUserName">RandomUserName</a></td>
451
+ <td align="left">Generates random user name of same length as original user name.</td>
452
+ </tr>
453
+ </table>
412
454
 
413
- ### RandomFloatDelta
414
- Shifts the current value randomly within given delta + and -. Default is 10.0
415
- ```ruby
416
- anonymize('points').using FieldStrategy::RandomFloatDelta.new(2.5)
417
- ```
418
455
 
419
456
  ## Write you own field strategies
420
457
  field parameter in following code is [DataAnon::Core::Field](#dataanon-core-field)
@@ -444,11 +481,11 @@ write your own anonymous field strategies within DSL,
444
481
  ## Default field strategies
445
482
 
446
483
  ```ruby
447
- # Work in progress... TO BE COMPLETED
448
- DEFAULT_STRATEGIES = {:string => FieldStrategy::LoremIpsum.new,
484
+ DEFAULT_STRATEGIES = {:string => FieldStrategy::RandomString.new,
449
485
  :fixnum => FieldStrategy::RandomIntegerDelta.new(5),
450
486
  :bignum => FieldStrategy::RandomIntegerDelta.new(5000),
451
487
  :float => FieldStrategy::RandomFloatDelta.new(5.0),
488
+ :bigdecimal => FieldStrategy::RandomBigDecimalDelta.new(500.0),
452
489
  :datetime => FieldStrategy::DateTimeDelta.new,
453
490
  :time => FieldStrategy::TimeDelta.new,
454
491
  :date => FieldStrategy::DateDelta.new,
@@ -457,7 +494,7 @@ DEFAULT_STRATEGIES = {:string => FieldStrategy::LoremIpsum.new,
457
494
  }
458
495
  ```
459
496
 
460
- Overriding default field strategies,
497
+ Overriding default field strategies & can be used to provide default strategy for missing data type.
461
498
 
462
499
  ```ruby
463
500
  database 'Chinook' do
@@ -497,7 +534,7 @@ DataAnon::Utils::Logging.logger.level = Logger::INFO
497
534
  ## Credits
498
535
 
499
536
  - [ThoughtWorks Inc](http://www.thoughtworks.com), for allowing us to build this tool and make it open source.
500
- - [Birinder](https://twitter.com/birinder_) and [Panda](https://twitter.com/sarbashrestha) for reviewing the documentation.
537
+ - [Panda](https://twitter.com/sarbashrestha) for reviewing the documentation.
501
538
  - [Dan Abel](http://www.linkedin.com/pub/dan-abel/0/61b/9b0) for introducing me to Blacklist and Whitelist approach for data anonymization.
502
539
  - [Chirga Doshi](https://twitter.com/chiragsdoshi) for encouraging me to get this done.
503
540
  - [Aditya Karle](https://twitter.com/adityakarle) for the Logo. (Coming Soon...)