data-anonymization 0.3.0 → 0.5.0
Sign up to get free protection for your applications and to get access to all the features.
- data/.gitignore +2 -1
- data/.rvmrc +1 -1
- data/.travis.yml +2 -0
- data/Gemfile +2 -0
- data/README.md +295 -258
- data/bin/datanon +57 -0
- data/data-anonymization.gemspec +2 -1
- data/examples/blacklist_dsl.rb +42 -0
- data/examples/mongodb_blacklist_dsl.rb +38 -0
- data/examples/mongodb_whitelist_dsl.rb +44 -0
- data/examples/whitelist_dsl.rb +63 -0
- data/lib/core/database.rb +21 -3
- data/lib/core/field.rb +5 -2
- data/lib/core/fields_missing_strategy.rb +30 -0
- data/lib/core/table_errors.rb +32 -0
- data/lib/data-anonymization.rb +11 -0
- data/lib/parallel/table.rb +8 -1
- data/lib/strategy/base.rb +35 -14
- data/lib/strategy/blacklist.rb +1 -1
- data/lib/strategy/field/anonymize_array.rb +28 -0
- data/lib/strategy/field/contact/random_address.rb +12 -0
- data/lib/strategy/field/contact/random_city.rb +12 -0
- data/lib/strategy/field/contact/random_phone_number.rb +4 -0
- data/lib/strategy/field/contact/random_province.rb +12 -0
- data/lib/strategy/field/contact/random_zipcode.rb +12 -0
- data/lib/strategy/field/datetime/anonymize_date.rb +15 -0
- data/lib/strategy/field/datetime/anonymize_datetime.rb +19 -0
- data/lib/strategy/field/datetime/anonymize_time.rb +19 -0
- data/lib/strategy/field/datetime/date_delta.rb +10 -0
- data/lib/strategy/field/datetime/date_time_delta.rb +9 -0
- data/lib/strategy/field/datetime/time_delta.rb +8 -0
- data/lib/strategy/field/default_anon.rb +4 -1
- data/lib/strategy/field/email/gmail_template.rb +8 -0
- data/lib/strategy/field/email/random_email.rb +7 -0
- data/lib/strategy/field/email/random_mailinator_email.rb +5 -0
- data/lib/strategy/field/fields.rb +4 -0
- data/lib/strategy/field/name/random_first_name.rb +10 -0
- data/lib/strategy/field/name/random_full_name.rb +10 -2
- data/lib/strategy/field/name/random_last_name.rb +9 -0
- data/lib/strategy/field/name/random_user_name.rb +5 -0
- data/lib/strategy/field/number/random_big_decimal_delta.rb +6 -0
- data/lib/strategy/field/number/random_float.rb +4 -0
- data/lib/strategy/field/number/random_float_delta.rb +6 -0
- data/lib/strategy/field/number/random_integer.rb +4 -0
- data/lib/strategy/field/number/random_integer_delta.rb +6 -0
- data/lib/strategy/field/string/formatted_string_numbers.rb +10 -6
- data/lib/strategy/field/string/lorem_ipsum.rb +9 -0
- data/lib/strategy/field/string/random_formatted_string.rb +39 -0
- data/lib/strategy/field/string/random_string.rb +6 -0
- data/lib/strategy/field/string/random_url.rb +7 -1
- data/lib/strategy/field/string/select_from_database.rb +7 -5
- data/lib/strategy/field/string/select_from_file.rb +7 -0
- data/lib/strategy/field/string/select_from_list.rb +8 -0
- data/lib/strategy/field/string/string_template.rb +11 -0
- data/lib/strategy/mongodb/anonymize_field.rb +44 -0
- data/lib/strategy/mongodb/blacklist.rb +29 -0
- data/lib/strategy/mongodb/whitelist.rb +62 -0
- data/lib/strategy/strategies.rb +10 -1
- data/lib/strategy/whitelist.rb +7 -2
- data/lib/thor/helpers/mongodb_dsl_generator.rb +66 -0
- data/lib/thor/helpers/rdbms_dsl_generator.rb +36 -0
- data/lib/thor/templates/mongodb_whitelist_template.erb +15 -0
- data/lib/thor/templates/whitelist_template.erb +21 -0
- data/lib/utils/database.rb +4 -0
- data/lib/utils/parallel_progress_bar.rb +24 -0
- data/lib/utils/progress_bar.rb +34 -22
- data/lib/utils/random_string.rb +3 -2
- data/lib/utils/random_string_chars_only.rb +3 -5
- data/lib/utils/template_helper.rb +44 -0
- data/lib/version.rb +1 -1
- data/spec/acceptance/mongodb_blacklist_spec.rb +75 -0
- data/spec/acceptance/mongodb_whitelist_spec.rb +107 -0
- data/spec/core/fields_missing_strategy_spec.rb +26 -0
- data/spec/strategy/field/name/random_first_name_spec.rb +1 -1
- data/spec/strategy/field/name/random_full_name_spec.rb +12 -7
- data/spec/strategy/field/name/random_last_name_spec.rb +1 -1
- data/spec/strategy/field/string/random_formatted_string_spec.rb +39 -0
- data/spec/strategy/field/string/select_from_file_spec.rb +21 -0
- data/spec/strategy/mongodb/anonymize_field_spec.rb +52 -0
- data/spec/utils/random_float_spec.rb +12 -0
- data/spec/utils/random_string_char_only_spec.rb +12 -0
- data/spec/utils/template_helper_spec.rb +14 -0
- metadata +56 -6
- data/blacklist_dsl.rb +0 -17
- data/blacklist_nosql_dsl.rb +0 -36
- data/whitelist_dsl.rb +0 -42
data/.gitignore
CHANGED
data/.rvmrc
CHANGED
@@ -1 +1 @@
|
|
1
|
-
rvm use 1.9.3
|
1
|
+
rvm use 1.9.3@data-anon --create
|
data/.travis.yml
CHANGED
data/Gemfile
CHANGED
data/README.md
CHANGED
@@ -1,7 +1,12 @@
|
|
1
1
|
# Data::Anonymization
|
2
2
|
Tool to create anonymized production data dump to use for PERF and other TEST environments.
|
3
3
|
|
4
|
+
[<img src="https://secure.travis-ci.org/sunitparekh/data-anonymization.png?branch=master">](http://travis-ci.org/sunitparekh/data-anonymization)
|
5
|
+
[<img src="https://gemnasium.com/sunitparekh/data-anonymization.png?travis">](https://gemnasium.com/sunitparekh/data-anonymization)
|
6
|
+
[<img src="https://codeclimate.com/badge.png">](https://codeclimate.com/github/sunitparekh/data-anonymization)
|
7
|
+
|
4
8
|
## Getting started
|
9
|
+
|
5
10
|
Install gem using:
|
6
11
|
|
7
12
|
$ gem install data-anonymization
|
@@ -39,15 +44,42 @@ Run using:
|
|
39
44
|
|
40
45
|
$ ruby my_dsl.rb
|
41
46
|
|
47
|
+
Liked it? please share
|
48
|
+
|
49
|
+
[<img src="https://si0.twimg.com/a/1346446870/images/resources/twitter-bird-light-bgs.png" height="35" width="35">](https://twitter.com/share?text=A+simple+ruby+DSL+based+data+anonymization&url=http:%2F%2Fsunitparekh.github.com%2Fdata-anonymization&via=dataanon&hashtags=dataanon)
|
50
|
+
|
42
51
|
## Examples
|
43
52
|
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
53
|
+
SQLite database
|
54
|
+
|
55
|
+
1. [Whitelist](https://github.com/sunitparekh/data-anonymization/blob/master/examples/whitelist_dsl.rb)
|
56
|
+
2. [Blacklist](https://github.com/sunitparekh/data-anonymization/blob/master/examples/blacklist_dsl.rb)
|
57
|
+
|
58
|
+
MongoDB
|
59
|
+
|
60
|
+
1. [Whitelist](https://github.com/sunitparekh/data-anonymization/blob/master/examples/mongodb_whitelist_dsl.rb)
|
61
|
+
2. [Blacklist](https://github.com/sunitparekh/data-anonymization/blob/master/examples/mongodb_blacklist_dsl.rb)
|
62
|
+
|
63
|
+
Postgresql database having **composite primary key**
|
64
|
+
|
65
|
+
1. [Whitelist](https://github.com/sunitparekh/test-anonymization/blob/master/dell_whitelist.rb)
|
66
|
+
2. [Blacklist](https://github.com/sunitparekh/test-anonymization/blob/master/dell_blacklist.rb)
|
67
|
+
|
48
68
|
|
49
69
|
## Changelog
|
50
70
|
|
71
|
+
#### 0.5.0 (rc1 released. install gem using --pre option)
|
72
|
+
|
73
|
+
Major changes:
|
74
|
+
|
75
|
+
1. MongoDB support
|
76
|
+
2. Command line utility to generate whitelist DSL for RDBMS & MongoDB (reduces pain for writing whitelist dsl)
|
77
|
+
3. Added support for reporting fields missing mapping in case of whitelist
|
78
|
+
4. Errors reported at the end of process. Job doesn't fail for a single error.
|
79
|
+
|
80
|
+
|
81
|
+
Please see the [Github 0.5.0 milestone page](https://github.com/sunitparekh/data-anonymization/issues?milestone=2&state=open) for more details on changes/fixes in release 0.5.0
|
82
|
+
|
51
83
|
#### 0.3.0 (Sep 4, 2012)
|
52
84
|
|
53
85
|
Major changes:
|
@@ -56,7 +88,7 @@ Major changes:
|
|
56
88
|
2. Change in default String strategy from LoremIpsum to RandomString based on end user feedback.
|
57
89
|
3. Fixed issue with table column name 'type' as this is default name for STI in activerecord.
|
58
90
|
|
59
|
-
Please see the [Github 0.3.0 milestone page](https://github.com/sunitparekh/data-anonymization/issues?milestone=1&
|
91
|
+
Please see the [Github 0.3.0 milestone page](https://github.com/sunitparekh/data-anonymization/issues?milestone=1&state=closed) for more details on changes/fixes in release 0.3.0
|
60
92
|
|
61
93
|
#### 0.2.0 (August 16, 2012)
|
62
94
|
|
@@ -71,15 +103,10 @@ Please see the [Github 0.3.0 milestone page](https://github.com/sunitparekh/data
|
|
71
103
|
|
72
104
|
## Roadmap
|
73
105
|
|
74
|
-
|
75
|
-
|
76
|
-
1. MongoDB anonymization support (NoSQL document based database support)
|
106
|
+
MVP done. Fix defects and support queries, suggestions, enhancements logged in Github issues :-)
|
77
107
|
|
78
|
-
|
108
|
+
## Share feedback
|
79
109
|
|
80
|
-
1. Generate DSL from database and build schema from source as part of Whitelist approach.
|
81
|
-
|
82
|
-
#### Share feedback
|
83
110
|
Please use Github [issues](https://github.com/sunitparekh/data-anonymization/issues) to share feedback, feature suggestions and report issues.
|
84
111
|
|
85
112
|
## What is data anonymization?
|
@@ -126,295 +153,305 @@ end
|
|
126
153
|
2. Change [default field strategies](#default-field-strategies) to avoid using same strategy again and again in your DSL.
|
127
154
|
3. To run anonymization in parallel at Table level, provided no FK constraint on tables use DataAnon::Parallel::Table strategy
|
128
155
|
|
129
|
-
##
|
130
|
-
Currently provides capability of running anonymization in parallel at table level provided no FK constraints on tables.
|
131
|
-
It uses [Parallel gem](https://github.com/grosser/parallel) provided by Michael Grosser.
|
132
|
-
By default it starts multiple parallel ruby processes processing table one by one.
|
133
|
-
```ruby
|
134
|
-
database 'DellStore' do
|
135
|
-
strategy DataAnon::Strategy::Whitelist
|
136
|
-
execution_strategy DataAnon::Parallel::Table # by default sequential table processing
|
137
|
-
...
|
138
|
-
end
|
139
|
-
```
|
156
|
+
## DSL Generation
|
140
157
|
|
158
|
+
We provide a command line tool to generate whitelist scripts for RDBMS and NoSQL databases. The user needs to supply the connection details to the database and a script is generated by analyzing the schema. Below are examples of how to use the tool to generate the scripts for RDBMS and NoSQL datastores
|
141
159
|
|
142
|
-
|
143
|
-
The object that gets passed along with the field strategies.
|
144
|
-
|
145
|
-
has following attribute accessor
|
160
|
+
When you install the data-anonymization tool, the **datanon** command become available on the terminal. If you type **datanon --help** and execute you should see the below
|
146
161
|
|
147
|
-
|
148
|
-
|
149
|
-
- `row_number` current row number
|
150
|
-
- `ar_record` active record of the current row under processing
|
151
|
-
|
152
|
-
## Field Strategies
|
162
|
+
```
|
163
|
+
Tasks:
|
153
164
|
|
154
|
-
|
155
|
-
|
165
|
+
datanon generate_mongo_dsl -d, --database=DATABASE -h, --host=HOST # Generates a base anonymization script(whitelist strategy) for a Mongo DB using the database schema
|
166
|
+
datanon generate_rdbms_dsl -a, --adapter=ADAPTER -d, --database=DATABASE -h, --host=HOST # Generates a base anonymization script(whitelist strategy) for a RDBMS database using the database schema
|
167
|
+
datanon help [TASK] # Describe available tasks or one specific task
|
156
168
|
|
157
|
-
```ruby
|
158
|
-
anonymize('UserName').using FieldStrategy::LoremIpsum.new
|
159
|
-
```
|
160
|
-
```ruby
|
161
|
-
anonymize('UserName').using FieldStrategy::LoremIpsum.new("very large string....")
|
162
|
-
```
|
163
|
-
```ruby
|
164
|
-
anonymize('UserName').using FieldStrategy::LoremIpsum.new(File.read('my_file.txt'))
|
165
169
|
```
|
166
170
|
|
167
|
-
###
|
168
|
-
Generates random string of same length.
|
169
|
-
```ruby
|
170
|
-
anonymize('UserName').using FieldStrategy::RandomString.new
|
171
|
-
```
|
171
|
+
### RDBMS whitelist generation
|
172
172
|
|
173
|
-
|
174
|
-
Simple string evaluation within [DataAnon::Core::Field](#dataanon-core-field) context. Can be used for email, username anonymization.
|
175
|
-
Make sure to put the string in 'single quote' else it will get evaluated inline.
|
176
|
-
```ruby
|
177
|
-
anonymize('UserName').using FieldStrategy::StringTemplate.new('user#{row_number}')
|
178
|
-
```
|
179
|
-
```ruby
|
180
|
-
anonymize('Email').using FieldStrategy::StringTemplate.new('valid.address+#{row_number}@gmail.com')
|
181
|
-
```
|
182
|
-
```ruby
|
183
|
-
anonymize('Email').using FieldStrategy::StringTemplate.new('useremail#{row_number}@mailinator.com')
|
184
|
-
```
|
173
|
+
The gem uses ActiveRecord(AR) abstraction to connect to relational databases. You can generate a whitelist script in seconds for any relational database supported by Active Record. To do so use the following command
|
185
174
|
|
186
|
-
### SelectFromList
|
187
|
-
Select randomly one of the values specified.
|
188
|
-
```ruby
|
189
|
-
anonymize('State').using FieldStrategy::SelectFromList.new(['New York','Georgia',...])
|
190
|
-
```
|
191
|
-
```ruby
|
192
|
-
anonymize('NameTitle').using FieldStrategy::SelectFromList.new(['Mr','Mrs','Dr',...])
|
193
175
|
```
|
176
|
+
datanon generate_rdbms_dsl [options]
|
194
177
|
|
195
|
-
### SelectFromFile
|
196
|
-
Similar to SelectFromList only difference is the list of values are picked up from file. Classical usage is like states field anonymization.
|
197
|
-
```ruby
|
198
|
-
anonymize('State').using FieldStrategy::SelectFromFile.new('states.txt')
|
199
178
|
```
|
200
179
|
|
201
|
-
|
202
|
-
Keeping the format same it changes each digit in the string with random digit.
|
203
|
-
```ruby
|
204
|
-
anonymize('CreditCardNumber').using FieldStrategy::FormattedStringNumber.new
|
205
|
-
```
|
180
|
+
The options available are :
|
206
181
|
|
207
|
-
|
208
|
-
|
209
|
-
|
210
|
-
|
211
|
-
|
212
|
-
|
182
|
+
1. adapter(-a) : The activerecord adapter to use to connect to the database (eg. mysql2, postgresql)
|
183
|
+
2. host(-h) : DB host name or IP address
|
184
|
+
3. database(-d) : The name of the database to generate the whitelist script for
|
185
|
+
4. username(-u) : Username for DB authentication
|
186
|
+
5. password(-w) : Password for DB authentication
|
187
|
+
6. port(-p) : The port the database service is running on. Default port provided by AR will be used if nothing is specififed.
|
213
188
|
|
214
|
-
|
215
|
-
Generates address using the [geojson](http://www.geojson.org/geojson-spec.html) format file. The default US/UK file chooses randomly from 300 addresses.
|
216
|
-
The large data set can be downloaded from [here](http://www.infochimps.com/datasets/simplegeo-places-dump)
|
217
|
-
```ruby
|
218
|
-
anonymize('Address').using FieldStrategy::RandomAddress.region_US
|
219
|
-
```
|
220
|
-
```ruby
|
221
|
-
anonymize('Address').using FieldStrategy::RandomAddress.region_UK
|
222
|
-
```
|
223
|
-
```ruby
|
224
|
-
# get your own geo_json file and use it
|
225
|
-
anonymize('Address').using FieldStrategy::RandomAddress.new('my_geo_json.json')
|
226
|
-
```
|
189
|
+
The adapter, host and database options are mandatory. The others are optional.
|
227
190
|
|
228
|
-
|
229
|
-
Similar to RandomAddress, generates city using the [geojson](http://www.geojson.org/geojson-spec.html) format file. The default US/UK file chooses randomly from 300 addresses.
|
230
|
-
The large data set can be downloaded from [here](http://www.infochimps.com/datasets/simplegeo-places-dump)
|
231
|
-
```ruby
|
232
|
-
anonymize('City').using FieldStrategy::RandomCity.region_US
|
233
|
-
```
|
234
|
-
```ruby
|
235
|
-
anonymize('City').using FieldStrategy::RandomCity.region_UK
|
236
|
-
```
|
237
|
-
```ruby
|
238
|
-
# get your own geo_json file and use it
|
239
|
-
anonymize('City').using FieldStrategy::RandomCity.new('my_geo_json.json')
|
240
|
-
```
|
191
|
+
A few examples of the command is shown below
|
241
192
|
|
242
|
-
### RandomProvince
|
243
|
-
Similar to RandomAddress, generates province using the [geojson](http://www.geojson.org/geojson-spec.html) format file. The default US/UK file chooses randomly from 300 addresses.
|
244
|
-
The large data set can be downloaded from [here](http://www.infochimps.com/datasets/simplegeo-places-dump)
|
245
|
-
```ruby
|
246
|
-
anonymize('Province').using FieldStrategy::RandomProvince.region_US
|
247
|
-
```
|
248
|
-
```ruby
|
249
|
-
anonymize('Province').using FieldStrategy::RandomProvince.region_UK
|
250
|
-
```
|
251
|
-
```ruby
|
252
|
-
# get your own geo_json file and use it
|
253
|
-
anonymize('Province').using FieldStrategy::RandomProvince.new('my_geo_json.json')
|
254
193
|
```
|
194
|
+
datanon generate_rdbms_dsl -a mysql2 -h db.host.com -p 3306 -d production_db -u root -w password
|
255
195
|
|
256
|
-
|
257
|
-
Similar to RandomAddress, generates zipcode using the [geojson](http://www.geojson.org/geojson-spec.html) format file. The default US/UK file chooses randomly from 300 addresses.
|
258
|
-
The large data set can be downloaded from [here](http://www.infochimps.com/datasets/simplegeo-places-dump)
|
259
|
-
```ruby
|
260
|
-
anonymize('Address').using FieldStrategy::RandomZipcode.region_US
|
261
|
-
```
|
262
|
-
```ruby
|
263
|
-
anonymize('Address').using FieldStrategy::RandomZipcode.region_UK
|
264
|
-
```
|
265
|
-
```ruby
|
266
|
-
# get your own geo_json file and use it
|
267
|
-
anonymize('Address').using FieldStrategy::RandomZipcode.new('my_geo_json.json')
|
268
|
-
```
|
196
|
+
datanon generate_rdbms_dsl -a postgresql -h 123.456.7.8 -d production_db
|
269
197
|
|
270
|
-
### RandomPhoneNumber
|
271
|
-
Keeping the format same it changes each digit in the string with random digit.
|
272
|
-
```ruby
|
273
|
-
anonymize('PhoneNumber').using FieldStrategy::RandomPhoneNumber.new
|
274
198
|
```
|
275
199
|
|
276
|
-
|
277
|
-
Anonymizes each field(except year and seconds) within the natural range (e.g. hour between 1-24 and day within the month) based on true/false
|
278
|
-
input for that field. By default, all fields are anonymized.
|
279
|
-
```ruby
|
280
|
-
#anonymizes month and hour fields, leaving the day and minute fields untouched
|
281
|
-
anonymize('DateOfBirth').using FieldStrategy::AnonymizeDateTime.new(true,false,true,false)
|
282
|
-
```
|
200
|
+
The relevant db gems must be installed so that AR has the adapters required to establish the connection to the databases. The script generates a file named **rdbms_whitelist_generated.rb** in the same location as the project.
|
283
201
|
|
284
|
-
|
285
|
-
```ruby
|
286
|
-
# anonymizes only the month field
|
287
|
-
anonymize('DateOfBirth').using FieldStrategy::AnonymizeDateTime.only_month
|
288
|
-
# anonymizes only the day field
|
289
|
-
anonymize('DateOfBirth').using FieldStrategy::AnonymizeDateTime.only_day
|
290
|
-
# anonymizes only the hour field
|
291
|
-
anonymize('DateOfBirth').using FieldStrategy::AnonymizeDateTime.only_hour
|
292
|
-
# anonymizes only the minute field
|
293
|
-
anonymize('DateOfBirth').using FieldStrategy::AnonymizeDateTime.only_minute
|
294
|
-
```
|
202
|
+
### MongoDB whitelist generation
|
295
203
|
|
296
|
-
|
297
|
-
Exactly similar to the above DateTime strategy, except that the returned object is of type `Time`
|
204
|
+
Similar to the the relational databases, a whitelist script for mongo db can be generated by analysing the database structure
|
298
205
|
|
299
|
-
### AnonymizeDate
|
300
|
-
Anonmizes day and month fields within natural range based on true/false input for that field. By defaut both fields are
|
301
|
-
anonymized
|
302
|
-
```ruby
|
303
|
-
# anonymizes month and leaves day unchanged
|
304
|
-
anonymize('DateOfBirth').using FieldStrategy::AnonymizeDate.new(true,false)
|
305
206
|
```
|
207
|
+
datanon generate_mongo_dsl [options]
|
306
208
|
|
307
|
-
In addition to customizing which fields you want anonymized, there are some helper methods which allow for quick anonymization
|
308
|
-
```ruby
|
309
|
-
# anonymizes only the month field
|
310
|
-
anonymize('DateOfBirth').using FieldStrategy::AnonymizeDate.only_month
|
311
|
-
# anonymizes only the day field
|
312
|
-
anonymize('DateOfBirth').using FieldStrategy::AnonymizeDate.only_day
|
313
209
|
```
|
314
210
|
|
315
|
-
|
316
|
-
Shifts data randomly within given range. Default shifts date within 10 days + or - and shifts time within 30 minutes.
|
317
|
-
```ruby
|
318
|
-
anonymize('DateOfBirth').using FieldStrategy::DateTimeDelta.new
|
319
|
-
```
|
320
|
-
```ruby
|
321
|
-
# shifts date within 20 days and time within 50 minutes
|
322
|
-
anonymize('DateOfBirth').using FieldStrategy::DateTimeDelta.new(20, 50)
|
323
|
-
```
|
211
|
+
The options available are :
|
324
212
|
|
325
|
-
|
326
|
-
|
213
|
+
1. host(-h) : DB host name or IP address
|
214
|
+
2. database(-d) : The name of the database to generate the whitelist script for
|
215
|
+
3. username(-u) : Username for DB authentication
|
216
|
+
4. password(-w) : Password for DB authentication
|
217
|
+
5. port(-p) : The port the database service is running on.
|
218
|
+
6. whitelist patterns(-r): A regex expression which can be used to match records in the database to list as whitelisted fields in the generated script.
|
327
219
|
|
328
|
-
|
220
|
+
The host and database options are mandatory. The others are optional.
|
329
221
|
|
330
|
-
|
331
|
-
```ruby
|
332
|
-
anonymize('DateOfBirth').using FieldStrategy::AnonymizeDate.new
|
333
|
-
```
|
334
|
-
```ruby
|
335
|
-
# shifts date within 25 days
|
336
|
-
anonymize('DateOfBirth').using FieldStrategy::DateDelta.new(25)
|
337
|
-
```
|
222
|
+
A few examples of the command is shown below
|
338
223
|
|
339
|
-
### RandomEmail
|
340
|
-
Generates email randomly using the given HOSTNAME and TLD.
|
341
|
-
By defaults generates hostname randomly along with email id.
|
342
|
-
```ruby
|
343
|
-
anonymize('Email').using FieldStrategy::RandomEmail.new('thoughtworks','com')
|
344
224
|
```
|
225
|
+
datanon generate_mongo_dsl -h db.host.com -d production_db -u root -w password
|
345
226
|
|
346
|
-
|
347
|
-
Generates a valid unique gmail address by taking advantage of the gmail + strategy. Takes in a valid gmail username and
|
348
|
-
generates emails of the form username+<number>@gmail.com
|
349
|
-
```ruby
|
350
|
-
anonymize('Email').using FieldStrategy::GmailTemplate.new('username')
|
351
|
-
```
|
227
|
+
datanon generate_mongo_dsl -h 123.456.7.8 -d production_db
|
352
228
|
|
353
|
-
### RandomMailinatorEmail
|
354
|
-
Generates random email using mailinator hostname. e.g. <randomstring>@mailinator.com
|
355
|
-
```ruby
|
356
|
-
anonymize('Email').using FieldStrategy::RandomMailinatorEmail.new
|
357
229
|
```
|
358
230
|
|
359
|
-
|
360
|
-
Generates random user name of same length as original user name.
|
361
|
-
```ruby
|
362
|
-
anonymize('Username').using FieldStrategy::RandomUserName.new
|
363
|
-
```
|
231
|
+
The **mongo** gem is required in order to install the mongo db drivers. The script generates a file named **mongodb_whitelist_generated.rb** in the same location as the project.
|
364
232
|
|
365
|
-
### RandomFirstName
|
366
|
-
Randomly picks up first name from the predefined list in the file. Default [file](https://raw.github.com/sunitparekh/data-anonymization/master/resources/first_names.txt) is part of the gem.
|
367
|
-
File should contain first name on each line.
|
368
|
-
```ruby
|
369
|
-
anonymize('FirstName').using FieldStrategy::RandomFirstName.new
|
370
|
-
```
|
371
|
-
```ruby
|
372
|
-
anonymize('FirstName').using FieldStrategy::RandomFirstName.new('my_first_names.txt')
|
373
|
-
```
|
374
233
|
|
375
|
-
### RandomLastName
|
376
|
-
Randomly picks up last name from the predefined list in the file. Default [file](https://raw.github.com/sunitparekh/data-anonymization/master/resources/last_names.txt) is part of the gem.
|
377
|
-
File should contain last name on each line.
|
378
|
-
```ruby
|
379
|
-
anonymize('LastName').using FieldStrategy::RandomLastName.new
|
380
|
-
```
|
381
|
-
```ruby
|
382
|
-
anonymize('LastName').using FieldStrategy::RandomLastName.new('my_last_names.txt')
|
383
|
-
```
|
384
234
|
|
385
|
-
|
386
|
-
|
387
|
-
It
|
388
|
-
|
389
|
-
anonymize('FullName').using FieldStrategy::RandomFullName.new
|
390
|
-
```
|
391
|
-
```ruby
|
392
|
-
anonymize('FullName').using FieldStrategy::RandomLastName.new('my_first_names.txt', 'my_last_names.txt')
|
393
|
-
```
|
235
|
+
## Running in Parallel
|
236
|
+
Currently provides capability of running anonymization in parallel at table level provided no FK constraints on tables.
|
237
|
+
It uses [Parallel gem](https://github.com/grosser/parallel) provided by Michael Grosser.
|
238
|
+
By default it starts multiple parallel ruby processes processing table one by one.
|
394
239
|
|
395
|
-
### RandomInteger
|
396
|
-
Generates random integer number between given two numbers. Default range is 0 to 100.
|
397
240
|
```ruby
|
398
|
-
|
241
|
+
database 'DellStore' do
|
242
|
+
strategy DataAnon::Strategy::Whitelist
|
243
|
+
execution_strategy DataAnon::Parallel::Table # by default sequential table processing
|
244
|
+
...
|
245
|
+
end
|
399
246
|
```
|
400
247
|
|
401
|
-
### RandomIntegerDelta
|
402
|
-
Shifts the current value randomly within given delta + and -. Default is 10
|
403
|
-
```ruby
|
404
|
-
anonymize('Age').using FieldStrategy::RandomIntegerDelta.new(2)
|
405
|
-
```
|
406
248
|
|
407
|
-
|
408
|
-
|
409
|
-
|
410
|
-
|
411
|
-
|
249
|
+
## DataAnon::Core::Field
|
250
|
+
The object that gets passed along with the field strategies.
|
251
|
+
|
252
|
+
has following attribute accessor
|
253
|
+
|
254
|
+
- `name` current field/column name
|
255
|
+
- `value` current field/column value
|
256
|
+
- `row_number` current row number
|
257
|
+
- `ar_record` active record of the current row under processing
|
258
|
+
|
259
|
+
## Field Strategies
|
260
|
+
|
261
|
+
|
262
|
+
<table>
|
263
|
+
<tr>
|
264
|
+
<th align="left">Content</th>
|
265
|
+
<th align="left">Name</th>
|
266
|
+
<th align="left">Description</th>
|
267
|
+
</tr>
|
268
|
+
<tr>
|
269
|
+
<td align="left">Text</td>
|
270
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/LoremIpsum">LoremIpsum</a></td>
|
271
|
+
<td align="left">Generates a random Lorep Ipsum String</td>
|
272
|
+
</tr>
|
273
|
+
<tr>
|
274
|
+
<td align="left">Text</td>
|
275
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomString">RandomString</a></td>
|
276
|
+
<td align="left">Generates a random string of equal length</td>
|
277
|
+
</tr>
|
278
|
+
<tr>
|
279
|
+
<td align="left">Text</td>
|
280
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/StringTemplate">StringTemplate</a></td>
|
281
|
+
<td align="left">Generates a string based on provided template</td>
|
282
|
+
</tr>
|
283
|
+
<tr>
|
284
|
+
<td align="left">Text</td>
|
285
|
+
<td align="left"><a>SelectFromList</a></td>
|
286
|
+
<td align="left">Randomly selects a string from a provided list</td>
|
287
|
+
</tr>
|
288
|
+
<tr>
|
289
|
+
<td align="left">Text</td>
|
290
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/SelectFromFile">SelectFromFile</a></td>
|
291
|
+
<td align="left">Randomly selects a string from a provided file</td>
|
292
|
+
</tr>
|
293
|
+
<tr>
|
294
|
+
<td align="left">Text</td>
|
295
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/FormattedStringNumber">FormattedStringNumber</a></td>
|
296
|
+
<td align="left">Randomize digits in a string while maintaining the format</td>
|
297
|
+
</tr>
|
298
|
+
<tr>
|
299
|
+
<td align="left">Text</td>
|
300
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/SelectFromDatabase">SelectFromDatabase</a></td>
|
301
|
+
<td align="left">Selects randomly from the result of a query on a database</td>
|
302
|
+
</tr>
|
303
|
+
<tr>
|
304
|
+
<td align="left">Text</td>
|
305
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomUrl">RandomURL</a></td>
|
306
|
+
<td align="left">Anonymizes a URL while mainting the structure</td>
|
307
|
+
</tr>
|
308
|
+
</table><table>
|
309
|
+
<tr>
|
310
|
+
<th align="left">Content</th>
|
311
|
+
<th align="left">Name</th>
|
312
|
+
<th align="left">Description</th>
|
313
|
+
</tr>
|
314
|
+
<tr>
|
315
|
+
<td align="left">Number</td>
|
316
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomInteger">RandomInteger</a></td>
|
317
|
+
<td align="left">Generates a random integer between provided limits (default 0 to 100)</td>
|
318
|
+
</tr>
|
319
|
+
<tr>
|
320
|
+
<td align="left">Number</td>
|
321
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomIntegerDelta">RandomIntegerDelta</a></td>
|
322
|
+
<td align="left">Generates a random integer within -delta and delta of original integer</td>
|
323
|
+
</tr>
|
324
|
+
<tr>
|
325
|
+
<td align="left">Number</td>
|
326
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomFloat">RandomFloat</a></td>
|
327
|
+
<td align="left">Generates a random float between provided limits (default 0.0 to 100.0)</td>
|
328
|
+
</tr>
|
329
|
+
<tr>
|
330
|
+
<td align="left">Number</td>
|
331
|
+
<td align="left"><a>RandomFloatDelta</a></td>
|
332
|
+
<td align="left">Generates a random float within -delta and delta of original float</td>
|
333
|
+
</tr>
|
334
|
+
<tr>
|
335
|
+
<td align="left">Number</td>
|
336
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomBigDecimalDelta">RandomBigDecimalDelta</a></td>
|
337
|
+
<td align="left">Similar to previous but creates a big decimal object</td>
|
338
|
+
</tr>
|
339
|
+
</table><table>
|
340
|
+
<tr>
|
341
|
+
<th align="left">Content</th>
|
342
|
+
<th align="left">Name</th>
|
343
|
+
<th align="left">Description</th>
|
344
|
+
</tr>
|
345
|
+
<tr>
|
346
|
+
<td align="left">Address</td>
|
347
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomAddress">RandomAddress</a></td>
|
348
|
+
<td align="left">Randomly selects an address from a geojson flat file [Default US address]</td>
|
349
|
+
</tr>
|
350
|
+
<tr>
|
351
|
+
<td align="left">City</td>
|
352
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomCity">RandomCity</a></td>
|
353
|
+
<td align="left">Similar to address, picks a random city from a geojson flafile [Default US cities]</td>
|
354
|
+
</tr>
|
355
|
+
<tr>
|
356
|
+
<td align="left">Province</td>
|
357
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomProvince">RandomProvince</a></td>
|
358
|
+
<td align="left">Similar to address, picks a random city from a geojson flafile [Default US provinces]</td>
|
359
|
+
</tr>
|
360
|
+
<tr>
|
361
|
+
<td align="left">Zip code</td>
|
362
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomZipcode">RandomZipcode</a></td>
|
363
|
+
<td align="left">Similar to address, picks a random zipcode from a geojson flafile [Default US zipcodes]</td>
|
364
|
+
</tr>
|
365
|
+
<tr>
|
366
|
+
<td align="left">Phone number</td>
|
367
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomPhoneNumber">RandomPhoneNumber</a></td>
|
368
|
+
<td align="left">Randomizes a phone number while preserving locale specific fomatting</td>
|
369
|
+
</tr>
|
370
|
+
</table><table>
|
371
|
+
<tr>
|
372
|
+
<th align="left">Content</th>
|
373
|
+
<th align="left">Name</th>
|
374
|
+
<th align="left">Description</th>
|
375
|
+
</tr>
|
376
|
+
<tr>
|
377
|
+
<td align="left">DateTime</td>
|
378
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/AnonymizeDateTime">AnonymizeDateTime</a></td>
|
379
|
+
<td align="left">Anonymizes each field (except year and seconds) within natural range of the field depending on true/false flag provided</td>
|
380
|
+
</tr>
|
381
|
+
<tr>
|
382
|
+
<td align="left">Time</td>
|
383
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/AnonymizeTime">AnonymizeTime</a></td>
|
384
|
+
<td align="left">Exactly similar to above except returned object is of type 'Time'</td>
|
385
|
+
</tr>
|
386
|
+
<tr>
|
387
|
+
<td align="left">Date</td>
|
388
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/AnonymizeDate">AnonymizeDate</a></td>
|
389
|
+
<td align="left">Anonymizes day and month within natural ranges based on true/false flag</td>
|
390
|
+
</tr>
|
391
|
+
<tr>
|
392
|
+
<td align="left">DateTimeDelta</td>
|
393
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/DateTimeDelta">DateTimeDelta</a></td>
|
394
|
+
<td align="left">Shifts data randomly within given range. Default shifts date within 10 days + or - and shifts time within 30 minutes.</td>
|
395
|
+
</tr>
|
396
|
+
<tr>
|
397
|
+
<td align="left">TimeDelta</td>
|
398
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/TimeDelta">TimeDelta</a></td>
|
399
|
+
<td align="left">Exactly similar to above except returned object is of type 'Time'</td>
|
400
|
+
</tr>
|
401
|
+
<tr>
|
402
|
+
<td align="left">DateDelta</td>
|
403
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/DateDelta">DateDelta</a></td>
|
404
|
+
<td align="left">Shifts date randomly within given delta range. Default shits date within 10 days + or -</td>
|
405
|
+
</tr>
|
406
|
+
</table><table>
|
407
|
+
<tr>
|
408
|
+
<th align="left">Content</th>
|
409
|
+
<th align="left">Name</th>
|
410
|
+
<th align="left">Description</th>
|
411
|
+
</tr>
|
412
|
+
<tr>
|
413
|
+
<td align="left">Email</td>
|
414
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomEmail">RandomEmail</a></td>
|
415
|
+
<td align="left">Generates email randomly using the given HOSTNAME and TLD.</td>
|
416
|
+
</tr>
|
417
|
+
<tr>
|
418
|
+
<td align="left">Email</td>
|
419
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/GmailTemplate">GmailTemplate</a></td>
|
420
|
+
<td align="left">Generates a valid unique gmail address by taking advantage of the gmail + strategy</td>
|
421
|
+
</tr>
|
422
|
+
<tr>
|
423
|
+
<td align="left">Email</td>
|
424
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomMailinatorEmail">RandomMailinatorEmail</a></td>
|
425
|
+
<td align="left">Generates random email using mailinator hostname.</td>
|
426
|
+
</tr>
|
427
|
+
</table><table>
|
428
|
+
<tr>
|
429
|
+
<th align="left">Content</th>
|
430
|
+
<th align="left">Name</th>
|
431
|
+
<th align="left">Description</th>
|
432
|
+
</tr>
|
433
|
+
<tr>
|
434
|
+
<td align="left">First name</td>
|
435
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomFirstName">RandomFirstName</a></td>
|
436
|
+
<td align="left">Randomly picks up first name from the predefined list in the file. Default <a href="https://raw.github.com/sunitparekh/data-anonymization/master/resources/first_names.txt">file</a> is part of the gem.</td>
|
437
|
+
</tr>
|
438
|
+
<tr>
|
439
|
+
<td align="left">Last name</td>
|
440
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomLastName">RandomLastName</a></td>
|
441
|
+
<td align="left">Randomly picks up first name from the predefined list in the file. Default <a href="https://raw.github.com/sunitparekh/data-anonymization/master/resources/first_names.txt">file</a> is part of the gem.</td>
|
442
|
+
</tr>
|
443
|
+
<tr>
|
444
|
+
<td align="left">Full Name</td>
|
445
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomFullName">RandomFullName</a></td>
|
446
|
+
<td align="left">Generates full name using the RandomFirstName and RandomLastName strategies.</td>
|
447
|
+
</tr>
|
448
|
+
<tr>
|
449
|
+
<td align="left">User name</td>
|
450
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomUserName">RandomUserName</a></td>
|
451
|
+
<td align="left">Generates random user name of same length as original user name.</td>
|
452
|
+
</tr>
|
453
|
+
</table>
|
412
454
|
|
413
|
-
### RandomFloatDelta
|
414
|
-
Shifts the current value randomly within given delta + and -. Default is 10.0
|
415
|
-
```ruby
|
416
|
-
anonymize('points').using FieldStrategy::RandomFloatDelta.new(2.5)
|
417
|
-
```
|
418
455
|
|
419
456
|
## Write you own field strategies
|
420
457
|
field parameter in following code is [DataAnon::Core::Field](#dataanon-core-field)
|
@@ -444,11 +481,11 @@ write your own anonymous field strategies within DSL,
|
|
444
481
|
## Default field strategies
|
445
482
|
|
446
483
|
```ruby
|
447
|
-
|
448
|
-
DEFAULT_STRATEGIES = {:string => FieldStrategy::LoremIpsum.new,
|
484
|
+
DEFAULT_STRATEGIES = {:string => FieldStrategy::RandomString.new,
|
449
485
|
:fixnum => FieldStrategy::RandomIntegerDelta.new(5),
|
450
486
|
:bignum => FieldStrategy::RandomIntegerDelta.new(5000),
|
451
487
|
:float => FieldStrategy::RandomFloatDelta.new(5.0),
|
488
|
+
:bigdecimal => FieldStrategy::RandomBigDecimalDelta.new(500.0),
|
452
489
|
:datetime => FieldStrategy::DateTimeDelta.new,
|
453
490
|
:time => FieldStrategy::TimeDelta.new,
|
454
491
|
:date => FieldStrategy::DateDelta.new,
|
@@ -457,7 +494,7 @@ DEFAULT_STRATEGIES = {:string => FieldStrategy::LoremIpsum.new,
|
|
457
494
|
}
|
458
495
|
```
|
459
496
|
|
460
|
-
Overriding default field strategies
|
497
|
+
Overriding default field strategies & can be used to provide default strategy for missing data type.
|
461
498
|
|
462
499
|
```ruby
|
463
500
|
database 'Chinook' do
|
@@ -497,7 +534,7 @@ DataAnon::Utils::Logging.logger.level = Logger::INFO
|
|
497
534
|
## Credits
|
498
535
|
|
499
536
|
- [ThoughtWorks Inc](http://www.thoughtworks.com), for allowing us to build this tool and make it open source.
|
500
|
-
- [
|
537
|
+
- [Panda](https://twitter.com/sarbashrestha) for reviewing the documentation.
|
501
538
|
- [Dan Abel](http://www.linkedin.com/pub/dan-abel/0/61b/9b0) for introducing me to Blacklist and Whitelist approach for data anonymization.
|
502
539
|
- [Chirga Doshi](https://twitter.com/chiragsdoshi) for encouraging me to get this done.
|
503
540
|
- [Aditya Karle](https://twitter.com/adityakarle) for the Logo. (Coming Soon...)
|