data-anonymization 0.3.0 → 0.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/.gitignore +2 -1
- data/.rvmrc +1 -1
- data/.travis.yml +2 -0
- data/Gemfile +2 -0
- data/README.md +295 -258
- data/bin/datanon +57 -0
- data/data-anonymization.gemspec +2 -1
- data/examples/blacklist_dsl.rb +42 -0
- data/examples/mongodb_blacklist_dsl.rb +38 -0
- data/examples/mongodb_whitelist_dsl.rb +44 -0
- data/examples/whitelist_dsl.rb +63 -0
- data/lib/core/database.rb +21 -3
- data/lib/core/field.rb +5 -2
- data/lib/core/fields_missing_strategy.rb +30 -0
- data/lib/core/table_errors.rb +32 -0
- data/lib/data-anonymization.rb +11 -0
- data/lib/parallel/table.rb +8 -1
- data/lib/strategy/base.rb +35 -14
- data/lib/strategy/blacklist.rb +1 -1
- data/lib/strategy/field/anonymize_array.rb +28 -0
- data/lib/strategy/field/contact/random_address.rb +12 -0
- data/lib/strategy/field/contact/random_city.rb +12 -0
- data/lib/strategy/field/contact/random_phone_number.rb +4 -0
- data/lib/strategy/field/contact/random_province.rb +12 -0
- data/lib/strategy/field/contact/random_zipcode.rb +12 -0
- data/lib/strategy/field/datetime/anonymize_date.rb +15 -0
- data/lib/strategy/field/datetime/anonymize_datetime.rb +19 -0
- data/lib/strategy/field/datetime/anonymize_time.rb +19 -0
- data/lib/strategy/field/datetime/date_delta.rb +10 -0
- data/lib/strategy/field/datetime/date_time_delta.rb +9 -0
- data/lib/strategy/field/datetime/time_delta.rb +8 -0
- data/lib/strategy/field/default_anon.rb +4 -1
- data/lib/strategy/field/email/gmail_template.rb +8 -0
- data/lib/strategy/field/email/random_email.rb +7 -0
- data/lib/strategy/field/email/random_mailinator_email.rb +5 -0
- data/lib/strategy/field/fields.rb +4 -0
- data/lib/strategy/field/name/random_first_name.rb +10 -0
- data/lib/strategy/field/name/random_full_name.rb +10 -2
- data/lib/strategy/field/name/random_last_name.rb +9 -0
- data/lib/strategy/field/name/random_user_name.rb +5 -0
- data/lib/strategy/field/number/random_big_decimal_delta.rb +6 -0
- data/lib/strategy/field/number/random_float.rb +4 -0
- data/lib/strategy/field/number/random_float_delta.rb +6 -0
- data/lib/strategy/field/number/random_integer.rb +4 -0
- data/lib/strategy/field/number/random_integer_delta.rb +6 -0
- data/lib/strategy/field/string/formatted_string_numbers.rb +10 -6
- data/lib/strategy/field/string/lorem_ipsum.rb +9 -0
- data/lib/strategy/field/string/random_formatted_string.rb +39 -0
- data/lib/strategy/field/string/random_string.rb +6 -0
- data/lib/strategy/field/string/random_url.rb +7 -1
- data/lib/strategy/field/string/select_from_database.rb +7 -5
- data/lib/strategy/field/string/select_from_file.rb +7 -0
- data/lib/strategy/field/string/select_from_list.rb +8 -0
- data/lib/strategy/field/string/string_template.rb +11 -0
- data/lib/strategy/mongodb/anonymize_field.rb +44 -0
- data/lib/strategy/mongodb/blacklist.rb +29 -0
- data/lib/strategy/mongodb/whitelist.rb +62 -0
- data/lib/strategy/strategies.rb +10 -1
- data/lib/strategy/whitelist.rb +7 -2
- data/lib/thor/helpers/mongodb_dsl_generator.rb +66 -0
- data/lib/thor/helpers/rdbms_dsl_generator.rb +36 -0
- data/lib/thor/templates/mongodb_whitelist_template.erb +15 -0
- data/lib/thor/templates/whitelist_template.erb +21 -0
- data/lib/utils/database.rb +4 -0
- data/lib/utils/parallel_progress_bar.rb +24 -0
- data/lib/utils/progress_bar.rb +34 -22
- data/lib/utils/random_string.rb +3 -2
- data/lib/utils/random_string_chars_only.rb +3 -5
- data/lib/utils/template_helper.rb +44 -0
- data/lib/version.rb +1 -1
- data/spec/acceptance/mongodb_blacklist_spec.rb +75 -0
- data/spec/acceptance/mongodb_whitelist_spec.rb +107 -0
- data/spec/core/fields_missing_strategy_spec.rb +26 -0
- data/spec/strategy/field/name/random_first_name_spec.rb +1 -1
- data/spec/strategy/field/name/random_full_name_spec.rb +12 -7
- data/spec/strategy/field/name/random_last_name_spec.rb +1 -1
- data/spec/strategy/field/string/random_formatted_string_spec.rb +39 -0
- data/spec/strategy/field/string/select_from_file_spec.rb +21 -0
- data/spec/strategy/mongodb/anonymize_field_spec.rb +52 -0
- data/spec/utils/random_float_spec.rb +12 -0
- data/spec/utils/random_string_char_only_spec.rb +12 -0
- data/spec/utils/template_helper_spec.rb +14 -0
- metadata +56 -6
- data/blacklist_dsl.rb +0 -17
- data/blacklist_nosql_dsl.rb +0 -36
- data/whitelist_dsl.rb +0 -42
data/.gitignore
CHANGED
data/.rvmrc
CHANGED
@@ -1 +1 @@
|
|
1
|
-
rvm use 1.9.3
|
1
|
+
rvm use 1.9.3@data-anon --create
|
data/.travis.yml
CHANGED
data/Gemfile
CHANGED
data/README.md
CHANGED
@@ -1,7 +1,12 @@
|
|
1
1
|
# Data::Anonymization
|
2
2
|
Tool to create anonymized production data dump to use for PERF and other TEST environments.
|
3
3
|
|
4
|
+
[<img src="https://secure.travis-ci.org/sunitparekh/data-anonymization.png?branch=master">](http://travis-ci.org/sunitparekh/data-anonymization)
|
5
|
+
[<img src="https://gemnasium.com/sunitparekh/data-anonymization.png?travis">](https://gemnasium.com/sunitparekh/data-anonymization)
|
6
|
+
[<img src="https://codeclimate.com/badge.png">](https://codeclimate.com/github/sunitparekh/data-anonymization)
|
7
|
+
|
4
8
|
## Getting started
|
9
|
+
|
5
10
|
Install gem using:
|
6
11
|
|
7
12
|
$ gem install data-anonymization
|
@@ -39,15 +44,42 @@ Run using:
|
|
39
44
|
|
40
45
|
$ ruby my_dsl.rb
|
41
46
|
|
47
|
+
Liked it? please share
|
48
|
+
|
49
|
+
[<img src="https://si0.twimg.com/a/1346446870/images/resources/twitter-bird-light-bgs.png" height="35" width="35">](https://twitter.com/share?text=A+simple+ruby+DSL+based+data+anonymization&url=http:%2F%2Fsunitparekh.github.com%2Fdata-anonymization&via=dataanon&hashtags=dataanon)
|
50
|
+
|
42
51
|
## Examples
|
43
52
|
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
53
|
+
SQLite database
|
54
|
+
|
55
|
+
1. [Whitelist](https://github.com/sunitparekh/data-anonymization/blob/master/examples/whitelist_dsl.rb)
|
56
|
+
2. [Blacklist](https://github.com/sunitparekh/data-anonymization/blob/master/examples/blacklist_dsl.rb)
|
57
|
+
|
58
|
+
MongoDB
|
59
|
+
|
60
|
+
1. [Whitelist](https://github.com/sunitparekh/data-anonymization/blob/master/examples/mongodb_whitelist_dsl.rb)
|
61
|
+
2. [Blacklist](https://github.com/sunitparekh/data-anonymization/blob/master/examples/mongodb_blacklist_dsl.rb)
|
62
|
+
|
63
|
+
Postgresql database having **composite primary key**
|
64
|
+
|
65
|
+
1. [Whitelist](https://github.com/sunitparekh/test-anonymization/blob/master/dell_whitelist.rb)
|
66
|
+
2. [Blacklist](https://github.com/sunitparekh/test-anonymization/blob/master/dell_blacklist.rb)
|
67
|
+
|
48
68
|
|
49
69
|
## Changelog
|
50
70
|
|
71
|
+
#### 0.5.0 (rc1 released. install gem using --pre option)
|
72
|
+
|
73
|
+
Major changes:
|
74
|
+
|
75
|
+
1. MongoDB support
|
76
|
+
2. Command line utility to generate whitelist DSL for RDBMS & MongoDB (reduces pain for writing whitelist dsl)
|
77
|
+
3. Added support for reporting fields missing mapping in case of whitelist
|
78
|
+
4. Errors reported at the end of process. Job doesn't fail for a single error.
|
79
|
+
|
80
|
+
|
81
|
+
Please see the [Github 0.5.0 milestone page](https://github.com/sunitparekh/data-anonymization/issues?milestone=2&state=open) for more details on changes/fixes in release 0.5.0
|
82
|
+
|
51
83
|
#### 0.3.0 (Sep 4, 2012)
|
52
84
|
|
53
85
|
Major changes:
|
@@ -56,7 +88,7 @@ Major changes:
|
|
56
88
|
2. Change in default String strategy from LoremIpsum to RandomString based on end user feedback.
|
57
89
|
3. Fixed issue with table column name 'type' as this is default name for STI in activerecord.
|
58
90
|
|
59
|
-
Please see the [Github 0.3.0 milestone page](https://github.com/sunitparekh/data-anonymization/issues?milestone=1&
|
91
|
+
Please see the [Github 0.3.0 milestone page](https://github.com/sunitparekh/data-anonymization/issues?milestone=1&state=closed) for more details on changes/fixes in release 0.3.0
|
60
92
|
|
61
93
|
#### 0.2.0 (August 16, 2012)
|
62
94
|
|
@@ -71,15 +103,10 @@ Please see the [Github 0.3.0 milestone page](https://github.com/sunitparekh/data
|
|
71
103
|
|
72
104
|
## Roadmap
|
73
105
|
|
74
|
-
|
75
|
-
|
76
|
-
1. MongoDB anonymization support (NoSQL document based database support)
|
106
|
+
MVP done. Fix defects and support queries, suggestions, enhancements logged in Github issues :-)
|
77
107
|
|
78
|
-
|
108
|
+
## Share feedback
|
79
109
|
|
80
|
-
1. Generate DSL from database and build schema from source as part of Whitelist approach.
|
81
|
-
|
82
|
-
#### Share feedback
|
83
110
|
Please use Github [issues](https://github.com/sunitparekh/data-anonymization/issues) to share feedback, feature suggestions and report issues.
|
84
111
|
|
85
112
|
## What is data anonymization?
|
@@ -126,295 +153,305 @@ end
|
|
126
153
|
2. Change [default field strategies](#default-field-strategies) to avoid using same strategy again and again in your DSL.
|
127
154
|
3. To run anonymization in parallel at Table level, provided no FK constraint on tables use DataAnon::Parallel::Table strategy
|
128
155
|
|
129
|
-
##
|
130
|
-
Currently provides capability of running anonymization in parallel at table level provided no FK constraints on tables.
|
131
|
-
It uses [Parallel gem](https://github.com/grosser/parallel) provided by Michael Grosser.
|
132
|
-
By default it starts multiple parallel ruby processes processing table one by one.
|
133
|
-
```ruby
|
134
|
-
database 'DellStore' do
|
135
|
-
strategy DataAnon::Strategy::Whitelist
|
136
|
-
execution_strategy DataAnon::Parallel::Table # by default sequential table processing
|
137
|
-
...
|
138
|
-
end
|
139
|
-
```
|
156
|
+
## DSL Generation
|
140
157
|
|
158
|
+
We provide a command line tool to generate whitelist scripts for RDBMS and NoSQL databases. The user needs to supply the connection details to the database and a script is generated by analyzing the schema. Below are examples of how to use the tool to generate the scripts for RDBMS and NoSQL datastores
|
141
159
|
|
142
|
-
|
143
|
-
The object that gets passed along with the field strategies.
|
144
|
-
|
145
|
-
has following attribute accessor
|
160
|
+
When you install the data-anonymization tool, the **datanon** command become available on the terminal. If you type **datanon --help** and execute you should see the below
|
146
161
|
|
147
|
-
|
148
|
-
|
149
|
-
- `row_number` current row number
|
150
|
-
- `ar_record` active record of the current row under processing
|
151
|
-
|
152
|
-
## Field Strategies
|
162
|
+
```
|
163
|
+
Tasks:
|
153
164
|
|
154
|
-
|
155
|
-
|
165
|
+
datanon generate_mongo_dsl -d, --database=DATABASE -h, --host=HOST # Generates a base anonymization script(whitelist strategy) for a Mongo DB using the database schema
|
166
|
+
datanon generate_rdbms_dsl -a, --adapter=ADAPTER -d, --database=DATABASE -h, --host=HOST # Generates a base anonymization script(whitelist strategy) for a RDBMS database using the database schema
|
167
|
+
datanon help [TASK] # Describe available tasks or one specific task
|
156
168
|
|
157
|
-
```ruby
|
158
|
-
anonymize('UserName').using FieldStrategy::LoremIpsum.new
|
159
|
-
```
|
160
|
-
```ruby
|
161
|
-
anonymize('UserName').using FieldStrategy::LoremIpsum.new("very large string....")
|
162
|
-
```
|
163
|
-
```ruby
|
164
|
-
anonymize('UserName').using FieldStrategy::LoremIpsum.new(File.read('my_file.txt'))
|
165
169
|
```
|
166
170
|
|
167
|
-
###
|
168
|
-
Generates random string of same length.
|
169
|
-
```ruby
|
170
|
-
anonymize('UserName').using FieldStrategy::RandomString.new
|
171
|
-
```
|
171
|
+
### RDBMS whitelist generation
|
172
172
|
|
173
|
-
|
174
|
-
Simple string evaluation within [DataAnon::Core::Field](#dataanon-core-field) context. Can be used for email, username anonymization.
|
175
|
-
Make sure to put the string in 'single quote' else it will get evaluated inline.
|
176
|
-
```ruby
|
177
|
-
anonymize('UserName').using FieldStrategy::StringTemplate.new('user#{row_number}')
|
178
|
-
```
|
179
|
-
```ruby
|
180
|
-
anonymize('Email').using FieldStrategy::StringTemplate.new('valid.address+#{row_number}@gmail.com')
|
181
|
-
```
|
182
|
-
```ruby
|
183
|
-
anonymize('Email').using FieldStrategy::StringTemplate.new('useremail#{row_number}@mailinator.com')
|
184
|
-
```
|
173
|
+
The gem uses ActiveRecord(AR) abstraction to connect to relational databases. You can generate a whitelist script in seconds for any relational database supported by Active Record. To do so use the following command
|
185
174
|
|
186
|
-
### SelectFromList
|
187
|
-
Select randomly one of the values specified.
|
188
|
-
```ruby
|
189
|
-
anonymize('State').using FieldStrategy::SelectFromList.new(['New York','Georgia',...])
|
190
|
-
```
|
191
|
-
```ruby
|
192
|
-
anonymize('NameTitle').using FieldStrategy::SelectFromList.new(['Mr','Mrs','Dr',...])
|
193
175
|
```
|
176
|
+
datanon generate_rdbms_dsl [options]
|
194
177
|
|
195
|
-
### SelectFromFile
|
196
|
-
Similar to SelectFromList only difference is the list of values are picked up from file. Classical usage is like states field anonymization.
|
197
|
-
```ruby
|
198
|
-
anonymize('State').using FieldStrategy::SelectFromFile.new('states.txt')
|
199
178
|
```
|
200
179
|
|
201
|
-
|
202
|
-
Keeping the format same it changes each digit in the string with random digit.
|
203
|
-
```ruby
|
204
|
-
anonymize('CreditCardNumber').using FieldStrategy::FormattedStringNumber.new
|
205
|
-
```
|
180
|
+
The options available are :
|
206
181
|
|
207
|
-
|
208
|
-
|
209
|
-
|
210
|
-
|
211
|
-
|
212
|
-
|
182
|
+
1. adapter(-a) : The activerecord adapter to use to connect to the database (eg. mysql2, postgresql)
|
183
|
+
2. host(-h) : DB host name or IP address
|
184
|
+
3. database(-d) : The name of the database to generate the whitelist script for
|
185
|
+
4. username(-u) : Username for DB authentication
|
186
|
+
5. password(-w) : Password for DB authentication
|
187
|
+
6. port(-p) : The port the database service is running on. Default port provided by AR will be used if nothing is specififed.
|
213
188
|
|
214
|
-
|
215
|
-
Generates address using the [geojson](http://www.geojson.org/geojson-spec.html) format file. The default US/UK file chooses randomly from 300 addresses.
|
216
|
-
The large data set can be downloaded from [here](http://www.infochimps.com/datasets/simplegeo-places-dump)
|
217
|
-
```ruby
|
218
|
-
anonymize('Address').using FieldStrategy::RandomAddress.region_US
|
219
|
-
```
|
220
|
-
```ruby
|
221
|
-
anonymize('Address').using FieldStrategy::RandomAddress.region_UK
|
222
|
-
```
|
223
|
-
```ruby
|
224
|
-
# get your own geo_json file and use it
|
225
|
-
anonymize('Address').using FieldStrategy::RandomAddress.new('my_geo_json.json')
|
226
|
-
```
|
189
|
+
The adapter, host and database options are mandatory. The others are optional.
|
227
190
|
|
228
|
-
|
229
|
-
Similar to RandomAddress, generates city using the [geojson](http://www.geojson.org/geojson-spec.html) format file. The default US/UK file chooses randomly from 300 addresses.
|
230
|
-
The large data set can be downloaded from [here](http://www.infochimps.com/datasets/simplegeo-places-dump)
|
231
|
-
```ruby
|
232
|
-
anonymize('City').using FieldStrategy::RandomCity.region_US
|
233
|
-
```
|
234
|
-
```ruby
|
235
|
-
anonymize('City').using FieldStrategy::RandomCity.region_UK
|
236
|
-
```
|
237
|
-
```ruby
|
238
|
-
# get your own geo_json file and use it
|
239
|
-
anonymize('City').using FieldStrategy::RandomCity.new('my_geo_json.json')
|
240
|
-
```
|
191
|
+
A few examples of the command is shown below
|
241
192
|
|
242
|
-
### RandomProvince
|
243
|
-
Similar to RandomAddress, generates province using the [geojson](http://www.geojson.org/geojson-spec.html) format file. The default US/UK file chooses randomly from 300 addresses.
|
244
|
-
The large data set can be downloaded from [here](http://www.infochimps.com/datasets/simplegeo-places-dump)
|
245
|
-
```ruby
|
246
|
-
anonymize('Province').using FieldStrategy::RandomProvince.region_US
|
247
|
-
```
|
248
|
-
```ruby
|
249
|
-
anonymize('Province').using FieldStrategy::RandomProvince.region_UK
|
250
|
-
```
|
251
|
-
```ruby
|
252
|
-
# get your own geo_json file and use it
|
253
|
-
anonymize('Province').using FieldStrategy::RandomProvince.new('my_geo_json.json')
|
254
193
|
```
|
194
|
+
datanon generate_rdbms_dsl -a mysql2 -h db.host.com -p 3306 -d production_db -u root -w password
|
255
195
|
|
256
|
-
|
257
|
-
Similar to RandomAddress, generates zipcode using the [geojson](http://www.geojson.org/geojson-spec.html) format file. The default US/UK file chooses randomly from 300 addresses.
|
258
|
-
The large data set can be downloaded from [here](http://www.infochimps.com/datasets/simplegeo-places-dump)
|
259
|
-
```ruby
|
260
|
-
anonymize('Address').using FieldStrategy::RandomZipcode.region_US
|
261
|
-
```
|
262
|
-
```ruby
|
263
|
-
anonymize('Address').using FieldStrategy::RandomZipcode.region_UK
|
264
|
-
```
|
265
|
-
```ruby
|
266
|
-
# get your own geo_json file and use it
|
267
|
-
anonymize('Address').using FieldStrategy::RandomZipcode.new('my_geo_json.json')
|
268
|
-
```
|
196
|
+
datanon generate_rdbms_dsl -a postgresql -h 123.456.7.8 -d production_db
|
269
197
|
|
270
|
-
### RandomPhoneNumber
|
271
|
-
Keeping the format same it changes each digit in the string with random digit.
|
272
|
-
```ruby
|
273
|
-
anonymize('PhoneNumber').using FieldStrategy::RandomPhoneNumber.new
|
274
198
|
```
|
275
199
|
|
276
|
-
|
277
|
-
Anonymizes each field(except year and seconds) within the natural range (e.g. hour between 1-24 and day within the month) based on true/false
|
278
|
-
input for that field. By default, all fields are anonymized.
|
279
|
-
```ruby
|
280
|
-
#anonymizes month and hour fields, leaving the day and minute fields untouched
|
281
|
-
anonymize('DateOfBirth').using FieldStrategy::AnonymizeDateTime.new(true,false,true,false)
|
282
|
-
```
|
200
|
+
The relevant db gems must be installed so that AR has the adapters required to establish the connection to the databases. The script generates a file named **rdbms_whitelist_generated.rb** in the same location as the project.
|
283
201
|
|
284
|
-
|
285
|
-
```ruby
|
286
|
-
# anonymizes only the month field
|
287
|
-
anonymize('DateOfBirth').using FieldStrategy::AnonymizeDateTime.only_month
|
288
|
-
# anonymizes only the day field
|
289
|
-
anonymize('DateOfBirth').using FieldStrategy::AnonymizeDateTime.only_day
|
290
|
-
# anonymizes only the hour field
|
291
|
-
anonymize('DateOfBirth').using FieldStrategy::AnonymizeDateTime.only_hour
|
292
|
-
# anonymizes only the minute field
|
293
|
-
anonymize('DateOfBirth').using FieldStrategy::AnonymizeDateTime.only_minute
|
294
|
-
```
|
202
|
+
### MongoDB whitelist generation
|
295
203
|
|
296
|
-
|
297
|
-
Exactly similar to the above DateTime strategy, except that the returned object is of type `Time`
|
204
|
+
Similar to the the relational databases, a whitelist script for mongo db can be generated by analysing the database structure
|
298
205
|
|
299
|
-
### AnonymizeDate
|
300
|
-
Anonmizes day and month fields within natural range based on true/false input for that field. By defaut both fields are
|
301
|
-
anonymized
|
302
|
-
```ruby
|
303
|
-
# anonymizes month and leaves day unchanged
|
304
|
-
anonymize('DateOfBirth').using FieldStrategy::AnonymizeDate.new(true,false)
|
305
206
|
```
|
207
|
+
datanon generate_mongo_dsl [options]
|
306
208
|
|
307
|
-
In addition to customizing which fields you want anonymized, there are some helper methods which allow for quick anonymization
|
308
|
-
```ruby
|
309
|
-
# anonymizes only the month field
|
310
|
-
anonymize('DateOfBirth').using FieldStrategy::AnonymizeDate.only_month
|
311
|
-
# anonymizes only the day field
|
312
|
-
anonymize('DateOfBirth').using FieldStrategy::AnonymizeDate.only_day
|
313
209
|
```
|
314
210
|
|
315
|
-
|
316
|
-
Shifts data randomly within given range. Default shifts date within 10 days + or - and shifts time within 30 minutes.
|
317
|
-
```ruby
|
318
|
-
anonymize('DateOfBirth').using FieldStrategy::DateTimeDelta.new
|
319
|
-
```
|
320
|
-
```ruby
|
321
|
-
# shifts date within 20 days and time within 50 minutes
|
322
|
-
anonymize('DateOfBirth').using FieldStrategy::DateTimeDelta.new(20, 50)
|
323
|
-
```
|
211
|
+
The options available are :
|
324
212
|
|
325
|
-
|
326
|
-
|
213
|
+
1. host(-h) : DB host name or IP address
|
214
|
+
2. database(-d) : The name of the database to generate the whitelist script for
|
215
|
+
3. username(-u) : Username for DB authentication
|
216
|
+
4. password(-w) : Password for DB authentication
|
217
|
+
5. port(-p) : The port the database service is running on.
|
218
|
+
6. whitelist patterns(-r): A regex expression which can be used to match records in the database to list as whitelisted fields in the generated script.
|
327
219
|
|
328
|
-
|
220
|
+
The host and database options are mandatory. The others are optional.
|
329
221
|
|
330
|
-
|
331
|
-
```ruby
|
332
|
-
anonymize('DateOfBirth').using FieldStrategy::AnonymizeDate.new
|
333
|
-
```
|
334
|
-
```ruby
|
335
|
-
# shifts date within 25 days
|
336
|
-
anonymize('DateOfBirth').using FieldStrategy::DateDelta.new(25)
|
337
|
-
```
|
222
|
+
A few examples of the command is shown below
|
338
223
|
|
339
|
-
### RandomEmail
|
340
|
-
Generates email randomly using the given HOSTNAME and TLD.
|
341
|
-
By defaults generates hostname randomly along with email id.
|
342
|
-
```ruby
|
343
|
-
anonymize('Email').using FieldStrategy::RandomEmail.new('thoughtworks','com')
|
344
224
|
```
|
225
|
+
datanon generate_mongo_dsl -h db.host.com -d production_db -u root -w password
|
345
226
|
|
346
|
-
|
347
|
-
Generates a valid unique gmail address by taking advantage of the gmail + strategy. Takes in a valid gmail username and
|
348
|
-
generates emails of the form username+<number>@gmail.com
|
349
|
-
```ruby
|
350
|
-
anonymize('Email').using FieldStrategy::GmailTemplate.new('username')
|
351
|
-
```
|
227
|
+
datanon generate_mongo_dsl -h 123.456.7.8 -d production_db
|
352
228
|
|
353
|
-
### RandomMailinatorEmail
|
354
|
-
Generates random email using mailinator hostname. e.g. <randomstring>@mailinator.com
|
355
|
-
```ruby
|
356
|
-
anonymize('Email').using FieldStrategy::RandomMailinatorEmail.new
|
357
229
|
```
|
358
230
|
|
359
|
-
|
360
|
-
Generates random user name of same length as original user name.
|
361
|
-
```ruby
|
362
|
-
anonymize('Username').using FieldStrategy::RandomUserName.new
|
363
|
-
```
|
231
|
+
The **mongo** gem is required in order to install the mongo db drivers. The script generates a file named **mongodb_whitelist_generated.rb** in the same location as the project.
|
364
232
|
|
365
|
-
### RandomFirstName
|
366
|
-
Randomly picks up first name from the predefined list in the file. Default [file](https://raw.github.com/sunitparekh/data-anonymization/master/resources/first_names.txt) is part of the gem.
|
367
|
-
File should contain first name on each line.
|
368
|
-
```ruby
|
369
|
-
anonymize('FirstName').using FieldStrategy::RandomFirstName.new
|
370
|
-
```
|
371
|
-
```ruby
|
372
|
-
anonymize('FirstName').using FieldStrategy::RandomFirstName.new('my_first_names.txt')
|
373
|
-
```
|
374
233
|
|
375
|
-
### RandomLastName
|
376
|
-
Randomly picks up last name from the predefined list in the file. Default [file](https://raw.github.com/sunitparekh/data-anonymization/master/resources/last_names.txt) is part of the gem.
|
377
|
-
File should contain last name on each line.
|
378
|
-
```ruby
|
379
|
-
anonymize('LastName').using FieldStrategy::RandomLastName.new
|
380
|
-
```
|
381
|
-
```ruby
|
382
|
-
anonymize('LastName').using FieldStrategy::RandomLastName.new('my_last_names.txt')
|
383
|
-
```
|
384
234
|
|
385
|
-
|
386
|
-
|
387
|
-
It
|
388
|
-
|
389
|
-
anonymize('FullName').using FieldStrategy::RandomFullName.new
|
390
|
-
```
|
391
|
-
```ruby
|
392
|
-
anonymize('FullName').using FieldStrategy::RandomLastName.new('my_first_names.txt', 'my_last_names.txt')
|
393
|
-
```
|
235
|
+
## Running in Parallel
|
236
|
+
Currently provides capability of running anonymization in parallel at table level provided no FK constraints on tables.
|
237
|
+
It uses [Parallel gem](https://github.com/grosser/parallel) provided by Michael Grosser.
|
238
|
+
By default it starts multiple parallel ruby processes processing table one by one.
|
394
239
|
|
395
|
-
### RandomInteger
|
396
|
-
Generates random integer number between given two numbers. Default range is 0 to 100.
|
397
240
|
```ruby
|
398
|
-
|
241
|
+
database 'DellStore' do
|
242
|
+
strategy DataAnon::Strategy::Whitelist
|
243
|
+
execution_strategy DataAnon::Parallel::Table # by default sequential table processing
|
244
|
+
...
|
245
|
+
end
|
399
246
|
```
|
400
247
|
|
401
|
-
### RandomIntegerDelta
|
402
|
-
Shifts the current value randomly within given delta + and -. Default is 10
|
403
|
-
```ruby
|
404
|
-
anonymize('Age').using FieldStrategy::RandomIntegerDelta.new(2)
|
405
|
-
```
|
406
248
|
|
407
|
-
|
408
|
-
|
409
|
-
|
410
|
-
|
411
|
-
|
249
|
+
## DataAnon::Core::Field
|
250
|
+
The object that gets passed along with the field strategies.
|
251
|
+
|
252
|
+
has following attribute accessor
|
253
|
+
|
254
|
+
- `name` current field/column name
|
255
|
+
- `value` current field/column value
|
256
|
+
- `row_number` current row number
|
257
|
+
- `ar_record` active record of the current row under processing
|
258
|
+
|
259
|
+
## Field Strategies
|
260
|
+
|
261
|
+
|
262
|
+
<table>
|
263
|
+
<tr>
|
264
|
+
<th align="left">Content</th>
|
265
|
+
<th align="left">Name</th>
|
266
|
+
<th align="left">Description</th>
|
267
|
+
</tr>
|
268
|
+
<tr>
|
269
|
+
<td align="left">Text</td>
|
270
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/LoremIpsum">LoremIpsum</a></td>
|
271
|
+
<td align="left">Generates a random Lorep Ipsum String</td>
|
272
|
+
</tr>
|
273
|
+
<tr>
|
274
|
+
<td align="left">Text</td>
|
275
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomString">RandomString</a></td>
|
276
|
+
<td align="left">Generates a random string of equal length</td>
|
277
|
+
</tr>
|
278
|
+
<tr>
|
279
|
+
<td align="left">Text</td>
|
280
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/StringTemplate">StringTemplate</a></td>
|
281
|
+
<td align="left">Generates a string based on provided template</td>
|
282
|
+
</tr>
|
283
|
+
<tr>
|
284
|
+
<td align="left">Text</td>
|
285
|
+
<td align="left"><a>SelectFromList</a></td>
|
286
|
+
<td align="left">Randomly selects a string from a provided list</td>
|
287
|
+
</tr>
|
288
|
+
<tr>
|
289
|
+
<td align="left">Text</td>
|
290
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/SelectFromFile">SelectFromFile</a></td>
|
291
|
+
<td align="left">Randomly selects a string from a provided file</td>
|
292
|
+
</tr>
|
293
|
+
<tr>
|
294
|
+
<td align="left">Text</td>
|
295
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/FormattedStringNumber">FormattedStringNumber</a></td>
|
296
|
+
<td align="left">Randomize digits in a string while maintaining the format</td>
|
297
|
+
</tr>
|
298
|
+
<tr>
|
299
|
+
<td align="left">Text</td>
|
300
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/SelectFromDatabase">SelectFromDatabase</a></td>
|
301
|
+
<td align="left">Selects randomly from the result of a query on a database</td>
|
302
|
+
</tr>
|
303
|
+
<tr>
|
304
|
+
<td align="left">Text</td>
|
305
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomUrl">RandomURL</a></td>
|
306
|
+
<td align="left">Anonymizes a URL while mainting the structure</td>
|
307
|
+
</tr>
|
308
|
+
</table><table>
|
309
|
+
<tr>
|
310
|
+
<th align="left">Content</th>
|
311
|
+
<th align="left">Name</th>
|
312
|
+
<th align="left">Description</th>
|
313
|
+
</tr>
|
314
|
+
<tr>
|
315
|
+
<td align="left">Number</td>
|
316
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomInteger">RandomInteger</a></td>
|
317
|
+
<td align="left">Generates a random integer between provided limits (default 0 to 100)</td>
|
318
|
+
</tr>
|
319
|
+
<tr>
|
320
|
+
<td align="left">Number</td>
|
321
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomIntegerDelta">RandomIntegerDelta</a></td>
|
322
|
+
<td align="left">Generates a random integer within -delta and delta of original integer</td>
|
323
|
+
</tr>
|
324
|
+
<tr>
|
325
|
+
<td align="left">Number</td>
|
326
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomFloat">RandomFloat</a></td>
|
327
|
+
<td align="left">Generates a random float between provided limits (default 0.0 to 100.0)</td>
|
328
|
+
</tr>
|
329
|
+
<tr>
|
330
|
+
<td align="left">Number</td>
|
331
|
+
<td align="left"><a>RandomFloatDelta</a></td>
|
332
|
+
<td align="left">Generates a random float within -delta and delta of original float</td>
|
333
|
+
</tr>
|
334
|
+
<tr>
|
335
|
+
<td align="left">Number</td>
|
336
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomBigDecimalDelta">RandomBigDecimalDelta</a></td>
|
337
|
+
<td align="left">Similar to previous but creates a big decimal object</td>
|
338
|
+
</tr>
|
339
|
+
</table><table>
|
340
|
+
<tr>
|
341
|
+
<th align="left">Content</th>
|
342
|
+
<th align="left">Name</th>
|
343
|
+
<th align="left">Description</th>
|
344
|
+
</tr>
|
345
|
+
<tr>
|
346
|
+
<td align="left">Address</td>
|
347
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomAddress">RandomAddress</a></td>
|
348
|
+
<td align="left">Randomly selects an address from a geojson flat file [Default US address]</td>
|
349
|
+
</tr>
|
350
|
+
<tr>
|
351
|
+
<td align="left">City</td>
|
352
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomCity">RandomCity</a></td>
|
353
|
+
<td align="left">Similar to address, picks a random city from a geojson flafile [Default US cities]</td>
|
354
|
+
</tr>
|
355
|
+
<tr>
|
356
|
+
<td align="left">Province</td>
|
357
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomProvince">RandomProvince</a></td>
|
358
|
+
<td align="left">Similar to address, picks a random city from a geojson flafile [Default US provinces]</td>
|
359
|
+
</tr>
|
360
|
+
<tr>
|
361
|
+
<td align="left">Zip code</td>
|
362
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomZipcode">RandomZipcode</a></td>
|
363
|
+
<td align="left">Similar to address, picks a random zipcode from a geojson flafile [Default US zipcodes]</td>
|
364
|
+
</tr>
|
365
|
+
<tr>
|
366
|
+
<td align="left">Phone number</td>
|
367
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomPhoneNumber">RandomPhoneNumber</a></td>
|
368
|
+
<td align="left">Randomizes a phone number while preserving locale specific fomatting</td>
|
369
|
+
</tr>
|
370
|
+
</table><table>
|
371
|
+
<tr>
|
372
|
+
<th align="left">Content</th>
|
373
|
+
<th align="left">Name</th>
|
374
|
+
<th align="left">Description</th>
|
375
|
+
</tr>
|
376
|
+
<tr>
|
377
|
+
<td align="left">DateTime</td>
|
378
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/AnonymizeDateTime">AnonymizeDateTime</a></td>
|
379
|
+
<td align="left">Anonymizes each field (except year and seconds) within natural range of the field depending on true/false flag provided</td>
|
380
|
+
</tr>
|
381
|
+
<tr>
|
382
|
+
<td align="left">Time</td>
|
383
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/AnonymizeTime">AnonymizeTime</a></td>
|
384
|
+
<td align="left">Exactly similar to above except returned object is of type 'Time'</td>
|
385
|
+
</tr>
|
386
|
+
<tr>
|
387
|
+
<td align="left">Date</td>
|
388
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/AnonymizeDate">AnonymizeDate</a></td>
|
389
|
+
<td align="left">Anonymizes day and month within natural ranges based on true/false flag</td>
|
390
|
+
</tr>
|
391
|
+
<tr>
|
392
|
+
<td align="left">DateTimeDelta</td>
|
393
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/DateTimeDelta">DateTimeDelta</a></td>
|
394
|
+
<td align="left">Shifts data randomly within given range. Default shifts date within 10 days + or - and shifts time within 30 minutes.</td>
|
395
|
+
</tr>
|
396
|
+
<tr>
|
397
|
+
<td align="left">TimeDelta</td>
|
398
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/TimeDelta">TimeDelta</a></td>
|
399
|
+
<td align="left">Exactly similar to above except returned object is of type 'Time'</td>
|
400
|
+
</tr>
|
401
|
+
<tr>
|
402
|
+
<td align="left">DateDelta</td>
|
403
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/DateDelta">DateDelta</a></td>
|
404
|
+
<td align="left">Shifts date randomly within given delta range. Default shits date within 10 days + or -</td>
|
405
|
+
</tr>
|
406
|
+
</table><table>
|
407
|
+
<tr>
|
408
|
+
<th align="left">Content</th>
|
409
|
+
<th align="left">Name</th>
|
410
|
+
<th align="left">Description</th>
|
411
|
+
</tr>
|
412
|
+
<tr>
|
413
|
+
<td align="left">Email</td>
|
414
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomEmail">RandomEmail</a></td>
|
415
|
+
<td align="left">Generates email randomly using the given HOSTNAME and TLD.</td>
|
416
|
+
</tr>
|
417
|
+
<tr>
|
418
|
+
<td align="left">Email</td>
|
419
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/GmailTemplate">GmailTemplate</a></td>
|
420
|
+
<td align="left">Generates a valid unique gmail address by taking advantage of the gmail + strategy</td>
|
421
|
+
</tr>
|
422
|
+
<tr>
|
423
|
+
<td align="left">Email</td>
|
424
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomMailinatorEmail">RandomMailinatorEmail</a></td>
|
425
|
+
<td align="left">Generates random email using mailinator hostname.</td>
|
426
|
+
</tr>
|
427
|
+
</table><table>
|
428
|
+
<tr>
|
429
|
+
<th align="left">Content</th>
|
430
|
+
<th align="left">Name</th>
|
431
|
+
<th align="left">Description</th>
|
432
|
+
</tr>
|
433
|
+
<tr>
|
434
|
+
<td align="left">First name</td>
|
435
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomFirstName">RandomFirstName</a></td>
|
436
|
+
<td align="left">Randomly picks up first name from the predefined list in the file. Default <a href="https://raw.github.com/sunitparekh/data-anonymization/master/resources/first_names.txt">file</a> is part of the gem.</td>
|
437
|
+
</tr>
|
438
|
+
<tr>
|
439
|
+
<td align="left">Last name</td>
|
440
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomLastName">RandomLastName</a></td>
|
441
|
+
<td align="left">Randomly picks up first name from the predefined list in the file. Default <a href="https://raw.github.com/sunitparekh/data-anonymization/master/resources/first_names.txt">file</a> is part of the gem.</td>
|
442
|
+
</tr>
|
443
|
+
<tr>
|
444
|
+
<td align="left">Full Name</td>
|
445
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomFullName">RandomFullName</a></td>
|
446
|
+
<td align="left">Generates full name using the RandomFirstName and RandomLastName strategies.</td>
|
447
|
+
</tr>
|
448
|
+
<tr>
|
449
|
+
<td align="left">User name</td>
|
450
|
+
<td align="left"><a href="http://rubydoc.info/github/sunitparekh/data-anonymization/DataAnon/Strategy/Field/RandomUserName">RandomUserName</a></td>
|
451
|
+
<td align="left">Generates random user name of same length as original user name.</td>
|
452
|
+
</tr>
|
453
|
+
</table>
|
412
454
|
|
413
|
-
### RandomFloatDelta
|
414
|
-
Shifts the current value randomly within given delta + and -. Default is 10.0
|
415
|
-
```ruby
|
416
|
-
anonymize('points').using FieldStrategy::RandomFloatDelta.new(2.5)
|
417
|
-
```
|
418
455
|
|
419
456
|
## Write you own field strategies
|
420
457
|
field parameter in following code is [DataAnon::Core::Field](#dataanon-core-field)
|
@@ -444,11 +481,11 @@ write your own anonymous field strategies within DSL,
|
|
444
481
|
## Default field strategies
|
445
482
|
|
446
483
|
```ruby
|
447
|
-
|
448
|
-
DEFAULT_STRATEGIES = {:string => FieldStrategy::LoremIpsum.new,
|
484
|
+
DEFAULT_STRATEGIES = {:string => FieldStrategy::RandomString.new,
|
449
485
|
:fixnum => FieldStrategy::RandomIntegerDelta.new(5),
|
450
486
|
:bignum => FieldStrategy::RandomIntegerDelta.new(5000),
|
451
487
|
:float => FieldStrategy::RandomFloatDelta.new(5.0),
|
488
|
+
:bigdecimal => FieldStrategy::RandomBigDecimalDelta.new(500.0),
|
452
489
|
:datetime => FieldStrategy::DateTimeDelta.new,
|
453
490
|
:time => FieldStrategy::TimeDelta.new,
|
454
491
|
:date => FieldStrategy::DateDelta.new,
|
@@ -457,7 +494,7 @@ DEFAULT_STRATEGIES = {:string => FieldStrategy::LoremIpsum.new,
|
|
457
494
|
}
|
458
495
|
```
|
459
496
|
|
460
|
-
Overriding default field strategies
|
497
|
+
Overriding default field strategies & can be used to provide default strategy for missing data type.
|
461
498
|
|
462
499
|
```ruby
|
463
500
|
database 'Chinook' do
|
@@ -497,7 +534,7 @@ DataAnon::Utils::Logging.logger.level = Logger::INFO
|
|
497
534
|
## Credits
|
498
535
|
|
499
536
|
- [ThoughtWorks Inc](http://www.thoughtworks.com), for allowing us to build this tool and make it open source.
|
500
|
-
- [
|
537
|
+
- [Panda](https://twitter.com/sarbashrestha) for reviewing the documentation.
|
501
538
|
- [Dan Abel](http://www.linkedin.com/pub/dan-abel/0/61b/9b0) for introducing me to Blacklist and Whitelist approach for data anonymization.
|
502
539
|
- [Chirga Doshi](https://twitter.com/chiragsdoshi) for encouraging me to get this done.
|
503
540
|
- [Aditya Karle](https://twitter.com/adityakarle) for the Logo. (Coming Soon...)
|