idscrub 1.1.2__tar.gz → 2.0.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {idscrub-1.1.2 → idscrub-2.0.0}/PKG-INFO +58 -12
- {idscrub-1.1.2 → idscrub-2.0.0}/README.md +55 -10
- idscrub-2.0.0/idscrub/scrub.py +1187 -0
- {idscrub-1.1.2 → idscrub-2.0.0}/idscrub.egg-info/PKG-INFO +58 -12
- {idscrub-1.1.2 → idscrub-2.0.0}/idscrub.egg-info/SOURCES.txt +5 -3
- {idscrub-1.1.2 → idscrub-2.0.0}/idscrub.egg-info/requires.txt +2 -1
- {idscrub-1.1.2 → idscrub-2.0.0}/notebooks/basic_usage.ipynb +294 -351
- {idscrub-1.1.2 → idscrub-2.0.0}/pyproject.toml +2 -1
- idscrub-2.0.0/test/conftest.py +58 -0
- {idscrub-1.1.2 → idscrub-2.0.0}/test/test_dataframe.py +8 -8
- idscrub-2.0.0/test/test_errors.py +32 -0
- idscrub-2.0.0/test/test_exclude.py +22 -0
- idscrub-2.0.0/test/test_group.py +9 -0
- {idscrub-1.1.2 → idscrub-2.0.0}/test/test_huggingface.py +3 -3
- idscrub-2.0.0/test/test_id.py +25 -0
- idscrub-2.0.0/test/test_label.py +32 -0
- idscrub-2.0.0/test/test_overlap.py +86 -0
- {idscrub-1.1.2 → idscrub-2.0.0}/test/test_phonenumbers.py +2 -2
- {idscrub-1.1.2 → idscrub-2.0.0}/test/test_presidio.py +13 -6
- idscrub-2.0.0/test/test_regex.py +220 -0
- idscrub-2.0.0/test/test_scrub.py +58 -0
- idscrub-2.0.0/test/test_scrub_text.py +22 -0
- {idscrub-1.1.2 → idscrub-2.0.0}/test/test_spacy.py +14 -10
- {idscrub-1.1.2 → idscrub-2.0.0}/uv.lock +210 -213
- idscrub-1.1.2/idscrub/scrub.py +0 -1020
- idscrub-1.1.2/test/conftest.py +0 -22
- idscrub-1.1.2/test/test_all.py +0 -39
- idscrub-1.1.2/test/test_chain.py +0 -54
- idscrub-1.1.2/test/test_id.py +0 -24
- idscrub-1.1.2/test/test_label.py +0 -17
- idscrub-1.1.2/test/test_log.py +0 -17
- idscrub-1.1.2/test/test_regex.py +0 -169
- idscrub-1.1.2/test/test_scrub.py +0 -48
- {idscrub-1.1.2 → idscrub-2.0.0}/.github/pull_request_template.md +0 -0
- {idscrub-1.1.2 → idscrub-2.0.0}/.github/workflows/cd.yml +0 -0
- {idscrub-1.1.2 → idscrub-2.0.0}/.github/workflows/ci.yml +0 -0
- {idscrub-1.1.2 → idscrub-2.0.0}/.gitignore +0 -0
- {idscrub-1.1.2 → idscrub-2.0.0}/.pre-commit-config.yaml +0 -0
- {idscrub-1.1.2 → idscrub-2.0.0}/CODEOWNERS +0 -0
- {idscrub-1.1.2 → idscrub-2.0.0}/LICENSE +0 -0
- {idscrub-1.1.2 → idscrub-2.0.0}/Makefile +0 -0
- {idscrub-1.1.2 → idscrub-2.0.0}/SECURITY_CHECKLIST.md +0 -0
- {idscrub-1.1.2 → idscrub-2.0.0}/idscrub/__init__.py +0 -0
- {idscrub-1.1.2 → idscrub-2.0.0}/idscrub/locations.py +0 -0
- {idscrub-1.1.2 → idscrub-2.0.0}/idscrub.egg-info/dependency_links.txt +0 -0
- {idscrub-1.1.2 → idscrub-2.0.0}/idscrub.egg-info/top_level.txt +0 -0
- {idscrub-1.1.2 → idscrub-2.0.0}/setup.cfg +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: idscrub
|
|
3
|
-
Version:
|
|
3
|
+
Version: 2.0.0
|
|
4
4
|
Author: Department for Business and Trade
|
|
5
5
|
Classifier: Development Status :: 3 - Alpha
|
|
6
6
|
Requires-Python: >=3.12
|
|
@@ -12,7 +12,8 @@ Requires-Dist: numpy>=2.3.4
|
|
|
12
12
|
Requires-Dist: pandas<3.0
|
|
13
13
|
Requires-Dist: phonenumbers>=9.0.18
|
|
14
14
|
Requires-Dist: pip>=25.3
|
|
15
|
-
Requires-Dist: spacy
|
|
15
|
+
Requires-Dist: spacy
|
|
16
|
+
Requires-Dist: transformers
|
|
16
17
|
Requires-Dist: tqdm>=4.67.1
|
|
17
18
|
Requires-Dist: presidio-analyzer
|
|
18
19
|
Requires-Dist: presidio-anonymizer
|
|
@@ -33,16 +34,20 @@ Dynamic: license-file
|
|
|
33
34
|
|
|
34
35
|
## Installation
|
|
35
36
|
|
|
36
|
-
`idscrub` can be installed using `pip` into a Python **>=3.12** environment.
|
|
37
|
+
`idscrub` can be installed using `pip` into a Python **>=3.12** environment.
|
|
38
|
+
|
|
39
|
+
We recommend installing with the SpaCy transformer model (`en_core_web_trf`) as a dependency:
|
|
37
40
|
|
|
38
41
|
```console
|
|
39
|
-
pip install idscrub
|
|
42
|
+
pip install idscrub[trf]
|
|
40
43
|
```
|
|
41
|
-
|
|
44
|
+
|
|
45
|
+
If you do not need SpaCy:
|
|
42
46
|
|
|
43
47
|
```console
|
|
44
|
-
pip install idscrub
|
|
48
|
+
pip install idscrub
|
|
45
49
|
```
|
|
50
|
+
|
|
46
51
|
## How to use the code
|
|
47
52
|
|
|
48
53
|
Basic usage example (see [basic_usage.ipynb](https://github.com/uktrade/idscrub/blob/main/notebooks/basic_usage.ipynb) for further examples):
|
|
@@ -50,18 +55,56 @@ Basic usage example (see [basic_usage.ipynb](https://github.com/uktrade/idscrub/
|
|
|
50
55
|
```python
|
|
51
56
|
from idscrub import IDScrub
|
|
52
57
|
|
|
53
|
-
scrub = IDScrub(['Our names are Hamish McDonald, L. Salah, and Elena Suárez.', 'My number is +441111111111 and I live at AA11 1AA.'])
|
|
54
|
-
scrubbed_texts = scrub.scrub(
|
|
58
|
+
scrub = IDScrub(['Our names are Hamish McDonald, L. Salah, and Elena Suárez.', 'My number is +441111111111 and I live at AA11 1AA.'])
|
|
59
|
+
scrubbed_texts = scrub.scrub(
|
|
60
|
+
pipeline=[
|
|
61
|
+
{"method": "spacy_entities", "entity_types": ["PERSON"]},
|
|
62
|
+
{"method": "uk_phone_numbers"},
|
|
63
|
+
{"method": "uk_postcodes"},
|
|
64
|
+
]
|
|
65
|
+
)
|
|
55
66
|
|
|
56
67
|
print(scrubbed_texts)
|
|
57
68
|
|
|
58
69
|
# Output: ['Our names are [PERSON], [PERSON], and [PERSON].', 'My number is [PHONENO] and I live at [POSTCODE].']
|
|
59
70
|
```
|
|
60
|
-
## Personal data types supported
|
|
61
71
|
|
|
62
|
-
|
|
72
|
+
This package will identify and scrub many types of data that you might not want to scrub, such as locations or context-relevent names. **We therefore highly recommend manually removing scrubbed data identified by `idscrub` from your original dataset on a case-by-case basis.**
|
|
73
|
+
|
|
74
|
+
Scrubbed data can be identified using the following methods (see the [usage example notebook](https://github.com/uktrade/idscrub/blob/main/notebooks/basic_usage.ipynb) for further information):
|
|
63
75
|
|
|
64
|
-
|
|
76
|
+
```python
|
|
77
|
+
import pandas as pd
|
|
78
|
+
from idscrub import IDScrub
|
|
79
|
+
|
|
80
|
+
# From lists of text:
|
|
81
|
+
scrub = IDScrub(['Our names are Hamish McDonald, L. Salah, and Elena Suárez.', 'My number is +441111111111 and I live at AA11 1AA.'])
|
|
82
|
+
scrubbed_texts = scrub.scrub(
|
|
83
|
+
pipeline=[
|
|
84
|
+
{"method": "spacy_entities", "entity_types": ["PERSON"]},
|
|
85
|
+
{"method": "uk_phone_numbers"},
|
|
86
|
+
{"method": "uk_postcodes"},
|
|
87
|
+
]
|
|
88
|
+
)
|
|
89
|
+
scrubbed_df = scrub.get_scrubbed_data()
|
|
90
|
+
print(scrubbed_df)
|
|
91
|
+
|
|
92
|
+
# From a Pandas DataFrame:
|
|
93
|
+
scrubbed_df, scrubbed_data = IDScrub.dataframe(
|
|
94
|
+
df=pd.read_csv('path/to/csv'),
|
|
95
|
+
id_col="ID",
|
|
96
|
+
pipeline=[
|
|
97
|
+
{"method": "spacy_entities", "entity_types": ["PERSON"]},
|
|
98
|
+
{"method": "uk_phone_numbers"},
|
|
99
|
+
{"method": "uk_postcodes"},
|
|
100
|
+
]
|
|
101
|
+
)
|
|
102
|
+
print(scrubbed_df)
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
## Personal data types supported
|
|
106
|
+
|
|
107
|
+
| Method | Scrubs |
|
|
65
108
|
|-------------------------|------------------------------------------------------------------------|
|
|
66
109
|
| `all` | All supported personal data types (see `IDScrub.all()` for further customisation) |
|
|
67
110
|
| `spacy_entities` | Entities detected by spaCy's `en_core_web_trf` or other user-selected spaCy models (e.g. persons (names), organisations) |
|
|
@@ -70,12 +113,15 @@ Personal data can either be scrubbed as methods with arguments for extra customi
|
|
|
70
113
|
| `email_addresses` | Email addresses (e.g. john@email.com) |
|
|
71
114
|
| `titles` | Titles (e.g. Mr., Mrs., Dr.) |
|
|
72
115
|
| `handles` | Social media handles (e.g. @username) |
|
|
116
|
+
| `urls` | URLs (e.g. www.bbc.co.uk) |
|
|
73
117
|
| `ip_addresses` | IP addresses (e.g. 8.8.8.8) |
|
|
74
118
|
| `uk_postcodes` | UK postal codes (e.g. SW1A 2AA) |
|
|
75
119
|
| `uk_addresses` | UK addresses (e.g. 10 Downing Street) |
|
|
76
120
|
| `uk_phone_numbers` | UK phone numbers (e.g. +441111111111) |
|
|
77
121
|
| `google_phone_numbers` | Phone numbers detected by Google's [phonenumbers](https://github.com/daviddrysdale/python-phonenumbers) |
|
|
78
122
|
|
|
123
|
+
Method arguments for further customisation can be viewed by viewing the docstring e.g. `?IDScrub.spacy_entities`.
|
|
124
|
+
|
|
79
125
|
## Considerations before use
|
|
80
126
|
|
|
81
127
|
- You must follow [GDPR guidance](https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/the-research-provisions/principles-and-grounds-for-processing/) when processing personal data using this package.
|
|
@@ -130,7 +176,7 @@ This project is managed by [uv](https://docs.astral.sh/uv/).
|
|
|
130
176
|
To install all dependencies for this project, run:
|
|
131
177
|
|
|
132
178
|
```console
|
|
133
|
-
uv sync
|
|
179
|
+
uv sync
|
|
134
180
|
```
|
|
135
181
|
|
|
136
182
|
If you do not have Python 3.12, run:
|
|
@@ -11,16 +11,20 @@
|
|
|
11
11
|
|
|
12
12
|
## Installation
|
|
13
13
|
|
|
14
|
-
`idscrub` can be installed using `pip` into a Python **>=3.12** environment.
|
|
14
|
+
`idscrub` can be installed using `pip` into a Python **>=3.12** environment.
|
|
15
|
+
|
|
16
|
+
We recommend installing with the SpaCy transformer model (`en_core_web_trf`) as a dependency:
|
|
15
17
|
|
|
16
18
|
```console
|
|
17
|
-
pip install idscrub
|
|
19
|
+
pip install idscrub[trf]
|
|
18
20
|
```
|
|
19
|
-
|
|
21
|
+
|
|
22
|
+
If you do not need SpaCy:
|
|
20
23
|
|
|
21
24
|
```console
|
|
22
|
-
pip install idscrub
|
|
25
|
+
pip install idscrub
|
|
23
26
|
```
|
|
27
|
+
|
|
24
28
|
## How to use the code
|
|
25
29
|
|
|
26
30
|
Basic usage example (see [basic_usage.ipynb](https://github.com/uktrade/idscrub/blob/main/notebooks/basic_usage.ipynb) for further examples):
|
|
@@ -28,18 +32,56 @@ Basic usage example (see [basic_usage.ipynb](https://github.com/uktrade/idscrub/
|
|
|
28
32
|
```python
|
|
29
33
|
from idscrub import IDScrub
|
|
30
34
|
|
|
31
|
-
scrub = IDScrub(['Our names are Hamish McDonald, L. Salah, and Elena Suárez.', 'My number is +441111111111 and I live at AA11 1AA.'])
|
|
32
|
-
scrubbed_texts = scrub.scrub(
|
|
35
|
+
scrub = IDScrub(['Our names are Hamish McDonald, L. Salah, and Elena Suárez.', 'My number is +441111111111 and I live at AA11 1AA.'])
|
|
36
|
+
scrubbed_texts = scrub.scrub(
|
|
37
|
+
pipeline=[
|
|
38
|
+
{"method": "spacy_entities", "entity_types": ["PERSON"]},
|
|
39
|
+
{"method": "uk_phone_numbers"},
|
|
40
|
+
{"method": "uk_postcodes"},
|
|
41
|
+
]
|
|
42
|
+
)
|
|
33
43
|
|
|
34
44
|
print(scrubbed_texts)
|
|
35
45
|
|
|
36
46
|
# Output: ['Our names are [PERSON], [PERSON], and [PERSON].', 'My number is [PHONENO] and I live at [POSTCODE].']
|
|
37
47
|
```
|
|
38
|
-
## Personal data types supported
|
|
39
48
|
|
|
40
|
-
|
|
49
|
+
This package will identify and scrub many types of data that you might not want to scrub, such as locations or context-relevent names. **We therefore highly recommend manually removing scrubbed data identified by `idscrub` from your original dataset on a case-by-case basis.**
|
|
50
|
+
|
|
51
|
+
Scrubbed data can be identified using the following methods (see the [usage example notebook](https://github.com/uktrade/idscrub/blob/main/notebooks/basic_usage.ipynb) for further information):
|
|
41
52
|
|
|
42
|
-
|
|
53
|
+
```python
|
|
54
|
+
import pandas as pd
|
|
55
|
+
from idscrub import IDScrub
|
|
56
|
+
|
|
57
|
+
# From lists of text:
|
|
58
|
+
scrub = IDScrub(['Our names are Hamish McDonald, L. Salah, and Elena Suárez.', 'My number is +441111111111 and I live at AA11 1AA.'])
|
|
59
|
+
scrubbed_texts = scrub.scrub(
|
|
60
|
+
pipeline=[
|
|
61
|
+
{"method": "spacy_entities", "entity_types": ["PERSON"]},
|
|
62
|
+
{"method": "uk_phone_numbers"},
|
|
63
|
+
{"method": "uk_postcodes"},
|
|
64
|
+
]
|
|
65
|
+
)
|
|
66
|
+
scrubbed_df = scrub.get_scrubbed_data()
|
|
67
|
+
print(scrubbed_df)
|
|
68
|
+
|
|
69
|
+
# From a Pandas DataFrame:
|
|
70
|
+
scrubbed_df, scrubbed_data = IDScrub.dataframe(
|
|
71
|
+
df=pd.read_csv('path/to/csv'),
|
|
72
|
+
id_col="ID",
|
|
73
|
+
pipeline=[
|
|
74
|
+
{"method": "spacy_entities", "entity_types": ["PERSON"]},
|
|
75
|
+
{"method": "uk_phone_numbers"},
|
|
76
|
+
{"method": "uk_postcodes"},
|
|
77
|
+
]
|
|
78
|
+
)
|
|
79
|
+
print(scrubbed_df)
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
## Personal data types supported
|
|
83
|
+
|
|
84
|
+
| Method | Scrubs |
|
|
43
85
|
|-------------------------|------------------------------------------------------------------------|
|
|
44
86
|
| `all` | All supported personal data types (see `IDScrub.all()` for further customisation) |
|
|
45
87
|
| `spacy_entities` | Entities detected by spaCy's `en_core_web_trf` or other user-selected spaCy models (e.g. persons (names), organisations) |
|
|
@@ -48,12 +90,15 @@ Personal data can either be scrubbed as methods with arguments for extra customi
|
|
|
48
90
|
| `email_addresses` | Email addresses (e.g. john@email.com) |
|
|
49
91
|
| `titles` | Titles (e.g. Mr., Mrs., Dr.) |
|
|
50
92
|
| `handles` | Social media handles (e.g. @username) |
|
|
93
|
+
| `urls` | URLs (e.g. www.bbc.co.uk) |
|
|
51
94
|
| `ip_addresses` | IP addresses (e.g. 8.8.8.8) |
|
|
52
95
|
| `uk_postcodes` | UK postal codes (e.g. SW1A 2AA) |
|
|
53
96
|
| `uk_addresses` | UK addresses (e.g. 10 Downing Street) |
|
|
54
97
|
| `uk_phone_numbers` | UK phone numbers (e.g. +441111111111) |
|
|
55
98
|
| `google_phone_numbers` | Phone numbers detected by Google's [phonenumbers](https://github.com/daviddrysdale/python-phonenumbers) |
|
|
56
99
|
|
|
100
|
+
Method arguments for further customisation can be viewed by viewing the docstring e.g. `?IDScrub.spacy_entities`.
|
|
101
|
+
|
|
57
102
|
## Considerations before use
|
|
58
103
|
|
|
59
104
|
- You must follow [GDPR guidance](https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/the-research-provisions/principles-and-grounds-for-processing/) when processing personal data using this package.
|
|
@@ -108,7 +153,7 @@ This project is managed by [uv](https://docs.astral.sh/uv/).
|
|
|
108
153
|
To install all dependencies for this project, run:
|
|
109
154
|
|
|
110
155
|
```console
|
|
111
|
-
uv sync
|
|
156
|
+
uv sync
|
|
112
157
|
```
|
|
113
158
|
|
|
114
159
|
If you do not have Python 3.12, run:
|