idscrub 0.2.1__py3-none-any.whl → 0.2.2__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {idscrub-0.2.1.dist-info → idscrub-0.2.2.dist-info}/METADATA +43 -59
- {idscrub-0.2.1.dist-info → idscrub-0.2.2.dist-info}/RECORD +5 -5
- {idscrub-0.2.1.dist-info → idscrub-0.2.2.dist-info}/WHEEL +0 -0
- {idscrub-0.2.1.dist-info → idscrub-0.2.2.dist-info}/licenses/LICENSE +0 -0
- {idscrub-0.2.1.dist-info → idscrub-0.2.2.dist-info}/top_level.txt +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: idscrub
|
|
3
|
-
Version: 0.2.
|
|
3
|
+
Version: 0.2.2
|
|
4
4
|
Author: Department for Business and Trade
|
|
5
5
|
Requires-Python: >=3.12
|
|
6
6
|
Description-Content-Type: text/markdown
|
|
@@ -21,36 +21,45 @@ Dynamic: license-file
|
|
|
21
21
|
|
|
22
22
|
# idscrub 🧽✨
|
|
23
23
|
|
|
24
|
-
|
|
24
|
+
* Names and other personally identifying information are often present in text.
|
|
25
|
+
* This information may need to be removed prior to further analysis in many cases.
|
|
26
|
+
* `idscrub` identifies and removes (*✨scrubs✨*) personal data from text using [regular expressions](https://en.wikipedia.org/wiki/Regular_expression) and [named-entity recognition](https://en.wikipedia.org/wiki/Named-entity_recognition).
|
|
25
27
|
|
|
26
|
-
|
|
28
|
+
## Installation
|
|
27
29
|
|
|
28
|
-
|
|
29
|
-
> You must follow [GDPR guidance](https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/the-research-provisions/principles-and-grounds-for-processing/) when processing personal data using this package.
|
|
30
|
-
>
|
|
31
|
-
> Specifically, you must:
|
|
32
|
-
>
|
|
33
|
-
> - **Update privacy notices**: Clearly state this processing activity in new or existing privacy notices before using the package.
|
|
34
|
-
> - **Ensure secure deletion**: Remove any temporary or intermediary files and outputs in a secure manner.
|
|
35
|
-
> - **Ensure data subject rights upheld**: Ensure individuals can access, correct, or erase their data as required.
|
|
36
|
-
> - **Maintain processing records**: Document how personal data is handled and for what purpose.
|
|
30
|
+
`idscrub` can be installed using `pip` into a Python **>=3.12** environment. Example:
|
|
37
31
|
|
|
38
|
-
|
|
32
|
+
```console
|
|
33
|
+
pip install idscrub
|
|
34
|
+
```
|
|
35
|
+
or with the spaCy transformer model (`en_core_web_trf`) already installed:
|
|
39
36
|
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
|
|
37
|
+
```console
|
|
38
|
+
pip install idscrub[trf]
|
|
39
|
+
```
|
|
40
|
+
## How to use the code
|
|
43
41
|
|
|
44
|
-
|
|
42
|
+
Basic usage example (see [basic_usage.ipynb](https://github.com/uktrade/idscrub/blob/main/notebooks/basic_usage.ipynb) for further examples):
|
|
45
43
|
|
|
46
|
-
|
|
44
|
+
```python
|
|
45
|
+
from idscrub import IDScrub
|
|
47
46
|
|
|
48
|
-
|
|
49
|
-
|
|
50
|
-
> * Users are encouraged to check and confirm outputs and conduct manual reviews where necessary, e.g. when cleaning high risk datasets.
|
|
51
|
-
> * It is up to the user to assess whether this removal process needs to be supplemented by other methods for their given dataset and security requirements.
|
|
47
|
+
scrub = IDScrub(['Our names are Hamish McDonald, L. Salah, and Elena Suárez.', 'My number is +441111111111 and I live at AA11 1AA.'])x
|
|
48
|
+
scrubbed_texts = scrub.scrub(scrub_methods=['spacy_persons', 'uk_phone_numbers', 'uk_postcodes'])
|
|
52
49
|
|
|
53
|
-
|
|
50
|
+
print(scrubbed_texts)
|
|
51
|
+
|
|
52
|
+
# Output: ['Our names are [PERSON], [PERSON], and [PERSON].', 'My number is [PHONENO] and I live at [POSTCODE].']
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
## Considerations before use
|
|
56
|
+
|
|
57
|
+
- You must follow [GDPR guidance](https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/the-research-provisions/principles-and-grounds-for-processing/) when processing personal data using this package.
|
|
58
|
+
- This package has been designed as a *first pass* for standardised personal data removal.
|
|
59
|
+
- Users are encouraged to check and confirm outputs and conduct manual reviews where necessary, e.g. when cleaning high risk datasets.
|
|
60
|
+
- It is up to the user to assess whether this removal process needs to be supplemented by other methods for their given dataset and security requirements.
|
|
61
|
+
|
|
62
|
+
### Input data
|
|
54
63
|
|
|
55
64
|
- This package is designed for text-based documents structured as a list of strings.
|
|
56
65
|
- It performs best when contextual meaning can be inferred from the text.
|
|
@@ -68,50 +77,25 @@ Dynamic: license-file
|
|
|
68
77
|
> [!IMPORTANT]
|
|
69
78
|
> * See [our wiki](https://github.com/uktrade/idscrub/wiki/Evaluation) for further details and notes on our evaluation of `idscrub`.
|
|
70
79
|
|
|
71
|
-
### Models
|
|
80
|
+
### Models
|
|
72
81
|
|
|
73
82
|
* Only Spacy's `en_core_web_trf` and no Hugging Face models have been formally evaluated.
|
|
74
83
|
* We therefore recommend that the current default `en_core_web_trf` is used for name scrubbing. **Other models need to be evaluated by the user.**
|
|
75
84
|
|
|
76
|
-
> [!IMPORTANT]
|
|
77
|
-
> Spacy and Hugging Face models have high memory requirements. To avoid memory-related errors. Clear the auto-generated `huggingface` folder if not in use. Do not push the `huggingface` folder (or user-defined equivalent) to GitHub.
|
|
78
|
-
|
|
79
85
|
## Similar Python packages
|
|
80
86
|
|
|
81
|
-
* Similar packages exist for undertaking this task, such as [
|
|
82
|
-
* Development of `idscrub` was undertaken to:
|
|
83
|
-
* To leverage the power of other packages, we have added methods that allow you to interact with them. These include: `IDScrub.presidio()` and `IDScrub.google_phone_numbers()`. See the [usage example notebook](https://github.com/uktrade/idscrub/blob/main/notebooks/basic_usage.ipynb) and method docstrings for further information.
|
|
84
|
-
|
|
85
|
-
|
|
86
|
-
## Installation
|
|
87
|
-
|
|
88
|
-
`idscrub` can be installed using `pip` into a Python **>=3.12** environment. Example:
|
|
89
|
-
|
|
90
|
-
```console
|
|
91
|
-
pip install idscrub
|
|
92
|
-
```
|
|
93
|
-
or with the spaCy transformer model (`en_core_web_trf`) already installed:
|
|
94
|
-
|
|
95
|
-
```console
|
|
96
|
-
pip instll idscrub[trf]
|
|
97
|
-
```
|
|
98
|
-
|
|
99
|
-
## How to use the code
|
|
100
|
-
|
|
101
|
-
Basic usage example (see `notebooks/basic_usage.ipynb` for further examples):
|
|
87
|
+
* Similar packages exist for undertaking this task, such as [Presidio](https://microsoft.github.io/presidio/), [Scrubadub](https://github.com/LeapBeyond/scrubadub) and [Sanityze](https://github.com/UBC-MDS/sanityze).
|
|
88
|
+
* Development of `idscrub` was undertaken to:
|
|
102
89
|
|
|
103
|
-
|
|
104
|
-
|
|
105
|
-
|
|
106
|
-
|
|
107
|
-
|
|
108
|
-
|
|
109
|
-
|
|
110
|
-
|
|
111
|
-
# Output: ['Our names are [PERSON], [PERSON], and [PERSON].', 'My number is [PHONENO] and I live at [POSTCODE].']
|
|
112
|
-
```
|
|
90
|
+
* Bring together different scrubbing methods across the Department for Business and Trade.
|
|
91
|
+
* Adhere to infrastructure requirements.
|
|
92
|
+
* Guarantee future stability and maintainability.
|
|
93
|
+
* Encourage future scrubbing methods to be added collaboratively and transparently.
|
|
94
|
+
* Allow for full flexibility depending on the use case and required outputs.
|
|
95
|
+
|
|
96
|
+
* To leverage the power of other packages, we have added methods that allow you to interact with them. These include: `IDScrub.presidio()` and `IDScrub.google_phone_numbers()`. See the [usage example notebook](https://github.com/uktrade/idscrub/blob/main/notebooks/basic_usage.ipynb) and method docstrings for further information.
|
|
113
97
|
|
|
114
|
-
## AI
|
|
98
|
+
## AI declaration
|
|
115
99
|
|
|
116
100
|
AI has been used in the development of `idscrub`, primarily to develop regular expressions, suggest code refinements and draft documentation.
|
|
117
101
|
|
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
idscrub/__init__.py,sha256=cRugJv27q1q--bl-VNLpfiScJb_ROlUxyLFhaF55S1w,38
|
|
2
2
|
idscrub/locations.py,sha256=7fMNOcGMYe7sX8TrfhMW6oYGAlc1WVYVQKQbpxE3pqo,217
|
|
3
3
|
idscrub/scrub.py,sha256=K4Sw4DxKhYJnnu_vpRhUcqj-AbeGr8SwDB0XrDLEciM,34940
|
|
4
|
-
idscrub-0.2.
|
|
4
|
+
idscrub-0.2.2.dist-info/licenses/LICENSE,sha256=JJnuf10NSx7YXglte1oH_N9ZP3AcWR_Y8irvQb_wnsg,1090
|
|
5
5
|
notebooks/basic_usage.ipynb,sha256=eQFU3mOyRXbCwFz3jVUKCxWRtIP5Jptny8fj-KYoBwA,39784
|
|
6
6
|
test/conftest.py,sha256=ph1S3LMvzlzvOsb3l2YhpyHSdmg4uV7p61ge_JVCGv0,267
|
|
7
7
|
test/test_all.py,sha256=z6v9O2Ts9dWITlhvZwRMyKUZsO7ncaT3znqqBCKJ6Wc,1141
|
|
@@ -15,7 +15,7 @@ test/test_phonenumbers.py,sha256=hZsXgwhn5R-7426TTWwCH9gWQwhyHtjLUstN10jnX6c,607
|
|
|
15
15
|
test/test_regex.py,sha256=EQGx3PHwJJzIdy6xwR8gEsSRDtlWHR-U81EPI811eZA,4474
|
|
16
16
|
test/test_scrub.py,sha256=pohmw3frtlkmZDMvOEbmvVJgtcVdFlEDL3TxR5-y-0Q,1422
|
|
17
17
|
test/test_spacy.py,sha256=mrUGUulvzDGgQRttdG0tgL2sGBRmYfg1fDNp7SFq8as,961
|
|
18
|
-
idscrub-0.2.
|
|
19
|
-
idscrub-0.2.
|
|
20
|
-
idscrub-0.2.
|
|
21
|
-
idscrub-0.2.
|
|
18
|
+
idscrub-0.2.2.dist-info/METADATA,sha256=IHoFTVY6cJARkeeKoQlpunA7Nboc4y32bpSoS-IgSoM,5352
|
|
19
|
+
idscrub-0.2.2.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
|
|
20
|
+
idscrub-0.2.2.dist-info/top_level.txt,sha256=D4EEodXGCjGiX35ObiBTmjjBAdouN-eCvH-LezGGtks,23
|
|
21
|
+
idscrub-0.2.2.dist-info/RECORD,,
|
|
File without changes
|
|
File without changes
|
|
File without changes
|