idscrub 0.2.1__tar.gz → 0.2.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (36) hide show
  1. {idscrub-0.2.1 → idscrub-0.2.2}/PKG-INFO +43 -59
  2. {idscrub-0.2.1 → idscrub-0.2.2}/README.md +42 -58
  3. {idscrub-0.2.1 → idscrub-0.2.2}/idscrub.egg-info/PKG-INFO +43 -59
  4. {idscrub-0.2.1 → idscrub-0.2.2}/.github/pull_request_template.md +0 -0
  5. {idscrub-0.2.1 → idscrub-0.2.2}/.github/workflows/cd.yml +0 -0
  6. {idscrub-0.2.1 → idscrub-0.2.2}/.github/workflows/ci.yml +0 -0
  7. {idscrub-0.2.1 → idscrub-0.2.2}/.gitignore +0 -0
  8. {idscrub-0.2.1 → idscrub-0.2.2}/.pre-commit-config.yaml +0 -0
  9. {idscrub-0.2.1 → idscrub-0.2.2}/CODEOWNERS +0 -0
  10. {idscrub-0.2.1 → idscrub-0.2.2}/LICENSE +0 -0
  11. {idscrub-0.2.1 → idscrub-0.2.2}/Makefile +0 -0
  12. {idscrub-0.2.1 → idscrub-0.2.2}/SECURITY.md +0 -0
  13. {idscrub-0.2.1 → idscrub-0.2.2}/SECURITY_CHECKLIST.md +0 -0
  14. {idscrub-0.2.1 → idscrub-0.2.2}/idscrub/__init__.py +0 -0
  15. {idscrub-0.2.1 → idscrub-0.2.2}/idscrub/locations.py +0 -0
  16. {idscrub-0.2.1 → idscrub-0.2.2}/idscrub/scrub.py +0 -0
  17. {idscrub-0.2.1 → idscrub-0.2.2}/idscrub.egg-info/SOURCES.txt +0 -0
  18. {idscrub-0.2.1 → idscrub-0.2.2}/idscrub.egg-info/dependency_links.txt +0 -0
  19. {idscrub-0.2.1 → idscrub-0.2.2}/idscrub.egg-info/requires.txt +0 -0
  20. {idscrub-0.2.1 → idscrub-0.2.2}/idscrub.egg-info/top_level.txt +0 -0
  21. {idscrub-0.2.1 → idscrub-0.2.2}/notebooks/basic_usage.ipynb +0 -0
  22. {idscrub-0.2.1 → idscrub-0.2.2}/pyproject.toml +0 -0
  23. {idscrub-0.2.1 → idscrub-0.2.2}/setup.cfg +0 -0
  24. {idscrub-0.2.1 → idscrub-0.2.2}/test/conftest.py +0 -0
  25. {idscrub-0.2.1 → idscrub-0.2.2}/test/test_all.py +0 -0
  26. {idscrub-0.2.1 → idscrub-0.2.2}/test/test_chain.py +0 -0
  27. {idscrub-0.2.1 → idscrub-0.2.2}/test/test_dataframe.py +0 -0
  28. {idscrub-0.2.1 → idscrub-0.2.2}/test/test_huggingface.py +0 -0
  29. {idscrub-0.2.1 → idscrub-0.2.2}/test/test_id.py +0 -0
  30. {idscrub-0.2.1 → idscrub-0.2.2}/test/test_log.py +0 -0
  31. {idscrub-0.2.1 → idscrub-0.2.2}/test/test_persidio.py +0 -0
  32. {idscrub-0.2.1 → idscrub-0.2.2}/test/test_phonenumbers.py +0 -0
  33. {idscrub-0.2.1 → idscrub-0.2.2}/test/test_regex.py +0 -0
  34. {idscrub-0.2.1 → idscrub-0.2.2}/test/test_scrub.py +0 -0
  35. {idscrub-0.2.1 → idscrub-0.2.2}/test/test_spacy.py +0 -0
  36. {idscrub-0.2.1 → idscrub-0.2.2}/uv.lock +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: idscrub
3
- Version: 0.2.1
3
+ Version: 0.2.2
4
4
  Author: Department for Business and Trade
5
5
  Requires-Python: >=3.12
6
6
  Description-Content-Type: text/markdown
@@ -21,36 +21,45 @@ Dynamic: license-file
21
21
 
22
22
  # idscrub 🧽✨
23
23
 
24
- ## Project Information
24
+ * Names and other personally identifying information are often present in text.
25
+ * This information may need to be removed prior to further analysis in many cases.
26
+ * `idscrub` identifies and removes (*✨scrubs✨*) personal data from text using [regular expressions](https://en.wikipedia.org/wiki/Regular_expression) and [named-entity recognition](https://en.wikipedia.org/wiki/Named-entity_recognition).
25
27
 
26
- * This package removes (*✨scrubs✨*) identifying personal data from text using [regular expressions](https://en.wikipedia.org/wiki/Regular_expression) and [named-entity recognition](https://en.wikipedia.org/wiki/Named-entity_recognition).
28
+ ## Installation
27
29
 
28
- > [!WARNING]
29
- > You must follow [GDPR guidance](https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/the-research-provisions/principles-and-grounds-for-processing/) when processing personal data using this package.
30
- >
31
- > Specifically, you must:
32
- >
33
- > - **Update privacy notices**: Clearly state this processing activity in new or existing privacy notices before using the package.
34
- > - **Ensure secure deletion**: Remove any temporary or intermediary files and outputs in a secure manner.
35
- > - **Ensure data subject rights upheld**: Ensure individuals can access, correct, or erase their data as required.
36
- > - **Maintain processing records**: Document how personal data is handled and for what purpose.
30
+ `idscrub` can be installed using `pip` into a Python **>=3.12** environment. Example:
37
31
 
38
- ### Description
32
+ ```console
33
+ pip install idscrub
34
+ ```
35
+ or with the spaCy transformer model (`en_core_web_trf`) already installed:
39
36
 
40
- * Names and other personally identifying information are often present in text.
41
- * This information may need to be removed prior to further analysis in many cases.
42
- * `idscrub` provides a standardised way to do this in the Department for Business and Trade.
37
+ ```console
38
+ pip install idscrub[trf]
39
+ ```
40
+ ## How to use the code
43
41
 
44
- ### Expected Outputs
42
+ Basic usage example (see [basic_usage.ipynb](https://github.com/uktrade/idscrub/blob/main/notebooks/basic_usage.ipynb) for further examples):
45
43
 
46
- * A list of text with names and other identifying information removed.
44
+ ```python
45
+ from idscrub import IDScrub
47
46
 
48
- > [!WARNING]
49
- > * This package has been designed as a *first pass* for standardised personal data removal.
50
- > * Users are encouraged to check and confirm outputs and conduct manual reviews where necessary, e.g. when cleaning high risk datasets.
51
- > * It is up to the user to assess whether this removal process needs to be supplemented by other methods for their given dataset and security requirements.
47
+ scrub = IDScrub(['Our names are Hamish McDonald, L. Salah, and Elena Suárez.', 'My number is +441111111111 and I live at AA11 1AA.'])x
48
+ scrubbed_texts = scrub.scrub(scrub_methods=['spacy_persons', 'uk_phone_numbers', 'uk_postcodes'])
52
49
 
53
- ### Data
50
+ print(scrubbed_texts)
51
+
52
+ # Output: ['Our names are [PERSON], [PERSON], and [PERSON].', 'My number is [PHONENO] and I live at [POSTCODE].']
53
+ ```
54
+
55
+ ## Considerations before use
56
+
57
+ - You must follow [GDPR guidance](https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/the-research-provisions/principles-and-grounds-for-processing/) when processing personal data using this package.
58
+ - This package has been designed as a *first pass* for standardised personal data removal.
59
+ - Users are encouraged to check and confirm outputs and conduct manual reviews where necessary, e.g. when cleaning high risk datasets.
60
+ - It is up to the user to assess whether this removal process needs to be supplemented by other methods for their given dataset and security requirements.
61
+
62
+ ### Input data
54
63
 
55
64
  - This package is designed for text-based documents structured as a list of strings.
56
65
  - It performs best when contextual meaning can be inferred from the text.
@@ -68,50 +77,25 @@ Dynamic: license-file
68
77
  > [!IMPORTANT]
69
78
  > * See [our wiki](https://github.com/uktrade/idscrub/wiki/Evaluation) for further details and notes on our evaluation of `idscrub`.
70
79
 
71
- ### Models and Memory
80
+ ### Models
72
81
 
73
82
  * Only Spacy's `en_core_web_trf` and no Hugging Face models have been formally evaluated.
74
83
  * We therefore recommend that the current default `en_core_web_trf` is used for name scrubbing. **Other models need to be evaluated by the user.**
75
84
 
76
- > [!IMPORTANT]
77
- > Spacy and Hugging Face models have high memory requirements. To avoid memory-related errors. Clear the auto-generated `huggingface` folder if not in use. Do not push the `huggingface` folder (or user-defined equivalent) to GitHub.
78
-
79
85
  ## Similar Python packages
80
86
 
81
- * Similar packages exist for undertaking this task, such as [presidio](https://microsoft.github.io/presidio/), [scrubadub](https://github.com/LeapBeyond/scrubadub) and [sanityze](https://github.com/UBC-MDS/sanityze).
82
- * Development of `idscrub` was undertaken to: bring together different scrubbing methods across the department, adhere to infrastructure requirements, guarantee future stability and maintainability, and encourage future scrubbing methods to be added collaboratively and transparently.
83
- * To leverage the power of other packages, we have added methods that allow you to interact with them. These include: `IDScrub.presidio()` and `IDScrub.google_phone_numbers()`. See the [usage example notebook](https://github.com/uktrade/idscrub/blob/main/notebooks/basic_usage.ipynb) and method docstrings for further information.
84
-
85
-
86
- ## Installation
87
-
88
- `idscrub` can be installed using `pip` into a Python **>=3.12** environment. Example:
89
-
90
- ```console
91
- pip install idscrub
92
- ```
93
- or with the spaCy transformer model (`en_core_web_trf`) already installed:
94
-
95
- ```console
96
- pip instll idscrub[trf]
97
- ```
98
-
99
- ## How to use the code
100
-
101
- Basic usage example (see `notebooks/basic_usage.ipynb` for further examples):
87
+ * Similar packages exist for undertaking this task, such as [Presidio](https://microsoft.github.io/presidio/), [Scrubadub](https://github.com/LeapBeyond/scrubadub) and [Sanityze](https://github.com/UBC-MDS/sanityze).
88
+ * Development of `idscrub` was undertaken to:
102
89
 
103
- ```python
104
- from idscrub import IDScrub
105
-
106
- scrub = IDScrub(['Our names are Hamish McDonald, L. Salah, and Elena Suárez.', 'My number is +441111111111 and I live at AA11 1AA.'])
107
- scrubbed_texts = scrub.scrub(scrub_methods=['spacy_persons', 'uk_phone_numbers', 'uk_postcodes'])
108
-
109
- print(scrubbed_texts)
110
-
111
- # Output: ['Our names are [PERSON], [PERSON], and [PERSON].', 'My number is [PHONENO] and I live at [POSTCODE].']
112
- ```
90
+ * Bring together different scrubbing methods across the Department for Business and Trade.
91
+ * Adhere to infrastructure requirements.
92
+ * Guarantee future stability and maintainability.
93
+ * Encourage future scrubbing methods to be added collaboratively and transparently.
94
+ * Allow for full flexibility depending on the use case and required outputs.
95
+
96
+ * To leverage the power of other packages, we have added methods that allow you to interact with them. These include: `IDScrub.presidio()` and `IDScrub.google_phone_numbers()`. See the [usage example notebook](https://github.com/uktrade/idscrub/blob/main/notebooks/basic_usage.ipynb) and method docstrings for further information.
113
97
 
114
- ## AI Declaration
98
+ ## AI declaration
115
99
 
116
100
  AI has been used in the development of `idscrub`, primarily to develop regular expressions, suggest code refinements and draft documentation.
117
101
 
@@ -1,35 +1,44 @@
1
1
  # idscrub 🧽✨
2
2
 
3
- ## Project Information
3
+ * Names and other personally identifying information are often present in text.
4
+ * This information may need to be removed prior to further analysis in many cases.
5
+ * `idscrub` identifies and removes (*✨scrubs✨*) personal data from text using [regular expressions](https://en.wikipedia.org/wiki/Regular_expression) and [named-entity recognition](https://en.wikipedia.org/wiki/Named-entity_recognition).
4
6
 
5
- * This package removes (*✨scrubs✨*) identifying personal data from text using [regular expressions](https://en.wikipedia.org/wiki/Regular_expression) and [named-entity recognition](https://en.wikipedia.org/wiki/Named-entity_recognition).
7
+ ## Installation
6
8
 
7
- > [!WARNING]
8
- > You must follow [GDPR guidance](https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/the-research-provisions/principles-and-grounds-for-processing/) when processing personal data using this package.
9
- >
10
- > Specifically, you must:
11
- >
12
- > - **Update privacy notices**: Clearly state this processing activity in new or existing privacy notices before using the package.
13
- > - **Ensure secure deletion**: Remove any temporary or intermediary files and outputs in a secure manner.
14
- > - **Ensure data subject rights upheld**: Ensure individuals can access, correct, or erase their data as required.
15
- > - **Maintain processing records**: Document how personal data is handled and for what purpose.
9
+ `idscrub` can be installed using `pip` into a Python **>=3.12** environment. Example:
16
10
 
17
- ### Description
11
+ ```console
12
+ pip install idscrub
13
+ ```
14
+ or with the spaCy transformer model (`en_core_web_trf`) already installed:
18
15
 
19
- * Names and other personally identifying information are often present in text.
20
- * This information may need to be removed prior to further analysis in many cases.
21
- * `idscrub` provides a standardised way to do this in the Department for Business and Trade.
16
+ ```console
17
+ pip install idscrub[trf]
18
+ ```
19
+ ## How to use the code
22
20
 
23
- ### Expected Outputs
21
+ Basic usage example (see [basic_usage.ipynb](https://github.com/uktrade/idscrub/blob/main/notebooks/basic_usage.ipynb) for further examples):
24
22
 
25
- * A list of text with names and other identifying information removed.
23
+ ```python
24
+ from idscrub import IDScrub
26
25
 
27
- > [!WARNING]
28
- > * This package has been designed as a *first pass* for standardised personal data removal.
29
- > * Users are encouraged to check and confirm outputs and conduct manual reviews where necessary, e.g. when cleaning high risk datasets.
30
- > * It is up to the user to assess whether this removal process needs to be supplemented by other methods for their given dataset and security requirements.
26
+ scrub = IDScrub(['Our names are Hamish McDonald, L. Salah, and Elena Suárez.', 'My number is +441111111111 and I live at AA11 1AA.'])x
27
+ scrubbed_texts = scrub.scrub(scrub_methods=['spacy_persons', 'uk_phone_numbers', 'uk_postcodes'])
31
28
 
32
- ### Data
29
+ print(scrubbed_texts)
30
+
31
+ # Output: ['Our names are [PERSON], [PERSON], and [PERSON].', 'My number is [PHONENO] and I live at [POSTCODE].']
32
+ ```
33
+
34
+ ## Considerations before use
35
+
36
+ - You must follow [GDPR guidance](https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/the-research-provisions/principles-and-grounds-for-processing/) when processing personal data using this package.
37
+ - This package has been designed as a *first pass* for standardised personal data removal.
38
+ - Users are encouraged to check and confirm outputs and conduct manual reviews where necessary, e.g. when cleaning high risk datasets.
39
+ - It is up to the user to assess whether this removal process needs to be supplemented by other methods for their given dataset and security requirements.
40
+
41
+ ### Input data
33
42
 
34
43
  - This package is designed for text-based documents structured as a list of strings.
35
44
  - It performs best when contextual meaning can be inferred from the text.
@@ -47,50 +56,25 @@
47
56
  > [!IMPORTANT]
48
57
  > * See [our wiki](https://github.com/uktrade/idscrub/wiki/Evaluation) for further details and notes on our evaluation of `idscrub`.
49
58
 
50
- ### Models and Memory
59
+ ### Models
51
60
 
52
61
  * Only Spacy's `en_core_web_trf` and no Hugging Face models have been formally evaluated.
53
62
  * We therefore recommend that the current default `en_core_web_trf` is used for name scrubbing. **Other models need to be evaluated by the user.**
54
63
 
55
- > [!IMPORTANT]
56
- > Spacy and Hugging Face models have high memory requirements. To avoid memory-related errors. Clear the auto-generated `huggingface` folder if not in use. Do not push the `huggingface` folder (or user-defined equivalent) to GitHub.
57
-
58
64
  ## Similar Python packages
59
65
 
60
- * Similar packages exist for undertaking this task, such as [presidio](https://microsoft.github.io/presidio/), [scrubadub](https://github.com/LeapBeyond/scrubadub) and [sanityze](https://github.com/UBC-MDS/sanityze).
61
- * Development of `idscrub` was undertaken to: bring together different scrubbing methods across the department, adhere to infrastructure requirements, guarantee future stability and maintainability, and encourage future scrubbing methods to be added collaboratively and transparently.
62
- * To leverage the power of other packages, we have added methods that allow you to interact with them. These include: `IDScrub.presidio()` and `IDScrub.google_phone_numbers()`. See the [usage example notebook](https://github.com/uktrade/idscrub/blob/main/notebooks/basic_usage.ipynb) and method docstrings for further information.
63
-
64
-
65
- ## Installation
66
-
67
- `idscrub` can be installed using `pip` into a Python **>=3.12** environment. Example:
68
-
69
- ```console
70
- pip install idscrub
71
- ```
72
- or with the spaCy transformer model (`en_core_web_trf`) already installed:
73
-
74
- ```console
75
- pip instll idscrub[trf]
76
- ```
77
-
78
- ## How to use the code
79
-
80
- Basic usage example (see `notebooks/basic_usage.ipynb` for further examples):
66
+ * Similar packages exist for undertaking this task, such as [Presidio](https://microsoft.github.io/presidio/), [Scrubadub](https://github.com/LeapBeyond/scrubadub) and [Sanityze](https://github.com/UBC-MDS/sanityze).
67
+ * Development of `idscrub` was undertaken to:
81
68
 
82
- ```python
83
- from idscrub import IDScrub
84
-
85
- scrub = IDScrub(['Our names are Hamish McDonald, L. Salah, and Elena Suárez.', 'My number is +441111111111 and I live at AA11 1AA.'])
86
- scrubbed_texts = scrub.scrub(scrub_methods=['spacy_persons', 'uk_phone_numbers', 'uk_postcodes'])
87
-
88
- print(scrubbed_texts)
89
-
90
- # Output: ['Our names are [PERSON], [PERSON], and [PERSON].', 'My number is [PHONENO] and I live at [POSTCODE].']
91
- ```
69
+ * Bring together different scrubbing methods across the Department for Business and Trade.
70
+ * Adhere to infrastructure requirements.
71
+ * Guarantee future stability and maintainability.
72
+ * Encourage future scrubbing methods to be added collaboratively and transparently.
73
+ * Allow for full flexibility depending on the use case and required outputs.
74
+
75
+ * To leverage the power of other packages, we have added methods that allow you to interact with them. These include: `IDScrub.presidio()` and `IDScrub.google_phone_numbers()`. See the [usage example notebook](https://github.com/uktrade/idscrub/blob/main/notebooks/basic_usage.ipynb) and method docstrings for further information.
92
76
 
93
- ## AI Declaration
77
+ ## AI declaration
94
78
 
95
79
  AI has been used in the development of `idscrub`, primarily to develop regular expressions, suggest code refinements and draft documentation.
96
80
 
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: idscrub
3
- Version: 0.2.1
3
+ Version: 0.2.2
4
4
  Author: Department for Business and Trade
5
5
  Requires-Python: >=3.12
6
6
  Description-Content-Type: text/markdown
@@ -21,36 +21,45 @@ Dynamic: license-file
21
21
 
22
22
  # idscrub 🧽✨
23
23
 
24
- ## Project Information
24
+ * Names and other personally identifying information are often present in text.
25
+ * This information may need to be removed prior to further analysis in many cases.
26
+ * `idscrub` identifies and removes (*✨scrubs✨*) personal data from text using [regular expressions](https://en.wikipedia.org/wiki/Regular_expression) and [named-entity recognition](https://en.wikipedia.org/wiki/Named-entity_recognition).
25
27
 
26
- * This package removes (*✨scrubs✨*) identifying personal data from text using [regular expressions](https://en.wikipedia.org/wiki/Regular_expression) and [named-entity recognition](https://en.wikipedia.org/wiki/Named-entity_recognition).
28
+ ## Installation
27
29
 
28
- > [!WARNING]
29
- > You must follow [GDPR guidance](https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/the-research-provisions/principles-and-grounds-for-processing/) when processing personal data using this package.
30
- >
31
- > Specifically, you must:
32
- >
33
- > - **Update privacy notices**: Clearly state this processing activity in new or existing privacy notices before using the package.
34
- > - **Ensure secure deletion**: Remove any temporary or intermediary files and outputs in a secure manner.
35
- > - **Ensure data subject rights upheld**: Ensure individuals can access, correct, or erase their data as required.
36
- > - **Maintain processing records**: Document how personal data is handled and for what purpose.
30
+ `idscrub` can be installed using `pip` into a Python **>=3.12** environment. Example:
37
31
 
38
- ### Description
32
+ ```console
33
+ pip install idscrub
34
+ ```
35
+ or with the spaCy transformer model (`en_core_web_trf`) already installed:
39
36
 
40
- * Names and other personally identifying information are often present in text.
41
- * This information may need to be removed prior to further analysis in many cases.
42
- * `idscrub` provides a standardised way to do this in the Department for Business and Trade.
37
+ ```console
38
+ pip install idscrub[trf]
39
+ ```
40
+ ## How to use the code
43
41
 
44
- ### Expected Outputs
42
+ Basic usage example (see [basic_usage.ipynb](https://github.com/uktrade/idscrub/blob/main/notebooks/basic_usage.ipynb) for further examples):
45
43
 
46
- * A list of text with names and other identifying information removed.
44
+ ```python
45
+ from idscrub import IDScrub
47
46
 
48
- > [!WARNING]
49
- > * This package has been designed as a *first pass* for standardised personal data removal.
50
- > * Users are encouraged to check and confirm outputs and conduct manual reviews where necessary, e.g. when cleaning high risk datasets.
51
- > * It is up to the user to assess whether this removal process needs to be supplemented by other methods for their given dataset and security requirements.
47
+ scrub = IDScrub(['Our names are Hamish McDonald, L. Salah, and Elena Suárez.', 'My number is +441111111111 and I live at AA11 1AA.'])x
48
+ scrubbed_texts = scrub.scrub(scrub_methods=['spacy_persons', 'uk_phone_numbers', 'uk_postcodes'])
52
49
 
53
- ### Data
50
+ print(scrubbed_texts)
51
+
52
+ # Output: ['Our names are [PERSON], [PERSON], and [PERSON].', 'My number is [PHONENO] and I live at [POSTCODE].']
53
+ ```
54
+
55
+ ## Considerations before use
56
+
57
+ - You must follow [GDPR guidance](https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/the-research-provisions/principles-and-grounds-for-processing/) when processing personal data using this package.
58
+ - This package has been designed as a *first pass* for standardised personal data removal.
59
+ - Users are encouraged to check and confirm outputs and conduct manual reviews where necessary, e.g. when cleaning high risk datasets.
60
+ - It is up to the user to assess whether this removal process needs to be supplemented by other methods for their given dataset and security requirements.
61
+
62
+ ### Input data
54
63
 
55
64
  - This package is designed for text-based documents structured as a list of strings.
56
65
  - It performs best when contextual meaning can be inferred from the text.
@@ -68,50 +77,25 @@ Dynamic: license-file
68
77
  > [!IMPORTANT]
69
78
  > * See [our wiki](https://github.com/uktrade/idscrub/wiki/Evaluation) for further details and notes on our evaluation of `idscrub`.
70
79
 
71
- ### Models and Memory
80
+ ### Models
72
81
 
73
82
  * Only Spacy's `en_core_web_trf` and no Hugging Face models have been formally evaluated.
74
83
  * We therefore recommend that the current default `en_core_web_trf` is used for name scrubbing. **Other models need to be evaluated by the user.**
75
84
 
76
- > [!IMPORTANT]
77
- > Spacy and Hugging Face models have high memory requirements. To avoid memory-related errors. Clear the auto-generated `huggingface` folder if not in use. Do not push the `huggingface` folder (or user-defined equivalent) to GitHub.
78
-
79
85
  ## Similar Python packages
80
86
 
81
- * Similar packages exist for undertaking this task, such as [presidio](https://microsoft.github.io/presidio/), [scrubadub](https://github.com/LeapBeyond/scrubadub) and [sanityze](https://github.com/UBC-MDS/sanityze).
82
- * Development of `idscrub` was undertaken to: bring together different scrubbing methods across the department, adhere to infrastructure requirements, guarantee future stability and maintainability, and encourage future scrubbing methods to be added collaboratively and transparently.
83
- * To leverage the power of other packages, we have added methods that allow you to interact with them. These include: `IDScrub.presidio()` and `IDScrub.google_phone_numbers()`. See the [usage example notebook](https://github.com/uktrade/idscrub/blob/main/notebooks/basic_usage.ipynb) and method docstrings for further information.
84
-
85
-
86
- ## Installation
87
-
88
- `idscrub` can be installed using `pip` into a Python **>=3.12** environment. Example:
89
-
90
- ```console
91
- pip install idscrub
92
- ```
93
- or with the spaCy transformer model (`en_core_web_trf`) already installed:
94
-
95
- ```console
96
- pip instll idscrub[trf]
97
- ```
98
-
99
- ## How to use the code
100
-
101
- Basic usage example (see `notebooks/basic_usage.ipynb` for further examples):
87
+ * Similar packages exist for undertaking this task, such as [Presidio](https://microsoft.github.io/presidio/), [Scrubadub](https://github.com/LeapBeyond/scrubadub) and [Sanityze](https://github.com/UBC-MDS/sanityze).
88
+ * Development of `idscrub` was undertaken to:
102
89
 
103
- ```python
104
- from idscrub import IDScrub
105
-
106
- scrub = IDScrub(['Our names are Hamish McDonald, L. Salah, and Elena Suárez.', 'My number is +441111111111 and I live at AA11 1AA.'])
107
- scrubbed_texts = scrub.scrub(scrub_methods=['spacy_persons', 'uk_phone_numbers', 'uk_postcodes'])
108
-
109
- print(scrubbed_texts)
110
-
111
- # Output: ['Our names are [PERSON], [PERSON], and [PERSON].', 'My number is [PHONENO] and I live at [POSTCODE].']
112
- ```
90
+ * Bring together different scrubbing methods across the Department for Business and Trade.
91
+ * Adhere to infrastructure requirements.
92
+ * Guarantee future stability and maintainability.
93
+ * Encourage future scrubbing methods to be added collaboratively and transparently.
94
+ * Allow for full flexibility depending on the use case and required outputs.
95
+
96
+ * To leverage the power of other packages, we have added methods that allow you to interact with them. These include: `IDScrub.presidio()` and `IDScrub.google_phone_numbers()`. See the [usage example notebook](https://github.com/uktrade/idscrub/blob/main/notebooks/basic_usage.ipynb) and method docstrings for further information.
113
97
 
114
- ## AI Declaration
98
+ ## AI declaration
115
99
 
116
100
  AI has been used in the development of `idscrub`, primarily to develop regular expressions, suggest code refinements and draft documentation.
117
101
 
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes