idscrub 1.1.2__tar.gz → 2.0.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (47) hide show
  1. {idscrub-1.1.2 → idscrub-2.0.1}/PKG-INFO +58 -12
  2. {idscrub-1.1.2 → idscrub-2.0.1}/README.md +55 -10
  3. idscrub-2.0.1/idscrub/scrub.py +1189 -0
  4. {idscrub-1.1.2 → idscrub-2.0.1}/idscrub.egg-info/PKG-INFO +58 -12
  5. {idscrub-1.1.2 → idscrub-2.0.1}/idscrub.egg-info/SOURCES.txt +5 -3
  6. {idscrub-1.1.2 → idscrub-2.0.1}/idscrub.egg-info/requires.txt +2 -1
  7. {idscrub-1.1.2 → idscrub-2.0.1}/notebooks/basic_usage.ipynb +294 -351
  8. {idscrub-1.1.2 → idscrub-2.0.1}/pyproject.toml +2 -1
  9. idscrub-2.0.1/test/conftest.py +58 -0
  10. {idscrub-1.1.2 → idscrub-2.0.1}/test/test_dataframe.py +8 -8
  11. idscrub-2.0.1/test/test_errors.py +32 -0
  12. idscrub-2.0.1/test/test_exclude.py +22 -0
  13. idscrub-2.0.1/test/test_group.py +9 -0
  14. {idscrub-1.1.2 → idscrub-2.0.1}/test/test_huggingface.py +3 -3
  15. idscrub-2.0.1/test/test_id.py +25 -0
  16. idscrub-2.0.1/test/test_label.py +32 -0
  17. idscrub-2.0.1/test/test_overlap.py +86 -0
  18. {idscrub-1.1.2 → idscrub-2.0.1}/test/test_phonenumbers.py +2 -2
  19. {idscrub-1.1.2 → idscrub-2.0.1}/test/test_presidio.py +21 -6
  20. idscrub-2.0.1/test/test_regex.py +220 -0
  21. idscrub-2.0.1/test/test_scrub.py +58 -0
  22. idscrub-2.0.1/test/test_scrub_text.py +22 -0
  23. {idscrub-1.1.2 → idscrub-2.0.1}/test/test_spacy.py +16 -12
  24. {idscrub-1.1.2 → idscrub-2.0.1}/uv.lock +210 -213
  25. idscrub-1.1.2/idscrub/scrub.py +0 -1020
  26. idscrub-1.1.2/test/conftest.py +0 -22
  27. idscrub-1.1.2/test/test_all.py +0 -39
  28. idscrub-1.1.2/test/test_chain.py +0 -54
  29. idscrub-1.1.2/test/test_id.py +0 -24
  30. idscrub-1.1.2/test/test_label.py +0 -17
  31. idscrub-1.1.2/test/test_log.py +0 -17
  32. idscrub-1.1.2/test/test_regex.py +0 -169
  33. idscrub-1.1.2/test/test_scrub.py +0 -48
  34. {idscrub-1.1.2 → idscrub-2.0.1}/.github/pull_request_template.md +0 -0
  35. {idscrub-1.1.2 → idscrub-2.0.1}/.github/workflows/cd.yml +0 -0
  36. {idscrub-1.1.2 → idscrub-2.0.1}/.github/workflows/ci.yml +0 -0
  37. {idscrub-1.1.2 → idscrub-2.0.1}/.gitignore +0 -0
  38. {idscrub-1.1.2 → idscrub-2.0.1}/.pre-commit-config.yaml +0 -0
  39. {idscrub-1.1.2 → idscrub-2.0.1}/CODEOWNERS +0 -0
  40. {idscrub-1.1.2 → idscrub-2.0.1}/LICENSE +0 -0
  41. {idscrub-1.1.2 → idscrub-2.0.1}/Makefile +0 -0
  42. {idscrub-1.1.2 → idscrub-2.0.1}/SECURITY_CHECKLIST.md +0 -0
  43. {idscrub-1.1.2 → idscrub-2.0.1}/idscrub/__init__.py +0 -0
  44. {idscrub-1.1.2 → idscrub-2.0.1}/idscrub/locations.py +0 -0
  45. {idscrub-1.1.2 → idscrub-2.0.1}/idscrub.egg-info/dependency_links.txt +0 -0
  46. {idscrub-1.1.2 → idscrub-2.0.1}/idscrub.egg-info/top_level.txt +0 -0
  47. {idscrub-1.1.2 → idscrub-2.0.1}/setup.cfg +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: idscrub
3
- Version: 1.1.2
3
+ Version: 2.0.1
4
4
  Author: Department for Business and Trade
5
5
  Classifier: Development Status :: 3 - Alpha
6
6
  Requires-Python: >=3.12
@@ -12,7 +12,8 @@ Requires-Dist: numpy>=2.3.4
12
12
  Requires-Dist: pandas<3.0
13
13
  Requires-Dist: phonenumbers>=9.0.18
14
14
  Requires-Dist: pip>=25.3
15
- Requires-Dist: spacy-transformers>=1.3.9
15
+ Requires-Dist: spacy
16
+ Requires-Dist: transformers
16
17
  Requires-Dist: tqdm>=4.67.1
17
18
  Requires-Dist: presidio-analyzer
18
19
  Requires-Dist: presidio-anonymizer
@@ -33,16 +34,20 @@ Dynamic: license-file
33
34
 
34
35
  ## Installation
35
36
 
36
- `idscrub` can be installed using `pip` into a Python **>=3.12** environment. Example:
37
+ `idscrub` can be installed using `pip` into a Python **>=3.12** environment.
38
+
39
+ We recommend installing with the SpaCy transformer model (`en_core_web_trf`) as a dependency:
37
40
 
38
41
  ```console
39
- pip install idscrub
42
+ pip install idscrub[trf]
40
43
  ```
41
- or with the spaCy transformer model (`en_core_web_trf`) already installed:
44
+
45
+ If you do not need SpaCy:
42
46
 
43
47
  ```console
44
- pip install idscrub[trf]
48
+ pip install idscrub
45
49
  ```
50
+
46
51
  ## How to use the code
47
52
 
48
53
  Basic usage example (see [basic_usage.ipynb](https://github.com/uktrade/idscrub/blob/main/notebooks/basic_usage.ipynb) for further examples):
@@ -50,18 +55,56 @@ Basic usage example (see [basic_usage.ipynb](https://github.com/uktrade/idscrub/
50
55
  ```python
51
56
  from idscrub import IDScrub
52
57
 
53
- scrub = IDScrub(['Our names are Hamish McDonald, L. Salah, and Elena Suárez.', 'My number is +441111111111 and I live at AA11 1AA.'])x
54
- scrubbed_texts = scrub.scrub(scrub_methods=['spacy_entities', 'uk_phone_numbers', 'uk_postcodes'])
58
+ scrub = IDScrub(['Our names are Hamish McDonald, L. Salah, and Elena Suárez.', 'My number is +441111111111 and I live at AA11 1AA.'])
59
+ scrubbed_texts = scrub.scrub(
60
+ pipeline=[
61
+ {"method": "spacy_entities", "entity_types": ["PERSON"]},
62
+ {"method": "uk_phone_numbers"},
63
+ {"method": "uk_postcodes"},
64
+ ]
65
+ )
55
66
 
56
67
  print(scrubbed_texts)
57
68
 
58
69
  # Output: ['Our names are [PERSON], [PERSON], and [PERSON].', 'My number is [PHONENO] and I live at [POSTCODE].']
59
70
  ```
60
- ## Personal data types supported
61
71
 
62
- Personal data can either be scrubbed as methods with arguments for extra customisation, e.g. `IDScrub.google_phone_numbers(region="GB")`, or as a string arguments with default configurations (see above). The method name and its string representation are the same.
72
+ This package will identify and scrub many types of data that you might not want to scrub, such as locations or context-relevent names. **We therefore highly recommend manually removing scrubbed data identified by `idscrub` from your original dataset on a case-by-case basis.**
73
+
74
+ Scrubbed data can be identified using the following methods (see the [usage example notebook](https://github.com/uktrade/idscrub/blob/main/notebooks/basic_usage.ipynb) for further information):
63
75
 
64
- | Argument | Scrubs |
76
+ ```python
77
+ import pandas as pd
78
+ from idscrub import IDScrub
79
+
80
+ # From lists of text:
81
+ scrub = IDScrub(['Our names are Hamish McDonald, L. Salah, and Elena Suárez.', 'My number is +441111111111 and I live at AA11 1AA.'])
82
+ scrubbed_texts = scrub.scrub(
83
+ pipeline=[
84
+ {"method": "spacy_entities", "entity_types": ["PERSON"]},
85
+ {"method": "uk_phone_numbers"},
86
+ {"method": "uk_postcodes"},
87
+ ]
88
+ )
89
+ scrubbed_df = scrub.get_scrubbed_data()
90
+ print(scrubbed_df)
91
+
92
+ # From a Pandas DataFrame:
93
+ scrubbed_df, scrubbed_data = IDScrub.dataframe(
94
+ df=pd.read_csv('path/to/csv'),
95
+ id_col="ID",
96
+ pipeline=[
97
+ {"method": "spacy_entities", "entity_types": ["PERSON"]},
98
+ {"method": "uk_phone_numbers"},
99
+ {"method": "uk_postcodes"},
100
+ ]
101
+ )
102
+ print(scrubbed_df)
103
+ ```
104
+
105
+ ## Personal data types supported
106
+
107
+ | Method | Scrubs |
65
108
  |-------------------------|------------------------------------------------------------------------|
66
109
  | `all` | All supported personal data types (see `IDScrub.all()` for further customisation) |
67
110
  | `spacy_entities` | Entities detected by spaCy's `en_core_web_trf` or other user-selected spaCy models (e.g. persons (names), organisations) |
@@ -70,12 +113,15 @@ Personal data can either be scrubbed as methods with arguments for extra customi
70
113
  | `email_addresses` | Email addresses (e.g. john@email.com) |
71
114
  | `titles` | Titles (e.g. Mr., Mrs., Dr.) |
72
115
  | `handles` | Social media handles (e.g. @username) |
116
+ | `urls` | URLs (e.g. www.bbc.co.uk) |
73
117
  | `ip_addresses` | IP addresses (e.g. 8.8.8.8) |
74
118
  | `uk_postcodes` | UK postal codes (e.g. SW1A 2AA) |
75
119
  | `uk_addresses` | UK addresses (e.g. 10 Downing Street) |
76
120
  | `uk_phone_numbers` | UK phone numbers (e.g. +441111111111) |
77
121
  | `google_phone_numbers` | Phone numbers detected by Google's [phonenumbers](https://github.com/daviddrysdale/python-phonenumbers) |
78
122
 
123
+ Method arguments for further customisation can be viewed by viewing the docstring e.g. `?IDScrub.spacy_entities`.
124
+
79
125
  ## Considerations before use
80
126
 
81
127
  - You must follow [GDPR guidance](https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/the-research-provisions/principles-and-grounds-for-processing/) when processing personal data using this package.
@@ -130,7 +176,7 @@ This project is managed by [uv](https://docs.astral.sh/uv/).
130
176
  To install all dependencies for this project, run:
131
177
 
132
178
  ```console
133
- uv sync --all-extras
179
+ uv sync
134
180
  ```
135
181
 
136
182
  If you do not have Python 3.12, run:
@@ -11,16 +11,20 @@
11
11
 
12
12
  ## Installation
13
13
 
14
- `idscrub` can be installed using `pip` into a Python **>=3.12** environment. Example:
14
+ `idscrub` can be installed using `pip` into a Python **>=3.12** environment.
15
+
16
+ We recommend installing with the SpaCy transformer model (`en_core_web_trf`) as a dependency:
15
17
 
16
18
  ```console
17
- pip install idscrub
19
+ pip install idscrub[trf]
18
20
  ```
19
- or with the spaCy transformer model (`en_core_web_trf`) already installed:
21
+
22
+ If you do not need SpaCy:
20
23
 
21
24
  ```console
22
- pip install idscrub[trf]
25
+ pip install idscrub
23
26
  ```
27
+
24
28
  ## How to use the code
25
29
 
26
30
  Basic usage example (see [basic_usage.ipynb](https://github.com/uktrade/idscrub/blob/main/notebooks/basic_usage.ipynb) for further examples):
@@ -28,18 +32,56 @@ Basic usage example (see [basic_usage.ipynb](https://github.com/uktrade/idscrub/
28
32
  ```python
29
33
  from idscrub import IDScrub
30
34
 
31
- scrub = IDScrub(['Our names are Hamish McDonald, L. Salah, and Elena Suárez.', 'My number is +441111111111 and I live at AA11 1AA.'])x
32
- scrubbed_texts = scrub.scrub(scrub_methods=['spacy_entities', 'uk_phone_numbers', 'uk_postcodes'])
35
+ scrub = IDScrub(['Our names are Hamish McDonald, L. Salah, and Elena Suárez.', 'My number is +441111111111 and I live at AA11 1AA.'])
36
+ scrubbed_texts = scrub.scrub(
37
+ pipeline=[
38
+ {"method": "spacy_entities", "entity_types": ["PERSON"]},
39
+ {"method": "uk_phone_numbers"},
40
+ {"method": "uk_postcodes"},
41
+ ]
42
+ )
33
43
 
34
44
  print(scrubbed_texts)
35
45
 
36
46
  # Output: ['Our names are [PERSON], [PERSON], and [PERSON].', 'My number is [PHONENO] and I live at [POSTCODE].']
37
47
  ```
38
- ## Personal data types supported
39
48
 
40
- Personal data can either be scrubbed as methods with arguments for extra customisation, e.g. `IDScrub.google_phone_numbers(region="GB")`, or as a string arguments with default configurations (see above). The method name and its string representation are the same.
49
+ This package will identify and scrub many types of data that you might not want to scrub, such as locations or context-relevent names. **We therefore highly recommend manually removing scrubbed data identified by `idscrub` from your original dataset on a case-by-case basis.**
50
+
51
+ Scrubbed data can be identified using the following methods (see the [usage example notebook](https://github.com/uktrade/idscrub/blob/main/notebooks/basic_usage.ipynb) for further information):
41
52
 
42
- | Argument | Scrubs |
53
+ ```python
54
+ import pandas as pd
55
+ from idscrub import IDScrub
56
+
57
+ # From lists of text:
58
+ scrub = IDScrub(['Our names are Hamish McDonald, L. Salah, and Elena Suárez.', 'My number is +441111111111 and I live at AA11 1AA.'])
59
+ scrubbed_texts = scrub.scrub(
60
+ pipeline=[
61
+ {"method": "spacy_entities", "entity_types": ["PERSON"]},
62
+ {"method": "uk_phone_numbers"},
63
+ {"method": "uk_postcodes"},
64
+ ]
65
+ )
66
+ scrubbed_df = scrub.get_scrubbed_data()
67
+ print(scrubbed_df)
68
+
69
+ # From a Pandas DataFrame:
70
+ scrubbed_df, scrubbed_data = IDScrub.dataframe(
71
+ df=pd.read_csv('path/to/csv'),
72
+ id_col="ID",
73
+ pipeline=[
74
+ {"method": "spacy_entities", "entity_types": ["PERSON"]},
75
+ {"method": "uk_phone_numbers"},
76
+ {"method": "uk_postcodes"},
77
+ ]
78
+ )
79
+ print(scrubbed_df)
80
+ ```
81
+
82
+ ## Personal data types supported
83
+
84
+ | Method | Scrubs |
43
85
  |-------------------------|------------------------------------------------------------------------|
44
86
  | `all` | All supported personal data types (see `IDScrub.all()` for further customisation) |
45
87
  | `spacy_entities` | Entities detected by spaCy's `en_core_web_trf` or other user-selected spaCy models (e.g. persons (names), organisations) |
@@ -48,12 +90,15 @@ Personal data can either be scrubbed as methods with arguments for extra customi
48
90
  | `email_addresses` | Email addresses (e.g. john@email.com) |
49
91
  | `titles` | Titles (e.g. Mr., Mrs., Dr.) |
50
92
  | `handles` | Social media handles (e.g. @username) |
93
+ | `urls` | URLs (e.g. www.bbc.co.uk) |
51
94
  | `ip_addresses` | IP addresses (e.g. 8.8.8.8) |
52
95
  | `uk_postcodes` | UK postal codes (e.g. SW1A 2AA) |
53
96
  | `uk_addresses` | UK addresses (e.g. 10 Downing Street) |
54
97
  | `uk_phone_numbers` | UK phone numbers (e.g. +441111111111) |
55
98
  | `google_phone_numbers` | Phone numbers detected by Google's [phonenumbers](https://github.com/daviddrysdale/python-phonenumbers) |
56
99
 
100
+ Method arguments for further customisation can be viewed by viewing the docstring e.g. `?IDScrub.spacy_entities`.
101
+
57
102
  ## Considerations before use
58
103
 
59
104
  - You must follow [GDPR guidance](https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/the-research-provisions/principles-and-grounds-for-processing/) when processing personal data using this package.
@@ -108,7 +153,7 @@ This project is managed by [uv](https://docs.astral.sh/uv/).
108
153
  To install all dependencies for this project, run:
109
154
 
110
155
  ```console
111
- uv sync --all-extras
156
+ uv sync
112
157
  ```
113
158
 
114
159
  If you do not have Python 3.12, run: