idscrub 1.1.2__py3-none-any.whl → 2.0.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: idscrub
3
- Version: 1.1.2
3
+ Version: 2.0.0
4
4
  Author: Department for Business and Trade
5
5
  Classifier: Development Status :: 3 - Alpha
6
6
  Requires-Python: >=3.12
@@ -12,7 +12,8 @@ Requires-Dist: numpy>=2.3.4
12
12
  Requires-Dist: pandas<3.0
13
13
  Requires-Dist: phonenumbers>=9.0.18
14
14
  Requires-Dist: pip>=25.3
15
- Requires-Dist: spacy-transformers>=1.3.9
15
+ Requires-Dist: spacy
16
+ Requires-Dist: transformers
16
17
  Requires-Dist: tqdm>=4.67.1
17
18
  Requires-Dist: presidio-analyzer
18
19
  Requires-Dist: presidio-anonymizer
@@ -33,16 +34,20 @@ Dynamic: license-file
33
34
 
34
35
  ## Installation
35
36
 
36
- `idscrub` can be installed using `pip` into a Python **>=3.12** environment. Example:
37
+ `idscrub` can be installed using `pip` into a Python **>=3.12** environment.
38
+
39
+ We recommend installing with the SpaCy transformer model (`en_core_web_trf`) as a dependency:
37
40
 
38
41
  ```console
39
- pip install idscrub
42
+ pip install idscrub[trf]
40
43
  ```
41
- or with the spaCy transformer model (`en_core_web_trf`) already installed:
44
+
45
+ If you do not need SpaCy:
42
46
 
43
47
  ```console
44
- pip install idscrub[trf]
48
+ pip install idscrub
45
49
  ```
50
+
46
51
  ## How to use the code
47
52
 
48
53
  Basic usage example (see [basic_usage.ipynb](https://github.com/uktrade/idscrub/blob/main/notebooks/basic_usage.ipynb) for further examples):
@@ -50,18 +55,56 @@ Basic usage example (see [basic_usage.ipynb](https://github.com/uktrade/idscrub/
50
55
  ```python
51
56
  from idscrub import IDScrub
52
57
 
53
- scrub = IDScrub(['Our names are Hamish McDonald, L. Salah, and Elena Suárez.', 'My number is +441111111111 and I live at AA11 1AA.'])x
54
- scrubbed_texts = scrub.scrub(scrub_methods=['spacy_entities', 'uk_phone_numbers', 'uk_postcodes'])
58
+ scrub = IDScrub(['Our names are Hamish McDonald, L. Salah, and Elena Suárez.', 'My number is +441111111111 and I live at AA11 1AA.'])
59
+ scrubbed_texts = scrub.scrub(
60
+ pipeline=[
61
+ {"method": "spacy_entities", "entity_types": ["PERSON"]},
62
+ {"method": "uk_phone_numbers"},
63
+ {"method": "uk_postcodes"},
64
+ ]
65
+ )
55
66
 
56
67
  print(scrubbed_texts)
57
68
 
58
69
  # Output: ['Our names are [PERSON], [PERSON], and [PERSON].', 'My number is [PHONENO] and I live at [POSTCODE].']
59
70
  ```
60
- ## Personal data types supported
61
71
 
62
- Personal data can either be scrubbed as methods with arguments for extra customisation, e.g. `IDScrub.google_phone_numbers(region="GB")`, or as a string arguments with default configurations (see above). The method name and its string representation are the same.
72
+ This package will identify and scrub many types of data that you might not want to scrub, such as locations or context-relevent names. **We therefore highly recommend manually removing scrubbed data identified by `idscrub` from your original dataset on a case-by-case basis.**
73
+
74
+ Scrubbed data can be identified using the following methods (see the [usage example notebook](https://github.com/uktrade/idscrub/blob/main/notebooks/basic_usage.ipynb) for further information):
63
75
 
64
- | Argument | Scrubs |
76
+ ```python
77
+ import pandas as pd
78
+ from idscrub import IDScrub
79
+
80
+ # From lists of text:
81
+ scrub = IDScrub(['Our names are Hamish McDonald, L. Salah, and Elena Suárez.', 'My number is +441111111111 and I live at AA11 1AA.'])
82
+ scrubbed_texts = scrub.scrub(
83
+ pipeline=[
84
+ {"method": "spacy_entities", "entity_types": ["PERSON"]},
85
+ {"method": "uk_phone_numbers"},
86
+ {"method": "uk_postcodes"},
87
+ ]
88
+ )
89
+ scrubbed_df = scrub.get_scrubbed_data()
90
+ print(scrubbed_df)
91
+
92
+ # From a Pandas DataFrame:
93
+ scrubbed_df, scrubbed_data = IDScrub.dataframe(
94
+ df=pd.read_csv('path/to/csv'),
95
+ id_col="ID",
96
+ pipeline=[
97
+ {"method": "spacy_entities", "entity_types": ["PERSON"]},
98
+ {"method": "uk_phone_numbers"},
99
+ {"method": "uk_postcodes"},
100
+ ]
101
+ )
102
+ print(scrubbed_df)
103
+ ```
104
+
105
+ ## Personal data types supported
106
+
107
+ | Method | Scrubs |
65
108
  |-------------------------|------------------------------------------------------------------------|
66
109
  | `all` | All supported personal data types (see `IDScrub.all()` for further customisation) |
67
110
  | `spacy_entities` | Entities detected by spaCy's `en_core_web_trf` or other user-selected spaCy models (e.g. persons (names), organisations) |
@@ -70,12 +113,15 @@ Personal data can either be scrubbed as methods with arguments for extra customi
70
113
  | `email_addresses` | Email addresses (e.g. john@email.com) |
71
114
  | `titles` | Titles (e.g. Mr., Mrs., Dr.) |
72
115
  | `handles` | Social media handles (e.g. @username) |
116
+ | `urls` | URLs (e.g. www.bbc.co.uk) |
73
117
  | `ip_addresses` | IP addresses (e.g. 8.8.8.8) |
74
118
  | `uk_postcodes` | UK postal codes (e.g. SW1A 2AA) |
75
119
  | `uk_addresses` | UK addresses (e.g. 10 Downing Street) |
76
120
  | `uk_phone_numbers` | UK phone numbers (e.g. +441111111111) |
77
121
  | `google_phone_numbers` | Phone numbers detected by Google's [phonenumbers](https://github.com/daviddrysdale/python-phonenumbers) |
78
122
 
123
+ Method arguments for further customisation can be viewed by viewing the docstring e.g. `?IDScrub.spacy_entities`.
124
+
79
125
  ## Considerations before use
80
126
 
81
127
  - You must follow [GDPR guidance](https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/the-research-provisions/principles-and-grounds-for-processing/) when processing personal data using this package.
@@ -130,7 +176,7 @@ This project is managed by [uv](https://docs.astral.sh/uv/).
130
176
  To install all dependencies for this project, run:
131
177
 
132
178
  ```console
133
- uv sync --all-extras
179
+ uv sync
134
180
  ```
135
181
 
136
182
  If you do not have Python 3.12, run:
@@ -0,0 +1,24 @@
1
+ idscrub/__init__.py,sha256=cRugJv27q1q--bl-VNLpfiScJb_ROlUxyLFhaF55S1w,38
2
+ idscrub/locations.py,sha256=7fMNOcGMYe7sX8TrfhMW6oYGAlc1WVYVQKQbpxE3pqo,217
3
+ idscrub/scrub.py,sha256=ow5gDejdchXxi3iN1vEqlCaJGGkaEnvFJ8nAau6w9XI,45308
4
+ idscrub-2.0.0.dist-info/licenses/LICENSE,sha256=JJnuf10NSx7YXglte1oH_N9ZP3AcWR_Y8irvQb_wnsg,1090
5
+ notebooks/basic_usage.ipynb,sha256=kfCgdN8DnjIUd9wZs93mpLKMT5NA3E8ty8a6mQUAZGo,40195
6
+ test/conftest.py,sha256=CD_fYlo-qjkDgsW-ZRMZjW5Famqpt8fzN2HOcJ2MBDU,1450
7
+ test/test_dataframe.py,sha256=MWHQVJ_lCKLR86lnGGz9Nlke-pPQwvim11Cs5INxmh8,6395
8
+ test/test_errors.py,sha256=IA4vuYUfq_sEhHK-6Ba25y8WjUfHzODio53zjhLBKCU,944
9
+ test/test_exclude.py,sha256=TuO4W6TUVZyPurPst1ZSbqHCRFCvrruLkNSoxks_arI,601
10
+ test/test_group.py,sha256=um6UlWeUx94cDADoNd525z8-Uw02oO4clB2hP6iy4PY,235
11
+ test/test_huggingface.py,sha256=awIhTN4iBErvm1Mt6ZAHlgCBqSeSCl12X6t7sMtprD0,935
12
+ test/test_id.py,sha256=z2Za6sk-Km-r2FN6i03_q2IY5I21OkwfOtc4p9sRvJg,1108
13
+ test/test_label.py,sha256=uiUF72nZFt5GWE7HWO2ura3XU4hSKfe6ZTOyc7oRyzg,1053
14
+ test/test_overlap.py,sha256=kM6jePx94evk-VRkMzeH_UiuYPR7FFe63oBgLsvfvZw,2380
15
+ test/test_phonenumbers.py,sha256=LqAUzSYdYoQQyO_mAHYQkkRCgh4uArX4oxxKVhMi548,661
16
+ test/test_presidio.py,sha256=Y6kgPOZ_ie3shVThbNfJ2i7w1yifS5ukI4eAo1Lgm6k,1658
17
+ test/test_regex.py,sha256=Z_D8ttPYnRCr4gl13W9IEb7vS6QtWMReAc9Rufbtv50,6789
18
+ test/test_scrub.py,sha256=aelzLpF2S_5F754VcZtWeEzQIcUZ0ex0m6A8mTAnWQg,1591
19
+ test/test_scrub_text.py,sha256=WadP66U3jrSqXlbUrIrguDYSfhRKFJW73lR6eudC22o,605
20
+ test/test_spacy.py,sha256=5wIPnHirs6fx1Msz9IDDF8dxyh3t1asWaFEvSXeiBvU,1910
21
+ idscrub-2.0.0.dist-info/METADATA,sha256=ARod5-Azd--n-lWnZ3yHF4jt9MSjpPrcnMePzmpp8BM,8537
22
+ idscrub-2.0.0.dist-info/WHEEL,sha256=wUyA8OaulRlbfwMtmQsvNngGrxQHAvkKcvRmdizlJi0,92
23
+ idscrub-2.0.0.dist-info/top_level.txt,sha256=D4EEodXGCjGiX35ObiBTmjjBAdouN-eCvH-LezGGtks,23
24
+ idscrub-2.0.0.dist-info/RECORD,,