privfill 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- privfill-0.1.0/LICENSE +21 -0
- privfill-0.1.0/PKG-INFO +92 -0
- privfill-0.1.0/README.md +71 -0
- privfill-0.1.0/pyproject.toml +29 -0
- privfill-0.1.0/setup.cfg +4 -0
- privfill-0.1.0/src/privfill/__init__.py +35 -0
- privfill-0.1.0/src/privfill/main.py +71 -0
- privfill-0.1.0/src/privfill/mechanisms.py +197 -0
- privfill-0.1.0/src/privfill.egg-info/PKG-INFO +92 -0
- privfill-0.1.0/src/privfill.egg-info/SOURCES.txt +11 -0
- privfill-0.1.0/src/privfill.egg-info/dependency_links.txt +1 -0
- privfill-0.1.0/src/privfill.egg-info/requires.txt +7 -0
- privfill-0.1.0/src/privfill.egg-info/top_level.txt +1 -0
privfill-0.1.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2025-2026 Stephen Meisenbacher
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
privfill-0.1.0/PKG-INFO
ADDED
|
@@ -0,0 +1,92 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: privfill
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: LLM-based Differential Privacy mechanisms for sentence-based text rewriting with infilling models.
|
|
5
|
+
Author-email: Stephen Meisenbacher <stephen.meisenbacher@tum.de>
|
|
6
|
+
License: MIT
|
|
7
|
+
Classifier: Programming Language :: Python :: 3
|
|
8
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
9
|
+
Classifier: Operating System :: OS Independent
|
|
10
|
+
Requires-Python: >=3.9
|
|
11
|
+
Description-Content-Type: text/markdown
|
|
12
|
+
License-File: LICENSE
|
|
13
|
+
Requires-Dist: pandas
|
|
14
|
+
Requires-Dist: nltk
|
|
15
|
+
Requires-Dist: numpy
|
|
16
|
+
Requires-Dist: tqdm
|
|
17
|
+
Requires-Dist: torch
|
|
18
|
+
Requires-Dist: transformers
|
|
19
|
+
Requires-Dist: mpmath
|
|
20
|
+
Dynamic: license-file
|
|
21
|
+
|
|
22
|
+
<div align="center">
|
|
23
|
+
|
|
24
|
+
# PrivFill
|
|
25
|
+
|
|
26
|
+
[](https://pypi.org/project/privfill/)
|
|
27
|
+
[](https://github.com/sjmeis/PrivFill/stargazers)
|
|
28
|
+
[](https://github.com/sjmeis/PrivFill/blob/main/LICENSE)
|
|
29
|
+
|
|
30
|
+
</div>
|
|
31
|
+
|
|
32
|
+
`privfill` is a Python package providing LLM-based local Differential Privacy (DP) mechanisms for text privatization via sentece infilling. It offers easy-to-use wrappers for fine-tuned Hugging Face models.
|
|
33
|
+
This software was originally presented in the NAACL 2025 findings paper: *On the Impact of Noise in Differentially Private Text Rewriting*
|
|
34
|
+
|
|
35
|
+
## Installation
|
|
36
|
+
|
|
37
|
+
Install the package locally in editable mode from your project's root directory:
|
|
38
|
+
|
|
39
|
+
```bash
|
|
40
|
+
pip install privfill
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
### Core Prerequisites:
|
|
44
|
+
|
|
45
|
+
- Python $\geq$ 3.9
|
|
46
|
+
- PyTorch (CUDA recommended for faster inference)
|
|
47
|
+
- Transformers & NLTK
|
|
48
|
+
|
|
49
|
+
## Basic Usage & Model Selection
|
|
50
|
+
Instead of typing Hugging Face repository paths, you can choose from the three built-in models using the `SupportedModels` enum.
|
|
51
|
+
|
|
52
|
+
```python
|
|
53
|
+
import privfill
|
|
54
|
+
|
|
55
|
+
# Choose between FLAN_T5_BASE, FLAN_T5_LARGE, and BART_LARGE
|
|
56
|
+
engine = privfill.load_pipeline(privfill.SupportedModels.FLAN_T5_BASE, DP=True)
|
|
57
|
+
|
|
58
|
+
text = "This is a long private document ... which contains sensitive information and should be privatized,"
|
|
59
|
+
private_text = engine.privatize(text, epsilon=10)
|
|
60
|
+
|
|
61
|
+
print(private_text)
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
As described in the paper, we also create an analagous, non-DP variant of `PrivFill`. The usage is very similar:
|
|
65
|
+
|
|
66
|
+
```python
|
|
67
|
+
engine = privfill.load_pipeline(privfill.SupportedModels.FLAN_T5_BASE, DP=False)
|
|
68
|
+
private_text = engine.privatize(text)
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
### Available Models
|
|
72
|
+
|
|
73
|
+
| Enum | Hugging Face Repository | Base Mechanism |
|
|
74
|
+
|-------------------------------|--------------------------------------|-------------------------|
|
|
75
|
+
| SupportedModels.FLAN_T5_BASE | sjmeis/flan-t5-base-infill-combined | DP-Prompt |
|
|
76
|
+
| SupportedModels.FLAN_T5_LARGE | sjmeis/flan-t5-large-infill-combined | DP-Prompt |
|
|
77
|
+
| SupportedModels.BART_LARGE | sjmeis/bart-large-infill-combined | DP-BART |
|
|
78
|
+
|
|
79
|
+
## Models ##
|
|
80
|
+
We make our three sentence infilling models public. They can be found at this [link](https://drive.google.com/drive/folders/12m1av9PY1X7S-cwd9y_8nepBPMtVju0C?usp=sharing).
|
|
81
|
+
|
|
82
|
+
## Comparison Code ##
|
|
83
|
+
We also include the LLMDP class code for `DP-BART` and `DP-Prompt`, as used in the paper.
|
|
84
|
+
|
|
85
|
+
```python
|
|
86
|
+
X = LLMDP.DPPrompt()
|
|
87
|
+
# or
|
|
88
|
+
X = LLMDP.DPBart()
|
|
89
|
+
|
|
90
|
+
# then
|
|
91
|
+
X.privatize(text, epsilon)
|
|
92
|
+
```
|
privfill-0.1.0/README.md
ADDED
|
@@ -0,0 +1,71 @@
|
|
|
1
|
+
<div align="center">
|
|
2
|
+
|
|
3
|
+
# PrivFill
|
|
4
|
+
|
|
5
|
+
[](https://pypi.org/project/privfill/)
|
|
6
|
+
[](https://github.com/sjmeis/PrivFill/stargazers)
|
|
7
|
+
[](https://github.com/sjmeis/PrivFill/blob/main/LICENSE)
|
|
8
|
+
|
|
9
|
+
</div>
|
|
10
|
+
|
|
11
|
+
`privfill` is a Python package providing LLM-based local Differential Privacy (DP) mechanisms for text privatization via sentece infilling. It offers easy-to-use wrappers for fine-tuned Hugging Face models.
|
|
12
|
+
This software was originally presented in the NAACL 2025 findings paper: *On the Impact of Noise in Differentially Private Text Rewriting*
|
|
13
|
+
|
|
14
|
+
## Installation
|
|
15
|
+
|
|
16
|
+
Install the package locally in editable mode from your project's root directory:
|
|
17
|
+
|
|
18
|
+
```bash
|
|
19
|
+
pip install privfill
|
|
20
|
+
```
|
|
21
|
+
|
|
22
|
+
### Core Prerequisites:
|
|
23
|
+
|
|
24
|
+
- Python $\geq$ 3.9
|
|
25
|
+
- PyTorch (CUDA recommended for faster inference)
|
|
26
|
+
- Transformers & NLTK
|
|
27
|
+
|
|
28
|
+
## Basic Usage & Model Selection
|
|
29
|
+
Instead of typing Hugging Face repository paths, you can choose from the three built-in models using the `SupportedModels` enum.
|
|
30
|
+
|
|
31
|
+
```python
|
|
32
|
+
import privfill
|
|
33
|
+
|
|
34
|
+
# Choose between FLAN_T5_BASE, FLAN_T5_LARGE, and BART_LARGE
|
|
35
|
+
engine = privfill.load_pipeline(privfill.SupportedModels.FLAN_T5_BASE, DP=True)
|
|
36
|
+
|
|
37
|
+
text = "This is a long private document ... which contains sensitive information and should be privatized,"
|
|
38
|
+
private_text = engine.privatize(text, epsilon=10)
|
|
39
|
+
|
|
40
|
+
print(private_text)
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
As described in the paper, we also create an analagous, non-DP variant of `PrivFill`. The usage is very similar:
|
|
44
|
+
|
|
45
|
+
```python
|
|
46
|
+
engine = privfill.load_pipeline(privfill.SupportedModels.FLAN_T5_BASE, DP=False)
|
|
47
|
+
private_text = engine.privatize(text)
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
### Available Models
|
|
51
|
+
|
|
52
|
+
| Enum | Hugging Face Repository | Base Mechanism |
|
|
53
|
+
|-------------------------------|--------------------------------------|-------------------------|
|
|
54
|
+
| SupportedModels.FLAN_T5_BASE | sjmeis/flan-t5-base-infill-combined | DP-Prompt |
|
|
55
|
+
| SupportedModels.FLAN_T5_LARGE | sjmeis/flan-t5-large-infill-combined | DP-Prompt |
|
|
56
|
+
| SupportedModels.BART_LARGE | sjmeis/bart-large-infill-combined | DP-BART |
|
|
57
|
+
|
|
58
|
+
## Models ##
|
|
59
|
+
We make our three sentence infilling models public. They can be found at this [link](https://drive.google.com/drive/folders/12m1av9PY1X7S-cwd9y_8nepBPMtVju0C?usp=sharing).
|
|
60
|
+
|
|
61
|
+
## Comparison Code ##
|
|
62
|
+
We also include the LLMDP class code for `DP-BART` and `DP-Prompt`, as used in the paper.
|
|
63
|
+
|
|
64
|
+
```python
|
|
65
|
+
X = LLMDP.DPPrompt()
|
|
66
|
+
# or
|
|
67
|
+
X = LLMDP.DPBart()
|
|
68
|
+
|
|
69
|
+
# then
|
|
70
|
+
X.privatize(text, epsilon)
|
|
71
|
+
```
|
|
@@ -0,0 +1,29 @@
|
|
|
1
|
+
[build-system]
|
|
2
|
+
requires = ["setuptools>=61.0.0", "wheel"]
|
|
3
|
+
build-backend = "setuptools.build_meta"
|
|
4
|
+
|
|
5
|
+
[project]
|
|
6
|
+
name = "privfill"
|
|
7
|
+
version = "0.1.0"
|
|
8
|
+
description = "LLM-based Differential Privacy mechanisms for sentence-based text rewriting with infilling models."
|
|
9
|
+
readme = "README.md"
|
|
10
|
+
authors = [{ name = "Stephen Meisenbacher", email = "stephen.meisenbacher@tum.de" }]
|
|
11
|
+
license = { text = "MIT" }
|
|
12
|
+
classifiers = [
|
|
13
|
+
"Programming Language :: Python :: 3",
|
|
14
|
+
"License :: OSI Approved :: MIT License",
|
|
15
|
+
"Operating System :: OS Independent",
|
|
16
|
+
]
|
|
17
|
+
requires-python = ">=3.9"
|
|
18
|
+
dependencies = [
|
|
19
|
+
"pandas",
|
|
20
|
+
"nltk",
|
|
21
|
+
"numpy",
|
|
22
|
+
"tqdm",
|
|
23
|
+
"torch",
|
|
24
|
+
"transformers",
|
|
25
|
+
"mpmath"
|
|
26
|
+
]
|
|
27
|
+
|
|
28
|
+
[tool.setuptools.packages.find]
|
|
29
|
+
where = ["src"]
|
privfill-0.1.0/setup.cfg
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
1
|
+
from enum import Enum
|
|
2
|
+
from .main import PrivFill, PrivFillDPBart, PrivFillDP
|
|
3
|
+
|
|
4
|
+
class SupportedModels(Enum):
|
|
5
|
+
FLAN_T5_BASE = "sjmeis/flan-t5-base-infill-combined"
|
|
6
|
+
FLAN_T5_LARGE = "sjmeis/flan-t5-large-infill-combined"
|
|
7
|
+
BART_LARGE = "sjmeis/bart-large-infill-combined"
|
|
8
|
+
|
|
9
|
+
def load_pipeline(model_choice: SupportedModels, DP: bool = False, **kwargs):
|
|
10
|
+
"""
|
|
11
|
+
Loads the appropriate privatization engine based on model choice and DP toggle.
|
|
12
|
+
|
|
13
|
+
Args:
|
|
14
|
+
model_choice (SupportedModels): The chosen model from the Enum.
|
|
15
|
+
dp (bool): If True, applies the model's Differential Privacy mechanism.
|
|
16
|
+
If False, falls back to the standard PrivFill wrapper.
|
|
17
|
+
"""
|
|
18
|
+
if not isinstance(model_choice, SupportedModels):
|
|
19
|
+
raise ValueError(
|
|
20
|
+
f"Invalid model choice. Please choose an option from privfill.SupportedModels. "
|
|
21
|
+
f"Available choices: {list(SupportedModels.__members__.keys())}"
|
|
22
|
+
)
|
|
23
|
+
|
|
24
|
+
checkpoint = model_choice.value
|
|
25
|
+
|
|
26
|
+
if DP:
|
|
27
|
+
if model_choice == SupportedModels.BART_LARGE:
|
|
28
|
+
return PrivFillDPBart(model_checkpoint=checkpoint, **kwargs)
|
|
29
|
+
else:
|
|
30
|
+
return PrivFillDP(model_checkpoint=checkpoint, **kwargs)
|
|
31
|
+
else:
|
|
32
|
+
return PrivFill(model_checkpoint=checkpoint, **kwargs)
|
|
33
|
+
|
|
34
|
+
|
|
35
|
+
__all__ = ["PrivFill", "PrivFillDPBart", "PrivFillDP", "SupportedModels", "load_pipeline"]
|
|
@@ -0,0 +1,71 @@
|
|
|
1
|
+
import nltk
|
|
2
|
+
import torch
|
|
3
|
+
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
|
|
4
|
+
from privfill.mechanisms import DPBart, DPPrompt
|
|
5
|
+
|
|
6
|
+
class PrivFill:
|
|
7
|
+
def __init__(self, model_checkpoint, max_new_tokens=32, max_input_length=512, base_model=None):
|
|
8
|
+
self.device = "cuda" if torch.cuda.is_available() else "cpu"
|
|
9
|
+
self.model_checkpoint = model_checkpoint
|
|
10
|
+
self.max_new_tokens = max_new_tokens
|
|
11
|
+
self.max_input_length = max_input_length
|
|
12
|
+
self.base_model = base_model
|
|
13
|
+
|
|
14
|
+
self.tokenizer = AutoTokenizer.from_pretrained(self.model_checkpoint)
|
|
15
|
+
self.model = AutoModelForSeq2SeqLM.from_pretrained(self.model_checkpoint).to(self.device)
|
|
16
|
+
|
|
17
|
+
def privatize(self, text):
|
|
18
|
+
sentences = nltk.sent_tokenize(text)
|
|
19
|
+
replace = []
|
|
20
|
+
for s in sentences:
|
|
21
|
+
temp = text.replace(s, "[blank]")
|
|
22
|
+
inputs = [temp]
|
|
23
|
+
inputs = self.tokenizer(inputs, max_length=self.max_input_length, truncation=True, return_tensors="pt").input_ids.to(self.device)
|
|
24
|
+
output = self.model.generate(inputs, min_new_tokens=5, do_sample=True, max_new_tokens=self.max_new_tokens, pad_token_id=50256)
|
|
25
|
+
decoded_output = self.tokenizer.decode(output[0], skip_special_tokens=True).replace(temp, "")
|
|
26
|
+
|
|
27
|
+
if self.base_model is None:
|
|
28
|
+
replace.append(decoded_output)
|
|
29
|
+
else:
|
|
30
|
+
replace.append(nltk.sent_tokenize(decoded_output.strip())[0])
|
|
31
|
+
return " ".join(replace)
|
|
32
|
+
|
|
33
|
+
|
|
34
|
+
class PrivFillDPBart:
|
|
35
|
+
def __init__(self, model_checkpoint, max_new_tokens=32, max_input_length=512):
|
|
36
|
+
self.device = "cuda" if torch.cuda.is_available() else "cpu"
|
|
37
|
+
self.model_checkpoint = model_checkpoint
|
|
38
|
+
self.max_new_tokens = max_new_tokens
|
|
39
|
+
self.max_input_length = max_input_length
|
|
40
|
+
|
|
41
|
+
self.model = DPBart(model=model_checkpoint)
|
|
42
|
+
|
|
43
|
+
def privatize(self, text, epsilon):
|
|
44
|
+
sentences = nltk.sent_tokenize(text)
|
|
45
|
+
eps = epsilon / len(sentences)
|
|
46
|
+
inputs = []
|
|
47
|
+
for s in sentences:
|
|
48
|
+
temp = text.replace(s, "[blank]")
|
|
49
|
+
inputs.append(temp)
|
|
50
|
+
|
|
51
|
+
return self.model.privatize_batch(inputs, epsilon=eps)
|
|
52
|
+
|
|
53
|
+
|
|
54
|
+
class PrivFillDP:
|
|
55
|
+
def __init__(self, model_checkpoint, max_new_tokens=32, max_input_length=512):
|
|
56
|
+
self.device = "cuda" if torch.cuda.is_available() else "cpu"
|
|
57
|
+
self.model_checkpoint = model_checkpoint
|
|
58
|
+
self.max_new_tokens = max_new_tokens
|
|
59
|
+
self.max_input_length = max_input_length
|
|
60
|
+
|
|
61
|
+
self.model = DPPrompt(model_checkpoint=model_checkpoint)
|
|
62
|
+
|
|
63
|
+
def privatize(self, text, epsilon):
|
|
64
|
+
sentences = nltk.sent_tokenize(text)
|
|
65
|
+
inputs = []
|
|
66
|
+
for s in sentences:
|
|
67
|
+
temp = text.replace(s, "[blank]")
|
|
68
|
+
inputs.append(temp)
|
|
69
|
+
|
|
70
|
+
output = self.model.privatize_dp(inputs, epsilon)
|
|
71
|
+
return " ".join(output)
|
|
@@ -0,0 +1,197 @@
|
|
|
1
|
+
import numpy as np
|
|
2
|
+
import torch
|
|
3
|
+
from torch.utils.data import Dataset
|
|
4
|
+
from transformers import (
|
|
5
|
+
AutoModelForSeq2SeqLM,
|
|
6
|
+
AutoTokenizer,
|
|
7
|
+
LogitsProcessor,
|
|
8
|
+
LogitsProcessorList,
|
|
9
|
+
pipeline,
|
|
10
|
+
BartTokenizer,
|
|
11
|
+
BartModel,
|
|
12
|
+
BartForConditionalGeneration
|
|
13
|
+
)
|
|
14
|
+
import mpmath
|
|
15
|
+
from mpmath import mp
|
|
16
|
+
import nltk
|
|
17
|
+
|
|
18
|
+
try:
|
|
19
|
+
nltk.data.find('tokenizers/punkt')
|
|
20
|
+
except LookupError:
|
|
21
|
+
nltk.download('punkt', quiet=True)
|
|
22
|
+
|
|
23
|
+
class ListDataset(Dataset):
|
|
24
|
+
def __init__(self, original_list):
|
|
25
|
+
self.original_list = original_list
|
|
26
|
+
|
|
27
|
+
def __len__(self):
|
|
28
|
+
return len(self.original_list)
|
|
29
|
+
|
|
30
|
+
def __getitem__(self, i):
|
|
31
|
+
return self.original_list[i]
|
|
32
|
+
|
|
33
|
+
|
|
34
|
+
class ClipLogitsProcessor(LogitsProcessor):
|
|
35
|
+
def __init__(self, min=-100, max=100):
|
|
36
|
+
self.min = min
|
|
37
|
+
self.max = max
|
|
38
|
+
|
|
39
|
+
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
|
|
40
|
+
return torch.clamp(scores, min=self.min, max=self.max)
|
|
41
|
+
|
|
42
|
+
|
|
43
|
+
class DPPrompt:
|
|
44
|
+
def __init__(self, model_checkpoint="google/flan-t5-large", min_logit=-95, max_logit=8, batch_size=16):
|
|
45
|
+
self.model_checkpoint = model_checkpoint
|
|
46
|
+
self.device = "cuda" if torch.cuda.is_available() else "cpu"
|
|
47
|
+
self.batch_size = batch_size
|
|
48
|
+
|
|
49
|
+
self.tokenizer = AutoTokenizer.from_pretrained(self.model_checkpoint)
|
|
50
|
+
self.model = AutoModelForSeq2SeqLM.from_pretrained(self.model_checkpoint).to(self.device)
|
|
51
|
+
|
|
52
|
+
self.min_logit = min_logit
|
|
53
|
+
self.max_logit = max_logit
|
|
54
|
+
self.sensitivity = abs(self.max_logit - self.min_logit)
|
|
55
|
+
self.logits_processor = LogitsProcessorList([ClipLogitsProcessor(self.min_logit, self.max_logit)])
|
|
56
|
+
|
|
57
|
+
self.pipe = pipeline("text2text-generation", model=self.model, tokenizer=self.tokenizer, device=0 if self.device == "cuda" else -1, truncation=True)
|
|
58
|
+
self.pipe.tokenizer.pad_token_id = self.model.config.eos_token_id
|
|
59
|
+
|
|
60
|
+
def prompt_template_fn(self, doc):
|
|
61
|
+
return f"Document : {doc}\nParaphrase of the document :"
|
|
62
|
+
|
|
63
|
+
def privatize(self, text, epsilon=100):
|
|
64
|
+
temperature = 2 * self.sensitivity / epsilon
|
|
65
|
+
prompt = self.prompt_template_fn(text)
|
|
66
|
+
model_inputs = self.tokenizer(prompt, max_length=512, truncation=True, return_tensors="pt").to(self.device)
|
|
67
|
+
|
|
68
|
+
output = self.model.generate(
|
|
69
|
+
**model_inputs,
|
|
70
|
+
do_sample=True,
|
|
71
|
+
top_k=0,
|
|
72
|
+
top_p=1.0,
|
|
73
|
+
temperature=temperature,
|
|
74
|
+
max_new_tokens=len(model_inputs["input_ids"][0]),
|
|
75
|
+
logits_processor=self.logits_processor
|
|
76
|
+
)
|
|
77
|
+
return self.tokenizer.decode(output[0], skip_special_tokens=True)
|
|
78
|
+
|
|
79
|
+
def privatize_dp(self, texts, epsilon=100, max_new_tokens=32):
|
|
80
|
+
temperature = 2 * self.sensitivity / epsilon
|
|
81
|
+
prompts = ListDataset(texts)
|
|
82
|
+
private_texts = []
|
|
83
|
+
for r in self.pipe(prompts, do_sample=True, top_k=0, top_p=1.0, temperature=temperature, logits_processor=self.logits_processor, max_new_tokens=max_new_tokens, batch_size=self.batch_size):
|
|
84
|
+
private_texts.append(r[0]["generated_text"])
|
|
85
|
+
return private_texts
|
|
86
|
+
|
|
87
|
+
|
|
88
|
+
class DPBart:
|
|
89
|
+
def __init__(self, model='facebook/bart-large', num_sigmas=1/2):
|
|
90
|
+
self.device = "cuda" if torch.cuda.is_available() else "cpu"
|
|
91
|
+
self.tokenizer = BartTokenizer.from_pretrained(model)
|
|
92
|
+
self.model = BartModel.from_pretrained(model).to(self.device)
|
|
93
|
+
self.decoder = BartForConditionalGeneration.from_pretrained(model).to(self.device)
|
|
94
|
+
|
|
95
|
+
self.delta = 1e-5
|
|
96
|
+
self.sigma = 0.2
|
|
97
|
+
self.num_sigmas = num_sigmas
|
|
98
|
+
self.c_min = -self.sigma
|
|
99
|
+
self.c_max = self.num_sigmas * self.sigma
|
|
100
|
+
|
|
101
|
+
def clip(self, vector):
|
|
102
|
+
return torch.clip(vector, self.c_min, self.c_max)
|
|
103
|
+
|
|
104
|
+
def calibrateAnalyticGaussianMechanism_precision(self, epsilon, delta, GS, tol=1.e-12):
|
|
105
|
+
if epsilon <= 1000:
|
|
106
|
+
mp.dps = 500
|
|
107
|
+
elif epsilon <= 2500:
|
|
108
|
+
mp.dps = 1100
|
|
109
|
+
else:
|
|
110
|
+
mp.dps = 2200
|
|
111
|
+
|
|
112
|
+
def Phi(t):
|
|
113
|
+
return 0.5 * (1.0 + mpmath.erf(t / mpmath.sqrt(2.0)))
|
|
114
|
+
|
|
115
|
+
def caseA(eps, s):
|
|
116
|
+
return Phi(mpmath.sqrt(eps * s)) - mpmath.exp(eps) * Phi(-mpmath.sqrt(eps * (s + 2.0)))
|
|
117
|
+
|
|
118
|
+
def caseB(eps, s):
|
|
119
|
+
return Phi(-mpmath.sqrt(eps * s)) - mpmath.exp(eps) * Phi(-mpmath.sqrt(eps * (s + 2.0)))
|
|
120
|
+
|
|
121
|
+
def doubling_trick(predicate_stop, s_inf, s_sup):
|
|
122
|
+
while not predicate_stop(s_sup):
|
|
123
|
+
s_inf = s_sup
|
|
124
|
+
s_sup = 2.0 * s_inf
|
|
125
|
+
return s_inf, s_sup
|
|
126
|
+
|
|
127
|
+
def binary_search(predicate_stop, predicate_left, s_inf, s_sup):
|
|
128
|
+
s_mid = s_inf + (s_sup - s_inf) / 2.0
|
|
129
|
+
while not predicate_stop(s_mid):
|
|
130
|
+
if predicate_left(s_mid):
|
|
131
|
+
s_sup = s_mid
|
|
132
|
+
else:
|
|
133
|
+
s_inf = s_mid
|
|
134
|
+
s_mid = s_inf + (s_sup - s_inf) / 2.0
|
|
135
|
+
return s_mid
|
|
136
|
+
|
|
137
|
+
delta_thr = caseA(epsilon, 0.0)
|
|
138
|
+
|
|
139
|
+
if delta == delta_thr:
|
|
140
|
+
alpha = 1.0
|
|
141
|
+
else:
|
|
142
|
+
if delta > delta_thr:
|
|
143
|
+
predicate_stop_DT = lambda s: caseA(epsilon, s) >= delta
|
|
144
|
+
func_s_to_delta = lambda s: caseA(epsilon, s)
|
|
145
|
+
predicate_left_BS = lambda s: func_s_to_delta(s) > delta
|
|
146
|
+
func_s_to_alpha = lambda s: mpmath.sqrt(1.0 + s / 2.0) - mpmath.sqrt(s / 2.0)
|
|
147
|
+
else:
|
|
148
|
+
predicate_stop_DT = lambda s: caseB(epsilon, s) <= delta
|
|
149
|
+
func_s_to_delta = lambda s: caseB(epsilon, s)
|
|
150
|
+
predicate_left_BS = lambda s: func_s_to_delta(s) < delta
|
|
151
|
+
func_s_to_alpha = lambda s: mpmath.sqrt(1.0 + s / 2.0) + mpmath.sqrt(s / 2.0)
|
|
152
|
+
|
|
153
|
+
predicate_stop_BS = lambda s: abs(func_s_to_delta(s) - delta) <= tol
|
|
154
|
+
s_inf, s_sup = doubling_trick(predicate_stop_DT, 0.0, 1.0)
|
|
155
|
+
s_final = binary_search(predicate_stop_BS, predicate_left_BS, s_inf, s_sup)
|
|
156
|
+
alpha = func_s_to_alpha(s_final)
|
|
157
|
+
|
|
158
|
+
sigma = alpha * GS / mpmath.sqrt(2.0 * epsilon)
|
|
159
|
+
return float(sigma)
|
|
160
|
+
|
|
161
|
+
def noise(self, vector, epsilon, delta=1e-5, method="analytic_gaussian"):
|
|
162
|
+
k = vector.shape[-1]
|
|
163
|
+
if method == "laplace":
|
|
164
|
+
sensitivity = 2 * self.sigma * self.num_sigmas * k
|
|
165
|
+
Z = torch.from_numpy(np.random.laplace(0, sensitivity / epsilon, size=k))
|
|
166
|
+
elif method == 'gaussian':
|
|
167
|
+
sensitivity = 2 * self.sigma * self.num_sigmas * np.sqrt(k)
|
|
168
|
+
scale = np.sqrt((sensitivity**2 / epsilon**2) * 2 * np.log(1.25 / self.delta))
|
|
169
|
+
Z = torch.from_numpy(np.random.normal(0, scale, size=k))
|
|
170
|
+
elif method == "analytic_gaussian":
|
|
171
|
+
sensitivity = 2 * self.sigma * self.num_sigmas * np.sqrt(k)
|
|
172
|
+
analytic_scale = self.calibrateAnalyticGaussianMechanism_precision(epsilon, self.delta, sensitivity)
|
|
173
|
+
Z = torch.from_numpy(np.random.normal(0, analytic_scale, size=k))
|
|
174
|
+
return vector + Z
|
|
175
|
+
|
|
176
|
+
def privatize(self, text, epsilon=100, method="analytic_gaussian"):
|
|
177
|
+
inputs = self.tokenizer(text, max_length=512, truncation=True, return_tensors="pt").to(self.device)
|
|
178
|
+
num_tokens = len(inputs["input_ids"][0])
|
|
179
|
+
|
|
180
|
+
enc_output = self.model.encoder(**inputs)
|
|
181
|
+
enc_output["last_hidden_state"] = self.noise(self.clip(enc_output["last_hidden_state"].cpu()), epsilon=epsilon, delta=self.delta, method=method).float().to(self.device)
|
|
182
|
+
|
|
183
|
+
dec_out = self.decoder.generate(encoder_outputs=enc_output, max_new_tokens=num_tokens)
|
|
184
|
+
private_text = self.tokenizer.decode(dec_out[0], skip_special_tokens=True)
|
|
185
|
+
return private_text.strip()
|
|
186
|
+
|
|
187
|
+
def privatize_batch(self, texts, epsilon=100, method="analytic_gaussian"):
|
|
188
|
+
inputs = self.tokenizer(texts, max_length=512, truncation=True, padding=True, return_tensors="pt").to(self.device)
|
|
189
|
+
num_tokens = [len(x) for x in inputs["input_ids"]]
|
|
190
|
+
|
|
191
|
+
enc_output = self.model.encoder(**inputs)
|
|
192
|
+
for i, x in enumerate(enc_output["last_hidden_state"].cpu()):
|
|
193
|
+
enc_output["last_hidden_state"][i] = self.noise(self.clip(x), epsilon=epsilon, delta=self.delta, method=method).float().to(self.device)
|
|
194
|
+
|
|
195
|
+
dec_out = self.decoder.generate(encoder_outputs=enc_output, max_new_tokens=max(num_tokens))
|
|
196
|
+
private_text = [self.tokenizer.decode(x, skip_special_tokens=True).strip() for x in dec_out]
|
|
197
|
+
return " ".join(private_text)
|
|
@@ -0,0 +1,92 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: privfill
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: LLM-based Differential Privacy mechanisms for sentence-based text rewriting with infilling models.
|
|
5
|
+
Author-email: Stephen Meisenbacher <stephen.meisenbacher@tum.de>
|
|
6
|
+
License: MIT
|
|
7
|
+
Classifier: Programming Language :: Python :: 3
|
|
8
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
9
|
+
Classifier: Operating System :: OS Independent
|
|
10
|
+
Requires-Python: >=3.9
|
|
11
|
+
Description-Content-Type: text/markdown
|
|
12
|
+
License-File: LICENSE
|
|
13
|
+
Requires-Dist: pandas
|
|
14
|
+
Requires-Dist: nltk
|
|
15
|
+
Requires-Dist: numpy
|
|
16
|
+
Requires-Dist: tqdm
|
|
17
|
+
Requires-Dist: torch
|
|
18
|
+
Requires-Dist: transformers
|
|
19
|
+
Requires-Dist: mpmath
|
|
20
|
+
Dynamic: license-file
|
|
21
|
+
|
|
22
|
+
<div align="center">
|
|
23
|
+
|
|
24
|
+
# PrivFill
|
|
25
|
+
|
|
26
|
+
[](https://pypi.org/project/privfill/)
|
|
27
|
+
[](https://github.com/sjmeis/PrivFill/stargazers)
|
|
28
|
+
[](https://github.com/sjmeis/PrivFill/blob/main/LICENSE)
|
|
29
|
+
|
|
30
|
+
</div>
|
|
31
|
+
|
|
32
|
+
`privfill` is a Python package providing LLM-based local Differential Privacy (DP) mechanisms for text privatization via sentece infilling. It offers easy-to-use wrappers for fine-tuned Hugging Face models.
|
|
33
|
+
This software was originally presented in the NAACL 2025 findings paper: *On the Impact of Noise in Differentially Private Text Rewriting*
|
|
34
|
+
|
|
35
|
+
## Installation
|
|
36
|
+
|
|
37
|
+
Install the package locally in editable mode from your project's root directory:
|
|
38
|
+
|
|
39
|
+
```bash
|
|
40
|
+
pip install privfill
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
### Core Prerequisites:
|
|
44
|
+
|
|
45
|
+
- Python $\geq$ 3.9
|
|
46
|
+
- PyTorch (CUDA recommended for faster inference)
|
|
47
|
+
- Transformers & NLTK
|
|
48
|
+
|
|
49
|
+
## Basic Usage & Model Selection
|
|
50
|
+
Instead of typing Hugging Face repository paths, you can choose from the three built-in models using the `SupportedModels` enum.
|
|
51
|
+
|
|
52
|
+
```python
|
|
53
|
+
import privfill
|
|
54
|
+
|
|
55
|
+
# Choose between FLAN_T5_BASE, FLAN_T5_LARGE, and BART_LARGE
|
|
56
|
+
engine = privfill.load_pipeline(privfill.SupportedModels.FLAN_T5_BASE, DP=True)
|
|
57
|
+
|
|
58
|
+
text = "This is a long private document ... which contains sensitive information and should be privatized,"
|
|
59
|
+
private_text = engine.privatize(text, epsilon=10)
|
|
60
|
+
|
|
61
|
+
print(private_text)
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
As described in the paper, we also create an analagous, non-DP variant of `PrivFill`. The usage is very similar:
|
|
65
|
+
|
|
66
|
+
```python
|
|
67
|
+
engine = privfill.load_pipeline(privfill.SupportedModels.FLAN_T5_BASE, DP=False)
|
|
68
|
+
private_text = engine.privatize(text)
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
### Available Models
|
|
72
|
+
|
|
73
|
+
| Enum | Hugging Face Repository | Base Mechanism |
|
|
74
|
+
|-------------------------------|--------------------------------------|-------------------------|
|
|
75
|
+
| SupportedModels.FLAN_T5_BASE | sjmeis/flan-t5-base-infill-combined | DP-Prompt |
|
|
76
|
+
| SupportedModels.FLAN_T5_LARGE | sjmeis/flan-t5-large-infill-combined | DP-Prompt |
|
|
77
|
+
| SupportedModels.BART_LARGE | sjmeis/bart-large-infill-combined | DP-BART |
|
|
78
|
+
|
|
79
|
+
## Models ##
|
|
80
|
+
We make our three sentence infilling models public. They can be found at this [link](https://drive.google.com/drive/folders/12m1av9PY1X7S-cwd9y_8nepBPMtVju0C?usp=sharing).
|
|
81
|
+
|
|
82
|
+
## Comparison Code ##
|
|
83
|
+
We also include the LLMDP class code for `DP-BART` and `DP-Prompt`, as used in the paper.
|
|
84
|
+
|
|
85
|
+
```python
|
|
86
|
+
X = LLMDP.DPPrompt()
|
|
87
|
+
# or
|
|
88
|
+
X = LLMDP.DPBart()
|
|
89
|
+
|
|
90
|
+
# then
|
|
91
|
+
X.privatize(text, epsilon)
|
|
92
|
+
```
|
|
@@ -0,0 +1,11 @@
|
|
|
1
|
+
LICENSE
|
|
2
|
+
README.md
|
|
3
|
+
pyproject.toml
|
|
4
|
+
src/privfill/__init__.py
|
|
5
|
+
src/privfill/main.py
|
|
6
|
+
src/privfill/mechanisms.py
|
|
7
|
+
src/privfill.egg-info/PKG-INFO
|
|
8
|
+
src/privfill.egg-info/SOURCES.txt
|
|
9
|
+
src/privfill.egg-info/dependency_links.txt
|
|
10
|
+
src/privfill.egg-info/requires.txt
|
|
11
|
+
src/privfill.egg-info/top_level.txt
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
privfill
|