privfill 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
privfill-0.1.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025-2026 Stephen Meisenbacher
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,92 @@
1
+ Metadata-Version: 2.4
2
+ Name: privfill
3
+ Version: 0.1.0
4
+ Summary: LLM-based Differential Privacy mechanisms for sentence-based text rewriting with infilling models.
5
+ Author-email: Stephen Meisenbacher <stephen.meisenbacher@tum.de>
6
+ License: MIT
7
+ Classifier: Programming Language :: Python :: 3
8
+ Classifier: License :: OSI Approved :: MIT License
9
+ Classifier: Operating System :: OS Independent
10
+ Requires-Python: >=3.9
11
+ Description-Content-Type: text/markdown
12
+ License-File: LICENSE
13
+ Requires-Dist: pandas
14
+ Requires-Dist: nltk
15
+ Requires-Dist: numpy
16
+ Requires-Dist: tqdm
17
+ Requires-Dist: torch
18
+ Requires-Dist: transformers
19
+ Requires-Dist: mpmath
20
+ Dynamic: license-file
21
+
22
+ <div align="center">
23
+
24
+ # PrivFill
25
+
26
+ [![PyPI version](https://img.shields.io/pypi/v/privfill.svg)](https://pypi.org/project/privfill/)
27
+ [![GitHub stars](https://img.shields.io/github/stars/sjmeis/PrivFill.svg?style=social)](https://github.com/sjmeis/PrivFill/stargazers)
28
+ [![License](https://img.shields.io/github/license/sjmeis/PrivFill.svg)](https://github.com/sjmeis/PrivFill/blob/main/LICENSE)
29
+
30
+ </div>
31
+
32
+ `privfill` is a Python package providing LLM-based local Differential Privacy (DP) mechanisms for text privatization via sentece infilling. It offers easy-to-use wrappers for fine-tuned Hugging Face models.
33
+ This software was originally presented in the NAACL 2025 findings paper: *On the Impact of Noise in Differentially Private Text Rewriting*
34
+
35
+ ## Installation
36
+
37
+ Install the package locally in editable mode from your project's root directory:
38
+
39
+ ```bash
40
+ pip install privfill
41
+ ```
42
+
43
+ ### Core Prerequisites:
44
+
45
+ - Python $\geq$ 3.9
46
+ - PyTorch (CUDA recommended for faster inference)
47
+ - Transformers & NLTK
48
+
49
+ ## Basic Usage & Model Selection
50
+ Instead of typing Hugging Face repository paths, you can choose from the three built-in models using the `SupportedModels` enum.
51
+
52
+ ```python
53
+ import privfill
54
+
55
+ # Choose between FLAN_T5_BASE, FLAN_T5_LARGE, and BART_LARGE
56
+ engine = privfill.load_pipeline(privfill.SupportedModels.FLAN_T5_BASE, DP=True)
57
+
58
+ text = "This is a long private document ... which contains sensitive information and should be privatized,"
59
+ private_text = engine.privatize(text, epsilon=10)
60
+
61
+ print(private_text)
62
+ ```
63
+
64
+ As described in the paper, we also create an analagous, non-DP variant of `PrivFill`. The usage is very similar:
65
+
66
+ ```python
67
+ engine = privfill.load_pipeline(privfill.SupportedModels.FLAN_T5_BASE, DP=False)
68
+ private_text = engine.privatize(text)
69
+ ```
70
+
71
+ ### Available Models
72
+
73
+ | Enum | Hugging Face Repository | Base Mechanism |
74
+ |-------------------------------|--------------------------------------|-------------------------|
75
+ | SupportedModels.FLAN_T5_BASE | sjmeis/flan-t5-base-infill-combined | DP-Prompt |
76
+ | SupportedModels.FLAN_T5_LARGE | sjmeis/flan-t5-large-infill-combined | DP-Prompt |
77
+ | SupportedModels.BART_LARGE | sjmeis/bart-large-infill-combined | DP-BART |
78
+
79
+ ## Models ##
80
+ We make our three sentence infilling models public. They can be found at this [link](https://drive.google.com/drive/folders/12m1av9PY1X7S-cwd9y_8nepBPMtVju0C?usp=sharing).
81
+
82
+ ## Comparison Code ##
83
+ We also include the LLMDP class code for `DP-BART` and `DP-Prompt`, as used in the paper.
84
+
85
+ ```python
86
+ X = LLMDP.DPPrompt()
87
+ # or
88
+ X = LLMDP.DPBart()
89
+
90
+ # then
91
+ X.privatize(text, epsilon)
92
+ ```
@@ -0,0 +1,71 @@
1
+ <div align="center">
2
+
3
+ # PrivFill
4
+
5
+ [![PyPI version](https://img.shields.io/pypi/v/privfill.svg)](https://pypi.org/project/privfill/)
6
+ [![GitHub stars](https://img.shields.io/github/stars/sjmeis/PrivFill.svg?style=social)](https://github.com/sjmeis/PrivFill/stargazers)
7
+ [![License](https://img.shields.io/github/license/sjmeis/PrivFill.svg)](https://github.com/sjmeis/PrivFill/blob/main/LICENSE)
8
+
9
+ </div>
10
+
11
+ `privfill` is a Python package providing LLM-based local Differential Privacy (DP) mechanisms for text privatization via sentece infilling. It offers easy-to-use wrappers for fine-tuned Hugging Face models.
12
+ This software was originally presented in the NAACL 2025 findings paper: *On the Impact of Noise in Differentially Private Text Rewriting*
13
+
14
+ ## Installation
15
+
16
+ Install the package locally in editable mode from your project's root directory:
17
+
18
+ ```bash
19
+ pip install privfill
20
+ ```
21
+
22
+ ### Core Prerequisites:
23
+
24
+ - Python $\geq$ 3.9
25
+ - PyTorch (CUDA recommended for faster inference)
26
+ - Transformers & NLTK
27
+
28
+ ## Basic Usage & Model Selection
29
+ Instead of typing Hugging Face repository paths, you can choose from the three built-in models using the `SupportedModels` enum.
30
+
31
+ ```python
32
+ import privfill
33
+
34
+ # Choose between FLAN_T5_BASE, FLAN_T5_LARGE, and BART_LARGE
35
+ engine = privfill.load_pipeline(privfill.SupportedModels.FLAN_T5_BASE, DP=True)
36
+
37
+ text = "This is a long private document ... which contains sensitive information and should be privatized,"
38
+ private_text = engine.privatize(text, epsilon=10)
39
+
40
+ print(private_text)
41
+ ```
42
+
43
+ As described in the paper, we also create an analagous, non-DP variant of `PrivFill`. The usage is very similar:
44
+
45
+ ```python
46
+ engine = privfill.load_pipeline(privfill.SupportedModels.FLAN_T5_BASE, DP=False)
47
+ private_text = engine.privatize(text)
48
+ ```
49
+
50
+ ### Available Models
51
+
52
+ | Enum | Hugging Face Repository | Base Mechanism |
53
+ |-------------------------------|--------------------------------------|-------------------------|
54
+ | SupportedModels.FLAN_T5_BASE | sjmeis/flan-t5-base-infill-combined | DP-Prompt |
55
+ | SupportedModels.FLAN_T5_LARGE | sjmeis/flan-t5-large-infill-combined | DP-Prompt |
56
+ | SupportedModels.BART_LARGE | sjmeis/bart-large-infill-combined | DP-BART |
57
+
58
+ ## Models ##
59
+ We make our three sentence infilling models public. They can be found at this [link](https://drive.google.com/drive/folders/12m1av9PY1X7S-cwd9y_8nepBPMtVju0C?usp=sharing).
60
+
61
+ ## Comparison Code ##
62
+ We also include the LLMDP class code for `DP-BART` and `DP-Prompt`, as used in the paper.
63
+
64
+ ```python
65
+ X = LLMDP.DPPrompt()
66
+ # or
67
+ X = LLMDP.DPBart()
68
+
69
+ # then
70
+ X.privatize(text, epsilon)
71
+ ```
@@ -0,0 +1,29 @@
1
+ [build-system]
2
+ requires = ["setuptools>=61.0.0", "wheel"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "privfill"
7
+ version = "0.1.0"
8
+ description = "LLM-based Differential Privacy mechanisms for sentence-based text rewriting with infilling models."
9
+ readme = "README.md"
10
+ authors = [{ name = "Stephen Meisenbacher", email = "stephen.meisenbacher@tum.de" }]
11
+ license = { text = "MIT" }
12
+ classifiers = [
13
+ "Programming Language :: Python :: 3",
14
+ "License :: OSI Approved :: MIT License",
15
+ "Operating System :: OS Independent",
16
+ ]
17
+ requires-python = ">=3.9"
18
+ dependencies = [
19
+ "pandas",
20
+ "nltk",
21
+ "numpy",
22
+ "tqdm",
23
+ "torch",
24
+ "transformers",
25
+ "mpmath"
26
+ ]
27
+
28
+ [tool.setuptools.packages.find]
29
+ where = ["src"]
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,35 @@
1
+ from enum import Enum
2
+ from .main import PrivFill, PrivFillDPBart, PrivFillDP
3
+
4
+ class SupportedModels(Enum):
5
+ FLAN_T5_BASE = "sjmeis/flan-t5-base-infill-combined"
6
+ FLAN_T5_LARGE = "sjmeis/flan-t5-large-infill-combined"
7
+ BART_LARGE = "sjmeis/bart-large-infill-combined"
8
+
9
+ def load_pipeline(model_choice: SupportedModels, DP: bool = False, **kwargs):
10
+ """
11
+ Loads the appropriate privatization engine based on model choice and DP toggle.
12
+
13
+ Args:
14
+ model_choice (SupportedModels): The chosen model from the Enum.
15
+ dp (bool): If True, applies the model's Differential Privacy mechanism.
16
+ If False, falls back to the standard PrivFill wrapper.
17
+ """
18
+ if not isinstance(model_choice, SupportedModels):
19
+ raise ValueError(
20
+ f"Invalid model choice. Please choose an option from privfill.SupportedModels. "
21
+ f"Available choices: {list(SupportedModels.__members__.keys())}"
22
+ )
23
+
24
+ checkpoint = model_choice.value
25
+
26
+ if DP:
27
+ if model_choice == SupportedModels.BART_LARGE:
28
+ return PrivFillDPBart(model_checkpoint=checkpoint, **kwargs)
29
+ else:
30
+ return PrivFillDP(model_checkpoint=checkpoint, **kwargs)
31
+ else:
32
+ return PrivFill(model_checkpoint=checkpoint, **kwargs)
33
+
34
+
35
+ __all__ = ["PrivFill", "PrivFillDPBart", "PrivFillDP", "SupportedModels", "load_pipeline"]
@@ -0,0 +1,71 @@
1
+ import nltk
2
+ import torch
3
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
4
+ from privfill.mechanisms import DPBart, DPPrompt
5
+
6
+ class PrivFill:
7
+ def __init__(self, model_checkpoint, max_new_tokens=32, max_input_length=512, base_model=None):
8
+ self.device = "cuda" if torch.cuda.is_available() else "cpu"
9
+ self.model_checkpoint = model_checkpoint
10
+ self.max_new_tokens = max_new_tokens
11
+ self.max_input_length = max_input_length
12
+ self.base_model = base_model
13
+
14
+ self.tokenizer = AutoTokenizer.from_pretrained(self.model_checkpoint)
15
+ self.model = AutoModelForSeq2SeqLM.from_pretrained(self.model_checkpoint).to(self.device)
16
+
17
+ def privatize(self, text):
18
+ sentences = nltk.sent_tokenize(text)
19
+ replace = []
20
+ for s in sentences:
21
+ temp = text.replace(s, "[blank]")
22
+ inputs = [temp]
23
+ inputs = self.tokenizer(inputs, max_length=self.max_input_length, truncation=True, return_tensors="pt").input_ids.to(self.device)
24
+ output = self.model.generate(inputs, min_new_tokens=5, do_sample=True, max_new_tokens=self.max_new_tokens, pad_token_id=50256)
25
+ decoded_output = self.tokenizer.decode(output[0], skip_special_tokens=True).replace(temp, "")
26
+
27
+ if self.base_model is None:
28
+ replace.append(decoded_output)
29
+ else:
30
+ replace.append(nltk.sent_tokenize(decoded_output.strip())[0])
31
+ return " ".join(replace)
32
+
33
+
34
+ class PrivFillDPBart:
35
+ def __init__(self, model_checkpoint, max_new_tokens=32, max_input_length=512):
36
+ self.device = "cuda" if torch.cuda.is_available() else "cpu"
37
+ self.model_checkpoint = model_checkpoint
38
+ self.max_new_tokens = max_new_tokens
39
+ self.max_input_length = max_input_length
40
+
41
+ self.model = DPBart(model=model_checkpoint)
42
+
43
+ def privatize(self, text, epsilon):
44
+ sentences = nltk.sent_tokenize(text)
45
+ eps = epsilon / len(sentences)
46
+ inputs = []
47
+ for s in sentences:
48
+ temp = text.replace(s, "[blank]")
49
+ inputs.append(temp)
50
+
51
+ return self.model.privatize_batch(inputs, epsilon=eps)
52
+
53
+
54
+ class PrivFillDP:
55
+ def __init__(self, model_checkpoint, max_new_tokens=32, max_input_length=512):
56
+ self.device = "cuda" if torch.cuda.is_available() else "cpu"
57
+ self.model_checkpoint = model_checkpoint
58
+ self.max_new_tokens = max_new_tokens
59
+ self.max_input_length = max_input_length
60
+
61
+ self.model = DPPrompt(model_checkpoint=model_checkpoint)
62
+
63
+ def privatize(self, text, epsilon):
64
+ sentences = nltk.sent_tokenize(text)
65
+ inputs = []
66
+ for s in sentences:
67
+ temp = text.replace(s, "[blank]")
68
+ inputs.append(temp)
69
+
70
+ output = self.model.privatize_dp(inputs, epsilon)
71
+ return " ".join(output)
@@ -0,0 +1,197 @@
1
+ import numpy as np
2
+ import torch
3
+ from torch.utils.data import Dataset
4
+ from transformers import (
5
+ AutoModelForSeq2SeqLM,
6
+ AutoTokenizer,
7
+ LogitsProcessor,
8
+ LogitsProcessorList,
9
+ pipeline,
10
+ BartTokenizer,
11
+ BartModel,
12
+ BartForConditionalGeneration
13
+ )
14
+ import mpmath
15
+ from mpmath import mp
16
+ import nltk
17
+
18
+ try:
19
+ nltk.data.find('tokenizers/punkt')
20
+ except LookupError:
21
+ nltk.download('punkt', quiet=True)
22
+
23
+ class ListDataset(Dataset):
24
+ def __init__(self, original_list):
25
+ self.original_list = original_list
26
+
27
+ def __len__(self):
28
+ return len(self.original_list)
29
+
30
+ def __getitem__(self, i):
31
+ return self.original_list[i]
32
+
33
+
34
+ class ClipLogitsProcessor(LogitsProcessor):
35
+ def __init__(self, min=-100, max=100):
36
+ self.min = min
37
+ self.max = max
38
+
39
+ def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
40
+ return torch.clamp(scores, min=self.min, max=self.max)
41
+
42
+
43
+ class DPPrompt:
44
+ def __init__(self, model_checkpoint="google/flan-t5-large", min_logit=-95, max_logit=8, batch_size=16):
45
+ self.model_checkpoint = model_checkpoint
46
+ self.device = "cuda" if torch.cuda.is_available() else "cpu"
47
+ self.batch_size = batch_size
48
+
49
+ self.tokenizer = AutoTokenizer.from_pretrained(self.model_checkpoint)
50
+ self.model = AutoModelForSeq2SeqLM.from_pretrained(self.model_checkpoint).to(self.device)
51
+
52
+ self.min_logit = min_logit
53
+ self.max_logit = max_logit
54
+ self.sensitivity = abs(self.max_logit - self.min_logit)
55
+ self.logits_processor = LogitsProcessorList([ClipLogitsProcessor(self.min_logit, self.max_logit)])
56
+
57
+ self.pipe = pipeline("text2text-generation", model=self.model, tokenizer=self.tokenizer, device=0 if self.device == "cuda" else -1, truncation=True)
58
+ self.pipe.tokenizer.pad_token_id = self.model.config.eos_token_id
59
+
60
+ def prompt_template_fn(self, doc):
61
+ return f"Document : {doc}\nParaphrase of the document :"
62
+
63
+ def privatize(self, text, epsilon=100):
64
+ temperature = 2 * self.sensitivity / epsilon
65
+ prompt = self.prompt_template_fn(text)
66
+ model_inputs = self.tokenizer(prompt, max_length=512, truncation=True, return_tensors="pt").to(self.device)
67
+
68
+ output = self.model.generate(
69
+ **model_inputs,
70
+ do_sample=True,
71
+ top_k=0,
72
+ top_p=1.0,
73
+ temperature=temperature,
74
+ max_new_tokens=len(model_inputs["input_ids"][0]),
75
+ logits_processor=self.logits_processor
76
+ )
77
+ return self.tokenizer.decode(output[0], skip_special_tokens=True)
78
+
79
+ def privatize_dp(self, texts, epsilon=100, max_new_tokens=32):
80
+ temperature = 2 * self.sensitivity / epsilon
81
+ prompts = ListDataset(texts)
82
+ private_texts = []
83
+ for r in self.pipe(prompts, do_sample=True, top_k=0, top_p=1.0, temperature=temperature, logits_processor=self.logits_processor, max_new_tokens=max_new_tokens, batch_size=self.batch_size):
84
+ private_texts.append(r[0]["generated_text"])
85
+ return private_texts
86
+
87
+
88
+ class DPBart:
89
+ def __init__(self, model='facebook/bart-large', num_sigmas=1/2):
90
+ self.device = "cuda" if torch.cuda.is_available() else "cpu"
91
+ self.tokenizer = BartTokenizer.from_pretrained(model)
92
+ self.model = BartModel.from_pretrained(model).to(self.device)
93
+ self.decoder = BartForConditionalGeneration.from_pretrained(model).to(self.device)
94
+
95
+ self.delta = 1e-5
96
+ self.sigma = 0.2
97
+ self.num_sigmas = num_sigmas
98
+ self.c_min = -self.sigma
99
+ self.c_max = self.num_sigmas * self.sigma
100
+
101
+ def clip(self, vector):
102
+ return torch.clip(vector, self.c_min, self.c_max)
103
+
104
+ def calibrateAnalyticGaussianMechanism_precision(self, epsilon, delta, GS, tol=1.e-12):
105
+ if epsilon <= 1000:
106
+ mp.dps = 500
107
+ elif epsilon <= 2500:
108
+ mp.dps = 1100
109
+ else:
110
+ mp.dps = 2200
111
+
112
+ def Phi(t):
113
+ return 0.5 * (1.0 + mpmath.erf(t / mpmath.sqrt(2.0)))
114
+
115
+ def caseA(eps, s):
116
+ return Phi(mpmath.sqrt(eps * s)) - mpmath.exp(eps) * Phi(-mpmath.sqrt(eps * (s + 2.0)))
117
+
118
+ def caseB(eps, s):
119
+ return Phi(-mpmath.sqrt(eps * s)) - mpmath.exp(eps) * Phi(-mpmath.sqrt(eps * (s + 2.0)))
120
+
121
+ def doubling_trick(predicate_stop, s_inf, s_sup):
122
+ while not predicate_stop(s_sup):
123
+ s_inf = s_sup
124
+ s_sup = 2.0 * s_inf
125
+ return s_inf, s_sup
126
+
127
+ def binary_search(predicate_stop, predicate_left, s_inf, s_sup):
128
+ s_mid = s_inf + (s_sup - s_inf) / 2.0
129
+ while not predicate_stop(s_mid):
130
+ if predicate_left(s_mid):
131
+ s_sup = s_mid
132
+ else:
133
+ s_inf = s_mid
134
+ s_mid = s_inf + (s_sup - s_inf) / 2.0
135
+ return s_mid
136
+
137
+ delta_thr = caseA(epsilon, 0.0)
138
+
139
+ if delta == delta_thr:
140
+ alpha = 1.0
141
+ else:
142
+ if delta > delta_thr:
143
+ predicate_stop_DT = lambda s: caseA(epsilon, s) >= delta
144
+ func_s_to_delta = lambda s: caseA(epsilon, s)
145
+ predicate_left_BS = lambda s: func_s_to_delta(s) > delta
146
+ func_s_to_alpha = lambda s: mpmath.sqrt(1.0 + s / 2.0) - mpmath.sqrt(s / 2.0)
147
+ else:
148
+ predicate_stop_DT = lambda s: caseB(epsilon, s) <= delta
149
+ func_s_to_delta = lambda s: caseB(epsilon, s)
150
+ predicate_left_BS = lambda s: func_s_to_delta(s) < delta
151
+ func_s_to_alpha = lambda s: mpmath.sqrt(1.0 + s / 2.0) + mpmath.sqrt(s / 2.0)
152
+
153
+ predicate_stop_BS = lambda s: abs(func_s_to_delta(s) - delta) <= tol
154
+ s_inf, s_sup = doubling_trick(predicate_stop_DT, 0.0, 1.0)
155
+ s_final = binary_search(predicate_stop_BS, predicate_left_BS, s_inf, s_sup)
156
+ alpha = func_s_to_alpha(s_final)
157
+
158
+ sigma = alpha * GS / mpmath.sqrt(2.0 * epsilon)
159
+ return float(sigma)
160
+
161
+ def noise(self, vector, epsilon, delta=1e-5, method="analytic_gaussian"):
162
+ k = vector.shape[-1]
163
+ if method == "laplace":
164
+ sensitivity = 2 * self.sigma * self.num_sigmas * k
165
+ Z = torch.from_numpy(np.random.laplace(0, sensitivity / epsilon, size=k))
166
+ elif method == 'gaussian':
167
+ sensitivity = 2 * self.sigma * self.num_sigmas * np.sqrt(k)
168
+ scale = np.sqrt((sensitivity**2 / epsilon**2) * 2 * np.log(1.25 / self.delta))
169
+ Z = torch.from_numpy(np.random.normal(0, scale, size=k))
170
+ elif method == "analytic_gaussian":
171
+ sensitivity = 2 * self.sigma * self.num_sigmas * np.sqrt(k)
172
+ analytic_scale = self.calibrateAnalyticGaussianMechanism_precision(epsilon, self.delta, sensitivity)
173
+ Z = torch.from_numpy(np.random.normal(0, analytic_scale, size=k))
174
+ return vector + Z
175
+
176
+ def privatize(self, text, epsilon=100, method="analytic_gaussian"):
177
+ inputs = self.tokenizer(text, max_length=512, truncation=True, return_tensors="pt").to(self.device)
178
+ num_tokens = len(inputs["input_ids"][0])
179
+
180
+ enc_output = self.model.encoder(**inputs)
181
+ enc_output["last_hidden_state"] = self.noise(self.clip(enc_output["last_hidden_state"].cpu()), epsilon=epsilon, delta=self.delta, method=method).float().to(self.device)
182
+
183
+ dec_out = self.decoder.generate(encoder_outputs=enc_output, max_new_tokens=num_tokens)
184
+ private_text = self.tokenizer.decode(dec_out[0], skip_special_tokens=True)
185
+ return private_text.strip()
186
+
187
+ def privatize_batch(self, texts, epsilon=100, method="analytic_gaussian"):
188
+ inputs = self.tokenizer(texts, max_length=512, truncation=True, padding=True, return_tensors="pt").to(self.device)
189
+ num_tokens = [len(x) for x in inputs["input_ids"]]
190
+
191
+ enc_output = self.model.encoder(**inputs)
192
+ for i, x in enumerate(enc_output["last_hidden_state"].cpu()):
193
+ enc_output["last_hidden_state"][i] = self.noise(self.clip(x), epsilon=epsilon, delta=self.delta, method=method).float().to(self.device)
194
+
195
+ dec_out = self.decoder.generate(encoder_outputs=enc_output, max_new_tokens=max(num_tokens))
196
+ private_text = [self.tokenizer.decode(x, skip_special_tokens=True).strip() for x in dec_out]
197
+ return " ".join(private_text)
@@ -0,0 +1,92 @@
1
+ Metadata-Version: 2.4
2
+ Name: privfill
3
+ Version: 0.1.0
4
+ Summary: LLM-based Differential Privacy mechanisms for sentence-based text rewriting with infilling models.
5
+ Author-email: Stephen Meisenbacher <stephen.meisenbacher@tum.de>
6
+ License: MIT
7
+ Classifier: Programming Language :: Python :: 3
8
+ Classifier: License :: OSI Approved :: MIT License
9
+ Classifier: Operating System :: OS Independent
10
+ Requires-Python: >=3.9
11
+ Description-Content-Type: text/markdown
12
+ License-File: LICENSE
13
+ Requires-Dist: pandas
14
+ Requires-Dist: nltk
15
+ Requires-Dist: numpy
16
+ Requires-Dist: tqdm
17
+ Requires-Dist: torch
18
+ Requires-Dist: transformers
19
+ Requires-Dist: mpmath
20
+ Dynamic: license-file
21
+
22
+ <div align="center">
23
+
24
+ # PrivFill
25
+
26
+ [![PyPI version](https://img.shields.io/pypi/v/privfill.svg)](https://pypi.org/project/privfill/)
27
+ [![GitHub stars](https://img.shields.io/github/stars/sjmeis/PrivFill.svg?style=social)](https://github.com/sjmeis/PrivFill/stargazers)
28
+ [![License](https://img.shields.io/github/license/sjmeis/PrivFill.svg)](https://github.com/sjmeis/PrivFill/blob/main/LICENSE)
29
+
30
+ </div>
31
+
32
+ `privfill` is a Python package providing LLM-based local Differential Privacy (DP) mechanisms for text privatization via sentece infilling. It offers easy-to-use wrappers for fine-tuned Hugging Face models.
33
+ This software was originally presented in the NAACL 2025 findings paper: *On the Impact of Noise in Differentially Private Text Rewriting*
34
+
35
+ ## Installation
36
+
37
+ Install the package locally in editable mode from your project's root directory:
38
+
39
+ ```bash
40
+ pip install privfill
41
+ ```
42
+
43
+ ### Core Prerequisites:
44
+
45
+ - Python $\geq$ 3.9
46
+ - PyTorch (CUDA recommended for faster inference)
47
+ - Transformers & NLTK
48
+
49
+ ## Basic Usage & Model Selection
50
+ Instead of typing Hugging Face repository paths, you can choose from the three built-in models using the `SupportedModels` enum.
51
+
52
+ ```python
53
+ import privfill
54
+
55
+ # Choose between FLAN_T5_BASE, FLAN_T5_LARGE, and BART_LARGE
56
+ engine = privfill.load_pipeline(privfill.SupportedModels.FLAN_T5_BASE, DP=True)
57
+
58
+ text = "This is a long private document ... which contains sensitive information and should be privatized,"
59
+ private_text = engine.privatize(text, epsilon=10)
60
+
61
+ print(private_text)
62
+ ```
63
+
64
+ As described in the paper, we also create an analagous, non-DP variant of `PrivFill`. The usage is very similar:
65
+
66
+ ```python
67
+ engine = privfill.load_pipeline(privfill.SupportedModels.FLAN_T5_BASE, DP=False)
68
+ private_text = engine.privatize(text)
69
+ ```
70
+
71
+ ### Available Models
72
+
73
+ | Enum | Hugging Face Repository | Base Mechanism |
74
+ |-------------------------------|--------------------------------------|-------------------------|
75
+ | SupportedModels.FLAN_T5_BASE | sjmeis/flan-t5-base-infill-combined | DP-Prompt |
76
+ | SupportedModels.FLAN_T5_LARGE | sjmeis/flan-t5-large-infill-combined | DP-Prompt |
77
+ | SupportedModels.BART_LARGE | sjmeis/bart-large-infill-combined | DP-BART |
78
+
79
+ ## Models ##
80
+ We make our three sentence infilling models public. They can be found at this [link](https://drive.google.com/drive/folders/12m1av9PY1X7S-cwd9y_8nepBPMtVju0C?usp=sharing).
81
+
82
+ ## Comparison Code ##
83
+ We also include the LLMDP class code for `DP-BART` and `DP-Prompt`, as used in the paper.
84
+
85
+ ```python
86
+ X = LLMDP.DPPrompt()
87
+ # or
88
+ X = LLMDP.DPBart()
89
+
90
+ # then
91
+ X.privatize(text, epsilon)
92
+ ```
@@ -0,0 +1,11 @@
1
+ LICENSE
2
+ README.md
3
+ pyproject.toml
4
+ src/privfill/__init__.py
5
+ src/privfill/main.py
6
+ src/privfill/mechanisms.py
7
+ src/privfill.egg-info/PKG-INFO
8
+ src/privfill.egg-info/SOURCES.txt
9
+ src/privfill.egg-info/dependency_links.txt
10
+ src/privfill.egg-info/requires.txt
11
+ src/privfill.egg-info/top_level.txt
@@ -0,0 +1,7 @@
1
+ pandas
2
+ nltk
3
+ numpy
4
+ tqdm
5
+ torch
6
+ transformers
7
+ mpmath
@@ -0,0 +1 @@
1
+ privfill