tokenbreak-scanner 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- tokenbreak_scanner-0.1.0/LICENSE +33 -0
- tokenbreak_scanner-0.1.0/MANIFEST.in +3 -0
- tokenbreak_scanner-0.1.0/PKG-INFO +324 -0
- tokenbreak_scanner-0.1.0/README.md +284 -0
- tokenbreak_scanner-0.1.0/pyproject.toml +75 -0
- tokenbreak_scanner-0.1.0/setup.cfg +4 -0
- tokenbreak_scanner-0.1.0/src/tokenbreak_scanner/__init__.py +7 -0
- tokenbreak_scanner-0.1.0/src/tokenbreak_scanner/cli.py +203 -0
- tokenbreak_scanner-0.1.0/src/tokenbreak_scanner/inspector.py +294 -0
- tokenbreak_scanner-0.1.0/src/tokenbreak_scanner/models.py +67 -0
- tokenbreak_scanner-0.1.0/src/tokenbreak_scanner/tokenizers.py +388 -0
- tokenbreak_scanner-0.1.0/src/tokenbreak_scanner/validator.py +361 -0
- tokenbreak_scanner-0.1.0/src/tokenbreak_scanner.egg-info/PKG-INFO +324 -0
- tokenbreak_scanner-0.1.0/src/tokenbreak_scanner.egg-info/SOURCES.txt +19 -0
- tokenbreak_scanner-0.1.0/src/tokenbreak_scanner.egg-info/dependency_links.txt +1 -0
- tokenbreak_scanner-0.1.0/src/tokenbreak_scanner.egg-info/entry_points.txt +2 -0
- tokenbreak_scanner-0.1.0/src/tokenbreak_scanner.egg-info/requires.txt +14 -0
- tokenbreak_scanner-0.1.0/src/tokenbreak_scanner.egg-info/top_level.txt +1 -0
- tokenbreak_scanner-0.1.0/tests/test_cli.py +61 -0
- tokenbreak_scanner-0.1.0/tests/test_inspector.py +125 -0
- tokenbreak_scanner-0.1.0/tests/test_tokenizers.py +115 -0
|
@@ -0,0 +1,33 @@
|
|
|
1
|
+
GNU AFFERO GENERAL PUBLIC LICENSE
|
|
2
|
+
Version 3, 19 November 2007
|
|
3
|
+
|
|
4
|
+
Copyright (C) 2026 TokenBreak Scanner Contributors
|
|
5
|
+
|
|
6
|
+
This program is free software: you can redistribute it and/or modify
|
|
7
|
+
it under the terms of the GNU Affero General Public License as published
|
|
8
|
+
by the Free Software Foundation, either version 3 of the License, or
|
|
9
|
+
(at your option) any later version.
|
|
10
|
+
|
|
11
|
+
This program is distributed in the hope that it will be useful,
|
|
12
|
+
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
13
|
+
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
14
|
+
GNU Affero General Public License for more details.
|
|
15
|
+
|
|
16
|
+
You should have received a copy of the GNU Affero General Public License
|
|
17
|
+
along with this program. If not, see <https://www.gnu.org/licenses/>.
|
|
18
|
+
|
|
19
|
+
Also add information on how to contact you by electronic and paper mail.
|
|
20
|
+
|
|
21
|
+
If your software can interact with users remotely through a computer
|
|
22
|
+
network, you should also make sure that it provides a way for users to
|
|
23
|
+
get its source. For example, if your program is a web application, its
|
|
24
|
+
interface could display a βSourceβ link that leads users to an archive
|
|
25
|
+
of the code. There are many ways you could offer source, and different
|
|
26
|
+
solutions will be better for different programs; see section 13 for the
|
|
27
|
+
specific requirements.
|
|
28
|
+
|
|
29
|
+
You should also get your employer (if you work as a programmer) or school,
|
|
30
|
+
if any, to sign a βcopyright disclaimerβ for the program, if necessary.
|
|
31
|
+
For more information on this, and how to apply and follow the GNU AGPL,
|
|
32
|
+
see <https://www.gnu.org/licenses/>.
|
|
33
|
+
|
|
@@ -0,0 +1,324 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: tokenbreak-scanner
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Scanner for TokenBreak vulnerabilities in NLP model artifacts
|
|
5
|
+
Author: TokenBreak Scanner Contributors
|
|
6
|
+
License: AGPL-3.0-or-later
|
|
7
|
+
Project-URL: Homepage, https://github.com/your-org/tokenbreak-scanner
|
|
8
|
+
Project-URL: Repository, https://github.com/your-org/tokenbreak-scanner
|
|
9
|
+
Project-URL: Documentation, https://github.com/your-org/tokenbreak-scanner#readme
|
|
10
|
+
Project-URL: Issues, https://github.com/your-org/tokenbreak-scanner/issues
|
|
11
|
+
Keywords: tokenbreak,tokenizer,security,nlp,adversarial,bpe,wordpiece,unigram,huggingface,vulnerability-scanner
|
|
12
|
+
Classifier: Development Status :: 4 - Beta
|
|
13
|
+
Classifier: Intended Audience :: Developers
|
|
14
|
+
Classifier: Intended Audience :: Science/Research
|
|
15
|
+
Classifier: License :: OSI Approved :: GNU Affero General Public License v3 or later (AGPLv3+)
|
|
16
|
+
Classifier: Topic :: Security
|
|
17
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
18
|
+
Classifier: Programming Language :: Python :: 3
|
|
19
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
20
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
21
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
22
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
23
|
+
Classifier: Operating System :: OS Independent
|
|
24
|
+
Requires-Python: >=3.9
|
|
25
|
+
Description-Content-Type: text/markdown
|
|
26
|
+
License-File: LICENSE
|
|
27
|
+
Requires-Dist: transformers>=4.40.0
|
|
28
|
+
Requires-Dist: tokenizers>=0.19.0
|
|
29
|
+
Requires-Dist: pydantic>=2.0.0
|
|
30
|
+
Requires-Dist: click>=8.0.0
|
|
31
|
+
Requires-Dist: rich>=13.0.0
|
|
32
|
+
Requires-Dist: numpy>=1.17
|
|
33
|
+
Provides-Extra: attack
|
|
34
|
+
Requires-Dist: torch>=2.0.0; extra == "attack"
|
|
35
|
+
Provides-Extra: dev
|
|
36
|
+
Requires-Dist: pytest>=7.0.0; extra == "dev"
|
|
37
|
+
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
|
|
38
|
+
Requires-Dist: httpx>=0.24.0; extra == "dev"
|
|
39
|
+
Dynamic: license-file
|
|
40
|
+
|
|
41
|
+
# π TokenBreak Scanner β NLP Model Vulnerability Auditor
|
|
42
|
+
|
|
43
|
+
Detect **TokenBreak adversarial vulnerabilities** in LLMs, classifiers, and encoders before they hit production.
|
|
44
|
+
|
|
45
|
+
[](https://pypi.org/project/tokenbreak-scanner)
|
|
46
|
+
[](https://pypi.org/project/tokenbreak-scanner)
|
|
47
|
+
[](LICENSE)
|
|
48
|
+
[](https://github.com/your-org/tokenbreak-scanner/actions)
|
|
49
|
+
[](https://pypi.org/project/tokenbreak-scanner)
|
|
50
|
+
|
|
51
|
+
[π TokenBreak Paper](https://arxiv.org/html/2506.07948v1) Β· [β‘ Quick Start](#installation) Β· [π Usage](#usage) Β· [π CI Integration](#ci-integration)
|
|
52
|
+
|
|
53
|
+
---
|
|
54
|
+
|
|
55
|
+
## π₯ What is TokenBreak?
|
|
56
|
+
|
|
57
|
+
**TokenBreak** is an adversarial attack that bypasses text-classification and LLM guardrails by prepending a single character to targeted words. This tiny perturbation corrupts **BPE** and **WordPiece** tokenization, causing models to misclassify malicious input as benign β while humans and downstream LLMs still understand the original meaning.
|
|
58
|
+
|
|
59
|
+
> π **Paper**: [TokenBreak: Bypassing Text Classification Models Through Token Manipulation](https://arxiv.org/html/2506.07948v1)
|
|
60
|
+
|
|
61
|
+
---
|
|
62
|
+
|
|
63
|
+
## π‘οΈ What TokenBreak Scanner Does
|
|
64
|
+
|
|
65
|
+
| Capability | Description |
|
|
66
|
+
|---|---|
|
|
67
|
+
| **Inspect model artifacts** | Reads `config.json`, `tokenizer.json`, `tokenizer_config.json` to determine tokenizer architecture |
|
|
68
|
+
| **Identify tokenization algorithm** | Detects **BPE**, **WordPiece**, **Unigram**, or **SentencePiece** |
|
|
69
|
+
| **Assess TokenBreak vulnerability** | Flags BPE/WordPiece models as **High Risk**; Unigram as **Safe** |
|
|
70
|
+
| **Explain the decision** | Evidence tree showing which signals (config, runtime, source) contributed to the confidence score |
|
|
71
|
+
| **Recommend defense** | Suggests the Unigram pre-mapping defense (Section 5) or migration to a safer model family |
|
|
72
|
+
| **Live attack validation** *(optional)* | Loads model weights and runs the `BreakPrompt` algorithm to verify the bypass experimentally |
|
|
73
|
+
|
|
74
|
+
---
|
|
75
|
+
|
|
76
|
+
## π¦ Installation
|
|
77
|
+
|
|
78
|
+
### From PyPI (recommended)
|
|
79
|
+
|
|
80
|
+
```bash
|
|
81
|
+
pip install tokenbreak-scanner
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
### Optional extras
|
|
85
|
+
|
|
86
|
+
- **Attack validation** (requires PyTorch):
|
|
87
|
+
```bash
|
|
88
|
+
pip install "tokenbreak-scanner[attack]"
|
|
89
|
+
```
|
|
90
|
+
- **Development dependencies** (pytest, coverage):
|
|
91
|
+
```bash
|
|
92
|
+
pip install "tokenbreak-scanner[dev]"
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
### From source
|
|
96
|
+
|
|
97
|
+
```bash
|
|
98
|
+
git clone https://github.com/your-org/tokenbreak-scanner.git
|
|
99
|
+
cd tokenbreak-scanner
|
|
100
|
+
pip install -e ".[dev]"
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
---
|
|
104
|
+
|
|
105
|
+
## π Quick Start
|
|
106
|
+
|
|
107
|
+
### Scan a local model directory
|
|
108
|
+
|
|
109
|
+
```bash
|
|
110
|
+
tokenbreak-scan ./models/my-spam-classifier/
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
### Scan a HuggingFace model ID (auto-download)
|
|
114
|
+
|
|
115
|
+
```bash
|
|
116
|
+
tokenbreak-scan distilbert-base-uncased --download
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
### Scan Qwen3-0.6B (production example)
|
|
120
|
+
|
|
121
|
+
```bash
|
|
122
|
+
tokenbreak-scan Qwen/Qwen3-0.6B --download --trust-remote-code
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
> β
Expected result: **HIGH risk** β Qwen uses BPE tokenization and is vulnerable to TokenBreak.
|
|
126
|
+
|
|
127
|
+
### JSON output (for CI pipelines)
|
|
128
|
+
|
|
129
|
+
```bash
|
|
130
|
+
tokenbreak-scan <model> --output json
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
### Run live attack validation
|
|
134
|
+
|
|
135
|
+
```bash
|
|
136
|
+
tokenbreak-scan <model> --test-attack
|
|
137
|
+
```
|
|
138
|
+
|
|
139
|
+
---
|
|
140
|
+
|
|
141
|
+
## π§ͺ Example Output
|
|
142
|
+
|
|
143
|
+
### Table format (vulnerable model)
|
|
144
|
+
|
|
145
|
+
```
|
|
146
|
+
======================================================================
|
|
147
|
+
TOKENBREAK SCANNER REPORT
|
|
148
|
+
======================================================================
|
|
149
|
+
Model Name: distilbert-base-uncased
|
|
150
|
+
Model Type: distilbert
|
|
151
|
+
Family: DistilBERT
|
|
152
|
+
Tokenizer Class: DistilBertTokenizerFast
|
|
153
|
+
Algorithm: WordPiece
|
|
154
|
+
Vocab Size: 30522
|
|
155
|
+
Confidence: 0.85
|
|
156
|
+
Vulnerable: YES β οΈ
|
|
157
|
+
Risk Level: High
|
|
158
|
+
======================================================================
|
|
159
|
+
Detection Sources:
|
|
160
|
+
1. [tokenizer.json model.type] weight=0.40 -> WordPiece
|
|
161
|
+
2. [tokenizer_config.json class] weight=0.20 -> WordPiece
|
|
162
|
+
3. [config.json model_type] weight=0.15 -> WordPiece
|
|
163
|
+
======================================================================
|
|
164
|
+
Recommendation:
|
|
165
|
+
Model is vulnerable to TokenBreak. Recommended: Implement the
|
|
166
|
+
Unigram pre-mapping defense by inserting a Unigram tokenizer
|
|
167
|
+
before classification and remapping tokens, or migrate to a
|
|
168
|
+
model family using Unigram tokenization.
|
|
169
|
+
======================================================================
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
### JSON format
|
|
173
|
+
|
|
174
|
+
```json
|
|
175
|
+
{
|
|
176
|
+
"model_name": "distilbert-base-uncased",
|
|
177
|
+
"model_type": "distilbert",
|
|
178
|
+
"model_family": "DistilBERT",
|
|
179
|
+
"tokenizer_class": "DistilBertTokenizerFast",
|
|
180
|
+
"tokenizer_algorithm": "WordPiece",
|
|
181
|
+
"vocab_size": 30522,
|
|
182
|
+
"confidence_score": 0.85,
|
|
183
|
+
"vulnerable_to_tokenbreak": true,
|
|
184
|
+
"risk_level": "High",
|
|
185
|
+
"detection_sources": [...],
|
|
186
|
+
"recommendation": "...",
|
|
187
|
+
"source": "/path/to/model"
|
|
188
|
+
}
|
|
189
|
+
```
|
|
190
|
+
|
|
191
|
+
---
|
|
192
|
+
|
|
193
|
+
## π CI Integration
|
|
194
|
+
|
|
195
|
+
Use exit codes to gate vulnerable models in your MLOps pipeline:
|
|
196
|
+
|
|
197
|
+
| Exit code | Meaning |
|
|
198
|
+
|---|---|
|
|
199
|
+
| `0` | Model is **safe** (Unigram) or unknown |
|
|
200
|
+
| `1` | Model is **vulnerable** (BPE or WordPiece) β block deployment |
|
|
201
|
+
| `2` | Error (path not found, download failed, etc.) |
|
|
202
|
+
|
|
203
|
+
### GitHub Actions example
|
|
204
|
+
|
|
205
|
+
```yaml
|
|
206
|
+
- name: Scan model for TokenBreak vulnerability
|
|
207
|
+
run: |
|
|
208
|
+
pip install tokenbreak-scanner
|
|
209
|
+
tokenbreak-scan ./models-to-deploy/ --output json > scan-report.json
|
|
210
|
+
cat scan-report.json
|
|
211
|
+
```
|
|
212
|
+
|
|
213
|
+
### Python SDK example
|
|
214
|
+
|
|
215
|
+
```python
|
|
216
|
+
from tokenbreak_scanner.inspector import inspect_model
|
|
217
|
+
from tokenbreak_scanner.models import RiskLevel
|
|
218
|
+
|
|
219
|
+
report = inspect_model(model_path, download=False)
|
|
220
|
+
if report.risk_level == RiskLevel.HIGH:
|
|
221
|
+
raise RuntimeError(f"Deployment blocked: {report.model_name} is vulnerable to TokenBreak")
|
|
222
|
+
```
|
|
223
|
+
|
|
224
|
+
---
|
|
225
|
+
|
|
226
|
+
## π¬ How the Attack Works
|
|
227
|
+
|
|
228
|
+
1. **Attack vector**: Prepend a single character (AβZ or aβz) to high-impact words, forcing BPE/WordPiece to split tokens differently.
|
|
229
|
+
2. **Effect**: The protection model misclassifies the input (false negative), while the downstream LLM or human reviewer still understands it.
|
|
230
|
+
3. **Defense**: Insert a Unigram tokenizer before the target tokenizer. Unigram tokenizes based on probability rather than left-to-right merges, making it naturally resistant to character-level perturbations.
|
|
231
|
+
|
|
232
|
+
---
|
|
233
|
+
|
|
234
|
+
## π Supported Model Families
|
|
235
|
+
|
|
236
|
+
| Family | Algorithm | TokenBreak Risk |
|
|
237
|
+
|---|---|---|
|
|
238
|
+
| GPT-2, GPT-J, GPT-Neo, GPT-NeoX | BPE | π΄ **HIGH** |
|
|
239
|
+
| LLaMA, Mistral, Mixtral, Falcon | BPE | π΄ **HIGH** |
|
|
240
|
+
| Qwen, Qwen2, Qwen3 | BPE | π΄ **HIGH** |
|
|
241
|
+
| Gemma, Gemma 2 | BPE | π΄ **HIGH** |
|
|
242
|
+
| Phi-3, Phi-4 | BPE | π΄ **HIGH** |
|
|
243
|
+
| BLOOM, BigScience | BPE | π΄ **HIGH** |
|
|
244
|
+
| Cohere, Command R | BPE | π΄ **HIGH** |
|
|
245
|
+
| BERT, DistilBERT, RoBERTa | WordPiece / BPE | π΄ **HIGH** |
|
|
246
|
+
| DeBERTa-v2, DeBERTa-v3 | Unigram | π’ **LOW** |
|
|
247
|
+
| XLM-RoBERTa | Unigram | π’ **LOW** |
|
|
248
|
+
| ALBERT | Unigram | π’ **LOW** |
|
|
249
|
+
| mT5, T5 (SentencePiece Unigram) | Unigram | π’ **LOW** |
|
|
250
|
+
|
|
251
|
+
> π Full mapping in `src/tokenbreak_scanner/tokenizers.py`
|
|
252
|
+
|
|
253
|
+
---
|
|
254
|
+
|
|
255
|
+
## ποΈ Architecture
|
|
256
|
+
|
|
257
|
+
```
|
|
258
|
+
tokenbreak_scanner/
|
|
259
|
+
βββ __init__.py # Package version
|
|
260
|
+
βββ cli.py # Click CLI with Rich table / JSON output
|
|
261
|
+
βββ inspector.py # Core introspection engine (6-signal weighted aggregation)
|
|
262
|
+
βββ models.py # Pydantic schemas: ScannerReport, DetectionSource, RiskLevel
|
|
263
|
+
βββ tokenizers.py # Tokenizer type detection, model-family mapping, runtime inspection
|
|
264
|
+
βββ validator.py # Optional live attack validation via BreakPrompt
|
|
265
|
+
```
|
|
266
|
+
|
|
267
|
+
### Detection signals (weighted evidence tree)
|
|
268
|
+
|
|
269
|
+
| Signal | Weight | Source |
|
|
270
|
+
|---|---|---|
|
|
271
|
+
| `tokenizer.json` model type | 0.40 | HuggingFace tokenizer artifact |
|
|
272
|
+
| Runtime `_tokenizer.model` | 0.40 | Live Rust backend inspection |
|
|
273
|
+
| Source-code fingerprint | 0.30 | Python module keyword matching |
|
|
274
|
+
| Remote source file | 0.30 | HF Hub downloaded tokenizer module |
|
|
275
|
+
| `tokenizer_config.json` class | 0.20 | Config metadata |
|
|
276
|
+
| `config.json` model_type | 0.15 | Architecture fallback |
|
|
277
|
+
|
|
278
|
+
---
|
|
279
|
+
|
|
280
|
+
## π§ͺ Testing
|
|
281
|
+
|
|
282
|
+
```bash
|
|
283
|
+
pytest tests/ -v
|
|
284
|
+
```
|
|
285
|
+
|
|
286
|
+
All 33+ tests cover:
|
|
287
|
+
- Local model directory scanning (BPE, WordPiece, Unigram)
|
|
288
|
+
- Missing `tokenizer.json` fallback
|
|
289
|
+
- CLI table and JSON output modes
|
|
290
|
+
- Tokenizer class name β algorithm mapping
|
|
291
|
+
|
|
292
|
+
---
|
|
293
|
+
|
|
294
|
+
## π€ Contributing
|
|
295
|
+
|
|
296
|
+
Contributions are welcome! Please open an issue or pull request.
|
|
297
|
+
|
|
298
|
+
1. Fork the repository
|
|
299
|
+
2. Create a feature branch (`git checkout -b feat/amazing-feature`)
|
|
300
|
+
3. Commit your changes (`git commit -m 'feat: add amazing feature'`)
|
|
301
|
+
4. Push to the branch (`git push origin feat/amazing-feature`)
|
|
302
|
+
5. Open a Pull Request
|
|
303
|
+
|
|
304
|
+
---
|
|
305
|
+
|
|
306
|
+
## π License
|
|
307
|
+
|
|
308
|
+
This project is licensed under the **GNU Affero General Public License v3.0 or later (AGPL-3.0+)**.
|
|
309
|
+
|
|
310
|
+
- β
Freedom to use, modify, and distribute
|
|
311
|
+
- π Copyleft: derivative works and services must also be open-sourced under AGPL
|
|
312
|
+
- π Network use counts as distribution β remote users are entitled to the source code
|
|
313
|
+
|
|
314
|
+
See [LICENSE](LICENSE) for the full text: <https://www.gnu.org/licenses/agpl-3.0.html>
|
|
315
|
+
|
|
316
|
+
---
|
|
317
|
+
|
|
318
|
+
## π Related
|
|
319
|
+
|
|
320
|
+
- π [TokenBreak Paper (arXiv 2506.07948v1)](https://arxiv.org/html/2506.07948v1)
|
|
321
|
+
- π¦Ύ [HuggingFace Transformers](https://github.com/huggingface/transformers)
|
|
322
|
+
- π¬ [Adversarial Robustness Toolbox](https://github.com/Trusted-AI/adversarial-robustness-toolbox)
|
|
323
|
+
- π‘οΈ [OWASP Machine Learning Security Top 10](https://owasp.org/www-project-machine-learning-security-top-10/)
|
|
324
|
+
- π§ [AI Village](https://aivillage.org/)
|
|
@@ -0,0 +1,284 @@
|
|
|
1
|
+
# π TokenBreak Scanner β NLP Model Vulnerability Auditor
|
|
2
|
+
|
|
3
|
+
Detect **TokenBreak adversarial vulnerabilities** in LLMs, classifiers, and encoders before they hit production.
|
|
4
|
+
|
|
5
|
+
[](https://pypi.org/project/tokenbreak-scanner)
|
|
6
|
+
[](https://pypi.org/project/tokenbreak-scanner)
|
|
7
|
+
[](LICENSE)
|
|
8
|
+
[](https://github.com/your-org/tokenbreak-scanner/actions)
|
|
9
|
+
[](https://pypi.org/project/tokenbreak-scanner)
|
|
10
|
+
|
|
11
|
+
[π TokenBreak Paper](https://arxiv.org/html/2506.07948v1) Β· [β‘ Quick Start](#installation) Β· [π Usage](#usage) Β· [π CI Integration](#ci-integration)
|
|
12
|
+
|
|
13
|
+
---
|
|
14
|
+
|
|
15
|
+
## π₯ What is TokenBreak?
|
|
16
|
+
|
|
17
|
+
**TokenBreak** is an adversarial attack that bypasses text-classification and LLM guardrails by prepending a single character to targeted words. This tiny perturbation corrupts **BPE** and **WordPiece** tokenization, causing models to misclassify malicious input as benign β while humans and downstream LLMs still understand the original meaning.
|
|
18
|
+
|
|
19
|
+
> π **Paper**: [TokenBreak: Bypassing Text Classification Models Through Token Manipulation](https://arxiv.org/html/2506.07948v1)
|
|
20
|
+
|
|
21
|
+
---
|
|
22
|
+
|
|
23
|
+
## π‘οΈ What TokenBreak Scanner Does
|
|
24
|
+
|
|
25
|
+
| Capability | Description |
|
|
26
|
+
|---|---|
|
|
27
|
+
| **Inspect model artifacts** | Reads `config.json`, `tokenizer.json`, `tokenizer_config.json` to determine tokenizer architecture |
|
|
28
|
+
| **Identify tokenization algorithm** | Detects **BPE**, **WordPiece**, **Unigram**, or **SentencePiece** |
|
|
29
|
+
| **Assess TokenBreak vulnerability** | Flags BPE/WordPiece models as **High Risk**; Unigram as **Safe** |
|
|
30
|
+
| **Explain the decision** | Evidence tree showing which signals (config, runtime, source) contributed to the confidence score |
|
|
31
|
+
| **Recommend defense** | Suggests the Unigram pre-mapping defense (Section 5) or migration to a safer model family |
|
|
32
|
+
| **Live attack validation** *(optional)* | Loads model weights and runs the `BreakPrompt` algorithm to verify the bypass experimentally |
|
|
33
|
+
|
|
34
|
+
---
|
|
35
|
+
|
|
36
|
+
## π¦ Installation
|
|
37
|
+
|
|
38
|
+
### From PyPI (recommended)
|
|
39
|
+
|
|
40
|
+
```bash
|
|
41
|
+
pip install tokenbreak-scanner
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
### Optional extras
|
|
45
|
+
|
|
46
|
+
- **Attack validation** (requires PyTorch):
|
|
47
|
+
```bash
|
|
48
|
+
pip install "tokenbreak-scanner[attack]"
|
|
49
|
+
```
|
|
50
|
+
- **Development dependencies** (pytest, coverage):
|
|
51
|
+
```bash
|
|
52
|
+
pip install "tokenbreak-scanner[dev]"
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
### From source
|
|
56
|
+
|
|
57
|
+
```bash
|
|
58
|
+
git clone https://github.com/your-org/tokenbreak-scanner.git
|
|
59
|
+
cd tokenbreak-scanner
|
|
60
|
+
pip install -e ".[dev]"
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
---
|
|
64
|
+
|
|
65
|
+
## π Quick Start
|
|
66
|
+
|
|
67
|
+
### Scan a local model directory
|
|
68
|
+
|
|
69
|
+
```bash
|
|
70
|
+
tokenbreak-scan ./models/my-spam-classifier/
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
### Scan a HuggingFace model ID (auto-download)
|
|
74
|
+
|
|
75
|
+
```bash
|
|
76
|
+
tokenbreak-scan distilbert-base-uncased --download
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
### Scan Qwen3-0.6B (production example)
|
|
80
|
+
|
|
81
|
+
```bash
|
|
82
|
+
tokenbreak-scan Qwen/Qwen3-0.6B --download --trust-remote-code
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
> β
Expected result: **HIGH risk** β Qwen uses BPE tokenization and is vulnerable to TokenBreak.
|
|
86
|
+
|
|
87
|
+
### JSON output (for CI pipelines)
|
|
88
|
+
|
|
89
|
+
```bash
|
|
90
|
+
tokenbreak-scan <model> --output json
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
### Run live attack validation
|
|
94
|
+
|
|
95
|
+
```bash
|
|
96
|
+
tokenbreak-scan <model> --test-attack
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
---
|
|
100
|
+
|
|
101
|
+
## π§ͺ Example Output
|
|
102
|
+
|
|
103
|
+
### Table format (vulnerable model)
|
|
104
|
+
|
|
105
|
+
```
|
|
106
|
+
======================================================================
|
|
107
|
+
TOKENBREAK SCANNER REPORT
|
|
108
|
+
======================================================================
|
|
109
|
+
Model Name: distilbert-base-uncased
|
|
110
|
+
Model Type: distilbert
|
|
111
|
+
Family: DistilBERT
|
|
112
|
+
Tokenizer Class: DistilBertTokenizerFast
|
|
113
|
+
Algorithm: WordPiece
|
|
114
|
+
Vocab Size: 30522
|
|
115
|
+
Confidence: 0.85
|
|
116
|
+
Vulnerable: YES β οΈ
|
|
117
|
+
Risk Level: High
|
|
118
|
+
======================================================================
|
|
119
|
+
Detection Sources:
|
|
120
|
+
1. [tokenizer.json model.type] weight=0.40 -> WordPiece
|
|
121
|
+
2. [tokenizer_config.json class] weight=0.20 -> WordPiece
|
|
122
|
+
3. [config.json model_type] weight=0.15 -> WordPiece
|
|
123
|
+
======================================================================
|
|
124
|
+
Recommendation:
|
|
125
|
+
Model is vulnerable to TokenBreak. Recommended: Implement the
|
|
126
|
+
Unigram pre-mapping defense by inserting a Unigram tokenizer
|
|
127
|
+
before classification and remapping tokens, or migrate to a
|
|
128
|
+
model family using Unigram tokenization.
|
|
129
|
+
======================================================================
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
### JSON format
|
|
133
|
+
|
|
134
|
+
```json
|
|
135
|
+
{
|
|
136
|
+
"model_name": "distilbert-base-uncased",
|
|
137
|
+
"model_type": "distilbert",
|
|
138
|
+
"model_family": "DistilBERT",
|
|
139
|
+
"tokenizer_class": "DistilBertTokenizerFast",
|
|
140
|
+
"tokenizer_algorithm": "WordPiece",
|
|
141
|
+
"vocab_size": 30522,
|
|
142
|
+
"confidence_score": 0.85,
|
|
143
|
+
"vulnerable_to_tokenbreak": true,
|
|
144
|
+
"risk_level": "High",
|
|
145
|
+
"detection_sources": [...],
|
|
146
|
+
"recommendation": "...",
|
|
147
|
+
"source": "/path/to/model"
|
|
148
|
+
}
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
---
|
|
152
|
+
|
|
153
|
+
## π CI Integration
|
|
154
|
+
|
|
155
|
+
Use exit codes to gate vulnerable models in your MLOps pipeline:
|
|
156
|
+
|
|
157
|
+
| Exit code | Meaning |
|
|
158
|
+
|---|---|
|
|
159
|
+
| `0` | Model is **safe** (Unigram) or unknown |
|
|
160
|
+
| `1` | Model is **vulnerable** (BPE or WordPiece) β block deployment |
|
|
161
|
+
| `2` | Error (path not found, download failed, etc.) |
|
|
162
|
+
|
|
163
|
+
### GitHub Actions example
|
|
164
|
+
|
|
165
|
+
```yaml
|
|
166
|
+
- name: Scan model for TokenBreak vulnerability
|
|
167
|
+
run: |
|
|
168
|
+
pip install tokenbreak-scanner
|
|
169
|
+
tokenbreak-scan ./models-to-deploy/ --output json > scan-report.json
|
|
170
|
+
cat scan-report.json
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
### Python SDK example
|
|
174
|
+
|
|
175
|
+
```python
|
|
176
|
+
from tokenbreak_scanner.inspector import inspect_model
|
|
177
|
+
from tokenbreak_scanner.models import RiskLevel
|
|
178
|
+
|
|
179
|
+
report = inspect_model(model_path, download=False)
|
|
180
|
+
if report.risk_level == RiskLevel.HIGH:
|
|
181
|
+
raise RuntimeError(f"Deployment blocked: {report.model_name} is vulnerable to TokenBreak")
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
---
|
|
185
|
+
|
|
186
|
+
## π¬ How the Attack Works
|
|
187
|
+
|
|
188
|
+
1. **Attack vector**: Prepend a single character (AβZ or aβz) to high-impact words, forcing BPE/WordPiece to split tokens differently.
|
|
189
|
+
2. **Effect**: The protection model misclassifies the input (false negative), while the downstream LLM or human reviewer still understands it.
|
|
190
|
+
3. **Defense**: Insert a Unigram tokenizer before the target tokenizer. Unigram tokenizes based on probability rather than left-to-right merges, making it naturally resistant to character-level perturbations.
|
|
191
|
+
|
|
192
|
+
---
|
|
193
|
+
|
|
194
|
+
## π Supported Model Families
|
|
195
|
+
|
|
196
|
+
| Family | Algorithm | TokenBreak Risk |
|
|
197
|
+
|---|---|---|
|
|
198
|
+
| GPT-2, GPT-J, GPT-Neo, GPT-NeoX | BPE | π΄ **HIGH** |
|
|
199
|
+
| LLaMA, Mistral, Mixtral, Falcon | BPE | π΄ **HIGH** |
|
|
200
|
+
| Qwen, Qwen2, Qwen3 | BPE | π΄ **HIGH** |
|
|
201
|
+
| Gemma, Gemma 2 | BPE | π΄ **HIGH** |
|
|
202
|
+
| Phi-3, Phi-4 | BPE | π΄ **HIGH** |
|
|
203
|
+
| BLOOM, BigScience | BPE | π΄ **HIGH** |
|
|
204
|
+
| Cohere, Command R | BPE | π΄ **HIGH** |
|
|
205
|
+
| BERT, DistilBERT, RoBERTa | WordPiece / BPE | π΄ **HIGH** |
|
|
206
|
+
| DeBERTa-v2, DeBERTa-v3 | Unigram | π’ **LOW** |
|
|
207
|
+
| XLM-RoBERTa | Unigram | π’ **LOW** |
|
|
208
|
+
| ALBERT | Unigram | π’ **LOW** |
|
|
209
|
+
| mT5, T5 (SentencePiece Unigram) | Unigram | π’ **LOW** |
|
|
210
|
+
|
|
211
|
+
> π Full mapping in `src/tokenbreak_scanner/tokenizers.py`
|
|
212
|
+
|
|
213
|
+
---
|
|
214
|
+
|
|
215
|
+
## ποΈ Architecture
|
|
216
|
+
|
|
217
|
+
```
|
|
218
|
+
tokenbreak_scanner/
|
|
219
|
+
βββ __init__.py # Package version
|
|
220
|
+
βββ cli.py # Click CLI with Rich table / JSON output
|
|
221
|
+
βββ inspector.py # Core introspection engine (6-signal weighted aggregation)
|
|
222
|
+
βββ models.py # Pydantic schemas: ScannerReport, DetectionSource, RiskLevel
|
|
223
|
+
βββ tokenizers.py # Tokenizer type detection, model-family mapping, runtime inspection
|
|
224
|
+
βββ validator.py # Optional live attack validation via BreakPrompt
|
|
225
|
+
```
|
|
226
|
+
|
|
227
|
+
### Detection signals (weighted evidence tree)
|
|
228
|
+
|
|
229
|
+
| Signal | Weight | Source |
|
|
230
|
+
|---|---|---|
|
|
231
|
+
| `tokenizer.json` model type | 0.40 | HuggingFace tokenizer artifact |
|
|
232
|
+
| Runtime `_tokenizer.model` | 0.40 | Live Rust backend inspection |
|
|
233
|
+
| Source-code fingerprint | 0.30 | Python module keyword matching |
|
|
234
|
+
| Remote source file | 0.30 | HF Hub downloaded tokenizer module |
|
|
235
|
+
| `tokenizer_config.json` class | 0.20 | Config metadata |
|
|
236
|
+
| `config.json` model_type | 0.15 | Architecture fallback |
|
|
237
|
+
|
|
238
|
+
---
|
|
239
|
+
|
|
240
|
+
## π§ͺ Testing
|
|
241
|
+
|
|
242
|
+
```bash
|
|
243
|
+
pytest tests/ -v
|
|
244
|
+
```
|
|
245
|
+
|
|
246
|
+
All 33+ tests cover:
|
|
247
|
+
- Local model directory scanning (BPE, WordPiece, Unigram)
|
|
248
|
+
- Missing `tokenizer.json` fallback
|
|
249
|
+
- CLI table and JSON output modes
|
|
250
|
+
- Tokenizer class name β algorithm mapping
|
|
251
|
+
|
|
252
|
+
---
|
|
253
|
+
|
|
254
|
+
## π€ Contributing
|
|
255
|
+
|
|
256
|
+
Contributions are welcome! Please open an issue or pull request.
|
|
257
|
+
|
|
258
|
+
1. Fork the repository
|
|
259
|
+
2. Create a feature branch (`git checkout -b feat/amazing-feature`)
|
|
260
|
+
3. Commit your changes (`git commit -m 'feat: add amazing feature'`)
|
|
261
|
+
4. Push to the branch (`git push origin feat/amazing-feature`)
|
|
262
|
+
5. Open a Pull Request
|
|
263
|
+
|
|
264
|
+
---
|
|
265
|
+
|
|
266
|
+
## π License
|
|
267
|
+
|
|
268
|
+
This project is licensed under the **GNU Affero General Public License v3.0 or later (AGPL-3.0+)**.
|
|
269
|
+
|
|
270
|
+
- β
Freedom to use, modify, and distribute
|
|
271
|
+
- π Copyleft: derivative works and services must also be open-sourced under AGPL
|
|
272
|
+
- π Network use counts as distribution β remote users are entitled to the source code
|
|
273
|
+
|
|
274
|
+
See [LICENSE](LICENSE) for the full text: <https://www.gnu.org/licenses/agpl-3.0.html>
|
|
275
|
+
|
|
276
|
+
---
|
|
277
|
+
|
|
278
|
+
## π Related
|
|
279
|
+
|
|
280
|
+
- π [TokenBreak Paper (arXiv 2506.07948v1)](https://arxiv.org/html/2506.07948v1)
|
|
281
|
+
- π¦Ύ [HuggingFace Transformers](https://github.com/huggingface/transformers)
|
|
282
|
+
- π¬ [Adversarial Robustness Toolbox](https://github.com/Trusted-AI/adversarial-robustness-toolbox)
|
|
283
|
+
- π‘οΈ [OWASP Machine Learning Security Top 10](https://owasp.org/www-project-machine-learning-security-top-10/)
|
|
284
|
+
- π§ [AI Village](https://aivillage.org/)
|