prometheus-llm 1.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,352 @@
1
+ Metadata-Version: 2.4
2
+ Name: prometheus-llm
3
+ Version: 1.0.0
4
+ Summary: Automated model steering and alignment adjustment via LoRA-based optimization
5
+ Keywords: llm,model-steering,alignment,lora
6
+ Author: Wangzhang Wu
7
+ Author-email: Wangzhang Wu <wangzhangwu1216@gmail.com>
8
+ License-Expression: AGPL-3.0-or-later
9
+ Classifier: Development Status :: 4 - Beta
10
+ Classifier: Environment :: Console
11
+ Classifier: Environment :: GPU
12
+ Classifier: Intended Audience :: Science/Research
13
+ Classifier: License :: OSI Approved :: GNU Affero General Public License v3 or later (AGPLv3+)
14
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.10
17
+ Classifier: Programming Language :: Python :: 3.11
18
+ Classifier: Programming Language :: Python :: 3.12
19
+ Requires-Dist: accelerate~=1.10
20
+ Requires-Dist: bitsandbytes~=0.45
21
+ Requires-Dist: datasets~=4.0
22
+ Requires-Dist: hf-transfer~=0.1
23
+ Requires-Dist: huggingface-hub~=1.6
24
+ Requires-Dist: kernels~=0.11
25
+ Requires-Dist: optuna~=4.5
26
+ Requires-Dist: peft~=0.14
27
+ Requires-Dist: psutil~=7.1
28
+ Requires-Dist: pydantic-settings~=2.10
29
+ Requires-Dist: questionary~=2.1
30
+ Requires-Dist: rich~=14.1
31
+ Requires-Dist: transformers~=5.3
32
+ Requires-Dist: geom-median~=0.1 ; extra == 'research'
33
+ Requires-Dist: imageio~=2.37 ; extra == 'research'
34
+ Requires-Dist: matplotlib~=3.10 ; extra == 'research'
35
+ Requires-Dist: numpy~=2.2 ; extra == 'research'
36
+ Requires-Dist: pacmap~=0.8 ; extra == 'research'
37
+ Requires-Dist: scikit-learn~=1.7 ; extra == 'research'
38
+ Requires-Python: >=3.10
39
+ Project-URL: Changelog, https://github.com/wuwangzhang1216/prometheus/releases
40
+ Project-URL: Documentation, https://github.com/wuwangzhang1216/prometheus
41
+ Project-URL: Homepage, https://github.com/wuwangzhang1216/prometheus
42
+ Project-URL: Issues, https://github.com/wuwangzhang1216/prometheus/issues
43
+ Project-URL: Repository, https://github.com/wuwangzhang1216/prometheus.git
44
+ Provides-Extra: research
45
+ Description-Content-Type: text/markdown
46
+
47
+ <p align="center">
48
+ <picture>
49
+ <source media="(prefers-color-scheme: dark)" srcset="assets/logo.svg">
50
+ <source media="(prefers-color-scheme: light)" srcset="assets/logo.svg">
51
+ <img alt="Prometheus" src="assets/logo.svg" width="460">
52
+ </picture>
53
+ </p>
54
+
55
+ <p align="center">
56
+ <strong>3% refusal rate &nbsp;·&nbsp; 0.01 KL divergence &nbsp;·&nbsp; Zero manual tuning</strong>
57
+ </p>
58
+
59
+ <p align="center">
60
+ <a href="https://pypi.org/project/prometheus-llm/"><img src="https://img.shields.io/pypi/v/prometheus-llm?color=blue" alt="PyPI"></a>
61
+ <a href="https://www.python.org/downloads/"><img src="https://img.shields.io/badge/python-3.10%2B-blue.svg" alt="Python 3.10+"></a>
62
+ <a href="https://www.gnu.org/licenses/agpl-3.0"><img src="https://img.shields.io/badge/license-AGPL--3.0-green.svg" alt="License: AGPL v3"></a>
63
+ <a href="https://huggingface.co/wangzhang"><img src="https://img.shields.io/badge/%F0%9F%A4%97-Models-yellow.svg" alt="Hugging Face"></a>
64
+ </p>
65
+
66
+ ---
67
+
68
+ ## Table of Contents
69
+
70
+ - [Quick Start](#quick-start)
71
+ - [How It Works](#how-it-works)
72
+ - [Results](#results)
73
+ - [Features](#features)
74
+ - [MoE Support](#moe-support)
75
+ - [Configuration](#configuration)
76
+ - [Hardware & VRAM](#hardware--vram)
77
+ - [Research Tools](#research-tools)
78
+ - [References](#references)
79
+ - [Citation](#citation)
80
+ - [Acknowledgments](#acknowledgments)
81
+ - [Contributing](#contributing)
82
+ - [License](#license)
83
+
84
+ ---
85
+
86
+ Prometheus finds the optimal abliteration parameters for any transformer model using [Optuna](https://optuna.org/) TPE optimization. It co-minimizes refusals and KL divergence from the original model — producing decensored models that retain as much intelligence as possible.
87
+
88
+ Works with dense models, multimodal models, and MoE architectures (Qwen3/3.5 MoE, Mixtral, DeepSeek, Granite MoE Hybrid, MiniMax-M2.5).
89
+
90
+
91
+ ## Quick Start
92
+
93
+ ```bash
94
+ pip install -U prometheus-llm
95
+ prometheus --model Qwen/Qwen3-4B-Instruct-2507
96
+ ```
97
+
98
+ That's it. The process is fully automatic — after optimization completes, you can save the model, upload to Hugging Face, or chat with it interactively.
99
+
100
+ > **Windows**: use `python scripts/run_prometheus.py --model <model>` or set `PYTHONIOENCODING=utf-8` to avoid Rich encoding issues.
101
+
102
+
103
+ ## How It Works
104
+
105
+ Language models learn to refuse harmful queries through specific activation patterns in their residual stream. Prometheus identifies these patterns and surgically removes them:
106
+
107
+ 1. **Compute refusal directions** — pass harmless and harmful prompts through the model, extract per-layer residual activations, and compute the difference vector that characterizes "refusal behavior"
108
+ 2. **Orthogonalize** — project out the component aligned with normal "good" responses, isolating only the refusal signal
109
+ 3. **Abliterate via LoRA** — apply rank-1 weight modifications to attention and MLP components, weighted by a kernel function across layers. Changes are captured as lightweight LoRA adapters, not destructively applied to base weights
110
+ 4. **Optimize** — Optuna's Tree-structured Parzen Estimator searches over kernel shape, fractional direction index, and per-component abliteration strength, selecting Pareto-optimal configurations that minimize both refusals and model degradation
111
+
112
+
113
+ ## Results
114
+
115
+ Abliterated models uploaded to [Hugging Face](https://huggingface.co/wangzhang):
116
+
117
+ | Model | Refusals | KL Divergence | Trials |
118
+ |-------|----------|---------------|--------|
119
+ | [Qwen3.5-122B-A10B](https://huggingface.co/wangzhang/Qwen3.5-122B-A10B-abliterated) | **6/200 (3%)** | 0.0115 | 25 |
120
+ | [Qwen3.5-35B-A3B](https://huggingface.co/wangzhang/Qwen3.5-35B-A3B-abliterated) | 7/200 (3.5%) | 0.0145 | 50 |
121
+ | [Qwen3.5-27B](https://huggingface.co/wangzhang/Qwen3.5-27B-abliterated) | 7/200 (3.5%) | 0.0051 | 15 |
122
+ | [Qwen3.5-9B](https://huggingface.co/wangzhang/Qwen3.5-9B-abliterated) | 2/200 (1%) | 0.0105 | 50 |
123
+ | [Qwen3.5-4B](https://huggingface.co/wangzhang/Qwen3.5-4B-abliterated) | 34/200 (17%) | 0.0159 | 50 |
124
+ | [Qwen3.5-0.8B](https://huggingface.co/wangzhang/Qwen3.5-0.8B-abliterated) | 3/200 (1.5%) | 0.0087 | 100 |
125
+
126
+ ### Key Findings
127
+
128
+ > **Orthogonalized directions reduced refusals by 67%** compared to raw abliteration in controlled experiments — the single most impactful optimization.
129
+
130
+ - **Larger models abliterate better** — the 122B achieved lower refusals *and* lower KL than the 35B, in half the trials. Larger models have cleaner refusal circuitry.
131
+ - **Per-layer direction index is critical at scale** — for 122B, independently optimizing the refusal direction per layer reduced refusals from 180/200 to 6/200. A single global direction failed entirely.
132
+ - **MoE hybrid steering** — combining LoRA abliteration with router weight suppression and fused expert abliteration proved essential for MoE architectures.
133
+
134
+
135
+ ## Features
136
+
137
+ ### Orthogonalized Directions
138
+
139
+ Instead of removing the full refusal direction (which degrades model quality), Prometheus projects out only the component orthogonal to "good" response directions. This preserves capabilities while selectively removing refusal behavior.
140
+
141
+ ```toml
142
+ [steering]
143
+ orthogonal_projection = true
144
+ ```
145
+
146
+ ### LLM Judge
147
+
148
+ Replace keyword-based refusal detection with LLM-powered classification via [OpenRouter](https://openrouter.ai/) for more accurate results, especially for non-English models.
149
+
150
+ ```toml
151
+ [detection]
152
+ llm_judge = true
153
+ llm_judge_model = "google/gemini-3.1-flash-lite-preview"
154
+ ```
155
+
156
+ ### Smart Optimization
157
+
158
+ - **Auto batch size** — exponential search finds the largest batch size that fits in VRAM
159
+ - **KL divergence pruning** — trials with KL above threshold are terminated early, saving compute
160
+ - **Fractional direction index** — interpolates between adjacent layer directions for finer-grained search
161
+ - **Per-component parameters** — separate abliteration weights for attention vs. MLP
162
+
163
+ ### Advanced Options
164
+
165
+ | Section | Option | Values | Description |
166
+ |---------|--------|--------|-------------|
167
+ | `[steering]` | `vector_method` | `mean`, `median_of_means`, `pca` | How to compute steering vectors from residuals |
168
+ | `[steering]` | `decay_kernel` | `linear`, `gaussian`, `cosine` | Kernel for interpolating weights across layers |
169
+ | `[steering]` | `weight_normalization` | `none`, `pre`, `full` | Weight row normalization before/after LoRA |
170
+ | `[steering]` | `outlier_quantile` | 0.0–1.0 | Tame extreme activations in some models |
171
+ | `[model]` | `use_torch_compile` | true/false | 10–30% inference speedup |
172
+
173
+
174
+ ## MoE Support
175
+
176
+ Three steering mechanisms for Mixture-of-Experts models:
177
+
178
+ 1. **Expert Profiling** — hooks router modules to compute per-expert "risk scores" from activation patterns on harmful vs. harmless prompts
179
+ 2. **Router Weight Suppression** — applies learned negative bias to routing weights of safety-critical experts
180
+ 3. **Fused Expert Abliteration** — direct rank-1 modification of expert `down_proj` matrices
181
+
182
+ Supported architectures: Qwen3/3.5 MoE, Mixtral, DeepSeek MoE, Granite MoE Hybrid, MiniMax-M2.5. See [configs/](configs/) for model-specific examples.
183
+
184
+
185
+ ## Configuration
186
+
187
+ Prometheus loads config in priority order (later overrides earlier):
188
+
189
+ 1. [`configs/default.toml`](configs/default.toml) — copy to `prometheus.toml` and customize
190
+ 2. `PM_CONFIG` environment variable
191
+ 3. `--config <path>` CLI flag
192
+ 4. CLI flags (`--model`, `--model.quant-method bnb_4bit`, etc.)
193
+
194
+ Run `prometheus --help` for all options.
195
+
196
+ Pre-built configs for specific setups:
197
+
198
+ | Config | Target |
199
+ |--------|--------|
200
+ | [`4b.toml`](configs/4b.toml) | Qwen3.5-4B dense |
201
+ | [`9b.toml`](configs/9b.toml) | 9B dense models |
202
+ | [`27b.toml`](configs/27b.toml) | Qwen3.5-27B dense (~54GB BF16) |
203
+ | [`35b.toml`](configs/35b.toml) | Qwen3.5-35B-A3B MoE |
204
+ | [`122b.toml`](configs/122b.toml) | Qwen3.5-122B-A10B MoE (BF16) |
205
+ | [`122b_4bit.toml`](configs/122b_4bit.toml) | Qwen3.5-122B-A10B (NF4, ~61GB) |
206
+ | [`122b_int8.toml`](configs/122b_int8.toml) | Qwen3.5-122B-A10B (INT8, ~122GB) |
207
+ | [`397b.toml`](configs/397b.toml) | Qwen3.5-397B-A17B MoE (NF4, ~215GB) |
208
+ | [`minimax_m25.toml`](configs/minimax_m25.toml) | MiniMax-M2.5 229B MoE (FP8, ~229GB) |
209
+ | [`100t.toml`](configs/100t.toml) | Extended 100-trial optimization |
210
+ | [`noslop.toml`](configs/noslop.toml) | Anti-slop tuning |
211
+
212
+
213
+ ## Hardware & VRAM
214
+
215
+ Prometheus auto-detects available accelerators (CUDA, XPU, MLU, MUSA, SDAA, NPU, MPS) and distributes layers across devices with `device_map = "auto"`.
216
+
217
+ For large models:
218
+ - **4-bit quantization**: `--model.quant-method bnb_4bit` cuts VRAM by ~4x
219
+ - **8-bit quantization**: `--model.quant-method bnb_8bit` — higher quality than 4-bit, ~2x VRAM reduction with CPU offload
220
+ - **Per-device memory limits**: set `[model] max_memory = {"0": "20GB", "cpu": "64GB"}` in your config
221
+ - **Non-interactive mode**: `--non-interactive` for fully automated batch runs
222
+
223
+
224
+ ## Research Tools
225
+
226
+ ```bash
227
+ pip install -U prometheus-llm[research]
228
+ ```
229
+
230
+ - `--display.plot-residuals` — PaCMAP-projected scatter plots and animated GIFs of residual vectors across layers
231
+ - `--display.print-residual-geometry` — cosine similarities, norms, silhouette coefficients
232
+
233
+ Example: PaCMAP visualization shows harmful (red) vs. harmless (blue) activations separating across layers, revealing how the model's refusal circuitry develops through its depth.
234
+
235
+ <!-- To add a screenshot: save the image to assets/ and uncomment the line below -->
236
+ <!-- ![PaCMAP visualization](assets/pacmap_example.png) -->
237
+
238
+
239
+ ## References
240
+
241
+ Prometheus builds on the following research:
242
+
243
+ - **Abliteration**: Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., & Nanda, N. (2024). [Refusal in Language Models Is Mediated by a Single Direction](https://arxiv.org/abs/2406.11717). *NeurIPS 2024*.
244
+ - **Representation Engineering**: Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., Goel, S., Li, N., Byun, M. J., Wang, Z., Mallen, A., Basart, S., Koyejo, S., Song, D., Fredrikson, M., Kolter, J. Z., & Hendrycks, D. (2023). [Representation Engineering: A Top-Down Approach to AI Transparency](https://arxiv.org/abs/2310.01405). *arXiv:2310.01405*.
245
+ - **LoRA**: Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685). *ICLR 2022*.
246
+ - **Optuna**: Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). [Optuna: A Next-generation Hyperparameter Optimization Framework](https://arxiv.org/abs/1907.10902). *KDD 2019*.
247
+ - **TPE**: Bergstra, J., Bardenet, R., Bengio, Y., & Kegl, B. (2011). [Algorithms for Hyper-Parameter Optimization](https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization). *NeurIPS 2011*.
248
+ - **PaCMAP**: Wang, Y., Huang, H., Rudin, C., & Shaposhnik, Y. (2021). [Understanding How Dimension Reduction Tools Work: An Empirical Approach to Deciphering t-SNE, UMAP, TriMap, and PaCMAP for Data Visualization](https://jmlr.org/papers/v22/20-1061.html). *JMLR*, 22, 1–73.
249
+
250
+ <details>
251
+ <summary>BibTeX</summary>
252
+
253
+ ```bibtex
254
+ @inproceedings{arditi2024refusal,
255
+ title = {Refusal in Language Models Is Mediated by a Single Direction},
256
+ author = {Arditi, Andy and Obeso, Oscar and Syed, Aaquib and Paleka, Daniel and Panickssery, Nina and Gurnee, Wes and Nanda, Neel},
257
+ booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
258
+ year = {2024},
259
+ url = {https://arxiv.org/abs/2406.11717}
260
+ }
261
+
262
+ @article{zou2023representation,
263
+ title = {Representation Engineering: A Top-Down Approach to AI Transparency},
264
+ author = {Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and Goel, Shashwat and Li, Nathaniel and Byun, Michael J. and Wang, Zifan and Mallen, Alex and Basart, Steven and Koyejo, Sanmi and Song, Dawn and Fredrikson, Matt and Kolter, J. Zico and Hendrycks, Dan},
265
+ journal = {arXiv preprint arXiv:2310.01405},
266
+ year = {2023},
267
+ url = {https://arxiv.org/abs/2310.01405}
268
+ }
269
+
270
+ @inproceedings{hu2022lora,
271
+ title = {{LoRA}: Low-Rank Adaptation of Large Language Models},
272
+ author = {Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu},
273
+ booktitle = {International Conference on Learning Representations (ICLR)},
274
+ year = {2022},
275
+ url = {https://arxiv.org/abs/2106.09685}
276
+ }
277
+
278
+ @inproceedings{akiba2019optuna,
279
+ title = {Optuna: A Next-generation Hyperparameter Optimization Framework},
280
+ author = {Akiba, Takuya and Sano, Shotaro and Yanase, Toshihiko and Ohta, Takeru and Koyama, Masanori},
281
+ booktitle = {Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining},
282
+ pages = {2623--2631},
283
+ year = {2019},
284
+ url = {https://arxiv.org/abs/1907.10902}
285
+ }
286
+
287
+ @inproceedings{bergstra2011algorithms,
288
+ title = {Algorithms for Hyper-Parameter Optimization},
289
+ author = {Bergstra, James and Bardenet, R{\'e}mi and Bengio, Yoshua and K{\'e}gl, Bal{\'a}zs},
290
+ booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
291
+ pages = {2546--2554},
292
+ year = {2011},
293
+ url = {https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization}
294
+ }
295
+
296
+ @article{wang2021pacmap,
297
+ title = {Understanding How Dimension Reduction Tools Work: An Empirical Approach to Deciphering t-SNE, UMAP, TriMap, and PaCMAP for Data Visualization},
298
+ author = {Wang, Yingfan and Huang, Haiyang and Rudin, Cynthia and Shaposhnik, Yaron},
299
+ journal = {Journal of Machine Learning Research},
300
+ volume = {22},
301
+ pages = {1--73},
302
+ year = {2021},
303
+ url = {https://jmlr.org/papers/v22/20-1061.html}
304
+ }
305
+ ```
306
+
307
+ </details>
308
+
309
+
310
+ ## Citation
311
+
312
+ ```bibtex
313
+ @software{prometheus,
314
+ author = {Wu, Wangzhang},
315
+ title = {Prometheus: Automated LLM Abliteration},
316
+ year = {2026},
317
+ url = {https://github.com/wuwangzhang1216/prometheus}
318
+ }
319
+ ```
320
+
321
+
322
+ ## Acknowledgments
323
+
324
+ Prometheus was initially inspired by [Heretic](https://github.com/p-e-w/heretic).
325
+
326
+ ```bibtex
327
+ @misc{heretic,
328
+ author = {Weidmann, Philipp Emanuel},
329
+ title = {Heretic: Fully automatic censorship removal for language models},
330
+ year = {2025},
331
+ publisher = {GitHub},
332
+ journal = {GitHub repository},
333
+ howpublished = {\url{https://github.com/p-e-w/heretic}}
334
+ }
335
+ ```
336
+
337
+
338
+ ## Contributing
339
+
340
+ Contributions are welcome! Please open an issue to discuss your idea before submitting a pull request.
341
+
342
+ 1. Fork the repository
343
+ 2. Create a feature branch (`git checkout -b feature/your-feature`)
344
+ 3. Commit your changes
345
+ 4. Push to your fork and open a pull request
346
+
347
+ All contributions are released under the [AGPL-3.0](LICENSE) license.
348
+
349
+
350
+ ## License
351
+
352
+ [AGPL-3.0](LICENSE)
@@ -0,0 +1,306 @@
1
+ <p align="center">
2
+ <picture>
3
+ <source media="(prefers-color-scheme: dark)" srcset="assets/logo.svg">
4
+ <source media="(prefers-color-scheme: light)" srcset="assets/logo.svg">
5
+ <img alt="Prometheus" src="assets/logo.svg" width="460">
6
+ </picture>
7
+ </p>
8
+
9
+ <p align="center">
10
+ <strong>3% refusal rate &nbsp;·&nbsp; 0.01 KL divergence &nbsp;·&nbsp; Zero manual tuning</strong>
11
+ </p>
12
+
13
+ <p align="center">
14
+ <a href="https://pypi.org/project/prometheus-llm/"><img src="https://img.shields.io/pypi/v/prometheus-llm?color=blue" alt="PyPI"></a>
15
+ <a href="https://www.python.org/downloads/"><img src="https://img.shields.io/badge/python-3.10%2B-blue.svg" alt="Python 3.10+"></a>
16
+ <a href="https://www.gnu.org/licenses/agpl-3.0"><img src="https://img.shields.io/badge/license-AGPL--3.0-green.svg" alt="License: AGPL v3"></a>
17
+ <a href="https://huggingface.co/wangzhang"><img src="https://img.shields.io/badge/%F0%9F%A4%97-Models-yellow.svg" alt="Hugging Face"></a>
18
+ </p>
19
+
20
+ ---
21
+
22
+ ## Table of Contents
23
+
24
+ - [Quick Start](#quick-start)
25
+ - [How It Works](#how-it-works)
26
+ - [Results](#results)
27
+ - [Features](#features)
28
+ - [MoE Support](#moe-support)
29
+ - [Configuration](#configuration)
30
+ - [Hardware & VRAM](#hardware--vram)
31
+ - [Research Tools](#research-tools)
32
+ - [References](#references)
33
+ - [Citation](#citation)
34
+ - [Acknowledgments](#acknowledgments)
35
+ - [Contributing](#contributing)
36
+ - [License](#license)
37
+
38
+ ---
39
+
40
+ Prometheus finds the optimal abliteration parameters for any transformer model using [Optuna](https://optuna.org/) TPE optimization. It co-minimizes refusals and KL divergence from the original model — producing decensored models that retain as much intelligence as possible.
41
+
42
+ Works with dense models, multimodal models, and MoE architectures (Qwen3/3.5 MoE, Mixtral, DeepSeek, Granite MoE Hybrid, MiniMax-M2.5).
43
+
44
+
45
+ ## Quick Start
46
+
47
+ ```bash
48
+ pip install -U prometheus-llm
49
+ prometheus --model Qwen/Qwen3-4B-Instruct-2507
50
+ ```
51
+
52
+ That's it. The process is fully automatic — after optimization completes, you can save the model, upload to Hugging Face, or chat with it interactively.
53
+
54
+ > **Windows**: use `python scripts/run_prometheus.py --model <model>` or set `PYTHONIOENCODING=utf-8` to avoid Rich encoding issues.
55
+
56
+
57
+ ## How It Works
58
+
59
+ Language models learn to refuse harmful queries through specific activation patterns in their residual stream. Prometheus identifies these patterns and surgically removes them:
60
+
61
+ 1. **Compute refusal directions** — pass harmless and harmful prompts through the model, extract per-layer residual activations, and compute the difference vector that characterizes "refusal behavior"
62
+ 2. **Orthogonalize** — project out the component aligned with normal "good" responses, isolating only the refusal signal
63
+ 3. **Abliterate via LoRA** — apply rank-1 weight modifications to attention and MLP components, weighted by a kernel function across layers. Changes are captured as lightweight LoRA adapters, not destructively applied to base weights
64
+ 4. **Optimize** — Optuna's Tree-structured Parzen Estimator searches over kernel shape, fractional direction index, and per-component abliteration strength, selecting Pareto-optimal configurations that minimize both refusals and model degradation
65
+
66
+
67
+ ## Results
68
+
69
+ Abliterated models uploaded to [Hugging Face](https://huggingface.co/wangzhang):
70
+
71
+ | Model | Refusals | KL Divergence | Trials |
72
+ |-------|----------|---------------|--------|
73
+ | [Qwen3.5-122B-A10B](https://huggingface.co/wangzhang/Qwen3.5-122B-A10B-abliterated) | **6/200 (3%)** | 0.0115 | 25 |
74
+ | [Qwen3.5-35B-A3B](https://huggingface.co/wangzhang/Qwen3.5-35B-A3B-abliterated) | 7/200 (3.5%) | 0.0145 | 50 |
75
+ | [Qwen3.5-27B](https://huggingface.co/wangzhang/Qwen3.5-27B-abliterated) | 7/200 (3.5%) | 0.0051 | 15 |
76
+ | [Qwen3.5-9B](https://huggingface.co/wangzhang/Qwen3.5-9B-abliterated) | 2/200 (1%) | 0.0105 | 50 |
77
+ | [Qwen3.5-4B](https://huggingface.co/wangzhang/Qwen3.5-4B-abliterated) | 34/200 (17%) | 0.0159 | 50 |
78
+ | [Qwen3.5-0.8B](https://huggingface.co/wangzhang/Qwen3.5-0.8B-abliterated) | 3/200 (1.5%) | 0.0087 | 100 |
79
+
80
+ ### Key Findings
81
+
82
+ > **Orthogonalized directions reduced refusals by 67%** compared to raw abliteration in controlled experiments — the single most impactful optimization.
83
+
84
+ - **Larger models abliterate better** — the 122B achieved lower refusals *and* lower KL than the 35B, in half the trials. Larger models have cleaner refusal circuitry.
85
+ - **Per-layer direction index is critical at scale** — for 122B, independently optimizing the refusal direction per layer reduced refusals from 180/200 to 6/200. A single global direction failed entirely.
86
+ - **MoE hybrid steering** — combining LoRA abliteration with router weight suppression and fused expert abliteration proved essential for MoE architectures.
87
+
88
+
89
+ ## Features
90
+
91
+ ### Orthogonalized Directions
92
+
93
+ Instead of removing the full refusal direction (which degrades model quality), Prometheus projects out only the component orthogonal to "good" response directions. This preserves capabilities while selectively removing refusal behavior.
94
+
95
+ ```toml
96
+ [steering]
97
+ orthogonal_projection = true
98
+ ```
99
+
100
+ ### LLM Judge
101
+
102
+ Replace keyword-based refusal detection with LLM-powered classification via [OpenRouter](https://openrouter.ai/) for more accurate results, especially for non-English models.
103
+
104
+ ```toml
105
+ [detection]
106
+ llm_judge = true
107
+ llm_judge_model = "google/gemini-3.1-flash-lite-preview"
108
+ ```
109
+
110
+ ### Smart Optimization
111
+
112
+ - **Auto batch size** — exponential search finds the largest batch size that fits in VRAM
113
+ - **KL divergence pruning** — trials with KL above threshold are terminated early, saving compute
114
+ - **Fractional direction index** — interpolates between adjacent layer directions for finer-grained search
115
+ - **Per-component parameters** — separate abliteration weights for attention vs. MLP
116
+
117
+ ### Advanced Options
118
+
119
+ | Section | Option | Values | Description |
120
+ |---------|--------|--------|-------------|
121
+ | `[steering]` | `vector_method` | `mean`, `median_of_means`, `pca` | How to compute steering vectors from residuals |
122
+ | `[steering]` | `decay_kernel` | `linear`, `gaussian`, `cosine` | Kernel for interpolating weights across layers |
123
+ | `[steering]` | `weight_normalization` | `none`, `pre`, `full` | Weight row normalization before/after LoRA |
124
+ | `[steering]` | `outlier_quantile` | 0.0–1.0 | Tame extreme activations in some models |
125
+ | `[model]` | `use_torch_compile` | true/false | 10–30% inference speedup |
126
+
127
+
128
+ ## MoE Support
129
+
130
+ Three steering mechanisms for Mixture-of-Experts models:
131
+
132
+ 1. **Expert Profiling** — hooks router modules to compute per-expert "risk scores" from activation patterns on harmful vs. harmless prompts
133
+ 2. **Router Weight Suppression** — applies learned negative bias to routing weights of safety-critical experts
134
+ 3. **Fused Expert Abliteration** — direct rank-1 modification of expert `down_proj` matrices
135
+
136
+ Supported architectures: Qwen3/3.5 MoE, Mixtral, DeepSeek MoE, Granite MoE Hybrid, MiniMax-M2.5. See [configs/](configs/) for model-specific examples.
137
+
138
+
139
+ ## Configuration
140
+
141
+ Prometheus loads config in priority order (later overrides earlier):
142
+
143
+ 1. [`configs/default.toml`](configs/default.toml) — copy to `prometheus.toml` and customize
144
+ 2. `PM_CONFIG` environment variable
145
+ 3. `--config <path>` CLI flag
146
+ 4. CLI flags (`--model`, `--model.quant-method bnb_4bit`, etc.)
147
+
148
+ Run `prometheus --help` for all options.
149
+
150
+ Pre-built configs for specific setups:
151
+
152
+ | Config | Target |
153
+ |--------|--------|
154
+ | [`4b.toml`](configs/4b.toml) | Qwen3.5-4B dense |
155
+ | [`9b.toml`](configs/9b.toml) | 9B dense models |
156
+ | [`27b.toml`](configs/27b.toml) | Qwen3.5-27B dense (~54GB BF16) |
157
+ | [`35b.toml`](configs/35b.toml) | Qwen3.5-35B-A3B MoE |
158
+ | [`122b.toml`](configs/122b.toml) | Qwen3.5-122B-A10B MoE (BF16) |
159
+ | [`122b_4bit.toml`](configs/122b_4bit.toml) | Qwen3.5-122B-A10B (NF4, ~61GB) |
160
+ | [`122b_int8.toml`](configs/122b_int8.toml) | Qwen3.5-122B-A10B (INT8, ~122GB) |
161
+ | [`397b.toml`](configs/397b.toml) | Qwen3.5-397B-A17B MoE (NF4, ~215GB) |
162
+ | [`minimax_m25.toml`](configs/minimax_m25.toml) | MiniMax-M2.5 229B MoE (FP8, ~229GB) |
163
+ | [`100t.toml`](configs/100t.toml) | Extended 100-trial optimization |
164
+ | [`noslop.toml`](configs/noslop.toml) | Anti-slop tuning |
165
+
166
+
167
+ ## Hardware & VRAM
168
+
169
+ Prometheus auto-detects available accelerators (CUDA, XPU, MLU, MUSA, SDAA, NPU, MPS) and distributes layers across devices with `device_map = "auto"`.
170
+
171
+ For large models:
172
+ - **4-bit quantization**: `--model.quant-method bnb_4bit` cuts VRAM by ~4x
173
+ - **8-bit quantization**: `--model.quant-method bnb_8bit` — higher quality than 4-bit, ~2x VRAM reduction with CPU offload
174
+ - **Per-device memory limits**: set `[model] max_memory = {"0": "20GB", "cpu": "64GB"}` in your config
175
+ - **Non-interactive mode**: `--non-interactive` for fully automated batch runs
176
+
177
+
178
+ ## Research Tools
179
+
180
+ ```bash
181
+ pip install -U prometheus-llm[research]
182
+ ```
183
+
184
+ - `--display.plot-residuals` — PaCMAP-projected scatter plots and animated GIFs of residual vectors across layers
185
+ - `--display.print-residual-geometry` — cosine similarities, norms, silhouette coefficients
186
+
187
+ Example: PaCMAP visualization shows harmful (red) vs. harmless (blue) activations separating across layers, revealing how the model's refusal circuitry develops through its depth.
188
+
189
+ <!-- To add a screenshot: save the image to assets/ and uncomment the line below -->
190
+ <!-- ![PaCMAP visualization](assets/pacmap_example.png) -->
191
+
192
+
193
+ ## References
194
+
195
+ Prometheus builds on the following research:
196
+
197
+ - **Abliteration**: Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., & Nanda, N. (2024). [Refusal in Language Models Is Mediated by a Single Direction](https://arxiv.org/abs/2406.11717). *NeurIPS 2024*.
198
+ - **Representation Engineering**: Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., Goel, S., Li, N., Byun, M. J., Wang, Z., Mallen, A., Basart, S., Koyejo, S., Song, D., Fredrikson, M., Kolter, J. Z., & Hendrycks, D. (2023). [Representation Engineering: A Top-Down Approach to AI Transparency](https://arxiv.org/abs/2310.01405). *arXiv:2310.01405*.
199
+ - **LoRA**: Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685). *ICLR 2022*.
200
+ - **Optuna**: Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). [Optuna: A Next-generation Hyperparameter Optimization Framework](https://arxiv.org/abs/1907.10902). *KDD 2019*.
201
+ - **TPE**: Bergstra, J., Bardenet, R., Bengio, Y., & Kegl, B. (2011). [Algorithms for Hyper-Parameter Optimization](https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization). *NeurIPS 2011*.
202
+ - **PaCMAP**: Wang, Y., Huang, H., Rudin, C., & Shaposhnik, Y. (2021). [Understanding How Dimension Reduction Tools Work: An Empirical Approach to Deciphering t-SNE, UMAP, TriMap, and PaCMAP for Data Visualization](https://jmlr.org/papers/v22/20-1061.html). *JMLR*, 22, 1–73.
203
+
204
+ <details>
205
+ <summary>BibTeX</summary>
206
+
207
+ ```bibtex
208
+ @inproceedings{arditi2024refusal,
209
+ title = {Refusal in Language Models Is Mediated by a Single Direction},
210
+ author = {Arditi, Andy and Obeso, Oscar and Syed, Aaquib and Paleka, Daniel and Panickssery, Nina and Gurnee, Wes and Nanda, Neel},
211
+ booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
212
+ year = {2024},
213
+ url = {https://arxiv.org/abs/2406.11717}
214
+ }
215
+
216
+ @article{zou2023representation,
217
+ title = {Representation Engineering: A Top-Down Approach to AI Transparency},
218
+ author = {Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and Goel, Shashwat and Li, Nathaniel and Byun, Michael J. and Wang, Zifan and Mallen, Alex and Basart, Steven and Koyejo, Sanmi and Song, Dawn and Fredrikson, Matt and Kolter, J. Zico and Hendrycks, Dan},
219
+ journal = {arXiv preprint arXiv:2310.01405},
220
+ year = {2023},
221
+ url = {https://arxiv.org/abs/2310.01405}
222
+ }
223
+
224
+ @inproceedings{hu2022lora,
225
+ title = {{LoRA}: Low-Rank Adaptation of Large Language Models},
226
+ author = {Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu},
227
+ booktitle = {International Conference on Learning Representations (ICLR)},
228
+ year = {2022},
229
+ url = {https://arxiv.org/abs/2106.09685}
230
+ }
231
+
232
+ @inproceedings{akiba2019optuna,
233
+ title = {Optuna: A Next-generation Hyperparameter Optimization Framework},
234
+ author = {Akiba, Takuya and Sano, Shotaro and Yanase, Toshihiko and Ohta, Takeru and Koyama, Masanori},
235
+ booktitle = {Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining},
236
+ pages = {2623--2631},
237
+ year = {2019},
238
+ url = {https://arxiv.org/abs/1907.10902}
239
+ }
240
+
241
+ @inproceedings{bergstra2011algorithms,
242
+ title = {Algorithms for Hyper-Parameter Optimization},
243
+ author = {Bergstra, James and Bardenet, R{\'e}mi and Bengio, Yoshua and K{\'e}gl, Bal{\'a}zs},
244
+ booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
245
+ pages = {2546--2554},
246
+ year = {2011},
247
+ url = {https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization}
248
+ }
249
+
250
+ @article{wang2021pacmap,
251
+ title = {Understanding How Dimension Reduction Tools Work: An Empirical Approach to Deciphering t-SNE, UMAP, TriMap, and PaCMAP for Data Visualization},
252
+ author = {Wang, Yingfan and Huang, Haiyang and Rudin, Cynthia and Shaposhnik, Yaron},
253
+ journal = {Journal of Machine Learning Research},
254
+ volume = {22},
255
+ pages = {1--73},
256
+ year = {2021},
257
+ url = {https://jmlr.org/papers/v22/20-1061.html}
258
+ }
259
+ ```
260
+
261
+ </details>
262
+
263
+
264
+ ## Citation
265
+
266
+ ```bibtex
267
+ @software{prometheus,
268
+ author = {Wu, Wangzhang},
269
+ title = {Prometheus: Automated LLM Abliteration},
270
+ year = {2026},
271
+ url = {https://github.com/wuwangzhang1216/prometheus}
272
+ }
273
+ ```
274
+
275
+
276
+ ## Acknowledgments
277
+
278
+ Prometheus was initially inspired by [Heretic](https://github.com/p-e-w/heretic).
279
+
280
+ ```bibtex
281
+ @misc{heretic,
282
+ author = {Weidmann, Philipp Emanuel},
283
+ title = {Heretic: Fully automatic censorship removal for language models},
284
+ year = {2025},
285
+ publisher = {GitHub},
286
+ journal = {GitHub repository},
287
+ howpublished = {\url{https://github.com/p-e-w/heretic}}
288
+ }
289
+ ```
290
+
291
+
292
+ ## Contributing
293
+
294
+ Contributions are welcome! Please open an issue to discuss your idea before submitting a pull request.
295
+
296
+ 1. Fork the repository
297
+ 2. Create a feature branch (`git checkout -b feature/your-feature`)
298
+ 3. Commit your changes
299
+ 4. Push to your fork and open a pull request
300
+
301
+ All contributions are released under the [AGPL-3.0](LICENSE) license.
302
+
303
+
304
+ ## License
305
+
306
+ [AGPL-3.0](LICENSE)