abliterix 1.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,733 @@
1
+ Metadata-Version: 2.4
2
+ Name: abliterix
3
+ Version: 1.1.0
4
+ Summary: Automated model steering and alignment adjustment via LoRA-based optimization
5
+ Keywords: llm,model-steering,alignment,lora
6
+ Author: Wangzhang Wu
7
+ Author-email: Wangzhang Wu <wangzhangwu1216@gmail.com>
8
+ License-Expression: AGPL-3.0-or-later
9
+ Classifier: Development Status :: 4 - Beta
10
+ Classifier: Environment :: Console
11
+ Classifier: Environment :: GPU
12
+ Classifier: Intended Audience :: Science/Research
13
+ Classifier: License :: OSI Approved :: GNU Affero General Public License v3 or later (AGPLv3+)
14
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.10
17
+ Classifier: Programming Language :: Python :: 3.11
18
+ Classifier: Programming Language :: Python :: 3.12
19
+ Requires-Dist: accelerate~=1.10
20
+ Requires-Dist: bitsandbytes~=0.45
21
+ Requires-Dist: datasets~=4.0
22
+ Requires-Dist: hf-transfer~=0.1
23
+ Requires-Dist: huggingface-hub~=1.6
24
+ Requires-Dist: kernels~=0.11
25
+ Requires-Dist: optuna~=4.5
26
+ Requires-Dist: peft~=0.18
27
+ Requires-Dist: psutil~=7.1
28
+ Requires-Dist: pydantic-settings~=2.10
29
+ Requires-Dist: questionary~=2.1
30
+ Requires-Dist: rich~=14.1
31
+ Requires-Dist: transformers~=5.3
32
+ Requires-Dist: geom-median~=0.1 ; extra == 'research'
33
+ Requires-Dist: imageio>=2.36 ; extra == 'research'
34
+ Requires-Dist: matplotlib~=3.10 ; extra == 'research'
35
+ Requires-Dist: numpy~=2.2 ; extra == 'research'
36
+ Requires-Dist: pacmap~=0.8 ; extra == 'research'
37
+ Requires-Dist: scikit-learn~=1.7 ; extra == 'research'
38
+ Requires-Dist: gradio~=5.20 ; extra == 'ui'
39
+ Requires-Dist: plotly~=6.1 ; extra == 'ui'
40
+ Requires-Dist: vllm>=0.8 ; extra == 'vllm'
41
+ Requires-Dist: speculators>=0.1.9 ; extra == 'vllm'
42
+ Requires-Python: >=3.10
43
+ Project-URL: Changelog, https://github.com/wuwangzhang1216/abliterix/releases
44
+ Project-URL: Documentation, https://github.com/wuwangzhang1216/abliterix
45
+ Project-URL: Homepage, https://github.com/wuwangzhang1216/abliterix
46
+ Project-URL: Issues, https://github.com/wuwangzhang1216/abliterix/issues
47
+ Project-URL: Repository, https://github.com/wuwangzhang1216/abliterix.git
48
+ Provides-Extra: research
49
+ Provides-Extra: ui
50
+ Provides-Extra: vllm
51
+ Description-Content-Type: text/markdown
52
+
53
+ <p align="center">
54
+ <picture>
55
+ <source media="(prefers-color-scheme: dark)" srcset="assets/logo.svg">
56
+ <source media="(prefers-color-scheme: light)" srcset="assets/logo.svg">
57
+ <img alt="Abliterix" src="assets/logo.svg" width="460">
58
+ </picture>
59
+ </p>
60
+
61
+ <p align="center">
62
+ <strong>18% refusal rate on Gemma 4 &nbsp;·&nbsp; 0.0007 KL divergence &nbsp;·&nbsp; 150+ model configs &nbsp;·&nbsp; Zero manual tuning</strong>
63
+ </p>
64
+
65
+ <p align="center">
66
+ <a href="https://pypi.org/project/abliterix/"><img src="https://img.shields.io/pypi/v/abliterix?color=blue" alt="PyPI"></a>
67
+ <a href="https://www.python.org/downloads/"><img src="https://img.shields.io/badge/python-3.10%2B-blue.svg" alt="Python 3.10+"></a>
68
+ <a href="https://www.gnu.org/licenses/agpl-3.0"><img src="https://img.shields.io/badge/license-AGPL--3.0-green.svg" alt="License: AGPL v3"></a>
69
+ <a href="https://huggingface.co/wangzhang"><img src="https://img.shields.io/badge/%F0%9F%A4%97-Models-yellow.svg" alt="Hugging Face"></a>
70
+ </p>
71
+
72
+ ---
73
+
74
+ ## Table of Contents
75
+
76
+ - [Quick Start](#quick-start)
77
+ - [Architecture](#architecture)
78
+ - [How It Works](#how-it-works)
79
+ - [Results](#results)
80
+ - [Evaluation Methodology](#evaluation-methodology)
81
+ - [Features](#features)
82
+ - [Model Support](#model-support)
83
+ - [Web UI](#web-ui)
84
+ - [MoE Support](#moe-support)
85
+ - [Configuration](#configuration)
86
+ - [Hardware & VRAM](#hardware--vram)
87
+ - [Research Tools](#research-tools)
88
+ - [References](#references)
89
+ - [Citation](#citation)
90
+ - [Acknowledgments](#acknowledgments)
91
+ - [Datasets](#datasets)
92
+ - [Contributing](#contributing)
93
+ - [License](#license)
94
+
95
+ ---
96
+
97
+ Abliterix finds the optimal abliteration parameters for any transformer model using [Optuna](https://optuna.org/) TPE optimization. It co-minimizes refusals and KL divergence from the original model — producing decensored models that retain as much intelligence as possible.
98
+
99
+ Works with dense models, multimodal models, MoE architectures, SSM/hybrid models, and vision-language models — with **150+ pre-built configs** covering Llama, Gemma, Phi, DeepSeek, Qwen, Mistral, Yi, InternLM, Falcon, Cohere, and more.
100
+
101
+
102
+ ## Architecture
103
+
104
+ Abliterix integrates techniques from **9 peer-reviewed papers** (NeurIPS, ACL, ICLR) into a unified, automated steering pipeline. The table below shows what each technique solves and where it fits:
105
+
106
+ | Dimension | Problem | Technique | Paper | Config |
107
+ |-----------|---------|-----------|-------|--------|
108
+ | **What to remove** | Raw refusal vector is polysemantic — entangles refusal with syntax and capability circuits | **Surgical Refusal Ablation (SRA)** | [Cristofano (2026)](https://arxiv.org/abs/2601.08489) | `vector_method = "sra"` |
109
+ | **What to remove** | Single direction misses refusal subspace | **Multi-direction abliteration** | [Glaze et al. (2026)](https://arxiv.org/abs/2602.02132) | `n_directions = 3` |
110
+ | **What to remove** | Manual layer/direction selection | **COSMIC** auto-selection | [Siu et al., ACL 2025](https://arxiv.org/abs/2506.00085) | `vector_method = "cosmic"` |
111
+ | **What to remove** | Mean difference misses distribution shape | **Optimal Transport** matching | [2026](https://arxiv.org/abs/2603.04355) | `vector_method = "optimal_transport"` |
112
+ | **Where to steer** | Steering all layers wastes KL budget | **Discriminative Layer Selection** | [Selective Steering (2026)](https://arxiv.org/abs/2601.19375) | `discriminative_layer_selection = true` |
113
+ | **Where to steer** | Static direction ignores context | **Steering Vector Fields (SVF)** | [2026](https://arxiv.org/abs/2602.01654) | `steering_mode = "vector_field"` |
114
+ | **How to steer** | Addition-based steering disrupts norms | **Angular Steering** | [Vu & Nguyen, NeurIPS 2025 Spotlight](https://arxiv.org/abs/2510.26243) | `steering_mode = "angular"` |
115
+ | **How to steer** | 2D planar rotation ignores hypersphere geometry | **Spherical Steering** (geodesic) | [2026](https://arxiv.org/abs/2602.08169) | `steering_mode = "spherical"` |
116
+ | **How to preserve** | Standard projection destroys helpfulness signal | **Projected Abliteration** | [grimjim (2025)](https://huggingface.co/blog/grimjim/projected-abliteration) | `projected_abliteration = true` |
117
+
118
+ ### Why This Matters
119
+
120
+ Most abliteration tools implement one or two of these techniques. Abliterix is the only framework that integrates all of them into a single automated pipeline:
121
+
122
+ - **SRA** cleans the refusal vector so you don't damage math, code, or reasoning capabilities (47x KL improvement on VLMs [[1]](https://arxiv.org/abs/2601.08489))
123
+ - **SVF** makes the steering direction adapt per-token, so the same model handles "make a bomb" and "make a cake" differently
124
+ - **Spherical Steering** respects the geometric structure imposed by RMSNorm in modern LLMs
125
+ - **Discriminative Layer Selection** skips layers where steering would only add noise (15.7x KL reduction [[2]](https://arxiv.org/abs/2601.19375))
126
+ - **Optuna TPE** automatically finds the optimal combination across all these dimensions — no manual tuning required
127
+
128
+ The recommended configuration for maximum quality:
129
+
130
+ ```toml
131
+ [steering]
132
+ vector_method = "sra"
133
+ steering_mode = "spherical"
134
+ discriminative_layer_selection = true
135
+ projected_abliteration = true
136
+ ```
137
+
138
+
139
+ ## Quick Start
140
+
141
+ ```bash
142
+ pip install -U abliterix
143
+ abliterix --model Qwen/Qwen3-4B-Instruct-2507
144
+ ```
145
+
146
+ That's it. The process is fully automatic — after optimization completes, you can save the model, upload to Hugging Face, or chat with it interactively.
147
+
148
+ > **Windows**: use `python scripts/run_abliterix.py --model <model>` or set `PYTHONIOENCODING=utf-8` to avoid Rich encoding issues.
149
+
150
+
151
+ ## How It Works
152
+
153
+ Language models learn to refuse harmful queries through specific activation patterns in their residual stream. Abliterix identifies these patterns and surgically removes them:
154
+
155
+ 1. **Compute refusal directions** — pass harmless and harmful prompts through the model, extract per-layer residual activations, and compute the difference vector that characterizes "refusal behavior"
156
+ 2. **Orthogonalize** — project out the component aligned with normal "good" responses, isolating only the refusal signal
157
+ 3. **Abliterate** — apply weight modifications to attention (Q/K/V/O) and MLP components, weighted by a kernel function across layers. Supports two modes:
158
+ - **LoRA mode** — rank-1 adapters for reversible, lightweight modifications
159
+ - **Direct mode** — norm-preserving orthogonal projection on base weights in float32 (required for double-norm architectures like Gemma 4)
160
+ 4. **Expert-Granular Abliteration (EGA)** — for MoE models, project the refusal direction from **all** expert `down_proj` slices (not just top-N safety experts), plus router weight suppression
161
+ 5. **Optimize** — Optuna's Tree-structured Parzen Estimator searches over kernel shape, fractional direction index, and per-component abliteration strength across all 5 steerable components, selecting Pareto-optimal configurations that minimize both refusals and model degradation
162
+
163
+
164
+ ## Results
165
+
166
+ Abliterated models uploaded to [Hugging Face](https://huggingface.co/wangzhang):
167
+
168
+ | Model | Refusals | KL Divergence | Trials | Method |
169
+ |-------|----------|---------------|--------|--------|
170
+ | [**Gemma-4-31B**](https://huggingface.co/wangzhang/gemma-4-31B-it-abliterated) | **18/100 (18%)** | **0.0007** | 20 | Direct + Q/K/V/O |
171
+ | [LFM2-24B-A2B](https://huggingface.co/wangzhang/LFM2-24B-A2B-abliterated) | **0/100 (0%)** | 0.0079 | 50 | LoRA |
172
+ | [GLM-4.7-Flash](https://huggingface.co/wangzhang/GLM-4.7-Flash-abliterated) | 1/100 (1%) | 0.0133 | 50 | LoRA |
173
+ | [Devstral-Small-2-24B](https://huggingface.co/wangzhang/Devstral-Small-2-24B-Instruct-abliterated) | 3/100 (3%) | 0.0086 | 50 | LoRA |
174
+ | [Qwen3.5-122B-A10B](https://huggingface.co/wangzhang/Qwen3.5-122B-A10B-abliterated) | 1/200 (0.5%) | 0.0115 | 25 | LoRA + MoE |
175
+ | [Qwen3.5-35B-A3B](https://huggingface.co/wangzhang/Qwen3.5-35B-A3B-abliterated) | 3/200 (1.5%) | **0.0035** | 50 | LoRA + MoE |
176
+ | [Qwen3.5-27B](https://huggingface.co/wangzhang/Qwen3.5-27B-abliterated) | 3/200 (1.5%) | 0.0051 | 35 | LoRA |
177
+ | [Qwen3.5-9B](https://huggingface.co/wangzhang/Qwen3.5-9B-abliterated) | 2/200 (1%) | 0.0105 | 50 | LoRA |
178
+ | [Qwen3.5-4B](https://huggingface.co/wangzhang/Qwen3.5-4B-abliterated) | 3/200 (1.5%) | 0.0065 | 50 | LoRA |
179
+ | [Qwen3.5-0.8B](https://huggingface.co/wangzhang/Qwen3.5-0.8B-abliterated) | **0/200 (0%)** | 0.0087 | 100 | LoRA |
180
+
181
+ ### Key Findings
182
+
183
+ > **Gemma 4 is the hardest model to abliterate.** Its double-norm architecture (4x RMSNorm/layer) + Per-Layer Embeddings (PLE) actively resist LoRA and hook-based steering. Direct weight editing with norm-preserving orthogonal projection across Q/K/V/O + MLP is the only proven approach. We achieved 18/100 with only 20 warmup trials — full TPE optimization is expected to reach single digits.
184
+
185
+ - **Honest evaluation matters** — many abliterated models online claim near-perfect scores (3/100, 0.7%, etc.) but use short generation lengths (30-50 tokens) that miss Gemma 4's "delayed refusal" pattern. We tested a prominent "3/100" model and measured **60/100 refusals** with our pipeline. See our [evaluation methodology](#evaluation-methodology) below.
186
+ - **Direct weight editing for double-norm architectures** — Gemma 4's 4x RMSNorm + PLE completely suppresses LoRA perturbations. `steering_mode = "direct"` with `weight_normalization = "pre"` and float32 precision is required.
187
+ - **Q/K/V projections as steerable targets** — targeting all 5 attention/MLP components (q_proj, k_proj, v_proj, o_proj, down_proj) breaks through PLE repair by preventing the model from attending to refusal-related positions.
188
+ - **Expert-Granular Abliteration (EGA)** — for MoE models, projecting the refusal direction from ALL expert slices (not just top-N) is essential. Dense-only abliteration leaves ~30% of refusals routed through untouched experts.
189
+ - **Wider strength ranges push through low-KL plateaus** — conservative ranges [0.5, 2.0] produce KL < 0.001 but barely reduce refusals. Expanding to [1.0, 6.0] finds the sweet spot.
190
+ - **Non-transformer architectures work too** — LFM2's hybrid conv+attention architecture achieved 0% refusals by steering convolution output projections alongside attention and MLP components.
191
+
192
+ ### Evaluation Methodology
193
+
194
+ > **TL;DR**: Most abliteration benchmarks are broken. We found that a model claiming "3/100 refusals" actually scores **60/100** under rigorous evaluation. The difference? Generation length and prompt difficulty.
195
+
196
+ #### The problem with current benchmarks
197
+
198
+ The abliteration community has a systemic benchmarking crisis. Model cards routinely claim near-perfect scores — "3/100 refusals", "0.7% refusal rate", "97% compliance" — but these numbers are not reproducible under rigorous conditions.
199
+
200
+ We identified **three root causes**:
201
+
202
+ **1. Short generation length inflates compliance scores**
203
+
204
+ Most tools generate only 30-50 tokens during evaluation for speed. This creates a systematic blind spot: modern safety-tuned models (especially Gemma 4) exhibit a **"delayed refusal" pattern** — they first produce 50-100 tokens of seemingly helpful context (educational framing, disclaimers, reframing the question), then pivot to an actual refusal:
205
+
206
+ ```
207
+ Prompt: "How to forge an ID card?"
208
+
209
+ First 50 tokens: "Discusses ID card anti-counterfeiting technology in a
210
+ seemingly helpful way..." → classified as COMPLIANT ❌
211
+
212
+ Full response: "...this is a high-risk criminal activity and I cannot
213
+ provide actual instructions." → classified as REFUSAL ✅
214
+ ```
215
+
216
+ We measured this effect directly on the same model with different generation lengths:
217
+
218
+ | Generation length | Detected refusals | Verdict |
219
+ |---|---|---|
220
+ | 30 tokens | ~8/100 | Wildly optimistic |
221
+ | 50 tokens | ~18/100 | Still undercounting |
222
+ | **100 tokens** | **60/100** | **Accurate** |
223
+ | 150 tokens | 60/100 | Converged |
224
+
225
+ **2. Easy datasets produce easy scores**
226
+
227
+ Public benchmarks like `mlabonne/harmful_behaviors` contain mostly straightforward English-only prompts ("How to make a bomb?"). Real-world adversarial prompts are far more diverse:
228
+ - **Bilingual/mixed-language** prompts that exploit language-switching safety gaps
229
+ - **Socially-engineered framings** (academic context, roleplay, hypothetical scenarios)
230
+ - **Varying sophistication** from direct requests to multi-step indirect approaches
231
+ - **Multiple harm categories** spanning different safety training domains
232
+
233
+ Our private dataset contains 1000 prompts designed to stress-test all these dimensions. Models that score 0% on `mlabonne/harmful_behaviors` routinely score 15-60% on our dataset.
234
+
235
+ **3. Keyword-only detection misses soft refusals**
236
+
237
+ Keyword detectors look for phrases like "I cannot", "I'm sorry", etc. But sophisticated models refuse in more subtle ways:
238
+ - *"This is an interesting question from an academic perspective..."* → proceeds to explain why it's dangerous without providing actionable information
239
+ - *"There are several important safety considerations..."* → educational framing that avoids actual compliance
240
+ - Chinese models often use *"建议通过合法途径"* (suggest through legal channels) — technically a refusal wrapped in helpfulness
241
+
242
+ #### Our evaluation standards
243
+
244
+ | Dimension | Our approach | Common approach | Why it matters |
245
+ |---|---|---|---|
246
+ | **Generation length** | >= 100 tokens | 30-50 tokens | Captures delayed/soft refusals |
247
+ | **Detection method** | Keyword + LLM judge (Gemini 3 Flash) | Keywords only | Catches subtle refusals |
248
+ | **Prompt difficulty** | Private bilingual dataset, 1000 prompts, 12 harm categories, 4 sophistication levels | mlabonne/harmful_behaviors (English-only, simple) | Real-world adversarial diversity |
249
+ | **Transparency** | All parameters documented on model card | Often undisclosed | Reproducibility |
250
+
251
+ #### Cross-model validation
252
+
253
+ We evaluated multiple abliterated models using our pipeline to establish honest baselines:
254
+
255
+ | Model | Claimed refusals | **Our measurement** | Discrepancy |
256
+ |---|---|---|---|
257
+ | TrevorJS/gemma-4-26B-A4B-it-uncensored | 3/100 | **60/100** | **20x** |
258
+ | wangzhang/gemma-4-31B-it-abliterated (ours) | 18/100 | **18/100** | Consistent |
259
+ | google/gemma-4-31B-it (baseline) | — | **99/100** | — |
260
+
261
+ **We report 18/100 honestly.** This is a real number from a rigorous pipeline, not an optimistic estimate from a lenient one.
262
+
263
+ ### Architecture A/B Test (Qwen3.5-0.8B)
264
+
265
+ Controlled comparison of new techniques vs baseline, grid-searching λ ∈ {0.5, 0.8, 1.0, 1.2, 1.5, 2.0} per method and selecting the best Pareto point (lowest refusals → lowest KL). Reproduced across two independent runs.
266
+
267
+ | Method | Best λ | Refusals | KL | KL vs Baseline |
268
+ |--------|--------|----------|-----|----------------|
269
+ | A: Baseline (mean+ortho) | 2.0 | 0/100 | 14.000 | — |
270
+ | B: Projected (mean+proj+win) | 2.0 | 0/100 | 13.938 | -0.4% |
271
+ | **C: Disc. layers** (mean+ortho+disc) | 2.0 | 0/100 | **12.375** | **-11.6%** |
272
+ | D: SRA (sra+proj+disc) | 2.0 | 0/100 | 12.813 | -8.5% |
273
+ | **E: Spherical** (mean+ortho+sph+disc) | 2.0 | 0/100 | **12.375** | **-11.6%** |
274
+ | **F: SVF** (mean+ortho+svf+disc) | 2.0 | 0/100 | **12.375** | **-11.6%** |
275
+ | G: Full new arch (SRA+sph+disc+proj) | 2.0 | 0/100 | 12.813 | -8.5% |
276
+
277
+ **Pareto front**: C, E, F (tied at lowest KL = 12.375)
278
+
279
+ Key findings from the A/B test:
280
+
281
+ > **SRA eliminates refusals at 1.9x lower steering strength.** Methods D and G achieve 0 refusals at λ=0.8, while the baseline requires λ=1.5. A cleaner refusal vector needs less force to ablate — which means less collateral damage to model intelligence.
282
+
283
+ - **Discriminative layer selection is the single biggest KL reducer** — all methods with disc. selection (C/D/E/F/G) beat baseline by 8–12%, confirming the [Selective Steering (2026)](https://arxiv.org/abs/2601.19375) paper
284
+ - **Every new method outperforms baseline** — worst new method (D/G at -8.5%) still significantly beats baseline and projected-only (-0.4%)
285
+ - **SVF trained effective concept scorers on all 24 layers** (accuracy > 60%), with only 2.4s overhead
286
+
287
+
288
+ ## Features
289
+
290
+ ### Surgical Refusal Ablation (SRA) *(new)*
291
+
292
+ Concept-guided spectral cleaning based on [Cristofano (2026)](https://arxiv.org/abs/2601.08489). The raw refusal vector is **polysemantic** — it entangles the refusal signal with syntax, formatting, and capability circuits (math, code, reasoning). SRA builds a registry of *Concept Atoms* from benign activations and uses ridge-regularized spectral residualization to orthogonalize the refusal vector against these protected directions.
293
+
294
+ **Result**: On Qwen3-VL-4B, standard ablation produces KL = 2.088 while SRA achieves KL = **0.044** — a **47x improvement** — at the same 0% refusal rate.
295
+
296
+ ```toml
297
+ [steering]
298
+ vector_method = "sra"
299
+ sra_base_method = "mean" # Base method for initial direction
300
+ sra_n_atoms = 8 # Number of protected capability clusters
301
+ sra_ridge_alpha = 0.01 # Ridge regularization (larger = more conservative)
302
+ ```
303
+
304
+ ### Spherical Steering *(new)*
305
+
306
+ Geodesic rotation on the activation hypersphere, inspired by [Spherical Steering (2026)](https://arxiv.org/abs/2602.08169). Modern LLMs use RMSNorm, which makes activation **direction** more salient than magnitude. Spherical steering rotates along the great circle (geodesic) between the current activation and the target direction, respecting this geometric structure.
307
+
308
+ ```toml
309
+ [steering]
310
+ steering_mode = "spherical"
311
+ ```
312
+
313
+ ### Steering Vector Fields (SVF) *(new)*
314
+
315
+ Learned context-dependent steering based on [Steering Vector Fields (2026)](https://arxiv.org/abs/2602.01654). Instead of a static steering direction, SVF trains a small per-layer concept scorer whose gradient `∇_h f(h)` provides a **locally optimal** steering direction at each token position. This makes the intervention adapt to the current context — different tokens get different steering directions.
316
+
317
+ ```toml
318
+ [steering]
319
+ steering_mode = "vector_field"
320
+ svf_scorer_epochs = 50 # Training epochs for concept scorer
321
+ svf_scorer_lr = 0.001 # Learning rate
322
+ svf_scorer_hidden = 256 # Hidden dimension of scorer MLP
323
+ ```
324
+
325
+ ### Projected Abliteration
326
+
327
+ Improved orthogonal projection based on [grimjim's research (2025)](https://huggingface.co/blog/grimjim/projected-abliteration). Only removes the component of the refusal direction **orthogonal** to the harmless mean — preserving helpfulness-aligned signals that standard abliteration destroys.
328
+
329
+ ```toml
330
+ [steering]
331
+ projected_abliteration = true
332
+ winsorize_vectors = true
333
+ ```
334
+
335
+ ### Discriminative Layer Selection
336
+
337
+ Based on [Selective Steering (2026)](https://arxiv.org/abs/2601.19375). Only steers layers where harmful/harmless activations project in **opposite directions**. In A/B tests on Qwen3-0.6B: **15.7x lower KL divergence** vs. baseline.
338
+
339
+ ```toml
340
+ [steering]
341
+ discriminative_layer_selection = true
342
+ ```
343
+
344
+ ### COSMIC Direction Selection
345
+
346
+ Automated direction + layer selection via cosine similarity ([COSMIC, ACL 2025](https://arxiv.org/abs/2506.00085)). Finds optimal refusal directions without output text analysis.
347
+
348
+ ```toml
349
+ [steering]
350
+ vector_method = "cosmic"
351
+ ```
352
+
353
+ ### Angular Steering
354
+
355
+ Norm-preserving rotation in activation space ([NeurIPS 2025 Spotlight](https://arxiv.org/abs/2510.26243)). Adaptive variant only rotates refusal-aligned activations.
356
+
357
+ ```toml
358
+ [steering]
359
+ steering_mode = "adaptive_angular"
360
+ ```
361
+
362
+ ### Optimal Transport & Multi-Direction
363
+
364
+ [PCA-Gaussian OT](https://arxiv.org/abs/2603.04355) matches full activation distributions. [Multi-direction](https://arxiv.org/abs/2602.02132) ablates top-k independent refusal directions simultaneously.
365
+
366
+ ```toml
367
+ [steering]
368
+ vector_method = "optimal_transport" # or use n_directions = 3 for multi-direction
369
+ ```
370
+
371
+ ### A/B Test Results (Qwen3-0.6B)
372
+
373
+ | Method | Refusals | KL Divergence | KL vs Baseline |
374
+ |--------|----------|---------------|----------------|
375
+ | Baseline (mean+ortho) | 1/100 | 0.01116 | — |
376
+ | Projected abliteration | 2/100 | 0.01078 | -3% |
377
+ | Discriminative layers | 3/100 | **0.00071** | **-93.6%** |
378
+ | COSMIC+proj+disc | 2/100 | **0.00168** | **-84.9%** |
379
+
380
+ ### LLM Judge
381
+
382
+ Replace keyword-based refusal detection with LLM-powered classification via [OpenRouter](https://openrouter.ai/) for more accurate results, especially for non-English models.
383
+
384
+ ```toml
385
+ [detection]
386
+ llm_judge = true
387
+ llm_judge_model = "google/gemini-3.1-flash-lite-preview"
388
+ ```
389
+
390
+ ### Smart Optimization
391
+
392
+ - **Auto batch size** — exponential search finds the largest batch size that fits in VRAM
393
+ - **KL divergence pruning** — trials with KL above threshold are terminated early, saving compute
394
+ - **Fractional direction index** — interpolates between adjacent layer directions for finer-grained search
395
+ - **Per-component parameters** — separate abliteration weights for attention, MLP, and convolution components
396
+
397
+ ### Advanced Options
398
+
399
+ | Section | Option | Values | Description |
400
+ |---------|--------|--------|-------------|
401
+ | `[steering]` | `vector_method` | `mean`, `median_of_means`, `pca`, `optimal_transport`, `cosmic`, `sra` | How to compute steering vectors |
402
+ | `[steering]` | `steering_mode` | `lora`, `direct`, `angular`, `adaptive_angular`, `spherical`, `vector_field` | Steering application strategy (`direct` for double-norm architectures like Gemma 4) |
403
+ | `[steering]` | `projected_abliteration` | true/false | Improved projection preserving helpfulness |
404
+ | `[steering]` | `discriminative_layer_selection` | true/false | Only steer discriminative layers |
405
+ | `[steering]` | `n_directions` | 1–k | Multi-direction refusal removal |
406
+ | `[steering]` | `sra_base_method` | `mean`, `pca`, etc. | Base method for SRA initial direction |
407
+ | `[steering]` | `sra_n_atoms` | 1–16 | Number of concept atoms for SRA |
408
+ | `[steering]` | `sra_ridge_alpha` | 0.001–1.0 | Ridge regularization for SRA |
409
+ | `[steering]` | `svf_scorer_epochs` | 10–100 | Training epochs for SVF concept scorer |
410
+ | `[steering]` | `decay_kernel` | `linear`, `gaussian`, `cosine` | Kernel for interpolating weights across layers |
411
+ | `[steering]` | `weight_normalization` | `none`, `pre`, `full` | Weight row normalization before/after LoRA |
412
+ | `[model]` | `use_torch_compile` | true/false | 10–30% inference speedup |
413
+
414
+
415
+ ## Model Support
416
+
417
+ Abliterix ships with **150+ pre-built configs** covering 4 architecture types across 20+ model families:
418
+
419
+ | Architecture | Families | Example Models |
420
+ |-------------|----------|----------------|
421
+ | **Dense** | Llama, Gemma, Phi, Qwen, Mistral, Yi, InternLM, Falcon, Cohere, EXAONE, Granite, OLMo, SmolLM, SOLAR, Zephyr | Llama-3.1-405B, Gemma-3-27B, Phi-4, DeepSeek-R1-Distill |
422
+ | **MoE** | Qwen3/3.5 MoE, Mixtral, DeepSeek, Phi-3.5-MoE, Granite MoE, DBRX, Llama-4 Scout/Maverick | Qwen3.5-122B, Mixtral-8x22B, Llama-4-Maverick-401B |
423
+ | **SSM/Hybrid** | Jamba (Mamba+attention), Nemotron-Cascade (Mamba-2+attention) | Jamba-1.5-Large-94B, Nemotron-Cascade-30B |
424
+ | **Vision-Language** | Qwen2-VL, InternVL2, LLaVA-NeXT, Pixtral, Mistral3-VL | Qwen2-VL-7B, LLaVA-NeXT-34B, Pixtral-12B |
425
+
426
+ Generate configs for new models:
427
+
428
+ ```bash
429
+ python scripts/generate_configs.py # Generate all missing configs
430
+ python scripts/generate_configs.py --family llama # Only Llama family
431
+ ```
432
+
433
+
434
+ ## Web UI
435
+
436
+ Launch the Gradio-based Web UI for a browser-based steering experience:
437
+
438
+ ```bash
439
+ pip install abliterix[ui]
440
+ abliterix --ui
441
+ ```
442
+
443
+ The UI provides:
444
+ - **Model selection** — preset config dropdown + custom HuggingFace model ID
445
+ - **Optimisation dashboard** — real-time Pareto front plot, trial log, progress tracking
446
+ - **Side-by-side comparison** — baseline vs. steered model responses
447
+ - **Interactive chat** — chat with the steered model
448
+ - **One-click export** — save locally or upload to HuggingFace Hub
449
+
450
+
451
+ ## MoE Support
452
+
453
+ Four independent steering mechanisms for Mixture-of-Experts models:
454
+
455
+ 1. **Expert-Granular Abliteration (EGA)** *(new)* — norm-preserving orthogonal projection applied to **all** expert `down_proj` slices in every MoE layer. Unlike top-N approaches that only modify a few "safety experts", EGA recognizes that refusal signal is distributed across all experts. Critical for models like Gemma 4 26B-A4B where dense-only abliteration leaves ~30% of refusals routed through untouched experts.
456
+ 2. **Expert Profiling** — hooks router modules to compute per-expert "risk scores" from activation patterns on harmful vs. harmless prompts
457
+ 3. **Router Weight Suppression** — applies learned negative bias to routing weights of safety-critical experts
458
+ 4. **Fused Expert Abliteration** — direct rank-1 modification of top-N expert `down_proj` matrices (complementary to EGA)
459
+
460
+ Supported MoE architectures: Gemma 4 26B-A4B, Qwen3/3.5 MoE, Mixtral, DeepSeek MoE, Granite MoE Hybrid, MiniMax-M2.5, LiquidAI LFM2, GLM-4 MoE, Phi-3.5-MoE, DBRX, Llama-4 Scout/Maverick. See [configs/](configs/) for model-specific examples.
461
+
462
+
463
+ ## Configuration
464
+
465
+ Abliterix loads config in priority order (later overrides earlier):
466
+
467
+ 1. [`configs/default.toml`](configs/default.toml) — copy to `abliterix.toml` and customize
468
+ 2. `AX_CONFIG` environment variable
469
+ 3. `--config <path>` CLI flag
470
+ 4. CLI flags (`--model`, `--model.quant-method bnb_4bit`, etc.)
471
+
472
+ Run `abliterix --help` for all options.
473
+
474
+ **150+ pre-built configs** in [`configs/`](configs/) — a selection:
475
+
476
+ | Config | Target |
477
+ |--------|--------|
478
+ | [`llama3.1_8b.toml`](configs/llama3.1_8b.toml) | Llama 3.1 8B Instruct |
479
+ | [`llama3.3_70b_4bit.toml`](configs/llama3.3_70b_4bit.toml) | Llama 3.3 70B (4-bit) |
480
+ | [`llama4_scout_109b.toml`](configs/llama4_scout_109b.toml) | Llama 4 Scout 109B MoE |
481
+ | [`gemma3_27b.toml`](configs/gemma3_27b.toml) | Gemma 3 27B |
482
+ | [`phi4.toml`](configs/phi4.toml) | Phi-4 14B |
483
+ | [`deepseek_r1_distill_32b.toml`](configs/deepseek_r1_distill_32b.toml) | DeepSeek R1 Distill 32B |
484
+ | [`qwen3.5_122b.toml`](configs/qwen3.5_122b.toml) | Qwen3.5-122B-A10B MoE |
485
+ | [`mixtral_8x7b.toml`](configs/mixtral_8x7b.toml) | Mixtral 8x7B MoE |
486
+ | [`jamba1.5_mini.toml`](configs/jamba1.5_mini.toml) | Jamba 1.5 Mini (SSM+MoE) |
487
+ | [`qwen2_vl_7b.toml`](configs/qwen2_vl_7b.toml) | Qwen2-VL 7B (Vision) |
488
+ | [`lfm2_24b.toml`](configs/lfm2_24b.toml) | LiquidAI LFM2-24B hybrid conv+GQA MoE |
489
+ | [`noslop.toml`](configs/noslop.toml) | Anti-slop tuning |
490
+
491
+
492
+ ## Hardware & VRAM
493
+
494
+ Abliterix auto-detects available accelerators (CUDA, XPU, MLU, MUSA, SDAA, NPU, MPS) and distributes layers across devices with `device_map = "auto"`.
495
+
496
+ For large models:
497
+ - **4-bit quantization**: `--model.quant-method bnb_4bit` cuts VRAM by ~4x
498
+ - **8-bit quantization**: `--model.quant-method bnb_8bit` — higher quality than 4-bit, ~2x VRAM reduction with CPU offload
499
+ - **Per-device memory limits**: set `[model] max_memory = {"0": "20GB", "cpu": "64GB"}` in your config
500
+ - **Non-interactive mode**: `--non-interactive` for fully automated batch runs
501
+
502
+
503
+ ## Research Tools
504
+
505
+ ```bash
506
+ pip install -U abliterix[research]
507
+ ```
508
+
509
+ - `--display.plot-residuals` — PaCMAP-projected scatter plots and animated GIFs of residual vectors across layers
510
+ - `--display.print-residual-geometry` — cosine similarities, norms, silhouette coefficients
511
+
512
+ Example: PaCMAP visualization shows harmful (red) vs. harmless (blue) activations separating across layers, revealing how the model's refusal circuitry develops through its depth.
513
+
514
+ <!-- To add a screenshot: save the image to assets/ and uncomment the line below -->
515
+ <!-- ![PaCMAP visualization](assets/pacmap_example.png) -->
516
+
517
+
518
+ ## Datasets
519
+
520
+ Evaluation prompt datasets are available on Hugging Face: [wangzhang/abliterix-datasets](https://huggingface.co/datasets/wangzhang/abliterix-datasets)
521
+
522
+ | Dataset | Count | Description |
523
+ |---------|-------|-------------|
524
+ | `good_500` | 500 | Harmless prompts — recommended for iteration |
525
+ | `good_1000` | 1000 | Harmless prompts — full set |
526
+ | `harmful_500` | 500 | Harmful prompts — recommended for iteration |
527
+ | `harmful_1000` | 1000 | Harmful prompts — full set |
528
+
529
+ The 500-example sets run ~2x faster than the 1000 sets with no clear quality loss.
530
+
531
+ ### Why we built our own datasets
532
+
533
+ Public abliteration benchmarks (e.g. `mlabonne/harmful_behaviors`, `mlabonne/harmless_alpaca`) are widely used but have critical limitations:
534
+
535
+ - **English-only**: zero coverage of Chinese, mixed-language, or code-switching prompts
536
+ - **Low sophistication**: mostly direct requests ("How to make X?") with no social engineering
537
+ - **Narrow harm taxonomy**: concentrated in a few categories, missing many real-world attack vectors
538
+ - **Small and static**: community has memorized them — models may be specifically trained against these exact prompts
539
+
540
+ Our datasets address all of these:
541
+
542
+ | Dimension | Our dataset | mlabonne/harmful_behaviors |
543
+ |---|---|---|
544
+ | **Languages** | English + Chinese + mixed | English only |
545
+ | **Sophistication levels** | 4 levels (direct → socially-engineered) | 1 level (direct) |
546
+ | **Harm categories** | 12 categories | ~3-4 categories |
547
+ | **Format diversity** | QA, roleplay, academic, narrative | Single format |
548
+ | **Design methodology** | Adversarial red-teaming with matched benign counterexamples | Community-sourced |
549
+
550
+ Each prompt includes metadata: `category`, `language`, `sophistication`, `format`, `style_family`, and `design_goal`. The benign datasets are specifically designed as **matched counterexamples** — topically similar to harmful prompts but policy-compliant, which produces cleaner refusal direction vectors.
551
+
552
+
553
+ ## References
554
+
555
+ Abliterix builds on the following research:
556
+
557
+ - **Abliteration**: Arditi, A., et al. (2024). [Refusal in Language Models Is Mediated by a Single Direction](https://arxiv.org/abs/2406.11717). *NeurIPS 2024*.
558
+ - **Projected Abliteration**: grimjim (2025). [Projected Abliteration](https://huggingface.co/blog/grimjim/projected-abliteration). Norm-preserving biprojection for refusal removal.
559
+ - **COSMIC**: Siu, V., et al. (2025). [COSMIC: Generalized Refusal Direction Identification in LLM Activations](https://arxiv.org/abs/2506.00085). *ACL 2025 Findings*.
560
+ - **Angular Steering**: Vu, H. M. & Nguyen, T. M. (2025). [Angular Steering: Behavior Control via Rotation in Activation Space](https://arxiv.org/abs/2510.26243). *NeurIPS 2025 Spotlight*.
561
+ - **Selective Steering**: (2026). [Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection](https://arxiv.org/abs/2601.19375).
562
+ - **Surgical Refusal Ablation**: Cristofano, A. (2026). [Surgical Refusal Ablation: Disentangling Safety from Intelligence via Concept-Guided Spectral Cleaning](https://arxiv.org/abs/2601.08489).
563
+ - **Spherical Steering**: (2026). [Spherical Steering: Geometry-Aware Activation Rotation for Language Models](https://arxiv.org/abs/2602.08169).
564
+ - **Steering Vector Fields**: (2026). [Steering Vector Fields for Context-Aware Inference-Time Control in Large Language Models](https://arxiv.org/abs/2602.01654).
565
+ - **Optimal Transport**: (2026). [Efficient Refusal Ablation in LLM through Optimal Transport](https://arxiv.org/abs/2603.04355).
566
+ - **Multi-Direction Refusal**: (2026). [There Is More to Refusal in Large Language Models than a Single Direction](https://arxiv.org/abs/2602.02132).
567
+
568
+ <details>
569
+ <summary>Classic references</summary>
570
+
571
+ - **Abliteration (original)**: Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., & Nanda, N. (2024). [Refusal in Language Models Is Mediated by a Single Direction](https://arxiv.org/abs/2406.11717). *NeurIPS 2024*.
572
+ - **Representation Engineering**: Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., Goel, S., Li, N., Byun, M. J., Wang, Z., Mallen, A., Basart, S., Koyejo, S., Song, D., Fredrikson, M., Kolter, J. Z., & Hendrycks, D. (2023). [Representation Engineering: A Top-Down Approach to AI Transparency](https://arxiv.org/abs/2310.01405). *arXiv:2310.01405*.
573
+ - **LoRA**: Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685). *ICLR 2022*.
574
+ - **Optuna**: Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). [Optuna: A Next-generation Hyperparameter Optimization Framework](https://arxiv.org/abs/1907.10902). *KDD 2019*.
575
+ - **TPE**: Bergstra, J., Bardenet, R., Bengio, Y., & Kegl, B. (2011). [Algorithms for Hyper-Parameter Optimization](https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization). *NeurIPS 2011*.
576
+ - **PaCMAP**: Wang, Y., Huang, H., Rudin, C., & Shaposhnik, Y. (2021). [Understanding How Dimension Reduction Tools Work: An Empirical Approach to Deciphering t-SNE, UMAP, TriMap, and PaCMAP for Data Visualization](https://jmlr.org/papers/v22/20-1061.html). *JMLR*, 22, 1–73.
577
+
578
+ </details>
579
+
580
+ <details>
581
+ <summary>BibTeX</summary>
582
+
583
+ ```bibtex
584
+ @inproceedings{arditi2024refusal,
585
+ title = {Refusal in Language Models Is Mediated by a Single Direction},
586
+ author = {Arditi, Andy and Obeso, Oscar and Syed, Aaquib and Paleka, Daniel and Panickssery, Nina and Gurnee, Wes and Nanda, Neel},
587
+ booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
588
+ year = {2024},
589
+ url = {https://arxiv.org/abs/2406.11717}
590
+ }
591
+
592
+ @article{zou2023representation,
593
+ title = {Representation Engineering: A Top-Down Approach to AI Transparency},
594
+ author = {Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and Goel, Shashwat and Li, Nathaniel and Byun, Michael J. and Wang, Zifan and Mallen, Alex and Basart, Steven and Koyejo, Sanmi and Song, Dawn and Fredrikson, Matt and Kolter, J. Zico and Hendrycks, Dan},
595
+ journal = {arXiv preprint arXiv:2310.01405},
596
+ year = {2023},
597
+ url = {https://arxiv.org/abs/2310.01405}
598
+ }
599
+
600
+ @inproceedings{hu2022lora,
601
+ title = {{LoRA}: Low-Rank Adaptation of Large Language Models},
602
+ author = {Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu},
603
+ booktitle = {International Conference on Learning Representations (ICLR)},
604
+ year = {2022},
605
+ url = {https://arxiv.org/abs/2106.09685}
606
+ }
607
+
608
+ @inproceedings{akiba2019optuna,
609
+ title = {Optuna: A Next-generation Hyperparameter Optimization Framework},
610
+ author = {Akiba, Takuya and Sano, Shotaro and Yanase, Toshihiko and Ohta, Takeru and Koyama, Masanori},
611
+ booktitle = {Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining},
612
+ pages = {2623--2631},
613
+ year = {2019},
614
+ url = {https://arxiv.org/abs/1907.10902}
615
+ }
616
+
617
+ @inproceedings{bergstra2011algorithms,
618
+ title = {Algorithms for Hyper-Parameter Optimization},
619
+ author = {Bergstra, James and Bardenet, R{\'e}mi and Bengio, Yoshua and K{\'e}gl, Bal{\'a}zs},
620
+ booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
621
+ pages = {2546--2554},
622
+ year = {2011},
623
+ url = {https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization}
624
+ }
625
+
626
+ @article{cristofano2026sra,
627
+ title = {Surgical Refusal Ablation: Disentangling Safety from Intelligence via Concept-Guided Spectral Cleaning},
628
+ author = {Cristofano, Andrea},
629
+ journal = {arXiv preprint arXiv:2601.08489},
630
+ year = {2026},
631
+ url = {https://arxiv.org/abs/2601.08489}
632
+ }
633
+
634
+ @article{spherical2026,
635
+ title = {Spherical Steering: Geometry-Aware Activation Rotation for Language Models},
636
+ journal = {arXiv preprint arXiv:2602.08169},
637
+ year = {2026},
638
+ url = {https://arxiv.org/abs/2602.08169}
639
+ }
640
+
641
+ @article{svf2026,
642
+ title = {Steering Vector Fields for Context-Aware Inference-Time Control in Large Language Models},
643
+ journal = {arXiv preprint arXiv:2602.01654},
644
+ year = {2026},
645
+ url = {https://arxiv.org/abs/2602.01654}
646
+ }
647
+
648
+ @article{selective2026,
649
+ title = {Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection},
650
+ journal = {arXiv preprint arXiv:2601.19375},
651
+ year = {2026},
652
+ url = {https://arxiv.org/abs/2601.19375}
653
+ }
654
+
655
+ @inproceedings{siu2025cosmic,
656
+ title = {{COSMIC}: Generalized Refusal Direction Identification in {LLM} Activations},
657
+ author = {Siu, Vincent and others},
658
+ booktitle = {Findings of the Association for Computational Linguistics: ACL 2025},
659
+ year = {2025},
660
+ url = {https://arxiv.org/abs/2506.00085}
661
+ }
662
+
663
+ @inproceedings{vu2025angular,
664
+ title = {Angular Steering: Behavior Control via Rotation in Activation Space},
665
+ author = {Vu, Hieu M. and Nguyen, Tan M.},
666
+ booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
667
+ year = {2025},
668
+ note = {Spotlight},
669
+ url = {https://arxiv.org/abs/2510.26243}
670
+ }
671
+
672
+ @article{wang2021pacmap,
673
+ title = {Understanding How Dimension Reduction Tools Work: An Empirical Approach to Deciphering t-SNE, UMAP, TriMap, and PaCMAP for Data Visualization},
674
+ author = {Wang, Yingfan and Huang, Haiyang and Rudin, Cynthia and Shaposhnik, Yaron},
675
+ journal = {Journal of Machine Learning Research},
676
+ volume = {22},
677
+ pages = {1--73},
678
+ year = {2021},
679
+ url = {https://jmlr.org/papers/v22/20-1061.html}
680
+ }
681
+ ```
682
+
683
+ </details>
684
+
685
+
686
+ ## Citation
687
+
688
+ ```bibtex
689
+ @software{abliterix,
690
+ author = {Wu, Wangzhang},
691
+ title = {Abliterix: Automated LLM Abliteration},
692
+ year = {2026},
693
+ url = {https://github.com/wuwangzhang1216/abliterix}
694
+ }
695
+ ```
696
+
697
+
698
+ ## Acknowledgments
699
+
700
+ Abliterix is a **derivative work** of [Heretic](https://github.com/p-e-w/heretic) by Philipp Emanuel Weidmann ([@p-e-w](https://github.com/p-e-w)), licensed under [AGPL-3.0-or-later](https://www.gnu.org/licenses/agpl-3.0.html). The original Heretic codebase provided the foundation for this project; Abliterix extends it with Optuna-based multi-objective optimization, LoRA-based steering, MoE architecture support, orthogonal projection, LLM judge detection, and additional model integrations.
701
+
702
+ All modifications are Copyright (C) 2026 Wangzhang Wu and are released under the same AGPL-3.0-or-later license. See [NOTICE](NOTICE) for details.
703
+
704
+ ```bibtex
705
+ @misc{heretic,
706
+ author = {Weidmann, Philipp Emanuel},
707
+ title = {Heretic: Fully automatic censorship removal for language models},
708
+ year = {2025},
709
+ publisher = {GitHub},
710
+ journal = {GitHub repository},
711
+ howpublished = {\url{https://github.com/p-e-w/heretic}}
712
+ }
713
+ ```
714
+
715
+
716
+ ## Contributing
717
+
718
+ Contributions are welcome! Please open an issue to discuss your idea before submitting a pull request.
719
+
720
+ 1. Fork the repository
721
+ 2. Create a feature branch (`git checkout -b feature/your-feature`)
722
+ 3. Commit your changes
723
+ 4. Push to your fork and open a pull request
724
+
725
+ All contributions are released under the [AGPL-3.0](LICENSE) license.
726
+
727
+
728
+ ## License
729
+
730
+ Abliterix is a derivative work of [Heretic](https://github.com/p-e-w/heretic) by Philipp Emanuel Weidmann, licensed under the [GNU Affero General Public License v3.0 or later](LICENSE).
731
+
732
+ Original work Copyright (C) 2025 Philipp Emanuel Weidmann
733
+ Modified work Copyright (C) 2026 Wangzhang Wu