safety-compass 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (28) hide show
  1. safety_compass-0.1.0/.gitignore +24 -0
  2. safety_compass-0.1.0/LICENSE +21 -0
  3. safety_compass-0.1.0/PKG-INFO +374 -0
  4. safety_compass-0.1.0/README.md +334 -0
  5. safety_compass-0.1.0/pyproject.toml +87 -0
  6. safety_compass-0.1.0/results/README.md +21 -0
  7. safety_compass-0.1.0/scripts/analyze_behavior.py +164 -0
  8. safety_compass-0.1.0/scripts/analyze_experiments.py +256 -0
  9. safety_compass-0.1.0/scripts/extract_directions.py +203 -0
  10. safety_compass-0.1.0/scripts/prepare_contrastive_pairs.py +89 -0
  11. safety_compass-0.1.0/scripts/run_monitored_finetune.py +324 -0
  12. safety_compass-0.1.0/src/safety_compass/__init__.py +74 -0
  13. safety_compass-0.1.0/src/safety_compass/_scripts/__init__.py +0 -0
  14. safety_compass-0.1.0/src/safety_compass/behavioral.py +418 -0
  15. safety_compass-0.1.0/src/safety_compass/callback.py +78 -0
  16. safety_compass-0.1.0/src/safety_compass/cli.py +33 -0
  17. safety_compass-0.1.0/src/safety_compass/concept.py +288 -0
  18. safety_compass-0.1.0/src/safety_compass/config.py +216 -0
  19. safety_compass-0.1.0/src/safety_compass/data_sources/__init__.py +71 -0
  20. safety_compass-0.1.0/src/safety_compass/data_sources/deception.py +46 -0
  21. safety_compass-0.1.0/src/safety_compass/data_sources/refusal.py +55 -0
  22. safety_compass-0.1.0/src/safety_compass/data_sources/sycophancy.py +85 -0
  23. safety_compass-0.1.0/src/safety_compass/formatters.py +90 -0
  24. safety_compass-0.1.0/src/safety_compass/logger.py +74 -0
  25. safety_compass-0.1.0/src/safety_compass/monitor.py +342 -0
  26. safety_compass-0.1.0/src/safety_compass/py.typed +0 -0
  27. safety_compass-0.1.0/src/safety_compass/utils.py +129 -0
  28. safety_compass-0.1.0/src/safety_compass/viz.py +410 -0
@@ -0,0 +1,24 @@
1
+ !.github/
2
+ docs/
3
+ results/*
4
+ !results/.gitkeep
5
+ !results/README.md
6
+ !results/phase3/
7
+ !results/phase3/**
8
+ !results/phase4/
9
+ !results/phase4/**
10
+ results/phase4/exp*/phase4_exp*_adapter/
11
+ *.pt
12
+ *.pth
13
+ *.bin
14
+ *.safetensors
15
+ __pycache__/
16
+ *.pyc
17
+ *.egg-info/
18
+ dist/
19
+ build/
20
+ .venv/
21
+ .env
22
+ *.log
23
+ !results/**/*.log
24
+ .DS_Store
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Ayesha Imran, Muhammad Aaliyan
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,374 @@
1
+ Metadata-Version: 2.4
2
+ Name: safety-compass
3
+ Version: 0.1.0
4
+ Summary: Monitor safety-relevant concept directions during LLM fine-tuning
5
+ Project-URL: Homepage, https://github.com/Ayesha-Imr/safety-compass
6
+ Project-URL: Repository, https://github.com/Ayesha-Imr/safety-compass
7
+ Project-URL: Issues, https://github.com/Ayesha-Imr/safety-compass/issues
8
+ Author-email: Ayesha Imran <ayesha.i1505@gmail.com>, Muhammad Aaliyan <aaliyan1230@gmail.com>
9
+ License-Expression: MIT
10
+ License-File: LICENSE
11
+ Keywords: alignment,concept-directions,fine-tuning,interpretability,llm,monitoring,safety
12
+ Classifier: Development Status :: 3 - Alpha
13
+ Classifier: Intended Audience :: Science/Research
14
+ Classifier: License :: OSI Approved :: MIT License
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.9
17
+ Classifier: Programming Language :: Python :: 3.10
18
+ Classifier: Programming Language :: Python :: 3.11
19
+ Classifier: Programming Language :: Python :: 3.12
20
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
21
+ Requires-Python: >=3.9
22
+ Requires-Dist: matplotlib>=3.7
23
+ Requires-Dist: numpy>=1.24
24
+ Requires-Dist: pyyaml>=6.0
25
+ Requires-Dist: scikit-learn>=1.3
26
+ Requires-Dist: torch>=2.0
27
+ Requires-Dist: tqdm>=4.66
28
+ Requires-Dist: transformers>=4.40
29
+ Provides-Extra: data
30
+ Requires-Dist: datasets>=2.19; extra == 'data'
31
+ Requires-Dist: huggingface-hub>=0.23; extra == 'data'
32
+ Provides-Extra: dev
33
+ Requires-Dist: pytest>=7.0; extra == 'dev'
34
+ Requires-Dist: ruff>=0.3; extra == 'dev'
35
+ Provides-Extra: gpu
36
+ Requires-Dist: accelerate>=0.29; extra == 'gpu'
37
+ Requires-Dist: bitsandbytes>=0.43; extra == 'gpu'
38
+ Requires-Dist: peft>=0.10; extra == 'gpu'
39
+ Description-Content-Type: text/markdown
40
+
41
+ # Safety Compass
42
+
43
+ A Python toolkit that monitors how safety-relevant concept directions evolve inside a language model's activation space during fine-tuning.
44
+
45
+ Safety Compass uses [difference-in-means (DiM)](https://arxiv.org/abs/2310.01405) extraction to find directions in a model's hidden states that separate safety-relevant behaviors (e.g., "refuses harmful requests" vs. "complies with harmful requests"). It then tracks how those directions drift during any HuggingFace fine-tuning run, producing structured logs of geometric and functional degradation metrics at configurable intervals.
46
+
47
+ **Core research question:** During fine-tuning, do safety-relevant concept directions erode uniformly, or is there a consistent hierarchy of fragility?
48
+
49
+ ## Key Findings
50
+
51
+ We monitored three safety concepts -- refusal, sycophancy, and deception -- across three benign fine-tuning datasets (Alpaca, Dolly, Code Alpaca) on Qwen3-8B. The fragility hierarchy is consistent across all datasets:
52
+
53
+ **Cosine similarity to baseline direction** (1.0 = unchanged, 0.0 = completely different):
54
+
55
+ ![Cosine similarity drift during fine-tuning on Alpaca](results/phase3/exp1/cosine_drift.png)
56
+
57
+ *Refusal (blue) drops to ~0.35 within 50 steps. Sycophancy (orange) drifts moderately. Deception (green) barely moves. Dashed line = 0.95 significance threshold.*
58
+
59
+ All directions start at **1.0** before fine-tuning. The table shows how far each direction drifted during training (lowest point reached &rarr; where it settled at the end):
60
+
61
+ | Dataset | Refusal | Sycophancy | Deception |
62
+ |---------|---------|------------|-----------|
63
+ | Alpaca | 1.0 &rarr; **0.353** &rarr; 0.378 | 1.0 &rarr; **0.687** &rarr; 0.689 | 1.0 &rarr; **0.985** &rarr; 0.985 |
64
+ | Dolly | 1.0 &rarr; **0.369** &rarr; 0.439 | 1.0 &rarr; **0.644** &rarr; 0.662 | 1.0 &rarr; **0.963** &rarr; 0.967 |
65
+ | Code Alpaca | 1.0 &rarr; **0.338** &rarr; 0.352 | 1.0 &rarr; **0.762** &rarr; 0.786 | 1.0 &rarr; **0.996** &rarr; 0.997 |
66
+
67
+ *Format: start &rarr; **min** &rarr; final. Refusal drops to ~0.35 (65% rotation) within just 50 training steps, then partially recovers. Deception barely moves at all.*
68
+
69
+ **Behavioral validation** confirms that geometric drift predicts observable behavior change:
70
+
71
+ | Dataset | Concept | Behavior Change |
72
+ |---------|---------|-----------------|
73
+ | Alpaca | Refusal | Refused 25% fewer harmful requests after fine-tuning |
74
+ | Dolly | Sycophancy | Agreed with 30% more false premises |
75
+ | All 3 | Deception | Modest behavioral change despite geometric stability |
76
+
77
+ ![Geometric drift vs behavioral change](results/phase4/analysis/drift_vs_behavior_plot.png)
78
+
79
+ *Each point is one (dataset, concept) pair. Lower cosine (more drift) correlates with larger behavioral degradation. Refusal points cluster at the left with the most drift and behavior change; deception stays near 1.0.*
80
+
81
+ The refusal direction is consistently the most fragile safety concept, drifting significantly even during benign (non-adversarial) fine-tuning. This suggests refusal behavior is the first safety property at risk during any fine-tuning run.
82
+
83
+ ## Installation
84
+
85
+ ```bash
86
+ # Core (extraction + monitoring)
87
+ pip install safety-compass
88
+
89
+ # With GPU support (4-bit quantization, LoRA, accelerate)
90
+ pip install "safety-compass[gpu]"
91
+
92
+ # With data generation (HuggingFace datasets for contrastive pair creation)
93
+ pip install "safety-compass[data]"
94
+
95
+ # Everything
96
+ pip install "safety-compass[gpu,data,viz,dev]"
97
+ ```
98
+
99
+ <details>
100
+ <summary>Development install (from source)</summary>
101
+
102
+ ```bash
103
+ git clone https://github.com/Ayesha-Imr/safety-compass.git
104
+ cd safety-compass
105
+ pip install -e ".[dev]"
106
+ ```
107
+ </details>
108
+
109
+ ## Compatibility
110
+
111
+ **Fine-tuning methods:** Safety Compass works with any fine-tuning approach that uses the HuggingFace `Trainer` -- QLoRA, LoRA, full fine-tuning, or any other method. The callback only reads the model's hidden states at measurement time; it doesn't care how the weights are being updated.
112
+
113
+ **Models:** Any HuggingFace causal language model (`AutoModelForCausalLM`) that supports `output_hidden_states=True` and has a tokenizer with `apply_chat_template`. This covers most modern chat/instruct models (Qwen, Llama, Mistral, Gemma, etc.). You just need a model config YAML specifying `num_layers` and `hidden_dim` -- see `configs/models/` for examples.
114
+
115
+ **Hardware:** Extraction runs forward passes on contrastive pairs (~60 prompts), so it needs enough memory to hold the model + a small batch of activations. Our experiments used a Kaggle T4 (15GB VRAM) with 4-bit quantized Qwen3-8B. Smaller models or larger GPUs work without quantization.
116
+
117
+ ## Quickstart
118
+
119
+ Adding safety monitoring to an existing HuggingFace training script takes three steps:
120
+
121
+ ```python
122
+ from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
123
+ from safety_compass import SafetyCompassMonitor, SafetyCompassCallback
124
+
125
+ # Load your model and tokenizer as usual
126
+ model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", ...)
127
+ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
128
+
129
+ # Step 1: Create a monitor from an experiment config.
130
+ # This loads concept definitions (which safety behaviors to track),
131
+ # model metadata (layer count, hidden dim), and monitoring settings.
132
+ monitor = SafetyCompassMonitor.from_config(
133
+ model=model,
134
+ tokenizer=tokenizer,
135
+ experiment_config="configs/experiments/alpaca_qlora.yaml",
136
+ )
137
+
138
+ # Step 2: Attach the callback to your Trainer.
139
+ # The callback extracts concept directions before training (baseline),
140
+ # then re-extracts and compares every `measure_every_n_steps` steps.
141
+ callback = SafetyCompassCallback(
142
+ monitor=monitor,
143
+ measure_every_n_steps=50,
144
+ log_file="drift_log.csv",
145
+ )
146
+
147
+ trainer = Trainer(
148
+ model=model,
149
+ args=TrainingArguments(...),
150
+ train_dataset=dataset,
151
+ callbacks=[callback],
152
+ )
153
+ trainer.train()
154
+
155
+ # Step 3: Results are written to drift_log.csv as training progresses.
156
+ # Each row contains: step, concept, cosine_to_baseline, auroc_fixed, auroc_current, ...
157
+ ```
158
+
159
+ For a complete end-to-end example including model loading with quantization and LoRA setup, see [`scripts/run_monitored_finetune.py`](scripts/run_monitored_finetune.py).
160
+
161
+ ## What It Measures
162
+
163
+ Every `measure_every_n_steps` steps, the callback re-extracts concept directions from the current model state and computes:
164
+
165
+ | Metric | What It Tells You |
166
+ |--------|-------------------|
167
+ | `cosine_to_baseline` | How much the direction has rotated from its pre-training position. Below 0.95 = meaningful drift. |
168
+ | `auroc_fixed` | Can the *original* baseline direction still classify held-out contrastive pairs? Tracks functional degradation. |
169
+ | `auroc_current` | Can a *freshly extracted* direction still classify? Should stay high if the concept is still linearly separable. |
170
+ | `direction_norm` | Magnitude of the raw difference-in-means vector. Large changes may indicate representational reorganization. |
171
+ | `cross_*_cosine` | Pairwise cosine between different concept directions. Rising values indicate concepts are becoming entangled. |
172
+
173
+ ![Metric heatmap across training steps](results/phase3/exp1/metric_heatmap.png)
174
+
175
+ *Example output from an Alpaca fine-tuning run. Each row is a metric for one concept; columns are training steps. Red indicates degradation from baseline.*
176
+
177
+ ## How It Works
178
+
179
+ ```
180
+ Contrastive Pairs Difference-in-Means Baseline Direction
181
+ (positive vs. --> Extract activation diff --> (unit vector at
182
+ negative examples) at specified layer best separating layer)
183
+ |
184
+ v
185
+ Training loop Periodic re-extraction Drift metrics
186
+ (your fine-tuning) --> every N steps, extract --> cosine similarity,
187
+ current direction AUROC on held-out pairs
188
+ ```
189
+
190
+ 1. **Before training**: The monitor extracts baseline directions using contrastive pairs -- matched prompts that differ only in the safety-relevant behavior. For example, for refusal: harmful requests vs. harmless requests with identical system prompts.
191
+
192
+ 2. **During training**: The callback periodically re-extracts directions from the current model state and compares them to the baselines.
193
+
194
+ 3. **Output**: A CSV log with one row per (step, concept) pair, plus optional W&B logging.
195
+
196
+ **Two pairing strategies are built in:**
197
+
198
+ - **[Arditi et al.](https://arxiv.org/abs/2406.11717)** (used for refusal): Same system prompt, different user queries. Isolates the model's response to harmful vs. harmless content.
199
+ - **[CAA (Panickssery et al.)](https://arxiv.org/abs/2312.06681)** (used for sycophancy, deception): Different system prompts, same user query. Isolates the effect of behavioral instructions.
200
+
201
+ ## Configuration
202
+
203
+ Safety Compass uses three layers of YAML configuration:
204
+
205
+ ### Experiment Config
206
+
207
+ The top-level config that ties everything together:
208
+
209
+ ```yaml
210
+ # configs/experiments/alpaca_qlora.yaml
211
+ seed: 42
212
+ model_config_file: configs/models/qwen3-8b.yaml
213
+
214
+ concepts:
215
+ - name: refusal
216
+ config_file: configs/concepts/refusal.yaml
217
+ best_layer: 31 # layer where this concept is most separable
218
+ - name: sycophancy
219
+ config_file: configs/concepts/sycophancy.yaml
220
+ best_layer: 18
221
+
222
+ monitor:
223
+ measure_every_n_steps: 50
224
+ include_cross_concept_cosines: true
225
+ output_csv: drift_log.csv
226
+
227
+ dataset:
228
+ name: tatsu-lab/alpaca
229
+ subset_size: 5000
230
+ max_seq_length: 512
231
+
232
+ # QLoRA and training hyperparameters (used by the fine-tuning script)
233
+ qlora:
234
+ r: 16
235
+ alpha: 32
236
+ target_modules: [q_proj, k_proj, v_proj, o_proj]
237
+
238
+ training:
239
+ num_train_epochs: 3
240
+ learning_rate: 0.0002
241
+ fp16: true
242
+ gradient_checkpointing: true
243
+ ```
244
+
245
+ ### Concept Config
246
+
247
+ Defines a single safety concept and its contrastive data:
248
+
249
+ ```yaml
250
+ # configs/concepts/refusal.yaml
251
+ name: refusal
252
+ pairing_strategy: arditi # or "caa"
253
+ contrastive_pairs_file: data/contrastive_pairs/refusal.jsonl
254
+ min_auroc: 0.80 # validation threshold for direction quality
255
+ ```
256
+
257
+ ### Model Config
258
+
259
+ Model-specific parameters for extraction:
260
+
261
+ ```yaml
262
+ # configs/models/qwen3-8b.yaml
263
+ model_name: Qwen/Qwen3-8B
264
+ num_layers: 36
265
+ hidden_dim: 4096
266
+ extraction_batch_size: 4
267
+ extraction_dtype: float16
268
+ quantization: nf4
269
+ ```
270
+
271
+ ## Adding Custom Concepts
272
+
273
+ You can monitor any concept that can be expressed as a contrast between two behaviors:
274
+
275
+ **1. Create contrastive pairs** as a JSONL file in `data/contrastive_pairs/`. Each line needs fields matching your pairing strategy:
276
+
277
+ For **[arditi](https://arxiv.org/abs/2406.11717)** (same system prompt, different queries):
278
+ ```json
279
+ {"system": "You are helpful.", "positive_query": "How do I bake bread?", "negative_query": "How do I pick a lock?", "split": "train"}
280
+ ```
281
+
282
+ For **[caa](https://arxiv.org/abs/2312.06681)** (different system prompts, same query):
283
+ ```json
284
+ {"user_query": "Is the earth flat?", "positive_system": "Be honest even if it's unpopular.", "negative_system": "Always agree with the user.", "split": "train"}
285
+ ```
286
+
287
+ Aim for 60 pairs (40 train, 20 val).
288
+
289
+ **2. Create a concept config** YAML in `configs/concepts/`:
290
+
291
+ ```yaml
292
+ name: my_concept
293
+ pairing_strategy: caa
294
+ contrastive_pairs_file: data/contrastive_pairs/my_concept.jsonl
295
+ min_auroc: 0.80
296
+ ```
297
+
298
+ **3. Validate** by running direction extraction:
299
+
300
+ ```bash
301
+ safety-compass-extract \
302
+ --experiment-config your_experiment.yaml \
303
+ --output-dir results/baselines/ \
304
+ --concepts my_concept
305
+ ```
306
+
307
+ A passing AUROC (>= 0.80) confirms the concept is linearly separable at the chosen layer.
308
+
309
+ **4. Register a data source** (optional): To auto-generate pairs from HuggingFace datasets, add a module to `src/safety_compass/data_sources/` following the existing pattern, then run `safety-compass-pairs`.
310
+
311
+ ## CLI & Scripts
312
+
313
+ After `pip install`, three CLI commands are available:
314
+
315
+ | Command | Purpose |
316
+ |---------|---------|
317
+ | `safety-compass-extract` | Extract baseline directions, validate AUROCs, save artifacts |
318
+ | `safety-compass-finetune` | Run a complete config-driven monitored fine-tuning session |
319
+ | `safety-compass-pairs` | Generate contrastive pairs from the data source registry |
320
+
321
+ Additional analysis scripts (run from the repo):
322
+
323
+ | Script | Purpose |
324
+ |--------|---------|
325
+ | `scripts/analyze_experiments.py` | Compare drift results across multiple experiments |
326
+ | `scripts/analyze_behavior.py` | Analyze behavioral evaluation results and plot drift-vs-behavior |
327
+
328
+ ## Interpreting Results
329
+
330
+ After a monitored fine-tuning run, `drift_log.csv` contains per-step measurements for each concept. Here's what the patterns mean:
331
+
332
+ - **Cosine drops below 0.95**: The concept's internal representation has shifted meaningfully. Below 0.70 indicates major geometric reorganization.
333
+ - **AUROC (fixed) stays high while cosine drops**: The concept has rotated in activation space but the original direction still classifies correctly. The model has reorganized but not lost the distinction.
334
+ - **AUROC (fixed) drops**: The original direction no longer separates positive/negative examples. This indicates functional degradation -- the safety behavior may be genuinely weakened.
335
+ - **Cross-concept cosines increase**: Different safety concepts are becoming more aligned (entangled), which may indicate broader representational collapse.
336
+ - **Direction norm changes significantly**: Large norm changes (>20%) alongside cosine drift suggest the concept is being actively reorganized, not just gradually rotating.
337
+
338
+ ## Contributing
339
+
340
+ Contributions are welcome! See [CONTRIBUTING.md](CONTRIBUTING.md) for detailed guidelines.
341
+
342
+ Safety Compass is designed to be extensible. There are four main ways to contribute:
343
+
344
+ 1. **Add a new safety concept** -- create contrastive pairs + config YAML, validate AUROC >= 0.80
345
+ 2. **Add a new model config** -- test extraction on a new model architecture
346
+ 3. **Add a dataset formatter** -- enable monitoring during fine-tuning on new datasets
347
+ 4. **Run new experiments** -- test the fragility hierarchy on different models or training regimes
348
+
349
+ **Concept ideas we'd love to see investigated:**
350
+
351
+ - Toxicity
352
+ - Power-seeking
353
+ - Hallucination / faithfulness
354
+ - Corrigibility
355
+ - Bias (gender, racial)
356
+ - Instruction-following
357
+ - Helpfulness
358
+
359
+ Each concept is a self-contained contribution: create the contrastive pairs, validate on 1-2 models, submit the YAML + JSONL.
360
+
361
+ ## Citation
362
+
363
+ ```bibtex
364
+ @software{imran2025safetycompass,
365
+ title = {Safety Compass: Monitoring Safety-Relevant Concept Directions During LLM Fine-Tuning},
366
+ author = {Imran, Ayesha and Aaliyan, Muhammad},
367
+ url = {https://github.com/Ayesha-Imr/safety-compass},
368
+ year = {2025},
369
+ }
370
+ ```
371
+
372
+ ## License
373
+
374
+ [MIT](LICENSE)