lmprobe 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- lmprobe-0.1.0/.claude/settings.local.json +11 -0
- lmprobe-0.1.0/.gitignore +55 -0
- lmprobe-0.1.0/CLAUDE.md +286 -0
- lmprobe-0.1.0/LICENSE +21 -0
- lmprobe-0.1.0/PKG-INFO +215 -0
- lmprobe-0.1.0/README.md +183 -0
- lmprobe-0.1.0/docs/design/001-api-philosophy.md +245 -0
- lmprobe-0.1.0/docs/design/002-pooling-strategies.md +165 -0
- lmprobe-0.1.0/docs/design/003-layer-selection.md +183 -0
- lmprobe-0.1.0/docs/design/004-classifier-interface.md +280 -0
- lmprobe-0.1.0/pyproject.toml +70 -0
- lmprobe-0.1.0/src/lmprobe/__init__.py +23 -0
- lmprobe-0.1.0/src/lmprobe/cache.py +206 -0
- lmprobe-0.1.0/src/lmprobe/classifiers.py +147 -0
- lmprobe-0.1.0/src/lmprobe/extraction.py +293 -0
- lmprobe-0.1.0/src/lmprobe/pooling.py +299 -0
- lmprobe-0.1.0/src/lmprobe/probe.py +420 -0
- lmprobe-0.1.0/tests/conftest.py +18 -0
- lmprobe-0.1.0/tests/test_readme_example.py +162 -0
- lmprobe-0.1.0/tests/test_remote.py +135 -0
lmprobe-0.1.0/.gitignore
ADDED
|
@@ -0,0 +1,55 @@
|
|
|
1
|
+
# Virtual environments
|
|
2
|
+
.venv/
|
|
3
|
+
venv/
|
|
4
|
+
ENV/
|
|
5
|
+
env/
|
|
6
|
+
|
|
7
|
+
# Python
|
|
8
|
+
__pycache__/
|
|
9
|
+
*.py[cod]
|
|
10
|
+
*$py.class
|
|
11
|
+
*.so
|
|
12
|
+
.Python
|
|
13
|
+
build/
|
|
14
|
+
develop-eggs/
|
|
15
|
+
dist/
|
|
16
|
+
downloads/
|
|
17
|
+
eggs/
|
|
18
|
+
.eggs/
|
|
19
|
+
lib/
|
|
20
|
+
lib64/
|
|
21
|
+
parts/
|
|
22
|
+
sdist/
|
|
23
|
+
var/
|
|
24
|
+
wheels/
|
|
25
|
+
*.egg-info/
|
|
26
|
+
.installed.cfg
|
|
27
|
+
*.egg
|
|
28
|
+
|
|
29
|
+
# Testing
|
|
30
|
+
.pytest_cache/
|
|
31
|
+
.coverage
|
|
32
|
+
htmlcov/
|
|
33
|
+
.tox/
|
|
34
|
+
.nox/
|
|
35
|
+
|
|
36
|
+
# IDEs
|
|
37
|
+
.idea/
|
|
38
|
+
.vscode/
|
|
39
|
+
*.swp
|
|
40
|
+
*.swo
|
|
41
|
+
*~
|
|
42
|
+
|
|
43
|
+
# OS
|
|
44
|
+
.DS_Store
|
|
45
|
+
Thumbs.db
|
|
46
|
+
|
|
47
|
+
# lmprobe cache
|
|
48
|
+
.cache/
|
|
49
|
+
|
|
50
|
+
# Jupyter
|
|
51
|
+
.ipynb_checkpoints/
|
|
52
|
+
|
|
53
|
+
# Distribution
|
|
54
|
+
*.tar.gz
|
|
55
|
+
*.whl
|
lmprobe-0.1.0/CLAUDE.md
ADDED
|
@@ -0,0 +1,286 @@
|
|
|
1
|
+
# CLAUDE.md - lmprobe Development Guide
|
|
2
|
+
|
|
3
|
+
## Project Overview
|
|
4
|
+
|
|
5
|
+
`lmprobe` is a Python library for training linear probes on language model activations. The primary use case is AI safety monitoring — detecting deception, harmful intent, and other safety-relevant properties by analyzing model internals.
|
|
6
|
+
|
|
7
|
+
## Design Philosophy
|
|
8
|
+
|
|
9
|
+
- **sklearn-inspired API**: Users familiar with scikit-learn should feel at home. Use `fit()`, `predict()`, `predict_proba()`, `score()`.
|
|
10
|
+
- **Contrastive-first**: The primary training paradigm is contrastive (positive vs negative prompts), following the Representation Engineering literature.
|
|
11
|
+
- **Sensible defaults, full control**: Simple cases should be one-liners; complex cases should be fully configurable.
|
|
12
|
+
- **Separation of concerns**: Activation extraction, pooling, and classification are distinct stages that can be configured independently.
|
|
13
|
+
|
|
14
|
+
## Key Design Decisions
|
|
15
|
+
|
|
16
|
+
Detailed design documents live in `docs/design/`. Read these before making changes to core APIs:
|
|
17
|
+
|
|
18
|
+
| Doc | Topic | Read when... |
|
|
19
|
+
|-----|-------|--------------|
|
|
20
|
+
| [001-api-philosophy.md](docs/design/001-api-philosophy.md) | Core API design | Changing public interfaces |
|
|
21
|
+
| [002-pooling-strategies.md](docs/design/002-pooling-strategies.md) | Train vs inference pooling | Working on activation aggregation |
|
|
22
|
+
| [003-layer-selection.md](docs/design/003-layer-selection.md) | Layer indexing conventions | Working on layer extraction |
|
|
23
|
+
| [004-classifier-interface.md](docs/design/004-classifier-interface.md) | Classifier abstraction | Adding new classifier types |
|
|
24
|
+
|
|
25
|
+
## Architecture
|
|
26
|
+
|
|
27
|
+
```
|
|
28
|
+
User Prompts
|
|
29
|
+
│
|
|
30
|
+
▼
|
|
31
|
+
┌─────────────────┐
|
|
32
|
+
│ ActivationCache │ ← Extracts & caches activations from LLM
|
|
33
|
+
└────────┬────────┘
|
|
34
|
+
│ raw activations: (batch, seq_len, layers, hidden_dim)
|
|
35
|
+
▼
|
|
36
|
+
┌─────────────────┐
|
|
37
|
+
│ PoolingStrategy │ ← Aggregates across tokens (train vs inference can differ)
|
|
38
|
+
└────────┬────────┘
|
|
39
|
+
│ pooled: (batch, layers, hidden_dim) or (batch, hidden_dim)
|
|
40
|
+
▼
|
|
41
|
+
┌─────────────────┐
|
|
42
|
+
│ Classifier │ ← sklearn-compatible estimator
|
|
43
|
+
└────────┬────────┘
|
|
44
|
+
│
|
|
45
|
+
▼
|
|
46
|
+
Predictions/Probabilities
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
## Package Structure
|
|
50
|
+
|
|
51
|
+
```
|
|
52
|
+
lmprobe/
|
|
53
|
+
├── src/
|
|
54
|
+
│ └── lmprobe/
|
|
55
|
+
│ ├── __init__.py
|
|
56
|
+
│ ├── probe.py # LinearProbe main class
|
|
57
|
+
│ ├── extraction.py # Activation extraction via nnsight
|
|
58
|
+
│ ├── pooling.py # Pooling strategies
|
|
59
|
+
│ ├── cache.py # Activation caching
|
|
60
|
+
│ └── classifiers.py # Built-in classifier factory
|
|
61
|
+
├── tests/
|
|
62
|
+
│ ├── conftest.py # Shared fixtures (tiny model)
|
|
63
|
+
│ ├── test_readme_example.py # NORTH STAR: README example must pass
|
|
64
|
+
│ ├── test_probe.py
|
|
65
|
+
│ ├── test_extraction.py
|
|
66
|
+
│ ├── test_pooling.py
|
|
67
|
+
│ └── test_cache.py
|
|
68
|
+
├── docs/
|
|
69
|
+
│ └── design/ # Design decision documents
|
|
70
|
+
├── pyproject.toml
|
|
71
|
+
└── CLAUDE.md
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
## Critical Design Decisions
|
|
75
|
+
|
|
76
|
+
These decisions are **mandatory** and must not be changed without explicit discussion:
|
|
77
|
+
|
|
78
|
+
| Decision | Value | Rationale |
|
|
79
|
+
|----------|-------|-----------|
|
|
80
|
+
| Multi-layer handling | Always concatenate | Simple, captures cross-layer patterns |
|
|
81
|
+
| Activation caching | Always enabled | Remote/LLM inference is expensive |
|
|
82
|
+
| Package layout | `src/lmprobe/` | Standard Python packaging |
|
|
83
|
+
| nnsight for extraction | Required dependency | Supports remote execution |
|
|
84
|
+
| API key | `NNSIGHT_API_KEY` env var | Standard credential handling |
|
|
85
|
+
| Cache location | `~/.cache/lmprobe/` (or `LMPROBE_CACHE_DIR`) | XDG-style default |
|
|
86
|
+
|
|
87
|
+
## Code Conventions
|
|
88
|
+
|
|
89
|
+
- Type hints on all public functions
|
|
90
|
+
- Docstrings in NumPy format
|
|
91
|
+
- Tests mirror source structure: `src/lmprobe/probe.py` → `tests/test_probe.py`
|
|
92
|
+
- Use `ruff` for linting, `black` for formatting
|
|
93
|
+
|
|
94
|
+
## Testing
|
|
95
|
+
|
|
96
|
+
**All tests must use a real language model.** Use `stas/tiny-random-llama-2` — a tiny Llama model with random weights designed for functional testing.
|
|
97
|
+
|
|
98
|
+
```python
|
|
99
|
+
# tests/conftest.py
|
|
100
|
+
import pytest
|
|
101
|
+
|
|
102
|
+
TEST_MODEL = "stas/tiny-random-llama-2"
|
|
103
|
+
|
|
104
|
+
@pytest.fixture
|
|
105
|
+
def tiny_model():
|
|
106
|
+
"""Tiny random Llama model for testing."""
|
|
107
|
+
return TEST_MODEL
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
**Test requirements:**
|
|
111
|
+
- Tests must run without GPU (CPU-only)
|
|
112
|
+
- Tests must not require `NNSIGHT_API_KEY` (use `remote=False`)
|
|
113
|
+
- Tests should be fast (tiny model has ~few MB weights)
|
|
114
|
+
- Integration tests verify full pipeline: extraction → pooling → classification
|
|
115
|
+
|
|
116
|
+
### Remote/NDIF Testing (TODO)
|
|
117
|
+
|
|
118
|
+
**Status: NOT YET TESTED**
|
|
119
|
+
|
|
120
|
+
The `remote=True` functionality uses nnsight to connect to NDIF (National Deep Inference Fabric), a US national research initiative. Remote testing has not been performed due to:
|
|
121
|
+
|
|
122
|
+
1. **Geographic restriction**: NDIF restricts access to US-based users only
|
|
123
|
+
2. **API key requirement**: Requires `NNSIGHT_API_KEY` environment variable
|
|
124
|
+
|
|
125
|
+
**What needs testing:**
|
|
126
|
+
- `LinearProbe(..., remote=True)` connects successfully
|
|
127
|
+
- `probe.fit(..., remote=True)` extracts activations from remote models
|
|
128
|
+
- `probe.predict(..., remote=False)` override works (train remote, predict local)
|
|
129
|
+
- Large models (e.g., `meta-llama/Llama-3.1-70B-Instruct`) work via remote
|
|
130
|
+
- Error handling when `NNSIGHT_API_KEY` is missing/invalid
|
|
131
|
+
- Cache behavior with remote extractions
|
|
132
|
+
|
|
133
|
+
**To test when US-based:**
|
|
134
|
+
```bash
|
|
135
|
+
export NNSIGHT_API_KEY="your-key"
|
|
136
|
+
pytest tests/test_remote.py -v # (test file to be created)
|
|
137
|
+
```
|
|
138
|
+
|
|
139
|
+
**Known considerations:**
|
|
140
|
+
- Remote execution may have different tensor handling (proxies vs direct tensors)
|
|
141
|
+
- The `extraction.py` code handles both cases with `hasattr(act, "value")` check
|
|
142
|
+
- Network latency may affect batch processing strategies
|
|
143
|
+
|
|
144
|
+
```python
|
|
145
|
+
# Example test
|
|
146
|
+
def test_fit_predict_roundtrip(tiny_model):
|
|
147
|
+
probe = LinearProbe(
|
|
148
|
+
model=tiny_model,
|
|
149
|
+
layers=-1,
|
|
150
|
+
remote=False,
|
|
151
|
+
random_state=42,
|
|
152
|
+
)
|
|
153
|
+
probe.fit(["positive example"], ["negative example"])
|
|
154
|
+
predictions = probe.predict(["test input"])
|
|
155
|
+
assert predictions.shape == (1,)
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
## Test-Driven Development
|
|
159
|
+
|
|
160
|
+
**This project uses test-driven development (TDD).** Write tests BEFORE implementation.
|
|
161
|
+
|
|
162
|
+
### The North Star Test
|
|
163
|
+
|
|
164
|
+
The **north star test** is `tests/test_readme_example.py`. It runs the exact code from README.md's "Example Usage" section. This test defines what "done" looks like:
|
|
165
|
+
|
|
166
|
+
```python
|
|
167
|
+
# tests/test_readme_example.py
|
|
168
|
+
"""
|
|
169
|
+
North Star Test: The README example must run exactly as documented.
|
|
170
|
+
|
|
171
|
+
This test runs the exact code from README.md. If this test passes,
|
|
172
|
+
the library's public API is working as advertised.
|
|
173
|
+
"""
|
|
174
|
+
|
|
175
|
+
def test_readme_example_runs(tiny_model):
|
|
176
|
+
"""The README example code runs without error."""
|
|
177
|
+
from lmprobe import LinearProbe
|
|
178
|
+
|
|
179
|
+
positive_prompts = [
|
|
180
|
+
"Who wants to go for a walk?",
|
|
181
|
+
"My tail is wagging with delight.",
|
|
182
|
+
"Fetch the ball!",
|
|
183
|
+
"Good boy!",
|
|
184
|
+
"Slobbering, chewing, growling, barking.",
|
|
185
|
+
]
|
|
186
|
+
|
|
187
|
+
negative_prompts = [
|
|
188
|
+
"Enjoys lounging in the sun beam all day.",
|
|
189
|
+
"Purring, stalking, pouncing, scratching.",
|
|
190
|
+
"Uses a litterbox, throws sand all over the room.",
|
|
191
|
+
"Tail raised, back arched, eyes alert, whiskers forward.",
|
|
192
|
+
]
|
|
193
|
+
|
|
194
|
+
probe = LinearProbe(
|
|
195
|
+
model=tiny_model, # Use tiny model instead of Llama for tests
|
|
196
|
+
layers=-1, # Last layer (tiny model has few layers)
|
|
197
|
+
pooling="last_token",
|
|
198
|
+
classifier="logistic_regression",
|
|
199
|
+
device="cpu",
|
|
200
|
+
remote=False,
|
|
201
|
+
random_state=42,
|
|
202
|
+
)
|
|
203
|
+
|
|
204
|
+
probe.fit(positive_prompts, negative_prompts)
|
|
205
|
+
|
|
206
|
+
test_prompts = [
|
|
207
|
+
"Arf! Arf! Let's go outside!",
|
|
208
|
+
"Knocking things off the counter for sport.",
|
|
209
|
+
]
|
|
210
|
+
predictions = probe.predict(test_prompts)
|
|
211
|
+
probabilities = probe.predict_proba(test_prompts)
|
|
212
|
+
|
|
213
|
+
# Shape assertions (values may vary with random weights)
|
|
214
|
+
assert predictions.shape == (2,)
|
|
215
|
+
assert probabilities.shape == (2, 2)
|
|
216
|
+
|
|
217
|
+
# Score method works
|
|
218
|
+
accuracy = probe.score(test_prompts, [1, 0])
|
|
219
|
+
assert 0.0 <= accuracy <= 1.0
|
|
220
|
+
```
|
|
221
|
+
|
|
222
|
+
### TDD Workflow
|
|
223
|
+
|
|
224
|
+
1. **Write a failing test first** — Define expected behavior before implementation
|
|
225
|
+
2. **Run the test, confirm it fails** — Ensures the test is actually testing something
|
|
226
|
+
3. **Implement minimal code to pass** — Don't over-engineer
|
|
227
|
+
4. **Refactor if needed** — Clean up while tests are green
|
|
228
|
+
5. **Repeat**
|
|
229
|
+
|
|
230
|
+
### Test Priority Order
|
|
231
|
+
|
|
232
|
+
When implementing, make tests pass in this order:
|
|
233
|
+
|
|
234
|
+
1. `test_readme_example.py` — The north star (full integration)
|
|
235
|
+
2. `test_probe.py` — LinearProbe unit tests
|
|
236
|
+
3. `test_extraction.py` — Activation extraction tests
|
|
237
|
+
4. `test_pooling.py` — Pooling strategy tests
|
|
238
|
+
5. `test_cache.py` — Caching tests
|
|
239
|
+
|
|
240
|
+
### Running Tests
|
|
241
|
+
|
|
242
|
+
```bash
|
|
243
|
+
# Run all tests
|
|
244
|
+
pytest
|
|
245
|
+
|
|
246
|
+
# Run north star test only
|
|
247
|
+
pytest tests/test_readme_example.py -v
|
|
248
|
+
|
|
249
|
+
# Run with coverage
|
|
250
|
+
pytest --cov=lmprobe
|
|
251
|
+
```
|
|
252
|
+
|
|
253
|
+
## Quick Reference
|
|
254
|
+
|
|
255
|
+
```python
|
|
256
|
+
from lmprobe import LinearProbe
|
|
257
|
+
|
|
258
|
+
probe = LinearProbe(
|
|
259
|
+
model="meta-llama/Llama-3.1-8B-Instruct",
|
|
260
|
+
layers=16, # int | list[int] | "all" | "middle"
|
|
261
|
+
pooling="last_token", # or override with train_pooling / inference_pooling
|
|
262
|
+
classifier="logistic_regression", # str | sklearn estimator
|
|
263
|
+
device="auto",
|
|
264
|
+
remote=False, # True for nnsight remote execution
|
|
265
|
+
random_state=42, # Propagates to classifier for reproducibility
|
|
266
|
+
)
|
|
267
|
+
|
|
268
|
+
probe.fit(positive_prompts, negative_prompts)
|
|
269
|
+
predictions = probe.predict(new_prompts)
|
|
270
|
+
|
|
271
|
+
# Override remote at call time
|
|
272
|
+
predictions = probe.predict(new_prompts, remote=True)
|
|
273
|
+
```
|
|
274
|
+
|
|
275
|
+
## Common Tasks
|
|
276
|
+
|
|
277
|
+
### Adding a new pooling strategy
|
|
278
|
+
1. Read `docs/design/002-pooling-strategies.md`
|
|
279
|
+
2. Add strategy to `src/lmprobe/pooling.py`
|
|
280
|
+
3. Register in `POOLING_STRATEGIES` dict
|
|
281
|
+
4. Add tests in `tests/test_pooling.py`
|
|
282
|
+
|
|
283
|
+
### Supporting a new model architecture
|
|
284
|
+
1. Check if transformers `AutoModel` handles it automatically
|
|
285
|
+
2. If not, add architecture-specific extraction in `src/lmprobe/extraction.py`
|
|
286
|
+
3. Document any quirks in `docs/models/`
|
lmprobe-0.1.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Toast
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
lmprobe-0.1.0/PKG-INFO
ADDED
|
@@ -0,0 +1,215 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: lmprobe
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Train linear probes on language model activations for AI safety monitoring
|
|
5
|
+
Project-URL: Homepage, https://github.com/toast/lmprobe
|
|
6
|
+
Project-URL: Documentation, https://github.com/toast/lmprobe#readme
|
|
7
|
+
Project-URL: Repository, https://github.com/toast/lmprobe
|
|
8
|
+
Author: Toast
|
|
9
|
+
License-Expression: MIT
|
|
10
|
+
License-File: LICENSE
|
|
11
|
+
Keywords: ai-safety,interpretability,language-models,machine-learning,nlp,probing
|
|
12
|
+
Classifier: Development Status :: 3 - Alpha
|
|
13
|
+
Classifier: Intended Audience :: Science/Research
|
|
14
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
15
|
+
Classifier: Programming Language :: Python :: 3
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
19
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
20
|
+
Requires-Python: >=3.10
|
|
21
|
+
Requires-Dist: nnsight>=0.3
|
|
22
|
+
Requires-Dist: numpy>=1.20
|
|
23
|
+
Requires-Dist: scikit-learn>=1.0
|
|
24
|
+
Requires-Dist: torch>=2.0
|
|
25
|
+
Requires-Dist: transformers>=4.30
|
|
26
|
+
Provides-Extra: dev
|
|
27
|
+
Requires-Dist: black>=23.0; extra == 'dev'
|
|
28
|
+
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
|
|
29
|
+
Requires-Dist: pytest>=7.0; extra == 'dev'
|
|
30
|
+
Requires-Dist: ruff>=0.1; extra == 'dev'
|
|
31
|
+
Description-Content-Type: text/markdown
|
|
32
|
+
|
|
33
|
+
# `lmprobe` Language Model Probe Library
|
|
34
|
+
This library supports the use of language model "activations" or "latents" to build text classifiers. The intent is to help detect and reduce misuse of AI - for example, chemical, biological, radiological and nuclear (CBRN) weapons development, social engineering at scale, and the development of novel cybersecurity attack vectors.
|
|
35
|
+
|
|
36
|
+
## Linear and Simple Models for LLMs
|
|
37
|
+
"Linear Probes" have emerged as an effective and practical way to monitor large language model activity.
|
|
38
|
+
|
|
39
|
+
### Background
|
|
40
|
+
|
|
41
|
+
First introduced by [Alain & Bengio (2016)](https://arxiv.org/abs/1610.01644) as "thermometers" for measuring what neural networks learn at each layer, linear probes have since been refined through work on [probe design and selectivity](https://nlp.stanford.edu/~johnhew/interpreting-probes.html) and validated by evidence supporting the [linear representation hypothesis](https://www.neelnanda.io/mechanistic-interpretability/othello). The [Representation Engineering](https://arxiv.org/abs/2310.01405) framework (Zou et al., 2023) demonstrated that probes can monitor safety-relevant properties like honesty and power-seeking. Recent AI safety research has shown promising results: Anthropic's work on [detecting sleeper agents](https://www.anthropic.com/research/probes-catch-sleeper-agents) achieved >99% AUROC using simple linear classifiers, and Apollo Research's [strategic deception detection](https://arxiv.org/abs/2502.03407) work demonstrates that probes trained on simple contrast pairs can generalize to realistic scenarios like insider trading concealment and sandbagging on safety evaluations.
|
|
42
|
+
|
|
43
|
+
### `lmprobe` Use Cases
|
|
44
|
+
|
|
45
|
+
The goal of `lmprobe` is to make text classifiers for language models easy to build, experiment on, and deploy during inference. While much of the research has focused on complex emergent risky behavior, the intended use of this library is for simpler use cases such as detection of the misuse of an AI chatbot by humans.
|
|
46
|
+
|
|
47
|
+
### Compatibility
|
|
48
|
+
|
|
49
|
+
By default, `lmprobe` uses huggingface and `nnsight` to manage models and extract latents during inference. However, the library is structured to modularize and isolate these aspects so that (ideally) frontier AI labs can extend the library for internal use on their bespoke inference systems.
|
|
50
|
+
|
|
51
|
+
### Installation
|
|
52
|
+
|
|
53
|
+
```
|
|
54
|
+
pip install lmprobe
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
### Environment Setup
|
|
58
|
+
|
|
59
|
+
For remote execution (large models via nnsight/NDIF):
|
|
60
|
+
|
|
61
|
+
```bash
|
|
62
|
+
export NNSIGHT_API_KEY="your-api-key-here"
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
### Example Usage
|
|
66
|
+
|
|
67
|
+
---
|
|
68
|
+
|
|
69
|
+
```python
|
|
70
|
+
from lmprobe import LinearProbe
|
|
71
|
+
|
|
72
|
+
positive_prompts = [ # positive class: "dog" without saying "dog"
|
|
73
|
+
"Who wants to go for a walk?",
|
|
74
|
+
"My tail is wagging with delight.",
|
|
75
|
+
"Fetch the ball!",
|
|
76
|
+
"Good boy!",
|
|
77
|
+
"Slobbering, chewing, growling, barking.",
|
|
78
|
+
]
|
|
79
|
+
|
|
80
|
+
negative_prompts = [ # negative class: "cat" without saying "cat"
|
|
81
|
+
"Enjoys lounging in the sun beam all day.",
|
|
82
|
+
"Purring, stalking, pouncing, scratching.",
|
|
83
|
+
"Uses a litterbox, throws sand all over the room.",
|
|
84
|
+
"Tail raised, back arched, eyes alert, whiskers forward.",
|
|
85
|
+
]
|
|
86
|
+
|
|
87
|
+
# Configure the probe
|
|
88
|
+
probe = LinearProbe(
|
|
89
|
+
model="meta-llama/Llama-3.1-8B-Instruct",
|
|
90
|
+
layers=16, # int, list[int], or "all"
|
|
91
|
+
pooling="last_token", # applies to both train and inference
|
|
92
|
+
classifier="logistic_regression", # or pass sklearn estimator
|
|
93
|
+
device="auto",
|
|
94
|
+
remote=False, # True for nnsight remote execution
|
|
95
|
+
random_state=42, # for reproducibility
|
|
96
|
+
)
|
|
97
|
+
|
|
98
|
+
# Fit using contrastive prompts
|
|
99
|
+
probe.fit(positive_prompts, negative_prompts)
|
|
100
|
+
|
|
101
|
+
# Predict on new examples
|
|
102
|
+
test_prompts = [
|
|
103
|
+
"Arf! Arf! Let's go outside!",
|
|
104
|
+
"Knocking things off the counter for sport.",
|
|
105
|
+
]
|
|
106
|
+
predictions = probe.predict(test_prompts) # [1, 0]
|
|
107
|
+
probabilities = probe.predict_proba(test_prompts) # [[0.12, 0.88], [0.91, 0.09]]
|
|
108
|
+
|
|
109
|
+
# Evaluate
|
|
110
|
+
accuracy = probe.score(test_prompts, [1, 0])
|
|
111
|
+
|
|
112
|
+
# Save/load for deployment
|
|
113
|
+
probe.save("dog_vs_cat_probe.pkl")
|
|
114
|
+
loaded_probe = LinearProbe.load("dog_vs_cat_probe.pkl")
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
---
|
|
118
|
+
|
|
119
|
+
## Remote Execution for Large Models
|
|
120
|
+
|
|
121
|
+
Use `remote=True` to run inference on large models via nnsight's remote servers:
|
|
122
|
+
|
|
123
|
+
```python
|
|
124
|
+
probe = LinearProbe(
|
|
125
|
+
model="meta-llama/Llama-3.1-70B-Instruct",
|
|
126
|
+
layers="middle",
|
|
127
|
+
remote=True, # Requires NNSIGHT_API_KEY
|
|
128
|
+
)
|
|
129
|
+
|
|
130
|
+
probe.fit(positive_prompts, negative_prompts)
|
|
131
|
+
|
|
132
|
+
# Override remote per-call (e.g., train remote, predict local)
|
|
133
|
+
predictions = probe.predict(new_prompts, remote=False)
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
---
|
|
137
|
+
|
|
138
|
+
## Multi-Layer Probing
|
|
139
|
+
|
|
140
|
+
When selecting multiple layers, activations are **concatenated** along the hidden dimension:
|
|
141
|
+
|
|
142
|
+
```python
|
|
143
|
+
probe = LinearProbe(
|
|
144
|
+
model="meta-llama/Llama-3.1-8B-Instruct",
|
|
145
|
+
layers=[14, 15, 16], # 3 layers × 4096 dims = 12,288-dim input to classifier
|
|
146
|
+
)
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
---
|
|
150
|
+
|
|
151
|
+
## Advanced: Different Train vs Inference Pooling
|
|
152
|
+
|
|
153
|
+
For real-time monitoring, train on a stable representation but score every token:
|
|
154
|
+
|
|
155
|
+
```python
|
|
156
|
+
probe = LinearProbe(
|
|
157
|
+
model="meta-llama/Llama-3.1-8B-Instruct",
|
|
158
|
+
layers=16,
|
|
159
|
+
pooling="last_token", # base strategy
|
|
160
|
+
inference_pooling="all", # override: return per-token scores
|
|
161
|
+
)
|
|
162
|
+
|
|
163
|
+
probe.fit(positive_prompts, negative_prompts)
|
|
164
|
+
|
|
165
|
+
# Returns (batch, seq_len) - one score per token
|
|
166
|
+
token_scores = probe.predict_proba(["Wagging my tail happily!"])
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
For "flag if ANY token triggers" detection:
|
|
170
|
+
|
|
171
|
+
```python
|
|
172
|
+
probe = LinearProbe(
|
|
173
|
+
model="meta-llama/Llama-3.1-8B-Instruct",
|
|
174
|
+
layers=16,
|
|
175
|
+
pooling="last_token", # base strategy
|
|
176
|
+
inference_pooling="max", # override: max score across tokens
|
|
177
|
+
)
|
|
178
|
+
```
|
|
179
|
+
|
|
180
|
+
---
|
|
181
|
+
|
|
182
|
+
## Configuration Reference
|
|
183
|
+
|
|
184
|
+
| Parameter | Type | Default | Description |
|
|
185
|
+
|-----------|------|---------|-------------|
|
|
186
|
+
| `model` | `str` | *required* | HuggingFace model ID or local path |
|
|
187
|
+
| `layers` | `int \| list[int] \| "all"` | `"middle"` | Which residual stream layers to probe |
|
|
188
|
+
| `pooling` | `str \| callable` | `"last_token"` | Token aggregation (train & inference) |
|
|
189
|
+
| `train_pooling` | `str \| callable` | — | Override pooling for `fit()` only |
|
|
190
|
+
| `inference_pooling` | `str \| callable` | — | Override pooling for `predict()` only |
|
|
191
|
+
| `classifier` | `str \| sklearn estimator` | `"logistic_regression"` | Classification model |
|
|
192
|
+
| `device` | `str` | `"auto"` | `"auto"`, `"cuda:0"`, `"cpu"` |
|
|
193
|
+
| `remote` | `bool` | `False` | Use nnsight remote execution (requires `NNSIGHT_API_KEY`) |
|
|
194
|
+
| `random_state` | `int \| None` | `None` | Random seed for reproducibility (propagates to classifier) |
|
|
195
|
+
|
|
196
|
+
### Pooling Strategies
|
|
197
|
+
|
|
198
|
+
| Strategy | Training | Inference | Description |
|
|
199
|
+
|----------|:--------:|:---------:|-------------|
|
|
200
|
+
| `"last_token"` | ✓ | ✓ | Final token activation (default, matches RepE literature) |
|
|
201
|
+
| `"mean"` | ✓ | ✓ | Mean across all tokens |
|
|
202
|
+
| `"first_token"` | ✓ | ✓ | First token (e.g., `[CLS]`) |
|
|
203
|
+
| `"all"` | ✓ | ✓ | Each token independently |
|
|
204
|
+
| `"max"` | | ✓ | Max score across tokens |
|
|
205
|
+
| `"min"` | | ✓ | Min score across tokens |
|
|
206
|
+
|
|
207
|
+
### Pooling Collision Rules
|
|
208
|
+
|
|
209
|
+
Explicit parameters override the base `pooling` value:
|
|
210
|
+
|
|
211
|
+
```python
|
|
212
|
+
# pooling="mean", train_pooling="last_token" → train=last_token, inference=mean
|
|
213
|
+
# pooling="mean", inference_pooling="max" → train=mean, inference=max
|
|
214
|
+
```
|
|
215
|
+
|