lmprobe 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,11 @@
1
+ {
2
+ "permissions": {
3
+ "allow": [
4
+ "WebSearch",
5
+ "Bash(pip install:*)",
6
+ "Bash(python -m pytest:*)",
7
+ "Bash(python:*)",
8
+ "Bash(git add:*)"
9
+ ]
10
+ }
11
+ }
@@ -0,0 +1,55 @@
1
+ # Virtual environments
2
+ .venv/
3
+ venv/
4
+ ENV/
5
+ env/
6
+
7
+ # Python
8
+ __pycache__/
9
+ *.py[cod]
10
+ *$py.class
11
+ *.so
12
+ .Python
13
+ build/
14
+ develop-eggs/
15
+ dist/
16
+ downloads/
17
+ eggs/
18
+ .eggs/
19
+ lib/
20
+ lib64/
21
+ parts/
22
+ sdist/
23
+ var/
24
+ wheels/
25
+ *.egg-info/
26
+ .installed.cfg
27
+ *.egg
28
+
29
+ # Testing
30
+ .pytest_cache/
31
+ .coverage
32
+ htmlcov/
33
+ .tox/
34
+ .nox/
35
+
36
+ # IDEs
37
+ .idea/
38
+ .vscode/
39
+ *.swp
40
+ *.swo
41
+ *~
42
+
43
+ # OS
44
+ .DS_Store
45
+ Thumbs.db
46
+
47
+ # lmprobe cache
48
+ .cache/
49
+
50
+ # Jupyter
51
+ .ipynb_checkpoints/
52
+
53
+ # Distribution
54
+ *.tar.gz
55
+ *.whl
@@ -0,0 +1,286 @@
1
+ # CLAUDE.md - lmprobe Development Guide
2
+
3
+ ## Project Overview
4
+
5
+ `lmprobe` is a Python library for training linear probes on language model activations. The primary use case is AI safety monitoring — detecting deception, harmful intent, and other safety-relevant properties by analyzing model internals.
6
+
7
+ ## Design Philosophy
8
+
9
+ - **sklearn-inspired API**: Users familiar with scikit-learn should feel at home. Use `fit()`, `predict()`, `predict_proba()`, `score()`.
10
+ - **Contrastive-first**: The primary training paradigm is contrastive (positive vs negative prompts), following the Representation Engineering literature.
11
+ - **Sensible defaults, full control**: Simple cases should be one-liners; complex cases should be fully configurable.
12
+ - **Separation of concerns**: Activation extraction, pooling, and classification are distinct stages that can be configured independently.
13
+
14
+ ## Key Design Decisions
15
+
16
+ Detailed design documents live in `docs/design/`. Read these before making changes to core APIs:
17
+
18
+ | Doc | Topic | Read when... |
19
+ |-----|-------|--------------|
20
+ | [001-api-philosophy.md](docs/design/001-api-philosophy.md) | Core API design | Changing public interfaces |
21
+ | [002-pooling-strategies.md](docs/design/002-pooling-strategies.md) | Train vs inference pooling | Working on activation aggregation |
22
+ | [003-layer-selection.md](docs/design/003-layer-selection.md) | Layer indexing conventions | Working on layer extraction |
23
+ | [004-classifier-interface.md](docs/design/004-classifier-interface.md) | Classifier abstraction | Adding new classifier types |
24
+
25
+ ## Architecture
26
+
27
+ ```
28
+ User Prompts
29
+
30
+
31
+ ┌─────────────────┐
32
+ │ ActivationCache │ ← Extracts & caches activations from LLM
33
+ └────────┬────────┘
34
+ │ raw activations: (batch, seq_len, layers, hidden_dim)
35
+
36
+ ┌─────────────────┐
37
+ │ PoolingStrategy │ ← Aggregates across tokens (train vs inference can differ)
38
+ └────────┬────────┘
39
+ │ pooled: (batch, layers, hidden_dim) or (batch, hidden_dim)
40
+
41
+ ┌─────────────────┐
42
+ │ Classifier │ ← sklearn-compatible estimator
43
+ └────────┬────────┘
44
+
45
+
46
+ Predictions/Probabilities
47
+ ```
48
+
49
+ ## Package Structure
50
+
51
+ ```
52
+ lmprobe/
53
+ ├── src/
54
+ │ └── lmprobe/
55
+ │ ├── __init__.py
56
+ │ ├── probe.py # LinearProbe main class
57
+ │ ├── extraction.py # Activation extraction via nnsight
58
+ │ ├── pooling.py # Pooling strategies
59
+ │ ├── cache.py # Activation caching
60
+ │ └── classifiers.py # Built-in classifier factory
61
+ ├── tests/
62
+ │ ├── conftest.py # Shared fixtures (tiny model)
63
+ │ ├── test_readme_example.py # NORTH STAR: README example must pass
64
+ │ ├── test_probe.py
65
+ │ ├── test_extraction.py
66
+ │ ├── test_pooling.py
67
+ │ └── test_cache.py
68
+ ├── docs/
69
+ │ └── design/ # Design decision documents
70
+ ├── pyproject.toml
71
+ └── CLAUDE.md
72
+ ```
73
+
74
+ ## Critical Design Decisions
75
+
76
+ These decisions are **mandatory** and must not be changed without explicit discussion:
77
+
78
+ | Decision | Value | Rationale |
79
+ |----------|-------|-----------|
80
+ | Multi-layer handling | Always concatenate | Simple, captures cross-layer patterns |
81
+ | Activation caching | Always enabled | Remote/LLM inference is expensive |
82
+ | Package layout | `src/lmprobe/` | Standard Python packaging |
83
+ | nnsight for extraction | Required dependency | Supports remote execution |
84
+ | API key | `NNSIGHT_API_KEY` env var | Standard credential handling |
85
+ | Cache location | `~/.cache/lmprobe/` (or `LMPROBE_CACHE_DIR`) | XDG-style default |
86
+
87
+ ## Code Conventions
88
+
89
+ - Type hints on all public functions
90
+ - Docstrings in NumPy format
91
+ - Tests mirror source structure: `src/lmprobe/probe.py` → `tests/test_probe.py`
92
+ - Use `ruff` for linting, `black` for formatting
93
+
94
+ ## Testing
95
+
96
+ **All tests must use a real language model.** Use `stas/tiny-random-llama-2` — a tiny Llama model with random weights designed for functional testing.
97
+
98
+ ```python
99
+ # tests/conftest.py
100
+ import pytest
101
+
102
+ TEST_MODEL = "stas/tiny-random-llama-2"
103
+
104
+ @pytest.fixture
105
+ def tiny_model():
106
+ """Tiny random Llama model for testing."""
107
+ return TEST_MODEL
108
+ ```
109
+
110
+ **Test requirements:**
111
+ - Tests must run without GPU (CPU-only)
112
+ - Tests must not require `NNSIGHT_API_KEY` (use `remote=False`)
113
+ - Tests should be fast (tiny model has ~few MB weights)
114
+ - Integration tests verify full pipeline: extraction → pooling → classification
115
+
116
+ ### Remote/NDIF Testing (TODO)
117
+
118
+ **Status: NOT YET TESTED**
119
+
120
+ The `remote=True` functionality uses nnsight to connect to NDIF (National Deep Inference Fabric), a US national research initiative. Remote testing has not been performed due to:
121
+
122
+ 1. **Geographic restriction**: NDIF restricts access to US-based users only
123
+ 2. **API key requirement**: Requires `NNSIGHT_API_KEY` environment variable
124
+
125
+ **What needs testing:**
126
+ - `LinearProbe(..., remote=True)` connects successfully
127
+ - `probe.fit(..., remote=True)` extracts activations from remote models
128
+ - `probe.predict(..., remote=False)` override works (train remote, predict local)
129
+ - Large models (e.g., `meta-llama/Llama-3.1-70B-Instruct`) work via remote
130
+ - Error handling when `NNSIGHT_API_KEY` is missing/invalid
131
+ - Cache behavior with remote extractions
132
+
133
+ **To test when US-based:**
134
+ ```bash
135
+ export NNSIGHT_API_KEY="your-key"
136
+ pytest tests/test_remote.py -v # (test file to be created)
137
+ ```
138
+
139
+ **Known considerations:**
140
+ - Remote execution may have different tensor handling (proxies vs direct tensors)
141
+ - The `extraction.py` code handles both cases with `hasattr(act, "value")` check
142
+ - Network latency may affect batch processing strategies
143
+
144
+ ```python
145
+ # Example test
146
+ def test_fit_predict_roundtrip(tiny_model):
147
+ probe = LinearProbe(
148
+ model=tiny_model,
149
+ layers=-1,
150
+ remote=False,
151
+ random_state=42,
152
+ )
153
+ probe.fit(["positive example"], ["negative example"])
154
+ predictions = probe.predict(["test input"])
155
+ assert predictions.shape == (1,)
156
+ ```
157
+
158
+ ## Test-Driven Development
159
+
160
+ **This project uses test-driven development (TDD).** Write tests BEFORE implementation.
161
+
162
+ ### The North Star Test
163
+
164
+ The **north star test** is `tests/test_readme_example.py`. It runs the exact code from README.md's "Example Usage" section. This test defines what "done" looks like:
165
+
166
+ ```python
167
+ # tests/test_readme_example.py
168
+ """
169
+ North Star Test: The README example must run exactly as documented.
170
+
171
+ This test runs the exact code from README.md. If this test passes,
172
+ the library's public API is working as advertised.
173
+ """
174
+
175
+ def test_readme_example_runs(tiny_model):
176
+ """The README example code runs without error."""
177
+ from lmprobe import LinearProbe
178
+
179
+ positive_prompts = [
180
+ "Who wants to go for a walk?",
181
+ "My tail is wagging with delight.",
182
+ "Fetch the ball!",
183
+ "Good boy!",
184
+ "Slobbering, chewing, growling, barking.",
185
+ ]
186
+
187
+ negative_prompts = [
188
+ "Enjoys lounging in the sun beam all day.",
189
+ "Purring, stalking, pouncing, scratching.",
190
+ "Uses a litterbox, throws sand all over the room.",
191
+ "Tail raised, back arched, eyes alert, whiskers forward.",
192
+ ]
193
+
194
+ probe = LinearProbe(
195
+ model=tiny_model, # Use tiny model instead of Llama for tests
196
+ layers=-1, # Last layer (tiny model has few layers)
197
+ pooling="last_token",
198
+ classifier="logistic_regression",
199
+ device="cpu",
200
+ remote=False,
201
+ random_state=42,
202
+ )
203
+
204
+ probe.fit(positive_prompts, negative_prompts)
205
+
206
+ test_prompts = [
207
+ "Arf! Arf! Let's go outside!",
208
+ "Knocking things off the counter for sport.",
209
+ ]
210
+ predictions = probe.predict(test_prompts)
211
+ probabilities = probe.predict_proba(test_prompts)
212
+
213
+ # Shape assertions (values may vary with random weights)
214
+ assert predictions.shape == (2,)
215
+ assert probabilities.shape == (2, 2)
216
+
217
+ # Score method works
218
+ accuracy = probe.score(test_prompts, [1, 0])
219
+ assert 0.0 <= accuracy <= 1.0
220
+ ```
221
+
222
+ ### TDD Workflow
223
+
224
+ 1. **Write a failing test first** — Define expected behavior before implementation
225
+ 2. **Run the test, confirm it fails** — Ensures the test is actually testing something
226
+ 3. **Implement minimal code to pass** — Don't over-engineer
227
+ 4. **Refactor if needed** — Clean up while tests are green
228
+ 5. **Repeat**
229
+
230
+ ### Test Priority Order
231
+
232
+ When implementing, make tests pass in this order:
233
+
234
+ 1. `test_readme_example.py` — The north star (full integration)
235
+ 2. `test_probe.py` — LinearProbe unit tests
236
+ 3. `test_extraction.py` — Activation extraction tests
237
+ 4. `test_pooling.py` — Pooling strategy tests
238
+ 5. `test_cache.py` — Caching tests
239
+
240
+ ### Running Tests
241
+
242
+ ```bash
243
+ # Run all tests
244
+ pytest
245
+
246
+ # Run north star test only
247
+ pytest tests/test_readme_example.py -v
248
+
249
+ # Run with coverage
250
+ pytest --cov=lmprobe
251
+ ```
252
+
253
+ ## Quick Reference
254
+
255
+ ```python
256
+ from lmprobe import LinearProbe
257
+
258
+ probe = LinearProbe(
259
+ model="meta-llama/Llama-3.1-8B-Instruct",
260
+ layers=16, # int | list[int] | "all" | "middle"
261
+ pooling="last_token", # or override with train_pooling / inference_pooling
262
+ classifier="logistic_regression", # str | sklearn estimator
263
+ device="auto",
264
+ remote=False, # True for nnsight remote execution
265
+ random_state=42, # Propagates to classifier for reproducibility
266
+ )
267
+
268
+ probe.fit(positive_prompts, negative_prompts)
269
+ predictions = probe.predict(new_prompts)
270
+
271
+ # Override remote at call time
272
+ predictions = probe.predict(new_prompts, remote=True)
273
+ ```
274
+
275
+ ## Common Tasks
276
+
277
+ ### Adding a new pooling strategy
278
+ 1. Read `docs/design/002-pooling-strategies.md`
279
+ 2. Add strategy to `src/lmprobe/pooling.py`
280
+ 3. Register in `POOLING_STRATEGIES` dict
281
+ 4. Add tests in `tests/test_pooling.py`
282
+
283
+ ### Supporting a new model architecture
284
+ 1. Check if transformers `AutoModel` handles it automatically
285
+ 2. If not, add architecture-specific extraction in `src/lmprobe/extraction.py`
286
+ 3. Document any quirks in `docs/models/`
lmprobe-0.1.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Toast
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
lmprobe-0.1.0/PKG-INFO ADDED
@@ -0,0 +1,215 @@
1
+ Metadata-Version: 2.4
2
+ Name: lmprobe
3
+ Version: 0.1.0
4
+ Summary: Train linear probes on language model activations for AI safety monitoring
5
+ Project-URL: Homepage, https://github.com/toast/lmprobe
6
+ Project-URL: Documentation, https://github.com/toast/lmprobe#readme
7
+ Project-URL: Repository, https://github.com/toast/lmprobe
8
+ Author: Toast
9
+ License-Expression: MIT
10
+ License-File: LICENSE
11
+ Keywords: ai-safety,interpretability,language-models,machine-learning,nlp,probing
12
+ Classifier: Development Status :: 3 - Alpha
13
+ Classifier: Intended Audience :: Science/Research
14
+ Classifier: License :: OSI Approved :: MIT License
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.10
17
+ Classifier: Programming Language :: Python :: 3.11
18
+ Classifier: Programming Language :: Python :: 3.12
19
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
20
+ Requires-Python: >=3.10
21
+ Requires-Dist: nnsight>=0.3
22
+ Requires-Dist: numpy>=1.20
23
+ Requires-Dist: scikit-learn>=1.0
24
+ Requires-Dist: torch>=2.0
25
+ Requires-Dist: transformers>=4.30
26
+ Provides-Extra: dev
27
+ Requires-Dist: black>=23.0; extra == 'dev'
28
+ Requires-Dist: pytest-cov>=4.0; extra == 'dev'
29
+ Requires-Dist: pytest>=7.0; extra == 'dev'
30
+ Requires-Dist: ruff>=0.1; extra == 'dev'
31
+ Description-Content-Type: text/markdown
32
+
33
+ # `lmprobe` Language Model Probe Library
34
+ This library supports the use of language model "activations" or "latents" to build text classifiers. The intent is to help detect and reduce misuse of AI - for example, chemical, biological, radiological and nuclear (CBRN) weapons development, social engineering at scale, and the development of novel cybersecurity attack vectors.
35
+
36
+ ## Linear and Simple Models for LLMs
37
+ "Linear Probes" have emerged as an effective and practical way to monitor large language model activity.
38
+
39
+ ### Background
40
+
41
+ First introduced by [Alain & Bengio (2016)](https://arxiv.org/abs/1610.01644) as "thermometers" for measuring what neural networks learn at each layer, linear probes have since been refined through work on [probe design and selectivity](https://nlp.stanford.edu/~johnhew/interpreting-probes.html) and validated by evidence supporting the [linear representation hypothesis](https://www.neelnanda.io/mechanistic-interpretability/othello). The [Representation Engineering](https://arxiv.org/abs/2310.01405) framework (Zou et al., 2023) demonstrated that probes can monitor safety-relevant properties like honesty and power-seeking. Recent AI safety research has shown promising results: Anthropic's work on [detecting sleeper agents](https://www.anthropic.com/research/probes-catch-sleeper-agents) achieved >99% AUROC using simple linear classifiers, and Apollo Research's [strategic deception detection](https://arxiv.org/abs/2502.03407) work demonstrates that probes trained on simple contrast pairs can generalize to realistic scenarios like insider trading concealment and sandbagging on safety evaluations.
42
+
43
+ ### `lmprobe` Use Cases
44
+
45
+ The goal of `lmprobe` is to make text classifiers for language models easy to build, experiment on, and deploy during inference. While much of the research has focused on complex emergent risky behavior, the intended use of this library is for simpler use cases such as detection of the misuse of an AI chatbot by humans.
46
+
47
+ ### Compatibility
48
+
49
+ By default, `lmprobe` uses huggingface and `nnsight` to manage models and extract latents during inference. However, the library is structured to modularize and isolate these aspects so that (ideally) frontier AI labs can extend the library for internal use on their bespoke inference systems.
50
+
51
+ ### Installation
52
+
53
+ ```
54
+ pip install lmprobe
55
+ ```
56
+
57
+ ### Environment Setup
58
+
59
+ For remote execution (large models via nnsight/NDIF):
60
+
61
+ ```bash
62
+ export NNSIGHT_API_KEY="your-api-key-here"
63
+ ```
64
+
65
+ ### Example Usage
66
+
67
+ ---
68
+
69
+ ```python
70
+ from lmprobe import LinearProbe
71
+
72
+ positive_prompts = [ # positive class: "dog" without saying "dog"
73
+ "Who wants to go for a walk?",
74
+ "My tail is wagging with delight.",
75
+ "Fetch the ball!",
76
+ "Good boy!",
77
+ "Slobbering, chewing, growling, barking.",
78
+ ]
79
+
80
+ negative_prompts = [ # negative class: "cat" without saying "cat"
81
+ "Enjoys lounging in the sun beam all day.",
82
+ "Purring, stalking, pouncing, scratching.",
83
+ "Uses a litterbox, throws sand all over the room.",
84
+ "Tail raised, back arched, eyes alert, whiskers forward.",
85
+ ]
86
+
87
+ # Configure the probe
88
+ probe = LinearProbe(
89
+ model="meta-llama/Llama-3.1-8B-Instruct",
90
+ layers=16, # int, list[int], or "all"
91
+ pooling="last_token", # applies to both train and inference
92
+ classifier="logistic_regression", # or pass sklearn estimator
93
+ device="auto",
94
+ remote=False, # True for nnsight remote execution
95
+ random_state=42, # for reproducibility
96
+ )
97
+
98
+ # Fit using contrastive prompts
99
+ probe.fit(positive_prompts, negative_prompts)
100
+
101
+ # Predict on new examples
102
+ test_prompts = [
103
+ "Arf! Arf! Let's go outside!",
104
+ "Knocking things off the counter for sport.",
105
+ ]
106
+ predictions = probe.predict(test_prompts) # [1, 0]
107
+ probabilities = probe.predict_proba(test_prompts) # [[0.12, 0.88], [0.91, 0.09]]
108
+
109
+ # Evaluate
110
+ accuracy = probe.score(test_prompts, [1, 0])
111
+
112
+ # Save/load for deployment
113
+ probe.save("dog_vs_cat_probe.pkl")
114
+ loaded_probe = LinearProbe.load("dog_vs_cat_probe.pkl")
115
+ ```
116
+
117
+ ---
118
+
119
+ ## Remote Execution for Large Models
120
+
121
+ Use `remote=True` to run inference on large models via nnsight's remote servers:
122
+
123
+ ```python
124
+ probe = LinearProbe(
125
+ model="meta-llama/Llama-3.1-70B-Instruct",
126
+ layers="middle",
127
+ remote=True, # Requires NNSIGHT_API_KEY
128
+ )
129
+
130
+ probe.fit(positive_prompts, negative_prompts)
131
+
132
+ # Override remote per-call (e.g., train remote, predict local)
133
+ predictions = probe.predict(new_prompts, remote=False)
134
+ ```
135
+
136
+ ---
137
+
138
+ ## Multi-Layer Probing
139
+
140
+ When selecting multiple layers, activations are **concatenated** along the hidden dimension:
141
+
142
+ ```python
143
+ probe = LinearProbe(
144
+ model="meta-llama/Llama-3.1-8B-Instruct",
145
+ layers=[14, 15, 16], # 3 layers × 4096 dims = 12,288-dim input to classifier
146
+ )
147
+ ```
148
+
149
+ ---
150
+
151
+ ## Advanced: Different Train vs Inference Pooling
152
+
153
+ For real-time monitoring, train on a stable representation but score every token:
154
+
155
+ ```python
156
+ probe = LinearProbe(
157
+ model="meta-llama/Llama-3.1-8B-Instruct",
158
+ layers=16,
159
+ pooling="last_token", # base strategy
160
+ inference_pooling="all", # override: return per-token scores
161
+ )
162
+
163
+ probe.fit(positive_prompts, negative_prompts)
164
+
165
+ # Returns (batch, seq_len) - one score per token
166
+ token_scores = probe.predict_proba(["Wagging my tail happily!"])
167
+ ```
168
+
169
+ For "flag if ANY token triggers" detection:
170
+
171
+ ```python
172
+ probe = LinearProbe(
173
+ model="meta-llama/Llama-3.1-8B-Instruct",
174
+ layers=16,
175
+ pooling="last_token", # base strategy
176
+ inference_pooling="max", # override: max score across tokens
177
+ )
178
+ ```
179
+
180
+ ---
181
+
182
+ ## Configuration Reference
183
+
184
+ | Parameter | Type | Default | Description |
185
+ |-----------|------|---------|-------------|
186
+ | `model` | `str` | *required* | HuggingFace model ID or local path |
187
+ | `layers` | `int \| list[int] \| "all"` | `"middle"` | Which residual stream layers to probe |
188
+ | `pooling` | `str \| callable` | `"last_token"` | Token aggregation (train & inference) |
189
+ | `train_pooling` | `str \| callable` | — | Override pooling for `fit()` only |
190
+ | `inference_pooling` | `str \| callable` | — | Override pooling for `predict()` only |
191
+ | `classifier` | `str \| sklearn estimator` | `"logistic_regression"` | Classification model |
192
+ | `device` | `str` | `"auto"` | `"auto"`, `"cuda:0"`, `"cpu"` |
193
+ | `remote` | `bool` | `False` | Use nnsight remote execution (requires `NNSIGHT_API_KEY`) |
194
+ | `random_state` | `int \| None` | `None` | Random seed for reproducibility (propagates to classifier) |
195
+
196
+ ### Pooling Strategies
197
+
198
+ | Strategy | Training | Inference | Description |
199
+ |----------|:--------:|:---------:|-------------|
200
+ | `"last_token"` | ✓ | ✓ | Final token activation (default, matches RepE literature) |
201
+ | `"mean"` | ✓ | ✓ | Mean across all tokens |
202
+ | `"first_token"` | ✓ | ✓ | First token (e.g., `[CLS]`) |
203
+ | `"all"` | ✓ | ✓ | Each token independently |
204
+ | `"max"` | | ✓ | Max score across tokens |
205
+ | `"min"` | | ✓ | Min score across tokens |
206
+
207
+ ### Pooling Collision Rules
208
+
209
+ Explicit parameters override the base `pooling` value:
210
+
211
+ ```python
212
+ # pooling="mean", train_pooling="last_token" → train=last_token, inference=mean
213
+ # pooling="mean", inference_pooling="max" → train=mean, inference=max
214
+ ```
215
+