difflayers 0.1.0__tar.gz → 0.1.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- difflayers-0.1.1/PKG-INFO +765 -0
- difflayers-0.1.1/README.md +731 -0
- difflayers-0.1.1/difflayers.egg-info/PKG-INFO +765 -0
- {difflayers-0.1.0 → difflayers-0.1.1}/pyproject.toml +1 -1
- {difflayers-0.1.0 → difflayers-0.1.1}/setup.py +1 -1
- difflayers-0.1.0/PKG-INFO +0 -210
- difflayers-0.1.0/README.md +0 -176
- difflayers-0.1.0/difflayers.egg-info/PKG-INFO +0 -210
- {difflayers-0.1.0 → difflayers-0.1.1}/LICENSE +0 -0
- {difflayers-0.1.0 → difflayers-0.1.1}/difflayers/__init__.py +0 -0
- {difflayers-0.1.0 → difflayers-0.1.1}/difflayers/activation.py +0 -0
- {difflayers-0.1.0 → difflayers-0.1.1}/difflayers/attention_operator.py +0 -0
- {difflayers-0.1.0 → difflayers-0.1.1}/difflayers/auxiliary/__init__.py +0 -0
- {difflayers-0.1.0 → difflayers-0.1.1}/difflayers/auxiliary/data.py +0 -0
- {difflayers-0.1.0 → difflayers-0.1.1}/difflayers/diffused_attention.py +0 -0
- {difflayers-0.1.0 → difflayers-0.1.1}/difflayers/diffusion.py +0 -0
- {difflayers-0.1.0 → difflayers-0.1.1}/difflayers/dynamics_engine.py +0 -0
- {difflayers-0.1.0 → difflayers-0.1.1}/difflayers/functional.py +0 -0
- {difflayers-0.1.0 → difflayers-0.1.1}/difflayers/graph/__init__.py +0 -0
- {difflayers-0.1.0 → difflayers-0.1.1}/difflayers/graph/build_graph.py +0 -0
- {difflayers-0.1.0 → difflayers-0.1.1}/difflayers/graph/builder.py +0 -0
- {difflayers-0.1.0 → difflayers-0.1.1}/difflayers/graph/laplacian.py +0 -0
- {difflayers-0.1.0 → difflayers-0.1.1}/difflayers/graph/laplacian_builder.py +0 -0
- {difflayers-0.1.0 → difflayers-0.1.1}/difflayers/transformer.py +0 -0
- {difflayers-0.1.0 → difflayers-0.1.1}/difflayers.egg-info/SOURCES.txt +0 -0
- {difflayers-0.1.0 → difflayers-0.1.1}/difflayers.egg-info/dependency_links.txt +0 -0
- {difflayers-0.1.0 → difflayers-0.1.1}/difflayers.egg-info/not-zip-safe +0 -0
- {difflayers-0.1.0 → difflayers-0.1.1}/difflayers.egg-info/requires.txt +0 -0
- {difflayers-0.1.0 → difflayers-0.1.1}/difflayers.egg-info/top_level.txt +0 -0
- {difflayers-0.1.0 → difflayers-0.1.1}/setup.cfg +0 -0
|
@@ -0,0 +1,765 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: difflayers
|
|
3
|
+
Version: 0.1.1
|
|
4
|
+
Summary: difflayers: Diffusion-Augmented Hopfield Networks
|
|
5
|
+
Home-page: https://github.com/hopfileds/hopfield-layers
|
|
6
|
+
Author: Priyam Ghosh
|
|
7
|
+
Author-email: Priyam Ghosh <priyamghosh9753@gmail.com>
|
|
8
|
+
License: BSD
|
|
9
|
+
Project-URL: Homepage, https://github.com/hopfileds/hopfield-layers
|
|
10
|
+
Project-URL: Repository, https://github.com/hopfileds/hopfield-layers
|
|
11
|
+
Project-URL: Bug Tracker, https://github.com/hopfileds/hopfield-layers/issues
|
|
12
|
+
Keywords: hopfield networks,deep learning,attention,diffusion,graph
|
|
13
|
+
Classifier: Development Status :: 3 - Alpha
|
|
14
|
+
Classifier: Intended Audience :: Science/Research
|
|
15
|
+
Classifier: License :: OSI Approved :: BSD License
|
|
16
|
+
Classifier: Programming Language :: Python :: 3
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.8
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
19
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
20
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
21
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
22
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
23
|
+
Classifier: Operating System :: OS Independent
|
|
24
|
+
Requires-Python: >=3.8
|
|
25
|
+
Description-Content-Type: text/markdown
|
|
26
|
+
License-File: LICENSE
|
|
27
|
+
Requires-Dist: torch>=1.9.0
|
|
28
|
+
Requires-Dist: numpy>=1.20.0
|
|
29
|
+
Requires-Dist: scipy>=1.7.0
|
|
30
|
+
Dynamic: author
|
|
31
|
+
Dynamic: home-page
|
|
32
|
+
Dynamic: license-file
|
|
33
|
+
Dynamic: requires-python
|
|
34
|
+
|
|
35
|
+
# difflayers — Diffusion-Augmented Hopfield Networks
|
|
36
|
+
|
|
37
|
+
<p align="center">
|
|
38
|
+
<a href="https://pypi.org/project/difflayers/"><img src="https://img.shields.io/pypi/v/difflayers?color=blue&label=PyPI" alt="PyPI"></a>
|
|
39
|
+
<a href="https://pypi.org/project/difflayers/"><img src="https://img.shields.io/pypi/pyversions/difflayers" alt="Python Versions"></a>
|
|
40
|
+
<a href="https://pytorch.org"><img src="https://img.shields.io/badge/PyTorch-%E2%89%A51.9-orange" alt="PyTorch"></a>
|
|
41
|
+
<a href="LICENSE"><img src="https://img.shields.io/badge/license-BSD-green" alt="License"></a>
|
|
42
|
+
</p>
|
|
43
|
+
|
|
44
|
+
**difflayers** is a PyTorch library that extends modern continuous Hopfield networks with graph-based Laplacian diffusion, turning associative memory layers into structure-aware retrievers. At its core sits the **Diffusion-Augmented Hopfield Network (DAHN)** — a drop-in upgrade to standard Hopfield attention that pre-smooths patterns over a learned kNN graph before every association step, suppressing spurious retrievals and sharpening metastable energy minima.
|
|
45
|
+
|
|
46
|
+
The library ships the full original Hopfield layer suite (`Hopfield`, `HopfieldPooling`, `HopfieldLayer`) plus the DAHN extensions (`DiffusedHopfield`, four diffusion operators, a graph-construction pipeline, and a dynamical memory engine) — all under a single, clean API.
|
|
47
|
+
|
|
48
|
+
---
|
|
49
|
+
|
|
50
|
+
## Table of Contents
|
|
51
|
+
|
|
52
|
+
1. [Background](#background)
|
|
53
|
+
2. [What DAHN Adds](#what-dahn-adds)
|
|
54
|
+
3. [Architecture Overview](#architecture-overview)
|
|
55
|
+
4. [Installation](#installation)
|
|
56
|
+
5. [Quick Start](#quick-start)
|
|
57
|
+
6. [Core Modules](#core-modules)
|
|
58
|
+
- [Hopfield](#hopfield)
|
|
59
|
+
- [HopfieldPooling](#hopfieldpooling)
|
|
60
|
+
- [HopfieldLayer](#hopfieldlayer)
|
|
61
|
+
- [DiffusedHopfield](#diffusedhopfield)
|
|
62
|
+
7. [Diffusion Modes](#diffusion-modes)
|
|
63
|
+
8. [DiffusionConfig Reference](#diffusionconfig-reference)
|
|
64
|
+
9. [Graph Pipeline](#graph-pipeline)
|
|
65
|
+
10. [Advanced Usage](#advanced-usage)
|
|
66
|
+
11. [Transformer Integration](#transformer-integration)
|
|
67
|
+
12. [Example Notebooks](#example-notebooks)
|
|
68
|
+
13. [Running Experiments](#running-experiments)
|
|
69
|
+
14. [API Reference](#api-reference)
|
|
70
|
+
15. [Complexity Guide](#complexity-guide)
|
|
71
|
+
16. [Background Paper](#background-paper)
|
|
72
|
+
17. [Disclaimer](#disclaimer)
|
|
73
|
+
18. [License](#license)
|
|
74
|
+
|
|
75
|
+
---
|
|
76
|
+
|
|
77
|
+
## Background
|
|
78
|
+
|
|
79
|
+
Modern Hopfield networks with continuous states were introduced in [Ramsauer et al. (2020)](https://arxiv.org/abs/2008.02217), where it was shown that the transformer **attention mechanism is exactly the update rule of a continuous Hopfield network**. This re-framing unlocks exponential storage capacity, single-step convergence, and a clean energy-based interpretation of deep attention.
|
|
80
|
+
|
|
81
|
+
The energy function of a continuous Hopfield network is:
|
|
82
|
+
|
|
83
|
+
$$E = -\text{lse}(\beta, X \xi) + \frac{1}{2}\xi^T \xi + \frac{1}{\beta}\log N + \frac{1}{2}M^2$$
|
|
84
|
+
|
|
85
|
+
where $\text{lse}(\beta, z) = \frac{1}{\beta}\log\sum_i e^{\beta z_i}$ is the log-sum-exp, $\xi$ is the state pattern (query), $X$ are the stored patterns (keys), $\beta$ is the inverse temperature, and $N$, $M$ are dimensional constants.
|
|
86
|
+
|
|
87
|
+
Energy minimization via one synchronous update yields the familiar softmax attention:
|
|
88
|
+
|
|
89
|
+
$$\xi^{\text{new}} = X^\top \text{softmax}(\beta X \xi)$$
|
|
90
|
+
|
|
91
|
+
The network can store **exponentially many patterns** (in the dimension $d$), converges in **one update step**, and has exponentially small retrieval errors — properties not shared by classical binary Hopfield networks.
|
|
92
|
+
|
|
93
|
+
Three classes of fixed points (energy minima) arise naturally:
|
|
94
|
+
|
|
95
|
+
| Fixed-point type | Regime | Behaviour |
|
|
96
|
+
|---|---|---|
|
|
97
|
+
| **Global averaging** | Low $\beta$ | Retrieves a weighted average of all patterns |
|
|
98
|
+
| **Metastable states** | Medium $\beta$ | Retrieves a subset of patterns — analogous to multi-head attention |
|
|
99
|
+
| **Single-pattern storage** | High $\beta$ | Sharply retrieves one stored pattern |
|
|
100
|
+
|
|
101
|
+
---
|
|
102
|
+
|
|
103
|
+
## What DAHN Adds
|
|
104
|
+
|
|
105
|
+
Standard Hopfield attention treats every stored pattern as equally reachable from any query. In high-noise or high-density memory scenarios, the attention distribution spreads over spurious neighbours, degrading retrieval accuracy.
|
|
106
|
+
|
|
107
|
+
**DAHN** addresses this by building a $k$-nearest-neighbour graph over the pattern set and pre-smoothing patterns with the graph Laplacian before every association step. The dynamics loop is:
|
|
108
|
+
|
|
109
|
+
$$\text{for } t = 1, \ldots, T:$$
|
|
110
|
+
$$K' = \underbrace{(I - \eta L)}_{\text{diffusion}} K, \quad Q' = (I - \eta L) Q \quad \text{(optional)}$$
|
|
111
|
+
$$\text{output} = \text{softmax}(\beta \, Q' {K'}^\top) \, V$$
|
|
112
|
+
|
|
113
|
+
where $L$ is the (optionally symmetric-normalized) graph Laplacian of the kNN similarity graph over $K$, and $\eta$ is the diffusion strength. This smoothing:
|
|
114
|
+
|
|
115
|
+
- **Clusters** related patterns before retrieval, reducing inter-cluster interference
|
|
116
|
+
- **Sharpens** metastable energy minima, improving single-pattern retrieval accuracy under noise
|
|
117
|
+
- **Preserves** the Hopfield energy landscape (diffusion decreases the energy, never creates new spurious minima)
|
|
118
|
+
- **Scales** gracefully: with `FactoredDiffusion` and sparse adjacency the full loop costs $O(kNd)$ per step
|
|
119
|
+
|
|
120
|
+
---
|
|
121
|
+
|
|
122
|
+
## Architecture Overview
|
|
123
|
+
|
|
124
|
+
```
|
|
125
|
+
difflayers/
|
|
126
|
+
│
|
|
127
|
+
├── __init__.py # Public API — 18 exported names
|
|
128
|
+
│
|
|
129
|
+
├── activation.py # HopfieldCore (multi-head Hopfield attention kernel)
|
|
130
|
+
├── functional.py # hopfield_core_forward (low-level functional API)
|
|
131
|
+
├── transformer.py # HopfieldEncoderLayer, HopfieldDecoderLayer
|
|
132
|
+
│
|
|
133
|
+
├── diffused_attention.py # DiffusedHopfield ← DAHN entry point
|
|
134
|
+
├── diffusion.py # DiffusionOperator ABC + 4 concrete strategies
|
|
135
|
+
│ # SimpleDiffusion, IterativeDiffusion,
|
|
136
|
+
│ # SpectralDiffusion, FactoredDiffusion
|
|
137
|
+
├── dynamics_engine.py # DiffusionConfig, GraphCache, DynamicsEngine,
|
|
138
|
+
│ # EnergyTracker
|
|
139
|
+
├── attention_operator.py # AttentionOperator (dense / graph-constrained)
|
|
140
|
+
│
|
|
141
|
+
├── graph/
|
|
142
|
+
│ ├── build_graph.py # build_similarity_matrix, build_knn_graph
|
|
143
|
+
│ ├── laplacian.py # compute_laplacian, compute_normalized_laplacian
|
|
144
|
+
│ ├── builder.py # GraphBuilder (fluent graph-construction API)
|
|
145
|
+
│ └── laplacian_builder.py # LaplacianBuilder
|
|
146
|
+
│
|
|
147
|
+
└── auxiliary/
|
|
148
|
+
└── data.py # LookupTableDataset
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
---
|
|
152
|
+
|
|
153
|
+
## Installation
|
|
154
|
+
|
|
155
|
+
### From PyPI (recommended)
|
|
156
|
+
|
|
157
|
+
```bash
|
|
158
|
+
pip install difflayers
|
|
159
|
+
```
|
|
160
|
+
|
|
161
|
+
### From source
|
|
162
|
+
|
|
163
|
+
```bash
|
|
164
|
+
git clone https://github.com/Prigoistic/mha-layers.git
|
|
165
|
+
cd mha-layers
|
|
166
|
+
pip install -e .
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
### Dependencies
|
|
170
|
+
|
|
171
|
+
| Package | Minimum version |
|
|
172
|
+
|---|---|
|
|
173
|
+
| Python | 3.8 |
|
|
174
|
+
| PyTorch | 1.9.0 |
|
|
175
|
+
| NumPy | 1.20.0 |
|
|
176
|
+
| SciPy | 1.7.0 |
|
|
177
|
+
|
|
178
|
+
For the example notebooks, install the extra requirements:
|
|
179
|
+
|
|
180
|
+
```bash
|
|
181
|
+
pip install -r examples/requirements.txt
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
---
|
|
185
|
+
|
|
186
|
+
## Quick Start
|
|
187
|
+
|
|
188
|
+
```python
|
|
189
|
+
import torch
|
|
190
|
+
from difflayers import Hopfield, HopfieldPooling, HopfieldLayer, DiffusedHopfield
|
|
191
|
+
|
|
192
|
+
# ------------------------------------------------------------------
|
|
193
|
+
# 1. Standard Hopfield attention (query x stored-pattern lookup)
|
|
194
|
+
# ------------------------------------------------------------------
|
|
195
|
+
hopfield = Hopfield(input_size=64, num_heads=4, batch_first=True)
|
|
196
|
+
|
|
197
|
+
queries = torch.randn(8, 10, 64) # (batch, query_len, d)
|
|
198
|
+
stored = torch.randn(8, 50, 64) # (batch, memory_size, d)
|
|
199
|
+
projections = torch.randn(8, 50, 64) # (batch, memory_size, d)
|
|
200
|
+
|
|
201
|
+
output = hopfield((stored, queries, projections))
|
|
202
|
+
# output: (8, 10, 64)
|
|
203
|
+
|
|
204
|
+
# ------------------------------------------------------------------
|
|
205
|
+
# 2. Hopfield pooling (sequence -> fixed-size embedding)
|
|
206
|
+
# ------------------------------------------------------------------
|
|
207
|
+
pooling = HopfieldPooling(input_size=64, num_heads=1, batch_first=True)
|
|
208
|
+
sequence = torch.randn(8, 100, 64)
|
|
209
|
+
pooled = pooling(sequence)
|
|
210
|
+
# pooled: (8, 1, 64) — one trained state-pattern queries over the sequence
|
|
211
|
+
|
|
212
|
+
# ------------------------------------------------------------------
|
|
213
|
+
# 3. Hopfield lookup (static trainable memory)
|
|
214
|
+
# ------------------------------------------------------------------
|
|
215
|
+
lookup = HopfieldLayer(input_size=64, num_pattern_repetitions=32)
|
|
216
|
+
query = torch.randn(8, 10, 64)
|
|
217
|
+
result = lookup(query)
|
|
218
|
+
# result: (8, 10, 64)
|
|
219
|
+
|
|
220
|
+
# ------------------------------------------------------------------
|
|
221
|
+
# 4. DiffusedHopfield (graph-diffusion augmented retrieval)
|
|
222
|
+
# ------------------------------------------------------------------
|
|
223
|
+
dh = DiffusedHopfield(
|
|
224
|
+
input_size=64,
|
|
225
|
+
num_heads=4,
|
|
226
|
+
batch_first=True,
|
|
227
|
+
eta=0.1, # diffusion strength eta
|
|
228
|
+
k_neighbors=8, # kNN graph degree
|
|
229
|
+
diffusion_mode="factored", # O(kNd) — fastest
|
|
230
|
+
diffusion_steps=3, # T iterations of diffuse -> attend
|
|
231
|
+
diffuse_key=True, # smooth stored patterns
|
|
232
|
+
diffuse_query=False, # optionally also smooth queries
|
|
233
|
+
)
|
|
234
|
+
output = dh((stored, queries, projections))
|
|
235
|
+
# output: (8, 10, 64) — same shape, sharper retrieval
|
|
236
|
+
```
|
|
237
|
+
|
|
238
|
+
---
|
|
239
|
+
|
|
240
|
+
## Core Modules
|
|
241
|
+
|
|
242
|
+
### Hopfield
|
|
243
|
+
|
|
244
|
+
The base continuous Hopfield attention layer. A direct PyTorch-compatible re-implementation of multi-head attention whose weights are derived from the Hopfield energy update rule rather than learned linear projections.
|
|
245
|
+
|
|
246
|
+
```python
|
|
247
|
+
from difflayers import Hopfield
|
|
248
|
+
|
|
249
|
+
hopfield = Hopfield(
|
|
250
|
+
input_size=128, # depth of state (query) patterns
|
|
251
|
+
hidden_size=64, # depth of the association (Hopfield) space
|
|
252
|
+
output_size=128, # depth of the output projection
|
|
253
|
+
num_heads=8, # parallel association heads
|
|
254
|
+
scaling=None, # beta; auto-set to 1/sqrt(head_dim) if None
|
|
255
|
+
update_steps_max=0, # 0 = one synchronous update (default/recommended)
|
|
256
|
+
update_steps_eps=1e-4, # convergence threshold for iterative updates
|
|
257
|
+
normalize_stored_pattern=True, # LayerNorm on keys
|
|
258
|
+
normalize_state_pattern=True, # LayerNorm on queries
|
|
259
|
+
batch_first=True,
|
|
260
|
+
dropout=0.1,
|
|
261
|
+
)
|
|
262
|
+
```
|
|
263
|
+
|
|
264
|
+
**Key parameters:**
|
|
265
|
+
|
|
266
|
+
| Parameter | Type | Default | Description |
|
|
267
|
+
|---|---|---|---|
|
|
268
|
+
| `input_size` | `int` | `None` | Feature depth of state (query) patterns |
|
|
269
|
+
| `hidden_size` | `int` | `None` | Hopfield association space depth; defaults to `input_size` |
|
|
270
|
+
| `output_size` | `int` | `None` | Output projection depth; defaults to `input_size` |
|
|
271
|
+
| `num_heads` | `int` | `1` | Parallel association heads |
|
|
272
|
+
| `scaling` | `float` | `None` | Inverse temperature beta; `None` => 1/sqrt(d_head) |
|
|
273
|
+
| `update_steps_max` | `int` | `0` | Max synchronous update iterations (`None` = run to convergence) |
|
|
274
|
+
| `batch_first` | `bool` | `True` | Input layout: `(batch, seq, d)` when `True`, `(seq, batch, d)` when `False` |
|
|
275
|
+
| `stored_pattern_as_static` | `bool` | `False` | Freeze stored patterns (no gradient through keys) |
|
|
276
|
+
| `disable_out_projection` | `bool` | `False` | Skip the final linear projection (useful for retrieval tasks) |
|
|
277
|
+
|
|
278
|
+
---
|
|
279
|
+
|
|
280
|
+
### HopfieldPooling
|
|
281
|
+
|
|
282
|
+
Replaces traditional pooling (mean, max, attention-based) with a Hopfield-energy-based alternative. A single **trainable state pattern** acts as the query and computes softmax weights over the input sequence, producing a fixed-size summary vector regardless of input length.
|
|
283
|
+
|
|
284
|
+
```python
|
|
285
|
+
from difflayers import HopfieldPooling
|
|
286
|
+
|
|
287
|
+
pooling = HopfieldPooling(
|
|
288
|
+
input_size=128,
|
|
289
|
+
num_heads=4,
|
|
290
|
+
batch_first=True,
|
|
291
|
+
dropout=0.1,
|
|
292
|
+
)
|
|
293
|
+
|
|
294
|
+
# Collapse a variable-length sequence to a single vector
|
|
295
|
+
sequence = torch.randn(batch, seq_len, 128)
|
|
296
|
+
pooled = pooling(sequence) # (batch, 1, 128)
|
|
297
|
+
```
|
|
298
|
+
|
|
299
|
+
Useful anywhere you need a **permutation-invariant** sequence summarisation — bag-of-words classification, set encoding, immune repertoire profiling, etc.
|
|
300
|
+
|
|
301
|
+
---
|
|
302
|
+
|
|
303
|
+
### HopfieldLayer
|
|
304
|
+
|
|
305
|
+
A trainable, input-independent lookup table. One or more **stored patterns** and their **projections** are learned parameters; given a query, the layer retrieves the most energy-aligned stored vector — acting like a content-addressable memory with learned slots.
|
|
306
|
+
|
|
307
|
+
```python
|
|
308
|
+
from difflayers import HopfieldLayer
|
|
309
|
+
|
|
310
|
+
lookup = HopfieldLayer(
|
|
311
|
+
input_size=128,
|
|
312
|
+
num_pattern_repetitions=64, # number of learned memory slots
|
|
313
|
+
batch_first=True,
|
|
314
|
+
)
|
|
315
|
+
|
|
316
|
+
query = torch.randn(batch, seq_len, 128)
|
|
317
|
+
result = lookup(query) # (batch, seq_len, 128)
|
|
318
|
+
```
|
|
319
|
+
|
|
320
|
+
This is distinct from `Hopfield` in that the memory contents are **learned parameters**, not runtime inputs — suitable for slot-attention, prototype networks, or any scenario where memory is fixed at training time.
|
|
321
|
+
|
|
322
|
+
---
|
|
323
|
+
|
|
324
|
+
### DiffusedHopfield
|
|
325
|
+
|
|
326
|
+
The DAHN module. A full drop-in replacement for `Hopfield` that augments the association with a graph-diffusion pre-processing step. Internally it builds a kNN cosine-similarity graph over the stored patterns, constructs the graph Laplacian, and runs a configurable diffusion-attention loop.
|
|
327
|
+
|
|
328
|
+
```python
|
|
329
|
+
from difflayers import DiffusedHopfield
|
|
330
|
+
|
|
331
|
+
dh = DiffusedHopfield(
|
|
332
|
+
# --- All standard Hopfield arguments are accepted ---
|
|
333
|
+
input_size=128,
|
|
334
|
+
num_heads=4,
|
|
335
|
+
batch_first=True,
|
|
336
|
+
scaling=1.0,
|
|
337
|
+
|
|
338
|
+
# --- DAHN-specific arguments ---
|
|
339
|
+
eta=0.1, # diffusion strength eta in (0, 0.5)
|
|
340
|
+
k_neighbors=8, # kNN graph degree
|
|
341
|
+
diffusion_mode="factored", # "factored" | "simple" | "iterative" | "spectral"
|
|
342
|
+
diffusion_steps=3, # T (ignored by "simple"; used by iterative/spectral)
|
|
343
|
+
use_normalized_laplacian=True, # symmetric-normalised L (recommended)
|
|
344
|
+
diffuse_key=True, # smooth stored patterns (keys)
|
|
345
|
+
diffuse_query=False, # optionally smooth query patterns too
|
|
346
|
+
use_sparse=False, # sparse adjacency for O(kN) memory
|
|
347
|
+
use_logit_diffusion=False, # also smooth post-softmax attention weights
|
|
348
|
+
logit_eta=None, # eta for logit diffusion; defaults to eta
|
|
349
|
+
adaptive_eta=False, # scale eta by attention entropy at runtime
|
|
350
|
+
cache_graph=True, # reuse graph across forward passes
|
|
351
|
+
energy_stop_tol=0.0, # early-stop on |Delta E| < tol (0 = disabled)
|
|
352
|
+
)
|
|
353
|
+
```
|
|
354
|
+
|
|
355
|
+
The forward signature is identical to `Hopfield`:
|
|
356
|
+
|
|
357
|
+
```python
|
|
358
|
+
output = dh((stored_patterns, state_patterns, pattern_projections))
|
|
359
|
+
# or with masking
|
|
360
|
+
output = dh((stored_patterns, state_patterns, pattern_projections),
|
|
361
|
+
stored_pattern_padding_mask=mask)
|
|
362
|
+
```
|
|
363
|
+
|
|
364
|
+
---
|
|
365
|
+
|
|
366
|
+
## Diffusion Modes
|
|
367
|
+
|
|
368
|
+
Four diffusion strategies are available, trading off speed, memory, and smoothing quality:
|
|
369
|
+
|
|
370
|
+
### `"factored"` *(default — recommended)*
|
|
371
|
+
|
|
372
|
+
```
|
|
373
|
+
x' = (1 - eta * deg) * x + eta * W @ x
|
|
374
|
+
```
|
|
375
|
+
|
|
376
|
+
Never forms the full Laplacian matrix. Stores only the sparse adjacency `W` and degree vector `deg`. Each step costs `O(kNd)` in time and `O(kN)` in memory. Best for large N and sparse graphs.
|
|
377
|
+
|
|
378
|
+
### `"simple"`
|
|
379
|
+
|
|
380
|
+
```
|
|
381
|
+
x' = (I - eta * L) @ x
|
|
382
|
+
```
|
|
383
|
+
|
|
384
|
+
One explicit Euler step of heat diffusion. Forms `D = I - eta*L` once and applies it. Cost: `O(N^2 * d)` per step.
|
|
385
|
+
|
|
386
|
+
### `"iterative"`
|
|
387
|
+
|
|
388
|
+
```
|
|
389
|
+
x' = (I - eta * L)^T @ x
|
|
390
|
+
```
|
|
391
|
+
|
|
392
|
+
Applies the same operator `D` repeatedly for `T` steps (`diffusion_steps`). Provides deeper smoothing at the cost of `T * O(N^2 * d)`. Includes a numerical guard against divergence.
|
|
393
|
+
|
|
394
|
+
### `"spectral"`
|
|
395
|
+
|
|
396
|
+
```
|
|
397
|
+
x' = U @ diag(exp(-eta * lambda)) @ U.T @ x
|
|
398
|
+
```
|
|
399
|
+
|
|
400
|
+
Exact heat-kernel diffusion via eigendecomposition of `L`. Precomputes `U` and `lambda` once (`O(N^3)`), then applies the diagonal filter in `O(N^2)` per call. Most accurate smoothing; not suitable for large N.
|
|
401
|
+
|
|
402
|
+
| Mode | Precompute | Per-step | Memory | Best for |
|
|
403
|
+
|---|---|---|---|---|
|
|
404
|
+
| `factored` | O(N^2) build kNN | O(kNd) | O(kN) | Large N, production |
|
|
405
|
+
| `simple` | O(N^2) build D | O(N^2 d) | O(N^2) | Moderate N, one-shot |
|
|
406
|
+
| `iterative` | O(N^2) build D | O(T * N^2 d) | O(N^2) | Deep smoothing |
|
|
407
|
+
| `spectral` | O(N^3) eigen | O(N^2) | O(N^2) | Small N, exact kernel |
|
|
408
|
+
|
|
409
|
+
---
|
|
410
|
+
|
|
411
|
+
## DiffusionConfig Reference
|
|
412
|
+
|
|
413
|
+
`DiffusionConfig` is a frozen dataclass that bundles all diffusion hyperparameters. You can pass one explicitly to `DiffusedHopfield`, or let the constructor build it from keyword arguments.
|
|
414
|
+
|
|
415
|
+
```python
|
|
416
|
+
from difflayers import DiffusionConfig
|
|
417
|
+
|
|
418
|
+
cfg = DiffusionConfig(
|
|
419
|
+
eta=0.1,
|
|
420
|
+
beta=1.0,
|
|
421
|
+
steps=3,
|
|
422
|
+
diffusion_mode="factored",
|
|
423
|
+
attention_mode="dense", # "dense" | "graph"
|
|
424
|
+
k_neighbors=5,
|
|
425
|
+
use_normalized_laplacian=True,
|
|
426
|
+
use_sparse=False,
|
|
427
|
+
diffuse_key=True,
|
|
428
|
+
diffuse_query=False,
|
|
429
|
+
use_logit_diffusion=False,
|
|
430
|
+
logit_eta=None,
|
|
431
|
+
adaptive_eta=False,
|
|
432
|
+
adaptive_temperature=5.0,
|
|
433
|
+
adaptive_threshold=1.0,
|
|
434
|
+
cache_graph=True,
|
|
435
|
+
energy_stop_tol=0.0,
|
|
436
|
+
)
|
|
437
|
+
```
|
|
438
|
+
|
|
439
|
+
| Field | Type | Default | Description |
|
|
440
|
+
|---|---|---|---|
|
|
441
|
+
| `eta` | `float` | `0.1` | Diffusion strength. For normalised L use eta < 0.5 |
|
|
442
|
+
| `beta` | `float` | `1.0` | Hopfield scaling / inverse temperature |
|
|
443
|
+
| `steps` | `int` | `3` | Number of diffuse->attend iterations |
|
|
444
|
+
| `diffusion_mode` | `str` | `"factored"` | One of `"factored"`, `"simple"`, `"iterative"`, `"spectral"` |
|
|
445
|
+
| `attention_mode` | `str` | `"dense"` | `"dense"` (full O(N^2)) or `"graph"` (kNN-constrained O(kN)) |
|
|
446
|
+
| `k_neighbors` | `int` | `5` | Number of nearest neighbours in the similarity graph |
|
|
447
|
+
| `use_normalized_laplacian` | `bool` | `True` | Symmetric-normalised L; eigenvalues in [0, 2] |
|
|
448
|
+
| `use_sparse` | `bool` | `False` | Store adjacency as `sparse_coo` for O(kN) memory |
|
|
449
|
+
| `diffuse_key` | `bool` | `True` | Smooth stored patterns (keys) before attention |
|
|
450
|
+
| `diffuse_query` | `bool` | `False` | Smooth state patterns (queries) before attention |
|
|
451
|
+
| `use_logit_diffusion` | `bool` | `False` | Smooth post-softmax attention weights over the key graph |
|
|
452
|
+
| `logit_eta` | `float\|None` | `None` | Separate eta for logit diffusion; falls back to `eta` |
|
|
453
|
+
| `adaptive_eta` | `bool` | `False` | Scale eta by attention entropy (high-entropy -> more diffusion) |
|
|
454
|
+
| `cache_graph` | `bool` | `True` | Re-use built graph across forward passes |
|
|
455
|
+
| `energy_stop_tol` | `float` | `0.0` | Early-stop if abs(Delta E) < tol per step; 0 disables |
|
|
456
|
+
|
|
457
|
+
---
|
|
458
|
+
|
|
459
|
+
## Graph Pipeline
|
|
460
|
+
|
|
461
|
+
The graph pipeline under `difflayers.graph` can be used standalone to build Laplacians for any downstream use:
|
|
462
|
+
|
|
463
|
+
```python
|
|
464
|
+
import torch
|
|
465
|
+
from difflayers.graph.build_graph import build_similarity_matrix, build_knn_graph
|
|
466
|
+
from difflayers.graph.laplacian import compute_laplacian, compute_normalized_laplacian
|
|
467
|
+
from difflayers.graph.builder import GraphBuilder
|
|
468
|
+
|
|
469
|
+
# --- Manual pipeline ---
|
|
470
|
+
X = torch.randn(100, 64) # 100 patterns, 64-dim
|
|
471
|
+
|
|
472
|
+
S = build_similarity_matrix(X) # (100, 100) cosine similarity
|
|
473
|
+
A = build_knn_graph(S, k=8, as_sparse=False) # (100, 100) symmetric kNN adjacency
|
|
474
|
+
L = compute_normalized_laplacian(A) # (100, 100) symmetric-normalised Laplacian
|
|
475
|
+
|
|
476
|
+
# --- Fluent builder API ---
|
|
477
|
+
graph = (
|
|
478
|
+
GraphBuilder(X)
|
|
479
|
+
.cosine_similarity()
|
|
480
|
+
.knn(k=8, sparse=True)
|
|
481
|
+
.normalized_laplacian()
|
|
482
|
+
.build()
|
|
483
|
+
)
|
|
484
|
+
# graph.L — Laplacian
|
|
485
|
+
# graph.W — adjacency
|
|
486
|
+
# graph.deg — degree vector
|
|
487
|
+
```
|
|
488
|
+
|
|
489
|
+
**`build_similarity_matrix(X)`**
|
|
490
|
+
Computes pairwise cosine similarities, clamps negatives to zero, and zeros the diagonal (no self-loops). Complexity: O(N^2 d).
|
|
491
|
+
|
|
492
|
+
**`build_knn_graph(S, k, as_sparse)`**
|
|
493
|
+
Sparsifies the similarity matrix by keeping only the top-k neighbours per node, then symmetrises. When `as_sparse=True`, returns `torch.sparse_coo_tensor` for O(kN) downstream products.
|
|
494
|
+
|
|
495
|
+
**`compute_laplacian(A)`**
|
|
496
|
+
Unnormalised Laplacian L = D - A, where D = diag(A * 1). Eigenvalues in [0, d_max].
|
|
497
|
+
|
|
498
|
+
**`compute_normalized_laplacian(A)`**
|
|
499
|
+
Symmetric normalised Laplacian L_sym = D^{-1/2} (D - A) D^{-1/2}. Eigenvalues in [0, 2]. Isolated nodes handled safely. **Recommended** for diffusion because the eigenvalue bound makes stable eta input-independent.
|
|
500
|
+
|
|
501
|
+
---
|
|
502
|
+
|
|
503
|
+
## Advanced Usage
|
|
504
|
+
|
|
505
|
+
### Static retrieval (no learned projections)
|
|
506
|
+
|
|
507
|
+
Useful for direct content-addressable memory benchmarks:
|
|
508
|
+
|
|
509
|
+
```python
|
|
510
|
+
model = DiffusedHopfield(
|
|
511
|
+
input_size=None,
|
|
512
|
+
stored_pattern_as_static=True,
|
|
513
|
+
state_pattern_as_static=True,
|
|
514
|
+
pattern_projection_as_static=True,
|
|
515
|
+
disable_out_projection=True,
|
|
516
|
+
normalize_stored_pattern=False,
|
|
517
|
+
normalize_state_pattern=False,
|
|
518
|
+
normalize_pattern_projection=False,
|
|
519
|
+
normalize_stored_pattern_affine=False,
|
|
520
|
+
normalize_state_pattern_affine=False,
|
|
521
|
+
normalize_pattern_projection_affine=False,
|
|
522
|
+
batch_first=True,
|
|
523
|
+
scaling=4.0,
|
|
524
|
+
eta=0.15,
|
|
525
|
+
k_neighbors=10,
|
|
526
|
+
diffusion_mode="iterative",
|
|
527
|
+
diffusion_steps=5,
|
|
528
|
+
diffuse_key=True,
|
|
529
|
+
)
|
|
530
|
+
```
|
|
531
|
+
|
|
532
|
+
### Ablation: diffuse only queries, only keys, or both
|
|
533
|
+
|
|
534
|
+
```python
|
|
535
|
+
# Only diffuse keys (strongest effect; default)
|
|
536
|
+
dh_k = DiffusedHopfield(input_size=64, diffuse_key=True, diffuse_query=False, eta=0.1)
|
|
537
|
+
|
|
538
|
+
# Only diffuse queries (useful when queries are noisy)
|
|
539
|
+
dh_q = DiffusedHopfield(input_size=64, diffuse_key=False, diffuse_query=True, eta=0.1)
|
|
540
|
+
|
|
541
|
+
# Diffuse both
|
|
542
|
+
dh_both = DiffusedHopfield(input_size=64, diffuse_key=True, diffuse_query=True, eta=0.1)
|
|
543
|
+
```
|
|
544
|
+
|
|
545
|
+
### Logit-level diffusion
|
|
546
|
+
|
|
547
|
+
Smooth the post-softmax attention weights over the key graph:
|
|
548
|
+
|
|
549
|
+
```python
|
|
550
|
+
dh = DiffusedHopfield(
|
|
551
|
+
input_size=64,
|
|
552
|
+
diffuse_key=True,
|
|
553
|
+
use_logit_diffusion=True,
|
|
554
|
+
logit_eta=0.05, # usually smaller than pattern-level eta
|
|
555
|
+
)
|
|
556
|
+
```
|
|
557
|
+
|
|
558
|
+
### Adaptive diffusion strength
|
|
559
|
+
|
|
560
|
+
Scale eta automatically by attention entropy — high-entropy (uncertain) distributions receive more smoothing:
|
|
561
|
+
|
|
562
|
+
```python
|
|
563
|
+
dh = DiffusedHopfield(
|
|
564
|
+
input_size=64,
|
|
565
|
+
adaptive_eta=True,
|
|
566
|
+
eta=0.2, # maximum eta
|
|
567
|
+
adaptive_temperature=5.0,
|
|
568
|
+
adaptive_threshold=1.0, # entropy midpoint for sigmoid gate
|
|
569
|
+
)
|
|
570
|
+
```
|
|
571
|
+
|
|
572
|
+
### DynamicsEngine + EnergyTracker (low-level API)
|
|
573
|
+
|
|
574
|
+
```python
|
|
575
|
+
from difflayers import DiffusionConfig, DynamicsEngine, EnergyTracker, GraphCache
|
|
576
|
+
from difflayers.diffusion import FactoredDiffusion
|
|
577
|
+
from difflayers.attention_operator import AttentionOperator
|
|
578
|
+
|
|
579
|
+
cfg = DiffusionConfig(eta=0.1, steps=5, k_neighbors=8)
|
|
580
|
+
|
|
581
|
+
# Build graph once
|
|
582
|
+
cache = GraphCache(cfg)
|
|
583
|
+
graph = cache.get(patterns) # builds kNN + Laplacian; cached on repeated calls
|
|
584
|
+
|
|
585
|
+
# Build operators
|
|
586
|
+
diffusion_op = FactoredDiffusion(graph.W, graph.deg, cfg.eta)
|
|
587
|
+
attn_op = AttentionOperator(beta=cfg.beta, mode=cfg.attention_mode)
|
|
588
|
+
|
|
589
|
+
# Run the dynamics loop
|
|
590
|
+
engine = DynamicsEngine(diffusion_op, attn_op, cfg)
|
|
591
|
+
tracker = EnergyTracker(enabled=True)
|
|
592
|
+
|
|
593
|
+
Q_out, K_out = engine.run(Q, K, V, tracker=tracker)
|
|
594
|
+
|
|
595
|
+
print(tracker.energies) # list of Hopfield energy per step
|
|
596
|
+
```
|
|
597
|
+
|
|
598
|
+
---
|
|
599
|
+
|
|
600
|
+
## Transformer Integration
|
|
601
|
+
|
|
602
|
+
`difflayers` provides Hopfield-based encoder and decoder layers that slot directly into standard transformer architectures:
|
|
603
|
+
|
|
604
|
+
```python
|
|
605
|
+
from difflayers import HopfieldEncoderLayer, HopfieldDecoderLayer
|
|
606
|
+
import torch.nn as nn
|
|
607
|
+
|
|
608
|
+
encoder = nn.TransformerEncoder(
|
|
609
|
+
encoder_layer=HopfieldEncoderLayer(
|
|
610
|
+
d_model=512,
|
|
611
|
+
nhead=8,
|
|
612
|
+
dim_feedforward=2048,
|
|
613
|
+
dropout=0.1,
|
|
614
|
+
batch_first=True,
|
|
615
|
+
),
|
|
616
|
+
num_layers=6,
|
|
617
|
+
)
|
|
618
|
+
|
|
619
|
+
decoder = nn.TransformerDecoder(
|
|
620
|
+
decoder_layer=HopfieldDecoderLayer(
|
|
621
|
+
d_model=512,
|
|
622
|
+
nhead=8,
|
|
623
|
+
dim_feedforward=2048,
|
|
624
|
+
dropout=0.1,
|
|
625
|
+
batch_first=True,
|
|
626
|
+
),
|
|
627
|
+
num_layers=6,
|
|
628
|
+
)
|
|
629
|
+
```
|
|
630
|
+
|
|
631
|
+
`HopfieldEncoderLayer` and `HopfieldDecoderLayer` are direct drop-in replacements for PyTorch's built-in transformer layers, with the attention kernel replaced by the Hopfield update rule.
|
|
632
|
+
|
|
633
|
+
---
|
|
634
|
+
|
|
635
|
+
## Example Notebooks
|
|
636
|
+
|
|
637
|
+
The [examples/](examples/) directory contains three fully worked demonstrations. Install dependencies first:
|
|
638
|
+
|
|
639
|
+
```bash
|
|
640
|
+
pip install -r examples/requirements.txt
|
|
641
|
+
```
|
|
642
|
+
|
|
643
|
+
### [Bit Pattern Set](examples/bit_pattern/bit_pattern_demo.ipynb)
|
|
644
|
+
|
|
645
|
+
A binary classification task in the Multiple Instance Learning (MIL) setting. Each bag contains bit-pattern instances (sequences of 0s and 1s); positive bags have specific class-defining patterns injected that are absent in negative bags. The notebook shows that `Hopfield`, `HopfieldPooling`, and `HopfieldLayer` all learn to filter bags for the discriminative patterns with high accuracy, even as bag size and noise increase.
|
|
646
|
+
|
|
647
|
+
### [Latch Sequence Set](examples/latch_sequence/latch_sequence_demo.ipynb)
|
|
648
|
+
|
|
649
|
+
A long-term dependency task. A sequence begins with symbol **A** or **B**; after a variable delay, the model must output the corresponding symbol. The Hopfield layer concentrates attention sharply on the first position of the sequence, capturing the dependency without positional encoding.
|
|
650
|
+
|
|
651
|
+
### [Attention-based Deep MIL (MNIST Bags)](examples/mnist_bags/mnist_bags_demo.ipynb)
|
|
652
|
+
|
|
653
|
+
A canonical MIL benchmark from [Ilse & Tomczak (2018)](https://arxiv.org/abs/1802.04712). Each bag is a collection of 28x28 MNIST images; a bag is positive if it contains a target digit, negative otherwise. The notebook benchmarks Hopfield-based pooling against classic attention-MIL and demonstrates strong accuracy even with large bag sizes.
|
|
654
|
+
|
|
655
|
+
---
|
|
656
|
+
|
|
657
|
+
## Running Experiments
|
|
658
|
+
|
|
659
|
+
All experiments are in [src/experiments/](src/experiments/) and write results to [results/](results/).
|
|
660
|
+
|
|
661
|
+
```bash
|
|
662
|
+
# Full ablation study (diffuse Q only / K only / both vs. none)
|
|
663
|
+
python -m src.experiments.ablation
|
|
664
|
+
|
|
665
|
+
# Benchmark diffusion modes (factored, simple, iterative, spectral)
|
|
666
|
+
python -m src.experiments.benchmark
|
|
667
|
+
|
|
668
|
+
# Noise robustness sweep
|
|
669
|
+
python -m src.experiments.noise_robustness
|
|
670
|
+
|
|
671
|
+
# Steps sweep (T = 1 ... 10)
|
|
672
|
+
python -m src.experiments.steps_sweep
|
|
673
|
+
|
|
674
|
+
# Mode comparison (standard Hopfield vs. DiffusedHopfield)
|
|
675
|
+
python -m src.experiments.mode_comparison
|
|
676
|
+
|
|
677
|
+
# Logit vs. feature-level diffusion comparison
|
|
678
|
+
python -m src.experiments.logit_vs_feature
|
|
679
|
+
|
|
680
|
+
# Attention head analysis
|
|
681
|
+
python -m src.experiments.attention_analysis
|
|
682
|
+
```
|
|
683
|
+
|
|
684
|
+
---
|
|
685
|
+
|
|
686
|
+
## API Reference
|
|
687
|
+
|
|
688
|
+
All public names exported from `difflayers`:
|
|
689
|
+
|
|
690
|
+
| Name | Type | Description |
|
|
691
|
+
|---|---|---|
|
|
692
|
+
| `Hopfield` | `nn.Module` | Base continuous Hopfield attention layer |
|
|
693
|
+
| `HopfieldPooling` | `nn.Module` | Hopfield-based pooling with a trainable query |
|
|
694
|
+
| `HopfieldLayer` | `nn.Module` | Trainable static-memory lookup layer |
|
|
695
|
+
| `HopfieldCore` | `nn.Module` | Low-level multi-head Hopfield kernel |
|
|
696
|
+
| `DiffusedHopfield` | `nn.Module` | DAHN: graph-diffusion augmented Hopfield |
|
|
697
|
+
| `HopfieldEncoderLayer` | `nn.Module` | Transformer encoder layer with Hopfield attention |
|
|
698
|
+
| `HopfieldDecoderLayer` | `nn.Module` | Transformer decoder layer with Hopfield attention |
|
|
699
|
+
| `DiffusionOperator` | `ABC` | Abstract base for diffusion strategies |
|
|
700
|
+
| `SimpleDiffusion` | `DiffusionOperator` | One-step explicit Euler diffusion |
|
|
701
|
+
| `IterativeDiffusion` | `DiffusionOperator` | T-step iterative diffusion |
|
|
702
|
+
| `SpectralDiffusion` | `DiffusionOperator` | Exact heat-kernel via eigendecomposition |
|
|
703
|
+
| `FactoredDiffusion` | `DiffusionOperator` | Laplacian-free O(kNd) factored form |
|
|
704
|
+
| `apply_diffusion` | `function` | Functional API for a single diffusion call |
|
|
705
|
+
| `DiffusionConfig` | `dataclass` | Unified serialisable config for DAHN |
|
|
706
|
+
| `GraphCache` | `class` | Builds and caches the kNN graph + Laplacian |
|
|
707
|
+
| `DynamicsEngine` | `class` | Orchestrates the diffuse->attend loop |
|
|
708
|
+
| `EnergyTracker` | `class` | Per-step Hopfield energy logging + early-stop |
|
|
709
|
+
| `GraphBuilder` | `class` | Fluent graph-construction API |
|
|
710
|
+
|
|
711
|
+
---
|
|
712
|
+
|
|
713
|
+
## Complexity Guide
|
|
714
|
+
|
|
715
|
+
| Operation | Time | Memory | Notes |
|
|
716
|
+
|---|---|---|---|
|
|
717
|
+
| Build similarity matrix | O(N^2 d) | O(N^2) | `build_similarity_matrix` |
|
|
718
|
+
| Build kNN graph (dense) | O(N^2) | O(N^2) | `build_knn_graph` |
|
|
719
|
+
| Build kNN graph (sparse) | O(N^2) | O(kN) | `as_sparse=True` |
|
|
720
|
+
| Laplacian (dense) | O(N^2) | O(N^2) | |
|
|
721
|
+
| `FactoredDiffusion` step | O(kNd) | O(kN) | Recommended for large N |
|
|
722
|
+
| `SimpleDiffusion` step | O(N^2 d) | O(N^2) | |
|
|
723
|
+
| `IterativeDiffusion` T steps | O(T N^2 d) | O(N^2) | |
|
|
724
|
+
| `SpectralDiffusion` precompute | O(N^3) | O(N^2) | Eigendecomposition |
|
|
725
|
+
| `SpectralDiffusion` apply | O(N^2) | O(N^2) | Per forward pass |
|
|
726
|
+
| Dense Hopfield attention | O(N^2 d) | O(N^2) | `attention_mode="dense"` |
|
|
727
|
+
| Graph-constrained attention | O(kNd) | O(kN) | `attention_mode="graph"` |
|
|
728
|
+
| Full DAHN (factored + dense) | O(T kNd + N^2 d) | O(N^2) | Typical configuration |
|
|
729
|
+
| Full DAHN (factored + graph) | O(T kNd) | O(kN) | Fully sparse end-to-end |
|
|
730
|
+
|
|
731
|
+
N = number of patterns, d = feature dimension, k = kNN degree, T = diffusion steps.
|
|
732
|
+
|
|
733
|
+
---
|
|
734
|
+
|
|
735
|
+
## Background Paper
|
|
736
|
+
|
|
737
|
+
The Hopfield attention foundation is described in:
|
|
738
|
+
|
|
739
|
+
> **Hopfield Networks is All You Need**
|
|
740
|
+
> Hubert Ramsauer, Bernhard Schaefl, Johannes Lehner, Philipp Seidl, Michael Widrich, Lukas Gruber,
|
|
741
|
+
> Markus Holzleitner, Milena Pavlovic, Geir Kjetil Sandve, Victor Greiff, David Kreil, Michael Kopp,
|
|
742
|
+
> Gunter Klambauer, Johannes Brandstetter, Sepp Hochreiter
|
|
743
|
+
> *ICLR 2021* — [arxiv.org/abs/2008.02217](https://arxiv.org/abs/2008.02217)
|
|
744
|
+
|
|
745
|
+
A detailed companion blog post covering the theoretical background is available at
|
|
746
|
+
[ml-jku.github.io/hopfield-layers](https://ml-jku.github.io/hopfield-layers/).
|
|
747
|
+
|
|
748
|
+
---
|
|
749
|
+
|
|
750
|
+
## Disclaimer
|
|
751
|
+
|
|
752
|
+
Parts of this implementation are based on [PyTorch v1.6.0](https://github.com/pytorch/pytorch/tree/v1.6.0) and extended for the Hopfield/DAHN setting:
|
|
753
|
+
|
|
754
|
+
| Module | Based on |
|
|
755
|
+
|---|---|
|
|
756
|
+
| [`difflayers/activation.py` — `HopfieldCore`](difflayers/activation.py) | [`torch.nn.MultiheadAttention`](https://github.com/pytorch/pytorch/blob/b31f58de6fa8bbda5353b3c77d9be4914399724d/torch/nn/modules/activation.py#L771) |
|
|
757
|
+
| [`difflayers/functional.py` — `hopfield_core_forward`](difflayers/functional.py) | [`torch.nn.functional.multi_head_attention_forward`](https://github.com/pytorch/pytorch/blob/b31f58de6fa8bbda5353b3c77d9be4914399724d/torch/nn/functional.py#L3854) |
|
|
758
|
+
| [`difflayers/transformer.py` — `HopfieldEncoderLayer`](difflayers/transformer.py) | [`torch.nn.TransformerEncoderLayer`](https://github.com/pytorch/pytorch/blob/b31f58de6fa8bbda5353b3c77d9be4914399724d/torch/nn/modules/transformer.py#L241) |
|
|
759
|
+
| [`difflayers/transformer.py` — `HopfieldDecoderLayer`](difflayers/transformer.py) | [`torch.nn.TransformerDecoderLayer`](https://github.com/pytorch/pytorch/blob/b31f58de6fa8bbda5353b3c77d9be4914399724d/torch/nn/modules/transformer.py#L303) |
|
|
760
|
+
|
|
761
|
+
---
|
|
762
|
+
|
|
763
|
+
## License
|
|
764
|
+
|
|
765
|
+
BSD-style license — see [LICENSE](LICENSE).
|