difflayers 0.1.0__tar.gz → 0.1.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (30) hide show
  1. difflayers-0.1.1/PKG-INFO +765 -0
  2. difflayers-0.1.1/README.md +731 -0
  3. difflayers-0.1.1/difflayers.egg-info/PKG-INFO +765 -0
  4. {difflayers-0.1.0 → difflayers-0.1.1}/pyproject.toml +1 -1
  5. {difflayers-0.1.0 → difflayers-0.1.1}/setup.py +1 -1
  6. difflayers-0.1.0/PKG-INFO +0 -210
  7. difflayers-0.1.0/README.md +0 -176
  8. difflayers-0.1.0/difflayers.egg-info/PKG-INFO +0 -210
  9. {difflayers-0.1.0 → difflayers-0.1.1}/LICENSE +0 -0
  10. {difflayers-0.1.0 → difflayers-0.1.1}/difflayers/__init__.py +0 -0
  11. {difflayers-0.1.0 → difflayers-0.1.1}/difflayers/activation.py +0 -0
  12. {difflayers-0.1.0 → difflayers-0.1.1}/difflayers/attention_operator.py +0 -0
  13. {difflayers-0.1.0 → difflayers-0.1.1}/difflayers/auxiliary/__init__.py +0 -0
  14. {difflayers-0.1.0 → difflayers-0.1.1}/difflayers/auxiliary/data.py +0 -0
  15. {difflayers-0.1.0 → difflayers-0.1.1}/difflayers/diffused_attention.py +0 -0
  16. {difflayers-0.1.0 → difflayers-0.1.1}/difflayers/diffusion.py +0 -0
  17. {difflayers-0.1.0 → difflayers-0.1.1}/difflayers/dynamics_engine.py +0 -0
  18. {difflayers-0.1.0 → difflayers-0.1.1}/difflayers/functional.py +0 -0
  19. {difflayers-0.1.0 → difflayers-0.1.1}/difflayers/graph/__init__.py +0 -0
  20. {difflayers-0.1.0 → difflayers-0.1.1}/difflayers/graph/build_graph.py +0 -0
  21. {difflayers-0.1.0 → difflayers-0.1.1}/difflayers/graph/builder.py +0 -0
  22. {difflayers-0.1.0 → difflayers-0.1.1}/difflayers/graph/laplacian.py +0 -0
  23. {difflayers-0.1.0 → difflayers-0.1.1}/difflayers/graph/laplacian_builder.py +0 -0
  24. {difflayers-0.1.0 → difflayers-0.1.1}/difflayers/transformer.py +0 -0
  25. {difflayers-0.1.0 → difflayers-0.1.1}/difflayers.egg-info/SOURCES.txt +0 -0
  26. {difflayers-0.1.0 → difflayers-0.1.1}/difflayers.egg-info/dependency_links.txt +0 -0
  27. {difflayers-0.1.0 → difflayers-0.1.1}/difflayers.egg-info/not-zip-safe +0 -0
  28. {difflayers-0.1.0 → difflayers-0.1.1}/difflayers.egg-info/requires.txt +0 -0
  29. {difflayers-0.1.0 → difflayers-0.1.1}/difflayers.egg-info/top_level.txt +0 -0
  30. {difflayers-0.1.0 → difflayers-0.1.1}/setup.cfg +0 -0
@@ -0,0 +1,765 @@
1
+ Metadata-Version: 2.4
2
+ Name: difflayers
3
+ Version: 0.1.1
4
+ Summary: difflayers: Diffusion-Augmented Hopfield Networks
5
+ Home-page: https://github.com/hopfileds/hopfield-layers
6
+ Author: Priyam Ghosh
7
+ Author-email: Priyam Ghosh <priyamghosh9753@gmail.com>
8
+ License: BSD
9
+ Project-URL: Homepage, https://github.com/hopfileds/hopfield-layers
10
+ Project-URL: Repository, https://github.com/hopfileds/hopfield-layers
11
+ Project-URL: Bug Tracker, https://github.com/hopfileds/hopfield-layers/issues
12
+ Keywords: hopfield networks,deep learning,attention,diffusion,graph
13
+ Classifier: Development Status :: 3 - Alpha
14
+ Classifier: Intended Audience :: Science/Research
15
+ Classifier: License :: OSI Approved :: BSD License
16
+ Classifier: Programming Language :: Python :: 3
17
+ Classifier: Programming Language :: Python :: 3.8
18
+ Classifier: Programming Language :: Python :: 3.9
19
+ Classifier: Programming Language :: Python :: 3.10
20
+ Classifier: Programming Language :: Python :: 3.11
21
+ Classifier: Programming Language :: Python :: 3.12
22
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
23
+ Classifier: Operating System :: OS Independent
24
+ Requires-Python: >=3.8
25
+ Description-Content-Type: text/markdown
26
+ License-File: LICENSE
27
+ Requires-Dist: torch>=1.9.0
28
+ Requires-Dist: numpy>=1.20.0
29
+ Requires-Dist: scipy>=1.7.0
30
+ Dynamic: author
31
+ Dynamic: home-page
32
+ Dynamic: license-file
33
+ Dynamic: requires-python
34
+
35
+ # difflayers — Diffusion-Augmented Hopfield Networks
36
+
37
+ <p align="center">
38
+ <a href="https://pypi.org/project/difflayers/"><img src="https://img.shields.io/pypi/v/difflayers?color=blue&label=PyPI" alt="PyPI"></a>
39
+ <a href="https://pypi.org/project/difflayers/"><img src="https://img.shields.io/pypi/pyversions/difflayers" alt="Python Versions"></a>
40
+ <a href="https://pytorch.org"><img src="https://img.shields.io/badge/PyTorch-%E2%89%A51.9-orange" alt="PyTorch"></a>
41
+ <a href="LICENSE"><img src="https://img.shields.io/badge/license-BSD-green" alt="License"></a>
42
+ </p>
43
+
44
+ **difflayers** is a PyTorch library that extends modern continuous Hopfield networks with graph-based Laplacian diffusion, turning associative memory layers into structure-aware retrievers. At its core sits the **Diffusion-Augmented Hopfield Network (DAHN)** — a drop-in upgrade to standard Hopfield attention that pre-smooths patterns over a learned kNN graph before every association step, suppressing spurious retrievals and sharpening metastable energy minima.
45
+
46
+ The library ships the full original Hopfield layer suite (`Hopfield`, `HopfieldPooling`, `HopfieldLayer`) plus the DAHN extensions (`DiffusedHopfield`, four diffusion operators, a graph-construction pipeline, and a dynamical memory engine) — all under a single, clean API.
47
+
48
+ ---
49
+
50
+ ## Table of Contents
51
+
52
+ 1. [Background](#background)
53
+ 2. [What DAHN Adds](#what-dahn-adds)
54
+ 3. [Architecture Overview](#architecture-overview)
55
+ 4. [Installation](#installation)
56
+ 5. [Quick Start](#quick-start)
57
+ 6. [Core Modules](#core-modules)
58
+ - [Hopfield](#hopfield)
59
+ - [HopfieldPooling](#hopfieldpooling)
60
+ - [HopfieldLayer](#hopfieldlayer)
61
+ - [DiffusedHopfield](#diffusedhopfield)
62
+ 7. [Diffusion Modes](#diffusion-modes)
63
+ 8. [DiffusionConfig Reference](#diffusionconfig-reference)
64
+ 9. [Graph Pipeline](#graph-pipeline)
65
+ 10. [Advanced Usage](#advanced-usage)
66
+ 11. [Transformer Integration](#transformer-integration)
67
+ 12. [Example Notebooks](#example-notebooks)
68
+ 13. [Running Experiments](#running-experiments)
69
+ 14. [API Reference](#api-reference)
70
+ 15. [Complexity Guide](#complexity-guide)
71
+ 16. [Background Paper](#background-paper)
72
+ 17. [Disclaimer](#disclaimer)
73
+ 18. [License](#license)
74
+
75
+ ---
76
+
77
+ ## Background
78
+
79
+ Modern Hopfield networks with continuous states were introduced in [Ramsauer et al. (2020)](https://arxiv.org/abs/2008.02217), where it was shown that the transformer **attention mechanism is exactly the update rule of a continuous Hopfield network**. This re-framing unlocks exponential storage capacity, single-step convergence, and a clean energy-based interpretation of deep attention.
80
+
81
+ The energy function of a continuous Hopfield network is:
82
+
83
+ $$E = -\text{lse}(\beta, X \xi) + \frac{1}{2}\xi^T \xi + \frac{1}{\beta}\log N + \frac{1}{2}M^2$$
84
+
85
+ where $\text{lse}(\beta, z) = \frac{1}{\beta}\log\sum_i e^{\beta z_i}$ is the log-sum-exp, $\xi$ is the state pattern (query), $X$ are the stored patterns (keys), $\beta$ is the inverse temperature, and $N$, $M$ are dimensional constants.
86
+
87
+ Energy minimization via one synchronous update yields the familiar softmax attention:
88
+
89
+ $$\xi^{\text{new}} = X^\top \text{softmax}(\beta X \xi)$$
90
+
91
+ The network can store **exponentially many patterns** (in the dimension $d$), converges in **one update step**, and has exponentially small retrieval errors — properties not shared by classical binary Hopfield networks.
92
+
93
+ Three classes of fixed points (energy minima) arise naturally:
94
+
95
+ | Fixed-point type | Regime | Behaviour |
96
+ |---|---|---|
97
+ | **Global averaging** | Low $\beta$ | Retrieves a weighted average of all patterns |
98
+ | **Metastable states** | Medium $\beta$ | Retrieves a subset of patterns — analogous to multi-head attention |
99
+ | **Single-pattern storage** | High $\beta$ | Sharply retrieves one stored pattern |
100
+
101
+ ---
102
+
103
+ ## What DAHN Adds
104
+
105
+ Standard Hopfield attention treats every stored pattern as equally reachable from any query. In high-noise or high-density memory scenarios, the attention distribution spreads over spurious neighbours, degrading retrieval accuracy.
106
+
107
+ **DAHN** addresses this by building a $k$-nearest-neighbour graph over the pattern set and pre-smoothing patterns with the graph Laplacian before every association step. The dynamics loop is:
108
+
109
+ $$\text{for } t = 1, \ldots, T:$$
110
+ $$K' = \underbrace{(I - \eta L)}_{\text{diffusion}} K, \quad Q' = (I - \eta L) Q \quad \text{(optional)}$$
111
+ $$\text{output} = \text{softmax}(\beta \, Q' {K'}^\top) \, V$$
112
+
113
+ where $L$ is the (optionally symmetric-normalized) graph Laplacian of the kNN similarity graph over $K$, and $\eta$ is the diffusion strength. This smoothing:
114
+
115
+ - **Clusters** related patterns before retrieval, reducing inter-cluster interference
116
+ - **Sharpens** metastable energy minima, improving single-pattern retrieval accuracy under noise
117
+ - **Preserves** the Hopfield energy landscape (diffusion decreases the energy, never creates new spurious minima)
118
+ - **Scales** gracefully: with `FactoredDiffusion` and sparse adjacency the full loop costs $O(kNd)$ per step
119
+
120
+ ---
121
+
122
+ ## Architecture Overview
123
+
124
+ ```
125
+ difflayers/
126
+
127
+ ├── __init__.py # Public API — 18 exported names
128
+
129
+ ├── activation.py # HopfieldCore (multi-head Hopfield attention kernel)
130
+ ├── functional.py # hopfield_core_forward (low-level functional API)
131
+ ├── transformer.py # HopfieldEncoderLayer, HopfieldDecoderLayer
132
+
133
+ ├── diffused_attention.py # DiffusedHopfield ← DAHN entry point
134
+ ├── diffusion.py # DiffusionOperator ABC + 4 concrete strategies
135
+ │ # SimpleDiffusion, IterativeDiffusion,
136
+ │ # SpectralDiffusion, FactoredDiffusion
137
+ ├── dynamics_engine.py # DiffusionConfig, GraphCache, DynamicsEngine,
138
+ │ # EnergyTracker
139
+ ├── attention_operator.py # AttentionOperator (dense / graph-constrained)
140
+
141
+ ├── graph/
142
+ │ ├── build_graph.py # build_similarity_matrix, build_knn_graph
143
+ │ ├── laplacian.py # compute_laplacian, compute_normalized_laplacian
144
+ │ ├── builder.py # GraphBuilder (fluent graph-construction API)
145
+ │ └── laplacian_builder.py # LaplacianBuilder
146
+
147
+ └── auxiliary/
148
+ └── data.py # LookupTableDataset
149
+ ```
150
+
151
+ ---
152
+
153
+ ## Installation
154
+
155
+ ### From PyPI (recommended)
156
+
157
+ ```bash
158
+ pip install difflayers
159
+ ```
160
+
161
+ ### From source
162
+
163
+ ```bash
164
+ git clone https://github.com/Prigoistic/mha-layers.git
165
+ cd mha-layers
166
+ pip install -e .
167
+ ```
168
+
169
+ ### Dependencies
170
+
171
+ | Package | Minimum version |
172
+ |---|---|
173
+ | Python | 3.8 |
174
+ | PyTorch | 1.9.0 |
175
+ | NumPy | 1.20.0 |
176
+ | SciPy | 1.7.0 |
177
+
178
+ For the example notebooks, install the extra requirements:
179
+
180
+ ```bash
181
+ pip install -r examples/requirements.txt
182
+ ```
183
+
184
+ ---
185
+
186
+ ## Quick Start
187
+
188
+ ```python
189
+ import torch
190
+ from difflayers import Hopfield, HopfieldPooling, HopfieldLayer, DiffusedHopfield
191
+
192
+ # ------------------------------------------------------------------
193
+ # 1. Standard Hopfield attention (query x stored-pattern lookup)
194
+ # ------------------------------------------------------------------
195
+ hopfield = Hopfield(input_size=64, num_heads=4, batch_first=True)
196
+
197
+ queries = torch.randn(8, 10, 64) # (batch, query_len, d)
198
+ stored = torch.randn(8, 50, 64) # (batch, memory_size, d)
199
+ projections = torch.randn(8, 50, 64) # (batch, memory_size, d)
200
+
201
+ output = hopfield((stored, queries, projections))
202
+ # output: (8, 10, 64)
203
+
204
+ # ------------------------------------------------------------------
205
+ # 2. Hopfield pooling (sequence -> fixed-size embedding)
206
+ # ------------------------------------------------------------------
207
+ pooling = HopfieldPooling(input_size=64, num_heads=1, batch_first=True)
208
+ sequence = torch.randn(8, 100, 64)
209
+ pooled = pooling(sequence)
210
+ # pooled: (8, 1, 64) — one trained state-pattern queries over the sequence
211
+
212
+ # ------------------------------------------------------------------
213
+ # 3. Hopfield lookup (static trainable memory)
214
+ # ------------------------------------------------------------------
215
+ lookup = HopfieldLayer(input_size=64, num_pattern_repetitions=32)
216
+ query = torch.randn(8, 10, 64)
217
+ result = lookup(query)
218
+ # result: (8, 10, 64)
219
+
220
+ # ------------------------------------------------------------------
221
+ # 4. DiffusedHopfield (graph-diffusion augmented retrieval)
222
+ # ------------------------------------------------------------------
223
+ dh = DiffusedHopfield(
224
+ input_size=64,
225
+ num_heads=4,
226
+ batch_first=True,
227
+ eta=0.1, # diffusion strength eta
228
+ k_neighbors=8, # kNN graph degree
229
+ diffusion_mode="factored", # O(kNd) — fastest
230
+ diffusion_steps=3, # T iterations of diffuse -> attend
231
+ diffuse_key=True, # smooth stored patterns
232
+ diffuse_query=False, # optionally also smooth queries
233
+ )
234
+ output = dh((stored, queries, projections))
235
+ # output: (8, 10, 64) — same shape, sharper retrieval
236
+ ```
237
+
238
+ ---
239
+
240
+ ## Core Modules
241
+
242
+ ### Hopfield
243
+
244
+ The base continuous Hopfield attention layer. A direct PyTorch-compatible re-implementation of multi-head attention whose weights are derived from the Hopfield energy update rule rather than learned linear projections.
245
+
246
+ ```python
247
+ from difflayers import Hopfield
248
+
249
+ hopfield = Hopfield(
250
+ input_size=128, # depth of state (query) patterns
251
+ hidden_size=64, # depth of the association (Hopfield) space
252
+ output_size=128, # depth of the output projection
253
+ num_heads=8, # parallel association heads
254
+ scaling=None, # beta; auto-set to 1/sqrt(head_dim) if None
255
+ update_steps_max=0, # 0 = one synchronous update (default/recommended)
256
+ update_steps_eps=1e-4, # convergence threshold for iterative updates
257
+ normalize_stored_pattern=True, # LayerNorm on keys
258
+ normalize_state_pattern=True, # LayerNorm on queries
259
+ batch_first=True,
260
+ dropout=0.1,
261
+ )
262
+ ```
263
+
264
+ **Key parameters:**
265
+
266
+ | Parameter | Type | Default | Description |
267
+ |---|---|---|---|
268
+ | `input_size` | `int` | `None` | Feature depth of state (query) patterns |
269
+ | `hidden_size` | `int` | `None` | Hopfield association space depth; defaults to `input_size` |
270
+ | `output_size` | `int` | `None` | Output projection depth; defaults to `input_size` |
271
+ | `num_heads` | `int` | `1` | Parallel association heads |
272
+ | `scaling` | `float` | `None` | Inverse temperature beta; `None` => 1/sqrt(d_head) |
273
+ | `update_steps_max` | `int` | `0` | Max synchronous update iterations (`None` = run to convergence) |
274
+ | `batch_first` | `bool` | `True` | Input layout: `(batch, seq, d)` when `True`, `(seq, batch, d)` when `False` |
275
+ | `stored_pattern_as_static` | `bool` | `False` | Freeze stored patterns (no gradient through keys) |
276
+ | `disable_out_projection` | `bool` | `False` | Skip the final linear projection (useful for retrieval tasks) |
277
+
278
+ ---
279
+
280
+ ### HopfieldPooling
281
+
282
+ Replaces traditional pooling (mean, max, attention-based) with a Hopfield-energy-based alternative. A single **trainable state pattern** acts as the query and computes softmax weights over the input sequence, producing a fixed-size summary vector regardless of input length.
283
+
284
+ ```python
285
+ from difflayers import HopfieldPooling
286
+
287
+ pooling = HopfieldPooling(
288
+ input_size=128,
289
+ num_heads=4,
290
+ batch_first=True,
291
+ dropout=0.1,
292
+ )
293
+
294
+ # Collapse a variable-length sequence to a single vector
295
+ sequence = torch.randn(batch, seq_len, 128)
296
+ pooled = pooling(sequence) # (batch, 1, 128)
297
+ ```
298
+
299
+ Useful anywhere you need a **permutation-invariant** sequence summarisation — bag-of-words classification, set encoding, immune repertoire profiling, etc.
300
+
301
+ ---
302
+
303
+ ### HopfieldLayer
304
+
305
+ A trainable, input-independent lookup table. One or more **stored patterns** and their **projections** are learned parameters; given a query, the layer retrieves the most energy-aligned stored vector — acting like a content-addressable memory with learned slots.
306
+
307
+ ```python
308
+ from difflayers import HopfieldLayer
309
+
310
+ lookup = HopfieldLayer(
311
+ input_size=128,
312
+ num_pattern_repetitions=64, # number of learned memory slots
313
+ batch_first=True,
314
+ )
315
+
316
+ query = torch.randn(batch, seq_len, 128)
317
+ result = lookup(query) # (batch, seq_len, 128)
318
+ ```
319
+
320
+ This is distinct from `Hopfield` in that the memory contents are **learned parameters**, not runtime inputs — suitable for slot-attention, prototype networks, or any scenario where memory is fixed at training time.
321
+
322
+ ---
323
+
324
+ ### DiffusedHopfield
325
+
326
+ The DAHN module. A full drop-in replacement for `Hopfield` that augments the association with a graph-diffusion pre-processing step. Internally it builds a kNN cosine-similarity graph over the stored patterns, constructs the graph Laplacian, and runs a configurable diffusion-attention loop.
327
+
328
+ ```python
329
+ from difflayers import DiffusedHopfield
330
+
331
+ dh = DiffusedHopfield(
332
+ # --- All standard Hopfield arguments are accepted ---
333
+ input_size=128,
334
+ num_heads=4,
335
+ batch_first=True,
336
+ scaling=1.0,
337
+
338
+ # --- DAHN-specific arguments ---
339
+ eta=0.1, # diffusion strength eta in (0, 0.5)
340
+ k_neighbors=8, # kNN graph degree
341
+ diffusion_mode="factored", # "factored" | "simple" | "iterative" | "spectral"
342
+ diffusion_steps=3, # T (ignored by "simple"; used by iterative/spectral)
343
+ use_normalized_laplacian=True, # symmetric-normalised L (recommended)
344
+ diffuse_key=True, # smooth stored patterns (keys)
345
+ diffuse_query=False, # optionally smooth query patterns too
346
+ use_sparse=False, # sparse adjacency for O(kN) memory
347
+ use_logit_diffusion=False, # also smooth post-softmax attention weights
348
+ logit_eta=None, # eta for logit diffusion; defaults to eta
349
+ adaptive_eta=False, # scale eta by attention entropy at runtime
350
+ cache_graph=True, # reuse graph across forward passes
351
+ energy_stop_tol=0.0, # early-stop on |Delta E| < tol (0 = disabled)
352
+ )
353
+ ```
354
+
355
+ The forward signature is identical to `Hopfield`:
356
+
357
+ ```python
358
+ output = dh((stored_patterns, state_patterns, pattern_projections))
359
+ # or with masking
360
+ output = dh((stored_patterns, state_patterns, pattern_projections),
361
+ stored_pattern_padding_mask=mask)
362
+ ```
363
+
364
+ ---
365
+
366
+ ## Diffusion Modes
367
+
368
+ Four diffusion strategies are available, trading off speed, memory, and smoothing quality:
369
+
370
+ ### `"factored"` *(default — recommended)*
371
+
372
+ ```
373
+ x' = (1 - eta * deg) * x + eta * W @ x
374
+ ```
375
+
376
+ Never forms the full Laplacian matrix. Stores only the sparse adjacency `W` and degree vector `deg`. Each step costs `O(kNd)` in time and `O(kN)` in memory. Best for large N and sparse graphs.
377
+
378
+ ### `"simple"`
379
+
380
+ ```
381
+ x' = (I - eta * L) @ x
382
+ ```
383
+
384
+ One explicit Euler step of heat diffusion. Forms `D = I - eta*L` once and applies it. Cost: `O(N^2 * d)` per step.
385
+
386
+ ### `"iterative"`
387
+
388
+ ```
389
+ x' = (I - eta * L)^T @ x
390
+ ```
391
+
392
+ Applies the same operator `D` repeatedly for `T` steps (`diffusion_steps`). Provides deeper smoothing at the cost of `T * O(N^2 * d)`. Includes a numerical guard against divergence.
393
+
394
+ ### `"spectral"`
395
+
396
+ ```
397
+ x' = U @ diag(exp(-eta * lambda)) @ U.T @ x
398
+ ```
399
+
400
+ Exact heat-kernel diffusion via eigendecomposition of `L`. Precomputes `U` and `lambda` once (`O(N^3)`), then applies the diagonal filter in `O(N^2)` per call. Most accurate smoothing; not suitable for large N.
401
+
402
+ | Mode | Precompute | Per-step | Memory | Best for |
403
+ |---|---|---|---|---|
404
+ | `factored` | O(N^2) build kNN | O(kNd) | O(kN) | Large N, production |
405
+ | `simple` | O(N^2) build D | O(N^2 d) | O(N^2) | Moderate N, one-shot |
406
+ | `iterative` | O(N^2) build D | O(T * N^2 d) | O(N^2) | Deep smoothing |
407
+ | `spectral` | O(N^3) eigen | O(N^2) | O(N^2) | Small N, exact kernel |
408
+
409
+ ---
410
+
411
+ ## DiffusionConfig Reference
412
+
413
+ `DiffusionConfig` is a frozen dataclass that bundles all diffusion hyperparameters. You can pass one explicitly to `DiffusedHopfield`, or let the constructor build it from keyword arguments.
414
+
415
+ ```python
416
+ from difflayers import DiffusionConfig
417
+
418
+ cfg = DiffusionConfig(
419
+ eta=0.1,
420
+ beta=1.0,
421
+ steps=3,
422
+ diffusion_mode="factored",
423
+ attention_mode="dense", # "dense" | "graph"
424
+ k_neighbors=5,
425
+ use_normalized_laplacian=True,
426
+ use_sparse=False,
427
+ diffuse_key=True,
428
+ diffuse_query=False,
429
+ use_logit_diffusion=False,
430
+ logit_eta=None,
431
+ adaptive_eta=False,
432
+ adaptive_temperature=5.0,
433
+ adaptive_threshold=1.0,
434
+ cache_graph=True,
435
+ energy_stop_tol=0.0,
436
+ )
437
+ ```
438
+
439
+ | Field | Type | Default | Description |
440
+ |---|---|---|---|
441
+ | `eta` | `float` | `0.1` | Diffusion strength. For normalised L use eta < 0.5 |
442
+ | `beta` | `float` | `1.0` | Hopfield scaling / inverse temperature |
443
+ | `steps` | `int` | `3` | Number of diffuse->attend iterations |
444
+ | `diffusion_mode` | `str` | `"factored"` | One of `"factored"`, `"simple"`, `"iterative"`, `"spectral"` |
445
+ | `attention_mode` | `str` | `"dense"` | `"dense"` (full O(N^2)) or `"graph"` (kNN-constrained O(kN)) |
446
+ | `k_neighbors` | `int` | `5` | Number of nearest neighbours in the similarity graph |
447
+ | `use_normalized_laplacian` | `bool` | `True` | Symmetric-normalised L; eigenvalues in [0, 2] |
448
+ | `use_sparse` | `bool` | `False` | Store adjacency as `sparse_coo` for O(kN) memory |
449
+ | `diffuse_key` | `bool` | `True` | Smooth stored patterns (keys) before attention |
450
+ | `diffuse_query` | `bool` | `False` | Smooth state patterns (queries) before attention |
451
+ | `use_logit_diffusion` | `bool` | `False` | Smooth post-softmax attention weights over the key graph |
452
+ | `logit_eta` | `float\|None` | `None` | Separate eta for logit diffusion; falls back to `eta` |
453
+ | `adaptive_eta` | `bool` | `False` | Scale eta by attention entropy (high-entropy -> more diffusion) |
454
+ | `cache_graph` | `bool` | `True` | Re-use built graph across forward passes |
455
+ | `energy_stop_tol` | `float` | `0.0` | Early-stop if abs(Delta E) < tol per step; 0 disables |
456
+
457
+ ---
458
+
459
+ ## Graph Pipeline
460
+
461
+ The graph pipeline under `difflayers.graph` can be used standalone to build Laplacians for any downstream use:
462
+
463
+ ```python
464
+ import torch
465
+ from difflayers.graph.build_graph import build_similarity_matrix, build_knn_graph
466
+ from difflayers.graph.laplacian import compute_laplacian, compute_normalized_laplacian
467
+ from difflayers.graph.builder import GraphBuilder
468
+
469
+ # --- Manual pipeline ---
470
+ X = torch.randn(100, 64) # 100 patterns, 64-dim
471
+
472
+ S = build_similarity_matrix(X) # (100, 100) cosine similarity
473
+ A = build_knn_graph(S, k=8, as_sparse=False) # (100, 100) symmetric kNN adjacency
474
+ L = compute_normalized_laplacian(A) # (100, 100) symmetric-normalised Laplacian
475
+
476
+ # --- Fluent builder API ---
477
+ graph = (
478
+ GraphBuilder(X)
479
+ .cosine_similarity()
480
+ .knn(k=8, sparse=True)
481
+ .normalized_laplacian()
482
+ .build()
483
+ )
484
+ # graph.L — Laplacian
485
+ # graph.W — adjacency
486
+ # graph.deg — degree vector
487
+ ```
488
+
489
+ **`build_similarity_matrix(X)`**
490
+ Computes pairwise cosine similarities, clamps negatives to zero, and zeros the diagonal (no self-loops). Complexity: O(N^2 d).
491
+
492
+ **`build_knn_graph(S, k, as_sparse)`**
493
+ Sparsifies the similarity matrix by keeping only the top-k neighbours per node, then symmetrises. When `as_sparse=True`, returns `torch.sparse_coo_tensor` for O(kN) downstream products.
494
+
495
+ **`compute_laplacian(A)`**
496
+ Unnormalised Laplacian L = D - A, where D = diag(A * 1). Eigenvalues in [0, d_max].
497
+
498
+ **`compute_normalized_laplacian(A)`**
499
+ Symmetric normalised Laplacian L_sym = D^{-1/2} (D - A) D^{-1/2}. Eigenvalues in [0, 2]. Isolated nodes handled safely. **Recommended** for diffusion because the eigenvalue bound makes stable eta input-independent.
500
+
501
+ ---
502
+
503
+ ## Advanced Usage
504
+
505
+ ### Static retrieval (no learned projections)
506
+
507
+ Useful for direct content-addressable memory benchmarks:
508
+
509
+ ```python
510
+ model = DiffusedHopfield(
511
+ input_size=None,
512
+ stored_pattern_as_static=True,
513
+ state_pattern_as_static=True,
514
+ pattern_projection_as_static=True,
515
+ disable_out_projection=True,
516
+ normalize_stored_pattern=False,
517
+ normalize_state_pattern=False,
518
+ normalize_pattern_projection=False,
519
+ normalize_stored_pattern_affine=False,
520
+ normalize_state_pattern_affine=False,
521
+ normalize_pattern_projection_affine=False,
522
+ batch_first=True,
523
+ scaling=4.0,
524
+ eta=0.15,
525
+ k_neighbors=10,
526
+ diffusion_mode="iterative",
527
+ diffusion_steps=5,
528
+ diffuse_key=True,
529
+ )
530
+ ```
531
+
532
+ ### Ablation: diffuse only queries, only keys, or both
533
+
534
+ ```python
535
+ # Only diffuse keys (strongest effect; default)
536
+ dh_k = DiffusedHopfield(input_size=64, diffuse_key=True, diffuse_query=False, eta=0.1)
537
+
538
+ # Only diffuse queries (useful when queries are noisy)
539
+ dh_q = DiffusedHopfield(input_size=64, diffuse_key=False, diffuse_query=True, eta=0.1)
540
+
541
+ # Diffuse both
542
+ dh_both = DiffusedHopfield(input_size=64, diffuse_key=True, diffuse_query=True, eta=0.1)
543
+ ```
544
+
545
+ ### Logit-level diffusion
546
+
547
+ Smooth the post-softmax attention weights over the key graph:
548
+
549
+ ```python
550
+ dh = DiffusedHopfield(
551
+ input_size=64,
552
+ diffuse_key=True,
553
+ use_logit_diffusion=True,
554
+ logit_eta=0.05, # usually smaller than pattern-level eta
555
+ )
556
+ ```
557
+
558
+ ### Adaptive diffusion strength
559
+
560
+ Scale eta automatically by attention entropy — high-entropy (uncertain) distributions receive more smoothing:
561
+
562
+ ```python
563
+ dh = DiffusedHopfield(
564
+ input_size=64,
565
+ adaptive_eta=True,
566
+ eta=0.2, # maximum eta
567
+ adaptive_temperature=5.0,
568
+ adaptive_threshold=1.0, # entropy midpoint for sigmoid gate
569
+ )
570
+ ```
571
+
572
+ ### DynamicsEngine + EnergyTracker (low-level API)
573
+
574
+ ```python
575
+ from difflayers import DiffusionConfig, DynamicsEngine, EnergyTracker, GraphCache
576
+ from difflayers.diffusion import FactoredDiffusion
577
+ from difflayers.attention_operator import AttentionOperator
578
+
579
+ cfg = DiffusionConfig(eta=0.1, steps=5, k_neighbors=8)
580
+
581
+ # Build graph once
582
+ cache = GraphCache(cfg)
583
+ graph = cache.get(patterns) # builds kNN + Laplacian; cached on repeated calls
584
+
585
+ # Build operators
586
+ diffusion_op = FactoredDiffusion(graph.W, graph.deg, cfg.eta)
587
+ attn_op = AttentionOperator(beta=cfg.beta, mode=cfg.attention_mode)
588
+
589
+ # Run the dynamics loop
590
+ engine = DynamicsEngine(diffusion_op, attn_op, cfg)
591
+ tracker = EnergyTracker(enabled=True)
592
+
593
+ Q_out, K_out = engine.run(Q, K, V, tracker=tracker)
594
+
595
+ print(tracker.energies) # list of Hopfield energy per step
596
+ ```
597
+
598
+ ---
599
+
600
+ ## Transformer Integration
601
+
602
+ `difflayers` provides Hopfield-based encoder and decoder layers that slot directly into standard transformer architectures:
603
+
604
+ ```python
605
+ from difflayers import HopfieldEncoderLayer, HopfieldDecoderLayer
606
+ import torch.nn as nn
607
+
608
+ encoder = nn.TransformerEncoder(
609
+ encoder_layer=HopfieldEncoderLayer(
610
+ d_model=512,
611
+ nhead=8,
612
+ dim_feedforward=2048,
613
+ dropout=0.1,
614
+ batch_first=True,
615
+ ),
616
+ num_layers=6,
617
+ )
618
+
619
+ decoder = nn.TransformerDecoder(
620
+ decoder_layer=HopfieldDecoderLayer(
621
+ d_model=512,
622
+ nhead=8,
623
+ dim_feedforward=2048,
624
+ dropout=0.1,
625
+ batch_first=True,
626
+ ),
627
+ num_layers=6,
628
+ )
629
+ ```
630
+
631
+ `HopfieldEncoderLayer` and `HopfieldDecoderLayer` are direct drop-in replacements for PyTorch's built-in transformer layers, with the attention kernel replaced by the Hopfield update rule.
632
+
633
+ ---
634
+
635
+ ## Example Notebooks
636
+
637
+ The [examples/](examples/) directory contains three fully worked demonstrations. Install dependencies first:
638
+
639
+ ```bash
640
+ pip install -r examples/requirements.txt
641
+ ```
642
+
643
+ ### [Bit Pattern Set](examples/bit_pattern/bit_pattern_demo.ipynb)
644
+
645
+ A binary classification task in the Multiple Instance Learning (MIL) setting. Each bag contains bit-pattern instances (sequences of 0s and 1s); positive bags have specific class-defining patterns injected that are absent in negative bags. The notebook shows that `Hopfield`, `HopfieldPooling`, and `HopfieldLayer` all learn to filter bags for the discriminative patterns with high accuracy, even as bag size and noise increase.
646
+
647
+ ### [Latch Sequence Set](examples/latch_sequence/latch_sequence_demo.ipynb)
648
+
649
+ A long-term dependency task. A sequence begins with symbol **A** or **B**; after a variable delay, the model must output the corresponding symbol. The Hopfield layer concentrates attention sharply on the first position of the sequence, capturing the dependency without positional encoding.
650
+
651
+ ### [Attention-based Deep MIL (MNIST Bags)](examples/mnist_bags/mnist_bags_demo.ipynb)
652
+
653
+ A canonical MIL benchmark from [Ilse & Tomczak (2018)](https://arxiv.org/abs/1802.04712). Each bag is a collection of 28x28 MNIST images; a bag is positive if it contains a target digit, negative otherwise. The notebook benchmarks Hopfield-based pooling against classic attention-MIL and demonstrates strong accuracy even with large bag sizes.
654
+
655
+ ---
656
+
657
+ ## Running Experiments
658
+
659
+ All experiments are in [src/experiments/](src/experiments/) and write results to [results/](results/).
660
+
661
+ ```bash
662
+ # Full ablation study (diffuse Q only / K only / both vs. none)
663
+ python -m src.experiments.ablation
664
+
665
+ # Benchmark diffusion modes (factored, simple, iterative, spectral)
666
+ python -m src.experiments.benchmark
667
+
668
+ # Noise robustness sweep
669
+ python -m src.experiments.noise_robustness
670
+
671
+ # Steps sweep (T = 1 ... 10)
672
+ python -m src.experiments.steps_sweep
673
+
674
+ # Mode comparison (standard Hopfield vs. DiffusedHopfield)
675
+ python -m src.experiments.mode_comparison
676
+
677
+ # Logit vs. feature-level diffusion comparison
678
+ python -m src.experiments.logit_vs_feature
679
+
680
+ # Attention head analysis
681
+ python -m src.experiments.attention_analysis
682
+ ```
683
+
684
+ ---
685
+
686
+ ## API Reference
687
+
688
+ All public names exported from `difflayers`:
689
+
690
+ | Name | Type | Description |
691
+ |---|---|---|
692
+ | `Hopfield` | `nn.Module` | Base continuous Hopfield attention layer |
693
+ | `HopfieldPooling` | `nn.Module` | Hopfield-based pooling with a trainable query |
694
+ | `HopfieldLayer` | `nn.Module` | Trainable static-memory lookup layer |
695
+ | `HopfieldCore` | `nn.Module` | Low-level multi-head Hopfield kernel |
696
+ | `DiffusedHopfield` | `nn.Module` | DAHN: graph-diffusion augmented Hopfield |
697
+ | `HopfieldEncoderLayer` | `nn.Module` | Transformer encoder layer with Hopfield attention |
698
+ | `HopfieldDecoderLayer` | `nn.Module` | Transformer decoder layer with Hopfield attention |
699
+ | `DiffusionOperator` | `ABC` | Abstract base for diffusion strategies |
700
+ | `SimpleDiffusion` | `DiffusionOperator` | One-step explicit Euler diffusion |
701
+ | `IterativeDiffusion` | `DiffusionOperator` | T-step iterative diffusion |
702
+ | `SpectralDiffusion` | `DiffusionOperator` | Exact heat-kernel via eigendecomposition |
703
+ | `FactoredDiffusion` | `DiffusionOperator` | Laplacian-free O(kNd) factored form |
704
+ | `apply_diffusion` | `function` | Functional API for a single diffusion call |
705
+ | `DiffusionConfig` | `dataclass` | Unified serialisable config for DAHN |
706
+ | `GraphCache` | `class` | Builds and caches the kNN graph + Laplacian |
707
+ | `DynamicsEngine` | `class` | Orchestrates the diffuse->attend loop |
708
+ | `EnergyTracker` | `class` | Per-step Hopfield energy logging + early-stop |
709
+ | `GraphBuilder` | `class` | Fluent graph-construction API |
710
+
711
+ ---
712
+
713
+ ## Complexity Guide
714
+
715
+ | Operation | Time | Memory | Notes |
716
+ |---|---|---|---|
717
+ | Build similarity matrix | O(N^2 d) | O(N^2) | `build_similarity_matrix` |
718
+ | Build kNN graph (dense) | O(N^2) | O(N^2) | `build_knn_graph` |
719
+ | Build kNN graph (sparse) | O(N^2) | O(kN) | `as_sparse=True` |
720
+ | Laplacian (dense) | O(N^2) | O(N^2) | |
721
+ | `FactoredDiffusion` step | O(kNd) | O(kN) | Recommended for large N |
722
+ | `SimpleDiffusion` step | O(N^2 d) | O(N^2) | |
723
+ | `IterativeDiffusion` T steps | O(T N^2 d) | O(N^2) | |
724
+ | `SpectralDiffusion` precompute | O(N^3) | O(N^2) | Eigendecomposition |
725
+ | `SpectralDiffusion` apply | O(N^2) | O(N^2) | Per forward pass |
726
+ | Dense Hopfield attention | O(N^2 d) | O(N^2) | `attention_mode="dense"` |
727
+ | Graph-constrained attention | O(kNd) | O(kN) | `attention_mode="graph"` |
728
+ | Full DAHN (factored + dense) | O(T kNd + N^2 d) | O(N^2) | Typical configuration |
729
+ | Full DAHN (factored + graph) | O(T kNd) | O(kN) | Fully sparse end-to-end |
730
+
731
+ N = number of patterns, d = feature dimension, k = kNN degree, T = diffusion steps.
732
+
733
+ ---
734
+
735
+ ## Background Paper
736
+
737
+ The Hopfield attention foundation is described in:
738
+
739
+ > **Hopfield Networks is All You Need**
740
+ > Hubert Ramsauer, Bernhard Schaefl, Johannes Lehner, Philipp Seidl, Michael Widrich, Lukas Gruber,
741
+ > Markus Holzleitner, Milena Pavlovic, Geir Kjetil Sandve, Victor Greiff, David Kreil, Michael Kopp,
742
+ > Gunter Klambauer, Johannes Brandstetter, Sepp Hochreiter
743
+ > *ICLR 2021* — [arxiv.org/abs/2008.02217](https://arxiv.org/abs/2008.02217)
744
+
745
+ A detailed companion blog post covering the theoretical background is available at
746
+ [ml-jku.github.io/hopfield-layers](https://ml-jku.github.io/hopfield-layers/).
747
+
748
+ ---
749
+
750
+ ## Disclaimer
751
+
752
+ Parts of this implementation are based on [PyTorch v1.6.0](https://github.com/pytorch/pytorch/tree/v1.6.0) and extended for the Hopfield/DAHN setting:
753
+
754
+ | Module | Based on |
755
+ |---|---|
756
+ | [`difflayers/activation.py` — `HopfieldCore`](difflayers/activation.py) | [`torch.nn.MultiheadAttention`](https://github.com/pytorch/pytorch/blob/b31f58de6fa8bbda5353b3c77d9be4914399724d/torch/nn/modules/activation.py#L771) |
757
+ | [`difflayers/functional.py` — `hopfield_core_forward`](difflayers/functional.py) | [`torch.nn.functional.multi_head_attention_forward`](https://github.com/pytorch/pytorch/blob/b31f58de6fa8bbda5353b3c77d9be4914399724d/torch/nn/functional.py#L3854) |
758
+ | [`difflayers/transformer.py` — `HopfieldEncoderLayer`](difflayers/transformer.py) | [`torch.nn.TransformerEncoderLayer`](https://github.com/pytorch/pytorch/blob/b31f58de6fa8bbda5353b3c77d9be4914399724d/torch/nn/modules/transformer.py#L241) |
759
+ | [`difflayers/transformer.py` — `HopfieldDecoderLayer`](difflayers/transformer.py) | [`torch.nn.TransformerDecoderLayer`](https://github.com/pytorch/pytorch/blob/b31f58de6fa8bbda5353b3c77d9be4914399724d/torch/nn/modules/transformer.py#L303) |
760
+
761
+ ---
762
+
763
+ ## License
764
+
765
+ BSD-style license — see [LICENSE](LICENSE).