hicache-pp 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Krishi Attri
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,299 @@
1
+ Metadata-Version: 2.4
2
+ Name: hicache-pp
3
+ Version: 0.1.0
4
+ Summary: Training-free diffusion inference acceleration via exponential (DMD/Prony) velocity forecasting
5
+ Author: Krishi Attri
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/Archerkattri/hicache-plus-plus
8
+ Project-URL: Repository, https://github.com/Archerkattri/hicache-plus-plus
9
+ Keywords: diffusion,flow-matching,inference-acceleration,feature-caching,dmd,prony,hicache,taylorseer
10
+ Classifier: Development Status :: 4 - Beta
11
+ Classifier: Intended Audience :: Science/Research
12
+ Classifier: License :: OSI Approved :: MIT License
13
+ Classifier: Operating System :: OS Independent
14
+ Classifier: Programming Language :: Python :: 3
15
+ Classifier: Programming Language :: Python :: 3.9
16
+ Classifier: Programming Language :: Python :: 3.10
17
+ Classifier: Programming Language :: Python :: 3.11
18
+ Classifier: Programming Language :: Python :: 3.12
19
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
20
+ Requires-Python: >=3.9
21
+ Description-Content-Type: text/markdown
22
+ License-File: LICENSE
23
+ Requires-Dist: torch
24
+ Dynamic: license-file
25
+
26
+ <div align="center">
27
+
28
+ # HiCache++
29
+
30
+ **Training-free diffusion inference acceleration by *exponential* velocity forecasting.**
31
+
32
+ *A drop-in upgrade to TaylorSeer / HiCache: replace the polynomial feature-cache basis with
33
+ a Dynamic-Mode-Decomposition (Prony) **exponential** basis — exact on the class diffusion
34
+ features actually live in, so it stays lossless at larger skip intervals than the polynomial.*
35
+
36
+ ![training&#8209;free](https://img.shields.io/badge/training--free-%E2%9C%93-2e8f5c)
37
+ &nbsp;![PyTorch](https://img.shields.io/badge/PyTorch-ee4c2c?logo=pytorch&logoColor=white)
38
+ &nbsp;![CPU tests](https://img.shields.io/badge/CPU%20tests-passing-2e8f5c)
39
+ &nbsp;![license MIT](https://img.shields.io/badge/license-MIT-2e6db0)
40
+ &nbsp;![python](https://img.shields.io/badge/python-%E2%89%A53.9-3776ab)
41
+
42
+ </div>
43
+
44
+ ## When to use this repo
45
+
46
+ These repos are **complementary accelerators, not competing solutions** — each speeds up a *different*
47
+ base generator, and the `+` / `++` suffix is a **method choice**, not a rival product. Pick by
48
+ **(1) which base model you run**, then **(2) which forecast basis you want**:
49
+
50
+ | base generator | `+` = HiCache (Hermite) | `++` = HiCache++ (DMD) |
51
+ |---|---|---|
52
+ | Hunyuan3D-2.1 | `hunyuan2.1-plus` | `hunyuan2.1-plus-plus` |
53
+ | Hunyuan3D-2 mini | `hunyuan2-plus` | `hunyuan2-plus-plus` |
54
+ | SAM 3D Objects | `sam3d-plus` | `sam3d-plus-plus` |
55
+ | Fast-SAM3D | `fastsam3d-plus` | `fastsam3d-plus-plus` |
56
+ | DiT-XL/2 (ImageNet) | `dit-plus` | `dit-plus-plus` |
57
+ | TRELLIS (v1) | `faster-trellis` | `faster-trellis-plus-plus` |
58
+ | TRELLIS.2-4B (v2) | `hermit-trellis2` | `hermit-trellis2-plus-plus` |
59
+
60
+ - **`+` (HiCache / scaled-Hermite):** the *published* polynomial velocity-forecast basis — conservative, reproduces the HiCache paper. Use it to deploy the established method.
61
+ - **`++` (HiCache++ / DMD exponential):** our Dynamic-Mode-Decomposition basis — *the same near-lossless quality at wider skip intervals*, where the polynomial diverges. Use it when you push the cache interval for more speed.
62
+ - **standalone / model-agnostic:** [`hicache-plus-plus`](https://github.com/Archerkattri/hicache-plus-plus) — the forecaster itself, to add DMD caching to *your own* diffusion/flow model.
63
+ - **`fast-trellis2`** = the TaylorSeer baseline fork (the upstream "Fast" accel) — the v2 reference point, not a HiCache variant.
64
+
65
+ > **This repo:** `hicache-plus-plus` — the **standalone HiCache++ forecaster** (DMD/Prony exponential velocity cache) + the Hermite baseline — model-agnostic; the per-model integrations are the sibling repos above.
66
+
67
+ ---
68
+
69
+ ## TL;DR
70
+
71
+ On a flow-matching / diffusion denoise loop you can skip the network on most steps and
72
+ *forecast* the velocity from cached anchors. The state of the art (TaylorSeer, HiCache)
73
+ forecasts with a **polynomial** basis (monomial / scaled-Hermite). But a diffusion feature
74
+ trajectory is the solution of a near-linear feature-ODE whose **exact** solution class is a
75
+ sum of (damped/oscillatory) **exponentials** — not polynomials. Polynomials diverge under
76
+ extrapolation, which is exactly why every polynomial cache caps out at a modest skip.
77
+
78
+ **HiCache++** forecasts with **Dynamic Mode Decomposition** (Schmid 2010) — the
79
+ SVD-regularised generalisation of **Prony's method** (1795): identify the linear propagator
80
+ `A` from raw velocity snapshots (`F_{t+1} ≈ A F_t`), eigendecompose it once, and predict any
81
+ (fractional) horizon `k` by eigenvalue powers:
82
+
83
+ ```
84
+ F_{t+k} ≈ Φ (λ**k ⊙ b), b = Φ⁺ F_t
85
+ ```
86
+
87
+ It is **exact on exponential trajectories** (the solution class) — the property polynomials
88
+ lack — so it holds quality at skip intervals where Hermite/Taylor drift.
89
+
90
+ > **Headline:** on Hunyuan3D-2.1, as the skip interval grows the polynomial (Hermite) decays
91
+ > fast — 0.88 → 0.74 → 0.38 at interval 3 / 5 / 6 — while the exponential (DMD) holds: 0.85 →
92
+ > 0.86 → 0.62 (baseline 0.91). **DMD's lead grows with the skip — +0.13 at i5, +0.24 at i6** —
93
+ > the exponential basis is what extends the lossless skip range.
94
+
95
+ ---
96
+
97
+ ## How it compares
98
+
99
+ Every modern feature cache skips the network on most steps and *forecasts* the velocity;
100
+ they differ in the **basis** used to extrapolate. The basis is what sets the skip ceiling,
101
+ because a diffusion feature trajectory is (locally) a sum of exponentials, not a polynomial:
102
+
103
+ | Method | Forecast basis | Exact on the feature-ODE class | Extrapolation | Max lossless skip\* |
104
+ |---|---|:--:|:--:|:--:|
105
+ | TaylorSeer | monomial (Taylor) | ✗ | diverges | small |
106
+ | **HiCache** | scaled-Hermite | ✗ | drifts | interval&#8209;3 |
107
+ | FoCa · Padé · Chebyshev | rational / orthogonal poly | ✗ | drifts | small–moderate |
108
+ | **HiCache++** _(this work)_ | **exponential (DMD / Prony)** | **✓ exact** | **bounded, correct asymptotics** | **interval&#8209;5–6** |
109
+
110
+ <sub>\*measured on Hunyuan3D-2.1 / SAM3D-slat (see Results). A polynomial basis is only a
111
+ local truncation of the exponential, so it is accurate for a tiny skip and diverges as the
112
+ horizon grows; the exponential basis *is* the exact solution class, so it stays lossless
113
+ further out — and DMD admits *fractional* horizons, so it forecasts sub-steps between
114
+ compute steps exactly.</sub>
115
+
116
+ ---
117
+
118
+ ## Why exponentials (the math)
119
+
120
+ A diffusion/flow-matching sampler integrates `dx/dt = v_θ(x, t)`. Across timesteps the
121
+ cached feature `F_t` (the CFG-combined velocity) evolves under a slowly-varying, near-linear
122
+ operator. The exact solution of a linear ODE `Ḟ = M F` is `F_t = Σ_j a_j e^{μ_j t}` — a sum
123
+ of exponentials with poles `μ_j` (damped if `Re μ_j < 0`, oscillatory if `Im μ_j ≠ 0`).
124
+
125
+ - **Polynomial basis** (Taylor monomials, Hermite): a *local* Taylor truncation of that
126
+ exponential. Accurate for a tiny skip, **diverges** as the horizon grows → modest skip cap.
127
+ - **Exponential basis** (DMD / Prony): the *exact* function class. Fit the poles `λ_j = e^{μ_j Δ}`
128
+ from snapshots and extrapolate with bounded, correct asymptotics.
129
+
130
+ **The ≥4-snapshot floor.** A *real-valued* trajectory spends **two** real degrees of freedom
131
+ on every **complex** pole (a conjugate pair `r e^{±iω}` → `r^t cos ωt, r^t sin ωt`). So even a
132
+ single oscillatory mode needs rank 3 to identify, i.e. **3 snapshot-pairs = 4 snapshots**. With
133
+ only 2 pairs the fit aliases (empirically ~2e-1 error vs ~5e-9 at 3 pairs). Below the floor (or
134
+ across a non-uniform window) HiCache++ falls back to the Hermite forecast for warm-up.
135
+
136
+ ---
137
+
138
+ ## Results (A/B, geometry-preserving)
139
+
140
+ All accelerators are *training-free and geometry-preserving*; the right A/B is **how far the
141
+ output drifts from the uncached/baseline geometry vs how much faster it runs**.
142
+
143
+ ### Mechanism — controlled, no model
144
+
145
+ Forecasting `H` steps past an 8-step cached window on synthetic trajectories from the exact
146
+ feature-ODE class — three forecast bases, rel. L2 error (↓):
147
+
148
+ | basis | H=1 | H=2 | H=4 | H=6 | H=8 |
149
+ |---|---:|---:|---:|---:|---:|
150
+ | TaylorSeer (polynomial) | 1.5e-2 | 8.0e-2 | 6.2e-1 | 2.3e0 | 6.5e0 |
151
+ | Padé / FoCa (rational) | 4.9e-2 | 1.1e-1 | 2.4e-1 | 5.3e-1 | 1.2e0 |
152
+ | **HiCache++ (exponential)** | **4.7e-9** | **1.4e-8** | **5.3e-8** | **1.2e-7** | **2.2e-7** |
153
+
154
+ The exponential basis is **exact** (~1e-8, flat in `H`); the polynomial **diverges**, and the
155
+ rational (Padé / FoCa) improves on it but still diverges — 6-to-9 orders of magnitude behind DMD,
156
+ and under noise the rational basis turns fragile (Froissart poles). That gap *is* the skip ceiling.
157
+ Reproduce: `python benchmarks/forecast_microbench.py`.
158
+
159
+ ### Hunyuan3D-2.1 (flat DiT velocities) — Toys4K F-score@0.05
160
+
161
+ Excludes `ball_000` (a sphere — Go-ICP alignment is rotationally degenerate on it; two runs
162
+ otherwise agree to ±0.01). Speedup is solo / uncontended.
163
+
164
+ | interval | Hermite (HiCache) | **DMD (HiCache++)** | speedup |
165
+ |---:|---:|---:|---:|
166
+ | baseline (uncached) | 0.911 | 0.911 | 1.00× |
167
+ | i3 | **0.876** | 0.852 | 1.72× |
168
+ | i4 | 0.776 | **0.827** | 1.80× |
169
+ | **i5** | 0.735 | **0.860** | 1.79× |
170
+ | i6 | 0.375 | **0.616** | ~2.0× |
171
+
172
+ DMD degrades *gracefully* where Hermite collapses, and its lead grows with the interval. On the
173
+ **deployed Hunyuan3D-2-mini**, DMD is **exactly lossless at i5** (0.794 = baseline 0.794).
174
+
175
+ ### SAM3D (PyTree velocities, slat FlowMatching) — real weights, F1 vs baseline
176
+
177
+ | config | speedup | CD_vs_base | F1_vs_base |
178
+ |---|---:|---:|---:|
179
+ | vanilla | 1.00× | 0.000 | **1.000** |
180
+ | HiCache i3 | 1.44× | 0.013 | **1.000** |
181
+ | DMD i5 | 1.47× | 0.013 | **1.000** |
182
+ | **DMD i6** | **1.56×** | 0.013 | **1.000** |
183
+
184
+ Both are geometry-lossless (F1=1.000); **DMD stays lossless to interval-6**, where it gives the
185
+ best speedup — past Hermite's lossless i3.
186
+
187
+ ### Fast-SAM3D (SS-stage TaylorSeer)
188
+ Hermite ≈ Taylor (a wash): both run the same stride-3 schedule, so the basis swap doesn't
189
+ change latency — TaylorSeer caching (the default) is what gives the ~3×, not the basis.
190
+
191
+ ### TRELLIS v1 (sparse-structure stage) — Toys4K F-score@0.05, n=31
192
+ Swapping *only* the SS forecast basis Hermite→DMD in `faster-trellis` (same carved-hybrid schedule):
193
+
194
+ | variant | F@0.05 | speedup | vs vanilla |
195
+ |---|---:|---:|---:|
196
+ | vanilla (uncached) | 0.839 | 1.00× | — |
197
+ | HiCache (Hermite) | 0.825 | 2.82× | −0.014 |
198
+ | **HiCache++ (DMD)** | **0.829** | **2.76×** | **−0.010** |
199
+
200
+ At the deployed ~interval-3 (2.8×), DMD is the most lossless accelerator (beats Hermite by +0.005
201
+ at matched speed); the margin widens at higher intervals. The same holds on **TRELLIS.2-4B (v2)** —
202
+ DMD ties Hermite at the deployed interval and pulls **+0.03–0.04 F-score ahead at intervals 3–4**
203
+ (see [`hermit-trellis2-plus-plus`](https://github.com/Archerkattri/hermit-trellis2-plus-plus#results)).
204
+ *(The DiT-XL/2 ImageNet FID-vs-latency table is still in progress.)*
205
+
206
+ ---
207
+
208
+ ## Install / use
209
+
210
+ ```python
211
+ import torch
212
+ from hicache_pp import hicache_init, hicache_decide, hicache_update_derivatives, hicache_forecast
213
+ from hicache_pp import dmd_update_snapshots, dmd_forecast_state # the exponential forecaster
214
+
215
+ # in your denoise loop (flat tensor velocities):
216
+ state = hicache_init(num_steps=N, interval=5, first_enhance=4, backend="dmd", history=6)
217
+ for i, t in enumerate(timesteps):
218
+ if hicache_decide(state) == "forecast":
219
+ v = dmd_forecast_state(state) # skip the network — forecast the velocity
220
+ state["step"] += 1
221
+ else:
222
+ v = model(x, t, ...) # the expensive forward
223
+ hicache_update_derivatives(state, v.detach())
224
+ dmd_update_snapshots(state, v.detach(), state["history"])
225
+ state["step"] += 1
226
+ x = scheduler.step(v, t, x)
227
+ ```
228
+
229
+ For **PyTree / structured** velocities (e.g. SAM3D), use `hicache_pp.tree` — the same API but
230
+ tree-aware (`hicache_forecast_tree`, `dmd_forecast_tree`, plus tree Adaptive-CFG).
231
+
232
+ See [`integrations/`](integrations/) for the exact wiring into Hunyuan3D-2.1, Hunyuan3D-2-mini,
233
+ SAM3D and Fast-SAM3D, [`benchmarks/`](benchmarks/) for the controlled forecast microbenchmark,
234
+ and [`results/`](results/) for the full tables.
235
+
236
+ ---
237
+
238
+ ## Tests
239
+
240
+ ```bash
241
+ python -m hicache_pp.hermite # Hermite basis + schedule (CPU, no GPU/model)
242
+ python -m hicache_pp.dmd # DMD exact-on-exponential + ≥4-snapshot floor
243
+ python -m hicache_pp.tree # tree-aware Hermite + DMD + Adaptive-CFG
244
+ python tests/run_tests.py # all of the above
245
+ ```
246
+
247
+ ---
248
+
249
+ ## Lineage & attribution
250
+
251
+ - **TaylorSeer** — feature caching with a monomial (Taylor) basis.
252
+ - **HiCache** (arXiv:2508.16984) — the scaled-Hermite polynomial upgrade. `hicache_pp.hermite`
253
+ is a clean reimplementation.
254
+ - **HiCache++ (this work)** — the **DMD/Prony exponential** forecaster (`hicache_pp.dmd`). DMD
255
+ (Schmid 2010) / Prony (1795) / Matrix-Pencil (Hua–Sarkar 1990) are classical spectral
256
+ estimation; their application to **diffusion feature caching** is, to our knowledge, new.
257
+ - **Adaptive-CFG** (Adaptive Guidance, arXiv:2312.12487) — composable uncond-skip, included in
258
+ the tree module.
259
+
260
+ ## Citation
261
+
262
+ If you use this library, please cite HiCache++ (this work) and the methods it builds on:
263
+
264
+ ```bibtex
265
+ @misc{hicachepp2026,
266
+ title = {HiCache++: Training-free Diffusion Inference Acceleration via Exponential (DMD/Prony) Velocity Forecasting},
267
+ author = {Attri, Krishi},
268
+ year = {2026},
269
+ note = {https://github.com/Archerkattri/hicache-plus-plus}
270
+ }
271
+
272
+ @misc{hicache2025,
273
+ title = {HiCache: Training-free Acceleration of Diffusion Models via Hermite Polynomial Feature Forecasting},
274
+ eprint = {2508.16984}, archivePrefix = {arXiv}, primaryClass = {cs.CV}, year = {2025}
275
+ }
276
+
277
+ @misc{taylorseer2025,
278
+ title = {From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers},
279
+ eprint = {2503.06923}, archivePrefix = {arXiv}, year = {2025}
280
+ }
281
+
282
+ @article{schmid2010dmd,
283
+ title = {Dynamic mode decomposition of numerical and experimental data},
284
+ author = {Schmid, Peter J.},
285
+ journal = {Journal of Fluid Mechanics}, volume = {656}, pages = {5--28}, year = {2010}
286
+ }
287
+
288
+ @article{hua1990matrixpencil,
289
+ title = {Matrix pencil method for estimating parameters of exponentially damped/undamped sinusoids in noise},
290
+ author = {Hua, Yingbo and Sarkar, Tapan K.},
291
+ journal = {IEEE Transactions on Acoustics, Speech, and Signal Processing},
292
+ volume = {38}, number = {5}, pages = {814--824}, year = {1990}
293
+ }
294
+
295
+ @misc{adaptiveguidance2023,
296
+ title = {Adaptive Guidance: Training-free Acceleration of Conditional Diffusion Models},
297
+ eprint = {2312.12487}, archivePrefix = {arXiv}, year = {2023}
298
+ }
299
+ ```
@@ -0,0 +1,274 @@
1
+ <div align="center">
2
+
3
+ # HiCache++
4
+
5
+ **Training-free diffusion inference acceleration by *exponential* velocity forecasting.**
6
+
7
+ *A drop-in upgrade to TaylorSeer / HiCache: replace the polynomial feature-cache basis with
8
+ a Dynamic-Mode-Decomposition (Prony) **exponential** basis — exact on the class diffusion
9
+ features actually live in, so it stays lossless at larger skip intervals than the polynomial.*
10
+
11
+ ![training&#8209;free](https://img.shields.io/badge/training--free-%E2%9C%93-2e8f5c)
12
+ &nbsp;![PyTorch](https://img.shields.io/badge/PyTorch-ee4c2c?logo=pytorch&logoColor=white)
13
+ &nbsp;![CPU tests](https://img.shields.io/badge/CPU%20tests-passing-2e8f5c)
14
+ &nbsp;![license MIT](https://img.shields.io/badge/license-MIT-2e6db0)
15
+ &nbsp;![python](https://img.shields.io/badge/python-%E2%89%A53.9-3776ab)
16
+
17
+ </div>
18
+
19
+ ## When to use this repo
20
+
21
+ These repos are **complementary accelerators, not competing solutions** — each speeds up a *different*
22
+ base generator, and the `+` / `++` suffix is a **method choice**, not a rival product. Pick by
23
+ **(1) which base model you run**, then **(2) which forecast basis you want**:
24
+
25
+ | base generator | `+` = HiCache (Hermite) | `++` = HiCache++ (DMD) |
26
+ |---|---|---|
27
+ | Hunyuan3D-2.1 | `hunyuan2.1-plus` | `hunyuan2.1-plus-plus` |
28
+ | Hunyuan3D-2 mini | `hunyuan2-plus` | `hunyuan2-plus-plus` |
29
+ | SAM 3D Objects | `sam3d-plus` | `sam3d-plus-plus` |
30
+ | Fast-SAM3D | `fastsam3d-plus` | `fastsam3d-plus-plus` |
31
+ | DiT-XL/2 (ImageNet) | `dit-plus` | `dit-plus-plus` |
32
+ | TRELLIS (v1) | `faster-trellis` | `faster-trellis-plus-plus` |
33
+ | TRELLIS.2-4B (v2) | `hermit-trellis2` | `hermit-trellis2-plus-plus` |
34
+
35
+ - **`+` (HiCache / scaled-Hermite):** the *published* polynomial velocity-forecast basis — conservative, reproduces the HiCache paper. Use it to deploy the established method.
36
+ - **`++` (HiCache++ / DMD exponential):** our Dynamic-Mode-Decomposition basis — *the same near-lossless quality at wider skip intervals*, where the polynomial diverges. Use it when you push the cache interval for more speed.
37
+ - **standalone / model-agnostic:** [`hicache-plus-plus`](https://github.com/Archerkattri/hicache-plus-plus) — the forecaster itself, to add DMD caching to *your own* diffusion/flow model.
38
+ - **`fast-trellis2`** = the TaylorSeer baseline fork (the upstream "Fast" accel) — the v2 reference point, not a HiCache variant.
39
+
40
+ > **This repo:** `hicache-plus-plus` — the **standalone HiCache++ forecaster** (DMD/Prony exponential velocity cache) + the Hermite baseline — model-agnostic; the per-model integrations are the sibling repos above.
41
+
42
+ ---
43
+
44
+ ## TL;DR
45
+
46
+ On a flow-matching / diffusion denoise loop you can skip the network on most steps and
47
+ *forecast* the velocity from cached anchors. The state of the art (TaylorSeer, HiCache)
48
+ forecasts with a **polynomial** basis (monomial / scaled-Hermite). But a diffusion feature
49
+ trajectory is the solution of a near-linear feature-ODE whose **exact** solution class is a
50
+ sum of (damped/oscillatory) **exponentials** — not polynomials. Polynomials diverge under
51
+ extrapolation, which is exactly why every polynomial cache caps out at a modest skip.
52
+
53
+ **HiCache++** forecasts with **Dynamic Mode Decomposition** (Schmid 2010) — the
54
+ SVD-regularised generalisation of **Prony's method** (1795): identify the linear propagator
55
+ `A` from raw velocity snapshots (`F_{t+1} ≈ A F_t`), eigendecompose it once, and predict any
56
+ (fractional) horizon `k` by eigenvalue powers:
57
+
58
+ ```
59
+ F_{t+k} ≈ Φ (λ**k ⊙ b), b = Φ⁺ F_t
60
+ ```
61
+
62
+ It is **exact on exponential trajectories** (the solution class) — the property polynomials
63
+ lack — so it holds quality at skip intervals where Hermite/Taylor drift.
64
+
65
+ > **Headline:** on Hunyuan3D-2.1, as the skip interval grows the polynomial (Hermite) decays
66
+ > fast — 0.88 → 0.74 → 0.38 at interval 3 / 5 / 6 — while the exponential (DMD) holds: 0.85 →
67
+ > 0.86 → 0.62 (baseline 0.91). **DMD's lead grows with the skip — +0.13 at i5, +0.24 at i6** —
68
+ > the exponential basis is what extends the lossless skip range.
69
+
70
+ ---
71
+
72
+ ## How it compares
73
+
74
+ Every modern feature cache skips the network on most steps and *forecasts* the velocity;
75
+ they differ in the **basis** used to extrapolate. The basis is what sets the skip ceiling,
76
+ because a diffusion feature trajectory is (locally) a sum of exponentials, not a polynomial:
77
+
78
+ | Method | Forecast basis | Exact on the feature-ODE class | Extrapolation | Max lossless skip\* |
79
+ |---|---|:--:|:--:|:--:|
80
+ | TaylorSeer | monomial (Taylor) | ✗ | diverges | small |
81
+ | **HiCache** | scaled-Hermite | ✗ | drifts | interval&#8209;3 |
82
+ | FoCa · Padé · Chebyshev | rational / orthogonal poly | ✗ | drifts | small–moderate |
83
+ | **HiCache++** _(this work)_ | **exponential (DMD / Prony)** | **✓ exact** | **bounded, correct asymptotics** | **interval&#8209;5–6** |
84
+
85
+ <sub>\*measured on Hunyuan3D-2.1 / SAM3D-slat (see Results). A polynomial basis is only a
86
+ local truncation of the exponential, so it is accurate for a tiny skip and diverges as the
87
+ horizon grows; the exponential basis *is* the exact solution class, so it stays lossless
88
+ further out — and DMD admits *fractional* horizons, so it forecasts sub-steps between
89
+ compute steps exactly.</sub>
90
+
91
+ ---
92
+
93
+ ## Why exponentials (the math)
94
+
95
+ A diffusion/flow-matching sampler integrates `dx/dt = v_θ(x, t)`. Across timesteps the
96
+ cached feature `F_t` (the CFG-combined velocity) evolves under a slowly-varying, near-linear
97
+ operator. The exact solution of a linear ODE `Ḟ = M F` is `F_t = Σ_j a_j e^{μ_j t}` — a sum
98
+ of exponentials with poles `μ_j` (damped if `Re μ_j < 0`, oscillatory if `Im μ_j ≠ 0`).
99
+
100
+ - **Polynomial basis** (Taylor monomials, Hermite): a *local* Taylor truncation of that
101
+ exponential. Accurate for a tiny skip, **diverges** as the horizon grows → modest skip cap.
102
+ - **Exponential basis** (DMD / Prony): the *exact* function class. Fit the poles `λ_j = e^{μ_j Δ}`
103
+ from snapshots and extrapolate with bounded, correct asymptotics.
104
+
105
+ **The ≥4-snapshot floor.** A *real-valued* trajectory spends **two** real degrees of freedom
106
+ on every **complex** pole (a conjugate pair `r e^{±iω}` → `r^t cos ωt, r^t sin ωt`). So even a
107
+ single oscillatory mode needs rank 3 to identify, i.e. **3 snapshot-pairs = 4 snapshots**. With
108
+ only 2 pairs the fit aliases (empirically ~2e-1 error vs ~5e-9 at 3 pairs). Below the floor (or
109
+ across a non-uniform window) HiCache++ falls back to the Hermite forecast for warm-up.
110
+
111
+ ---
112
+
113
+ ## Results (A/B, geometry-preserving)
114
+
115
+ All accelerators are *training-free and geometry-preserving*; the right A/B is **how far the
116
+ output drifts from the uncached/baseline geometry vs how much faster it runs**.
117
+
118
+ ### Mechanism — controlled, no model
119
+
120
+ Forecasting `H` steps past an 8-step cached window on synthetic trajectories from the exact
121
+ feature-ODE class — three forecast bases, rel. L2 error (↓):
122
+
123
+ | basis | H=1 | H=2 | H=4 | H=6 | H=8 |
124
+ |---|---:|---:|---:|---:|---:|
125
+ | TaylorSeer (polynomial) | 1.5e-2 | 8.0e-2 | 6.2e-1 | 2.3e0 | 6.5e0 |
126
+ | Padé / FoCa (rational) | 4.9e-2 | 1.1e-1 | 2.4e-1 | 5.3e-1 | 1.2e0 |
127
+ | **HiCache++ (exponential)** | **4.7e-9** | **1.4e-8** | **5.3e-8** | **1.2e-7** | **2.2e-7** |
128
+
129
+ The exponential basis is **exact** (~1e-8, flat in `H`); the polynomial **diverges**, and the
130
+ rational (Padé / FoCa) improves on it but still diverges — 6-to-9 orders of magnitude behind DMD,
131
+ and under noise the rational basis turns fragile (Froissart poles). That gap *is* the skip ceiling.
132
+ Reproduce: `python benchmarks/forecast_microbench.py`.
133
+
134
+ ### Hunyuan3D-2.1 (flat DiT velocities) — Toys4K F-score@0.05
135
+
136
+ Excludes `ball_000` (a sphere — Go-ICP alignment is rotationally degenerate on it; two runs
137
+ otherwise agree to ±0.01). Speedup is solo / uncontended.
138
+
139
+ | interval | Hermite (HiCache) | **DMD (HiCache++)** | speedup |
140
+ |---:|---:|---:|---:|
141
+ | baseline (uncached) | 0.911 | 0.911 | 1.00× |
142
+ | i3 | **0.876** | 0.852 | 1.72× |
143
+ | i4 | 0.776 | **0.827** | 1.80× |
144
+ | **i5** | 0.735 | **0.860** | 1.79× |
145
+ | i6 | 0.375 | **0.616** | ~2.0× |
146
+
147
+ DMD degrades *gracefully* where Hermite collapses, and its lead grows with the interval. On the
148
+ **deployed Hunyuan3D-2-mini**, DMD is **exactly lossless at i5** (0.794 = baseline 0.794).
149
+
150
+ ### SAM3D (PyTree velocities, slat FlowMatching) — real weights, F1 vs baseline
151
+
152
+ | config | speedup | CD_vs_base | F1_vs_base |
153
+ |---|---:|---:|---:|
154
+ | vanilla | 1.00× | 0.000 | **1.000** |
155
+ | HiCache i3 | 1.44× | 0.013 | **1.000** |
156
+ | DMD i5 | 1.47× | 0.013 | **1.000** |
157
+ | **DMD i6** | **1.56×** | 0.013 | **1.000** |
158
+
159
+ Both are geometry-lossless (F1=1.000); **DMD stays lossless to interval-6**, where it gives the
160
+ best speedup — past Hermite's lossless i3.
161
+
162
+ ### Fast-SAM3D (SS-stage TaylorSeer)
163
+ Hermite ≈ Taylor (a wash): both run the same stride-3 schedule, so the basis swap doesn't
164
+ change latency — TaylorSeer caching (the default) is what gives the ~3×, not the basis.
165
+
166
+ ### TRELLIS v1 (sparse-structure stage) — Toys4K F-score@0.05, n=31
167
+ Swapping *only* the SS forecast basis Hermite→DMD in `faster-trellis` (same carved-hybrid schedule):
168
+
169
+ | variant | F@0.05 | speedup | vs vanilla |
170
+ |---|---:|---:|---:|
171
+ | vanilla (uncached) | 0.839 | 1.00× | — |
172
+ | HiCache (Hermite) | 0.825 | 2.82× | −0.014 |
173
+ | **HiCache++ (DMD)** | **0.829** | **2.76×** | **−0.010** |
174
+
175
+ At the deployed ~interval-3 (2.8×), DMD is the most lossless accelerator (beats Hermite by +0.005
176
+ at matched speed); the margin widens at higher intervals. The same holds on **TRELLIS.2-4B (v2)** —
177
+ DMD ties Hermite at the deployed interval and pulls **+0.03–0.04 F-score ahead at intervals 3–4**
178
+ (see [`hermit-trellis2-plus-plus`](https://github.com/Archerkattri/hermit-trellis2-plus-plus#results)).
179
+ *(The DiT-XL/2 ImageNet FID-vs-latency table is still in progress.)*
180
+
181
+ ---
182
+
183
+ ## Install / use
184
+
185
+ ```python
186
+ import torch
187
+ from hicache_pp import hicache_init, hicache_decide, hicache_update_derivatives, hicache_forecast
188
+ from hicache_pp import dmd_update_snapshots, dmd_forecast_state # the exponential forecaster
189
+
190
+ # in your denoise loop (flat tensor velocities):
191
+ state = hicache_init(num_steps=N, interval=5, first_enhance=4, backend="dmd", history=6)
192
+ for i, t in enumerate(timesteps):
193
+ if hicache_decide(state) == "forecast":
194
+ v = dmd_forecast_state(state) # skip the network — forecast the velocity
195
+ state["step"] += 1
196
+ else:
197
+ v = model(x, t, ...) # the expensive forward
198
+ hicache_update_derivatives(state, v.detach())
199
+ dmd_update_snapshots(state, v.detach(), state["history"])
200
+ state["step"] += 1
201
+ x = scheduler.step(v, t, x)
202
+ ```
203
+
204
+ For **PyTree / structured** velocities (e.g. SAM3D), use `hicache_pp.tree` — the same API but
205
+ tree-aware (`hicache_forecast_tree`, `dmd_forecast_tree`, plus tree Adaptive-CFG).
206
+
207
+ See [`integrations/`](integrations/) for the exact wiring into Hunyuan3D-2.1, Hunyuan3D-2-mini,
208
+ SAM3D and Fast-SAM3D, [`benchmarks/`](benchmarks/) for the controlled forecast microbenchmark,
209
+ and [`results/`](results/) for the full tables.
210
+
211
+ ---
212
+
213
+ ## Tests
214
+
215
+ ```bash
216
+ python -m hicache_pp.hermite # Hermite basis + schedule (CPU, no GPU/model)
217
+ python -m hicache_pp.dmd # DMD exact-on-exponential + ≥4-snapshot floor
218
+ python -m hicache_pp.tree # tree-aware Hermite + DMD + Adaptive-CFG
219
+ python tests/run_tests.py # all of the above
220
+ ```
221
+
222
+ ---
223
+
224
+ ## Lineage & attribution
225
+
226
+ - **TaylorSeer** — feature caching with a monomial (Taylor) basis.
227
+ - **HiCache** (arXiv:2508.16984) — the scaled-Hermite polynomial upgrade. `hicache_pp.hermite`
228
+ is a clean reimplementation.
229
+ - **HiCache++ (this work)** — the **DMD/Prony exponential** forecaster (`hicache_pp.dmd`). DMD
230
+ (Schmid 2010) / Prony (1795) / Matrix-Pencil (Hua–Sarkar 1990) are classical spectral
231
+ estimation; their application to **diffusion feature caching** is, to our knowledge, new.
232
+ - **Adaptive-CFG** (Adaptive Guidance, arXiv:2312.12487) — composable uncond-skip, included in
233
+ the tree module.
234
+
235
+ ## Citation
236
+
237
+ If you use this library, please cite HiCache++ (this work) and the methods it builds on:
238
+
239
+ ```bibtex
240
+ @misc{hicachepp2026,
241
+ title = {HiCache++: Training-free Diffusion Inference Acceleration via Exponential (DMD/Prony) Velocity Forecasting},
242
+ author = {Attri, Krishi},
243
+ year = {2026},
244
+ note = {https://github.com/Archerkattri/hicache-plus-plus}
245
+ }
246
+
247
+ @misc{hicache2025,
248
+ title = {HiCache: Training-free Acceleration of Diffusion Models via Hermite Polynomial Feature Forecasting},
249
+ eprint = {2508.16984}, archivePrefix = {arXiv}, primaryClass = {cs.CV}, year = {2025}
250
+ }
251
+
252
+ @misc{taylorseer2025,
253
+ title = {From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers},
254
+ eprint = {2503.06923}, archivePrefix = {arXiv}, year = {2025}
255
+ }
256
+
257
+ @article{schmid2010dmd,
258
+ title = {Dynamic mode decomposition of numerical and experimental data},
259
+ author = {Schmid, Peter J.},
260
+ journal = {Journal of Fluid Mechanics}, volume = {656}, pages = {5--28}, year = {2010}
261
+ }
262
+
263
+ @article{hua1990matrixpencil,
264
+ title = {Matrix pencil method for estimating parameters of exponentially damped/undamped sinusoids in noise},
265
+ author = {Hua, Yingbo and Sarkar, Tapan K.},
266
+ journal = {IEEE Transactions on Acoustics, Speech, and Signal Processing},
267
+ volume = {38}, number = {5}, pages = {814--824}, year = {1990}
268
+ }
269
+
270
+ @misc{adaptiveguidance2023,
271
+ title = {Adaptive Guidance: Training-free Acceleration of Conditional Diffusion Models},
272
+ eprint = {2312.12487}, archivePrefix = {arXiv}, year = {2023}
273
+ }
274
+ ```
@@ -0,0 +1,37 @@
1
+ """HiCache++ — training-free diffusion inference acceleration via velocity forecasting.
2
+
3
+ Two drop-in forecasters for a flow-matching / diffusion denoise loop. On a *skipped*
4
+ sampling step, instead of running the (expensive) network you forecast the CFG-combined
5
+ velocity from cached anchors at the recent *compute* steps:
6
+
7
+ * **hermite** — HiCache (dual-scaled physicist's Hermite polynomial; arXiv:2508.16984).
8
+ The polynomial basis. Generalises TaylorSeer (monomial) with bounded high-order terms.
9
+
10
+ * **dmd** — HiCache++ (Dynamic Mode Decomposition / Prony). The EXPONENTIAL basis. A
11
+ diffusion feature trajectory solves a near-linear feature-ODE whose exact solution
12
+ class is a sum of (damped/oscillatory) exponentials — *not* polynomials. DMD (Schmid
13
+ 2010), the SVD-regularised generalisation of Prony's method (1795), identifies the
14
+ linear propagator from raw velocity snapshots and advances it by eigenvalue powers,
15
+ so it is *exact* on that class where the polynomial drifts. This lets it stay lossless
16
+ at larger skip intervals than the polynomial — the failure mode that caps HiCache.
17
+
18
+ Flat-tensor velocities (e.g. Hunyuan3D DiT): use ``hermite`` + ``dmd``.
19
+ PyTree / structured velocities (e.g. SAM3D): use ``tree`` (tree-aware Hermite + DMD
20
+ + Adaptive-CFG).
21
+ """
22
+ from . import hermite, dmd, tree
23
+
24
+ # flat-tensor API
25
+ from .hermite import (
26
+ hicache_init, hicache_decide, hicache_update_derivatives, hicache_forecast,
27
+ physicists_hermite, scaled_hermite,
28
+ )
29
+ from .dmd import dmd_forecast, dmd_update_snapshots, dmd_forecast_state
30
+
31
+ __all__ = [
32
+ "hermite", "dmd", "tree",
33
+ "hicache_init", "hicache_decide", "hicache_update_derivatives", "hicache_forecast",
34
+ "physicists_hermite", "scaled_hermite",
35
+ "dmd_forecast", "dmd_update_snapshots", "dmd_forecast_state",
36
+ ]
37
+ __version__ = "0.1.0"