PyPI - cache-dit - Versions diffs - 0.1.8__tar.gz → 0.2.1__tar.gz - Mend

cache-dit 0.1.8tar.gz → 0.2.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (110) hide show

{cache_dit-0.1.8 → cache_dit-0.2.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: cache_dit
-Version: 0.1.8
+Version: 0.2.1
 Summary: 🤗 CacheDiT: A Training-free and Easy-to-use Cache Acceleration Toolbox for Diffusion Transformers
 Author: DefTruth, vipshop.com, etc.
 Maintainer: DefTruth, vipshop.com, etc
@@ -44,31 +44,18 @@ Dynamic: requires-python
       <img src=https://img.shields.io/badge/PyPI-pass-brightgreen.svg >
       <img src=https://static.pepy.tech/badge/cache-dit >
       <img src=https://img.shields.io/badge/Python-3.10|3.11|3.12-9cf.svg >
-      <img src=https://img.shields.io/badge/Release-v0.1.8-brightgreen.svg >
+      <img src=https://img.shields.io/badge/Release-v0.2.1-brightgreen.svg >
  </div>
   <p align="center">
     DeepCache is for UNet not DiT. Most DiT cache speedups are complex and not training-free. CacheDiT <br>offers a set of training-free cache accelerators for DiT: 🔥DBCache, DBPrune, FBCache, etc🔥
   </p>
-  <p align="center">
-  <h3> 🔥Supported Models🔥</h2>
-  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀FLUX.1</b>: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥</a> <br>
-  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀CogVideoX</b>: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥</a> <br>
-  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀Mochi</b>: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥</a> <br>
-  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀Wan2.1</b>: 🔜DBCache, 🔜DBPrune, ✔️FBCache🔥</a> <br> <br>
-  <b>♥️ Please consider to leave a ⭐️ Star to support us ~ ♥️</b>
-  </p>
 </div>
+## 👋 Highlight
-<!--
-## 🎉Supported Models
-<div id="supported"></div>
-- [🚀FLUX.1](https://github.com/vipshop/cache-dit/raw/main/examples): *✔️DBCache, ✔️DBPrune, ✔️FBCache*
-- [🚀CogVideoX](https://github.com/vipshop/cache-dit/raw/main/examples): *✔️DBCache, ✔️DBPrune, ✔️FBCache*
-- [🚀Mochi](https://github.com/vipshop/cache-dit/raw/main/examples): *✔️DBCache, ✔️DBPrune, ✔️FBCache*
-- [🚀Wan2.1**](https://github.com/vipshop/cache-dit/raw/main/examples): *🔜DBCache, 🔜DBPrune, ✔️FBCache*
--->
+<div id="reference"></div>
+The **CacheDiT** codebase is adapted from [FBCache](https://github.com/chengzeyi/ParaAttention/tree/main/src/para_attn/first_block_cache). Special thanks to their excellent work! The **FBCache** support for Mochi, FLUX.1, CogVideoX, Wan2.1, and HunyuanVideo is directly adapted from the original [FBCache](https://github.com/chengzeyi/ParaAttention/tree/main/src/para_attn/first_block_cache).
 ## 🤗 Introduction
@@ -110,6 +97,12 @@ These case studies demonstrate that even with relatively high thresholds (such a
 **DBPrune**: We have further implemented a new **Dynamic Block Prune** algorithm based on **Residual Caching** for Diffusion Transformers, referred to as DBPrune. DBPrune caches each block's hidden states and residuals, then **dynamically prunes** blocks during inference by computing the L1 distance between previous hidden states. When a block is pruned, its output is approximated using the cached residuals.
+<div align="center">
+  <p align="center">
+    DBPrune, <b> L20x1 </b>, Steps: 28, "A cat holding a sign that says hello world with complex background"
+  </p>
+</div>
 |Baseline(L20x1)|Pruned(24%)|Pruned(35%)|Pruned(38%)|Pruned(45%)|Pruned(60%)|
 |:---:|:---:|:---:|:---:|:---:|:---:|
 |24.85s|19.43s|16.82s|15.95s|14.24s|10.66s|
@@ -117,11 +110,11 @@ These case studies demonstrate that even with relatively high thresholds (such a
 <div align="center">
   <p align="center">
-    DBPrune, <b> L20x1 </b>, Steps: 28, "A cat holding a sign that says hello world with complex background"
+    <h3>🔥 Context Parallelism and Torch Compile</h3>
   </p>
-</div>
+</div>
-**CacheDiT** are **plug-and-play** solutions that works hand-in-hand with [ParaAttention](https://github.com/chengzeyi/ParaAttention). Users can easily tap into its **Context Parallelism** features for distributed inference. Moreover, **CacheDiT** are designed to work compatibly with `torch.compile`. You can easily use CacheDiT with torch.compile to further achieve a better performance.
+Moreover, **CacheDiT** are **plug-and-play** solutions that works hand-in-hand with [ParaAttention](https://github.com/chengzeyi/ParaAttention). Users can easily tap into its **Context Parallelism** features for distributed inference. By the way, CacheDiT is designed to work compatibly with **torch.compile.** You can easily use CacheDiT with torch.compile to further achieve a better performance.
 <div align="center">
   <p align="center">
@@ -131,11 +124,16 @@ These case studies demonstrate that even with relatively high thresholds (such a
 |Baseline|Pruned(24%)|Pruned(35%)|Pruned(38%)|Pruned(45%)|Pruned(60%)|
 |:---:|:---:|:---:|:---:|:---:|:---:|
-|+L20x1:24.85s|19.43s|16.82s|15.95s|14.24s|10.66s|
-|+compile:20.43s|16.25s|14.12s|13.41s|12s|8.86s|
+|+compile:20.43s|16.25s|14.12s|13.41s|12.00s|8.86s|
 |+L20x4:7.75s|6.62s|6.03s|5.81s|5.24s|3.93s|
 |<img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_NONE_R0.08_S0_T20.43s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.03_P24.0_T16.25s.png width=105px> | <img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.04_P34.6_T14.12s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.045_P38.2_T13.41s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.055_P45.1_T12.00s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.2_P59.5_T8.86s.png width=105px>|
+<div align="center">
+  <p align="center">
+    <b>♥️ Please consider to leave a ⭐️ Star to support us ~ ♥️</b>
+  </p>
+</div>
 ## ©️Citations
 ```BibTeX
@@ -148,17 +146,12 @@ These case studies demonstrate that even with relatively high thresholds (such a
 }
 ```
-## 👋Reference
-<div id="reference"></div>
-The **CacheDiT** codebase was adapted from FBCache's implementation at the [ParaAttention](https://github.com/chengzeyi/ParaAttention/tree/main/src/para_attn/first_block_cache). We would like to express our sincere gratitude for this excellent work!
 ## 📖Contents
 <div id="contents"></div>
 - [⚙️Installation](#️installation)
+- [🔥Supported Models](#supported)
 - [⚡️Dual Block Cache](#dbcache)
 - [🎉First Block Cache](#fbcache)
 - [⚡️Dynamic Block Prune](#dbprune)
@@ -182,6 +175,30 @@ Or you can install the latest develop version from GitHub:
 pip3 install git+https://github.com/vipshop/cache-dit.git
 ```
+## 🔥Supported Models
+<div id="supported"></div>
+- [🚀FLUX.1](https://github.com/vipshop/cache-dit/raw/main/examples)
+- [🚀Mochi](https://github.com/vipshop/cache-dit/raw/main/examples)
+- [🚀CogVideoX](https://github.com/vipshop/cache-dit/raw/main/examples)
+- [🚀CogVideoX1.5](https://github.com/vipshop/cache-dit/raw/main/examples)
+- [🚀Wan2.1](https://github.com/vipshop/cache-dit/raw/main/examples)
+- [🚀HunyuanVideo](https://github.com/vipshop/cache-dit/raw/main/examples)
+<!--
+<p align="center">
+  <h4> 🔥Supported Models🔥</h4>
+  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀FLUX.1</b>: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥</a> <br>
+  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀Mochi</b>: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥</a> <br>
+  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀CogVideoX</b>: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥</a> <br>
+  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀CogVideoX1.5</b>: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥</a> <br>
+  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀Wan2.1</b>: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥</a> <br>
+  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀HunyuanVideo</b>: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥</a> <br>
+</p>
+-->
 ## ⚡️DBCache: Dual Block Cache
 <div id="dbcache"></div>
@@ -339,6 +356,9 @@ cache_options = {
 apply_cache_on_pipe(pipe, **cache_options)
 ```
+> [!Important]
+> Please note that for GPUs with lower VRAM, DBPrune may not be suitable for use on video DiTs, as it caches the hidden states and residuals of each block, leading to higher GPU memory requirements. In such cases, please use DBCache, which only caches the hidden states and residuals of 2 blocks.
 <div align="center">
   <p align="center">
     DBPrune, <b> L20x1 </b>, Steps: 28, "A cat holding a sign that says hello world with complex background"
@@ -396,26 +416,12 @@ Then, run the python test script with `torchrun`:
 ```bash
 torchrun --nproc_per_node=4 parallel_cache.py
 ```
-<!--
-<div align="center">
-  <p align="center">
-    DBPrune, <b> L20x4 </b>, Steps: 28, "A cat holding a sign that says hello world with complex background"
-  </p>
-</div>
-|Baseline|Pruned(24%)|Pruned(35%)|Pruned(38%)|Pruned(45%)|Pruned(60%)|
-|:---:|:---:|:---:|:---:|:---:|:---:|
-|+L20x1:24.85s|19.43s|16.82s|15.95s|14.24s|10.66s|
-|+L20x4:8.54s|7.20s|6.61s|6.09s|5.54s|4.22s|
-|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/NONE_R0.08_S0.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.03_P24.0_T19.43s.png width=105px> | <img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.04_P34.6_T16.82s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.05_P38.3_T15.95s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.06_P45.2_T14.24s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.2_P59.5_T10.66s.png width=105px>|
--->
 ## 🔥Torch Compile
 <div id="compile"></div>
-**CacheDiT** are designed to work compatibly with `torch.compile`. You can easily use CacheDiT with torch.compile to further achieve a better performance. For example:
+By the way, **CacheDiT** is designed to work compatibly with **torch.compile.** You can easily use CacheDiT with torch.compile to further achieve a better performance. For example:
 ```python
 apply_cache_on_pipe(
@@ -430,22 +436,6 @@ torch._dynamo.config.recompile_limit = 96  # default is 8
 torch._dynamo.config.accumulated_recompile_limit = 2048  # default is 256
 ```
-<!--
-<div align="center">
-  <p align="center">
-  DBPrune + <b>torch.compile</b>, Steps: 28, "A cat holding a sign that says hello world with complex background"
-  </p>
-</div>
-|Baseline|Pruned(24%)|Pruned(35%)|Pruned(38%)|Pruned(45%)|Pruned(60%)|
-|:---:|:---:|:---:|:---:|:---:|:---:|
-|+L20x1:24.8s|19.4s|16.8s|15.9s|14.2s|10.6s|
-|+compile:20.4s|16.5s|14.1s|13.4s|12s|8.8s|
-|+L20x4:7.7s|6.6s|6.0s|5.8s|5.2s|3.9s|
-|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_NONE_R0.08_S0_T20.43s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.03_P24.0_T16.25s.png width=105px> | <img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.04_P34.6_T14.12s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.045_P38.2_T13.41s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.055_P45.1_T12.00s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.2_P59.5_T8.86s.png width=105px>|
--->
 ## 👋Contribute
 <div id="contribute"></div>

{cache_dit-0.1.8 → cache_dit-0.2.1}/README.md RENAMED Viewed

@@ -9,31 +9,18 @@
       <img src=https://img.shields.io/badge/PyPI-pass-brightgreen.svg >
       <img src=https://static.pepy.tech/badge/cache-dit >
       <img src=https://img.shields.io/badge/Python-3.10|3.11|3.12-9cf.svg >
-      <img src=https://img.shields.io/badge/Release-v0.1.8-brightgreen.svg >
+      <img src=https://img.shields.io/badge/Release-v0.2.1-brightgreen.svg >
  </div>
   <p align="center">
     DeepCache is for UNet not DiT. Most DiT cache speedups are complex and not training-free. CacheDiT <br>offers a set of training-free cache accelerators for DiT: 🔥DBCache, DBPrune, FBCache, etc🔥
   </p>
-  <p align="center">
-  <h3> 🔥Supported Models🔥</h2>
-  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀FLUX.1</b>: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥</a> <br>
-  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀CogVideoX</b>: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥</a> <br>
-  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀Mochi</b>: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥</a> <br>
-  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀Wan2.1</b>: 🔜DBCache, 🔜DBPrune, ✔️FBCache🔥</a> <br> <br>
-  <b>♥️ Please consider to leave a ⭐️ Star to support us ~ ♥️</b>
-  </p>
 </div>
+## 👋 Highlight
-<!--
-## 🎉Supported Models
-<div id="supported"></div>
-- [🚀FLUX.1](https://github.com/vipshop/cache-dit/raw/main/examples): *✔️DBCache, ✔️DBPrune, ✔️FBCache*
-- [🚀CogVideoX](https://github.com/vipshop/cache-dit/raw/main/examples): *✔️DBCache, ✔️DBPrune, ✔️FBCache*
-- [🚀Mochi](https://github.com/vipshop/cache-dit/raw/main/examples): *✔️DBCache, ✔️DBPrune, ✔️FBCache*
-- [🚀Wan2.1**](https://github.com/vipshop/cache-dit/raw/main/examples): *🔜DBCache, 🔜DBPrune, ✔️FBCache*
--->
+<div id="reference"></div>
+The **CacheDiT** codebase is adapted from [FBCache](https://github.com/chengzeyi/ParaAttention/tree/main/src/para_attn/first_block_cache). Special thanks to their excellent work! The **FBCache** support for Mochi, FLUX.1, CogVideoX, Wan2.1, and HunyuanVideo is directly adapted from the original [FBCache](https://github.com/chengzeyi/ParaAttention/tree/main/src/para_attn/first_block_cache).
 ## 🤗 Introduction
@@ -75,6 +62,12 @@ These case studies demonstrate that even with relatively high thresholds (such a
 **DBPrune**: We have further implemented a new **Dynamic Block Prune** algorithm based on **Residual Caching** for Diffusion Transformers, referred to as DBPrune. DBPrune caches each block's hidden states and residuals, then **dynamically prunes** blocks during inference by computing the L1 distance between previous hidden states. When a block is pruned, its output is approximated using the cached residuals.
+<div align="center">
+  <p align="center">
+    DBPrune, <b> L20x1 </b>, Steps: 28, "A cat holding a sign that says hello world with complex background"
+  </p>
+</div>
 |Baseline(L20x1)|Pruned(24%)|Pruned(35%)|Pruned(38%)|Pruned(45%)|Pruned(60%)|
 |:---:|:---:|:---:|:---:|:---:|:---:|
 |24.85s|19.43s|16.82s|15.95s|14.24s|10.66s|
@@ -82,11 +75,11 @@ These case studies demonstrate that even with relatively high thresholds (such a
 <div align="center">
   <p align="center">
-    DBPrune, <b> L20x1 </b>, Steps: 28, "A cat holding a sign that says hello world with complex background"
+    <h3>🔥 Context Parallelism and Torch Compile</h3>
   </p>
-</div>
+</div>
-**CacheDiT** are **plug-and-play** solutions that works hand-in-hand with [ParaAttention](https://github.com/chengzeyi/ParaAttention). Users can easily tap into its **Context Parallelism** features for distributed inference. Moreover, **CacheDiT** are designed to work compatibly with `torch.compile`. You can easily use CacheDiT with torch.compile to further achieve a better performance.
+Moreover, **CacheDiT** are **plug-and-play** solutions that works hand-in-hand with [ParaAttention](https://github.com/chengzeyi/ParaAttention). Users can easily tap into its **Context Parallelism** features for distributed inference. By the way, CacheDiT is designed to work compatibly with **torch.compile.** You can easily use CacheDiT with torch.compile to further achieve a better performance.
 <div align="center">
   <p align="center">
@@ -96,11 +89,16 @@ These case studies demonstrate that even with relatively high thresholds (such a
 |Baseline|Pruned(24%)|Pruned(35%)|Pruned(38%)|Pruned(45%)|Pruned(60%)|
 |:---:|:---:|:---:|:---:|:---:|:---:|
-|+L20x1:24.85s|19.43s|16.82s|15.95s|14.24s|10.66s|
-|+compile:20.43s|16.25s|14.12s|13.41s|12s|8.86s|
+|+compile:20.43s|16.25s|14.12s|13.41s|12.00s|8.86s|
 |+L20x4:7.75s|6.62s|6.03s|5.81s|5.24s|3.93s|
 |<img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_NONE_R0.08_S0_T20.43s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.03_P24.0_T16.25s.png width=105px> | <img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.04_P34.6_T14.12s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.045_P38.2_T13.41s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.055_P45.1_T12.00s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.2_P59.5_T8.86s.png width=105px>|
+<div align="center">
+  <p align="center">
+    <b>♥️ Please consider to leave a ⭐️ Star to support us ~ ♥️</b>
+  </p>
+</div>
 ## ©️Citations
 ```BibTeX
@@ -113,17 +111,12 @@ These case studies demonstrate that even with relatively high thresholds (such a
 }
 ```
-## 👋Reference
-<div id="reference"></div>
-The **CacheDiT** codebase was adapted from FBCache's implementation at the [ParaAttention](https://github.com/chengzeyi/ParaAttention/tree/main/src/para_attn/first_block_cache). We would like to express our sincere gratitude for this excellent work!
 ## 📖Contents
 <div id="contents"></div>
 - [⚙️Installation](#️installation)
+- [🔥Supported Models](#supported)
 - [⚡️Dual Block Cache](#dbcache)
 - [🎉First Block Cache](#fbcache)
 - [⚡️Dynamic Block Prune](#dbprune)
@@ -147,6 +140,30 @@ Or you can install the latest develop version from GitHub:
 pip3 install git+https://github.com/vipshop/cache-dit.git
 ```
+## 🔥Supported Models
+<div id="supported"></div>
+- [🚀FLUX.1](https://github.com/vipshop/cache-dit/raw/main/examples)
+- [🚀Mochi](https://github.com/vipshop/cache-dit/raw/main/examples)
+- [🚀CogVideoX](https://github.com/vipshop/cache-dit/raw/main/examples)
+- [🚀CogVideoX1.5](https://github.com/vipshop/cache-dit/raw/main/examples)
+- [🚀Wan2.1](https://github.com/vipshop/cache-dit/raw/main/examples)
+- [🚀HunyuanVideo](https://github.com/vipshop/cache-dit/raw/main/examples)
+<!--
+<p align="center">
+  <h4> 🔥Supported Models🔥</h4>
+  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀FLUX.1</b>: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥</a> <br>
+  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀Mochi</b>: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥</a> <br>
+  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀CogVideoX</b>: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥</a> <br>
+  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀CogVideoX1.5</b>: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥</a> <br>
+  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀Wan2.1</b>: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥</a> <br>
+  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀HunyuanVideo</b>: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥</a> <br>
+</p>
+-->
 ## ⚡️DBCache: Dual Block Cache
 <div id="dbcache"></div>
@@ -304,6 +321,9 @@ cache_options = {
 apply_cache_on_pipe(pipe, **cache_options)
 ```
+> [!Important]
+> Please note that for GPUs with lower VRAM, DBPrune may not be suitable for use on video DiTs, as it caches the hidden states and residuals of each block, leading to higher GPU memory requirements. In such cases, please use DBCache, which only caches the hidden states and residuals of 2 blocks.
 <div align="center">
   <p align="center">
     DBPrune, <b> L20x1 </b>, Steps: 28, "A cat holding a sign that says hello world with complex background"
@@ -361,26 +381,12 @@ Then, run the python test script with `torchrun`:
 ```bash
 torchrun --nproc_per_node=4 parallel_cache.py
 ```
-<!--
-<div align="center">
-  <p align="center">
-    DBPrune, <b> L20x4 </b>, Steps: 28, "A cat holding a sign that says hello world with complex background"
-  </p>
-</div>
-|Baseline|Pruned(24%)|Pruned(35%)|Pruned(38%)|Pruned(45%)|Pruned(60%)|
-|:---:|:---:|:---:|:---:|:---:|:---:|
-|+L20x1:24.85s|19.43s|16.82s|15.95s|14.24s|10.66s|
-|+L20x4:8.54s|7.20s|6.61s|6.09s|5.54s|4.22s|
-|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/NONE_R0.08_S0.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.03_P24.0_T19.43s.png width=105px> | <img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.04_P34.6_T16.82s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.05_P38.3_T15.95s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.06_P45.2_T14.24s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.2_P59.5_T10.66s.png width=105px>|
--->
 ## 🔥Torch Compile
 <div id="compile"></div>
-**CacheDiT** are designed to work compatibly with `torch.compile`. You can easily use CacheDiT with torch.compile to further achieve a better performance. For example:
+By the way, **CacheDiT** is designed to work compatibly with **torch.compile.** You can easily use CacheDiT with torch.compile to further achieve a better performance. For example:
 ```python
 apply_cache_on_pipe(
@@ -395,22 +401,6 @@ torch._dynamo.config.recompile_limit = 96  # default is 8
 torch._dynamo.config.accumulated_recompile_limit = 2048  # default is 256
 ```
-<!--
-<div align="center">
-  <p align="center">
-  DBPrune + <b>torch.compile</b>, Steps: 28, "A cat holding a sign that says hello world with complex background"
-  </p>
-</div>
-|Baseline|Pruned(24%)|Pruned(35%)|Pruned(38%)|Pruned(45%)|Pruned(60%)|
-|:---:|:---:|:---:|:---:|:---:|:---:|
-|+L20x1:24.8s|19.4s|16.8s|15.9s|14.2s|10.6s|
-|+compile:20.4s|16.5s|14.1s|13.4s|12s|8.8s|
-|+L20x4:7.7s|6.6s|6.0s|5.8s|5.2s|3.9s|
-|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_NONE_R0.08_S0_T20.43s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.03_P24.0_T16.25s.png width=105px> | <img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.04_P34.6_T14.12s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.045_P38.2_T13.41s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.055_P45.1_T12.00s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.2_P59.5_T8.86s.png width=105px>|
--->
 ## 👋Contribute
 <div id="contribute"></div>

{cache_dit-0.1.8 → cache_dit-0.2.1}/examples/.gitignore RENAMED Viewed

@@ -165,3 +165,4 @@ report*.html
 .DS_Store
 *.png
+*.mp4

cache_dit-0.2.1/examples/README.md ADDED Viewed

@@ -0,0 +1,45 @@
+# Examples for CacheDiT
+## Install requirements
+```bash
+pip3 install -r requirements.txt
+```
+## Run examples
+- FLUX.1-dev
+```bash
+python3 run_flux.py
+```
+- FLUX.1-Fill-dev
+```bash
+python3 run_flux_fill.py
+```
+- CogVideoX
+```bash
+python3 run_cogvideox.py
+```
+- Wan2.1
+```bash
+python3 run_wan.py
+```
+- Mochi
+```bash
+python3 run_mochi.py
+```
+- HunyuanVideo
+```bash
+python3 run_hunyuan_video.py
+```

cache_dit-0.2.1/examples/requirements.txt ADDED Viewed

@@ -0,0 +1,4 @@
+imageio-ffmpeg
+# wan currently requires installing from source
+diffusers @ git+https://github.com/huggingface/diffusers
+ftfy

{cache_dit-0.1.8 → cache_dit-0.2.1}/examples/run_cogvideox.py RENAMED Viewed

@@ -1,14 +1,33 @@
 import os
 import torch
-from diffusers import CogVideoXPipeline
 from diffusers.utils import export_to_video
+from diffusers import CogVideoXPipeline, AutoencoderKLCogVideoX
 from cache_dit.cache_factory import apply_cache_on_pipe, CacheType
+model_id = os.environ.get("COGVIDEOX_DIR", "THUDM/CogVideoX-5b")
+def is_cogvideox_1_5():
+    return "CogVideoX1.5" in model_id or "THUDM/CogVideoX1.5" in model_id
+def get_gpu_memory_in_gib():
+    if not torch.cuda.is_available():
+        return 0
+    try:
+        total_memory_bytes = torch.cuda.get_device_properties(
+            torch.cuda.current_device(),
+        ).total_memory
+        total_memory_gib = total_memory_bytes / (1024**3)
+        return int(total_memory_gib)
+    except Exception:
+        return 0
 pipe = CogVideoXPipeline.from_pretrained(
-    os.environ.get(
-        "COGVIDEOX_DIR",
-        "THUDM/CogVideoX-5b",
-    ),
+    model_id,
     torch_dtype=torch.bfloat16,
 ).to("cuda")
@@ -17,6 +36,8 @@ cache_options = CacheType.default_options(CacheType.DBCache)
 apply_cache_on_pipe(pipe, **cache_options)
+pipe.enable_model_cpu_offload()
+assert isinstance(pipe.vae, AutoencoderKLCogVideoX)  # enable type check for IDE
 pipe.vae.enable_slicing()
 pipe.vae.enable_tiling()
@@ -37,7 +58,12 @@ video = pipe(
     prompt=prompt,
     num_videos_per_prompt=1,
     num_inference_steps=50,
-    num_frames=49,
+    num_frames=(
+        # Avoid OOM for CogVideoX1.5 model on 48GB GPU
+        16
+        if (is_cogvideox_1_5() and get_gpu_memory_in_gib() < 48)
+        else 49
+    ),
     guidance_scale=6,
     generator=torch.Generator("cuda").manual_seed(0),
 ).frames[0]

cache_dit-0.2.1/examples/run_hunyuan_video.py ADDED Viewed

@@ -0,0 +1,75 @@
+# Adapted from: https://github.com/chengzeyi/ParaAttention/blob/main/first_block_cache_examples/run_hunyuan_video.py
+import os
+import torch
+from diffusers.utils import export_to_video
+from diffusers import (
+    HunyuanVideoPipeline,
+    HunyuanVideoTransformer3DModel,
+    AutoencoderKLHunyuanVideo,
+)
+from cache_dit.cache_factory import apply_cache_on_pipe, CacheType
+model_id = os.environ.get("HUNYUAN_DIR", "tencent/HunyuanVideo")
+def get_gpu_memory_in_gib():
+    if not torch.cuda.is_available():
+        return 0
+    try:
+        total_memory_bytes = torch.cuda.get_device_properties(
+            torch.cuda.current_device(),
+        ).total_memory
+        total_memory_gib = total_memory_bytes / (1024**3)
+        return int(total_memory_gib)
+    except Exception:
+        return 0
+transformer = HunyuanVideoTransformer3DModel.from_pretrained(
+    model_id,
+    subfolder="transformer",
+    torch_dtype=torch.bfloat16,
+    revision="refs/pr/18",
+)
+pipe = HunyuanVideoPipeline.from_pretrained(
+    model_id,
+    transformer=transformer,
+    torch_dtype=torch.float16,
+    revision="refs/pr/18",
+).to("cuda")
+# Default options, F8B8, good balance between performance and precision
+apply_cache_on_pipe(pipe, **CacheType.default_options(CacheType.DBCache))
+assert isinstance(
+    pipe.vae, AutoencoderKLHunyuanVideo
+)  # enable type check for IDE
+# Enable memory savings
+pipe.enable_model_cpu_offload()
+if get_gpu_memory_in_gib() <= 48:
+    pipe.vae.enable_tiling(
+        # Make it runnable on GPUs with 48GB memory
+        tile_sample_min_height=128,
+        tile_sample_stride_height=96,
+        tile_sample_min_width=128,
+        tile_sample_stride_width=96,
+        tile_sample_min_num_frames=32,
+        tile_sample_stride_num_frames=24,
+    )
+else:
+    pipe.vae.enable_tiling()
+output = pipe(
+    prompt="A cat walks on the grass, realistic",
+    height=720,
+    width=1280,
+    num_frames=129,
+    num_inference_steps=30,
+).frames[0]
+print("Saving video to hunyuan_video.mp4")
+export_to_video(output, "hunyuan_video.mp4", fps=15)

{cache_dit-0.1.8 → cache_dit-0.2.1}/examples/run_wan.py RENAMED Viewed

@@ -1,6 +1,7 @@
 import os
 import torch
-from diffusers import WanPipeline
+import diffusers
+from diffusers import WanPipeline, AutoencoderKLWan
 from diffusers.utils import export_to_video
 from diffusers.schedulers.scheduling_unipc_multistep import (
     UniPCMultistepScheduler,
@@ -27,11 +28,24 @@ if hasattr(pipe, "scheduler") and pipe.scheduler is not None:
 pipe.to("cuda")
-apply_cache_on_pipe(pipe, **CacheType.default_options(CacheType.FBCache))
+# Default options, F8B8, good balance between performance and precision
+apply_cache_on_pipe(pipe, **CacheType.default_options(CacheType.DBCache))
 # Enable memory savings
 pipe.enable_model_cpu_offload()
-pipe.enable_vae_tiling()
+# Wan currently requires installing diffusers from source
+assert isinstance(pipe.vae, AutoencoderKLWan)  # enable type check for IDE
+if diffusers.__version__ >= "0.34.0.dev0":
+    pipe.vae.enable_tiling()
+    pipe.vae.enable_slicing()
+else:
+    print(
+        "Wan pipeline requires diffusers version >= 0.34.0.dev0 "
+        "for vae tiling and slicing, please install diffusers "
+        "from source."
+    )
 video = pipe(
     prompt=(
@@ -39,8 +53,8 @@ video = pipe(
         "flying past in the background, hyperrealistic"
     ),
     negative_prompt="",
-    height=480,
-    width=832,
+    height=height,
+    width=width,
     num_frames=81,
     num_inference_steps=30,
 ).frames[0]

{cache_dit-0.1.8 → cache_dit-0.2.1}/setup.py RENAMED Viewed

@@ -66,6 +66,7 @@ setup(
             "expecttest",
             "hypothesis",
             "transformers",
+            # "diffusers @ git+https://github.com/huggingface/diffusers",  # wan currently requires installing from source
             "diffusers",
             "accelerate",
             "peft",

{cache_dit-0.1.8 → cache_dit-0.2.1}/src/cache_dit/_version.py RENAMED Viewed

@@ -17,5 +17,5 @@ __version__: str
 __version_tuple__: VERSION_TUPLE
 version_tuple: VERSION_TUPLE
-__version__ = version = '0.1.8'
-__version_tuple__ = version_tuple = (0, 1, 8)
+__version__ = version = '0.2.1'
+__version_tuple__ = version_tuple = (0, 2, 1)

cache-dit 0.1.8__tar.gz → 0.2.1__tar.gz

cache-dit 0.1.8tar.gz → 0.2.1tar.gz