PyPI - cache-dit - Versions diffs - 0.1.7__tar.gz → 0.2.0__tar.gz - Mend

cache-dit 0.1.7tar.gz → 0.2.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of cache-dit might be problematic. Click here for more details.

Files changed (111) hide show

{cache_dit-0.1.7 → cache_dit-0.2.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: cache_dit
-Version: 0.1.7
+Version: 0.2.0
 Summary: 🤗 CacheDiT: A Training-free and Easy-to-use Cache Acceleration Toolbox for Diffusion Transformers
 Author: DefTruth, vipshop.com, etc.
 Maintainer: DefTruth, vipshop.com, etc
@@ -35,7 +35,7 @@ Dynamic: requires-python
 <div align="center">
   <p align="center">
-    <h3>🤗 CacheDiT: A Training-free and Easy-to-use Cache Acceleration <br>Toolbox for Diffusion Transformers</h3>
+    <h2>🤗 CacheDiT: A Training-free and Easy-to-use Cache Acceleration <br>Toolbox for Diffusion Transformers</h2>
   </p>
   <img src=https://github.com/vipshop/cache-dit/raw/main/assets/cache-dit.png >
   <div align='center'>
@@ -44,13 +44,28 @@ Dynamic: requires-python
       <img src=https://img.shields.io/badge/PyPI-pass-brightgreen.svg >
       <img src=https://static.pepy.tech/badge/cache-dit >
       <img src=https://img.shields.io/badge/Python-3.10|3.11|3.12-9cf.svg >
-      <img src=https://img.shields.io/badge/Release-v0.1.7-brightgreen.svg >
+      <img src=https://img.shields.io/badge/Release-v0.2.0-brightgreen.svg >
  </div>
   <p align="center">
     DeepCache is for UNet not DiT. Most DiT cache speedups are complex and not training-free. CacheDiT <br>offers a set of training-free cache accelerators for DiT: 🔥DBCache, DBPrune, FBCache, etc🔥
   </p>
+  <p align="center">
+  <h4> 🔥Supported Models🔥</h4>
+  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀FLUX.1</b>: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥</a> <br>
+  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀Mochi</b>: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥</a> <br>
+  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀CogVideoX</b>: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥</a> <br>
+  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀CogVideoX1.5</b>: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥</a> <br>
+  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀Wan2.1</b>: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥</a> <br>
+  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀HunyuanVideo</b>: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥</a> <br>
+  </p>
 </div>
+## 👋 Highlight
+<div id="reference"></div>
+The **CacheDiT** codebase is adapted from [FBCache](https://github.com/chengzeyi/ParaAttention/tree/main/src/para_attn/first_block_cache). Special thanks to their excellent work! The **FBCache** support for Mochi, FLUX.1, CogVideoX, Wan2.1, and HunyuanVideo is directly adapted from the original [FBCache](https://github.com/chengzeyi/ParaAttention/tree/main/src/para_attn/first_block_cache).
 ## 🤗 Introduction
 <div align="center">
@@ -91,6 +106,12 @@ These case studies demonstrate that even with relatively high thresholds (such a
 **DBPrune**: We have further implemented a new **Dynamic Block Prune** algorithm based on **Residual Caching** for Diffusion Transformers, referred to as DBPrune. DBPrune caches each block's hidden states and residuals, then **dynamically prunes** blocks during inference by computing the L1 distance between previous hidden states. When a block is pruned, its output is approximated using the cached residuals.
+<div align="center">
+  <p align="center">
+    DBPrune, <b> L20x1 </b>, Steps: 28, "A cat holding a sign that says hello world with complex background"
+  </p>
+</div>
 |Baseline(L20x1)|Pruned(24%)|Pruned(35%)|Pruned(38%)|Pruned(45%)|Pruned(60%)|
 |:---:|:---:|:---:|:---:|:---:|:---:|
 |24.85s|19.43s|16.82s|15.95s|14.24s|10.66s|
@@ -98,15 +119,29 @@ These case studies demonstrate that even with relatively high thresholds (such a
 <div align="center">
   <p align="center">
-    DBPrune, <b> L20x1 </b>, Steps: 28, "A cat holding a sign that says hello world with complex background"
+    <h3>🔥 Context Parallelism and Torch Compile</h3>
+  </p>
+</div>
+Moreover, **CacheDiT** are **plug-and-play** solutions that works hand-in-hand with [ParaAttention](https://github.com/chengzeyi/ParaAttention). Users can easily tap into its **Context Parallelism** features for distributed inference. By the way, CacheDiT is designed to work compatibly with **torch.compile.** You can easily use CacheDiT with torch.compile to further achieve a better performance.
+<div align="center">
+  <p align="center">
+  DBPrune + <b>torch.compile + context parallelism</b> <br>Steps: 28, "A cat holding a sign that says hello world with complex background"
   </p>
 </div>
-Moreover, **CacheDiT** are **plug-and-play** solutions that works hand-in-hand with [ParaAttention](https://github.com/chengzeyi/ParaAttention). Users can easily tap into its **Context Parallelism** features for distributed inference.
+|Baseline|Pruned(24%)|Pruned(35%)|Pruned(38%)|Pruned(45%)|Pruned(60%)|
+|:---:|:---:|:---:|:---:|:---:|:---:|
+|+compile:20.43s|16.25s|14.12s|13.41s|12.00s|8.86s|
+|+L20x4:7.75s|6.62s|6.03s|5.81s|5.24s|3.93s|
+|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_NONE_R0.08_S0_T20.43s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.03_P24.0_T16.25s.png width=105px> | <img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.04_P34.6_T14.12s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.045_P38.2_T13.41s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.055_P45.1_T12.00s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.2_P59.5_T8.86s.png width=105px>|
-<p align="center">
-    ♥️ Please consider to leave a ⭐️ Star to support us ~ ♥️
-</p>
+<div align="center">
+  <p align="center">
+    <b>♥️ Please consider to leave a ⭐️ Star to support us ~ ♥️</b>
+  </p>
+</div>
 ## ©️Citations
@@ -120,12 +155,6 @@ Moreover, **CacheDiT** are **plug-and-play** solutions that works hand-in-hand w
 }
 ```
-## 👋Reference
-<div id="reference"></div>
-The **CacheDiT** codebase was adapted from FBCache's implementation at the [ParaAttention](https://github.com/chengzeyi/ParaAttention/tree/main/src/para_attn/first_block_cache). We would like to express our sincere gratitude for this excellent work!
 ## 📖Contents
 <div id="contents"></div>
@@ -136,11 +165,9 @@ The **CacheDiT** codebase was adapted from FBCache's implementation at the [Para
 - [⚡️Dynamic Block Prune](#dbprune)
 - [🎉Context Parallelism](#context-parallelism)
 - [🔥Torch Compile](#compile)
-- [🎉Supported Models](#supported)
 - [👋Contribute](#contribute)
 - [©️License](#license)
 ## ⚙️Installation
 <div id="installation"></div>
@@ -371,23 +398,11 @@ Then, run the python test script with `torchrun`:
 torchrun --nproc_per_node=4 parallel_cache.py
 ```
-<div align="center">
-  <p align="center">
-    DBPrune, <b> L20x4 </b>, Steps: 28, "A cat holding a sign that says hello world with complex background"
-  </p>
-</div>
-|Baseline(L20x1)|Pruned(24%)|Pruned(35%)|Pruned(38%)|Pruned(45%)|Pruned(60%)|
-|:---:|:---:|:---:|:---:|:---:|:---:|
-|24.85s|19.43s|16.82s|15.95s|14.24s|10.66s|
-|8.54s (L20x4)|7.20s (L20x4)|6.61s (L20x4)|6.09s (L20x4)|5.54s (L20x4)|4.22s (L20x4)|
-|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/NONE_R0.08_S0.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.03_P24.0_T19.43s.png width=105px> | <img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.04_P34.6_T16.82s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.05_P38.3_T15.95s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.06_P45.2_T14.24s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.2_P59.5_T10.66s.png width=105px>|
 ## 🔥Torch Compile
 <div id="compile"></div>
-**CacheDiT** are designed to work compatibly with `torch.compile`. For example:
+By the way, **CacheDiT** is designed to work compatibly with **torch.compile.** You can easily use CacheDiT with torch.compile to further achieve a better performance. For example:
 ```python
 apply_cache_on_pipe(
@@ -396,21 +411,11 @@ apply_cache_on_pipe(
 # Compile the Transformer module
 pipe.transformer = torch.compile(pipe.transformer)
 ```
-However, users intending to use **CacheDiT** for DiT with **dynamic input shapes** should consider increasing the **recompile** **limit** of `torch._dynamo` to achieve better performance.
+However, users intending to use **CacheDiT** for DiT with **dynamic input shapes** should consider increasing the **recompile** **limit** of `torch._dynamo`. Otherwise, the recompile_limit error may be triggered, causing the module to fall back to eager mode.
 ```python
 torch._dynamo.config.recompile_limit = 96  # default is 8
 torch._dynamo.config.accumulated_recompile_limit = 2048  # default is 256
 ```
-Otherwise, the recompile_limit error may be triggered, causing the module to fall back to eager mode.
-## 🎉Supported Models
-<div id="supported"></div>
-- [🚀FLUX.1](https://github.com/vipshop/cache-dit/raw/main/src/cache_dit/cache_factory/dual_block_cache/diffusers_adapters)
-- [🚀CogVideoX](https://github.com/vipshop/cache-dit/raw/main/src/cache_dit/cache_factory/dual_block_cache/diffusers_adapters)
-- [🚀Mochi](https://github.com/vipshop/cache-dit/raw/main/src/cache_dit/cache_factory/dual_block_cache/diffusers_adapters)
 ## 👋Contribute
 <div id="contribute"></div>

{cache_dit-0.1.7 → cache_dit-0.2.0}/README.md RENAMED Viewed

@@ -1,6 +1,6 @@
 <div align="center">
   <p align="center">
-    <h3>🤗 CacheDiT: A Training-free and Easy-to-use Cache Acceleration <br>Toolbox for Diffusion Transformers</h3>
+    <h2>🤗 CacheDiT: A Training-free and Easy-to-use Cache Acceleration <br>Toolbox for Diffusion Transformers</h2>
   </p>
   <img src=https://github.com/vipshop/cache-dit/raw/main/assets/cache-dit.png >
   <div align='center'>
@@ -9,13 +9,28 @@
       <img src=https://img.shields.io/badge/PyPI-pass-brightgreen.svg >
       <img src=https://static.pepy.tech/badge/cache-dit >
       <img src=https://img.shields.io/badge/Python-3.10|3.11|3.12-9cf.svg >
-      <img src=https://img.shields.io/badge/Release-v0.1.7-brightgreen.svg >
+      <img src=https://img.shields.io/badge/Release-v0.2.0-brightgreen.svg >
  </div>
   <p align="center">
     DeepCache is for UNet not DiT. Most DiT cache speedups are complex and not training-free. CacheDiT <br>offers a set of training-free cache accelerators for DiT: 🔥DBCache, DBPrune, FBCache, etc🔥
   </p>
+  <p align="center">
+  <h4> 🔥Supported Models🔥</h4>
+  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀FLUX.1</b>: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥</a> <br>
+  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀Mochi</b>: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥</a> <br>
+  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀CogVideoX</b>: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥</a> <br>
+  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀CogVideoX1.5</b>: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥</a> <br>
+  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀Wan2.1</b>: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥</a> <br>
+  <a href=https://github.com/vipshop/cache-dit/raw/main/examples> <b>🚀HunyuanVideo</b>: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥</a> <br>
+  </p>
 </div>
+## 👋 Highlight
+<div id="reference"></div>
+The **CacheDiT** codebase is adapted from [FBCache](https://github.com/chengzeyi/ParaAttention/tree/main/src/para_attn/first_block_cache). Special thanks to their excellent work! The **FBCache** support for Mochi, FLUX.1, CogVideoX, Wan2.1, and HunyuanVideo is directly adapted from the original [FBCache](https://github.com/chengzeyi/ParaAttention/tree/main/src/para_attn/first_block_cache).
 ## 🤗 Introduction
 <div align="center">
@@ -56,6 +71,12 @@ These case studies demonstrate that even with relatively high thresholds (such a
 **DBPrune**: We have further implemented a new **Dynamic Block Prune** algorithm based on **Residual Caching** for Diffusion Transformers, referred to as DBPrune. DBPrune caches each block's hidden states and residuals, then **dynamically prunes** blocks during inference by computing the L1 distance between previous hidden states. When a block is pruned, its output is approximated using the cached residuals.
+<div align="center">
+  <p align="center">
+    DBPrune, <b> L20x1 </b>, Steps: 28, "A cat holding a sign that says hello world with complex background"
+  </p>
+</div>
 |Baseline(L20x1)|Pruned(24%)|Pruned(35%)|Pruned(38%)|Pruned(45%)|Pruned(60%)|
 |:---:|:---:|:---:|:---:|:---:|:---:|
 |24.85s|19.43s|16.82s|15.95s|14.24s|10.66s|
@@ -63,15 +84,29 @@ These case studies demonstrate that even with relatively high thresholds (such a
 <div align="center">
   <p align="center">
-    DBPrune, <b> L20x1 </b>, Steps: 28, "A cat holding a sign that says hello world with complex background"
+    <h3>🔥 Context Parallelism and Torch Compile</h3>
+  </p>
+</div>
+Moreover, **CacheDiT** are **plug-and-play** solutions that works hand-in-hand with [ParaAttention](https://github.com/chengzeyi/ParaAttention). Users can easily tap into its **Context Parallelism** features for distributed inference. By the way, CacheDiT is designed to work compatibly with **torch.compile.** You can easily use CacheDiT with torch.compile to further achieve a better performance.
+<div align="center">
+  <p align="center">
+  DBPrune + <b>torch.compile + context parallelism</b> <br>Steps: 28, "A cat holding a sign that says hello world with complex background"
   </p>
 </div>
-Moreover, **CacheDiT** are **plug-and-play** solutions that works hand-in-hand with [ParaAttention](https://github.com/chengzeyi/ParaAttention). Users can easily tap into its **Context Parallelism** features for distributed inference.
+|Baseline|Pruned(24%)|Pruned(35%)|Pruned(38%)|Pruned(45%)|Pruned(60%)|
+|:---:|:---:|:---:|:---:|:---:|:---:|
+|+compile:20.43s|16.25s|14.12s|13.41s|12.00s|8.86s|
+|+L20x4:7.75s|6.62s|6.03s|5.81s|5.24s|3.93s|
+|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_NONE_R0.08_S0_T20.43s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.03_P24.0_T16.25s.png width=105px> | <img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.04_P34.6_T14.12s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.045_P38.2_T13.41s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.055_P45.1_T12.00s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.2_P59.5_T8.86s.png width=105px>|
-<p align="center">
-    ♥️ Please consider to leave a ⭐️ Star to support us ~ ♥️
-</p>
+<div align="center">
+  <p align="center">
+    <b>♥️ Please consider to leave a ⭐️ Star to support us ~ ♥️</b>
+  </p>
+</div>
 ## ©️Citations
@@ -85,12 +120,6 @@ Moreover, **CacheDiT** are **plug-and-play** solutions that works hand-in-hand w
 }
 ```
-## 👋Reference
-<div id="reference"></div>
-The **CacheDiT** codebase was adapted from FBCache's implementation at the [ParaAttention](https://github.com/chengzeyi/ParaAttention/tree/main/src/para_attn/first_block_cache). We would like to express our sincere gratitude for this excellent work!
 ## 📖Contents
 <div id="contents"></div>
@@ -101,11 +130,9 @@ The **CacheDiT** codebase was adapted from FBCache's implementation at the [Para
 - [⚡️Dynamic Block Prune](#dbprune)
 - [🎉Context Parallelism](#context-parallelism)
 - [🔥Torch Compile](#compile)
-- [🎉Supported Models](#supported)
 - [👋Contribute](#contribute)
 - [©️License](#license)
 ## ⚙️Installation
 <div id="installation"></div>
@@ -336,23 +363,11 @@ Then, run the python test script with `torchrun`:
 torchrun --nproc_per_node=4 parallel_cache.py
 ```
-<div align="center">
-  <p align="center">
-    DBPrune, <b> L20x4 </b>, Steps: 28, "A cat holding a sign that says hello world with complex background"
-  </p>
-</div>
-|Baseline(L20x1)|Pruned(24%)|Pruned(35%)|Pruned(38%)|Pruned(45%)|Pruned(60%)|
-|:---:|:---:|:---:|:---:|:---:|:---:|
-|24.85s|19.43s|16.82s|15.95s|14.24s|10.66s|
-|8.54s (L20x4)|7.20s (L20x4)|6.61s (L20x4)|6.09s (L20x4)|5.54s (L20x4)|4.22s (L20x4)|
-|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/NONE_R0.08_S0.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.03_P24.0_T19.43s.png width=105px> | <img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.04_P34.6_T16.82s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.05_P38.3_T15.95s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.06_P45.2_T14.24s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.2_P59.5_T10.66s.png width=105px>|
 ## 🔥Torch Compile
 <div id="compile"></div>
-**CacheDiT** are designed to work compatibly with `torch.compile`. For example:
+By the way, **CacheDiT** is designed to work compatibly with **torch.compile.** You can easily use CacheDiT with torch.compile to further achieve a better performance. For example:
 ```python
 apply_cache_on_pipe(
@@ -361,21 +376,11 @@ apply_cache_on_pipe(
 # Compile the Transformer module
 pipe.transformer = torch.compile(pipe.transformer)
 ```
-However, users intending to use **CacheDiT** for DiT with **dynamic input shapes** should consider increasing the **recompile** **limit** of `torch._dynamo` to achieve better performance.
+However, users intending to use **CacheDiT** for DiT with **dynamic input shapes** should consider increasing the **recompile** **limit** of `torch._dynamo`. Otherwise, the recompile_limit error may be triggered, causing the module to fall back to eager mode.
 ```python
 torch._dynamo.config.recompile_limit = 96  # default is 8
 torch._dynamo.config.accumulated_recompile_limit = 2048  # default is 256
 ```
-Otherwise, the recompile_limit error may be triggered, causing the module to fall back to eager mode.
-## 🎉Supported Models
-<div id="supported"></div>
-- [🚀FLUX.1](https://github.com/vipshop/cache-dit/raw/main/src/cache_dit/cache_factory/dual_block_cache/diffusers_adapters)
-- [🚀CogVideoX](https://github.com/vipshop/cache-dit/raw/main/src/cache_dit/cache_factory/dual_block_cache/diffusers_adapters)
-- [🚀Mochi](https://github.com/vipshop/cache-dit/raw/main/src/cache_dit/cache_factory/dual_block_cache/diffusers_adapters)
 ## 👋Contribute
 <div id="contribute"></div>

cache_dit-0.2.0/assets/U0_C1_DBPRUNE_F1B0_R0.03_P24.0_T16.25s.png ADDED Viewed

Binary file

cache_dit-0.2.0/assets/U0_C1_DBPRUNE_F1B0_R0.045_P38.2_T13.41s.png ADDED Viewed

Binary file

cache_dit-0.2.0/assets/U0_C1_DBPRUNE_F1B0_R0.04_P34.6_T14.12s.png ADDED Viewed

Binary file

cache_dit-0.2.0/assets/U0_C1_DBPRUNE_F1B0_R0.055_P45.1_T12.00s.png ADDED Viewed

Binary file

cache_dit-0.2.0/assets/U0_C1_DBPRUNE_F1B0_R0.05_P41.6_T12.70s.png ADDED Viewed

Binary file

cache_dit-0.2.0/assets/U0_C1_DBPRUNE_F1B0_R0.2_P59.5_T8.86s.png ADDED Viewed

Binary file

cache_dit-0.2.0/assets/U0_C1_DBPRUNE_F8B8_R0.08_P23.1_T16.14s.png ADDED Viewed

Binary file

cache_dit-0.2.0/assets/U0_C1_NONE_R0.08_S0_T20.43s.png ADDED Viewed

Binary file

cache_dit-0.2.0/assets/U4_C1_DBPRUNE_F1B0_R0.03_P27.3_T6.62s.png ADDED Viewed

Binary file

cache_dit-0.2.0/assets/U4_C1_DBPRUNE_F1B0_R0.03_P27.3_T6.63s.png ADDED Viewed

Binary file

cache_dit-0.2.0/assets/U4_C1_DBPRUNE_F1B0_R0.045_P38.2_T5.81s.png ADDED Viewed

Binary file

cache_dit-0.2.0/assets/U4_C1_DBPRUNE_F1B0_R0.045_P38.2_T5.82s.png ADDED Viewed

Binary file

cache_dit-0.2.0/assets/U4_C1_DBPRUNE_F1B0_R0.04_P34.6_T6.06s.png ADDED Viewed

Binary file

cache_dit-0.2.0/assets/U4_C1_DBPRUNE_F1B0_R0.04_P34.6_T6.07s.png ADDED Viewed

Binary file

cache_dit-0.2.0/assets/U4_C1_DBPRUNE_F1B0_R0.04_P34.6_T6.08s.png ADDED Viewed

Binary file

cache_dit-0.2.0/assets/U4_C1_DBPRUNE_F1B0_R0.055_P45.1_T5.27s.png ADDED Viewed

Binary file

cache_dit-0.2.0/assets/U4_C1_DBPRUNE_F1B0_R0.055_P45.1_T5.28s.png ADDED Viewed

Binary file

cache_dit-0.2.0/assets/U4_C1_DBPRUNE_F1B0_R0.2_P59.5_T3.95s.png ADDED Viewed

Binary file

cache_dit-0.2.0/assets/U4_C1_DBPRUNE_F1B0_R0.2_P59.5_T3.96s.png ADDED Viewed

Binary file

cache_dit-0.2.0/assets/U4_C1_NONE_R0.08_S0_T7.78s.png ADDED Viewed

Binary file

cache_dit-0.2.0/assets/U4_C1_NONE_R0.08_S0_T7.79s.png ADDED Viewed

Binary file

{cache_dit-0.1.7 → cache_dit-0.2.0}/bench/bench.py RENAMED Viewed

@@ -3,7 +3,7 @@ import argparse
 import torch
 import time
-from diffusers import FluxPipeline
+from diffusers import FluxPipeline, FluxTransformer2DModel
 from cache_dit.cache_factory import apply_cache_on_pipe, CacheType
 from cache_dit.logger import init_logger
@@ -110,6 +110,7 @@ def get_cache_options(cache_type: CacheType, args: argparse.Namespace):
     return cache_options, cache_type_str
+@torch.no_grad()
 def main():
     args = get_args()
     logger.info(f"Arguments: {args}")
@@ -119,7 +120,9 @@ def main():
         try:
             import torch.distributed as dist
             from para_attn.context_parallel import init_context_parallel_mesh
-            from para_attn.context_parallel.diffusers_adapters import parallelize_pipe
+            from para_attn.context_parallel.diffusers_adapters import (
+                parallelize_pipe,
+            )
             # Initialize distributed process group
             dist.init_process_group()
@@ -133,9 +136,10 @@ def main():
             ).to("cuda")
             parallelize_pipe(
-                pipe, mesh=init_context_parallel_mesh(
+                pipe,
+                mesh=init_context_parallel_mesh(
                     pipe.device.type, max_ulysses_dim_size=args.ulysses
-                )
+                ),
             )
         except ImportError as e:
             logger.error(
@@ -148,7 +152,7 @@ def main():
         pipe = FluxPipeline.from_pretrained(
             os.environ.get("FLUX_DIR", "black-forest-labs/FLUX.1-dev"),
             torch_dtype=torch.bfloat16,
-        ).to("cuda")
+        ).to("cuda")
     cache_options, cache_type = get_cache_options(args.cache, args)
@@ -165,7 +169,18 @@ def main():
         torch._dynamo.config.accumulated_recompile_limit = (
             2048  # default is 256
         )
-        pipe.transformer = torch.compile(pipe.transformer, mode="default")
+        if isinstance(pipe.transformer, FluxTransformer2DModel):
+            logger.warning(
+                "Only compile transformer blocks not the whole model "
+                "for FluxTransformer2DModel to keep higher precision."
+            )
+            for module in pipe.transformer.transformer_blocks:
+                module.compile()
+            for module in pipe.transformer.single_transformer_blocks:
+                module.compile()
+        else:
+            logger.info("Compiling the transformer with default mode.")
+            pipe.transformer = torch.compile(pipe.transformer, mode="default")
     all_times = []
     cached_stepes = 0
@@ -238,6 +253,7 @@ def main():
     if args.ulysses is not None:
         import torch.distributed as dist
         dist.destroy_process_group()
         logger.info("Distributed process group destroyed.")

{cache_dit-0.1.7 → cache_dit-0.2.0}/examples/.gitignore RENAMED Viewed

@@ -164,5 +164,5 @@ _version.py
 report*.html
 .DS_Store
 *.png
+*.mp4

cache_dit-0.2.0/examples/README.md ADDED Viewed

@@ -0,0 +1,45 @@
+# Examples for CacheDiT
+## Install requirements
+```bash
+pip3 install -r requirements.txt
+```
+## Run examples
+- FLUX.1-dev
+```bash
+python3 run_flux.py
+```
+- FLUX.1-Fill-dev
+```bash
+python3 run_flux_fill.py
+```
+- CogVideoX
+```bash
+python3 run_cogvideox.py
+```
+- Wan2.1
+```bash
+python3 run_wan.py
+```
+- Mochi
+```bash
+python3 run_mochi.py
+```
+- HunyuanVideo
+```bash
+python3 run_hunyuan_video.py
+```

cache_dit-0.2.0/examples/data/cup.png ADDED Viewed

Binary file

cache_dit-0.2.0/examples/data/cup_mask.png ADDED Viewed

Binary file

cache_dit-0.2.0/examples/requirements.txt ADDED Viewed

@@ -0,0 +1,4 @@
+imageio-ffmpeg
+# wan currently requires installing from source
+diffusers @ git+https://github.com/huggingface/diffusers
+ftfy

cache_dit-0.2.0/examples/run_cogvideox.py ADDED Viewed

@@ -0,0 +1,72 @@
+import os
+import torch
+from diffusers.utils import export_to_video
+from diffusers import CogVideoXPipeline, AutoencoderKLCogVideoX
+from cache_dit.cache_factory import apply_cache_on_pipe, CacheType
+model_id = os.environ.get("COGVIDEOX_DIR", "THUDM/CogVideoX-5b")
+def is_cogvideox_1_5():
+    return "CogVideoX1.5" in model_id or "THUDM/CogVideoX1.5" in model_id
+def get_gpu_memory_in_gib():
+    if not torch.cuda.is_available():
+        return 0
+    try:
+        total_memory_bytes = torch.cuda.get_device_properties(
+            torch.cuda.current_device(),
+        ).total_memory
+        total_memory_gib = total_memory_bytes / (1024**3)
+        return int(total_memory_gib)
+    except Exception:
+        return 0
+pipe = CogVideoXPipeline.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+).to("cuda")
+# Default options, F8B8, good balance between performance and precision
+cache_options = CacheType.default_options(CacheType.DBCache)
+apply_cache_on_pipe(pipe, **cache_options)
+pipe.enable_model_cpu_offload()
+assert isinstance(pipe.vae, AutoencoderKLCogVideoX)  # enable type check for IDE
+pipe.vae.enable_slicing()
+pipe.vae.enable_tiling()
+prompt = (
+    "A panda, dressed in a small, red jacket and a tiny hat, "
+    "sits on a wooden stool in a serene bamboo forest. The "
+    "panda's fluffy paws strum a miniature acoustic guitar, "
+    "producing soft, melodic tunes. Nearby, a few other pandas "
+    "gather, watching curiously and some clapping in rhythm. "
+    "Sunlight filters through the tall bamboo, casting a gentle "
+    "glow on the scene. The panda's face is expressive, showing "
+    "concentration and joy as it plays. The background includes "
+    "a small, flowing stream and vibrant green foliage, enhancing "
+    "the peaceful and magical atmosphere of this unique musical "
+    "performance."
+)
+video = pipe(
+    prompt=prompt,
+    num_videos_per_prompt=1,
+    num_inference_steps=50,
+    num_frames=(
+        # Avoid OOM for CogVideoX1.5 model on 48GB GPU
+        16
+        if (is_cogvideox_1_5() and get_gpu_memory_in_gib() < 48)
+        else 49
+    ),
+    guidance_scale=6,
+    generator=torch.Generator("cuda").manual_seed(0),
+).frames[0]
+print("Saving video to cogvideox.mp4")
+export_to_video(video, "cogvideox.mp4", fps=8)

{cache_dit-0.1.7 → cache_dit-0.2.0}/examples/run_flux.py RENAMED Viewed

@@ -1,9 +1,13 @@
+import os
 import torch
 from diffusers import FluxPipeline
 from cache_dit.cache_factory import apply_cache_on_pipe, CacheType
 pipe = FluxPipeline.from_pretrained(
-    "black-forest-labs/FLUX.1-dev",
+    os.environ.get(
+        "FLUX_DIR",
+        "black-forest-labs/FLUX.1-dev",
+    ),
     torch_dtype=torch.bfloat16,
 ).to("cuda")

cache_dit-0.2.0/examples/run_flux_fill.py ADDED Viewed

@@ -0,0 +1,32 @@
+import os
+import torch
+from diffusers import FluxFillPipeline
+from diffusers.utils import load_image
+from cache_dit.cache_factory import apply_cache_on_pipe, CacheType
+pipe = FluxFillPipeline.from_pretrained(
+    os.environ.get(
+        "FLUX_FILL_DIR",
+        "black-forest-labs/FLUX.1-Fill-dev",
+    ),
+    torch_dtype=torch.bfloat16,
+).to("cuda")
+# Default options, F8B8, good balance between performance and precision
+cache_options = CacheType.default_options(CacheType.DBCache)
+apply_cache_on_pipe(pipe, **cache_options)
+image = pipe(
+    prompt="a white paper cup",
+    image=load_image("data/cup.png"),
+    mask_image=load_image("data/cup_mask.png"),
+    guidance_scale=30,
+    num_inference_steps=28,
+    max_sequence_length=512,
+    generator=torch.Generator("cuda").manual_seed(0),
+).images[0]
+print("Saving image to flux-fill.png")
+image.save("flux-fill.png")

cache-dit 0.1.7__tar.gz → 0.2.0__tar.gz

Potentially problematic release.

cache-dit 0.1.7tar.gz → 0.2.0tar.gz