PyPI - cache-dit - Versions diffs - 0.1.3__tar.gz → 0.1.6__tar.gz - Mend

cache-dit 0.1.3tar.gz → 0.1.6tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of cache-dit might be problematic. Click here for more details.

Files changed (77) hide show

{cache_dit-0.1.3 → cache_dit-0.1.6}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: cache_dit
-Version: 0.1.3
+Version: 0.1.6
 Summary: 🤗 CacheDiT: A Training-free and Easy-to-use Cache Acceleration Toolbox for Diffusion Transformers
 Author: DefTruth, vipshop.com, etc.
 Maintainer: DefTruth, vipshop.com, etc
@@ -10,9 +10,9 @@ Requires-Python: >=3.10
 Description-Content-Type: text/markdown
 License-File: LICENSE
 Requires-Dist: packaging
-Requires-Dist: torch
-Requires-Dist: transformers
-Requires-Dist: diffusers
+Requires-Dist: torch>=2.5.1
+Requires-Dist: transformers>=4.51.3
+Requires-Dist: diffusers>=0.33.1
 Provides-Extra: all
 Provides-Extra: dev
 Requires-Dist: pre-commit; extra == "dev"
@@ -44,10 +44,10 @@ Dynamic: requires-python
       <img src=https://img.shields.io/badge/PyPI-pass-brightgreen.svg >
       <img src=https://static.pepy.tech/badge/cache-dit >
       <img src=https://img.shields.io/badge/Python-3.10|3.11|3.12-9cf.svg >
-      <img src=https://img.shields.io/badge/Release-v0.1.3-brightgreen.svg >
+      <img src=https://img.shields.io/badge/Release-v0.1.6-brightgreen.svg >
  </div>
   <p align="center">
-    DeepCache is for UNet not DiT. Most DiT cache speedups are complex and not training-free. CacheDiT provides <br>a series of training-free, UNet-style cache accelerators for DiT: DBCache, DBPrune, FBCache, etc.
+    DeepCache is for UNet not DiT. Most DiT cache speedups are complex and not training-free. CacheDiT <br>offers a set of training-free cache accelerators for DiT: 🔥DBCache, DBPrune, FBCache, etc🔥
   </p>
 </div>
@@ -55,7 +55,7 @@ Dynamic: requires-python
 <div align="center">
   <p align="center">
-    <h3>DBCache: Dual Block Caching for Diffusion Transformers</h3>
+    <h3>🔥 DBCache: Dual Block Caching for Diffusion Transformers</h3>
   </p>
 </div>
@@ -77,7 +77,7 @@ Dynamic: requires-python
 <div align="center">
   <p align="center">
-     DBCache, <b> L20x4 </b>, Steps: 20, case to show the texture recovery ability of DBCache
+    DBCache, <b> L20x4 </b>, Steps: 20, case to show the texture recovery ability of DBCache
   </p>
 </div>
@@ -85,7 +85,7 @@ These case studies demonstrate that even with relatively high thresholds (such a
 <div align="center">
   <p align="center">
-    <h3>DBPrune: Dynamic Block Prune with Residual Caching</h3>
+    <h3>🔥 DBPrune: Dynamic Block Prune with Residual Caching</h3>
   </p>
 </div>
@@ -102,10 +102,10 @@ These case studies demonstrate that even with relatively high thresholds (such a
   </p>
 </div>
-Moreover, both DBCache and DBPrune are **plug-and-play** solutions that works hand-in-hand with [ParaAttention](https://github.com/chengzeyi/ParaAttention). Users can easily tap into its **Context Parallelism** features for distributed inference.
+Moreover, **CacheDiT** are **plug-and-play** solutions that works hand-in-hand with [ParaAttention](https://github.com/chengzeyi/ParaAttention). Users can easily tap into its **Context Parallelism** features for distributed inference.
 <p align="center">
-  ♥️ Please consider to leave a ⭐️ Star to support us ~ ♥️
+    ♥️ Please consider to leave a ⭐️ Star to support us ~ ♥️
 </p>
 ## ©️Citations
@@ -135,7 +135,7 @@ The **CacheDiT** codebase was adapted from FBCache's implementation at the [Para
 - [🎉First Block Cache](#fbcache)
 - [⚡️Dynamic Block Prune](#dbprune)
 - [🎉Context Parallelism](#context-parallelism)
-- [⚡️Torch Compile](#compile)
+- [🔥Torch Compile](#compile)
 - [🎉Supported Models](#supported)
 - [👋Contribute](#contribute)
 - [©️License](#license)
@@ -167,7 +167,7 @@ pip3 install git+https://github.com/vipshop/cache-dit.git
 - **Fn**: Specifies that DBCache uses the **first n** Transformer blocks to fit the information at time step t, enabling the calculation of a more stable L1 diff and delivering more accurate information to subsequent blocks.
 - **Bn**: Further fuses approximate information in the **last n** Transformer blocks to enhance prediction accuracy. These blocks act as an auto-scaler for approximate hidden states that use residual cache.
 - **warmup_steps**: (default: 0) DBCache does not apply the caching strategy when the number of running steps is less than or equal to this value, ensuring the model sufficiently learns basic features during warmup.
-- **max_cached_steps**:  (default: -1) DBCache disables the caching strategy when the running steps exceed this value to prevent precision degradation.
+- **max_cached_steps**:  (default: -1) DBCache disables the caching strategy when the previous cached steps exceed this value to prevent precision degradation.
 - **residual_diff_threshold**: The value of residual diff threshold, a higher value leads to faster performance at the cost of lower precision.
 For a good balance between performance and precision, DBCache is configured by default with **F8B8**, 8 warmup steps, and unlimited cached steps.
@@ -210,6 +210,17 @@ cache_options = {
 }
 ```
+<div align="center">
+  <p align="center">
+    DBCache, <b> L20x1 </b>, Steps: 28, "A cat holding a sign that says hello world with complex background"
+  </p>
+</div>
+|Baseline(L20x1)|F1B0 (0.08)|F1B0 (0.20)|F8B8 (0.15)|F12B12 (0.20)|F16B16 (0.20)|
+|:---:|:---:|:---:|:---:|:---:|:---:|
+|24.85s|15.59s|8.58s|15.41s|15.11s|17.74s|
+|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/NONE_R0.08_S0.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBCACHE_F1B0S1_R0.08_S11.png width=105px> | <img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBCACHE_F1B0S1_R0.2_S19.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBCACHE_F8B8S1_R0.15_S15.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBCACHE_F12B12S4_R0.2_S16.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBCACHE_F16B16S4_R0.2_S13.png width=105px>|
 ## 🎉FBCache: First Block Cache
 <div id="fbcache"></div>
@@ -261,12 +272,47 @@ pipe = FluxPipeline.from_pretrained(
     torch_dtype=torch.bfloat16,
 ).to("cuda")
-# Using DBPrune
+# Using DBPrune with default options
 cache_options = CacheType.default_options(CacheType.DBPrune)
 apply_cache_on_pipe(pipe, **cache_options)
 ```
+We have also brought the designs from DBCache to DBPrune to make it a more general and customizable block prune algorithm. You can specify the values of **Fn** and **Bn** for higher precision, or set up the non-prune blocks list **non_prune_blocks_ids** to avoid aggressive pruning. For example:
+```python
+# Custom options for DBPrune
+cache_options = {
+    "cache_type": CacheType.DBPrune,
+    "residual_diff_threshold": 0.05,
+    # Never prune the first `Fn` and last `Bn` blocks.
+    "Fn_compute_blocks": 8,  # default 1
+    "Bn_compute_blocks": 8,  # default 0
+    "warmup_steps": 8,  # default -1
+    # Disables the pruning strategy when the previous
+    # pruned steps greater than this value.
+    "max_pruned_steps": 12,  # default, -1 means no limit
+    # Enable dynamic prune threshold within step, higher
+    # `max_dynamic_prune_threshold` value may introduce a more
+    # ageressive pruning strategy.
+    "enable_dynamic_prune_threshold": True,
+    "max_dynamic_prune_threshold": 2 * 0.05,
+    # (New thresh) = mean(previous_block_diffs_within_step) * 1.25
+    # (New thresh) = ((New thresh) if (New thresh) <
+    # max_dynamic_prune_threshold else residual_diff_threshold)
+    "dynamic_prune_threshold_relax_ratio": 1.25,
+    # The step interval to update residual cache. For example,
+    # 2: means the update steps will be [0, 2, 4, ...].
+    "residual_cache_update_interval": 1,
+    # You can set non-prune blocks to avoid ageressive pruning.
+    # For example, FLUX.1 has 19 + 38 blocks, so we can set it
+    # to 0, 2, 4, ..., 56, etc.
+    "non_prune_blocks_ids": [],
+}
+apply_cache_on_pipe(pipe, **cache_options)
+```
 <div align="center">
   <p align="center">
     DBPrune, <b> L20x1 </b>, Steps: 28, "A cat holding a sign that says hello world with complex background"
@@ -288,14 +334,19 @@ apply_cache_on_pipe(pipe, **cache_options)
 pip3 install para-attn  # or install `para-attn` from sources.
 ```
-Then, you can run **DBCache** with **Context Parallelism** on 4 GPUs:
+Then, you can run **DBCache** or **DBPrune** with **Context Parallelism** on 4 GPUs:
 ```python
+import torch.distributed as dist
 from diffusers import FluxPipeline
 from para_attn.context_parallel import init_context_parallel_mesh
 from para_attn.context_parallel.diffusers_adapters import parallelize_pipe
 from cache_dit.cache_factory import apply_cache_on_pipe, CacheType
+ # Init distributed process group
+dist.init_process_group()
+torch.cuda.set_device(dist.get_rank())
 pipe = FluxPipeline.from_pretrained(
     "black-forest-labs/FLUX.1-dev",
     torch_dtype=torch.bfloat16,
@@ -308,13 +359,31 @@ parallelize_pipe(
     )
 )
-# DBCache with F8B8 from this library
+# DBPrune with default options from this library
 apply_cache_on_pipe(
-    pipe, **CacheType.default_options(CacheType.DBCache)
+    pipe, **CacheType.default_options(CacheType.DBPrune)
 )
+dist.destroy_process_group()
+```
+Then, run the python test script with `torchrun`:
+```bash
+torchrun --nproc_per_node=4 parallel_cache.py
 ```
-## ⚡️Torch Compile
+<div align="center">
+  <p align="center">
+    DBPrune, <b> L20x4 </b>, Steps: 28, "A cat holding a sign that says hello world with complex background"
+  </p>
+</div>
+|Baseline(L20x1)|Pruned(24%)|Pruned(35%)|Pruned(38%)|Pruned(45%)|Pruned(60%)|
+|:---:|:---:|:---:|:---:|:---:|:---:|
+|24.85s|19.43s|16.82s|15.95s|14.24s|10.66s|
+|8.54s (L20x4)|7.20s (L20x4)|6.61s (L20x4)|6.09s (L20x4)|5.54s (L20x4)|4.22s (L20x4)|
+|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/NONE_R0.08_S0.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.03_P24.0_T19.43s.png width=105px> | <img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.04_P34.6_T16.82s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.05_P38.3_T15.95s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.06_P45.2_T14.24s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.2_P59.5_T10.66s.png width=105px>|
+## 🔥Torch Compile
 <div id="compile"></div>
@@ -322,7 +391,7 @@ apply_cache_on_pipe(
 ```python
 apply_cache_on_pipe(
-    pipe, **CacheType.default_options(CacheType.DBCache)
+    pipe, **CacheType.default_options(CacheType.DBPrune)
 )
 # Compile the Transformer module
 pipe.transformer = torch.compile(pipe.transformer)
@@ -333,7 +402,13 @@ However, users intending to use **CacheDiT** for DiT with **dynamic input shapes
 torch._dynamo.config.recompile_limit = 96  # default is 8
 torch._dynamo.config.accumulated_recompile_limit = 2048  # default is 256
 ```
-Otherwise, the recompile_limit error may be triggered, causing the module to fall back to eager mode.
+Otherwise, the recompile_limit error may be triggered, causing the module to fall back to eager mode. Here is the case of **DBPrune + torch.compile**.
+<div align="center">
+  <p align="center">
+    DBPrune + compile, Steps: 28, "A cat holding a sign that says hello world with complex background"
+  </p>
+</div>
 ## 🎉Supported Models
@@ -346,7 +421,7 @@ Otherwise, the recompile_limit error may be triggered, causing the module to fal
 ## 👋Contribute
 <div id="contribute"></div>
-How to contribute? Star this repo or check [CONTRIBUTE.md](./CONTRIBUTE.md).
+How to contribute? Star ⭐️ this repo to support us or check [CONTRIBUTE.md](./CONTRIBUTE.md).
 ## ©️License

{cache_dit-0.1.3 → cache_dit-0.1.6}/README.md RENAMED Viewed

@@ -9,10 +9,10 @@
       <img src=https://img.shields.io/badge/PyPI-pass-brightgreen.svg >
       <img src=https://static.pepy.tech/badge/cache-dit >
       <img src=https://img.shields.io/badge/Python-3.10|3.11|3.12-9cf.svg >
-      <img src=https://img.shields.io/badge/Release-v0.1.3-brightgreen.svg >
+      <img src=https://img.shields.io/badge/Release-v0.1.6-brightgreen.svg >
  </div>
   <p align="center">
-    DeepCache is for UNet not DiT. Most DiT cache speedups are complex and not training-free. CacheDiT provides <br>a series of training-free, UNet-style cache accelerators for DiT: DBCache, DBPrune, FBCache, etc.
+    DeepCache is for UNet not DiT. Most DiT cache speedups are complex and not training-free. CacheDiT <br>offers a set of training-free cache accelerators for DiT: 🔥DBCache, DBPrune, FBCache, etc🔥
   </p>
 </div>
@@ -20,7 +20,7 @@
 <div align="center">
   <p align="center">
-    <h3>DBCache: Dual Block Caching for Diffusion Transformers</h3>
+    <h3>🔥 DBCache: Dual Block Caching for Diffusion Transformers</h3>
   </p>
 </div>
@@ -42,7 +42,7 @@
 <div align="center">
   <p align="center">
-     DBCache, <b> L20x4 </b>, Steps: 20, case to show the texture recovery ability of DBCache
+    DBCache, <b> L20x4 </b>, Steps: 20, case to show the texture recovery ability of DBCache
   </p>
 </div>
@@ -50,7 +50,7 @@ These case studies demonstrate that even with relatively high thresholds (such a
 <div align="center">
   <p align="center">
-    <h3>DBPrune: Dynamic Block Prune with Residual Caching</h3>
+    <h3>🔥 DBPrune: Dynamic Block Prune with Residual Caching</h3>
   </p>
 </div>
@@ -67,10 +67,10 @@ These case studies demonstrate that even with relatively high thresholds (such a
   </p>
 </div>
-Moreover, both DBCache and DBPrune are **plug-and-play** solutions that works hand-in-hand with [ParaAttention](https://github.com/chengzeyi/ParaAttention). Users can easily tap into its **Context Parallelism** features for distributed inference.
+Moreover, **CacheDiT** are **plug-and-play** solutions that works hand-in-hand with [ParaAttention](https://github.com/chengzeyi/ParaAttention). Users can easily tap into its **Context Parallelism** features for distributed inference.
 <p align="center">
-  ♥️ Please consider to leave a ⭐️ Star to support us ~ ♥️
+    ♥️ Please consider to leave a ⭐️ Star to support us ~ ♥️
 </p>
 ## ©️Citations
@@ -100,7 +100,7 @@ The **CacheDiT** codebase was adapted from FBCache's implementation at the [Para
 - [🎉First Block Cache](#fbcache)
 - [⚡️Dynamic Block Prune](#dbprune)
 - [🎉Context Parallelism](#context-parallelism)
-- [⚡️Torch Compile](#compile)
+- [🔥Torch Compile](#compile)
 - [🎉Supported Models](#supported)
 - [👋Contribute](#contribute)
 - [©️License](#license)
@@ -132,7 +132,7 @@ pip3 install git+https://github.com/vipshop/cache-dit.git
 - **Fn**: Specifies that DBCache uses the **first n** Transformer blocks to fit the information at time step t, enabling the calculation of a more stable L1 diff and delivering more accurate information to subsequent blocks.
 - **Bn**: Further fuses approximate information in the **last n** Transformer blocks to enhance prediction accuracy. These blocks act as an auto-scaler for approximate hidden states that use residual cache.
 - **warmup_steps**: (default: 0) DBCache does not apply the caching strategy when the number of running steps is less than or equal to this value, ensuring the model sufficiently learns basic features during warmup.
-- **max_cached_steps**:  (default: -1) DBCache disables the caching strategy when the running steps exceed this value to prevent precision degradation.
+- **max_cached_steps**:  (default: -1) DBCache disables the caching strategy when the previous cached steps exceed this value to prevent precision degradation.
 - **residual_diff_threshold**: The value of residual diff threshold, a higher value leads to faster performance at the cost of lower precision.
 For a good balance between performance and precision, DBCache is configured by default with **F8B8**, 8 warmup steps, and unlimited cached steps.
@@ -175,6 +175,17 @@ cache_options = {
 }
 ```
+<div align="center">
+  <p align="center">
+    DBCache, <b> L20x1 </b>, Steps: 28, "A cat holding a sign that says hello world with complex background"
+  </p>
+</div>
+|Baseline(L20x1)|F1B0 (0.08)|F1B0 (0.20)|F8B8 (0.15)|F12B12 (0.20)|F16B16 (0.20)|
+|:---:|:---:|:---:|:---:|:---:|:---:|
+|24.85s|15.59s|8.58s|15.41s|15.11s|17.74s|
+|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/NONE_R0.08_S0.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBCACHE_F1B0S1_R0.08_S11.png width=105px> | <img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBCACHE_F1B0S1_R0.2_S19.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBCACHE_F8B8S1_R0.15_S15.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBCACHE_F12B12S4_R0.2_S16.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBCACHE_F16B16S4_R0.2_S13.png width=105px>|
 ## 🎉FBCache: First Block Cache
 <div id="fbcache"></div>
@@ -226,12 +237,47 @@ pipe = FluxPipeline.from_pretrained(
     torch_dtype=torch.bfloat16,
 ).to("cuda")
-# Using DBPrune
+# Using DBPrune with default options
 cache_options = CacheType.default_options(CacheType.DBPrune)
 apply_cache_on_pipe(pipe, **cache_options)
 ```
+We have also brought the designs from DBCache to DBPrune to make it a more general and customizable block prune algorithm. You can specify the values of **Fn** and **Bn** for higher precision, or set up the non-prune blocks list **non_prune_blocks_ids** to avoid aggressive pruning. For example:
+```python
+# Custom options for DBPrune
+cache_options = {
+    "cache_type": CacheType.DBPrune,
+    "residual_diff_threshold": 0.05,
+    # Never prune the first `Fn` and last `Bn` blocks.
+    "Fn_compute_blocks": 8,  # default 1
+    "Bn_compute_blocks": 8,  # default 0
+    "warmup_steps": 8,  # default -1
+    # Disables the pruning strategy when the previous
+    # pruned steps greater than this value.
+    "max_pruned_steps": 12,  # default, -1 means no limit
+    # Enable dynamic prune threshold within step, higher
+    # `max_dynamic_prune_threshold` value may introduce a more
+    # ageressive pruning strategy.
+    "enable_dynamic_prune_threshold": True,
+    "max_dynamic_prune_threshold": 2 * 0.05,
+    # (New thresh) = mean(previous_block_diffs_within_step) * 1.25
+    # (New thresh) = ((New thresh) if (New thresh) <
+    # max_dynamic_prune_threshold else residual_diff_threshold)
+    "dynamic_prune_threshold_relax_ratio": 1.25,
+    # The step interval to update residual cache. For example,
+    # 2: means the update steps will be [0, 2, 4, ...].
+    "residual_cache_update_interval": 1,
+    # You can set non-prune blocks to avoid ageressive pruning.
+    # For example, FLUX.1 has 19 + 38 blocks, so we can set it
+    # to 0, 2, 4, ..., 56, etc.
+    "non_prune_blocks_ids": [],
+}
+apply_cache_on_pipe(pipe, **cache_options)
+```
 <div align="center">
   <p align="center">
     DBPrune, <b> L20x1 </b>, Steps: 28, "A cat holding a sign that says hello world with complex background"
@@ -253,14 +299,19 @@ apply_cache_on_pipe(pipe, **cache_options)
 pip3 install para-attn  # or install `para-attn` from sources.
 ```
-Then, you can run **DBCache** with **Context Parallelism** on 4 GPUs:
+Then, you can run **DBCache** or **DBPrune** with **Context Parallelism** on 4 GPUs:
 ```python
+import torch.distributed as dist
 from diffusers import FluxPipeline
 from para_attn.context_parallel import init_context_parallel_mesh
 from para_attn.context_parallel.diffusers_adapters import parallelize_pipe
 from cache_dit.cache_factory import apply_cache_on_pipe, CacheType
+ # Init distributed process group
+dist.init_process_group()
+torch.cuda.set_device(dist.get_rank())
 pipe = FluxPipeline.from_pretrained(
     "black-forest-labs/FLUX.1-dev",
     torch_dtype=torch.bfloat16,
@@ -273,13 +324,31 @@ parallelize_pipe(
     )
 )
-# DBCache with F8B8 from this library
+# DBPrune with default options from this library
 apply_cache_on_pipe(
-    pipe, **CacheType.default_options(CacheType.DBCache)
+    pipe, **CacheType.default_options(CacheType.DBPrune)
 )
+dist.destroy_process_group()
+```
+Then, run the python test script with `torchrun`:
+```bash
+torchrun --nproc_per_node=4 parallel_cache.py
 ```
-## ⚡️Torch Compile
+<div align="center">
+  <p align="center">
+    DBPrune, <b> L20x4 </b>, Steps: 28, "A cat holding a sign that says hello world with complex background"
+  </p>
+</div>
+|Baseline(L20x1)|Pruned(24%)|Pruned(35%)|Pruned(38%)|Pruned(45%)|Pruned(60%)|
+|:---:|:---:|:---:|:---:|:---:|:---:|
+|24.85s|19.43s|16.82s|15.95s|14.24s|10.66s|
+|8.54s (L20x4)|7.20s (L20x4)|6.61s (L20x4)|6.09s (L20x4)|5.54s (L20x4)|4.22s (L20x4)|
+|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/NONE_R0.08_S0.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.03_P24.0_T19.43s.png width=105px> | <img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.04_P34.6_T16.82s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.05_P38.3_T15.95s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.06_P45.2_T14.24s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.2_P59.5_T10.66s.png width=105px>|
+## 🔥Torch Compile
 <div id="compile"></div>
@@ -287,7 +356,7 @@ apply_cache_on_pipe(
 ```python
 apply_cache_on_pipe(
-    pipe, **CacheType.default_options(CacheType.DBCache)
+    pipe, **CacheType.default_options(CacheType.DBPrune)
 )
 # Compile the Transformer module
 pipe.transformer = torch.compile(pipe.transformer)
@@ -298,7 +367,13 @@ However, users intending to use **CacheDiT** for DiT with **dynamic input shapes
 torch._dynamo.config.recompile_limit = 96  # default is 8
 torch._dynamo.config.accumulated_recompile_limit = 2048  # default is 256
 ```
-Otherwise, the recompile_limit error may be triggered, causing the module to fall back to eager mode.
+Otherwise, the recompile_limit error may be triggered, causing the module to fall back to eager mode. Here is the case of **DBPrune + torch.compile**.
+<div align="center">
+  <p align="center">
+    DBPrune + compile, Steps: 28, "A cat holding a sign that says hello world with complex background"
+  </p>
+</div>
 ## 🎉Supported Models
@@ -311,7 +386,7 @@ Otherwise, the recompile_limit error may be triggered, causing the module to fal
 ## 👋Contribute
 <div id="contribute"></div>
-How to contribute? Star this repo or check [CONTRIBUTE.md](./CONTRIBUTE.md).
+How to contribute? Star ⭐️ this repo to support us or check [CONTRIBUTE.md](./CONTRIBUTE.md).
 ## ©️License

{cache_dit-0.1.3 → cache_dit-0.1.6}/bench/bench.py RENAMED Viewed

@@ -16,6 +16,7 @@ def get_args() -> argparse.ArgumentParser:
     # General arguments
     parser.add_argument("--steps", type=int, default=28)
     parser.add_argument("--repeats", type=int, default=2)
+    parser.add_argument("--seed", type=int, default=0)
     parser.add_argument("--cache", type=str, default=None)
     parser.add_argument("--alter", action="store_true", default=False)
     parser.add_argument("--l1-diff", action="store_true", default=False)
@@ -26,12 +27,9 @@ def get_args() -> argparse.ArgumentParser:
     parser.add_argument("--warmup-steps", type=int, default=0)
     parser.add_argument("--max-cached-steps", type=int, default=-1)
     parser.add_argument("--max-pruned-steps", type=int, default=-1)
-    parser.add_argument("--seed", type=int, default=0)
-    parser.add_argument(
-        "--compile",
-        action="store_true",
-        default=False,
-    )
+    parser.add_argument("--ulysses", type=int, default=None)
+    parser.add_argument("--compile", action="store_true", default=False)
+    parser.add_argument("--gen-device", type=str, default="cuda")
     return parser.parse_args()
@@ -116,10 +114,41 @@ def main():
     args = get_args()
     logger.info(f"Arguments: {args}")
-    pipe = FluxPipeline.from_pretrained(
-        os.environ.get("FLUX_DIR", "black-forest-labs/FLUX.1-dev"),
-        torch_dtype=torch.bfloat16,
-    ).to("cuda")
+    # Context Parallel from ParaAttention
+    if args.ulysses is not None:
+        try:
+            import torch.distributed as dist
+            from para_attn.context_parallel import init_context_parallel_mesh
+            from para_attn.context_parallel.diffusers_adapters import parallelize_pipe
+            # Initialize distributed process group
+            dist.init_process_group()
+            torch.cuda.set_device(dist.get_rank())
+            logger.info(f"Ulysses: {args.ulysses}")
+            pipe = FluxPipeline.from_pretrained(
+                os.environ.get("FLUX_DIR", "black-forest-labs/FLUX.1-dev"),
+                torch_dtype=torch.bfloat16,
+            ).to("cuda")
+            parallelize_pipe(
+                pipe, mesh=init_context_parallel_mesh(
+                    pipe.device.type, max_ulysses_dim_size=args.ulysses
+                )
+            )
+        except ImportError as e:
+            logger.error(
+                "para-attn is not installed, please install it "
+                "with `pip install para-attn.`"
+            )
+            args.ulysses = None
+            raise e
+    else:
+        pipe = FluxPipeline.from_pretrained(
+            os.environ.get("FLUX_DIR", "black-forest-labs/FLUX.1-dev"),
+            torch_dtype=torch.bfloat16,
+        ).to("cuda")
     cache_options, cache_type = get_cache_options(args.cache, args)
@@ -149,7 +178,7 @@ def main():
         image = pipe(
             "A cat holding a sign that says hello world with complex background",
             num_inference_steps=args.steps,
-            generator=torch.Generator("cuda").manual_seed(args.seed),
+            generator=torch.Generator(args.gen_device).manual_seed(args.seed),
         ).images[0]
         end = time.time()
         all_times.append(end - start)
@@ -191,19 +220,27 @@ def main():
             f"Actual Blocks: {actual_blocks}\n"
             f"Pruned Blocks: {pruned_blocks}"
         )
+    ulysses = 0 if args.ulysses is None else args.ulysses
     if len(actual_blocks) > 0:
         save_name = (
-            f"{cache_type}_R{args.rdt}_P{pruned_ratio:.1f}_"
+            f"U{ulysses}_C{int(args.compile)}_{cache_type}_"
+            f"R{args.rdt}_P{pruned_ratio:.1f}_"
             f"T{mean_time:.2f}s.png"
         )
     else:
         save_name = (
-            f"{cache_type}_R{args.rdt}_S{cached_stepes}_"
+            f"U{ulysses}_C{int(args.compile)}_{cache_type}_"
+            f"R{args.rdt}_S{cached_stepes}_"
             f"T{mean_time:.2f}s.png"
         )
     image.save(save_name)
     logger.info(f"Image saved as {save_name}")
+    if args.ulysses is not None:
+        import torch.distributed as dist
+        dist.destroy_process_group()
+        logger.info("Distributed process group destroyed.")
 if __name__ == "__main__":
     main()

cache_dit-0.1.6/examples/run_cogvideox.py ADDED Viewed

@@ -0,0 +1,30 @@
+import torch
+from diffusers import CogVideoXPipeline
+from diffusers.utils import export_to_video
+from cache_dit.cache_factory import apply_cache_on_pipe, CacheType
+pipe = CogVideoXPipeline.from_pretrained(
+    "THUDM/CogVideoX-5b",
+    torch_dtype=torch.bfloat16,
+).to("cuda")
+# Default options, F8B8, good balance between performance and precision
+cache_options = CacheType.default_options(CacheType.DBCache)
+apply_cache_on_pipe(pipe, **cache_options)
+pipe.vae.enable_slicing()
+pipe.vae.enable_tiling()
+prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
+video = pipe(
+    prompt=prompt,
+    num_videos_per_prompt=1,
+    num_inference_steps=50,
+    num_frames=49,
+    guidance_scale=6,
+    generator=torch.Generator("cuda").manual_seed(0),
+).frames[0]
+print("Saving video to cogvideox.mp4")
+export_to_video(video, "cogvideox.mp4", fps=8)

cache_dit-0.1.6/examples/run_mochi.py ADDED Viewed

@@ -0,0 +1,25 @@
+import torch
+from diffusers import MochiPipeline
+from diffusers.utils import export_to_video
+from cache_dit.cache_factory import apply_cache_on_pipe, CacheType
+pipe = MochiPipeline.from_pretrained(
+    "genmo/mochi-1-preview",
+    torch_dtype=torch.bfloat16,
+).to("cuda")
+# Default options, F8B8, good balance between performance and precision
+cache_options = CacheType.default_options(CacheType.DBCache)
+apply_cache_on_pipe(pipe, **cache_options)
+pipe.enable_vae_tiling()
+prompt = "Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k."
+video = pipe(
+    prompt,
+    num_frames=84,
+).frames[0]
+print("Saving video to mochi.mp4")
+export_to_video(video, "mochi.mp4", fps=30)

{cache_dit-0.1.3 → cache_dit-0.1.6}/requirements.txt RENAMED Viewed

@@ -1,6 +1,6 @@
 # Example requirement, can be anything that pip knows
 # install with `pip install -r requirements.txt`, and make sure that CI does the same
 packaging
-torch
-transformers
-diffusers
+torch>=2.5.1 # torch 2.7.0 is preferred, but 2.5.1 is the minimum required version
+transformers>=4.51.3
+diffusers>=0.33.1

{cache_dit-0.1.3 → cache_dit-0.1.6}/src/cache_dit/_version.py RENAMED Viewed

@@ -17,5 +17,5 @@ __version__: str
 __version_tuple__: VERSION_TUPLE
 version_tuple: VERSION_TUPLE
-__version__ = version = '0.1.3'
-__version_tuple__ = version_tuple = (0, 1, 3)
+__version__ = version = '0.1.6'
+__version_tuple__ = version_tuple = (0, 1, 6)

{cache_dit-0.1.3 → cache_dit-0.1.6}/src/cache_dit/cache_factory/dynamic_block_prune/prune_context.py RENAMED Viewed

@@ -562,7 +562,7 @@ class DBPrunedTransformerBlocks(torch.nn.Module):
         torch._dynamo.graph_break()
         add_pruned_block(self.pruned_blocks_step)
-        add_actual_block(self._num_transformer_blocks)
+        add_actual_block(self.num_transformer_blocks)
         patch_pruned_stats(self.transformer)
         return (
@@ -577,7 +577,7 @@ class DBPrunedTransformerBlocks(torch.nn.Module):
     @property
     @torch.compiler.disable
-    def _num_transformer_blocks(self):
+    def num_transformer_blocks(self):
         # Total number of transformer blocks, including single transformer blocks.
         num_blocks = len(self.transformer_blocks)
         if self.single_transformer_blocks is not None:
@@ -597,7 +597,7 @@ class DBPrunedTransformerBlocks(torch.nn.Module):
     @torch.compiler.disable
     def _non_prune_blocks_ids(self):
         # Never prune the first `Fn` and last `Bn` blocks.
-        num_blocks = self._num_transformer_blocks
+        num_blocks = self.num_transformer_blocks
         Fn_compute_blocks_ = (
             Fn_compute_blocks()
             if Fn_compute_blocks() < num_blocks
@@ -627,6 +627,10 @@ class DBPrunedTransformerBlocks(torch.nn.Module):
         ]
         return sorted(non_prune_blocks_ids)
+    # @torch.compile(dynamic=True)
+    # mark this function as compile with dynamic=True will
+    # cause precision degradate, so, we choose to disable it
+    # now, until we find a better solution or fixed the bug.
     @torch.compiler.disable
     def _compute_single_hidden_states_residual(
         self,
@@ -663,6 +667,10 @@ class DBPrunedTransformerBlocks(torch.nn.Module):
             single_encoder_hidden_states_residual,
         )
+    # @torch.compile(dynamic=True)
+    # mark this function as compile with dynamic=True will
+    # cause precision degradate, so, we choose to disable it
+    # now, until we find a better solution or fixed the bug.
     @torch.compiler.disable
     def _split_single_hidden_states(
         self,

{cache_dit-0.1.3 → cache_dit-0.1.6}/src/cache_dit.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: cache_dit
-Version: 0.1.3
+Version: 0.1.6
 Summary: 🤗 CacheDiT: A Training-free and Easy-to-use Cache Acceleration Toolbox for Diffusion Transformers
 Author: DefTruth, vipshop.com, etc.
 Maintainer: DefTruth, vipshop.com, etc
@@ -10,9 +10,9 @@ Requires-Python: >=3.10
 Description-Content-Type: text/markdown
 License-File: LICENSE
 Requires-Dist: packaging
-Requires-Dist: torch
-Requires-Dist: transformers
-Requires-Dist: diffusers
+Requires-Dist: torch>=2.5.1
+Requires-Dist: transformers>=4.51.3
+Requires-Dist: diffusers>=0.33.1
 Provides-Extra: all
 Provides-Extra: dev
 Requires-Dist: pre-commit; extra == "dev"
@@ -44,10 +44,10 @@ Dynamic: requires-python
       <img src=https://img.shields.io/badge/PyPI-pass-brightgreen.svg >
       <img src=https://static.pepy.tech/badge/cache-dit >
       <img src=https://img.shields.io/badge/Python-3.10|3.11|3.12-9cf.svg >
-      <img src=https://img.shields.io/badge/Release-v0.1.3-brightgreen.svg >
+      <img src=https://img.shields.io/badge/Release-v0.1.6-brightgreen.svg >
  </div>
   <p align="center">
-    DeepCache is for UNet not DiT. Most DiT cache speedups are complex and not training-free. CacheDiT provides <br>a series of training-free, UNet-style cache accelerators for DiT: DBCache, DBPrune, FBCache, etc.
+    DeepCache is for UNet not DiT. Most DiT cache speedups are complex and not training-free. CacheDiT <br>offers a set of training-free cache accelerators for DiT: 🔥DBCache, DBPrune, FBCache, etc🔥
   </p>
 </div>
@@ -55,7 +55,7 @@ Dynamic: requires-python
 <div align="center">
   <p align="center">
-    <h3>DBCache: Dual Block Caching for Diffusion Transformers</h3>
+    <h3>🔥 DBCache: Dual Block Caching for Diffusion Transformers</h3>
   </p>
 </div>
@@ -77,7 +77,7 @@ Dynamic: requires-python
 <div align="center">
   <p align="center">
-     DBCache, <b> L20x4 </b>, Steps: 20, case to show the texture recovery ability of DBCache
+    DBCache, <b> L20x4 </b>, Steps: 20, case to show the texture recovery ability of DBCache
   </p>
 </div>
@@ -85,7 +85,7 @@ These case studies demonstrate that even with relatively high thresholds (such a
 <div align="center">
   <p align="center">
-    <h3>DBPrune: Dynamic Block Prune with Residual Caching</h3>
+    <h3>🔥 DBPrune: Dynamic Block Prune with Residual Caching</h3>
   </p>
 </div>
@@ -102,10 +102,10 @@ These case studies demonstrate that even with relatively high thresholds (such a
   </p>
 </div>
-Moreover, both DBCache and DBPrune are **plug-and-play** solutions that works hand-in-hand with [ParaAttention](https://github.com/chengzeyi/ParaAttention). Users can easily tap into its **Context Parallelism** features for distributed inference.
+Moreover, **CacheDiT** are **plug-and-play** solutions that works hand-in-hand with [ParaAttention](https://github.com/chengzeyi/ParaAttention). Users can easily tap into its **Context Parallelism** features for distributed inference.
 <p align="center">
-  ♥️ Please consider to leave a ⭐️ Star to support us ~ ♥️
+    ♥️ Please consider to leave a ⭐️ Star to support us ~ ♥️
 </p>
 ## ©️Citations
@@ -135,7 +135,7 @@ The **CacheDiT** codebase was adapted from FBCache's implementation at the [Para
 - [🎉First Block Cache](#fbcache)
 - [⚡️Dynamic Block Prune](#dbprune)
 - [🎉Context Parallelism](#context-parallelism)
-- [⚡️Torch Compile](#compile)
+- [🔥Torch Compile](#compile)
 - [🎉Supported Models](#supported)
 - [👋Contribute](#contribute)
 - [©️License](#license)
@@ -167,7 +167,7 @@ pip3 install git+https://github.com/vipshop/cache-dit.git
 - **Fn**: Specifies that DBCache uses the **first n** Transformer blocks to fit the information at time step t, enabling the calculation of a more stable L1 diff and delivering more accurate information to subsequent blocks.
 - **Bn**: Further fuses approximate information in the **last n** Transformer blocks to enhance prediction accuracy. These blocks act as an auto-scaler for approximate hidden states that use residual cache.
 - **warmup_steps**: (default: 0) DBCache does not apply the caching strategy when the number of running steps is less than or equal to this value, ensuring the model sufficiently learns basic features during warmup.
-- **max_cached_steps**:  (default: -1) DBCache disables the caching strategy when the running steps exceed this value to prevent precision degradation.
+- **max_cached_steps**:  (default: -1) DBCache disables the caching strategy when the previous cached steps exceed this value to prevent precision degradation.
 - **residual_diff_threshold**: The value of residual diff threshold, a higher value leads to faster performance at the cost of lower precision.
 For a good balance between performance and precision, DBCache is configured by default with **F8B8**, 8 warmup steps, and unlimited cached steps.
@@ -210,6 +210,17 @@ cache_options = {
 }
 ```
+<div align="center">
+  <p align="center">
+    DBCache, <b> L20x1 </b>, Steps: 28, "A cat holding a sign that says hello world with complex background"
+  </p>
+</div>
+|Baseline(L20x1)|F1B0 (0.08)|F1B0 (0.20)|F8B8 (0.15)|F12B12 (0.20)|F16B16 (0.20)|
+|:---:|:---:|:---:|:---:|:---:|:---:|
+|24.85s|15.59s|8.58s|15.41s|15.11s|17.74s|
+|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/NONE_R0.08_S0.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBCACHE_F1B0S1_R0.08_S11.png width=105px> | <img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBCACHE_F1B0S1_R0.2_S19.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBCACHE_F8B8S1_R0.15_S15.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBCACHE_F12B12S4_R0.2_S16.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBCACHE_F16B16S4_R0.2_S13.png width=105px>|
 ## 🎉FBCache: First Block Cache
 <div id="fbcache"></div>
@@ -261,12 +272,47 @@ pipe = FluxPipeline.from_pretrained(
     torch_dtype=torch.bfloat16,
 ).to("cuda")
-# Using DBPrune
+# Using DBPrune with default options
 cache_options = CacheType.default_options(CacheType.DBPrune)
 apply_cache_on_pipe(pipe, **cache_options)
 ```
+We have also brought the designs from DBCache to DBPrune to make it a more general and customizable block prune algorithm. You can specify the values of **Fn** and **Bn** for higher precision, or set up the non-prune blocks list **non_prune_blocks_ids** to avoid aggressive pruning. For example:
+```python
+# Custom options for DBPrune
+cache_options = {
+    "cache_type": CacheType.DBPrune,
+    "residual_diff_threshold": 0.05,
+    # Never prune the first `Fn` and last `Bn` blocks.
+    "Fn_compute_blocks": 8,  # default 1
+    "Bn_compute_blocks": 8,  # default 0
+    "warmup_steps": 8,  # default -1
+    # Disables the pruning strategy when the previous
+    # pruned steps greater than this value.
+    "max_pruned_steps": 12,  # default, -1 means no limit
+    # Enable dynamic prune threshold within step, higher
+    # `max_dynamic_prune_threshold` value may introduce a more
+    # ageressive pruning strategy.
+    "enable_dynamic_prune_threshold": True,
+    "max_dynamic_prune_threshold": 2 * 0.05,
+    # (New thresh) = mean(previous_block_diffs_within_step) * 1.25
+    # (New thresh) = ((New thresh) if (New thresh) <
+    # max_dynamic_prune_threshold else residual_diff_threshold)
+    "dynamic_prune_threshold_relax_ratio": 1.25,
+    # The step interval to update residual cache. For example,
+    # 2: means the update steps will be [0, 2, 4, ...].
+    "residual_cache_update_interval": 1,
+    # You can set non-prune blocks to avoid ageressive pruning.
+    # For example, FLUX.1 has 19 + 38 blocks, so we can set it
+    # to 0, 2, 4, ..., 56, etc.
+    "non_prune_blocks_ids": [],
+}
+apply_cache_on_pipe(pipe, **cache_options)
+```
 <div align="center">
   <p align="center">
     DBPrune, <b> L20x1 </b>, Steps: 28, "A cat holding a sign that says hello world with complex background"
@@ -288,14 +334,19 @@ apply_cache_on_pipe(pipe, **cache_options)
 pip3 install para-attn  # or install `para-attn` from sources.
 ```
-Then, you can run **DBCache** with **Context Parallelism** on 4 GPUs:
+Then, you can run **DBCache** or **DBPrune** with **Context Parallelism** on 4 GPUs:
 ```python
+import torch.distributed as dist
 from diffusers import FluxPipeline
 from para_attn.context_parallel import init_context_parallel_mesh
 from para_attn.context_parallel.diffusers_adapters import parallelize_pipe
 from cache_dit.cache_factory import apply_cache_on_pipe, CacheType
+ # Init distributed process group
+dist.init_process_group()
+torch.cuda.set_device(dist.get_rank())
 pipe = FluxPipeline.from_pretrained(
     "black-forest-labs/FLUX.1-dev",
     torch_dtype=torch.bfloat16,
@@ -308,13 +359,31 @@ parallelize_pipe(
     )
 )
-# DBCache with F8B8 from this library
+# DBPrune with default options from this library
 apply_cache_on_pipe(
-    pipe, **CacheType.default_options(CacheType.DBCache)
+    pipe, **CacheType.default_options(CacheType.DBPrune)
 )
+dist.destroy_process_group()
+```
+Then, run the python test script with `torchrun`:
+```bash
+torchrun --nproc_per_node=4 parallel_cache.py
 ```
-## ⚡️Torch Compile
+<div align="center">
+  <p align="center">
+    DBPrune, <b> L20x4 </b>, Steps: 28, "A cat holding a sign that says hello world with complex background"
+  </p>
+</div>
+|Baseline(L20x1)|Pruned(24%)|Pruned(35%)|Pruned(38%)|Pruned(45%)|Pruned(60%)|
+|:---:|:---:|:---:|:---:|:---:|:---:|
+|24.85s|19.43s|16.82s|15.95s|14.24s|10.66s|
+|8.54s (L20x4)|7.20s (L20x4)|6.61s (L20x4)|6.09s (L20x4)|5.54s (L20x4)|4.22s (L20x4)|
+|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/NONE_R0.08_S0.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.03_P24.0_T19.43s.png width=105px> | <img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.04_P34.6_T16.82s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.05_P38.3_T15.95s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.06_P45.2_T14.24s.png width=105px>|<img src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.2_P59.5_T10.66s.png width=105px>|
+## 🔥Torch Compile
 <div id="compile"></div>
@@ -322,7 +391,7 @@ apply_cache_on_pipe(
 ```python
 apply_cache_on_pipe(
-    pipe, **CacheType.default_options(CacheType.DBCache)
+    pipe, **CacheType.default_options(CacheType.DBPrune)
 )
 # Compile the Transformer module
 pipe.transformer = torch.compile(pipe.transformer)
@@ -333,7 +402,13 @@ However, users intending to use **CacheDiT** for DiT with **dynamic input shapes
 torch._dynamo.config.recompile_limit = 96  # default is 8
 torch._dynamo.config.accumulated_recompile_limit = 2048  # default is 256
 ```
-Otherwise, the recompile_limit error may be triggered, causing the module to fall back to eager mode.
+Otherwise, the recompile_limit error may be triggered, causing the module to fall back to eager mode. Here is the case of **DBPrune + torch.compile**.
+<div align="center">
+  <p align="center">
+    DBPrune + compile, Steps: 28, "A cat holding a sign that says hello world with complex background"
+  </p>
+</div>
 ## 🎉Supported Models
@@ -346,7 +421,7 @@ Otherwise, the recompile_limit error may be triggered, causing the module to fal
 ## 👋Contribute
 <div id="contribute"></div>
-How to contribute? Star this repo or check [CONTRIBUTE.md](./CONTRIBUTE.md).
+How to contribute? Star ⭐️ this repo to support us or check [CONTRIBUTE.md](./CONTRIBUTE.md).
 ## ©️License

{cache_dit-0.1.3 → cache_dit-0.1.6}/src/cache_dit.egg-info/SOURCES.txt RENAMED Viewed

@@ -40,7 +40,9 @@ bench/.gitignore
 bench/bench.py
 docs/.gitignore
 examples/.gitignore
+examples/run_cogvideox.py
 examples/run_flux.py
+examples/run_mochi.py
 src/cache_dit/__init__.py
 src/cache_dit/_version.py
 src/cache_dit/logger.py