PyPI - diffsynth - Versions diffs - 2.0.11__tar.gz → 2.0.12__tar.gz - Mend

diffsynth 2.0.11tar.gz → 2.0.12tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (187) hide show

{diffsynth-2.0.11 → diffsynth-2.0.12}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: diffsynth
-Version: 2.0.11
+Version: 2.0.12
 Summary: Enjoy the magic of Diffusion models!
 Author: ModelScope Team
 License: Apache-2.0
@@ -33,6 +33,14 @@ Requires-Dist: torch==2.7.1+cpu; extra == "npu"
 Requires-Dist: torch-npu==2.7.1; extra == "npu"
 Requires-Dist: torchvision==0.22.1+cpu; extra == "npu"
 Provides-Extra: audio
+Requires-Dist: av; extra == "audio"
 Requires-Dist: torchaudio; extra == "audio"
 Requires-Dist: torchcodec; extra == "audio"
+Requires-Dist: librosa; extra == "audio"
+Provides-Extra: all
+Requires-Dist: av; extra == "all"
+Requires-Dist: torchaudio; extra == "all"
+Requires-Dist: torchcodec; extra == "all"
+Requires-Dist: librosa; extra == "all"
+Requires-Dist: streamlit; extra == "all"
 Dynamic: license-file

{diffsynth-2.0.11 → diffsynth-2.0.12}/README.md RENAMED Viewed

@@ -34,6 +34,10 @@ We believe that a well-developed open-source code framework can lower the thresh
 > Currently, the development personnel of this project are limited, with most of the work handled by [Artiprocher](https://github.com/Artiprocher) and [mi804](https://github.com/mi804). Therefore, the progress of new feature development will be relatively slow, and the speed of responding to and resolving issues is limited. We apologize for this and ask developers to understand.
+- **May 18, 2026** Added **CPU Offload Training** support. By moving model weights layer-by-layer between CPU and GPU, it significantly reduces GPU VRAM usage during training, enabling LoRA training of large models even on consumer-grade GPUs, compatible with all models. Simply add `--enable_model_cpu_offload` to your training command to enable (currently supports single-GPU training only). For details, see the [documentation](/docs/en/Training/Offload_Training.md).
+- **May 14, 2026** HiDream-O1-Image open-sourced, welcome a new member to the image model family! Support includes text-to-image generation, image editing, low VRAM inference, and training capabilities. For details, please refer to the [documentation](/docs/en/Model_Details/HiDream-O1-Image.md) and [example code](/examples/hidream_o1_image/).
 - **April 28, 2026** 🔥 We are excited to announce the release of **Diffusion Templates**, a plugin framework designed for Diffusion models that significantly lowers the barrier to training controllable generative models. Let's explore this cutting-edge technology together!
     * Open-source code: [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio)
     * Technical report: [arXiv](https://arxiv.org/abs/2604.24351)
@@ -884,6 +888,68 @@ Example code for JoyAI-Image is available at: [/examples/joyai_image/](/examples
 </details>
+#### HiDream-O1-Image: [/docs/en/Model_Details/HiDream-O1-Image.md](/docs/en/Model_Details/HiDream-O1-Image.md)
+<details>
+<summary>Quick Start</summary>
+Running the following code will quickly load the [HiDream-ai/HiDream-O1-Image](https://modelscope.cn/HiDream-ai/HiDream-O1-Image) model and perform inference. VRAM management is enabled, and the framework will automatically control the loading of model parameters based on available VRAM. The model can run with a minimum of 3GB VRAM.
+```python
+from diffsynth.pipelines.hidream_o1_image import HiDreamO1ImagePipeline
+from diffsynth.core.loader.config import ModelConfig
+import torch
+vram_config = {
+    "offload_dtype": torch.bfloat16,
+    "offload_device": "cpu",
+    "onload_dtype": torch.bfloat16,
+    "onload_device": "cpu",
+    "preparing_dtype": torch.bfloat16,
+    "preparing_device": "cuda",
+    "computation_dtype": torch.bfloat16,
+    "computation_device": "cuda",
+}
+pipe = HiDreamO1ImagePipeline.from_pretrained(
+    torch_dtype=torch.bfloat16,
+    device="cuda",
+    model_configs=[
+        ModelConfig(model_id="HiDream-ai/HiDream-O1-Image", origin_file_pattern="model-*.safetensors", **vram_config),
+    ],
+    processor_config=ModelConfig(model_id="HiDream-ai/HiDream-O1-Image", origin_file_pattern="./"),
+    vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
+)
+image = pipe(
+    prompt="medium shot, eye-level, front view. A woman is seated in an ornate bedroom, illuminated by candlelight, with a calm and composed expression. The subject is a young woman with fair skin, light brown hair styled in an updo with loose tendrils framing her face, and blue eyes. She wears a cream-colored satin robe with delicate floral embroidery and lace trim along the neckline. Her ears are adorned with pearl drop earrings. She is seated on a bed with a dark, intricately carved wooden headboard. To her left, a wooden nightstand holds three lit white candles and a candelabra with multiple lit candles in the background. The bed is covered with patterned pillows and a dark, textured blanket. The walls are paneled with dark wood and feature a large, ornate tapestry with muted earth tones. The lighting creates soft highlights on her face and robe, with warm shadows cast across the room.",
+    negative_prompt=" ",
+    cfg_scale=4.0,
+    height=2048,
+    width=2048,
+    seed=42,
+    num_inference_steps=50,
+)
+image.save("image.jpg")
+```
+</details>
+<details>
+<summary>Examples</summary>
+Example code for HiDream-O1-Image is available at: [/examples/hidream_o1_image/](/examples/hidream_o1_image/)
+| Model ID | Inference | Low VRAM Inference | Full Training | Full Training Validation | LoRA Training | LoRA Training Validation |
+|-|-|-|-|-|-|-|
+|[HiDream-ai/HiDream-O1-Image](https://modelscope.cn/HiDream-ai/HiDream-O1-Image)|[code](/examples/hidream_o1_image/model_inference/HiDream-O1-Image.py)|[code](/examples/hidream_o1_image/model_inference_low_vram/HiDream-O1-Image.py)|[code](/examples/hidream_o1_image/model_training/full/HiDream-O1-Image.sh)|[code](/examples/hidream_o1_image/model_training/validate_full/HiDream-O1-Image.py)|[code](/examples/hidream_o1_image/model_training/lora/HiDream-O1-Image.sh)|[code](/examples/hidream_o1_image/model_training/validate_lora/HiDream-O1-Image.py)|
+|[HiDream-ai/HiDream-O1-Image-Dev](https://modelscope.cn/HiDream-ai/HiDream-O1-Image-Dev)|[code](/examples/hidream_o1_image/model_inference/HiDream-O1-Image-Dev.py)|[code](/examples/hidream_o1_image/model_inference_low_vram/HiDream-O1-Image-Dev.py)|[code](/examples/hidream_o1_image/model_training/full/HiDream-O1-Image-Dev.sh)|[code](/examples/hidream_o1_image/model_training/validate_full/HiDream-O1-Image-Dev.py)|[code](/examples/hidream_o1_image/model_training/lora/HiDream-O1-Image-Dev.sh)|[code](/examples/hidream_o1_image/model_training/validate_lora/HiDream-O1-Image-Dev.py)|
+</details>
 ### Video Synthesis
 https://github.com/user-attachments/assets/1d66ae74-3b02-40a9-acc3-ea95fc039314
@@ -1158,8 +1224,8 @@ Example code for Wan is available at: [/examples/wanvideo/](/examples/wanvideo/)
 |[PAI/Wan2.2-Fun-A14B-Control-Camera](https://modelscope.cn/models/PAI/Wan2.2-Fun-A14B-Control-Camera)|`control_camera_video`, `input_image`|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/model_inference/Wan2.2-Fun-A14B-Control-Camera.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/model_inference_low_vram/Wan2.2-Fun-A14B-Control-Camera.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/model_training/full/Wan2.2-Fun-A14B-Control-Camera.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/model_training/validate_full/Wan2.2-Fun-A14B-Control-Camera.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/model_training/lora/Wan2.2-Fun-A14B-Control-Camera.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/model_training/validate_lora/Wan2.2-Fun-A14B-Control-Camera.py)|
 |[openmoss/MOVA-360p](https://modelscope.cn/models/openmoss/MOVA-360p)|`input_image`|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/mova/model_inference/MOVA-360p-I2AV.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/mova/model_inference_low_vram/MOVA-360p-I2AV.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/mova/model_training/full/MOVA-360P-I2AV.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/mova/model_training/validate_full/MOVA-360p-I2AV.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/mova/model_training/lora/MOVA-360P-I2AV.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/mova/model_training/validate_lora/MOVA-360p-I2AV.py)|
 |[openmoss/MOVA-720p](https://modelscope.cn/models/openmoss/MOVA-720p)|`input_image`|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/mova/model_inference/MOVA-720p-I2AV.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/mova/model_inference_low_vram/MOVA-720p-I2AV.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/mova/model_training/full/MOVA-720P-I2AV.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/mova/model_training/validate_full/MOVA-720p-I2AV.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/mova/model_training/lora/MOVA-720P-I2AV.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/mova/model_training/validate_lora/MOVA-720p-I2AV.py)|
-|[Wan-AI/WanToDance-14B (global model)](https://modelscope.cn/models/Wan-AI/WanToDance-14B)|`wantodance_music_path`, `wantodance_reference_image`, `wantodance_fps`, `wantodance_keyframes`, `wantodance_keyframes_mask`|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/model_inference/WanToDance-14B-global.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/model_inference_low_vram/WanToDance-14B-global.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/model_training/full/WanToDance-14B-global.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/model_training/validate_full/WanToDance-14B-global.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/model_training/lora/WanToDance-14B-global.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/model_training/validate_lora/WanToDance-14B-global.py)|
-|[Wan-AI/WanToDance-14B (local model)](https://modelscope.cn/models/Wan-AI/WanToDance-14B)|`wantodance_music_path`, `wantodance_reference_image`, `wantodance_fps`, `wantodance_keyframes`, `wantodance_keyframes_mask`|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/model_inference/WanToDance-14B-local.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/model_inference_low_vram/WanToDance-14B-local.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/model_training/full/WanToDance-14B-local.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/model_training/validate_full/WanToDance-14B-local.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/model_training/lora/WanToDance-14B-local.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/model_training/validate_lora/WanToDance-14B-local.py)|
+|[Wan-AI/Wan2.2-Dancer-14B (global model)](https://modelscope.cn/models/Wan-AI/Wan2.2-Dancer-14B)|`wantodance_music_path`, `wantodance_reference_image`, `wantodance_fps`, `wantodance_keyframes`, `wantodance_keyframes_mask`|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/model_inference/Wan2.2-Dancer-14B-global.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/model_inference_low_vram/Wan2.2-Dancer-14B-global.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/model_training/full/Wan2.2-Dancer-14B-global.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/model_training/validate_full/Wan2.2-Dancer-14B-global.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/model_training/lora/Wan2.2-Dancer-14B-global.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/model_training/validate_lora/Wan2.2-Dancer-14B-global.py)|
+|[Wan-AI/Wan2.2-Dancer-14B (local model)](https://modelscope.cn/models/Wan-AI/Wan2.2-Dancer-14B)|`wantodance_music_path`, `wantodance_reference_image`, `wantodance_fps`, `wantodance_keyframes`, `wantodance_keyframes_mask`|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/model_inference/Wan2.2-Dancer-14B-local.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/model_inference_low_vram/Wan2.2-Dancer-14B-local.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/model_training/full/Wan2.2-Dancer-14B-local.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/model_training/validate_full/Wan2.2-Dancer-14B-local.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/model_training/lora/Wan2.2-Dancer-14B-local.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/model_training/validate_lora/Wan2.2-Dancer-14B-local.py)|
 </details>

{diffsynth-2.0.11 → diffsynth-2.0.12}/diffsynth/configs/model_configs.py RENAMED Viewed

@@ -309,7 +309,7 @@ wan_series = [
         "state_dict_converter": "diffsynth.utils.state_dict_converters.wans2v_audio_encoder.WanS2VAudioEncoderStateDictConverter",
     },
     {
-        # Example: ModelConfig(model_id="Wan-AI/WanToDance-14B", origin_file_pattern="global_model.safetensors")
+        # Example: ModelConfig(model_id="Wan-AI/Wan2.2-Dancer-14B", origin_file_pattern="global_model.safetensors")
         "model_hash": "eb18873fc0ba77b541eb7b62dbcd2059",
         "model_name": "wan_video_dit",
         "model_class": "diffsynth.models.wan_video_dit.WanModel",
@@ -833,20 +833,6 @@ ltx2_series = [
         "extra_kwargs": {"decoder_version": "ltx-2.3"},
         "state_dict_converter": "diffsynth.utils.state_dict_converters.ltx2_video_vae.LTX2VideoDecoderStateDictConverter",
     },
-    {
-        # Example: ModelConfig(model_id="DiffSynth-Studio/LTX-2.3-Repackage", origin_file_pattern="audio_vocoder.safetensors")
-        "model_hash": "7d7823dde8f1ea0b50fb07ac329dd4cb",
-        "model_name": "ltx2_audio_vae_decoder",
-        "model_class": "diffsynth.models.ltx2_audio_vae.LTX2AudioDecoder",
-        "state_dict_converter": "diffsynth.utils.state_dict_converters.ltx2_audio_vae.LTX2AudioDecoderStateDictConverter",
-    },
-    {
-        # Example: ModelConfig(model_id="DiffSynth-Studio/LTX-2.3-Repackage", origin_file_pattern="audio_vae_encoder.safetensors")
-        "model_hash": "29338f3b95e7e312a3460a482e4f4554",
-        "model_name": "ltx2_audio_vae_encoder",
-        "model_class": "diffsynth.models.ltx2_audio_vae.LTX2AudioEncoder",
-        "state_dict_converter": "diffsynth.utils.state_dict_converters.ltx2_audio_vae.LTX2AudioEncoderStateDictConverter",
-    },
     {
         # Example: ModelConfig(model_id="DiffSynth-Studio/LTX-2.3-Repackage", origin_file_pattern="audio_vocoder.safetensors")
         "model_hash": "cd436c99e69ec5c80f050f0944f02a15",
@@ -1040,7 +1026,16 @@ ace_step_series = [
     },
 ]
+hidream_o1_image_series = [
+    {
+        # Example: ModelConfig(model_id="HiDream-ai/HiDream-O1-Image", origin_file_pattern="model-*.safetensors")
+        "model_hash": "58a7c1073d79556bfc61e05e6061b771",
+        "model_name": "hidream_o1_image_dit",
+        "model_class": "diffsynth.models.hidream_o1_image_dit.HiDreamO1ImageModel",
+    },
+]
 MODEL_CONFIGS = (
     stable_diffusion_xl_series + stable_diffusion_series + qwen_image_series + wan_series + flux_series + flux2_series + ernie_image_series
-    + z_image_series + ltx2_series + anima_series + mova_series + joyai_image_series + ace_step_series
+    + z_image_series + ltx2_series + anima_series + mova_series + joyai_image_series + ace_step_series + hidream_o1_image_series
 )

{diffsynth-2.0.11 → diffsynth-2.0.12}/diffsynth/configs/vram_management_module_maps.py RENAMED Viewed

@@ -327,7 +327,7 @@ VRAM_MANAGEMENT_MODULE_MAPS = {
     "diffsynth.models.ace_step_tokenizer.AceStepTokenizer": {
         "torch.nn.Linear": "diffsynth.core.vram.layers.AutoWrappedLinear",
         "torch.nn.Embedding": "diffsynth.core.vram.layers.AutoWrappedModule",
-        "vector_quantize_pytorch.ResidualFSQ": "diffsynth.core.vram.layers.AutoWrappedModule",
+        "diffsynth.models.ace_step_residual_fsq.ResidualFSQ": "diffsynth.core.vram.layers.AutoWrappedModule",
         "transformers.models.qwen3.modeling_qwen3.Qwen3RMSNorm": "diffsynth.core.vram.layers.AutoWrappedModule",
         "transformers.models.qwen3.modeling_qwen3.Qwen3MLP": "diffsynth.core.vram.layers.AutoWrappedModule",
         "transformers.models.qwen3.modeling_qwen3.Qwen3RotaryEmbedding": "diffsynth.core.vram.layers.AutoWrappedModule",
@@ -372,6 +372,14 @@ VRAM_MANAGEMENT_MODULE_MAPS = {
         "diffsynth.models.stable_diffusion_text_encoder.CLIPAttention": "diffsynth.core.vram.layers.AutoWrappedModule",
         "diffsynth.models.stable_diffusion_xl_text_encoder.CLIPTextModelWithProjection": "diffsynth.core.vram.layers.AutoWrappedModule",
     },
+    "diffsynth.models.hidream_o1_image_dit.HiDreamO1ImageModel": {
+        "torch.nn.Linear": "diffsynth.core.vram.layers.AutoWrappedLinear",
+        "torch.nn.Embedding": "diffsynth.core.vram.layers.AutoWrappedModule",
+        "torch.nn.Conv3d": "diffsynth.core.vram.layers.AutoWrappedModule",
+        "torch.nn.LayerNorm": "diffsynth.core.vram.layers.AutoWrappedModule",
+        "diffsynth.models.hidream_o1_image_dit.Qwen3VLTextRMSNorm": "diffsynth.core.vram.layers.AutoWrappedModule",
+        "diffsynth.models.hidream_o1_image_dit.Qwen3VLVisionModel": "diffsynth.core.vram.layers.AutoWrappedModule",
+    },
 }
 def QwenImageTextEncoder_Module_Map_Updater():

{diffsynth-2.0.11 → diffsynth-2.0.12}/diffsynth/core/__init__.py RENAMED Viewed

@@ -4,3 +4,4 @@ from .gradient import *
 from .loader import *
 from .vram import *
 from .device import *
+from .offload_training import *

{diffsynth-2.0.11 → diffsynth-2.0.12}/diffsynth/core/attention/attention.py RENAMED Viewed

@@ -63,10 +63,10 @@ def rearrange_out(out: torch.Tensor, out_pattern="b n s d", required_out_pattern
     return out
-def torch_sdpa(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, q_pattern="b n s d", k_pattern="b n s d", v_pattern="b n s d", out_pattern="b n s d", dims=None, attn_mask=None, scale=None):
+def torch_sdpa(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, q_pattern="b n s d", k_pattern="b n s d", v_pattern="b n s d", out_pattern="b n s d", dims=None, attn_mask=None, scale=None, is_causal=False):
     required_in_pattern, required_out_pattern= "b n s d", "b n s d"
     q, k, v = rearrange_qkv(q, k, v, q_pattern, k_pattern, v_pattern, required_in_pattern, dims)
-    out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask, scale=scale)
+    out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask, scale=scale, is_causal=is_causal)
     out = rearrange_out(out, out_pattern, required_out_pattern, dims)
     return out
@@ -81,10 +81,10 @@ def flash_attention_3(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, q_patte
     return out
-def flash_attention_2(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, q_pattern="b n s d", k_pattern="b n s d", v_pattern="b n s d", out_pattern="b n s d", dims=None, scale=None):
+def flash_attention_2(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, q_pattern="b n s d", k_pattern="b n s d", v_pattern="b n s d", out_pattern="b n s d", dims=None, scale=None, is_causal=False):
     required_in_pattern, required_out_pattern= "b s n d", "b s n d"
     q, k, v = rearrange_qkv(q, k, v, q_pattern, k_pattern, v_pattern, required_in_pattern, dims)
-    out = flash_attn.flash_attn_func(q, k, v, softmax_scale=scale)
+    out = flash_attn.flash_attn_func(q, k, v, softmax_scale=scale, causal=is_causal)
     out = rearrange_out(out, out_pattern, required_out_pattern, dims)
     return out
@@ -105,17 +105,17 @@ def xformers_attention(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, q_patt
     return out
-def attention_forward(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, q_pattern="b n s d", k_pattern="b n s d", v_pattern="b n s d", out_pattern="b n s d", dims=None, attn_mask=None, scale=None, compatibility_mode=False):
+def attention_forward(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, q_pattern="b n s d", k_pattern="b n s d", v_pattern="b n s d", out_pattern="b n s d", dims=None, attn_mask=None, scale=None, is_causal=False, compatibility_mode=False):
     if compatibility_mode or (attn_mask is not None):
-        return torch_sdpa(q, k, v, q_pattern, k_pattern, v_pattern, out_pattern, dims, attn_mask=attn_mask, scale=scale)
+        return torch_sdpa(q, k, v, q_pattern, k_pattern, v_pattern, out_pattern, dims, attn_mask=attn_mask, scale=scale, is_causal=is_causal)
     else:
         if ATTENTION_IMPLEMENTATION == "flash_attention_3":
             return flash_attention_3(q, k, v, q_pattern, k_pattern, v_pattern, out_pattern, dims, scale=scale)
         elif ATTENTION_IMPLEMENTATION == "flash_attention_2":
-            return flash_attention_2(q, k, v, q_pattern, k_pattern, v_pattern, out_pattern, dims, scale=scale)
+            return flash_attention_2(q, k, v, q_pattern, k_pattern, v_pattern, out_pattern, dims, scale=scale, is_causal=is_causal)
         elif ATTENTION_IMPLEMENTATION == "sage_attention":
             return sage_attention(q, k, v, q_pattern, k_pattern, v_pattern, out_pattern, dims, scale=scale)
         elif ATTENTION_IMPLEMENTATION == "xformers":
             return xformers_attention(q, k, v, q_pattern, k_pattern, v_pattern, out_pattern, dims, scale=scale)
         else:
-            return torch_sdpa(q, k, v, q_pattern, k_pattern, v_pattern, out_pattern, dims, scale=scale)
+            return torch_sdpa(q, k, v, q_pattern, k_pattern, v_pattern, out_pattern, dims, scale=scale, is_causal=is_causal)

{diffsynth-2.0.11 → diffsynth-2.0.12}/diffsynth/core/data/operators.py RENAMED Viewed

@@ -2,8 +2,6 @@ import math, warnings
 import torch, torchvision, imageio, os
 import imageio.v3 as iio
 from PIL import Image
-import torchaudio
-from diffsynth.utils.data.audio import read_audio
 class DataProcessingPipeline:
@@ -249,9 +247,11 @@ class ToAbsolutePath(DataProcessingOperator):
 class LoadAudio(DataProcessingOperator):
     def __init__(self, sr=16000):
         self.sr = sr
-    def __call__(self, data: str):
         import librosa
-        input_audio, sample_rate = librosa.load(data, sr=self.sr)
+        self.audio_loader = librosa.load
+    def __call__(self, data: str):
+        input_audio, sample_rate = self.audio_loader(data, sr=self.sr)
         return input_audio
@@ -259,13 +259,15 @@ class LoadAudioWithTorchaudio(DataProcessingOperator, FrameSamplerByRateMixin):
     def __init__(self, num_frames=121, time_division_factor=8, time_division_remainder=1, frame_rate=24, fix_frame_rate=True):
         FrameSamplerByRateMixin.__init__(self, num_frames, time_division_factor, time_division_remainder, frame_rate, fix_frame_rate)
+        import torchaudio
+        self.audio_loader = torchaudio.load
     def __call__(self, data: str):
         try:
             reader = self.get_reader(data)
             num_frames = self.get_num_frames(reader)
             duration = num_frames / self.frame_rate
-            waveform, sample_rate = torchaudio.load(data)
+            waveform, sample_rate = self.audio_loader(data)
             target_samples = int(duration * sample_rate)
             current_samples = waveform.shape[-1]
             if current_samples > target_samples:
@@ -285,10 +287,12 @@ class LoadPureAudioWithTorchaudio(DataProcessingOperator):
         self.target_sample_rate = target_sample_rate
         self.target_duration = target_duration
         self.resample = True if target_sample_rate is not None else False
+        from diffsynth.utils.data.audio import read_audio
+        self.audio_loader = read_audio
     def __call__(self, data: str):
         try:
-            waveform, sample_rate = read_audio(data, resample=self.resample, resample_rate=self.target_sample_rate)
+            waveform, sample_rate = self.audio_loader(data, resample=self.resample, resample_rate=self.target_sample_rate)
             if self.target_duration is not None:
                 target_samples = int(self.target_duration * sample_rate)
                 current_samples = waveform.shape[-1]

diffsynth-2.0.12/diffsynth/core/offload_training/__init__.py ADDED Viewed

	@@ -0,0 +1 @@
1	+ from .manager import OffloadTrainingManager

diffsynth-2.0.12/diffsynth/core/offload_training/manager.py ADDED Viewed

@@ -0,0 +1,177 @@
+"""
+Layer offloading for training — hook-based CPU offload.
+Hook lifecycle per module:
+No checkpointing:
+  forward_pre(load→GPU) → forward() → forward_hook(offload)
+  backward_pre(load→GPU) → backward() → backward_hook(offload)
+With checkpointing (use_reentrant=False):
+  First forward:
+    forward_pre(load→GPU) → forward() → forward_hook(offload, mark in_recompute)
+  Recomputing forward (during backward):
+    forward_pre(load→GPU) → forward() → forward_hook(in_recompute=True → keep GPU)
+  Backward:
+    backward_pre(load→GPU) → backward() → backward_hook(offload)
+"""
+import torch
+import torch.nn as nn
+import warnings
+from .offloader import StaticParamOffloader, TrainableParamOffloader, AlwaysOnGPUParamOffloader, BufferOffloader
+from .memory_buffer import PinnedArenaPool, BaseBufferPool
+warnings.filterwarnings("ignore", message="Full backward hook is firing when gradients are computed with respect to module outputs")
+def has_parameters(module: nn.Module) -> bool:
+    return len(list(module.parameters())) > 0
+def count_parameters(module: nn.Module) -> int:
+    return sum(p.numel() for p in module.parameters())
+def is_leaf_module(module: nn.Module) -> bool:
+    return len(list(module.children())) == 0
+class UnitWiseParamManager:
+    def __init__(self, model: nn.Module, target_device: torch.device, enable_optimizer_cpu_offload: bool = False, params: list = None, buffers: list = None, memory_buffer: BaseBufferPool = None):
+        self.model = model
+        self.target_device = target_device
+        self.param_offloaders = {}
+        for param in (model.parameters() if params is None else params):
+            if not param.requires_grad:
+                self.param_offloaders[id(param)] = StaticParamOffloader(param, target_device, memory_buffer=memory_buffer)
+            else:
+                if enable_optimizer_cpu_offload:
+                    self.param_offloaders[id(param)] = TrainableParamOffloader(param, target_device)
+                else:
+                    self.param_offloaders[id(param)] = AlwaysOnGPUParamOffloader(param, target_device)
+        if buffers is not None and len(buffers) > 0:
+            for mod, buf_name, buf in buffers:
+                self.param_offloaders[id(buf)] = BufferOffloader(mod, buf_name, buf, target_device, memory_buffer=memory_buffer)
+    def move_gradients_to_cpu(self):
+        for offloader in self.param_offloaders.values():
+            offloader.offload_grad()
+    def onload_module(self, module: nn.Module):
+        for param in module.parameters(recurse=False):
+            if id(param) in self.param_offloaders:
+                self.param_offloaders[id(param)].onload()
+        for name, buf in module.named_buffers(recurse=False):
+            if id(buf) in self.param_offloaders:
+                self.param_offloaders[id(buf)].onload()
+    def offload_module(self, module: nn.Module):
+        for param in module.parameters(recurse=False):
+            if id(param) in self.param_offloaders:
+                self.param_offloaders[id(param)].offload()
+        for name, buf in module.named_buffers(recurse=False):
+            if id(buf) in self.param_offloaders:
+                self.param_offloaders[id(buf)].offload()
+class UnitWiseHookManager:
+    def __init__(self, model: nn.Module, target_device: torch.device, enable_optimizer_cpu_offload: bool = False,
+                 params: list = None, buffers: list = None, memory_buffer: BaseBufferPool = None):
+        self.param_manager = UnitWiseParamManager(model, target_device, enable_optimizer_cpu_offload, params=params, buffers=buffers, memory_buffer=memory_buffer)
+        self._in_recompute: set = set()
+        self._register_hooks(model)
+    def _register_hooks(self, module: nn.Module):
+        def forward_pre_hook(mod, args):
+            self.param_manager.onload_module(mod)
+        def forward_hook(mod, args, output):
+            if mod in self._in_recompute:
+                return
+            self._in_recompute.add(mod)
+            self.param_manager.offload_module(mod)
+        def backward_pre_hook(mod, grad_output):
+            self.param_manager.onload_module(mod)
+        def backward_hook(mod, grad_input, grad_output):
+            self.param_manager.offload_module(mod)
+        module.register_forward_pre_hook(forward_pre_hook)
+        module.register_forward_hook(forward_hook)
+        module.register_full_backward_pre_hook(backward_pre_hook)
+        if is_leaf_module(module):
+            module.register_full_backward_hook(backward_hook)
+        else:
+            # Parent module backward_hook fires before child backward completes.
+            # Register on leaf children instead.
+            sub_modules = [m for m in module.modules() if is_leaf_module(m) and has_parameters(m)]
+            for sub_mod in sub_modules:
+                sub_mod.register_full_backward_hook(backward_hook)
+    def after_backward(self):
+        self._in_recompute.clear()
+        self.param_manager.move_gradients_to_cpu()
+    @property
+    def managed_param_ids(self):
+        return set(self.param_manager.param_offloaders.keys())
+class OffloadTrainingManager:
+    def __init__(self, model: nn.Module, target_device: torch.device, enable_optimizer_cpu_offload: bool = False, cpu_offload_split_threshold: int = None):
+        self.model = model
+        self.target_device = target_device
+        self.enable_optimizer_cpu_offload = enable_optimizer_cpu_offload
+        cpu_offload_split_threshold = cpu_offload_split_threshold * 1024 * 1024 if cpu_offload_split_threshold is not None else None
+        self._register_units(model, target_device, enable_optimizer_cpu_offload, cpu_offload_split_threshold)
+    def _register_units(self, model: nn.Module, target_device: torch.device, enable_optimizer_cpu_offload: bool, cpu_offload_split_threshold: int = None):
+        self.memory_buffer = PinnedArenaPool.from_model(model)
+        units = self._find_units_recursive(model, cpu_offload_split_threshold)
+        self.units = [UnitWiseHookManager(u, target_device, enable_optimizer_cpu_offload, memory_buffer=self.memory_buffer) for u in units]
+        managed_param_ids = set().union(*[unit.managed_param_ids for unit in self.units])
+        orphan_params, orphan_buffers = self._find_orphan_params_and_buffers(model, managed_param_ids)
+        for orphan_module in set(orphan_params.keys()) | set(orphan_buffers.keys()):
+            params = orphan_params.get(orphan_module, [])
+            buffers = orphan_buffers.get(orphan_module, [])
+            self.units.append(UnitWiseHookManager(orphan_module, target_device, enable_optimizer_cpu_offload, params=params, buffers=buffers, memory_buffer=self.memory_buffer))
+    def _find_orphan_params_and_buffers(self, model: nn.Module, managed_param_ids: set):
+        orphan_params_by_module = {}
+        for _, mod in model.named_modules():
+            for param in mod.parameters(recurse=False):
+                if id(param) not in managed_param_ids:
+                    orphan_params_by_module.setdefault(mod, []).append(param)
+        # Collect orphan buffers grouped by owner module
+        orphan_buffers_by_module = {}
+        for _, mod in model.named_modules():
+            for name, buf in mod.named_buffers(recurse=False):
+                orphan_buffers_by_module.setdefault(mod, []).append((mod, name, buf))
+        return orphan_params_by_module, orphan_buffers_by_module
+    def _find_units_recursive(self, module: nn.Module, cpu_offload_split_threshold: int = None) -> list:
+        if cpu_offload_split_threshold is None:
+            return [m for m in module.modules() if is_leaf_module(m) and has_parameters(m)]
+        if self._should_force_recurse(module, cpu_offload_split_threshold):
+            units = []
+            for child in module.children():
+                units.extend(self._find_units_recursive(child, cpu_offload_split_threshold))
+            return units
+        return [module]
+    def _should_force_recurse(self, module: nn.Module, cpu_offload_split_threshold: int = None) -> bool:
+        if is_leaf_module(module):
+            return False
+        if (
+            count_parameters(module) > cpu_offload_split_threshold
+            or ('forward' not in type(module).__dict__)
+            or (hasattr(module, 'encode') and hasattr(module, 'decode'))
+        ):
+            return True
+        return False
+    # run after backward() and before optimizer.step()
+    def after_backward(self):
+        for unit in self.units:
+            unit.after_backward()
+        torch.cuda.synchronize()

diffsynth-2.0.12/diffsynth/core/offload_training/memory_buffer.py ADDED Viewed

@@ -0,0 +1,136 @@
+import torch
+ALIGNMENT = 64
+def _align_up(x: int, alignment: int = ALIGNMENT) -> int:
+    return (x + alignment - 1) // alignment * alignment
+def _next_power_of_two(x: int) -> int:
+    """
+    Smallest power of two >= x.
+    For power-of-two x=2^k: (x-1) has bit_length=k, so 1<<k = x (unchanged).
+    For non-power-of-two: (x-1).bit_length() exceeds floor-log2(x), rounding up to next 2^n.
+    """
+    return 1 if x <= 1 else 1 << (x - 1).bit_length()
+def _prev_power_of_two(x: int) -> int:
+    """Largest power-of-two <= x."""
+    return 1 if x <= 1 else 1 << (x.bit_length() - 1)
+def _tensor_storage_size(tensor: torch.Tensor) -> int:
+    return _align_up(tensor.numel() * tensor.element_size())
+class BaseBufferPool:
+    """Naive per-tensor pin_memory allocation. No pre-allocation, no memory saving."""
+    def allocate_like(self, tensor: torch.Tensor) -> torch.Tensor:
+        return tensor.pin_memory()
+    @classmethod
+    def from_model(cls, model: torch.nn.Module, **kwargs):
+        return cls()
+class PinnedBuffer:
+    """Single pinned uint8 buffer with bump-pointer allocation. Lazy: actual memory allocated on first allocate_like."""
+    def __init__(self, size: int):
+        self._size = size
+        self._buf: torch.Tensor | None = None
+        self._offset = 0
+    def _ensure_allocated(self):
+        if self._buf is None:
+            self._buf = torch.empty(self._size, dtype=torch.uint8, device="cpu", pin_memory=True)
+    @property
+    def capacity(self) -> int:
+        return self._size
+    @property
+    def remaining(self) -> int:
+        return self._size if self._buf is None else self._buf.numel() - self._offset
+    @property
+    def used(self) -> int:
+        return self._offset
+    @classmethod
+    def from_tensor(cls, tensor: torch.Tensor, min_size: int = 1 * 1024**3):
+        size = max(_tensor_storage_size(tensor) + ALIGNMENT, min_size)
+        return cls(_next_power_of_two(size))
+    def allocate_like(self, tensor: torch.Tensor, *, copy: bool = True, non_blocking: bool = False) -> torch.Tensor | None:
+        """Try to allocate a view for tensor. Returns None if not enough space."""
+        num_bytes = tensor.numel() * tensor.element_size()
+        if num_bytes > self.remaining:
+            return None
+        self._ensure_allocated()
+        view = self._buf.narrow(0, self._offset, num_bytes).view(tensor.dtype).reshape(tuple(tensor.shape))
+        if copy:
+            view.copy_(tensor, non_blocking=bool(non_blocking and tensor.device.type == "cuda"))
+        self._offset = _align_up(self._offset + num_bytes)
+        return view
+class PinnedArenaPool(BaseBufferPool):
+    """Pinned arena pool — pre-allocate pinned memory, avoid per-tensor cudaHostAlloc overhead.
+    Pool strategy:
+      1. Sizing: from_model() scans all non-trainable params + buffers, sums their aligned sizes
+         as total_bytes. max_chunk_size is raised to fit the largest single tensor.
+      2. Decomposition: total_bytes is split into power-of-two chunks (min_chunk_size ~ max_chunk_size).
+         Each chunk becomes one PinnedBuffer (lazy — actual pin_memory on first use).
+      3. Allocation: allocate_like() sequentially probes each buffer for space (first-fit).
+         Each PinnedBuffer uses bump-pointer with ALIGNMENT padding between tensors.
+      4. Growth: if all existing buffers are full, _grow() appends a new PinnedBuffer sized
+         to the requesting tensor (at least min_chunk_size, power-of-two rounded).
+      5. Fallback: on any exception, falls back to per-tensor pin_memory().
+    """
+    def __init__(self, total_bytes: int, min_chunk_size: int = 1 * 1024**3, max_chunk_size: int = 4 * 1024**3):
+        self.min_chunk_size = _next_power_of_two(int(min_chunk_size))
+        self.max_chunk_size = _next_power_of_two(int(max_chunk_size))
+        self.min_chunk_size = min(self.min_chunk_size, self.max_chunk_size)
+        self._buffers = [PinnedBuffer(s) for s in self._decompose(total_bytes, self.min_chunk_size, self.max_chunk_size)]
+    @classmethod
+    def from_model(cls, model: torch.nn.Module, min_chunk_size: int = 1 * 1024**3, max_chunk_size: int = 4 * 1024**3):
+        """Size pool for all non-trainable params + buffers."""
+        tensors = [p for p in model.parameters() if not p.requires_grad] + list(model.buffers())
+        total = sum(_tensor_storage_size(t) for t in tensors)
+        max_tensor_size = max((_tensor_storage_size(t) for t in tensors), default=0)
+        max_chunk_size = _next_power_of_two(max_tensor_size) if max_tensor_size > max_chunk_size else max_chunk_size
+        return cls(total, min_chunk_size=min_chunk_size, max_chunk_size=max_chunk_size)
+    @staticmethod
+    def _decompose(total_bytes: int, min_chunk_size: int, max_chunk_size: int) -> list:
+        """Decompose total_bytes into power-of-two chunks capped by min/max."""
+        if total_bytes <= 0:
+            return []
+        chunks, remaining = [], total_bytes
+        while remaining > 0:
+            chunk = max(min(_prev_power_of_two(remaining), max_chunk_size), min_chunk_size)
+            chunks.append(chunk)
+            remaining -= chunk
+        chunks.sort(reverse=True)
+        return chunks
+    def _grow(self, tensor: torch.Tensor):
+        self._buffers.append(PinnedBuffer.from_tensor(tensor, min_size=self.min_chunk_size))
+    def allocate_like(self, tensor: torch.Tensor, *, copy: bool = True, require_contiguous: bool = True, non_blocking: bool = False) -> torch.Tensor:
+        """Allocate a pinned view. Falls back to per-tensor pin_memory on failure."""
+        src = tensor.detach()
+        if require_contiguous and not src.is_contiguous():
+            src = src.contiguous()
+        try:
+            for buf in self._buffers:
+                view = buf.allocate_like(src, copy=copy, non_blocking=non_blocking)
+                if view is not None:
+                    return view
+            self._grow(src)
+            return self._buffers[-1].allocate_like(src, copy=copy, non_blocking=non_blocking)
+        except Exception:
+            return src.pin_memory()

diffsynth 2.0.11__tar.gz → 2.0.12__tar.gz

diffsynth 2.0.11tar.gz → 2.0.12tar.gz