PyPI - diffsynth - Versions diffs - 2.0.13__tar.gz → 2.0.15__tar.gz - Mend

diffsynth 2.0.13tar.gz → 2.0.15tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (224) hide show

{diffsynth-2.0.13 → diffsynth-2.0.15}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: diffsynth
-Version: 2.0.13
+Version: 2.0.15
 Summary: Enjoy the magic of Diffusion models!
 Author: ModelScope Team
 License: Apache-2.0

{diffsynth-2.0.13 → diffsynth-2.0.15}/README.md RENAMED Viewed

@@ -34,6 +34,8 @@ We believe that a well-developed open-source code framework can lower the thresh
 > Currently, the development personnel of this project are limited, with most of the work handled by [Artiprocher](https://github.com/Artiprocher) and [mi804](https://github.com/mi804). Therefore, the progress of new feature development will be relatively slow, and the speed of responding to and resolving issues is limited. We apologize for this and ask developers to understand.
+- **June 16, 2026**: We have added a new Template model for ACE-Step: [vocals2music](https://www.modelscope.cn/models/DiffSynth-Studio/acestep15xlsft-vocals2music). For more details, please refer to the [documentation](/docs/zh/Model_Details/ACE-Step.md) and [example code](/examples/ace_step/).
 - **June 15, 2026** We have open-sourced Image-to-LoRA V2, compressing the hours-long training process for image style LoRAs into a single model inference step, thereby exploring a new paradigm for LoRA model training. The [technical report](https://arxiv.org/abs/2606.13809) has been released. This release includes three models:
     * [DiffSynth-Studio/ZImage-i2L-v2](https://modelscope.cn/models/DiffSynth-Studio/ZImage-i2L-v2): Adapted for the Z-Image model
     * [DiffSynth-Studio/KleinBase4B-i2L-v2](https://modelscope.cn/models/DiffSynth-Studio/KleinBase4B-i2L-v2): Adapted for the FLUX.2-klein-base-4B model
@@ -1036,6 +1038,7 @@ Example code for Ideogram 4 is available at: [/examples/ideogram4/](/examples/id
 | Model ID | Inference | Low VRAM Inference | Full Training | Full Training Validation | LoRA Training | LoRA Training Validation |
 |-|-|-|-|-|-|-|
 |[ideogram-ai/ideogram-4-fp8](https://www.modelscope.cn/models/ideogram-ai/ideogram-4-fp8)|[code](/examples/ideogram4/model_inference/ideogram-4-fp8.py)|-|-|-|-|-|
+|[DiffSynth-Studio/ideogram-4-bf16-repackage](https://www.modelscope.cn/models/DiffSynth-Studio/ideogram-4-bf16-repackage)|[code](/examples/ideogram4/model_inference/ideogram-4-bf16-repackage.py)|[code](/examples/ideogram4/model_inference_low_vram/ideogram-4-bf16-repackage.py)|[code](/examples/ideogram4/model_training/full/Ideogram-4-bf16-repackage.sh)|-|[code](/examples/ideogram4/model_training/lora/Ideogram-4-bf16-repackage.sh)|[code](/examples/ideogram4/model_training/validate_lora/Ideogram-4-bf16-repackage.py)|
 </details>
@@ -1396,6 +1399,7 @@ Example code for ACE-Step is available at: [/examples/ace_step/](/examples/ace_s
 |[ACE-Step/acestep-v15-xl-base](https://www.modelscope.cn/models/ACE-Step/acestep-v15-xl-base)|[code](/examples/ace_step/model_inference/acestep-v15-xl-base.py)|[code](/examples/ace_step/model_inference_low_vram/acestep-v15-xl-base.py)|[code](/examples/ace_step/model_training/full/acestep-v15-xl-base.sh)|[code](/examples/ace_step/model_training/validate_full/acestep-v15-xl-base.py)|[code](/examples/ace_step/model_training/lora/acestep-v15-xl-base.sh)|[code](/examples/ace_step/model_training/validate_lora/acestep-v15-xl-base.py)|
 |[ACE-Step/acestep-v15-xl-sft](https://www.modelscope.cn/models/ACE-Step/acestep-v15-xl-sft)|[code](/examples/ace_step/model_inference/acestep-v15-xl-sft.py)|[code](/examples/ace_step/model_inference_low_vram/acestep-v15-xl-sft.py)|[code](/examples/ace_step/model_training/full/acestep-v15-xl-sft.sh)|[code](/examples/ace_step/model_training/validate_full/acestep-v15-xl-sft.py)|[code](/examples/ace_step/model_training/lora/acestep-v15-xl-sft.sh)|[code](/examples/ace_step/model_training/validate_lora/acestep-v15-xl-sft.py)|
 |[ACE-Step/acestep-v15-xl-turbo](https://www.modelscope.cn/models/ACE-Step/acestep-v15-xl-turbo)|[code](/examples/ace_step/model_inference/acestep-v15-xl-turbo.py)|[code](/examples/ace_step/model_inference_low_vram/acestep-v15-xl-turbo.py)|[code](/examples/ace_step/model_training/full/acestep-v15-xl-turbo.sh)|[code](/examples/ace_step/model_training/validate_full/acestep-v15-xl-turbo.py)|[code](/examples/ace_step/model_training/lora/acestep-v15-xl-turbo.sh)|[code](/examples/ace_step/model_training/validate_lora/acestep-v15-xl-turbo.py)|
+|[DiffSynth-Studio/acestep15xlsft-lora-music](https://www.modelscope.cn/models/DiffSynth-Studio/acestep15xlsft-lora-music)|[code](/examples/ace_step/model_inference/acestep15xlsft-vocals2music.py)|[code](/examples/ace_step/model_inference_low_vram/acestep15xlsft-vocals2music.py)|[code](/examples/ace_step/model_training/full/acestep15xlsft-vocals2music.sh)|[code](/examples/ace_step/model_training/validate_full/acestep15xlsft-vocals2music.py)|-|-|
 </details>

{diffsynth-2.0.13 → diffsynth-2.0.15}/diffsynth/configs/model_configs.py RENAMED Viewed

@@ -917,6 +917,14 @@ stable_diffusion_xl_series = [
         "state_dict_converter": "diffsynth.utils.state_dict_converters.stable_diffusion_vae.SDVAEStateDictConverter",
         "extra_kwargs": {"scaling_factor": 0.13025, "sample_size": 1024, "force_upcast": True},
     },
+    {
+        # Example: ModelConfig(model_id="stabilityai/stable-diffusion-xl-base-1.0", origin_file_pattern="sd_xl_base_1.0.safetensors")
+        "model_hash": "4cf64a799d04260df438c6f33c9a047e",
+        "model_name": "stable_diffusion_xl_unet",
+        "model_class": "diffsynth.models.stable_diffusion_xl_unet.SDXLUNet2DConditionModel",
+        "extra_kwargs": {"attention_head_dim": [5, 10, 20], "transformer_layers_per_block": [1, 2, 10], "use_linear_projection": True, "addition_embed_type": "text_time", "addition_time_embed_dim": 256, "projection_class_embeddings_input_dim": 2816},
+        "state_dict_converter": "diffsynth.utils.state_dict_converters.sdxl.SDXLUNetStateDictConverter_Original2Diffusers",
+    }
 ]
 stable_diffusion_series = [
@@ -1149,20 +1157,17 @@ ideogram4_series = [
         "extra_kwargs": {"keep_original_dtype": True},
     },
     {
-        # Example: ModelConfig(model_id="ideogram-ai/ideogram-4-fp8", origin_file_pattern="vae/diffusion_pytorch_model.safetensors")
-        "model_hash": "c54288e3ee12ca215898840682337b95",
-        "model_name": "ideogram4_vae_encoder",
-        "model_class": "diffsynth.models.ideogram4_vae.Ideogram4VAEEncoder",
-        "state_dict_converter": "diffsynth.models.ideogram4_vae.Ideogram4VAEEncoderStateDictConverter",
-        "extra_kwargs": {"keep_original_dtype": True},
+        # Example: ModelConfig(model_id="DiffSynth-Studio/ideogram-4-bf16-repackage", origin_file_pattern="transformer/diffusion_pytorch_model.safetensors")
+        "model_hash": "291b300b11c8c8e11978bd85a9c5f80c",
+        "model_name": "ideogram4_dit",
+        "model_class": "diffsynth.models.ideogram4_dit.Ideogram4DiT",
+        "extra_kwargs": {"config": {"emb_dim": 4608, "num_layers": 34, "num_heads": 18, "intermediate_size": 12288, "adanln_dim": 512, "in_channels": 128, "llm_features_dim": 53248, "rope_theta": 5000000, "mrope_section": [24, 20, 20], "norm_eps": 1e-05}},
     },
     {
-        # Example: ModelConfig(model_id="ideogram-ai/ideogram-4-fp8", origin_file_pattern="vae/diffusion_pytorch_model.safetensors")
-        "model_hash": "c54288e3ee12ca215898840682337b95",
-        "model_name": "ideogram4_vae_decoder",
-        "model_class": "diffsynth.models.ideogram4_vae.Ideogram4VAEDecoder",
-        "state_dict_converter": "diffsynth.models.ideogram4_vae.Ideogram4VAEDecoderStateDictConverter",
-        "extra_kwargs": {"keep_original_dtype": True},
+        # Example: ModelConfig(model_id="ideogram-ai/ideogram-4-fp8", origin_file_pattern="text_encoder/model.safetensors")
+        "model_hash": "6a269892c0757aacd46bd41b8d5a7aef",
+        "model_name": "ideogram4_text_encoder",
+        "model_class": "diffsynth.models.ideogram4_text_encoder.Ideogram4TextEncoder",
     },
 ]

{diffsynth-2.0.13 → diffsynth-2.0.15}/diffsynth/configs/vram_management_module_maps.py RENAMED Viewed

@@ -380,6 +380,19 @@ VRAM_MANAGEMENT_MODULE_MAPS = {
         "diffsynth.models.hidream_o1_image_dit.Qwen3VLTextRMSNorm": "diffsynth.core.vram.layers.AutoWrappedModule",
         "diffsynth.models.hidream_o1_image_dit.Qwen3VLVisionModel": "diffsynth.core.vram.layers.AutoWrappedModule",
     },
+    "diffsynth.models.ideogram4_dit.Ideogram4DiT": {
+        "torch.nn.Linear": "diffsynth.core.vram.layers.AutoWrappedLinear",
+        "diffsynth.models.ideogram4_dit.Ideogram4RMSNorm": "diffsynth.core.vram.layers.AutoWrappedModule",
+        "torch.nn.Embedding": "diffsynth.core.vram.layers.AutoWrappedModule",
+        "torch.nn.LayerNorm": "diffsynth.core.vram.layers.AutoWrappedModule",
+    },
+    "diffsynth.models.ideogram4_text_encoder.Ideogram4TextEncoder": {
+        "torch.nn.Linear": "diffsynth.core.vram.layers.AutoWrappedLinear",
+        "torch.nn.Embedding": "diffsynth.core.vram.layers.AutoWrappedModule",
+        "torch.nn.LayerNorm": "diffsynth.core.vram.layers.AutoWrappedModule",
+        "transformers.models.qwen3_vl.modeling_qwen3_vl.Qwen3VLTextRotaryEmbedding": "diffsynth.core.vram.layers.AutoWrappedModule",
+        "transformers.models.qwen3_vl.modeling_qwen3_vl.Qwen3VLTextRMSNorm": "diffsynth.core.vram.layers.AutoWrappedModule",
+    },
 }
 def QwenImageTextEncoder_Module_Map_Updater():

{diffsynth-2.0.13 → diffsynth-2.0.15}/diffsynth/core/attention/attention.py RENAMED Viewed

@@ -26,6 +26,14 @@ try:
 except ModuleNotFoundError:
     XFORMERS_AVAILABLE = False
+try:
+    if "enable_gqa" in inspect.signature(torch.nn.functional.scaled_dot_product_attention).parameters:
+        TORCH_SUPPORT_GQA = True
+    else:
+        TORCH_SUPPORT_GQA = False
+except:
+    TORCH_SUPPORT_GQA = False
 def initialize_attention_priority():
     if os.environ.get('DIFFSYNTH_ATTENTION_IMPLEMENTATION') is not None:
@@ -68,7 +76,7 @@ def torch_sdpa(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, q_pattern="b n
     q, k, v = rearrange_qkv(q, k, v, q_pattern, k_pattern, v_pattern, required_in_pattern, dims)
     if q.shape[1] != k.shape[1] or q.shape[1] != v.shape[1]:
         # Grouped Query Attention
-        if "enable_gqa" in inspect.signature(torch.nn.functional.scaled_dot_product_attention).parameters:
+        if TORCH_SUPPORT_GQA:
             out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask, scale=scale, is_causal=is_causal, enable_gqa=True)
         else:
             # In low-version torch, `enable_gqa` is not supported.

{diffsynth-2.0.13 → diffsynth-2.0.15}/diffsynth/core/data/operators.py RENAMED Viewed

@@ -2,6 +2,7 @@ import math, warnings
 import torch, torchvision, imageio, os
 import imageio.v3 as iio
 from PIL import Image
+from einops import repeat
 class DataProcessingPipeline:
@@ -283,7 +284,7 @@ class LoadAudioWithTorchaudio(DataProcessingOperator, FrameSamplerByRateMixin):
 class LoadPureAudioWithTorchaudio(DataProcessingOperator):
-    def __init__(self, target_sample_rate=None, max_audio_duration=None, padding=False):
+    def __init__(self, target_sample_rate=None, max_audio_duration=None, padding=False, channels=2):
         self.target_sample_rate = target_sample_rate
         self.max_audio_duration = max_audio_duration
         self.resample = True if target_sample_rate is not None else False
@@ -302,6 +303,8 @@ class LoadPureAudioWithTorchaudio(DataProcessingOperator):
                 elif current_samples < target_samples and self.padding:
                     padding = target_samples - current_samples
                     waveform = torch.nn.functional.pad(waveform, (0, padding))
+            if waveform.shape[0] == 1:
+                waveform = repeat(waveform, "C L -> (N C) L", N=2)
             return waveform, sample_rate
         except Exception as e:
             print(f"Cannot load audio in {data} due to {e}. The audio will be `None`.")

{diffsynth-2.0.13 → diffsynth-2.0.15}/diffsynth/diffusion/ddim_scheduler.py RENAMED Viewed

@@ -87,8 +87,9 @@ class DDIMScheduler():
     def add_noise(self, original_samples, noise, timestep):
-        sqrt_alpha_prod = math.sqrt(self.alphas_cumprod[int(timestep.flatten().tolist()[0])])
-        sqrt_one_minus_alpha_prod = math.sqrt(1 - self.alphas_cumprod[int(timestep.flatten().tolist()[0])])
+        timestep_id = max(min(int(timestep.flatten().tolist()[0]), len(self.alphas_cumprod)-1), 0)
+        sqrt_alpha_prod = math.sqrt(self.alphas_cumprod[timestep_id])
+        sqrt_one_minus_alpha_prod = math.sqrt(1 - self.alphas_cumprod[timestep_id])
         noisy_samples = sqrt_alpha_prod * original_samples + sqrt_one_minus_alpha_prod * noise
         return noisy_samples
@@ -97,8 +98,9 @@ class DDIMScheduler():
         if self.prediction_type == "epsilon":
             return noise
         else:
-            sqrt_alpha_prod = math.sqrt(self.alphas_cumprod[int(timestep.flatten().tolist()[0])])
-            sqrt_one_minus_alpha_prod = math.sqrt(1 - self.alphas_cumprod[int(timestep.flatten().tolist()[0])])
+            timestep_id = max(min(int(timestep.flatten().tolist()[0]), len(self.alphas_cumprod)-1), 0)
+            sqrt_alpha_prod = math.sqrt(self.alphas_cumprod[timestep_id])
+            sqrt_one_minus_alpha_prod = math.sqrt(1 - self.alphas_cumprod[timestep_id])
             target = sqrt_alpha_prod * noise - sqrt_one_minus_alpha_prod * sample
             return target

{diffsynth-2.0.13 → diffsynth-2.0.15}/diffsynth/diffusion/flow_match.py RENAMED Viewed

@@ -214,7 +214,7 @@ class FlowMatchScheduler():
         logsnr_max = 18.0
         t_min = 1.0 / (1 + math.exp(0.5 * logsnr_max))
         t_max = 1.0 / (1 + math.exp(0.5 * logsnr_min))
-        step_intervals = torch.linspace(0.0, 1.0, num_inference_steps + 1, dtype=torch.float64)
+        step_intervals = torch.linspace(0.0, denoising_strength, num_inference_steps + 1, dtype=torch.float64)
         sigmas = []
         for i in range(num_inference_steps + 1):
             z = torch.special.ndtri(step_intervals[i])
@@ -230,7 +230,7 @@ class FlowMatchScheduler():
             one_minus_t = one_minus_t * (sigma_start / one_minus_t[0])
         sigmas = sigmas.flip(dims=(0,))
         timesteps = sigmas[:-1]
-        sigmas = 1 - sigmas
+        sigmas = (1 - sigmas)[:-1]
         return sigmas, timesteps
     @staticmethod
@@ -263,7 +263,7 @@ class FlowMatchScheduler():
     def set_training_weight(self):
         steps = 1000
-        x = self.timesteps
+        x = self.sigmas * self.num_train_timesteps
         y = torch.exp(-2 * ((x - steps / 2) / steps) ** 2)
         y_shifted = y - y.min()
         bsmntw_weighing = y_shifted * (steps / y_shifted.sum())

{diffsynth-2.0.13 → diffsynth-2.0.15}/diffsynth/models/ideogram4_dit.py RENAMED Viewed

@@ -5,6 +5,8 @@ import torch
 import torch.nn as nn
 import torch.nn.functional as F
+from ..core.gradient import gradient_checkpoint_forward
 LLM_TOKEN_INDICATOR = 3
 OUTPUT_IMAGE_INDICATOR = 2
 IMAGE_POSITION_OFFSET = 65536
@@ -140,7 +142,7 @@ class Ideogram4MRoPE(nn.Module):
         pos = position_ids.permute(2, 0, 1).to(dtype=torch.float32)
         inv_freq = self.inv_freq.to(dtype=torch.float32)[None, None, :, None].expand(
             3, batch_size, -1, 1
-        )
+        ).to(pos.device)
         freqs = inv_freq @ pos.unsqueeze(2)
         freqs = freqs.transpose(2, 3)
@@ -291,7 +293,7 @@ class Ideogram4EmbedScalar(nn.Module):
         scaled = 1e4 * (x - self.range_min) / (self.range_max - self.range_min)
         emb = _sinusoidal_embedding(scaled, self.dim)
         emb = emb.to(
-            getattr(self.mlp_in, "compute_dtype", None) or self.mlp_in.weight.dtype
+            getattr(self.mlp_in, "compute_dtype", None) or getattr(self.mlp_in, "computation_dtype", None) or self.mlp_in.weight.dtype
         )
         emb = F.silu(self.mlp_in(emb))
         return self.mlp_out(emb)
@@ -375,6 +377,8 @@ class Ideogram4DiT(nn.Module):
         position_ids: torch.Tensor,
         segment_ids: torch.Tensor,
         indicator: torch.Tensor,
+        use_gradient_checkpointing: bool = False,
+        use_gradient_checkpointing_offload: bool = False,
     ) -> torch.Tensor:
         """Velocity prediction.
@@ -393,7 +397,7 @@ class Ideogram4DiT(nn.Module):
         assert in_channels == self.config.in_channels
         param_dtype = (
-            getattr(self.input_proj, "compute_dtype", None) or self.input_proj.weight.dtype
+            getattr(self.input_proj, "compute_dtype", None) or getattr(self.input_proj, "computation_dtype", None) or self.input_proj.weight.dtype
         )
         x = x.to(param_dtype)
         t = t.to(param_dtype)
@@ -428,7 +432,16 @@ class Ideogram4DiT(nn.Module):
         sin = sin.to(h.dtype)
         for layer in self.layers:
-            h = layer(h, segment_ids=segment_ids, cos=cos, sin=sin, adaln_input=adaln_input)
+            h = gradient_checkpoint_forward(
+                layer,
+                use_gradient_checkpointing=use_gradient_checkpointing,
+                use_gradient_checkpointing_offload=use_gradient_checkpointing_offload,
+                x=h,
+                segment_ids=segment_ids,
+                cos=cos,
+                sin=sin,
+                adaln_input=adaln_input,
+            )
         out = self.final_layer(h, c=adaln_input)
         return out.to(torch.float32)

diffsynth-2.0.15/diffsynth/models/ideogram4_vae.py ADDED Viewed

@@ -0,0 +1,74 @@
+import torch
+from einops import rearrange
+LATENT_SHIFT = (
+    0.01984364, 0.10149707, 0.29689495, 0.27188619, -0.21445648, -0.15979549,
+    0.05021099, -0.15083604, -0.15360136, -0.20131799, 0.01922352, 0.0622626,
+    0.10140969, -0.06739428, 0.3758261, -0.233712, 0.35164491, -0.02590912,
+    -0.0271935, -0.10833897, -0.1476848, -0.01130957, -0.2298372, 0.23526423,
+    -0.10893522, 0.11957631, 0.04047799, 0.3134589, -0.17225064, -0.18646109,
+    -0.34691978, -0.03571246, 0.02583857, 0.10190072, 0.28402294, 0.26952152,
+    -0.21634675, -0.17938656, 0.04358909, -0.15007621, -0.1548502, -0.18971131,
+    0.02710861, 0.05609494, 0.10697846, -0.06854968, 0.38167698, -0.24269937,
+    0.35705471, -0.03063305, -0.02946109, -0.11244286, -0.14336038, -0.01362137,
+    -0.21863696, 0.23228983, -0.11739769, 0.11693044, 0.02563311, 0.31356594,
+    -0.17420591, -0.19006285, -0.34905377, -0.04025005, 0.01924137, 0.07652984,
+    0.2995608, 0.2628057, -0.22011674, -0.12715361, 0.04879879, -0.14075719,
+    -0.15935895, -0.2123584, 0.01974813, 0.05523547, 0.10011992, -0.06428964,
+    0.37781868, -0.21491644, 0.34254215, -0.03153528, -0.0310082, -0.10761415,
+    -0.14730405, -0.02475182, -0.2285588, 0.2515081, -0.10445128, 0.12446,
+    0.07062869, 0.30880162, -0.18016875, -0.18869164, -0.34533499, -0.0129177,
+    0.02578168, 0.07993659, 0.28642181, 0.26038408, -0.22459419, -0.14820155,
+    0.04059549, -0.14043529, -0.16111187, -0.2020305, 0.02602069, 0.04852717,
+    0.10432153, -0.06309942, 0.38402443, -0.22397003, 0.34814481, -0.03774432,
+    -0.03381438, -0.11245691, -0.14128767, -0.02853208, -0.21752016, 0.24872463,
+    -0.11399775, 0.1222687, 0.05620835, 0.309178, -0.18065738, -0.19401479,
+    -0.34495114, -0.01760592,
+)
+LATENT_SCALE = (
+    1.63933691, 1.70204478, 1.73642566, 1.90004803, 1.6675316, 1.69059584,
+    1.56853198, 1.62314944, 1.89106626, 1.58086668, 1.60822129, 1.60962993,
+    1.63322129, 1.56074359, 1.73419528, 1.7919265, 1.64040632, 1.66802808,
+    1.60390303, 1.75480492, 1.63187587, 1.64334594, 1.61722884, 1.60146046,
+    1.63459219, 1.55291476, 1.68771497, 1.68415657, 1.78966054, 1.66631641,
+    1.65626686, 1.65976433, 1.63487607, 1.69513249, 1.72933756, 1.91310663,
+    1.67035057, 1.72286863, 1.56719251, 1.61934825, 1.88628859, 1.56911539,
+    1.59455129, 1.60829869, 1.62470611, 1.56052853, 1.73677003, 1.77563606,
+    1.63732541, 1.66370527, 1.59508952, 1.75153949, 1.63029275, 1.64517667,
+    1.61659342, 1.59722044, 1.64103121, 1.5408531, 1.68610394, 1.67772755,
+    1.78998563, 1.66621713, 1.65458955, 1.66041308, 1.64710857, 1.68163503,
+    1.74000294, 1.92784786, 1.67411194, 1.67395548, 1.57406532, 1.62199356,
+    1.87618195, 1.5584375, 1.57438785, 1.61711053, 1.63094305, 1.55644029,
+    1.73124302, 1.80666627, 1.6463621, 1.65932006, 1.60816188, 1.75682671,
+    1.64695873, 1.63121722, 1.61380832, 1.60478651, 1.63396035, 1.53505068,
+    1.65534289, 1.67132281, 1.80317197, 1.6767314, 1.65700938, 1.68426259,
+    1.65339716, 1.67540638, 1.73298504, 1.94067348, 1.67893609, 1.70635117,
+    1.5730906, 1.61928553, 1.87148809, 1.56244866, 1.56697152, 1.61584394,
+    1.62759496, 1.55480378, 1.73484107, 1.79055143, 1.64688773, 1.66121492,
+    1.60135887, 1.75254572, 1.64798332, 1.62989921, 1.61381592, 1.60792883,
+    1.63939668, 1.53075757, 1.65371318, 1.66801185, 1.80029087, 1.67591476,
+    1.65655173, 1.68533454,
+)
+def get_latent_norm(device: torch.device) -> tuple[torch.Tensor, torch.Tensor]:
+    shift = torch.tensor(LATENT_SHIFT, dtype=torch.float32, device=device)
+    scale = torch.tensor(LATENT_SCALE, dtype=torch.float32, device=device)
+    return shift, scale
+def decode(vae, latents, height, width, torch_dtype):
+    latent_shift, latent_scale = get_latent_norm(latents.device)
+    latents = latents.float() * latent_scale + latent_shift
+    latents = rearrange(latents, "B (H W) (P Q C) -> B C (H P) (W Q)", P=2, Q=2, H=height//16, W=width//16).to(torch.bfloat16)
+    latents = latents.to(torch_dtype)
+    image = vae._decode(latents)
+    return image
+def encode(vae, image, height, width, torch_dtype):
+    latents = vae._encode(image)[:, :32]
+    latent_shift, latent_scale = get_latent_norm(latents.device)
+    latents = rearrange(latents, "B C (H P) (W Q) -> B (H W) (P Q C)", P=2, Q=2, H=height//16, W=width//16).to(torch.bfloat16)
+    latents = (latents.float() - latent_shift) / latent_scale
+    latents = latents.to(torch_dtype)
+    return latents

{diffsynth-2.0.13 → diffsynth-2.0.15}/diffsynth/pipelines/ideogram4.py RENAMED Viewed

@@ -10,7 +10,8 @@ from ..diffusion.base_pipeline import BasePipeline, PipelineUnit
 from ..core import ModelConfig
 from ..models.ideogram4_dit import Ideogram4DiT, LLM_TOKEN_INDICATOR, OUTPUT_IMAGE_INDICATOR, IMAGE_POSITION_OFFSET
 from ..models.ideogram4_text_encoder import Ideogram4TextEncoder
-from ..models.ideogram4_vae import Ideogram4VAEEncoder, Ideogram4VAEDecoder
+from ..models.flux2_vae import Flux2VAE
+from ..models.ideogram4_vae import encode, decode
 from transformers import AutoTokenizer
@@ -25,8 +26,7 @@ class Ideogram4Pipeline(BasePipeline):
         self.text_encoder: Ideogram4TextEncoder = None
         self.dit: Ideogram4DiT = None
         self.dit_uncond: Ideogram4DiT = None
-        self.vae_encoder: Ideogram4VAEEncoder = None
-        self.vae_decoder: Ideogram4VAEDecoder = None
+        self.vae: Flux2VAE = None
         self.tokenizer: AutoTokenizer = None
         self.in_iteration_models = ("dit", "dit_uncond")
         self.units = [
@@ -55,8 +55,7 @@ class Ideogram4Pipeline(BasePipeline):
         else:
             pipe.dit = transformers
         pipe.text_encoder = model_pool.fetch_model("ideogram4_text_encoder")
-        pipe.vae_encoder = model_pool.fetch_model("ideogram4_vae_encoder")
-        pipe.vae_decoder = model_pool.fetch_model("ideogram4_vae_decoder")
+        pipe.vae = model_pool.fetch_model("flux2_vae")
         if tokenizer_config is not None:
             tokenizer_config.download_if_necessary()
@@ -112,16 +111,15 @@ class Ideogram4Pipeline(BasePipeline):
             if cfg_scale != 1:
                 models = {"dit": self.dit_uncond if self.dit_uncond is not None else self.dit}
                 noise_pred_nega = self.model_fn(timestep=timestep, **models, **inputs_shared, **inputs_nega)
-                # This is not a standard CFG implementation. We align it to the original version of Ideogram4.
-                noise_pred = cfg_scale * noise_pred_posi + (1.0 - cfg_scale) * noise_pred_nega
+                noise_pred = noise_pred_nega + cfg_scale * (noise_pred_posi - noise_pred_nega)
             else:
                 noise_pred = noise_pred_posi
             inputs_shared["latents"] = self.step(self.scheduler, progress_id=progress_id, noise_pred=noise_pred, **inputs_shared)
         # Decode
-        self.load_models_to_device(["vae_decoder"])
-        image = self.vae_decoder.decode(inputs_shared["latents"], inputs_shared["grid_h"], inputs_shared["grid_w"], self.dit.patch_size, self.torch_dtype)
+        self.load_models_to_device(["vae"])
+        image = decode(self.vae, inputs_shared["latents"], height, width, self.torch_dtype)
         image = self.vae_output_to_image(image)
         self.load_models_to_device([])
         return image
@@ -168,7 +166,7 @@ class Ideogram4Unit_PromptEmbedder(PipelineUnit):
                 f"prompt has {num_text_tokens} tokens, exceeds max_text_tokens={max_text_tokens}"
             )
-        patch = pipe.dit.patch_size * pipe.vae_encoder.ae_scale_factor
+        patch = pipe.dit.patch_size * 8
         grid_h = height // patch
         grid_w = width // patch
         num_image_tokens = grid_h * grid_w
@@ -239,7 +237,7 @@ class Ideogram4Unit_NoiseInitializer(PipelineUnit):
         )
     def process(self, pipe: "Ideogram4Pipeline", height, width, seed, rand_device):
-        patch = pipe.dit.patch_size * pipe.vae_encoder.ae_scale_factor
+        patch = pipe.dit.patch_size * 8
         grid_h = height // patch
         grid_w = width // patch
         num_image_tokens = grid_h * grid_w
@@ -251,18 +249,17 @@ class Ideogram4Unit_NoiseInitializer(PipelineUnit):
 class Ideogram4Unit_InputImageEmbedder(PipelineUnit):
     def __init__(self):
         super().__init__(
-            input_params=("input_image", "noise", "height", "width", "grid_h", "grid_w"),
+            input_params=("input_image", "noise", "height", "width"),
             output_params=("latents", "input_latents"),
-            onload_model_names=("vae_encoder",)
+            onload_model_names=("vae",)
         )
-    def process(self, pipe: "Ideogram4Pipeline", input_image, noise, height, width, grid_h, grid_w):
+    def process(self, pipe: "Ideogram4Pipeline", input_image, noise, height, width):
         if input_image is None:
             return {"latents": noise, "input_latents": None}
-        pipe.load_models_to_device(["vae_encoder"])
+        pipe.load_models_to_device(["vae"])
         image = pipe.preprocess_image(input_image)
-        input_latents = pipe.vae_encoder.encode(image, grid_h, grid_w, pipe.dit.patch_size)
+        input_latents = encode(pipe.vae, image, height, width, torch.bfloat16)
         if pipe.scheduler.training:
             return {"latents": noise, "input_latents": input_latents}
         else:
@@ -279,6 +276,8 @@ def model_fn_ideogram4(
     segment_ids=None,
     indicator=None,
     max_text_tokens=0,
+    use_gradient_checkpointing=False,
+    use_gradient_checkpointing_offload=False,
     **kwargs,
 ):
     t_ideogram4 = timestep.to(torch.float32)
@@ -292,5 +291,7 @@ def model_fn_ideogram4(
     out = dit(
         llm_features=llm_features, x=z, t=t_ideogram4,
         position_ids=position_ids, segment_ids=segment_ids, indicator=indicator,
+        use_gradient_checkpointing=use_gradient_checkpointing,
+        use_gradient_checkpointing_offload=use_gradient_checkpointing_offload,
     )
     return -out[:, max_text_tokens:]

{diffsynth-2.0.13 → diffsynth-2.0.15}/diffsynth/pipelines/stable_diffusion_xl.py RENAMED Viewed

@@ -13,6 +13,7 @@ from ..models.stable_diffusion_text_encoder import SDTextEncoder
 from ..models.stable_diffusion_xl_unet import SDXLUNet2DConditionModel
 from ..models.stable_diffusion_xl_text_encoder import SDXLTextEncoder2
 from ..models.stable_diffusion_vae import StableDiffusionVAE
+from ..utils.lora.sdxl import SdxlLoRALoader
 def rescale_noise_cfg(noise_cfg, noise_pred_text, guidance_rescale=0.0):
@@ -53,6 +54,7 @@ class StableDiffusionXLPipeline(BasePipeline):
         ]
         self.model_fn = model_fn_stable_diffusion_xl
         self.compilable_models = ["unet"]
+        self.lora_loader = SdxlLoRALoader
     @staticmethod
     def from_pretrained(

diffsynth-2.0.15/diffsynth/utils/demucs/__init__.py ADDED Viewed

@@ -0,0 +1,21 @@
+import torch, torchaudio
+from diffsynth import load_model, ModelConfig
+from diffsynth.models.demucs import HTDemucs
+class AudioTrackSeparator(torch.nn.Module):
+    def __init__(self, torch_dtype=torch.float32, device="cuda", model_config=ModelConfig(model_id="DiffSynth-Studio/Demucs-Repackage", origin_file_pattern="model.safetensors")):
+        super().__init__()
+        model_config.download_if_necessary()
+        self.model = load_model(HTDemucs, model_config.path, torch_dtype=torch_dtype, device=device)
+    @torch.no_grad()
+    def __call__(self, audio, target_sample_rate=48000, **kwargs):
+        if isinstance(audio, str):
+            audio, sample_rate = torchaudio.load(audio)
+        else:
+            audio, sample_rate = audio
+        audio = audio.to(dtype=next(iter(self.model.parameters())).dtype, device=next(iter(self.model.parameters())).device)
+        vocals = self.model.extract_track(audio, sample_rate)
+        if target_sample_rate != 44100:
+            vocals = torchaudio.functional.resample(vocals, 44100, target_sample_rate)
+        return vocals

diffsynth-2.0.15/diffsynth/utils/dequantizer/__init__.py ADDED Viewed

@@ -0,0 +1,15 @@
+from diffsynth import load_state_dict
+import torch
+from safetensors.torch import save_file
+from tqdm import tqdm
+def dequantize(source_path, target_path, device="cuda", torch_dtype=torch.bfloat16):
+    sd = load_state_dict(source_path, device=device)
+    for k in tqdm([k for k in sd if k.endswith(".weight_scale")]):
+        weight_key = k[:-13] + ".weight"
+        weight = sd.pop(weight_key).to(torch_dtype)
+        scale = sd.pop(k).to(torch_dtype).unsqueeze(1)
+        sd[weight_key] = weight * scale
+    if target_path is not None:
+        save_file(sd, target_path)

diffsynth 2.0.13__tar.gz → 2.0.15__tar.gz

diffsynth 2.0.13tar.gz → 2.0.15tar.gz