PyPI - mmgp - Versions diffs - 3.0.9__tar.gz → 3.1.0__tar.gz - Mend

mmgp 3.0.9tar.gz → 3.1.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of mmgp might be problematic. Click here for more details.

Files changed (14) hide show

{mmgp-3.0.9/src/mmgp.egg-info → mmgp-3.1.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.2
 Name: mmgp
-Version: 3.0.9
+Version: 3.1.0
 Summary: Memory Management for the GPU Poor
 Author-email: deepbeepmeep <deepbeepmeep@yahoo.com>
 License:                     GNU GENERAL PUBLIC LICENSE
@@ -17,7 +17,7 @@ Requires-Dist: peft
 <p align="center">
-  <H2>Memory Management 3.0.9 for the GPU Poor by DeepBeepMeep</H2>
+  <H2>Memory Management 3.1.0 for the GPU Poor by DeepBeepMeep</H2>
 </p>
@@ -100,7 +100,7 @@ For example:
 The smaller this number, the more VRAM left for image data / longer video but also the slower because there will be lots of loading / unloading between the RAM and the VRAM. If model is too big to fit in a budget, it will be broken down in multiples parts that will be unloaded / loaded consequently. The speed of low budget can be  increased (up to 2 times) by turning on the options pinnedMemory and asyncTransfers.
 - asyncTransfers: boolean, load to the GPU the next model part while the current part is being processed. This requires twice the budget if any is defined. This may increase speed by 20% (mostly visible on fast modern GPUs).
 - verboseLevel: number between 0 and 2 (1 by default), provides various level of feedback of the different processes
-- compile: list of model ids to compile, may accelerate up x2 depending on the type of GPU. As of 01/01/2025 it will work only on Linux or WSL since compilation relies on Triton which is not yet supported on Windows
+- compile: list of model ids to compile, may accelerate up x2 depending on the type of GPU. It makes sens to compile only the model that is frequently used such as the "transformer" model in the case of video or image generation. As of 01/01/2025 it will work only on Linux or WSL since compilation relies on Triton which is not yet supported on Windows
 If you are short on RAM and plan to work with quantized models, it is recommended to load pre-quantized models direclty rather than using on the fly quantization, it will be faster and consume slightly less RAM.

{mmgp-3.0.9 → mmgp-3.1.0}/README.md RENAMED Viewed

@@ -1,6 +1,6 @@
 <p align="center">
-  <H2>Memory Management 3.0.9 for the GPU Poor by DeepBeepMeep</H2>
+  <H2>Memory Management 3.1.0 for the GPU Poor by DeepBeepMeep</H2>
 </p>
@@ -83,7 +83,7 @@ For example:
 The smaller this number, the more VRAM left for image data / longer video but also the slower because there will be lots of loading / unloading between the RAM and the VRAM. If model is too big to fit in a budget, it will be broken down in multiples parts that will be unloaded / loaded consequently. The speed of low budget can be  increased (up to 2 times) by turning on the options pinnedMemory and asyncTransfers.
 - asyncTransfers: boolean, load to the GPU the next model part while the current part is being processed. This requires twice the budget if any is defined. This may increase speed by 20% (mostly visible on fast modern GPUs).
 - verboseLevel: number between 0 and 2 (1 by default), provides various level of feedback of the different processes
-- compile: list of model ids to compile, may accelerate up x2 depending on the type of GPU. As of 01/01/2025 it will work only on Linux or WSL since compilation relies on Triton which is not yet supported on Windows
+- compile: list of model ids to compile, may accelerate up x2 depending on the type of GPU. It makes sens to compile only the model that is frequently used such as the "transformer" model in the case of video or image generation. As of 01/01/2025 it will work only on Linux or WSL since compilation relies on Triton which is not yet supported on Windows
 If you are short on RAM and plan to work with quantized models, it is recommended to load pre-quantized models direclty rather than using on the fly quantization, it will be faster and consume slightly less RAM.

{mmgp-3.0.9 → mmgp-3.1.0}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "mmgp"
-version = "3.0.9"
+version = "3.1.0"
 authors = [
   { name = "deepbeepmeep", email = "deepbeepmeep@yahoo.com" },
 ]

{mmgp-3.0.9 → mmgp-3.1.0}/src/mmgp/offload.py RENAMED Viewed

@@ -1,4 +1,4 @@
-# ------------------ Memory Management 3.0 for the GPU Poor by DeepBeepMeep (mmgp)------------------
+# ------------------ Memory Management 3.1 for the GPU Poor by DeepBeepMeep (mmgp)------------------
 #
 # This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ...  can run smoothly on a 24 GB GPU limited card.
 # This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several
@@ -79,7 +79,7 @@ from mmgp import profile_type
 from optimum.quanto import freeze,  qfloat8, qint4 , qint8, quantize, QModuleMixin, QTensor,  quantize_module
+shared_state = {}
 mmm = safetensors2.mmm
@@ -154,33 +154,75 @@ def _get_max_reservable_memory(perc_reserved_mem_max):
         perc_reserved_mem_max = 0.40 if os.name == 'nt' else 0.5
     return  perc_reserved_mem_max * physical_memory
-def _detect_main_towers(model, verboseLevel=1):
+def _detect_main_towers(model, min_floors = 5, verboseLevel=1):
     cur_blocks_prefix = None
     towers_modules= []
     towers_names= []
+    floors_modules= []
+    tower_name = None
     for submodule_name, submodule in model.named_modules():
         if submodule_name=='':
             continue
-        if isinstance(submodule, torch.nn.ModuleList):
-            newList =False
-            if cur_blocks_prefix == None:
-                cur_blocks_prefix = submodule_name + "."
-                newList = True
-            else:
-                if not submodule_name.startswith(cur_blocks_prefix):
-                    cur_blocks_prefix = submodule_name + "."
-                    newList = True
+        if cur_blocks_prefix != None:
+            if submodule_name.startswith(cur_blocks_prefix):
+                depth_prefix = cur_blocks_prefix.split(".")
+                depth_name = submodule_name.split(".")
+                level  =  depth_name[len(depth_prefix)-1]
+                pre , num = _extract_num_from_str(level)
+                if num != cur_blocks_seq:
+                    floors_modules.append(submodule)
-            if newList and len(submodule)>=5:
-                towers_names.append(submodule_name)
-                towers_modules.append(submodule)
+                cur_blocks_seq = num
+            else:
+                if len(floors_modules) >= min_floors:
+                    towers_modules += floors_modules
+                    towers_names.append(tower_name)
+                tower_name = None
+                floors_modules= []
+                cur_blocks_prefix, cur_blocks_seq = None, -1
+        if cur_blocks_prefix == None:
+            pre , num = _extract_num_from_str(submodule_name)
+            if isinstance(submodule, (torch.nn.ModuleList)):
+                cur_blocks_prefix, cur_blocks_seq = pre + ".",  -1
+                tower_name = submodule_name + ".*"
+            elif num >=0:
+                cur_blocks_prefix, cur_blocks_seq = pre, num
+                tower_name = submodule_name[ :-1] + "*"
+                floors_modules.append(submodule)
+    if len(floors_modules) >= min_floors:
+        towers_modules += floors_modules
+        towers_names.append(tower_name)
+    # for submodule_name, submodule in model.named_modules():
+    #     if submodule_name=='':
+    #         continue
+    #     if isinstance(submodule, torch.nn.ModuleList):
+    #         newList =False
+    #         if cur_blocks_prefix == None:
+    #             cur_blocks_prefix = submodule_name + "."
+    #             newList = True
+    #         else:
+    #             if not submodule_name.startswith(cur_blocks_prefix):
+    #                 cur_blocks_prefix = submodule_name + "."
+    #                 newList = True
+    #         if newList and len(submodule)>=5:
+    #             towers_names.append(submodule_name)
+    #             towers_modules.append(submodule)
-        else:
-            if cur_blocks_prefix is not None:
-                if not submodule_name.startswith(cur_blocks_prefix):
-                    cur_blocks_prefix = None
+    #     else:
+    #         if cur_blocks_prefix is not None:
+    #             if not submodule_name.startswith(cur_blocks_prefix):
+    #                 cur_blocks_prefix = None
     return towers_names, towers_modules
@@ -194,7 +236,7 @@ def _get_model(model_path):
     _path = Path(model_path).parts
     _filename = _path[-1]
     _path = _path[:-1]
-    if len(_path)==1:
+    if len(_path)<=1:
         raise("file not found")
     else:
         from huggingface_hub import  hf_hub_download #snapshot_download,
@@ -369,8 +411,16 @@ def _welcome():
     if welcome_displayed:
          return
     welcome_displayed = True
-    print(f"{BOLD}{HEADER}************ Memory Management for the GPU Poor (mmgp 3.0) by DeepBeepMeep ************{ENDC}{UNBOLD}")
+    print(f"{BOLD}{HEADER}************ Memory Management for the GPU Poor (mmgp 3.1) by DeepBeepMeep ************{ENDC}{UNBOLD}")
+def _extract_num_from_str(num_in_str):
+    for i in range(len(num_in_str)):
+        if not num_in_str[-i-1:].isnumeric():
+            if i == 0:
+                return num_in_str, -1
+            else:
+                return num_in_str[: -i],  int(num_in_str[-i:])
+    return  "", int(num_in_str)
 def  _quantize_dirty_hack(model):
     # dirty hack: add a hook on state_dict() to return a fake non quantized state_dict if called by Lora Diffusers initialization functions
@@ -581,6 +631,255 @@ def _quantize(model_to_quantize, weights=qint8, verboseLevel = 1, threshold = 10
     return True
+def load_loras_into_model(model, lora_path, lora_multi = None, verboseLevel = -1):
+    verboseLevel = _compute_verbose_level(verboseLevel)
+    if inject_adapter_in_model == None or set_weights_and_activate_adapters == None or  get_peft_kwargs == None:
+        raise Exception("Unable to load Lora, missing 'peft' and / or 'diffusers' modules")
+    if not isinstance(lora_path, list):
+        lora_path = [lora_path]
+    if lora_multi is None:
+        lora_multi = [1. for _ in lora_path]
+    for i, path in enumerate(lora_path):
+        adapter_name = str(i)
+        state_dict = safetensors2.torch_load_file(path)
+        keys = list(state_dict.keys())
+        if len(keys) == 0:
+            raise Exception(f"Empty Lora '{path}'")
+        network_alphas = {}
+        for k in keys:
+            if "alpha" in k:
+                alpha_value = state_dict.pop(k)
+                if not ( (torch.is_tensor(alpha_value) and torch.is_floating_point(alpha_value)) or isinstance(
+                    alpha_value, float
+                )):
+                    network_alphas[k] =  torch.tensor( float(alpha_value.item() ) )
+        pos = keys[0].find(".")
+        prefix = keys[0][0:pos]
+        if not any( prefix.startswith(some_prefix) for some_prefix in ["diffusion_model", "transformer"]):
+            msg = f"No compatible weight was found in Lora file '{path}'. Please check that it is compatible with the Diffusers format."
+            raise Exception(msg)
+        transformer = model
+        transformer_keys = [k for k in keys if k.startswith(prefix)]
+        state_dict = {
+            k.replace(f"{prefix}.", ""): v for k, v in state_dict.items() if k in transformer_keys
+        }
+        sd_keys = state_dict.keys()
+        if len(sd_keys) == 0:
+            print(f"No compatible weight was found in Lora file '{path}'. Please check that it is compatible with the Diffusers format.")
+            return
+        # is_correct_format = all("lora" in key for key in state_dict.keys())
+        # check with first key if is not in peft format
+        # first_key = next(iter(state_dict.keys()))
+        # if "lora_A" not in first_key:
+        #     state_dict = convert_unet_state_dict_to_peft(state_dict)
+        if adapter_name in getattr(transformer, "peft_config", {}):
+            raise ValueError(
+                f"Adapter name {adapter_name} already in use in the transformer - please select a new adapter name."
+            )
+        rank = {}
+        for key, val in state_dict.items():
+            if "lora_B" in key:
+                rank[key] = val.shape[1]
+        if network_alphas is not None and len(network_alphas) >= 1:
+            alpha_keys = [k for k in network_alphas.keys() if k.startswith(prefix) and k.split(".")[0] == prefix]
+            network_alphas = {k.replace(f"{prefix}.", ""): v for k, v in network_alphas.items() if k in alpha_keys}
+        lora_config_kwargs = get_peft_kwargs(rank, network_alpha_dict=network_alphas, peft_state_dict=state_dict)
+        lora_config = LoraConfig(**lora_config_kwargs)
+        peft_kwargs = {}
+        peft_kwargs["low_cpu_mem_usage"] = True
+        inject_adapter_in_model(lora_config, model, adapter_name=adapter_name, **peft_kwargs)
+        incompatible_keys = set_peft_model_state_dict(transformer, state_dict, adapter_name, **peft_kwargs)
+        warn_msg = ""
+        if incompatible_keys is not None:
+            # Check only for unexpected keys.
+            unexpected_keys = getattr(incompatible_keys, "unexpected_keys", None)
+            if unexpected_keys:
+                pass
+        if verboseLevel >=1:
+            print(f"Lora '{path}' was loaded in model '{_get_module_name(model)}'")
+    set_weights_and_activate_adapters(model,[ str(i) for i in range(len(lora_multi))], lora_multi)
+def fast_load_transformers_model(model_path: str, do_quantize = False, quantizationType =  qint8, pinToMemory = False, partialPinning = False, verboseLevel = -1):
+    """
+    quick version of .LoadfromPretrained of  the transformers library
+    used to build a model and load the corresponding weights (quantized or not)
+    """
+    import os.path
+    from accelerate import init_empty_weights
+    if not (model_path.endswith(".sft") or model_path.endswith(".safetensors")):
+        raise Exception("full model path to file expected")
+    model_path = _get_model(model_path)
+    verboseLevel = _compute_verbose_level(verboseLevel)
+    with safetensors2.safe_open(model_path) as f:
+        metadata = f.metadata()
+    if metadata is None:
+        transformer_config = None
+    else:
+        transformer_config = metadata.get("config", None)
+    if transformer_config == None:
+        config_fullpath =  os.path.join(os.path.dirname(model_path), "config.json")
+        if not os.path.isfile(config_fullpath):
+            raise Exception("a 'config.json' that describes the model is required in the directory of the model or inside the safetensor file")
+        with open(config_fullpath, "r", encoding="utf-8") as reader:
+            text = reader.read()
+        transformer_config= json.loads(text)
+    if "architectures" in transformer_config:
+        architectures = transformer_config["architectures"]
+        class_name = architectures[0]
+        module = __import__("transformers")
+        map = {  "T5WithLMHeadModel" : "T5EncoderModel"}
+        class_name = map.get(class_name, class_name)
+        transfomer_class = getattr(module, class_name)
+        from transformers import AutoConfig
+        import tempfile
+        with tempfile.NamedTemporaryFile("w", delete = False,  encoding ="utf-8") as fp:
+            fp.write(json.dumps(transformer_config))
+            fp.close()
+            config_obj = AutoConfig.from_pretrained(fp.name)
+        os.remove(fp.name)
+        #needed to keep inits of non persistent buffers
+        with init_empty_weights():
+            model = transfomer_class(config_obj)
+        model = model.base_model
+    elif "_class_name" in transformer_config:
+        class_name = transformer_config["_class_name"]
+        module = __import__("diffusers")
+        transfomer_class = getattr(module, class_name)
+        with init_empty_weights():
+            model = transfomer_class.from_config(transformer_config)
+    torch.set_default_device('cpu')
+    model._config = transformer_config
+    load_model_data(model,model_path, do_quantize = do_quantize, quantizationType = quantizationType, pinToMemory= pinToMemory, partialPinning= partialPinning, verboseLevel=verboseLevel )
+    return model
+def load_model_data(model, file_path: str, do_quantize = False, quantizationType = qint8, pinToMemory = False, partialPinning = False, verboseLevel = -1):
+    """
+    Load a model, detect if it has been previously quantized using quanto and do the extra setup if necessary
+    """
+    file_path = _get_model(file_path)
+    verboseLevel = _compute_verbose_level(verboseLevel)
+    model = _remove_model_wrapper(model)
+    # if pinToMemory and do_quantize:
+    #     raise Exception("Pinning and Quantization can not be used at the same time")
+    if not (".safetensors" in file_path or ".sft" in file_path):
+        if pinToMemory:
+            raise Exception("Pinning to memory while loading only supported for safe tensors files")
+        state_dict = torch.load(file_path, weights_only=True)
+        if "module" in state_dict:
+            state_dict = state_dict["module"]
+    else:
+        state_dict, metadata = _safetensors_load_file(file_path)
+        if metadata is None:
+            quantization_map = None
+        else:
+            quantization_map = metadata.get("quantization_map", None)
+            config = metadata.get("config", None)
+            if config is not None:
+                model._config = config
+        if quantization_map is None:
+            pos = str.rfind(file_path, ".")
+            if pos > 0:
+                quantization_map_path = file_path[:pos]
+            quantization_map_path += "_map.json"
+            if os.path.isfile(quantization_map_path):
+                with open(quantization_map_path, 'r') as f:
+                    quantization_map = json.load(f)
+        if quantization_map is None :
+            if "quanto" in file_path and not do_quantize:
+                print("Model seems to be quantized by quanto but no quantization map was found whether inside the model or in a separate '{file_path[:json]}_map.json' file")
+        else:
+            _requantize(model, state_dict, quantization_map)
+    missing_keys , unexpected_keys = model.load_state_dict(state_dict, False,  assign = True )
+    # if len(missing_keys) > 0:
+    #     sd_crap = { k : None for k in missing_keys}
+    #     missing_keys , unexpected_keys = model.load_state_dict(sd_crap, strict =False,  assign = True )
+    del state_dict
+    for k,p in model.named_parameters():
+        if p.is_meta:
+            txt  = f"Incompatible State Dictionary or 'Init_Empty_Weights' not set since parameter '{k}' has no data"
+            raise Exception(txt)
+    for k,b in model.named_buffers():
+        if b.is_meta:
+            txt  = f"Incompatible State Dictionary or 'Init_Empty_Weights' not set since buffer '{k}' has no data"
+            raise Exception(txt)
+    if do_quantize:
+        if quantization_map is None:
+            if _quantize(model, quantizationType, verboseLevel=verboseLevel, model_id=file_path):
+                quantization_map = model._quanto_map
+        else:
+            if verboseLevel >=1:
+                print("Model already quantized")
+    if pinToMemory:
+        _pin_to_memory(model, file_path, partialPinning = partialPinning, verboseLevel = verboseLevel)
+    return
 def get_model_name(model):
     return model.name
@@ -612,6 +911,7 @@ class offload:
         self.async_transfers = False
         global last_offload_obj
         last_offload_obj = self
     def add_module_to_blocks(self, model_id, blocks_name, submodule, prev_block_name):
@@ -669,7 +969,7 @@ class offload:
                 return False
         return True
+    @torch.compiler.disable()
     def gpu_load_blocks(self, model_id, blocks_name):
         # cl = clock.start()
@@ -715,7 +1015,7 @@ class offload:
         # cl.stop()
         # print(f"load time: {cl.format_time_gap()}")
+    @torch.compiler.disable()
     def gpu_unload_blocks(self, model_id, blocks_name):
         # cl = clock.start()
         if blocks_name != None:
@@ -736,7 +1036,7 @@ class offload:
         # cl.stop()
         # print(f"unload time: {cl.format_time_gap()}")
+    # @torch.compiler.disable()
     def gpu_load(self, model_id):
         model = self.models[model_id]
         self.active_models.append(model)
@@ -818,10 +1118,10 @@ class offload:
         return False
-    def hook_load_data_if_needed(self, target_module, model_id,blocks_name, context):
+    def hook_preload_blocks_for_compilation(self, target_module, model_id,blocks_name, context):
-        @torch.compiler.disable()
-        def load_data_if_needed(module,  *args, **kwargs):
+        # @torch.compiler.disable()
+        def preload_blocks_for_compile(module,  *args, **kwargs):
             some_context = context #for debugging
             if blocks_name == None:
                 if self.ready_to_check_mem():
@@ -835,8 +1135,9 @@ class offload:
                             self.empty_cache_if_needed()
                     self.loaded_blocks[model_id] = blocks_name
                     self.gpu_load_blocks(model_id, blocks_name)
-        target_module.register_forward_pre_hook(load_data_if_needed)
+        # need to be registered before the forward not to be break the efficiency of the compilation chain
+        # it should be at the top of the compilation as this type of hook in the middle of a chain seems to break memory performance
+        target_module.register_forward_pre_hook(preload_blocks_for_compile)
     def hook_check_empty_cache_needed(self, target_module, model_id,blocks_name, previous_method,  context):
@@ -909,267 +1210,18 @@ class offload:
             print(f"Hooked in model '{model_id}' ({model_name})")
-    # Not implemented yet, but why would one want to get rid of these features ?
-    # def unhook_module(module: torch.nn.Module):
-    #     if not hasattr(module,"_mm_id"):
-    #         return
-    #     delattr(module, "_mm_id")
-    # def unhook_all(parent_module: torch.nn.Module):
-    #     for module in parent_module.components.items():
-    #         self.unhook_module(module)
-import torch
-def load_loras_into_model(model, lora_path, lora_multi = None, verboseLevel = -1):
-    verboseLevel = _compute_verbose_level(verboseLevel)
-    if inject_adapter_in_model == None or set_weights_and_activate_adapters == None or  get_peft_kwargs == None:
-        raise Exception("Unable to load Lora, missing 'peft' and / or 'diffusers' modules")
-    if not isinstance(lora_path, list):
-        lora_path = [lora_path]
-    if lora_multi is None:
-        lora_multi = [1. for _ in lora_path]
-    for i, path in enumerate(lora_path):
-        adapter_name = str(i)
-        state_dict = safetensors2.torch_load_file(path)
-        keys = list(state_dict.keys())
-        if len(keys) == 0:
-            raise Exception(f"Empty Lora '{path}'")
-        network_alphas = {}
-        for k in keys:
-            if "alpha" in k:
-                alpha_value = state_dict.pop(k)
-                if not ( (torch.is_tensor(alpha_value) and torch.is_floating_point(alpha_value)) or isinstance(
-                    alpha_value, float
-                )):
-                    network_alphas[k] =  torch.tensor( float(alpha_value.item() ) )
-        pos = keys[0].find(".")
-        prefix = keys[0][0:pos]
-        if not any( prefix.startswith(some_prefix) for some_prefix in ["diffusion_model", "transformer"]):
-            msg = f"No compatible weight was found in Lora file '{path}'. Please check that it is compatible with the Diffusers format."
-            raise Exception(msg)
-        transformer = model
-        transformer_keys = [k for k in keys if k.startswith(prefix)]
-        state_dict = {
-            k.replace(f"{prefix}.", ""): v for k, v in state_dict.items() if k in transformer_keys
-        }
-        sd_keys = state_dict.keys()
-        if len(sd_keys) == 0:
-            print(f"No compatible weight was found in Lora file '{path}'. Please check that it is compatible with the Diffusers format.")
-            return
-        # is_correct_format = all("lora" in key for key in state_dict.keys())
-        # check with first key if is not in peft format
-        # first_key = next(iter(state_dict.keys()))
-        # if "lora_A" not in first_key:
-        #     state_dict = convert_unet_state_dict_to_peft(state_dict)
-        if adapter_name in getattr(transformer, "peft_config", {}):
-            raise ValueError(
-                f"Adapter name {adapter_name} already in use in the transformer - please select a new adapter name."
-            )
-        rank = {}
-        for key, val in state_dict.items():
-            if "lora_B" in key:
-                rank[key] = val.shape[1]
-        if network_alphas is not None and len(network_alphas) >= 1:
-            alpha_keys = [k for k in network_alphas.keys() if k.startswith(prefix) and k.split(".")[0] == prefix]
-            network_alphas = {k.replace(f"{prefix}.", ""): v for k, v in network_alphas.items() if k in alpha_keys}
-        lora_config_kwargs = get_peft_kwargs(rank, network_alpha_dict=network_alphas, peft_state_dict=state_dict)
-        lora_config = LoraConfig(**lora_config_kwargs)
-        peft_kwargs = {}
-        peft_kwargs["low_cpu_mem_usage"] = True
-        inject_adapter_in_model(lora_config, model, adapter_name=adapter_name, **peft_kwargs)
-        incompatible_keys = set_peft_model_state_dict(transformer, state_dict, adapter_name, **peft_kwargs)
-        warn_msg = ""
-        if incompatible_keys is not None:
-            # Check only for unexpected keys.
-            unexpected_keys = getattr(incompatible_keys, "unexpected_keys", None)
-            if unexpected_keys:
-                pass
-        if verboseLevel >=1:
-            print(f"Lora '{path}' was loaded in model '{_get_module_name(model)}'")
-    set_weights_and_activate_adapters(model,[ str(i) for i in range(len(lora_multi))], lora_multi)
-def fast_load_transformers_model(model_path: str, do_quantize = False, quantizationType =  qint8, pinToMemory = False, partialPinning = False, verboseLevel = -1):
-    """
-    quick version of .LoadfromPretrained of  the transformers library
-    used to build a model and load the corresponding weights (quantized or not)
-    """
-    import os.path
-    from accelerate import init_empty_weights
-    if not (model_path.endswith(".sft") or model_path.endswith(".safetensors")):
-        raise Exception("full model path to file expected")
-    model_path = _get_model(model_path)
-    verboseLevel = _compute_verbose_level(verboseLevel)
-    with safetensors2.safe_open(model_path) as f:
-        metadata = f.metadata()
-    if metadata is None:
-        transformer_config = None
-    else:
-        transformer_config = metadata.get("config", None)
-    if transformer_config == None:
-        config_fullpath =  os.path.join(os.path.dirname(model_path), "config.json")
-        if not os.path.isfile(config_fullpath):
-            raise Exception("a 'config.json' that describes the model is required in the directory of the model or inside the safetensor file")
-        with open(config_fullpath, "r", encoding="utf-8") as reader:
-            text = reader.read()
-        transformer_config= json.loads(text)
-    if "architectures" in transformer_config:
-        architectures = transformer_config["architectures"]
-        class_name = architectures[0]
-        module = __import__("transformers")
-        transfomer_class = getattr(module, class_name)
-        from transformers import AutoConfig
-        import tempfile
-        with tempfile.NamedTemporaryFile("w", delete = False,  encoding ="utf-8") as fp:
-            fp.write(json.dumps(transformer_config))
-            fp.close()
-            config_obj = AutoConfig.from_pretrained(fp.name)
-        os.remove(fp.name)
-        #needed to keep inits of non persistent buffers
-        with init_empty_weights():
-            model = transfomer_class(config_obj)
-        model = model.base_model
-    elif "_class_name" in transformer_config:
-        class_name = transformer_config["_class_name"]
-        module = __import__("diffusers")
-        transfomer_class = getattr(module, class_name)
-        with init_empty_weights():
-            model = transfomer_class.from_config(transformer_config)
-    torch.set_default_device('cpu')
-    model._config = transformer_config
-    load_model_data(model,model_path, do_quantize = do_quantize, quantizationType = quantizationType, pinToMemory= pinToMemory, partialPinning= partialPinning, verboseLevel=verboseLevel )
-    return model
-def load_model_data(model, file_path: str, do_quantize = False, quantizationType = qint8, pinToMemory = False, partialPinning = False, verboseLevel = -1):
-    """
-    Load a model, detect if it has been previously quantized using quanto and do the extra setup if necessary
-    """
-    file_path = _get_model(file_path)
-    verboseLevel = _compute_verbose_level(verboseLevel)
-    model = _remove_model_wrapper(model)
-    # if pinToMemory and do_quantize:
-    #     raise Exception("Pinning and Quantization can not be used at the same time")
-    if not (".safetensors" in file_path or ".sft" in file_path):
-        if pinToMemory:
-            raise Exception("Pinning to memory while loading only supported for safe tensors files")
-        state_dict = torch.load(file_path, weights_only=True)
-        if "module" in state_dict:
-            state_dict = state_dict["module"]
-    else:
-        state_dict, metadata = _safetensors_load_file(file_path)
-        if metadata is None:
-            quantization_map = None
-        else:
-            quantization_map = metadata.get("quantization_map", None)
-            config = metadata.get("config", None)
-            if config is not None:
-                model._config = config
-        if quantization_map is None:
-            pos = str.rfind(file_path, ".")
-            if pos > 0:
-                quantization_map_path = file_path[:pos]
-            quantization_map_path += "_map.json"
-            if os.path.isfile(quantization_map_path):
-                with open(quantization_map_path, 'r') as f:
-                    quantization_map = json.load(f)
-        if quantization_map is None :
-            if "quanto" in file_path and not do_quantize:
-                print("Model seems to be quantized by quanto but no quantization map was found whether inside the model or in a separate '{file_path[:json]}_map.json' file")
-        else:
-            _requantize(model, state_dict, quantization_map)
-    missing_keys , unexpected_keys = model.load_state_dict(state_dict, strict = quantization_map is None,  assign = True )
-    del state_dict
-    if do_quantize:
-        if quantization_map is None:
-            if _quantize(model, quantizationType, verboseLevel=verboseLevel, model_id=file_path):
-                quantization_map = model._quanto_map
-        else:
-            if verboseLevel >=1:
-                print("Model already quantized")
-    if pinToMemory:
-        _pin_to_memory(model, file_path, partialPinning = partialPinning, verboseLevel = verboseLevel)
-    return
-def save_model(model, file_path, do_quantize = False, quantizationType = qint8, verboseLevel = -1 ):
+def save_model(model, file_path, do_quantize = False, quantizationType = qint8, verboseLevel = -1, config_file_path = None ):
     """save the weights of a model and quantize them if requested
     These weights can be loaded again using 'load_model_data'
     """
     config = None
     verboseLevel = _compute_verbose_level(verboseLevel)
-    if hasattr(model, "_config"):
+    if config_file_path !=None:
+        with open(config_file_path, "r", encoding="utf-8") as reader:
+            text = reader.read()
+            config= json.loads(text)
+    elif hasattr(model, "_config"):
         config = model._config
     elif hasattr(model, "config"):
         config_fullpath = None
@@ -1195,7 +1247,7 @@ def save_model(model, file_path, do_quantize = False, quantizationType = qint8,
         print(f"Saving file '{file_path}")
     safetensors2.torch_write_file(model.state_dict(),  file_path , quantization_map = quantization_map, config = config)
     if verboseLevel >=1:
-        print(f"File '{file_path} saved")
+        print(f"File '{file_path}' saved")
@@ -1286,7 +1338,6 @@ def all(pipe_or_dict_of_modules, pinnedMemory = False, quantizeTransformer = Tru
     max_reservable_memory = _get_max_reservable_memory(perc_reserved_mem_max)
     estimatesBytesToPin = 0
     for model_id in models:
         current_model: torch.nn.Module = models[model_id]
         # make sure that no RAM or GPU memory is not allocated for gradiant / training
@@ -1302,7 +1353,6 @@ def all(pipe_or_dict_of_modules, pinnedMemory = False, quantizeTransformer = Tru
         for n, p in current_model.named_parameters():
             p.requires_grad = False
-            p = p.detach()
             if isinstance(p, QTensor):
                 # # fix quanto bug (seems to have been fixed)
                 # if not modelPinned and p._scale.dtype == torch.float32:
@@ -1352,21 +1402,21 @@ def all(pipe_or_dict_of_modules, pinnedMemory = False, quantizeTransformer = Tru
     #  Hook forward methods of modules
     for model_id in models:
         current_model: torch.nn.Module = models[model_id]
-        current_budget = model_budgets[model_id]
-        current_size = 0
-        cur_blocks_prefix, prev_blocks_name, cur_blocks_name,cur_blocks_seq = None, None, None, -1
-        self.loaded_blocks[model_id] = None
         towers_names, towers_modules = _detect_main_towers(current_model)
-        towers_names = [n +"." for n in towers_names]
         if self.verboseLevel>=2 and len(towers_names)>0:
             print(f"Potential iterative blocks found in model '{model_id}':{towers_names}")
         # compile main iterative modules stacks ("towers")
-        if compileAllModels or model_id in modelsToCompile :
+        compilationInThisOne = compileAllModels or model_id in modelsToCompile
+        if compilationInThisOne:
             if self.verboseLevel>=1:
-                print(f"Pytorch compilation of model '{model_id}' is scheduled.")
-            for tower in towers_modules:
-                for submodel in tower:
-                    submodel.forward= torch.compile(submodel.forward,  backend= "inductor", mode="default" ) # , fullgraph= True, mode= "reduce-overhead", "max-autotune", "max-autotune-no-cudagraphs",
+                if len(towers_modules)>0:
+                    print(f"Pytorch compilation of model '{model_id}' is scheduled.")
+                else:
+                    print(f"Pytorch compilation of model '{model_id}' is not yet supported.")
+            for submodel in towers_modules:
+                # for submodel in tower:
+                submodel.forward= torch.compile(submodel.forward,  backend= "inductor", mode="default" ) # , fullgraph= True, mode= "reduce-overhead", "max-autotune", "max-autotune-no-cudagraphs",
                     #dynamic=True,
         if pinAllModels or model_id in modelsToPin:
@@ -1376,6 +1426,11 @@ def all(pipe_or_dict_of_modules, pinnedMemory = False, quantizeTransformer = Tru
             else:
                 _pin_to_memory(current_model, model_id, partialPinning= partialPinning, perc_reserved_mem_max=perc_reserved_mem_max, verboseLevel=verboseLevel)
+        current_budget = model_budgets[model_id]
+        current_size = 0
+        cur_blocks_prefix, prev_blocks_name, cur_blocks_name,cur_blocks_seq = None, None, None, -1
+        self.loaded_blocks[model_id] = None
         for submodule_name, submodule in current_model.named_modules():
             # create a fake 'accelerate' parameter so that the _execution_device property returns always "cuda"
             # (it is queried in many pipelines even if offloading is not properly implemented)
@@ -1384,44 +1439,43 @@ def all(pipe_or_dict_of_modules, pinnedMemory = False, quantizeTransformer = Tru
             if submodule_name=='':
                 continue
-            newListItem = False
             if current_budget > 0:
-                if isinstance(submodule, (torch.nn.ModuleList, torch.nn.Sequential)): #
-                    if cur_blocks_prefix == None:
-                        cur_blocks_prefix = submodule_name + "."
+                if cur_blocks_prefix != None:
+                    if submodule_name.startswith(cur_blocks_prefix):
+                        depth_prefix = cur_blocks_prefix.split(".")
+                        depth_name = submodule_name.split(".")
+                        level  =  depth_name[len(depth_prefix)-1]
+                        pre , num = _extract_num_from_str(level)
+                        if num != cur_blocks_seq and (cur_blocks_seq == -1 or current_size > current_budget):
+                            prev_blocks_name = cur_blocks_name
+                            cur_blocks_name =  cur_blocks_prefix + str(num)
+                            # print(f"new block: {model_id}/{cur_blocks_name} - {submodule_name}")
+                        cur_blocks_seq = num
                     else:
-                        #if cur_blocks_prefix != submodule_name[:len(cur_blocks_prefix)]:
-                        if not submodule_name.startswith(cur_blocks_prefix):
-                            cur_blocks_prefix = submodule_name + "."
-                            cur_blocks_name,cur_blocks_seq = None, -1
-                else:
-                    if cur_blocks_prefix is not None:
-                        if submodule_name.startswith(cur_blocks_prefix):
-                            num = int(submodule_name[len(cur_blocks_prefix):].split(".")[0])
-                            newListItem= num != cur_blocks_seq
-                            if num != cur_blocks_seq and (cur_blocks_name == None or current_size > current_budget):
-                                prev_blocks_name = cur_blocks_name
-                                cur_blocks_name = cur_blocks_prefix + str(num)
-                                # print(f"new block: {model_id}/{cur_blocks_name} - {submodule_name}")
-                            cur_blocks_seq = num
-                        else:
-                            cur_blocks_prefix, prev_blocks_name, cur_blocks_name,cur_blocks_seq = None, None, None, -1
+                        cur_blocks_prefix, prev_blocks_name, cur_blocks_name,cur_blocks_seq = None, None, None, -1
+                if cur_blocks_prefix == None:
+                    pre , num = _extract_num_from_str(submodule_name)
+                    if isinstance(submodule, (torch.nn.ModuleList, torch.nn.Sequential)):
+                        cur_blocks_prefix, prev_blocks_name, cur_blocks_seq = pre + ".", None, -1
+                    elif num >=0:
+                        cur_blocks_prefix, prev_blocks_name, cur_blocks_seq = pre, None, num
+                        cur_blocks_name = submodule_name
+                        # print(f"new block: {model_id}/{cur_blocks_name} - {submodule_name}")
             if hasattr(submodule, "forward"):
                 submodule_method = getattr(submodule, "forward")
                 if callable(submodule_method):
                     if len(submodule_name.split("."))==1:
                         self.hook_change_module(submodule, current_model, model_id, submodule_name, submodule_method)
-                    elif newListItem:
-                        self.hook_load_data_if_needed(submodule, model_id, cur_blocks_name, context = submodule_name )
+                    elif compilationInThisOne and submodule in towers_modules:
+                        self.hook_preload_blocks_for_compilation(submodule, model_id, cur_blocks_name, context = submodule_name )
                     else:
                         self.hook_check_empty_cache_needed(submodule, model_id, cur_blocks_name, submodule_method, context = submodule_name )
-                    current_size = self.add_module_to_blocks(model_id, cur_blocks_name, submodule, prev_blocks_name)
+                current_size = self.add_module_to_blocks(model_id, cur_blocks_name, submodule, prev_blocks_name)
     if self.verboseLevel >=2:
@@ -1467,11 +1521,12 @@ def profile(pipe_or_dict_of_modules, profile_no: profile_type =  profile_type.Ve
     models_to_scan = ("text_encoder", "text_encoder_2")
     candidates_to_quantize = ("t5", "llama", "llm")
     for model_id  in models_to_scan:
-        name = module_names[model_id]
-        for candidate in candidates_to_quantize:
-            if candidate in name:
-                default_extraModelsToQuantize.append(model_id)
-                break
+        if model_id in module_names:
+            name = module_names[model_id]
+            for candidate in candidates_to_quantize:
+                if candidate in name:
+                    default_extraModelsToQuantize.append(model_id)
+                    break
     # transformer (video or image generator) should be as small as possible not to occupy space that could be used by actual image data
@@ -1480,6 +1535,7 @@ def profile(pipe_or_dict_of_modules, profile_no: profile_type =  profile_type.Ve
     default_budgets = { "transformer" : 600 , "text_encoder": 3000, "text_encoder_2": 3000 }
     extraModelsToQuantize = None
     asyncTransfers = True
+    budgets = None
     if profile_no == profile_type.HighRAM_HighVRAM:
         pinnedMemory= True

{mmgp-3.0.9 → mmgp-3.1.0}/src/mmgp/safetensors2.py RENAMED Viewed

@@ -156,19 +156,32 @@ def torch_write_file(sd, file_path, quantization_map = None, config = None):
     pos = 0
     i = 0
     mx = 100000
+    metadata = dict()
     for k , t  in sd.items():
-        entry = {}
-        dtypestr= map[t.dtype]
-        entry["dtype"] = dtypestr
-        entry["shape"] = list(t.shape)
-        size = torch.numel(t) * t.element_size()
-        entry["data_offsets"] = [pos, pos + size]
-        pos += size
-        sf_sd[k] = entry
+        if torch.is_tensor(t):
+            entry = {}
+            dtypestr= map[t.dtype]
+            entry["dtype"] = dtypestr
+            entry["shape"] = list(t.shape)
+            size = torch.numel(t) * t.element_size()
+            if size == 0:
+                pass
+            entry["data_offsets"] = [pos, pos + size]
+            pos += size
+            sf_sd[k] = entry
+        else:
+            if isinstance(t, str):
+                metadata[k] = t
+            else:
+                try:
+                    b64 = base64.b64encode(json.dumps(t, ensure_ascii=False).encode('utf8')).decode('utf8')
+                    metadata[k + "_base64"] = b64
+                except:
+                    pass
         i+=1
         if i==mx:
             break
-    metadata = dict()
     if not quantization_map is None:
         metadata["quantization_format"] = "quanto"
         metadata["quantization_map_base64"] =  base64.b64encode(json.dumps(quantization_map, ensure_ascii=False).encode('utf8')).decode('utf8')
@@ -192,9 +205,9 @@ def torch_write_file(sd, file_path, quantization_map = None, config = None):
         i = 0
         for k , t  in sd.items():
-            size = torch.numel(t) * t.element_size()
-            if size != 0:
-                if len(t.shape) == 0:
+            if torch.is_tensor(t):
+                size = torch.numel(t) * t.element_size()
+                if size != 0:
                     dtype = t.dtype
                     # convert in a friendly format, scalars types not supported by numpy
                     if  dtype ==  torch.bfloat16:
@@ -202,11 +215,8 @@ def torch_write_file(sd, file_path, quantization_map = None, config = None):
                     elif  dtype ==  torch.float8_e5m2 or dtype ==  torch.float8_e4m3fn:
                         t = t.view(torch.uint8)
                     buffer = t.numpy().tobytes()
-                else:
-                    buffer = t.view(torch.uint8).numpy().tobytes()
-                bytes_written = writer.write(buffer)
-                assert bytes_written == size
+                    bytes_written = writer.write(buffer)
+                    assert bytes_written == size
             i+=1
             if i==mx:
                 break
@@ -297,13 +307,12 @@ class SafeTensorFile:
                 length = data_offsets[1]-data_offsets[0]
                 map_idx = next(iter_tensor_no)
                 offset = current_pos - maps[map_idx][1]
-                if len(shape) == 0:
-                    if length == 0:
-                        t = torch.empty(0, dtype=dtype)
-                    else:
-                        # don't waste a memory view for a scalar
-                        t = torch.frombuffer(bytearray(maps[map_idx][0][offset:offset + length]), dtype=torch.uint8)
-                        t = t.view(dtype)
+                if length == 0:
+                    t = torch.empty(shape, dtype=dtype)
+                elif len(shape) == 0:
+                    # don't waste a memory view for a scalar
+                    t = torch.frombuffer(bytearray(maps[map_idx][0][offset:offset + length]), dtype=torch.uint8)
+                    t = t.view(dtype)
                 else:
                     mv = memoryview(maps[map_idx][0])[offset:offset + length]
                     t = torch.frombuffer(mv, dtype=dtype)

{mmgp-3.0.9 → mmgp-3.1.0/src/mmgp.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.2
 Name: mmgp
-Version: 3.0.9
+Version: 3.1.0
 Summary: Memory Management for the GPU Poor
 Author-email: deepbeepmeep <deepbeepmeep@yahoo.com>
 License:                     GNU GENERAL PUBLIC LICENSE
@@ -17,7 +17,7 @@ Requires-Dist: peft
 <p align="center">
-  <H2>Memory Management 3.0.9 for the GPU Poor by DeepBeepMeep</H2>
+  <H2>Memory Management 3.1.0 for the GPU Poor by DeepBeepMeep</H2>
 </p>
@@ -100,7 +100,7 @@ For example:
 The smaller this number, the more VRAM left for image data / longer video but also the slower because there will be lots of loading / unloading between the RAM and the VRAM. If model is too big to fit in a budget, it will be broken down in multiples parts that will be unloaded / loaded consequently. The speed of low budget can be  increased (up to 2 times) by turning on the options pinnedMemory and asyncTransfers.
 - asyncTransfers: boolean, load to the GPU the next model part while the current part is being processed. This requires twice the budget if any is defined. This may increase speed by 20% (mostly visible on fast modern GPUs).
 - verboseLevel: number between 0 and 2 (1 by default), provides various level of feedback of the different processes
-- compile: list of model ids to compile, may accelerate up x2 depending on the type of GPU. As of 01/01/2025 it will work only on Linux or WSL since compilation relies on Triton which is not yet supported on Windows
+- compile: list of model ids to compile, may accelerate up x2 depending on the type of GPU. It makes sens to compile only the model that is frequently used such as the "transformer" model in the case of video or image generation. As of 01/01/2025 it will work only on Linux or WSL since compilation relies on Triton which is not yet supported on Windows
 If you are short on RAM and plan to work with quantized models, it is recommended to load pre-quantized models direclty rather than using on the fly quantization, it will be faster and consume slightly less RAM.