PyPI - mmgp - Versions diffs - 3.1.3__tar.gz → 3.1.4.post1__tar.gz - Mend

mmgp 3.1.3tar.gz → 3.1.4.post1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of mmgp might be problematic. Click here for more details.

Files changed (14) hide show

{mmgp-3.1.3/src/mmgp.egg-info → mmgp-3.1.4.post1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.2
 Name: mmgp
-Version: 3.1.3
+Version: 3.1.4.post1
 Summary: Memory Management for the GPU Poor
 Author-email: deepbeepmeep <deepbeepmeep@yahoo.com>
 License:                     GNU GENERAL PUBLIC LICENSE
@@ -17,7 +17,7 @@ Requires-Dist: peft
 <p align="center">
-  <H2>Memory Management 3.1.0 for the GPU Poor by DeepBeepMeep</H2>
+  <H2>Memory Management 3.1.4 for the GPU Poor by DeepBeepMeep</H2>
 </p>
@@ -26,10 +26,10 @@ This a replacement for the accelerate library that should in theory manage offlo
 times in a pipe (eg VAE).
 Requirements:
-- VRAM: minimum 12 GB, recommended 24 GB (RTX 3090/ RTX 4090)
+- VRAM: minimum 6 GB, recommended 24 GB (RTX 3090/ RTX 4090)
 - RAM: minimum 24 GB, recommended 48 GB
-This module features 5 profiles in order to able to run the model at a decent speed on a low end consumer config (32 GB of RAM and 12 VRAM) and to run it at a very good speed (if not the best) on a high end consumer config (48 GB of RAM and 24 GB of VRAM).\
+This module features 5 profiles in order to able to run the model at a decent speed on a low end consumer config (24 GB of RAM and 6 VRAM) and to run it at a very good speed (if not the best) on a high end consumer config (48 GB of RAM and 24 GB of VRAM).\
 These RAM requirements are for Linux systems. Due to different memory management Windows will require an extra 16 GB of RAM to run the corresponding profile.
 Each profile may use a combination of the following:
@@ -41,7 +41,25 @@ Each profile may use a combination of the following:
 - Automated on the fly quantization or ability to load pre quantized models
 - Pretrained Lora support with low RAM requirements
 - Support for pytorch compilation on Linux and WSL (supported on pure Windows but requires a complex Triton Installation).
--
+## Sample applications that use mmgp
+It is recommended to have a look at these applications to see how mmgp was implemented in each of them:
+- Hunyuan3D-2GP: https://github.com/deepbeepmeep/Hunyuan3D-2GP\
+A great image to 3D and text to 3D tool by the Tencent team. Thanks to mmgp it can run with less than 6 GB of VRAM
+- HuanyuanVideoGP: https://github.com/deepbeepmeep/HunyuanVideoGP\
+One of the best open source Text to Video generator
+- FluxFillGP: https://github.com/deepbeepmeep/FluxFillGP\
+One of the best inpainting / outpainting tools based on Flux that can run with less than 12 GB of VRAM.
+- Cosmos1GP: https://github.com/deepbeepmeep/Cosmos1GP\
+This application include two models: a text to world generator and a image / video to world (probably the best open source image to video generator).
+- OminiControlGP: https://github.com/deepbeepmeep/OminiControlGP\
+A Flux derived application very powerful that can be used to transfer an object of your choice in a prompted scene. With mmgp you can run it with only 6 GB of VRAM.
 ## Installation
 First you need to install the module in your current project with:
 ```shell
@@ -74,7 +92,7 @@ Profile 2 (High RAM) and 4 (Low RAM)are the most recommended profiles since they
 If you use Flux derived applciation profile 1 and 3 will offer much faster generation times.
 In any case, a safe approach is to start from profile 5 (default profile) and then go down progressively to profile 4 and then to profile 2 as long as the app remains responsive or doesn't trigger any out of memory error.
-By default the 'transformer' will be quantized to 8 bits for all profiles. If you don't want that you may specify the optional parameter *quantizeTransformer = False*.
+By default the model named 'transformer' will be quantized to 8 bits for all profiles. If you don't want that you may specify the optional parameter *quantizeTransformer = False*.
 Every parameter set automatically by a profile can be overridden with one or multiple parameters accepted by *offload.all* (see below):
 ```
@@ -100,13 +118,20 @@ For example:
 The smaller this number, the more VRAM left for image data / longer video but also the slower because there will be lots of loading / unloading between the RAM and the VRAM. If model is too big to fit in a budget, it will be broken down in multiples parts that will be unloaded / loaded consequently. The speed of low budget can be  increased (up to 2 times) by turning on the options pinnedMemory and asyncTransfers.
 - asyncTransfers: boolean, load to the GPU the next model part while the current part is being processed. This requires twice the budget if any is defined. This may increase speed by 20% (mostly visible on fast modern GPUs).
 - verboseLevel: number between 0 and 2 (1 by default), provides various level of feedback of the different processes
-- compile: list of model ids to compile, may accelerate up x2 depending on the type of GPU. It makes sens to compile only the model that is frequently used such as the "transformer" model in the case of video or image generation. As of 01/01/2025 it will work only on Linux or WSL since compilation relies on Triton which is not yet supported on Windows
+- compile: list of model ids to compile, may accelerate up x2 depending on the type of GPU. It makes sense to compile only the model that is frequently used such as the "transformer" model in the case of video or image generation. Compilation requires Triton to be installed. Triton is available out of the box on Linux or WSL but requires to be installed with Windows: https://github.com/woct0rdho/triton-windows
 If you are short on RAM and plan to work with quantized models, it is recommended to load pre-quantized models direclty rather than using on the fly quantization, it will be faster and consume slightly less RAM.
 ##  Going further
 The module includes several tools to package a light version of your favorite video / image generator:
+- *extract_models(string prefix,  obj to explore)*\
+This tool will try to detect for you models that are embedded in a pipeline or in some custom class. It will save you time by building a pipe dictionary required par *offload.all* or "offload.profile*. The prefix correponds to the text that will appear before the name of each model in the dictionary.
+- *load_loras_into_model(model, lora_path, lora_multi)*\
+Load in a model a list of Lora described by a list of path *lora_path* and a list of *weights coefficients*.
+The Lora file must be in the *diffusers* format. This function works also on non diffusers models. However if there is already an official Lora support for a model it is recommended to use the official diffusers functions.
 - *save_model(model, file_path, do_quantize = False, quantizationType = qint8 )*\
 Save tensors of a model already loaded in memory in a safetensor format (much faster to reload). You can save it in a quantized format (default qint8 quantization recommended).
 The resulting safetensor file will contain extra fields in its metadata such as the quantization map and its configuration, so you will be able to move the file around without files such as *config.json* or *file_map.json*.
@@ -120,16 +145,13 @@ Initialize (build the model hierarchy in memory) and fast load the corresponding
 The advantages over the original *from_pretrained* method is that a full model can fit into a single file with a filename of your choosing (thefore you can have multiple 'transformers' versions of the same model in the same directory) and prequantized models are processed in a transparent way.
 Last but not least, you can also on the fly pin to RAM the whole model or the most important part of it (partialPin = True) in a more efficient way (faster and requires less RAM) than if you did through *offload.all* or *offload.profile*.
-- *load_loras_into_model(model, lora_path, lora_multi)
-Load in a model a list of Lora described by a list of path *lora_path* and a list of *weights coefficients*.
-The Lora file must be in the *diffusers* format. This function works also on non diffusers models. However if there is already an official Lora support for a model it is recommended to use the official diffusers functions.
 The typical workflow wil be:
 1) temporarly insert the *save_model* function just after a model has been fully loaded to save a copy of the model / quantized model.
 2) replace the full initalizing / loading logic with *fast_load_transformers_model* (if there is a *from_pretrained* call to a transformers object) or only the tensor loading functions (*torch.load_model_file* and *torch.load_state_dict*) with *load_model_data after* the initializing logic.
 ## Special cases
-Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models.\
+Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you may try to use *extract_models* or create a dictionary that manually maps all the models.\
 For instance :
@@ -143,9 +165,9 @@ pipe = { "text_encoder": self.text_encoder, "transformer": self.dit, "vae":self.
 ```
-Please note that there should be always one model whose Id is 'transformer'. It corresponds to the main image / video model which usually needs to be quantized (this is done on the fly by default when loading the model).
+Please note it is recommended to have always one model whose Id is 'transformer' so that you can leverage predefined profiles. The 'transformer' corresponds to the main image / video model which usually needs to be quantized (this is done on the fly by default when loading the model).
-Becareful, lots of models use the T5 XXL as a text encoder. However, quite often their corresponding pipeline configurations point at the official Google T5 XXL repository
+Be careful, lots of models use the T5 XXL as a text encoder. However, quite often their corresponding pipeline configurations point at the official Google T5 XXL repository
 where there is a huge 40GB model to download and load. It is cumbersorme as it is a 32 bits model and contains the decoder part of T5 that is not used.
 I suggest you use instead one of the 16 bits encoder only version available around, for instance:
 ```

{mmgp-3.1.3 → mmgp-3.1.4.post1}/README.md RENAMED Viewed

@@ -1,6 +1,6 @@
 <p align="center">
-  <H2>Memory Management 3.1.0 for the GPU Poor by DeepBeepMeep</H2>
+  <H2>Memory Management 3.1.4 for the GPU Poor by DeepBeepMeep</H2>
 </p>
@@ -9,10 +9,10 @@ This a replacement for the accelerate library that should in theory manage offlo
 times in a pipe (eg VAE).
 Requirements:
-- VRAM: minimum 12 GB, recommended 24 GB (RTX 3090/ RTX 4090)
+- VRAM: minimum 6 GB, recommended 24 GB (RTX 3090/ RTX 4090)
 - RAM: minimum 24 GB, recommended 48 GB
-This module features 5 profiles in order to able to run the model at a decent speed on a low end consumer config (32 GB of RAM and 12 VRAM) and to run it at a very good speed (if not the best) on a high end consumer config (48 GB of RAM and 24 GB of VRAM).\
+This module features 5 profiles in order to able to run the model at a decent speed on a low end consumer config (24 GB of RAM and 6 VRAM) and to run it at a very good speed (if not the best) on a high end consumer config (48 GB of RAM and 24 GB of VRAM).\
 These RAM requirements are for Linux systems. Due to different memory management Windows will require an extra 16 GB of RAM to run the corresponding profile.
 Each profile may use a combination of the following:
@@ -24,7 +24,25 @@ Each profile may use a combination of the following:
 - Automated on the fly quantization or ability to load pre quantized models
 - Pretrained Lora support with low RAM requirements
 - Support for pytorch compilation on Linux and WSL (supported on pure Windows but requires a complex Triton Installation).
--
+## Sample applications that use mmgp
+It is recommended to have a look at these applications to see how mmgp was implemented in each of them:
+- Hunyuan3D-2GP: https://github.com/deepbeepmeep/Hunyuan3D-2GP\
+A great image to 3D and text to 3D tool by the Tencent team. Thanks to mmgp it can run with less than 6 GB of VRAM
+- HuanyuanVideoGP: https://github.com/deepbeepmeep/HunyuanVideoGP\
+One of the best open source Text to Video generator
+- FluxFillGP: https://github.com/deepbeepmeep/FluxFillGP\
+One of the best inpainting / outpainting tools based on Flux that can run with less than 12 GB of VRAM.
+- Cosmos1GP: https://github.com/deepbeepmeep/Cosmos1GP\
+This application include two models: a text to world generator and a image / video to world (probably the best open source image to video generator).
+- OminiControlGP: https://github.com/deepbeepmeep/OminiControlGP\
+A Flux derived application very powerful that can be used to transfer an object of your choice in a prompted scene. With mmgp you can run it with only 6 GB of VRAM.
 ## Installation
 First you need to install the module in your current project with:
 ```shell
@@ -57,7 +75,7 @@ Profile 2 (High RAM) and 4 (Low RAM)are the most recommended profiles since they
 If you use Flux derived applciation profile 1 and 3 will offer much faster generation times.
 In any case, a safe approach is to start from profile 5 (default profile) and then go down progressively to profile 4 and then to profile 2 as long as the app remains responsive or doesn't trigger any out of memory error.
-By default the 'transformer' will be quantized to 8 bits for all profiles. If you don't want that you may specify the optional parameter *quantizeTransformer = False*.
+By default the model named 'transformer' will be quantized to 8 bits for all profiles. If you don't want that you may specify the optional parameter *quantizeTransformer = False*.
 Every parameter set automatically by a profile can be overridden with one or multiple parameters accepted by *offload.all* (see below):
 ```
@@ -83,13 +101,20 @@ For example:
 The smaller this number, the more VRAM left for image data / longer video but also the slower because there will be lots of loading / unloading between the RAM and the VRAM. If model is too big to fit in a budget, it will be broken down in multiples parts that will be unloaded / loaded consequently. The speed of low budget can be  increased (up to 2 times) by turning on the options pinnedMemory and asyncTransfers.
 - asyncTransfers: boolean, load to the GPU the next model part while the current part is being processed. This requires twice the budget if any is defined. This may increase speed by 20% (mostly visible on fast modern GPUs).
 - verboseLevel: number between 0 and 2 (1 by default), provides various level of feedback of the different processes
-- compile: list of model ids to compile, may accelerate up x2 depending on the type of GPU. It makes sens to compile only the model that is frequently used such as the "transformer" model in the case of video or image generation. As of 01/01/2025 it will work only on Linux or WSL since compilation relies on Triton which is not yet supported on Windows
+- compile: list of model ids to compile, may accelerate up x2 depending on the type of GPU. It makes sense to compile only the model that is frequently used such as the "transformer" model in the case of video or image generation. Compilation requires Triton to be installed. Triton is available out of the box on Linux or WSL but requires to be installed with Windows: https://github.com/woct0rdho/triton-windows
 If you are short on RAM and plan to work with quantized models, it is recommended to load pre-quantized models direclty rather than using on the fly quantization, it will be faster and consume slightly less RAM.
 ##  Going further
 The module includes several tools to package a light version of your favorite video / image generator:
+- *extract_models(string prefix,  obj to explore)*\
+This tool will try to detect for you models that are embedded in a pipeline or in some custom class. It will save you time by building a pipe dictionary required par *offload.all* or "offload.profile*. The prefix correponds to the text that will appear before the name of each model in the dictionary.
+- *load_loras_into_model(model, lora_path, lora_multi)*\
+Load in a model a list of Lora described by a list of path *lora_path* and a list of *weights coefficients*.
+The Lora file must be in the *diffusers* format. This function works also on non diffusers models. However if there is already an official Lora support for a model it is recommended to use the official diffusers functions.
 - *save_model(model, file_path, do_quantize = False, quantizationType = qint8 )*\
 Save tensors of a model already loaded in memory in a safetensor format (much faster to reload). You can save it in a quantized format (default qint8 quantization recommended).
 The resulting safetensor file will contain extra fields in its metadata such as the quantization map and its configuration, so you will be able to move the file around without files such as *config.json* or *file_map.json*.
@@ -103,16 +128,13 @@ Initialize (build the model hierarchy in memory) and fast load the corresponding
 The advantages over the original *from_pretrained* method is that a full model can fit into a single file with a filename of your choosing (thefore you can have multiple 'transformers' versions of the same model in the same directory) and prequantized models are processed in a transparent way.
 Last but not least, you can also on the fly pin to RAM the whole model or the most important part of it (partialPin = True) in a more efficient way (faster and requires less RAM) than if you did through *offload.all* or *offload.profile*.
-- *load_loras_into_model(model, lora_path, lora_multi)
-Load in a model a list of Lora described by a list of path *lora_path* and a list of *weights coefficients*.
-The Lora file must be in the *diffusers* format. This function works also on non diffusers models. However if there is already an official Lora support for a model it is recommended to use the official diffusers functions.
 The typical workflow wil be:
 1) temporarly insert the *save_model* function just after a model has been fully loaded to save a copy of the model / quantized model.
 2) replace the full initalizing / loading logic with *fast_load_transformers_model* (if there is a *from_pretrained* call to a transformers object) or only the tensor loading functions (*torch.load_model_file* and *torch.load_state_dict*) with *load_model_data after* the initializing logic.
 ## Special cases
-Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models.\
+Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you may try to use *extract_models* or create a dictionary that manually maps all the models.\
 For instance :
@@ -126,9 +148,9 @@ pipe = { "text_encoder": self.text_encoder, "transformer": self.dit, "vae":self.
 ```
-Please note that there should be always one model whose Id is 'transformer'. It corresponds to the main image / video model which usually needs to be quantized (this is done on the fly by default when loading the model).
+Please note it is recommended to have always one model whose Id is 'transformer' so that you can leverage predefined profiles. The 'transformer' corresponds to the main image / video model which usually needs to be quantized (this is done on the fly by default when loading the model).
-Becareful, lots of models use the T5 XXL as a text encoder. However, quite often their corresponding pipeline configurations point at the official Google T5 XXL repository
+Be careful, lots of models use the T5 XXL as a text encoder. However, quite often their corresponding pipeline configurations point at the official Google T5 XXL repository
 where there is a huge 40GB model to download and load. It is cumbersorme as it is a 32 bits model and contains the decoder part of T5 that is not used.
 I suggest you use instead one of the 16 bits encoder only version available around, for instance:
 ```

{mmgp-3.1.3 → mmgp-3.1.4.post1}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "mmgp"
-version = "3.1.3"
+version = "3.1.4-1"
 authors = [
   { name = "deepbeepmeep", email = "deepbeepmeep@yahoo.com" },
 ]

{mmgp-3.1.3 → mmgp-3.1.4.post1}/src/mmgp/offload.py RENAMED Viewed

@@ -149,12 +149,12 @@ def _compute_verbose_level(level):
     safetensors2.verboseLevel = level
     return level
-def _get_max_reservable_memory(perc_reserved_mem_max):
+def _get_perc_reserved_mem_max(perc_reserved_mem_max):
     if perc_reserved_mem_max<=0:
         perc_reserved_mem_max = 0.40 if os.name == 'nt' else 0.5
-    return  perc_reserved_mem_max * physical_memory
+    return  perc_reserved_mem_max
-def _detect_main_towers(model, min_floors = 5, verboseLevel=1):
+def _detect_main_towers(model, min_floors = 5):
     cur_blocks_prefix = None
     towers_modules= []
     towers_names= []
@@ -191,39 +191,16 @@ def _detect_main_towers(model, min_floors = 5, verboseLevel=1):
             pre , num = _extract_num_from_str(submodule_name)
             if isinstance(submodule, (torch.nn.ModuleList)):
                 cur_blocks_prefix, cur_blocks_seq = pre + ".",  -1
-                tower_name = submodule_name + ".*"
+                tower_name = submodule_name #+ ".*"
             elif num >=0:
                 cur_blocks_prefix, cur_blocks_seq = pre, num
-                tower_name = submodule_name[ :-1] + "*"
+                tower_name = submodule_name[ :-1] #+ "*"
                 floors_modules.append(submodule)
     if len(floors_modules) >= min_floors:
         towers_modules += floors_modules
         towers_names.append(tower_name)
-    # for submodule_name, submodule in model.named_modules():
-    #     if submodule_name=='':
-    #         continue
-    #     if isinstance(submodule, torch.nn.ModuleList):
-    #         newList =False
-    #         if cur_blocks_prefix == None:
-    #             cur_blocks_prefix = submodule_name + "."
-    #             newList = True
-    #         else:
-    #             if not submodule_name.startswith(cur_blocks_prefix):
-    #                 cur_blocks_prefix = submodule_name + "."
-    #                 newList = True
-    #         if newList and len(submodule)>=5:
-    #             towers_names.append(submodule_name)
-    #             towers_modules.append(submodule)
-    #     else:
-    #         if cur_blocks_prefix is not None:
-    #             if not submodule_name.startswith(cur_blocks_prefix):
-    #                 cur_blocks_prefix = None
     return towers_names, towers_modules
@@ -261,30 +238,7 @@ def _remove_model_wrapper(model):
         return sub_module
     return model
-    # def force_load_tensor(t):
-    #     c = torch.nn.Parameter(t + 0)
-    #     torch.utils.swap_tensors(t, c)
-    #     del c
-    # for n,m in model_to_quantize.named_modules():
-    #     # do not read quantized weights (detected them directly or behind an adapter)
-    #     if isinstance(m, QModuleMixin) or hasattr(m, "base_layer") and  isinstance(m.base_layer, QModuleMixin):
-    #         if hasattr(m, "bias") and m.bias is not None:
-    #             force_load_tensor(m.bias.data)
-    #             # m.bias.data = m.bias.data + 0
-    #     else:
-    #         for n, p in m.named_parameters(recurse = False):
-    #             data = getattr(m, n)
-    #             force_load_tensor(data)
-    #             # setattr(m,n, torch.nn.Parameter(data + 0 ) )
-    #     for b in m.buffers(recurse = False):
-    #         # b.data = b.data + 0
-    #         b.data = torch.nn.Buffer(b.data + 0)
-    #         force_load_tensor(b.data)
 def _move_to_pinned_tensor(source_tensor, big_tensor, offset, length):
     dtype= source_tensor.dtype
@@ -324,17 +278,11 @@ def _force_load_parameter(p):
     torch.utils.swap_tensors(p, q)
     del q
-def _pin_to_memory(model, model_id, partialPinning = False, perc_reserved_mem_max = 0, verboseLevel = 1):
-    if  verboseLevel>=1 :
-        if partialPinning:
-            print(f"Partial pinning of data of '{model_id}' to reserved RAM")
-        else:
-            print(f"Pinning data of '{model_id}' to reserved RAM")
+def _pin_to_memory(model, model_id, partialPinning = False, verboseLevel = 1):
-    max_reservable_memory = _get_max_reservable_memory(perc_reserved_mem_max)
     if partialPinning:
         towers_names, _ = _detect_main_towers(model)
-        towers_names = [n +"." for n in towers_names]
     BIG_TENSOR_MAX_SIZE = 2**28 # 256 MB
@@ -353,6 +301,20 @@ def _pin_to_memory(model, model_id, partialPinning = False, perc_reserved_mem_ma
             params_list = params_list +  [ (k + '.' + n, p,  False)  for n, p in sub_module.named_parameters(recurse=False)] +  [ (k + '.' + n, p,  True)  for n, p in sub_module.named_buffers(recurse=False)]
+    if  verboseLevel>=1 :
+        if partialPinning:
+            if len(params_list) == 0:
+                print(f"Unable to apply Partial of '{model_id}' as no isolated main structures were found")
+            else:
+                print(f"Partial pinning of data of '{model_id}' to reserved RAM")
+        else:
+            print(f"Pinning data of '{model_id}' to reserved RAM")
+    if partialPinning and len(params_list) == 0:
+        return
     for n, p, _ in params_list:
         if isinstance(p, QTensor):
             if p._qtype == qint4:
@@ -442,10 +404,10 @@ def _pin_to_memory(model, model_id, partialPinning = False, perc_reserved_mem_ma
     gc.collect()
     if verboseLevel >=1:
-        if total_tensor_bytes <= total:
-            print(f"The whole model was pinned to reserved RAM: {last_big_tensor} large blocks spread across {total/ONE_MB:.2f} MB")
+        if partialPinning:
+            print(f"The model was partially pinned to reserved RAM: {last_big_tensor} large blocks spread across {total/ONE_MB:.2f} MB")
         else:
-            print(f"{total/ONE_MB:.2f} MB were pinned to reserved RAM out of {total_tensor_bytes/ONE_MB:.2f} MB")
+            print(f"The whole model was pinned to reserved RAM: {last_big_tensor} large blocks spread across {total/ONE_MB:.2f} MB")
     model._already_pinned = True
@@ -461,13 +423,14 @@ def _welcome():
     print(f"{BOLD}{HEADER}************ Memory Management for the GPU Poor (mmgp 3.1) by DeepBeepMeep ************{ENDC}{UNBOLD}")
 def _extract_num_from_str(num_in_str):
-    for i in range(len(num_in_str)):
+    size = len(num_in_str)
+    for i in range(size):
         if not num_in_str[-i-1:].isnumeric():
             if i == 0:
                 return num_in_str, -1
             else:
                 return num_in_str[: -i],  int(num_in_str[-i:])
-    return  "", int(num_in_str)
+    return  "", -1 if size == 0 else int(num_in_str)
 def  _quantize_dirty_hack(model):
     # dirty hack: add a hook on state_dict() to return a fake non quantized state_dict if called by Lora Diffusers initialization functions
@@ -1425,7 +1388,9 @@ def all(pipe_or_dict_of_modules, pinnedMemory = False, quantizeTransformer = Tru
       #  torch._logging.set_logs(recompiles=True)
       #  torch._inductor.config.realize_opcount_threshold = 100 # workaround bug "AssertionError: increase TRITON_MAX_BLOCK['X'] to 4096."
-    max_reservable_memory = _get_max_reservable_memory(perc_reserved_mem_max)
+    perc_reserved_mem_max = _get_perc_reserved_mem_max(perc_reserved_mem_max)
+    max_reservable_memory = perc_reserved_mem_max * physical_memory
     estimatesBytesToPin = 0
     for model_id in models:
@@ -1486,7 +1451,7 @@ def all(pipe_or_dict_of_modules, pinnedMemory = False, quantizeTransformer = Tru
     if estimatesBytesToPin > 0 and estimatesBytesToPin >= (max_reservable_memory - total_pinned_bytes):
         if self.verboseLevel >=1:
-            print(f"Switching to partial pinning since full requirements for pinned models is {estimatesBytesToPin/ONE_MB:0.1f} MB while estimated reservable RAM is {max_reservable_memory/ONE_MB:0.1f} MB" )
+            print(f"Switching to partial pinning since full requirements for pinned models is {estimatesBytesToPin/ONE_MB:0.1f} MB while estimated reservable RAM is {max_reservable_memory/ONE_MB:0.1f} MB. You may increase the value of parameter 'perc_reserved_mem_max' to a value higher than {perc_reserved_mem_max:0.2f} to force full pinnning." )
         partialPinning = True
     #  Hook forward methods of modules
@@ -1498,7 +1463,7 @@ def all(pipe_or_dict_of_modules, pinnedMemory = False, quantizeTransformer = Tru
         if compilationInThisOne:
             if self.verboseLevel>=1:
                 if len(towers_modules)>0:
-                    print(f"Pytorch compilation of '{model_id}' is scheduled for these modules : {towers_names}.")
+                    print(f"Pytorch compilation of '{model_id}' is scheduled for these modules : {towers_names}*.")
                 else:
                     print(f"Pytorch compilation of model '{model_id}' is not yet supported.")
@@ -1511,7 +1476,7 @@ def all(pipe_or_dict_of_modules, pinnedMemory = False, quantizeTransformer = Tru
                 if self.verboseLevel >=1:
                     print(f"Model '{model_id}' already pinned to reserved memory")
             else:
-                _pin_to_memory(current_model, model_id, partialPinning= partialPinning, perc_reserved_mem_max=perc_reserved_mem_max, verboseLevel=verboseLevel)
+                _pin_to_memory(current_model, model_id, partialPinning= partialPinning, verboseLevel=verboseLevel)
         current_budget = model_budgets[model_id]
         current_size = 0
@@ -1538,7 +1503,7 @@ def all(pipe_or_dict_of_modules, pinnedMemory = False, quantizeTransformer = Tru
                         if num != cur_blocks_seq and (cur_blocks_seq == -1 or current_size > current_budget):
                             prev_blocks_name = cur_blocks_name
                             cur_blocks_name =  cur_blocks_prefix + str(num)
-                            print(f"new block: {model_id}/{cur_blocks_name} - {submodule_name}")
+                            # print(f"new block: {model_id}/{cur_blocks_name} - {submodule_name}")
                         cur_blocks_seq = num
                     else:
                         cur_blocks_prefix, prev_blocks_name, cur_blocks_name,cur_blocks_seq = None, None, None, -1
@@ -1550,7 +1515,7 @@ def all(pipe_or_dict_of_modules, pinnedMemory = False, quantizeTransformer = Tru
                     elif num >=0:
                         cur_blocks_prefix, prev_blocks_name, cur_blocks_seq = pre, None, num
                         cur_blocks_name = submodule_name
-                        print(f"new block: {model_id}/{cur_blocks_name} - {submodule_name}")
+                        # print(f"new block: {model_id}/{cur_blocks_name} - {submodule_name}")
             if hasattr(submodule, "forward"):

{mmgp-3.1.3 → mmgp-3.1.4.post1/src/mmgp.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.2
 Name: mmgp
-Version: 3.1.3
+Version: 3.1.4.post1
 Summary: Memory Management for the GPU Poor
 Author-email: deepbeepmeep <deepbeepmeep@yahoo.com>
 License:                     GNU GENERAL PUBLIC LICENSE
@@ -17,7 +17,7 @@ Requires-Dist: peft
 <p align="center">
-  <H2>Memory Management 3.1.0 for the GPU Poor by DeepBeepMeep</H2>
+  <H2>Memory Management 3.1.4 for the GPU Poor by DeepBeepMeep</H2>
 </p>
@@ -26,10 +26,10 @@ This a replacement for the accelerate library that should in theory manage offlo
 times in a pipe (eg VAE).
 Requirements:
-- VRAM: minimum 12 GB, recommended 24 GB (RTX 3090/ RTX 4090)
+- VRAM: minimum 6 GB, recommended 24 GB (RTX 3090/ RTX 4090)
 - RAM: minimum 24 GB, recommended 48 GB
-This module features 5 profiles in order to able to run the model at a decent speed on a low end consumer config (32 GB of RAM and 12 VRAM) and to run it at a very good speed (if not the best) on a high end consumer config (48 GB of RAM and 24 GB of VRAM).\
+This module features 5 profiles in order to able to run the model at a decent speed on a low end consumer config (24 GB of RAM and 6 VRAM) and to run it at a very good speed (if not the best) on a high end consumer config (48 GB of RAM and 24 GB of VRAM).\
 These RAM requirements are for Linux systems. Due to different memory management Windows will require an extra 16 GB of RAM to run the corresponding profile.
 Each profile may use a combination of the following:
@@ -41,7 +41,25 @@ Each profile may use a combination of the following:
 - Automated on the fly quantization or ability to load pre quantized models
 - Pretrained Lora support with low RAM requirements
 - Support for pytorch compilation on Linux and WSL (supported on pure Windows but requires a complex Triton Installation).
--
+## Sample applications that use mmgp
+It is recommended to have a look at these applications to see how mmgp was implemented in each of them:
+- Hunyuan3D-2GP: https://github.com/deepbeepmeep/Hunyuan3D-2GP\
+A great image to 3D and text to 3D tool by the Tencent team. Thanks to mmgp it can run with less than 6 GB of VRAM
+- HuanyuanVideoGP: https://github.com/deepbeepmeep/HunyuanVideoGP\
+One of the best open source Text to Video generator
+- FluxFillGP: https://github.com/deepbeepmeep/FluxFillGP\
+One of the best inpainting / outpainting tools based on Flux that can run with less than 12 GB of VRAM.
+- Cosmos1GP: https://github.com/deepbeepmeep/Cosmos1GP\
+This application include two models: a text to world generator and a image / video to world (probably the best open source image to video generator).
+- OminiControlGP: https://github.com/deepbeepmeep/OminiControlGP\
+A Flux derived application very powerful that can be used to transfer an object of your choice in a prompted scene. With mmgp you can run it with only 6 GB of VRAM.
 ## Installation
 First you need to install the module in your current project with:
 ```shell
@@ -74,7 +92,7 @@ Profile 2 (High RAM) and 4 (Low RAM)are the most recommended profiles since they
 If you use Flux derived applciation profile 1 and 3 will offer much faster generation times.
 In any case, a safe approach is to start from profile 5 (default profile) and then go down progressively to profile 4 and then to profile 2 as long as the app remains responsive or doesn't trigger any out of memory error.
-By default the 'transformer' will be quantized to 8 bits for all profiles. If you don't want that you may specify the optional parameter *quantizeTransformer = False*.
+By default the model named 'transformer' will be quantized to 8 bits for all profiles. If you don't want that you may specify the optional parameter *quantizeTransformer = False*.
 Every parameter set automatically by a profile can be overridden with one or multiple parameters accepted by *offload.all* (see below):
 ```
@@ -100,13 +118,20 @@ For example:
 The smaller this number, the more VRAM left for image data / longer video but also the slower because there will be lots of loading / unloading between the RAM and the VRAM. If model is too big to fit in a budget, it will be broken down in multiples parts that will be unloaded / loaded consequently. The speed of low budget can be  increased (up to 2 times) by turning on the options pinnedMemory and asyncTransfers.
 - asyncTransfers: boolean, load to the GPU the next model part while the current part is being processed. This requires twice the budget if any is defined. This may increase speed by 20% (mostly visible on fast modern GPUs).
 - verboseLevel: number between 0 and 2 (1 by default), provides various level of feedback of the different processes
-- compile: list of model ids to compile, may accelerate up x2 depending on the type of GPU. It makes sens to compile only the model that is frequently used such as the "transformer" model in the case of video or image generation. As of 01/01/2025 it will work only on Linux or WSL since compilation relies on Triton which is not yet supported on Windows
+- compile: list of model ids to compile, may accelerate up x2 depending on the type of GPU. It makes sense to compile only the model that is frequently used such as the "transformer" model in the case of video or image generation. Compilation requires Triton to be installed. Triton is available out of the box on Linux or WSL but requires to be installed with Windows: https://github.com/woct0rdho/triton-windows
 If you are short on RAM and plan to work with quantized models, it is recommended to load pre-quantized models direclty rather than using on the fly quantization, it will be faster and consume slightly less RAM.
 ##  Going further
 The module includes several tools to package a light version of your favorite video / image generator:
+- *extract_models(string prefix,  obj to explore)*\
+This tool will try to detect for you models that are embedded in a pipeline or in some custom class. It will save you time by building a pipe dictionary required par *offload.all* or "offload.profile*. The prefix correponds to the text that will appear before the name of each model in the dictionary.
+- *load_loras_into_model(model, lora_path, lora_multi)*\
+Load in a model a list of Lora described by a list of path *lora_path* and a list of *weights coefficients*.
+The Lora file must be in the *diffusers* format. This function works also on non diffusers models. However if there is already an official Lora support for a model it is recommended to use the official diffusers functions.
 - *save_model(model, file_path, do_quantize = False, quantizationType = qint8 )*\
 Save tensors of a model already loaded in memory in a safetensor format (much faster to reload). You can save it in a quantized format (default qint8 quantization recommended).
 The resulting safetensor file will contain extra fields in its metadata such as the quantization map and its configuration, so you will be able to move the file around without files such as *config.json* or *file_map.json*.
@@ -120,16 +145,13 @@ Initialize (build the model hierarchy in memory) and fast load the corresponding
 The advantages over the original *from_pretrained* method is that a full model can fit into a single file with a filename of your choosing (thefore you can have multiple 'transformers' versions of the same model in the same directory) and prequantized models are processed in a transparent way.
 Last but not least, you can also on the fly pin to RAM the whole model or the most important part of it (partialPin = True) in a more efficient way (faster and requires less RAM) than if you did through *offload.all* or *offload.profile*.
-- *load_loras_into_model(model, lora_path, lora_multi)
-Load in a model a list of Lora described by a list of path *lora_path* and a list of *weights coefficients*.
-The Lora file must be in the *diffusers* format. This function works also on non diffusers models. However if there is already an official Lora support for a model it is recommended to use the official diffusers functions.
 The typical workflow wil be:
 1) temporarly insert the *save_model* function just after a model has been fully loaded to save a copy of the model / quantized model.
 2) replace the full initalizing / loading logic with *fast_load_transformers_model* (if there is a *from_pretrained* call to a transformers object) or only the tensor loading functions (*torch.load_model_file* and *torch.load_state_dict*) with *load_model_data after* the initializing logic.
 ## Special cases
-Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models.\
+Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you may try to use *extract_models* or create a dictionary that manually maps all the models.\
 For instance :
@@ -143,9 +165,9 @@ pipe = { "text_encoder": self.text_encoder, "transformer": self.dit, "vae":self.
 ```
-Please note that there should be always one model whose Id is 'transformer'. It corresponds to the main image / video model which usually needs to be quantized (this is done on the fly by default when loading the model).
+Please note it is recommended to have always one model whose Id is 'transformer' so that you can leverage predefined profiles. The 'transformer' corresponds to the main image / video model which usually needs to be quantized (this is done on the fly by default when loading the model).
-Becareful, lots of models use the T5 XXL as a text encoder. However, quite often their corresponding pipeline configurations point at the official Google T5 XXL repository
+Be careful, lots of models use the T5 XXL as a text encoder. However, quite often their corresponding pipeline configurations point at the official Google T5 XXL repository
 where there is a huge 40GB model to download and load. It is cumbersorme as it is a 32 bits model and contains the decoder part of T5 that is not used.
 I suggest you use instead one of the 16 bits encoder only version available around, for instance:
 ```