PyPI - mmgp - Versions diffs - 2.0.2__tar.gz → 2.0.4__tar.gz - Mend

mmgp 2.0.2tar.gz → 2.0.4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of mmgp might be problematic. Click here for more details.

Files changed (12) hide show

{mmgp-2.0.2 → mmgp-2.0.4}/LICENSE.md +1 -1
{mmgp-2.0.2/src/mmgp.egg-info → mmgp-2.0.4}/PKG-INFO +137 -137
{mmgp-2.0.2 → mmgp-2.0.4}/pyproject.toml +16 -16
{mmgp-2.0.2 → mmgp-2.0.4}/setup.cfg +4 -4
{mmgp-2.0.2 → mmgp-2.0.4/src/mmgp.egg-info}/PKG-INFO +137 -137
{mmgp-2.0.2 → mmgp-2.0.4}/src/mmgp.py +4 -4
{mmgp-2.0.2 → mmgp-2.0.4}/README.md +0 -0
{mmgp-2.0.2 → mmgp-2.0.4}/src/__init__.py +0 -0
{mmgp-2.0.2 → mmgp-2.0.4}/src/mmgp.egg-info/SOURCES.txt +0 -0
{mmgp-2.0.2 → mmgp-2.0.4}/src/mmgp.egg-info/dependency_links.txt +0 -0
{mmgp-2.0.2 → mmgp-2.0.4}/src/mmgp.egg-info/requires.txt +0 -0
{mmgp-2.0.2 → mmgp-2.0.4}/src/mmgp.egg-info/top_level.txt +0 -0

{mmgp-2.0.2 → mmgp-2.0.4}/LICENSE.md RENAMED Viewed

@@ -1,2 +1,2 @@
-                    GNU GENERAL PUBLIC LICENSE
+                    GNU GENERAL PUBLIC LICENSE
                        Version 3, 29 June 2007

{mmgp-2.0.2/src/mmgp.egg-info → mmgp-2.0.4}/PKG-INFO RENAMED Viewed

@@ -1,137 +1,137 @@
-Metadata-Version: 2.1
-Name: mmgp
-Version: 2.0.2
-Summary: Memory Management for the GPU Poor
-Author-email: deepbeepmeep <deepbeepmeep@yahoo.com>
-License:                     GNU GENERAL PUBLIC LICENSE
-                               Version 3, 29 June 2007
-Requires-Python: >=3.10
-Description-Content-Type: text/markdown
-License-File: LICENSE.md
-Requires-Dist: torch>=2.1.0
-Requires-Dist: optimum-quanto
-Requires-Dist: accelerate
-<p align="center">
-  <H2>Memory Management 2.0 for the GPU Poor by DeepBeepMeep</H2>
-</p>
-This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ...  can run smoothly on a 12 to 24 GB GPU limited card.
-This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several
-times in a pipe (eg VAE).
-Requirements:
-- VRAM: minimum 12 GB, recommended 24 GB (RTX 3090/ RTX 4090)
-- RAM: minimum 24 GB, recommended 48 GB
-This module features 5 profiles in order to able to run the model at a decent speed on a low end consumer config (32 GB of RAM and 12 VRAM) and to run it at a very good speed on a high end consumer config (48 GB of RAM and 24 GB of VRAM).
-Each profile may use the following:
-- Smart preloading of models in RAM to reduce RAM requirements
-- Smart automated loading / unloading of models in the GPU to avoid unloading models that may be needed again soon
-- Smart slicing of models to reduce memory occupied by models in the VRAM
-- Ability to pin models in reserved RAM to accelerate transfers to VRAM
-- Async transfers to VRAM to avoid a pause when loading a new slice of a model
-- Automated on the fly quantization or ability to load quantized models
-## Installation
-First you need to install the module in your current project with:
-```shell
-pip install mmgp
-```
-## Usage
-It is almost plug and play and just needs to be invoked from the main app just after the model pipeline has been created.
-1) First make sure that the pipeline explictly loads the models in the CPU device, for instance:
-```
-  pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16).to("cpu")
-```
-2) Once every potential Lora has been loaded and merged, add the following lines for a quick setup:
-```
-  from mmgp import offload, profile_type
-  offload.profile(pipe, profile_type.HighRAM_LowVRAM_Fast)
-```
-You can choose between 5 profiles depending on your hardware:
-- HighRAM_HighVRAM_Fastest: at least 48 GB of RAM and 24 GB of VRAM : the fastest well suited for a RTX 3090 / RTX 4090
-- HighRAM_LowVRAM_Fast (recommended): at least 48 GB of RAM and 12 GB of VRAM : a bit slower, better suited for RTX 3070/3080/4070/4080
-            or for RTX 3090 / RTX 4090 with large pictures batches or long videos
-- LowRAM_HighVRAM_Medium: at least 32 GB of RAM and 24 GB of VRAM : so so speed but adapted for RTX 3090 / RTX 4090 with limited RAM
-- LowRAM_LowVRAM_Slow: at least 32 GB of RAM and 12 GB of VRAM :  if have little VRAM or generate longer videos
-- VerylowRAM_LowVRAM_Slowest: at least 24 GB of RAM and 10 GB of VRAM : if you don't have much it won't be fast but maybe it will work
-By default the 'transformer' will be quantized to 8 bits for all profiles. If you don't want that you may specify the optional parameter *quantizeTransformer = False*.
-## Alternatively you may want to create your own profile with specific parameters:
-For example:
-```
-  from mmgp import offload
-  offload.all(pipe, pinInRAM=True, modelsToQuantize = ["text_encoder_2"] )
-```
-- pinInRAM: Boolean (for all models) or List of models ids to pin in RAM. Every model pinned in RAM will load much faster (4 times) but this requires more RAM
-- modelsToQuantize: list of model ids to quantize on the fly. If the corresponding model is already quantized, this option will be ignored.
-- quantizeTransformer: boolean by default True. The 'transformer' model in the pipe contains usually the video or image generator is by defaut; quantized on the fly by default to 8 bits. If you want to save time on disk and reduce the loading time, you may want to load directly a prequantized model. If you don't want to quantize the image generator, you need to set the option *quantizeTransformer* to *False* to turn off on the fly quantization.
-- budgets: either a number in mega bytes (for all models, if 0 unlimited budget) or a dictionary that maps model ids to mega bytes : define the budget in VRAM (in fact the real number is 2.5 this number) that is allocated in VRAM for each model. The smaller this number, the more VRAM left for image data / longer video but also the slower because there will be lots of loading / unloading between the RAM and the VRAM. Turning on the PinInRAM accelerates greatly (4x) small budgets but consumes usually 50% more RAM.
-##  Going further
-The module includes several tools to package a light version of your favorite video / image generator:
-- *save_model(model, file_path, do_quantize = False, quantization_type = qint8 )*\
-Save tensors of a model already loaded in memory in a safetensor format (much faster to reload). You can save it in a quantized format (default qint8 quantization recommended).
-If the model is saved in a quantized format, an extra file that ends with '_map.json' will be created and needed to reload the model again.
-- *load_model_data(model, file_path: str)*\
-Load the tensors data of a model in RAM of a model already initialized with no data. Detect and handle quantized models saved previously with save_model.
-- *fast_load_transformers_model(model_path: str)*\
-Initialize (build the model hierarchy in memory) and fast load the corresponding tensors of a 'transformers' library model.
-The advantages over the original *from_pretrained* method is that the full model can fit into a single file with a filename of your choosing (thefore you can have multiple 'transformers' versions of the same model in the same directory) and prequantized model are processed in a transparent way.
-Please note that you need to keep the original file transformers 'config.json' in the same directory.
-The typical workflow wil be:
-1) temporarly insert the *save_model* function just after a model has been fully loaded to save a copy of the model / quantized model.
-2) replace the full initalizing / loading logic with *fast_load_transformers_model* (if there is a *from_pretrained* call to a transformers object) or only the tensor loading functions (*torch.load_model_file* and *torch.load_state_dict*) with *load_model_data after* the initializing logic.
-## Special cases
-Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models.\
-For instance :
-- for flux derived models:
-```
-pipe = { "text_encoder": clip, "text_encoder_2": t5, "transformer": model, "vae":ae }
-```
-- for mochi:
-```
-pipe = { "text_encoder": self.text_encoder, "transformer": self.dit, "vae":self.decoder }
-```
-Please note that there should be always one model whose Id is 'transformer'. It corresponds to the main image / video model which usually needs to be quantized (this is done on the fly by default when loading the model).
-Becareful, lots of models use the T5 XXL as a text encoder. However, quite often their corresponding pipeline configurations point at the official Google T5 XXL repository
-where there is a huge 40GB model to download and load. It is cumbersorme as it is a 32 bits model and contains the decoder part of T5 that is not used.
-I suggest you use instead one of the 16 bits encoder only version available around, for instance:
-```
-text_encoder_2 = T5EncoderModel.from_pretrained("black-forest-labs/FLUX.1-dev", subfolder="text_encoder_2", torch_dtype=torch.float16)
-```
-Sometime just providing the pipe won't be sufficient as you will need to change the content of the core model:
-- For instance you may need to disable an existing CPU offload logic that already exists (such as manual calls to move tensors between cuda and the cpu)
-- mmpg to tries to fake the device as being "cuda" but sometimes some code won't be fooled and it will create tensors in the cpu device and this may cause some issues.
-You are free to use my module for non commercial use as long you give me proper credits. You may contact me on twitter @deepbeepmeep
-Thanks to
----------
-- Huggingface / accelerate for the hooking examples
-- Huggingface / quanto for their very useful quantizer
-- gau-nernst for his Pinnig RAM samples
+Metadata-Version: 2.1
+Name: mmgp
+Version: 2.0.4
+Summary: Memory Management for the GPU Poor
+Author-email: deepbeepmeep <deepbeepmeep@yahoo.com>
+License:                     GNU GENERAL PUBLIC LICENSE
+                               Version 3, 29 June 2007
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+License-File: LICENSE.md
+Requires-Dist: torch>=2.1.0
+Requires-Dist: optimum-quanto
+Requires-Dist: accelerate
+<p align="center">
+  <H2>Memory Management 2.0 for the GPU Poor by DeepBeepMeep</H2>
+</p>
+This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ...  can run smoothly on a 12 to 24 GB GPU limited card.
+This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several
+times in a pipe (eg VAE).
+Requirements:
+- VRAM: minimum 12 GB, recommended 24 GB (RTX 3090/ RTX 4090)
+- RAM: minimum 24 GB, recommended 48 GB
+This module features 5 profiles in order to able to run the model at a decent speed on a low end consumer config (32 GB of RAM and 12 VRAM) and to run it at a very good speed on a high end consumer config (48 GB of RAM and 24 GB of VRAM).
+Each profile may use the following:
+- Smart preloading of models in RAM to reduce RAM requirements
+- Smart automated loading / unloading of models in the GPU to avoid unloading models that may be needed again soon
+- Smart slicing of models to reduce memory occupied by models in the VRAM
+- Ability to pin models in reserved RAM to accelerate transfers to VRAM
+- Async transfers to VRAM to avoid a pause when loading a new slice of a model
+- Automated on the fly quantization or ability to load quantized models
+## Installation
+First you need to install the module in your current project with:
+```shell
+pip install mmgp
+```
+## Usage
+It is almost plug and play and just needs to be invoked from the main app just after the model pipeline has been created.
+1) First make sure that the pipeline explictly loads the models in the CPU device, for instance:
+```
+  pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16).to("cpu")
+```
+2) Once every potential Lora has been loaded and merged, add the following lines for a quick setup:
+```
+  from mmgp import offload, profile_type
+  offload.profile(pipe, profile_type.HighRAM_LowVRAM_Fast)
+```
+You can choose between 5 profiles depending on your hardware:
+- HighRAM_HighVRAM_Fastest: at least 48 GB of RAM and 24 GB of VRAM : the fastest well suited for a RTX 3090 / RTX 4090
+- HighRAM_LowVRAM_Fast (recommended): at least 48 GB of RAM and 12 GB of VRAM : a bit slower, better suited for RTX 3070/3080/4070/4080
+            or for RTX 3090 / RTX 4090 with large pictures batches or long videos
+- LowRAM_HighVRAM_Medium: at least 32 GB of RAM and 24 GB of VRAM : so so speed but adapted for RTX 3090 / RTX 4090 with limited RAM
+- LowRAM_LowVRAM_Slow: at least 32 GB of RAM and 12 GB of VRAM :  if have little VRAM or generate longer videos
+- VerylowRAM_LowVRAM_Slowest: at least 24 GB of RAM and 10 GB of VRAM : if you don't have much it won't be fast but maybe it will work
+By default the 'transformer' will be quantized to 8 bits for all profiles. If you don't want that you may specify the optional parameter *quantizeTransformer = False*.
+## Alternatively you may want to create your own profile with specific parameters:
+For example:
+```
+  from mmgp import offload
+  offload.all(pipe, pinInRAM=True, modelsToQuantize = ["text_encoder_2"] )
+```
+- pinInRAM: Boolean (for all models) or List of models ids to pin in RAM. Every model pinned in RAM will load much faster (4 times) but this requires more RAM
+- modelsToQuantize: list of model ids to quantize on the fly. If the corresponding model is already quantized, this option will be ignored.
+- quantizeTransformer: boolean by default True. The 'transformer' model in the pipe contains usually the video or image generator is by defaut; quantized on the fly by default to 8 bits. If you want to save time on disk and reduce the loading time, you may want to load directly a prequantized model. If you don't want to quantize the image generator, you need to set the option *quantizeTransformer* to *False* to turn off on the fly quantization.
+- budgets: either a number in mega bytes (for all models, if 0 unlimited budget) or a dictionary that maps model ids to mega bytes : define the budget in VRAM (in fact the real number is 2.5 this number) that is allocated in VRAM for each model. The smaller this number, the more VRAM left for image data / longer video but also the slower because there will be lots of loading / unloading between the RAM and the VRAM. Turning on the PinInRAM accelerates greatly (4x) small budgets but consumes usually 50% more RAM.
+##  Going further
+The module includes several tools to package a light version of your favorite video / image generator:
+- *save_model(model, file_path, do_quantize = False, quantization_type = qint8 )*\
+Save tensors of a model already loaded in memory in a safetensor format (much faster to reload). You can save it in a quantized format (default qint8 quantization recommended).
+If the model is saved in a quantized format, an extra file that ends with '_map.json' will be created and needed to reload the model again.
+- *load_model_data(model, file_path: str)*\
+Load the tensors data of a model in RAM of a model already initialized with no data. Detect and handle quantized models saved previously with save_model.
+- *fast_load_transformers_model(model_path: str)*\
+Initialize (build the model hierarchy in memory) and fast load the corresponding tensors of a 'transformers' library model.
+The advantages over the original *from_pretrained* method is that the full model can fit into a single file with a filename of your choosing (thefore you can have multiple 'transformers' versions of the same model in the same directory) and prequantized model are processed in a transparent way.
+Please note that you need to keep the original file transformers 'config.json' in the same directory.
+The typical workflow wil be:
+1) temporarly insert the *save_model* function just after a model has been fully loaded to save a copy of the model / quantized model.
+2) replace the full initalizing / loading logic with *fast_load_transformers_model* (if there is a *from_pretrained* call to a transformers object) or only the tensor loading functions (*torch.load_model_file* and *torch.load_state_dict*) with *load_model_data after* the initializing logic.
+## Special cases
+Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models.\
+For instance :
+- for flux derived models:
+```
+pipe = { "text_encoder": clip, "text_encoder_2": t5, "transformer": model, "vae":ae }
+```
+- for mochi:
+```
+pipe = { "text_encoder": self.text_encoder, "transformer": self.dit, "vae":self.decoder }
+```
+Please note that there should be always one model whose Id is 'transformer'. It corresponds to the main image / video model which usually needs to be quantized (this is done on the fly by default when loading the model).
+Becareful, lots of models use the T5 XXL as a text encoder. However, quite often their corresponding pipeline configurations point at the official Google T5 XXL repository
+where there is a huge 40GB model to download and load. It is cumbersorme as it is a 32 bits model and contains the decoder part of T5 that is not used.
+I suggest you use instead one of the 16 bits encoder only version available around, for instance:
+```
+text_encoder_2 = T5EncoderModel.from_pretrained("black-forest-labs/FLUX.1-dev", subfolder="text_encoder_2", torch_dtype=torch.float16)
+```
+Sometime just providing the pipe won't be sufficient as you will need to change the content of the core model:
+- For instance you may need to disable an existing CPU offload logic that already exists (such as manual calls to move tensors between cuda and the cpu)
+- mmpg to tries to fake the device as being "cuda" but sometimes some code won't be fooled and it will create tensors in the cpu device and this may cause some issues.
+You are free to use my module for non commercial use as long you give me proper credits. You may contact me on twitter @deepbeepmeep
+Thanks to
+---------
+- Huggingface / accelerate for the hooking examples
+- Huggingface / quanto for their very useful quantizer
+- gau-nernst for his Pinnig RAM samples

{mmgp-2.0.2 → mmgp-2.0.4}/pyproject.toml RENAMED Viewed

@@ -1,16 +1,16 @@
-[project]
-name = "mmgp"
-version = "2.0.2"
-authors = [
-  { name = "deepbeepmeep", email = "deepbeepmeep@yahoo.com" },
-]
-description = "Memory Management for the GPU Poor"
-readme = "README.md"
-requires-python = ">=3.10"
-license = { file = "LICENSE.md" }
-dependencies = [
-  "torch >= 2.1.0",
-  "optimum-quanto",
-  "accelerate"
-]
+[project]
+name = "mmgp"
+version = "2.0.4"
+authors = [
+  { name = "deepbeepmeep", email = "deepbeepmeep@yahoo.com" },
+]
+description = "Memory Management for the GPU Poor"
+readme = "README.md"
+requires-python = ">=3.10"
+license = { file = "LICENSE.md" }
+dependencies = [
+  "torch >= 2.1.0",
+  "optimum-quanto",
+  "accelerate"
+]

{mmgp-2.0.2 → mmgp-2.0.4}/setup.cfg RENAMED Viewed

@@ -1,4 +1,4 @@
-[egg_info]
-tag_build =
-tag_date = 0
+[egg_info]
+tag_build =
+tag_date = 0

{mmgp-2.0.2 → mmgp-2.0.4/src/mmgp.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,137 +1,137 @@
-Metadata-Version: 2.1
-Name: mmgp
-Version: 2.0.2
-Summary: Memory Management for the GPU Poor
-Author-email: deepbeepmeep <deepbeepmeep@yahoo.com>
-License:                     GNU GENERAL PUBLIC LICENSE
-                               Version 3, 29 June 2007
-Requires-Python: >=3.10
-Description-Content-Type: text/markdown
-License-File: LICENSE.md
-Requires-Dist: torch>=2.1.0
-Requires-Dist: optimum-quanto
-Requires-Dist: accelerate
-<p align="center">
-  <H2>Memory Management 2.0 for the GPU Poor by DeepBeepMeep</H2>
-</p>
-This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ...  can run smoothly on a 12 to 24 GB GPU limited card.
-This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several
-times in a pipe (eg VAE).
-Requirements:
-- VRAM: minimum 12 GB, recommended 24 GB (RTX 3090/ RTX 4090)
-- RAM: minimum 24 GB, recommended 48 GB
-This module features 5 profiles in order to able to run the model at a decent speed on a low end consumer config (32 GB of RAM and 12 VRAM) and to run it at a very good speed on a high end consumer config (48 GB of RAM and 24 GB of VRAM).
-Each profile may use the following:
-- Smart preloading of models in RAM to reduce RAM requirements
-- Smart automated loading / unloading of models in the GPU to avoid unloading models that may be needed again soon
-- Smart slicing of models to reduce memory occupied by models in the VRAM
-- Ability to pin models in reserved RAM to accelerate transfers to VRAM
-- Async transfers to VRAM to avoid a pause when loading a new slice of a model
-- Automated on the fly quantization or ability to load quantized models
-## Installation
-First you need to install the module in your current project with:
-```shell
-pip install mmgp
-```
-## Usage
-It is almost plug and play and just needs to be invoked from the main app just after the model pipeline has been created.
-1) First make sure that the pipeline explictly loads the models in the CPU device, for instance:
-```
-  pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16).to("cpu")
-```
-2) Once every potential Lora has been loaded and merged, add the following lines for a quick setup:
-```
-  from mmgp import offload, profile_type
-  offload.profile(pipe, profile_type.HighRAM_LowVRAM_Fast)
-```
-You can choose between 5 profiles depending on your hardware:
-- HighRAM_HighVRAM_Fastest: at least 48 GB of RAM and 24 GB of VRAM : the fastest well suited for a RTX 3090 / RTX 4090
-- HighRAM_LowVRAM_Fast (recommended): at least 48 GB of RAM and 12 GB of VRAM : a bit slower, better suited for RTX 3070/3080/4070/4080
-            or for RTX 3090 / RTX 4090 with large pictures batches or long videos
-- LowRAM_HighVRAM_Medium: at least 32 GB of RAM and 24 GB of VRAM : so so speed but adapted for RTX 3090 / RTX 4090 with limited RAM
-- LowRAM_LowVRAM_Slow: at least 32 GB of RAM and 12 GB of VRAM :  if have little VRAM or generate longer videos
-- VerylowRAM_LowVRAM_Slowest: at least 24 GB of RAM and 10 GB of VRAM : if you don't have much it won't be fast but maybe it will work
-By default the 'transformer' will be quantized to 8 bits for all profiles. If you don't want that you may specify the optional parameter *quantizeTransformer = False*.
-## Alternatively you may want to create your own profile with specific parameters:
-For example:
-```
-  from mmgp import offload
-  offload.all(pipe, pinInRAM=True, modelsToQuantize = ["text_encoder_2"] )
-```
-- pinInRAM: Boolean (for all models) or List of models ids to pin in RAM. Every model pinned in RAM will load much faster (4 times) but this requires more RAM
-- modelsToQuantize: list of model ids to quantize on the fly. If the corresponding model is already quantized, this option will be ignored.
-- quantizeTransformer: boolean by default True. The 'transformer' model in the pipe contains usually the video or image generator is by defaut; quantized on the fly by default to 8 bits. If you want to save time on disk and reduce the loading time, you may want to load directly a prequantized model. If you don't want to quantize the image generator, you need to set the option *quantizeTransformer* to *False* to turn off on the fly quantization.
-- budgets: either a number in mega bytes (for all models, if 0 unlimited budget) or a dictionary that maps model ids to mega bytes : define the budget in VRAM (in fact the real number is 2.5 this number) that is allocated in VRAM for each model. The smaller this number, the more VRAM left for image data / longer video but also the slower because there will be lots of loading / unloading between the RAM and the VRAM. Turning on the PinInRAM accelerates greatly (4x) small budgets but consumes usually 50% more RAM.
-##  Going further
-The module includes several tools to package a light version of your favorite video / image generator:
-- *save_model(model, file_path, do_quantize = False, quantization_type = qint8 )*\
-Save tensors of a model already loaded in memory in a safetensor format (much faster to reload). You can save it in a quantized format (default qint8 quantization recommended).
-If the model is saved in a quantized format, an extra file that ends with '_map.json' will be created and needed to reload the model again.
-- *load_model_data(model, file_path: str)*\
-Load the tensors data of a model in RAM of a model already initialized with no data. Detect and handle quantized models saved previously with save_model.
-- *fast_load_transformers_model(model_path: str)*\
-Initialize (build the model hierarchy in memory) and fast load the corresponding tensors of a 'transformers' library model.
-The advantages over the original *from_pretrained* method is that the full model can fit into a single file with a filename of your choosing (thefore you can have multiple 'transformers' versions of the same model in the same directory) and prequantized model are processed in a transparent way.
-Please note that you need to keep the original file transformers 'config.json' in the same directory.
-The typical workflow wil be:
-1) temporarly insert the *save_model* function just after a model has been fully loaded to save a copy of the model / quantized model.
-2) replace the full initalizing / loading logic with *fast_load_transformers_model* (if there is a *from_pretrained* call to a transformers object) or only the tensor loading functions (*torch.load_model_file* and *torch.load_state_dict*) with *load_model_data after* the initializing logic.
-## Special cases
-Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models.\
-For instance :
-- for flux derived models:
-```
-pipe = { "text_encoder": clip, "text_encoder_2": t5, "transformer": model, "vae":ae }
-```
-- for mochi:
-```
-pipe = { "text_encoder": self.text_encoder, "transformer": self.dit, "vae":self.decoder }
-```
-Please note that there should be always one model whose Id is 'transformer'. It corresponds to the main image / video model which usually needs to be quantized (this is done on the fly by default when loading the model).
-Becareful, lots of models use the T5 XXL as a text encoder. However, quite often their corresponding pipeline configurations point at the official Google T5 XXL repository
-where there is a huge 40GB model to download and load. It is cumbersorme as it is a 32 bits model and contains the decoder part of T5 that is not used.
-I suggest you use instead one of the 16 bits encoder only version available around, for instance:
-```
-text_encoder_2 = T5EncoderModel.from_pretrained("black-forest-labs/FLUX.1-dev", subfolder="text_encoder_2", torch_dtype=torch.float16)
-```
-Sometime just providing the pipe won't be sufficient as you will need to change the content of the core model:
-- For instance you may need to disable an existing CPU offload logic that already exists (such as manual calls to move tensors between cuda and the cpu)
-- mmpg to tries to fake the device as being "cuda" but sometimes some code won't be fooled and it will create tensors in the cpu device and this may cause some issues.
-You are free to use my module for non commercial use as long you give me proper credits. You may contact me on twitter @deepbeepmeep
-Thanks to
----------
-- Huggingface / accelerate for the hooking examples
-- Huggingface / quanto for their very useful quantizer
-- gau-nernst for his Pinnig RAM samples
+Metadata-Version: 2.1
+Name: mmgp
+Version: 2.0.4
+Summary: Memory Management for the GPU Poor
+Author-email: deepbeepmeep <deepbeepmeep@yahoo.com>
+License:                     GNU GENERAL PUBLIC LICENSE
+                               Version 3, 29 June 2007
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+License-File: LICENSE.md
+Requires-Dist: torch>=2.1.0
+Requires-Dist: optimum-quanto
+Requires-Dist: accelerate
+<p align="center">
+  <H2>Memory Management 2.0 for the GPU Poor by DeepBeepMeep</H2>
+</p>
+This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ...  can run smoothly on a 12 to 24 GB GPU limited card.
+This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several
+times in a pipe (eg VAE).
+Requirements:
+- VRAM: minimum 12 GB, recommended 24 GB (RTX 3090/ RTX 4090)
+- RAM: minimum 24 GB, recommended 48 GB
+This module features 5 profiles in order to able to run the model at a decent speed on a low end consumer config (32 GB of RAM and 12 VRAM) and to run it at a very good speed on a high end consumer config (48 GB of RAM and 24 GB of VRAM).
+Each profile may use the following:
+- Smart preloading of models in RAM to reduce RAM requirements
+- Smart automated loading / unloading of models in the GPU to avoid unloading models that may be needed again soon
+- Smart slicing of models to reduce memory occupied by models in the VRAM
+- Ability to pin models in reserved RAM to accelerate transfers to VRAM
+- Async transfers to VRAM to avoid a pause when loading a new slice of a model
+- Automated on the fly quantization or ability to load quantized models
+## Installation
+First you need to install the module in your current project with:
+```shell
+pip install mmgp
+```
+## Usage
+It is almost plug and play and just needs to be invoked from the main app just after the model pipeline has been created.
+1) First make sure that the pipeline explictly loads the models in the CPU device, for instance:
+```
+  pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16).to("cpu")
+```
+2) Once every potential Lora has been loaded and merged, add the following lines for a quick setup:
+```
+  from mmgp import offload, profile_type
+  offload.profile(pipe, profile_type.HighRAM_LowVRAM_Fast)
+```
+You can choose between 5 profiles depending on your hardware:
+- HighRAM_HighVRAM_Fastest: at least 48 GB of RAM and 24 GB of VRAM : the fastest well suited for a RTX 3090 / RTX 4090
+- HighRAM_LowVRAM_Fast (recommended): at least 48 GB of RAM and 12 GB of VRAM : a bit slower, better suited for RTX 3070/3080/4070/4080
+            or for RTX 3090 / RTX 4090 with large pictures batches or long videos
+- LowRAM_HighVRAM_Medium: at least 32 GB of RAM and 24 GB of VRAM : so so speed but adapted for RTX 3090 / RTX 4090 with limited RAM
+- LowRAM_LowVRAM_Slow: at least 32 GB of RAM and 12 GB of VRAM :  if have little VRAM or generate longer videos
+- VerylowRAM_LowVRAM_Slowest: at least 24 GB of RAM and 10 GB of VRAM : if you don't have much it won't be fast but maybe it will work
+By default the 'transformer' will be quantized to 8 bits for all profiles. If you don't want that you may specify the optional parameter *quantizeTransformer = False*.
+## Alternatively you may want to create your own profile with specific parameters:
+For example:
+```
+  from mmgp import offload
+  offload.all(pipe, pinInRAM=True, modelsToQuantize = ["text_encoder_2"] )
+```
+- pinInRAM: Boolean (for all models) or List of models ids to pin in RAM. Every model pinned in RAM will load much faster (4 times) but this requires more RAM
+- modelsToQuantize: list of model ids to quantize on the fly. If the corresponding model is already quantized, this option will be ignored.
+- quantizeTransformer: boolean by default True. The 'transformer' model in the pipe contains usually the video or image generator is by defaut; quantized on the fly by default to 8 bits. If you want to save time on disk and reduce the loading time, you may want to load directly a prequantized model. If you don't want to quantize the image generator, you need to set the option *quantizeTransformer* to *False* to turn off on the fly quantization.
+- budgets: either a number in mega bytes (for all models, if 0 unlimited budget) or a dictionary that maps model ids to mega bytes : define the budget in VRAM (in fact the real number is 2.5 this number) that is allocated in VRAM for each model. The smaller this number, the more VRAM left for image data / longer video but also the slower because there will be lots of loading / unloading between the RAM and the VRAM. Turning on the PinInRAM accelerates greatly (4x) small budgets but consumes usually 50% more RAM.
+##  Going further
+The module includes several tools to package a light version of your favorite video / image generator:
+- *save_model(model, file_path, do_quantize = False, quantization_type = qint8 )*\
+Save tensors of a model already loaded in memory in a safetensor format (much faster to reload). You can save it in a quantized format (default qint8 quantization recommended).
+If the model is saved in a quantized format, an extra file that ends with '_map.json' will be created and needed to reload the model again.
+- *load_model_data(model, file_path: str)*\
+Load the tensors data of a model in RAM of a model already initialized with no data. Detect and handle quantized models saved previously with save_model.
+- *fast_load_transformers_model(model_path: str)*\
+Initialize (build the model hierarchy in memory) and fast load the corresponding tensors of a 'transformers' library model.
+The advantages over the original *from_pretrained* method is that the full model can fit into a single file with a filename of your choosing (thefore you can have multiple 'transformers' versions of the same model in the same directory) and prequantized model are processed in a transparent way.
+Please note that you need to keep the original file transformers 'config.json' in the same directory.
+The typical workflow wil be:
+1) temporarly insert the *save_model* function just after a model has been fully loaded to save a copy of the model / quantized model.
+2) replace the full initalizing / loading logic with *fast_load_transformers_model* (if there is a *from_pretrained* call to a transformers object) or only the tensor loading functions (*torch.load_model_file* and *torch.load_state_dict*) with *load_model_data after* the initializing logic.
+## Special cases
+Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models.\
+For instance :
+- for flux derived models:
+```
+pipe = { "text_encoder": clip, "text_encoder_2": t5, "transformer": model, "vae":ae }
+```
+- for mochi:
+```
+pipe = { "text_encoder": self.text_encoder, "transformer": self.dit, "vae":self.decoder }
+```
+Please note that there should be always one model whose Id is 'transformer'. It corresponds to the main image / video model which usually needs to be quantized (this is done on the fly by default when loading the model).
+Becareful, lots of models use the T5 XXL as a text encoder. However, quite often their corresponding pipeline configurations point at the official Google T5 XXL repository
+where there is a huge 40GB model to download and load. It is cumbersorme as it is a 32 bits model and contains the decoder part of T5 that is not used.
+I suggest you use instead one of the 16 bits encoder only version available around, for instance:
+```
+text_encoder_2 = T5EncoderModel.from_pretrained("black-forest-labs/FLUX.1-dev", subfolder="text_encoder_2", torch_dtype=torch.float16)
+```
+Sometime just providing the pipe won't be sufficient as you will need to change the content of the core model:
+- For instance you may need to disable an existing CPU offload logic that already exists (such as manual calls to move tensors between cuda and the cpu)
+- mmpg to tries to fake the device as being "cuda" but sometimes some code won't be fooled and it will create tensors in the cpu device and this may cause some issues.
+You are free to use my module for non commercial use as long you give me proper credits. You may contact me on twitter @deepbeepmeep
+Thanks to
+---------
+- Huggingface / accelerate for the hooking examples
+- Huggingface / quanto for their very useful quantizer
+- gau-nernst for his Pinnig RAM samples

{mmgp-2.0.2 → mmgp-2.0.4}/src/mmgp.py RENAMED Viewed

@@ -680,7 +680,7 @@ class offload:
         if (budgets!= None or budget >0) :
             self.async_transfers = True
-        pinInRAM = True
+        #pinInRAM = True
         # compile not working yet or slower
         compile = False # True
         #quantizeTransformer = False
@@ -752,7 +752,7 @@ class offload:
                         p._scale = p._scale.to(torch.bfloat16)
                     current_model_size +=  torch.numel(p._scale) * sizeofbfloat16
                     current_model_size +=  torch.numel(p._data) * sizeofbfloat16 / 2
-                    if pinInRAM and not already_loaded:
+                    if modelPinned and not already_loaded:
                         # Force flushing the lazy load so that reserved memory can be freed when we are ready to pin
                         p._scale = p._scale + 0
                         p._data = p._data + 0
@@ -760,7 +760,7 @@ class offload:
                     if p.data.dtype == torch.float32:
                         # convert any left overs float32 weight to bloat16 to divide by 2 the model memory footprint
                         p.data = p.data.to(torch.bfloat16)
-                    else:
+                    elif modelPinned:
                         # force reading the tensors from the disk by pretending to modify them
                         p.data = p.data + 0
@@ -774,7 +774,7 @@ class offload:
                     # force reading the tensors from the disk by pretending to modify them
                     b.data = b.data + 0
-                current_model_size +=  torch.numel(p.data) * p.data.element_size()
+                current_model_size +=  torch.numel(b.data) * b.data.element_size()
             if model_id not in self.models:
                 self.models[model_id] = current_model

{mmgp-2.0.2 → mmgp-2.0.4}/README.md RENAMED Viewed

File without changes

{mmgp-2.0.2 → mmgp-2.0.4}/src/__init__.py RENAMED Viewed

File without changes

{mmgp-2.0.2 → mmgp-2.0.4}/src/mmgp.egg-info/SOURCES.txt RENAMED Viewed

File without changes

{mmgp-2.0.2 → mmgp-2.0.4}/src/mmgp.egg-info/dependency_links.txt RENAMED Viewed

File without changes

{mmgp-2.0.2 → mmgp-2.0.4}/src/mmgp.egg-info/requires.txt RENAMED Viewed

File without changes

{mmgp-2.0.2 → mmgp-2.0.4}/src/mmgp.egg-info/top_level.txt RENAMED Viewed

File without changes

mmgp 2.0.2__tar.gz → 2.0.4__tar.gz

Potentially problematic release.

mmgp 2.0.2tar.gz → 2.0.4tar.gz