mmgp 3.1.4.post1__tar.gz → 3.1.4.post151__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of mmgp might be problematic. Click here for more details.
- {mmgp-3.1.4.post1/src/mmgp.egg-info → mmgp-3.1.4.post151}/PKG-INFO +19 -12
- {mmgp-3.1.4.post1 → mmgp-3.1.4.post151}/README.md +19 -12
- {mmgp-3.1.4.post1 → mmgp-3.1.4.post151}/pyproject.toml +1 -1
- {mmgp-3.1.4.post1 → mmgp-3.1.4.post151}/src/mmgp/offload.py +609 -265
- {mmgp-3.1.4.post1 → mmgp-3.1.4.post151}/src/mmgp/safetensors2.py +10 -4
- {mmgp-3.1.4.post1 → mmgp-3.1.4.post151/src/mmgp.egg-info}/PKG-INFO +19 -12
- {mmgp-3.1.4.post1 → mmgp-3.1.4.post151}/LICENSE.md +0 -0
- {mmgp-3.1.4.post1 → mmgp-3.1.4.post151}/setup.cfg +0 -0
- {mmgp-3.1.4.post1 → mmgp-3.1.4.post151}/src/__init__.py +0 -0
- {mmgp-3.1.4.post1 → mmgp-3.1.4.post151}/src/mmgp/__init__.py +0 -0
- {mmgp-3.1.4.post1 → mmgp-3.1.4.post151}/src/mmgp.egg-info/SOURCES.txt +0 -0
- {mmgp-3.1.4.post1 → mmgp-3.1.4.post151}/src/mmgp.egg-info/dependency_links.txt +0 -0
- {mmgp-3.1.4.post1 → mmgp-3.1.4.post151}/src/mmgp.egg-info/requires.txt +0 -0
- {mmgp-3.1.4.post1 → mmgp-3.1.4.post151}/src/mmgp.egg-info/top_level.txt +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.2
|
|
2
2
|
Name: mmgp
|
|
3
|
-
Version: 3.1.4.
|
|
3
|
+
Version: 3.1.4.post151
|
|
4
4
|
Summary: Memory Management for the GPU Poor
|
|
5
5
|
Author-email: deepbeepmeep <deepbeepmeep@yahoo.com>
|
|
6
6
|
License: GNU GENERAL PUBLIC LICENSE
|
|
@@ -17,7 +17,7 @@ Requires-Dist: peft
|
|
|
17
17
|
|
|
18
18
|
|
|
19
19
|
<p align="center">
|
|
20
|
-
<H2>Memory Management 3.1.4 for the GPU Poor by DeepBeepMeep</H2>
|
|
20
|
+
<H2>Memory Management 3.1.4-151 for the GPU Poor by DeepBeepMeep</H2>
|
|
21
21
|
</p>
|
|
22
22
|
|
|
23
23
|
|
|
@@ -44,21 +44,23 @@ Each profile may use a combination of the following:
|
|
|
44
44
|
|
|
45
45
|
## Sample applications that use mmgp
|
|
46
46
|
It is recommended to have a look at these applications to see how mmgp was implemented in each of them:
|
|
47
|
-
- Hunyuan3D-2GP: https://github.com/deepbeepmeep/Hunyuan3D-2GP
|
|
47
|
+
- Hunyuan3D-2GP: https://github.com/deepbeepmeep/Hunyuan3D-2GP :\
|
|
48
48
|
A great image to 3D and text to 3D tool by the Tencent team. Thanks to mmgp it can run with less than 6 GB of VRAM
|
|
49
49
|
|
|
50
|
-
- HuanyuanVideoGP: https://github.com/deepbeepmeep/HunyuanVideoGP
|
|
50
|
+
- HuanyuanVideoGP: https://github.com/deepbeepmeep/HunyuanVideoGP :\
|
|
51
51
|
One of the best open source Text to Video generator
|
|
52
52
|
|
|
53
|
-
- FluxFillGP: https://github.com/deepbeepmeep/FluxFillGP
|
|
53
|
+
- FluxFillGP: https://github.com/deepbeepmeep/FluxFillGP :\
|
|
54
54
|
One of the best inpainting / outpainting tools based on Flux that can run with less than 12 GB of VRAM.
|
|
55
55
|
|
|
56
|
-
- Cosmos1GP: https://github.com/deepbeepmeep/Cosmos1GP
|
|
56
|
+
- Cosmos1GP: https://github.com/deepbeepmeep/Cosmos1GP :\
|
|
57
57
|
This application include two models: a text to world generator and a image / video to world (probably the best open source image to video generator).
|
|
58
58
|
|
|
59
|
-
- OminiControlGP: https://github.com/deepbeepmeep/OminiControlGP
|
|
59
|
+
- OminiControlGP: https://github.com/deepbeepmeep/OminiControlGP :\
|
|
60
60
|
A Flux derived application very powerful that can be used to transfer an object of your choice in a prompted scene. With mmgp you can run it with only 6 GB of VRAM.
|
|
61
61
|
|
|
62
|
+
- YuE GP: https://github.com/deepbeepmeep/YuEGP :\
|
|
63
|
+
A great song generator (instruments + singer's voice) based on prompted Lyrics and a genre description. Thanks to mmgp you can run it with less than 10 GB of VRAM without waiting forever.
|
|
62
64
|
|
|
63
65
|
## Installation
|
|
64
66
|
First you need to install the module in your current project with:
|
|
@@ -88,7 +90,7 @@ You can choose between 5 profiles depending on your hardware:
|
|
|
88
90
|
- LowRAM_LowVRAM (4): at least 32 GB of RAM and 12 GB of VRAM : if you have little VRAM or want to generate longer videos / more images
|
|
89
91
|
- VerylowRAM_LowVRAM (5): at least 24 GB of RAM and 10 GB of VRAM : if you don't have much it won't be fast but maybe it will work
|
|
90
92
|
|
|
91
|
-
Profile 2 (High RAM) and 4 (Low RAM)are the most recommended profiles since they are versatile (support for long videos for a slight performance cost).\
|
|
93
|
+
Profile 2 (High RAM) and 4 (Low RAM) are the most recommended profiles since they are versatile (support for long videos for a slight performance cost).\
|
|
92
94
|
If you use Flux derived applciation profile 1 and 3 will offer much faster generation times.
|
|
93
95
|
In any case, a safe approach is to start from profile 5 (default profile) and then go down progressively to profile 4 and then to profile 2 as long as the app remains responsive or doesn't trigger any out of memory error.
|
|
94
96
|
|
|
@@ -114,11 +116,13 @@ For example:
|
|
|
114
116
|
- pinnedMemory: Boolean (for all models) or List of models ids to pin to RAM. Every model pinned to RAM will load much faster (up to 2 times) but this requires more RAM
|
|
115
117
|
- quantizeTransformer: boolean by default True. The 'transformer' model in the pipe contains usually the video or image generator is by defaut; quantized on the fly by default to 8 bits. If you want to save time on disk and reduce the loading time, you may want to load directly a prequantized model. If you don't want to quantize the image generator, you need to set the option *quantizeTransformer* to *False* to turn off on the fly quantization.
|
|
116
118
|
- extraModelsToQuantize: list of additional modelids of models to quantize on the fly. If the corresponding model is already quantized, this option will be ignored.
|
|
117
|
-
- budgets: either a number in mega bytes (for all models, if 0 unlimited budget) or a dictionary that maps model ids to mega bytes : define the budget in
|
|
119
|
+
- budgets: either a number in mega bytes (for all models, if 0 unlimited budget) or a dictionary that maps model ids to mega bytes : define the approximate budget in mega bytes that is allocated in VRAM for a model. Try not to allocate all the available VRAM so that the rest can be used to process the data. To define the default value in the dictionary, you may add entry named "*".
|
|
118
120
|
The smaller this number, the more VRAM left for image data / longer video but also the slower because there will be lots of loading / unloading between the RAM and the VRAM. If model is too big to fit in a budget, it will be broken down in multiples parts that will be unloaded / loaded consequently. The speed of low budget can be increased (up to 2 times) by turning on the options pinnedMemory and asyncTransfers.
|
|
121
|
+
- workingVRAM: either a number in mega bytes or a dictionary that maps a model ids to a number in mega bytes that corresponds to a minimum amount of VRAM that should be left for the data processed by the model. This number will prevail if it is in conflict with a too high budget defined for the same model.
|
|
119
122
|
- asyncTransfers: boolean, load to the GPU the next model part while the current part is being processed. This requires twice the budget if any is defined. This may increase speed by 20% (mostly visible on fast modern GPUs).
|
|
120
123
|
- verboseLevel: number between 0 and 2 (1 by default), provides various level of feedback of the different processes
|
|
121
124
|
- compile: list of model ids to compile, may accelerate up x2 depending on the type of GPU. It makes sense to compile only the model that is frequently used such as the "transformer" model in the case of video or image generation. Compilation requires Triton to be installed. Triton is available out of the box on Linux or WSL but requires to be installed with Windows: https://github.com/woct0rdho/triton-windows
|
|
125
|
+
- coTenantsMap: a dictionary that maps a model id to a list of other models with which it accepts to share the VRAM at the same time. This is useful to avoid unefficient loading / unloading when two models processes are interleaved. For instance *coTenantsMap = { "text_encoder_2": ["text_encoder"] }* , here when *text_encoder_2* is loaded it won't unload *text_encoder*. Please note that the reverse is not true as these maps by design are not symetrical to allow tailored workflows. If you need to have as well *text_encoder* that won't unload *text_encoder_2* if it is already loaded *coTenantsMap = { "text_encoder_2": ["text_encoder"], "text_encoder": ["text_encoder_2"] }*
|
|
122
126
|
|
|
123
127
|
If you are short on RAM and plan to work with quantized models, it is recommended to load pre-quantized models direclty rather than using on the fly quantization, it will be faster and consume slightly less RAM.
|
|
124
128
|
|
|
@@ -126,11 +130,14 @@ If you are short on RAM and plan to work with quantized models, it is recommende
|
|
|
126
130
|
|
|
127
131
|
The module includes several tools to package a light version of your favorite video / image generator:
|
|
128
132
|
- *extract_models(string prefix, obj to explore)*\
|
|
129
|
-
This tool will try to detect for you models that are embedded in a pipeline or in some custom class. It will save you time by building a pipe dictionary required
|
|
133
|
+
This tool will try to detect for you models that are embedded in a pipeline or in some custom class. It will save you time by building a pipe dictionary required by *offload.all* or "offload.profile*. The prefix correponds to the text that will appear before the name of each model in the dictionary.
|
|
130
134
|
|
|
131
|
-
- *load_loras_into_model(model, lora_path, lora_multi)*\
|
|
135
|
+
- *load_loras_into_model(model, lora_path, lora_multi, activate_all_loras = True)*\
|
|
132
136
|
Load in a model a list of Lora described by a list of path *lora_path* and a list of *weights coefficients*.
|
|
133
|
-
The Lora file must be in the *diffusers* format. This function works also on non diffusers models. However if there is already an official Lora support for a model it is recommended to use the official diffusers functions.
|
|
137
|
+
The Lora file must be in the *diffusers* format. This function works also on non diffusers models. However if there is already an official Lora support for a model it is recommended to use the official diffusers functions. By default all the load loras will be activated or they can be activated later using *activate_loras*.
|
|
138
|
+
|
|
139
|
+
-*activate_loras(model, lora_nos, lora_multi = None )*\
|
|
140
|
+
Activate the loras whose nos are in the list of nos. Every lora that is not this list and that was activated previously will be disactivated.
|
|
134
141
|
|
|
135
142
|
- *save_model(model, file_path, do_quantize = False, quantizationType = qint8 )*\
|
|
136
143
|
Save tensors of a model already loaded in memory in a safetensor format (much faster to reload). You can save it in a quantized format (default qint8 quantization recommended).
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
|
|
2
2
|
<p align="center">
|
|
3
|
-
<H2>Memory Management 3.1.4 for the GPU Poor by DeepBeepMeep</H2>
|
|
3
|
+
<H2>Memory Management 3.1.4-151 for the GPU Poor by DeepBeepMeep</H2>
|
|
4
4
|
</p>
|
|
5
5
|
|
|
6
6
|
|
|
@@ -27,21 +27,23 @@ Each profile may use a combination of the following:
|
|
|
27
27
|
|
|
28
28
|
## Sample applications that use mmgp
|
|
29
29
|
It is recommended to have a look at these applications to see how mmgp was implemented in each of them:
|
|
30
|
-
- Hunyuan3D-2GP: https://github.com/deepbeepmeep/Hunyuan3D-2GP
|
|
30
|
+
- Hunyuan3D-2GP: https://github.com/deepbeepmeep/Hunyuan3D-2GP :\
|
|
31
31
|
A great image to 3D and text to 3D tool by the Tencent team. Thanks to mmgp it can run with less than 6 GB of VRAM
|
|
32
32
|
|
|
33
|
-
- HuanyuanVideoGP: https://github.com/deepbeepmeep/HunyuanVideoGP
|
|
33
|
+
- HuanyuanVideoGP: https://github.com/deepbeepmeep/HunyuanVideoGP :\
|
|
34
34
|
One of the best open source Text to Video generator
|
|
35
35
|
|
|
36
|
-
- FluxFillGP: https://github.com/deepbeepmeep/FluxFillGP
|
|
36
|
+
- FluxFillGP: https://github.com/deepbeepmeep/FluxFillGP :\
|
|
37
37
|
One of the best inpainting / outpainting tools based on Flux that can run with less than 12 GB of VRAM.
|
|
38
38
|
|
|
39
|
-
- Cosmos1GP: https://github.com/deepbeepmeep/Cosmos1GP
|
|
39
|
+
- Cosmos1GP: https://github.com/deepbeepmeep/Cosmos1GP :\
|
|
40
40
|
This application include two models: a text to world generator and a image / video to world (probably the best open source image to video generator).
|
|
41
41
|
|
|
42
|
-
- OminiControlGP: https://github.com/deepbeepmeep/OminiControlGP
|
|
42
|
+
- OminiControlGP: https://github.com/deepbeepmeep/OminiControlGP :\
|
|
43
43
|
A Flux derived application very powerful that can be used to transfer an object of your choice in a prompted scene. With mmgp you can run it with only 6 GB of VRAM.
|
|
44
44
|
|
|
45
|
+
- YuE GP: https://github.com/deepbeepmeep/YuEGP :\
|
|
46
|
+
A great song generator (instruments + singer's voice) based on prompted Lyrics and a genre description. Thanks to mmgp you can run it with less than 10 GB of VRAM without waiting forever.
|
|
45
47
|
|
|
46
48
|
## Installation
|
|
47
49
|
First you need to install the module in your current project with:
|
|
@@ -71,7 +73,7 @@ You can choose between 5 profiles depending on your hardware:
|
|
|
71
73
|
- LowRAM_LowVRAM (4): at least 32 GB of RAM and 12 GB of VRAM : if you have little VRAM or want to generate longer videos / more images
|
|
72
74
|
- VerylowRAM_LowVRAM (5): at least 24 GB of RAM and 10 GB of VRAM : if you don't have much it won't be fast but maybe it will work
|
|
73
75
|
|
|
74
|
-
Profile 2 (High RAM) and 4 (Low RAM)are the most recommended profiles since they are versatile (support for long videos for a slight performance cost).\
|
|
76
|
+
Profile 2 (High RAM) and 4 (Low RAM) are the most recommended profiles since they are versatile (support for long videos for a slight performance cost).\
|
|
75
77
|
If you use Flux derived applciation profile 1 and 3 will offer much faster generation times.
|
|
76
78
|
In any case, a safe approach is to start from profile 5 (default profile) and then go down progressively to profile 4 and then to profile 2 as long as the app remains responsive or doesn't trigger any out of memory error.
|
|
77
79
|
|
|
@@ -97,11 +99,13 @@ For example:
|
|
|
97
99
|
- pinnedMemory: Boolean (for all models) or List of models ids to pin to RAM. Every model pinned to RAM will load much faster (up to 2 times) but this requires more RAM
|
|
98
100
|
- quantizeTransformer: boolean by default True. The 'transformer' model in the pipe contains usually the video or image generator is by defaut; quantized on the fly by default to 8 bits. If you want to save time on disk and reduce the loading time, you may want to load directly a prequantized model. If you don't want to quantize the image generator, you need to set the option *quantizeTransformer* to *False* to turn off on the fly quantization.
|
|
99
101
|
- extraModelsToQuantize: list of additional modelids of models to quantize on the fly. If the corresponding model is already quantized, this option will be ignored.
|
|
100
|
-
- budgets: either a number in mega bytes (for all models, if 0 unlimited budget) or a dictionary that maps model ids to mega bytes : define the budget in
|
|
102
|
+
- budgets: either a number in mega bytes (for all models, if 0 unlimited budget) or a dictionary that maps model ids to mega bytes : define the approximate budget in mega bytes that is allocated in VRAM for a model. Try not to allocate all the available VRAM so that the rest can be used to process the data. To define the default value in the dictionary, you may add entry named "*".
|
|
101
103
|
The smaller this number, the more VRAM left for image data / longer video but also the slower because there will be lots of loading / unloading between the RAM and the VRAM. If model is too big to fit in a budget, it will be broken down in multiples parts that will be unloaded / loaded consequently. The speed of low budget can be increased (up to 2 times) by turning on the options pinnedMemory and asyncTransfers.
|
|
104
|
+
- workingVRAM: either a number in mega bytes or a dictionary that maps a model ids to a number in mega bytes that corresponds to a minimum amount of VRAM that should be left for the data processed by the model. This number will prevail if it is in conflict with a too high budget defined for the same model.
|
|
102
105
|
- asyncTransfers: boolean, load to the GPU the next model part while the current part is being processed. This requires twice the budget if any is defined. This may increase speed by 20% (mostly visible on fast modern GPUs).
|
|
103
106
|
- verboseLevel: number between 0 and 2 (1 by default), provides various level of feedback of the different processes
|
|
104
107
|
- compile: list of model ids to compile, may accelerate up x2 depending on the type of GPU. It makes sense to compile only the model that is frequently used such as the "transformer" model in the case of video or image generation. Compilation requires Triton to be installed. Triton is available out of the box on Linux or WSL but requires to be installed with Windows: https://github.com/woct0rdho/triton-windows
|
|
108
|
+
- coTenantsMap: a dictionary that maps a model id to a list of other models with which it accepts to share the VRAM at the same time. This is useful to avoid unefficient loading / unloading when two models processes are interleaved. For instance *coTenantsMap = { "text_encoder_2": ["text_encoder"] }* , here when *text_encoder_2* is loaded it won't unload *text_encoder*. Please note that the reverse is not true as these maps by design are not symetrical to allow tailored workflows. If you need to have as well *text_encoder* that won't unload *text_encoder_2* if it is already loaded *coTenantsMap = { "text_encoder_2": ["text_encoder"], "text_encoder": ["text_encoder_2"] }*
|
|
105
109
|
|
|
106
110
|
If you are short on RAM and plan to work with quantized models, it is recommended to load pre-quantized models direclty rather than using on the fly quantization, it will be faster and consume slightly less RAM.
|
|
107
111
|
|
|
@@ -109,11 +113,14 @@ If you are short on RAM and plan to work with quantized models, it is recommende
|
|
|
109
113
|
|
|
110
114
|
The module includes several tools to package a light version of your favorite video / image generator:
|
|
111
115
|
- *extract_models(string prefix, obj to explore)*\
|
|
112
|
-
This tool will try to detect for you models that are embedded in a pipeline or in some custom class. It will save you time by building a pipe dictionary required
|
|
116
|
+
This tool will try to detect for you models that are embedded in a pipeline or in some custom class. It will save you time by building a pipe dictionary required by *offload.all* or "offload.profile*. The prefix correponds to the text that will appear before the name of each model in the dictionary.
|
|
113
117
|
|
|
114
|
-
- *load_loras_into_model(model, lora_path, lora_multi)*\
|
|
118
|
+
- *load_loras_into_model(model, lora_path, lora_multi, activate_all_loras = True)*\
|
|
115
119
|
Load in a model a list of Lora described by a list of path *lora_path* and a list of *weights coefficients*.
|
|
116
|
-
The Lora file must be in the *diffusers* format. This function works also on non diffusers models. However if there is already an official Lora support for a model it is recommended to use the official diffusers functions.
|
|
120
|
+
The Lora file must be in the *diffusers* format. This function works also on non diffusers models. However if there is already an official Lora support for a model it is recommended to use the official diffusers functions. By default all the load loras will be activated or they can be activated later using *activate_loras*.
|
|
121
|
+
|
|
122
|
+
-*activate_loras(model, lora_nos, lora_multi = None )*\
|
|
123
|
+
Activate the loras whose nos are in the list of nos. Every lora that is not this list and that was activated previously will be disactivated.
|
|
117
124
|
|
|
118
125
|
- *save_model(model, file_path, do_quantize = False, quantizationType = qint8 )*\
|
|
119
126
|
Save tensors of a model already loaded in memory in a safetensor format (much faster to reload). You can save it in a quantized format (default qint8 quantization recommended).
|
|
@@ -167,4 +174,4 @@ Thanks to
|
|
|
167
174
|
---------
|
|
168
175
|
- Huggingface / accelerate for the hooking examples
|
|
169
176
|
- Huggingface / quanto for their very useful quantizer
|
|
170
|
-
- gau-nernst for his Pinnig RAM samples
|
|
177
|
+
- gau-nernst for his Pinnig RAM samples
|