mmgp 1.2.0__py3-none-any.whl → 2.0.1__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of mmgp might be problematic. Click here for more details.
- mmgp-2.0.1.dist-info/METADATA +137 -0
- mmgp-2.0.1.dist-info/RECORD +7 -0
- mmgp.py +685 -155
- mmgp-1.2.0.dist-info/METADATA +0 -109
- mmgp-1.2.0.dist-info/RECORD +0 -7
- {mmgp-1.2.0.dist-info → mmgp-2.0.1.dist-info}/LICENSE.md +0 -0
- {mmgp-1.2.0.dist-info → mmgp-2.0.1.dist-info}/WHEEL +0 -0
- {mmgp-1.2.0.dist-info → mmgp-2.0.1.dist-info}/top_level.txt +0 -0
|
@@ -0,0 +1,137 @@
|
|
|
1
|
+
Metadata-Version: 2.1
|
|
2
|
+
Name: mmgp
|
|
3
|
+
Version: 2.0.1
|
|
4
|
+
Summary: Memory Management for the GPU Poor
|
|
5
|
+
Author-email: deepbeepmeep <deepbeepmeep@yahoo.com>
|
|
6
|
+
License: GNU GENERAL PUBLIC LICENSE
|
|
7
|
+
Version 3, 29 June 2007
|
|
8
|
+
Requires-Python: >=3.10
|
|
9
|
+
Description-Content-Type: text/markdown
|
|
10
|
+
License-File: LICENSE.md
|
|
11
|
+
Requires-Dist: torch>=2.1.0
|
|
12
|
+
Requires-Dist: optimum-quanto
|
|
13
|
+
Requires-Dist: accelerate
|
|
14
|
+
|
|
15
|
+
|
|
16
|
+
<p align="center">
|
|
17
|
+
<H2>Memory Management 2.0 for the GPU Poor by DeepBeepMeep</H2>
|
|
18
|
+
</p>
|
|
19
|
+
|
|
20
|
+
|
|
21
|
+
This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ... can run smoothly on a 12 to 24 GB GPU limited card.
|
|
22
|
+
This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several
|
|
23
|
+
times in a pipe (eg VAE).
|
|
24
|
+
|
|
25
|
+
Requirements:
|
|
26
|
+
- VRAM: minimum 12 GB, recommended 24 GB (RTX 3090/ RTX 4090)
|
|
27
|
+
- RAM: minimum 24 GB, recommended 48 GB
|
|
28
|
+
|
|
29
|
+
This module features 5 profiles in order to able to run the model at a decent speed on a low end consumer config (32 GB of RAM and 12 VRAM) and to run it at a very good speed on a high end consumer config (48 GB of RAM and 24 GB of VRAM).
|
|
30
|
+
|
|
31
|
+
Each profile may use the following:
|
|
32
|
+
- Smart preloading of models in RAM to reduce RAM requirements
|
|
33
|
+
- Smart automated loading / unloading of models in the GPU to avoid unloading models that may be needed again soon
|
|
34
|
+
- Smart slicing of models to reduce memory occupied by models in the VRAM
|
|
35
|
+
- Ability to pin models in reserved RAM to accelerate transfers to VRAM
|
|
36
|
+
- Async transfers to VRAM to avoid a pause when loading a new slice of a model
|
|
37
|
+
- Automated on the fly quantization or ability to load quantized models
|
|
38
|
+
|
|
39
|
+
## Installation
|
|
40
|
+
First you need to install the module in your current project with:
|
|
41
|
+
```shell
|
|
42
|
+
pip install mmgp
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
|
|
46
|
+
## Usage
|
|
47
|
+
|
|
48
|
+
It is almost plug and play and just needs to be invoked from the main app just after the model pipeline has been created.
|
|
49
|
+
1) First make sure that the pipeline explictly loads the models in the CPU device, for instance:
|
|
50
|
+
```
|
|
51
|
+
pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16).to("cpu")
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
2) Once every potential Lora has been loaded and merged, add the following lines for a quick setup:
|
|
55
|
+
```
|
|
56
|
+
from mmgp import offload, profile_type
|
|
57
|
+
offload.profile(pipe, profile_type.HighRAM_LowVRAM_Fast)
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
You can choose between 5 profiles depending on your hardware:
|
|
61
|
+
- HighRAM_HighVRAM_Fastest: at least 48 GB of RAM and 24 GB of VRAM : the fastest well suited for a RTX 3090 / RTX 4090
|
|
62
|
+
- HighRAM_LowVRAM_Fast (recommended): at least 48 GB of RAM and 12 GB of VRAM : a bit slower, better suited for RTX 3070/3080/4070/4080
|
|
63
|
+
or for RTX 3090 / RTX 4090 with large pictures batches or long videos
|
|
64
|
+
- LowRAM_HighVRAM_Medium: at least 32 GB of RAM and 24 GB of VRAM : so so speed but adapted for RTX 3090 / RTX 4090 with limited RAM
|
|
65
|
+
- LowRAM_LowVRAM_Slow: at least 32 GB of RAM and 12 GB of VRAM : if have little VRAM or generate longer videos
|
|
66
|
+
- VerylowRAM_LowVRAM_Slowest: at least 24 GB of RAM and 10 GB of VRAM : if you don't have much it won't be fast but maybe it will work
|
|
67
|
+
|
|
68
|
+
By default the 'transformer' will be quantized to 8 bits for all profiles. If you don't want that you may specify the optional parameter *quantizeTransformer = False*.
|
|
69
|
+
|
|
70
|
+
## Alternatively you may want to create your own profile with specific parameters:
|
|
71
|
+
|
|
72
|
+
For example:
|
|
73
|
+
```
|
|
74
|
+
from mmgp import offload
|
|
75
|
+
offload.all(pipe, pinInRAM=True, modelsToQuantize = ["text_encoder_2"] )
|
|
76
|
+
```
|
|
77
|
+
- pinInRAM: Boolean (for all models) or List of models ids to pin in RAM. Every model pinned in RAM will load much faster (4 times) but this requires more RAM
|
|
78
|
+
- modelsToQuantize: list of model ids to quantize on the fly. If the corresponding model is already quantized, this option will be ignored.
|
|
79
|
+
- quantizeTransformer: boolean by default True. The 'transformer' model in the pipe contains usually the video or image generator is by defaut; quantized on the fly by default to 8 bits. If you want to save time on disk and reduce the loading time, you may want to load directly a prequantized model. If you don't want to quantize the image generator, you need to set the option *quantizeTransformer* to *False* to turn off on the fly quantization.
|
|
80
|
+
- budgets: either a number in mega bytes (for all models, if 0 unlimited budget) or a dictionary that maps model ids to mega bytes : define the budget in VRAM (in fact the real number is 2.5 this number) that is allocated in VRAM for each model. The smaller this number, the more VRAM left for image data / longer video but also the slower because there will be lots of loading / unloading between the RAM and the VRAM. Turning on the PinInRAM accelerates greatly (4x) small budgets but consumes usually 50% more RAM.
|
|
81
|
+
|
|
82
|
+
|
|
83
|
+
## Going further
|
|
84
|
+
|
|
85
|
+
The module includes several tools to package a light version of your favorite video / image generator:
|
|
86
|
+
- *save_model(model, file_path, do_quantize = False, quantization_type = qint8 )*\
|
|
87
|
+
Save tensors of a model already loaded in memory in a safetensor format (much faster to reload). You can save it in a quantized format (default qint8 quantization recommended).
|
|
88
|
+
If the model is saved in a quantized format, an extra file that ends with '_map.json' will be created and needed to reload the model again.
|
|
89
|
+
|
|
90
|
+
- *load_model_data(model, file_path: str)*\
|
|
91
|
+
Load the tensors data of a model in RAM of a model already initialized with no data. Detect and handle quantized models saved previously with save_model.
|
|
92
|
+
|
|
93
|
+
- *fast_load_transformers_model(model_path: str)*\
|
|
94
|
+
Initialize (build the model hierarchy in memory) and fast load the corresponding tensors of a 'transformers' library model.
|
|
95
|
+
The advantages over the original *from_pretrained* method is that the full model can fit into a single file with a filename of your choosing (thefore you can have multiple 'transformers' versions of the same model in the same directory) and prequantized model are processed in a transparent way.
|
|
96
|
+
Please note that you need to keep the original file transformers 'config.json' in the same directory.
|
|
97
|
+
|
|
98
|
+
|
|
99
|
+
The typical workflow wil be:
|
|
100
|
+
1) temporarly insert the *save_model* function just after a model has been fully loaded to save a copy of the model / quantized model.
|
|
101
|
+
2) replace the full initalizing / loading logic with *fast_load_transformers_model* (if there is a *from_pretrained* call to a transformers object) or only the tensor loading functions (*torch.load_model_file* and *torch.load_state_dict*) with *load_model_data after* the initializing logic.
|
|
102
|
+
|
|
103
|
+
## Special cases
|
|
104
|
+
Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models.\
|
|
105
|
+
For instance :
|
|
106
|
+
|
|
107
|
+
|
|
108
|
+
- for flux derived models:
|
|
109
|
+
```
|
|
110
|
+
pipe = { "text_encoder": clip, "text_encoder_2": t5, "transformer": model, "vae":ae }
|
|
111
|
+
```
|
|
112
|
+
- for mochi:
|
|
113
|
+
```
|
|
114
|
+
pipe = { "text_encoder": self.text_encoder, "transformer": self.dit, "vae":self.decoder }
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
|
|
118
|
+
Please note that there should be always one model whose Id is 'transformer'. It corresponds to the main image / video model which usually needs to be quantized (this is done on the fly by default when loading the model).
|
|
119
|
+
|
|
120
|
+
Becareful, lots of models use the T5 XXL as a text encoder. However, quite often their corresponding pipeline configurations point at the official Google T5 XXL repository
|
|
121
|
+
where there is a huge 40GB model to download and load. It is cumbersorme as it is a 32 bits model and contains the decoder part of T5 that is not used.
|
|
122
|
+
I suggest you use instead one of the 16 bits encoder only version available around, for instance:
|
|
123
|
+
```
|
|
124
|
+
text_encoder_2 = T5EncoderModel.from_pretrained("black-forest-labs/FLUX.1-dev", subfolder="text_encoder_2", torch_dtype=torch.float16)
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
Sometime just providing the pipe won't be sufficient as you will need to change the content of the core model:
|
|
128
|
+
- For instance you may need to disable an existing CPU offload logic that already exists (such as manual calls to move tensors between cuda and the cpu)
|
|
129
|
+
- mmpg to tries to fake the device as being "cuda" but sometimes some code won't be fooled and it will create tensors in the cpu device and this may cause some issues.
|
|
130
|
+
|
|
131
|
+
You are free to use my module for non commercial use as long you give me proper credits. You may contact me on twitter @deepbeepmeep
|
|
132
|
+
|
|
133
|
+
Thanks to
|
|
134
|
+
---------
|
|
135
|
+
- Huggingface / accelerate for the hooking examples
|
|
136
|
+
- Huggingface / quanto for their very useful quantizer
|
|
137
|
+
- gau-nernst for his Pinnig RAM samples
|
|
@@ -0,0 +1,7 @@
|
|
|
1
|
+
__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
|
2
|
+
mmgp.py,sha256=UuAUF76QIve6j2qEkYucxaOq4uo9mKptP4AdkQKr8Eg,45152
|
|
3
|
+
mmgp-2.0.1.dist-info/LICENSE.md,sha256=HjzvY2grdtdduZclbZ46B2M-XpT4MDCxFub5ZwTWq2g,93
|
|
4
|
+
mmgp-2.0.1.dist-info/METADATA,sha256=y-6bIJqU6FrX4NMVXheTjs7n2PeoG-kilyyULqgxnt4,8601
|
|
5
|
+
mmgp-2.0.1.dist-info/WHEEL,sha256=PZUExdf71Ui_so67QXpySuHtCi3-J3wvF4ORK6k_S8U,91
|
|
6
|
+
mmgp-2.0.1.dist-info/top_level.txt,sha256=waGaepj2qVfnS2yAOkaMu4r9mJaVjGbEi6AwOUogU_U,14
|
|
7
|
+
mmgp-2.0.1.dist-info/RECORD,,
|
mmgp.py
CHANGED
|
@@ -1,24 +1,28 @@
|
|
|
1
|
-
# ------------------ Memory Management for the GPU Poor by DeepBeepMeep (mmgp)------------------
|
|
1
|
+
# ------------------ Memory Management 2.0 for the GPU Poor by DeepBeepMeep (mmgp)------------------
|
|
2
2
|
#
|
|
3
3
|
# This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ... can run smoothly on a 24 GB GPU limited card.
|
|
4
4
|
# This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several
|
|
5
5
|
# times in a pipe (eg VAE).
|
|
6
6
|
#
|
|
7
7
|
# Requirements:
|
|
8
|
-
# -
|
|
9
|
-
# - RAM: minimum
|
|
8
|
+
# - VRAM: minimum 12 GB, recommended 24 GB (RTX 3090/ RTX 4090)
|
|
9
|
+
# - RAM: minimum 24 GB, recommended 48 - 64 GB
|
|
10
10
|
#
|
|
11
11
|
# It is almost plug and play and just needs to be invoked from the main app just after the model pipeline has been created.
|
|
12
12
|
# 1) First make sure that the pipeline explictly loads the models in the CPU device
|
|
13
13
|
# for instance: pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16).to("cpu")
|
|
14
14
|
# 2) Once every potential Lora has been loaded and merged, add the following lines:
|
|
15
|
+
# For a quick setup, you may want to choose between 4 profiles depending on your hardware, for instance:
|
|
16
|
+
# from mmgp import offload, profile_type
|
|
17
|
+
# offload.profile(pipe, profile_type.HighRAM_LowVRAM_Fast)
|
|
18
|
+
# Alternatively you may want to your own parameters, for instance:
|
|
15
19
|
# from mmgp import offload
|
|
16
|
-
# offload.all(pipe)
|
|
20
|
+
# offload.all(pipe, pinInRAM=true, modelsToQuantize = ["text_encoder_2"] )
|
|
17
21
|
# The 'transformer' model that contains usually the video or image generator is quantized on the fly by default to 8 bits so that it can fit into 24 GB of VRAM.
|
|
18
22
|
# If you want to save time on disk and reduce the loading time, you may want to load directly a prequantized model. In that case you need to set the option quantizeTransformer to False to turn off on the fly quantization.
|
|
19
23
|
# You can specify a list of additional models string ids to quantize (for instance the text_encoder) using the optional argument modelsToQuantize. This may be useful if you have less than 48 GB of RAM.
|
|
20
24
|
# Note that there is little advantage on the GPU / VRAM side to quantize text encoders as their inputs are usually quite light.
|
|
21
|
-
# Conversely if you have more than
|
|
25
|
+
# Conversely if you have more than 48GB RAM you may want to enable RAM pinning with the option pinInRAM = True. You will get in return super fast loading / unloading of models
|
|
22
26
|
# (this can save significant time if the same pipeline is run multiple times in a row)
|
|
23
27
|
#
|
|
24
28
|
# Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models.
|
|
@@ -53,10 +57,15 @@ import torch
|
|
|
53
57
|
import gc
|
|
54
58
|
import time
|
|
55
59
|
import functools
|
|
60
|
+
import sys
|
|
61
|
+
import json
|
|
62
|
+
|
|
56
63
|
from optimum.quanto import freeze, qfloat8, qint8, quantize, QModuleMixin, QTensor
|
|
57
64
|
|
|
58
65
|
|
|
59
66
|
|
|
67
|
+
ONE_MB = 1048576
|
|
68
|
+
|
|
60
69
|
cotenants_map = {
|
|
61
70
|
"text_encoder": ["vae", "text_encoder_2"],
|
|
62
71
|
"text_encoder_2": ["vae", "text_encoder"],
|
|
@@ -79,10 +88,107 @@ def move_tensors(obj, device):
|
|
|
79
88
|
else:
|
|
80
89
|
raise TypeError("Tensor or list / dict of tensors expected")
|
|
81
90
|
|
|
91
|
+
def _quantize(model_to_quantize, weights=qint8, verboseLevel = 1, threshold = 1000000000, model_id = None):
|
|
92
|
+
|
|
93
|
+
sizeofbfloat16 = torch.bfloat16.itemsize
|
|
94
|
+
|
|
95
|
+
def compute_submodule_size(submodule):
|
|
96
|
+
size = 0
|
|
97
|
+
for p in submodule.parameters(recurse=False):
|
|
98
|
+
size += torch.numel(p.data) * sizeofbfloat16
|
|
99
|
+
|
|
100
|
+
for p in submodule.buffers(recurse=False):
|
|
101
|
+
size += torch.numel(p.data) * sizeofbfloat16
|
|
102
|
+
|
|
103
|
+
return size
|
|
104
|
+
|
|
105
|
+
total_size =0
|
|
106
|
+
total_excluded = 0
|
|
107
|
+
exclude_list = []
|
|
108
|
+
submodule_size = 0
|
|
109
|
+
submodule_names = []
|
|
110
|
+
cur_blocks_prefix = None
|
|
111
|
+
prev_blocks_prefix = None
|
|
112
|
+
|
|
113
|
+
print(f"Quantization of model '{model_id}' started")
|
|
114
|
+
|
|
115
|
+
for submodule_name, submodule in model_to_quantize.named_modules():
|
|
116
|
+
if isinstance(submodule, QModuleMixin):
|
|
117
|
+
if verboseLevel>=1:
|
|
118
|
+
print("No quantization to do as model is already quantized")
|
|
119
|
+
return False
|
|
120
|
+
|
|
121
|
+
|
|
122
|
+
if submodule_name=='':
|
|
123
|
+
continue
|
|
124
|
+
|
|
125
|
+
|
|
126
|
+
flush = False
|
|
127
|
+
if isinstance(submodule, (torch.nn.ModuleList, torch.nn.Sequential)):
|
|
128
|
+
if cur_blocks_prefix == None:
|
|
129
|
+
cur_blocks_prefix = submodule_name + "."
|
|
130
|
+
flush = True
|
|
131
|
+
else:
|
|
132
|
+
#if cur_blocks_prefix != submodule_name[:len(cur_blocks_prefix)]:
|
|
133
|
+
if not submodule_name.startswith(cur_blocks_prefix):
|
|
134
|
+
cur_blocks_prefix = submodule_name + "."
|
|
135
|
+
flush = True
|
|
136
|
+
else:
|
|
137
|
+
if cur_blocks_prefix is not None:
|
|
138
|
+
#if not cur_blocks_prefix == submodule_name[0:len(cur_blocks_prefix)]:
|
|
139
|
+
if not submodule_name.startswith(cur_blocks_prefix):
|
|
140
|
+
cur_blocks_prefix = None
|
|
141
|
+
flush = True
|
|
142
|
+
|
|
143
|
+
if flush:
|
|
144
|
+
if submodule_size <= threshold:
|
|
145
|
+
exclude_list += submodule_names
|
|
146
|
+
if verboseLevel >=2:
|
|
147
|
+
print(f"Excluded size {submodule_size/ONE_MB:.1f} MB: {prev_blocks_prefix} : {submodule_names}")
|
|
148
|
+
total_excluded += submodule_size
|
|
149
|
+
|
|
150
|
+
submodule_size = 0
|
|
151
|
+
submodule_names = []
|
|
152
|
+
prev_blocks_prefix = cur_blocks_prefix
|
|
153
|
+
size = compute_submodule_size(submodule)
|
|
154
|
+
submodule_size += size
|
|
155
|
+
total_size += size
|
|
156
|
+
submodule_names.append(submodule_name)
|
|
157
|
+
|
|
158
|
+
if submodule_size > 0 and submodule_size <= threshold:
|
|
159
|
+
exclude_list += submodule_names
|
|
160
|
+
if verboseLevel >=2:
|
|
161
|
+
print(f"Excluded size {submodule_size/ONE_MB:.1f} MB: {prev_blocks_prefix} : {submodule_names}")
|
|
162
|
+
total_excluded += submodule_size
|
|
163
|
+
|
|
164
|
+
perc_excluded =total_excluded/ total_size if total_size >0 else 1
|
|
165
|
+
if verboseLevel >=2:
|
|
166
|
+
print(f"Total Excluded {total_excluded/ONE_MB:.1f} MB oF {total_size/ONE_MB:.1f} that is {perc_excluded*100:.2f}%")
|
|
167
|
+
if perc_excluded >= 0.10:
|
|
168
|
+
print(f"Too many many modules are excluded, there is something wrong with the selection, switch back to full quantization.")
|
|
169
|
+
exclude_list = None
|
|
170
|
+
|
|
171
|
+
# we are obviously loading a model that has been already quantized
|
|
172
|
+
|
|
173
|
+
quantize(model_to_quantize,weights, exclude= exclude_list)
|
|
174
|
+
freeze(model_to_quantize)
|
|
175
|
+
torch.cuda.empty_cache()
|
|
176
|
+
gc.collect()
|
|
177
|
+
print(f"Quantization of model '{model_id}' done")
|
|
178
|
+
|
|
179
|
+
return True
|
|
82
180
|
|
|
83
181
|
def get_model_name(model):
|
|
84
182
|
return model.name
|
|
85
183
|
|
|
184
|
+
import enum
|
|
185
|
+
class profile_type(int, enum.Enum):
|
|
186
|
+
HighRAM_HighVRAM_Fastest = 1
|
|
187
|
+
HighRAM_LowVRAM_Fast = 2
|
|
188
|
+
LowRAM_HighVRAM_Medium = 3
|
|
189
|
+
LowRAM_LowVRAM_Slow = 4
|
|
190
|
+
VerylowRAM_LowVRAM_Slowest = 5
|
|
191
|
+
|
|
86
192
|
class HfHook:
|
|
87
193
|
def __init__(self):
|
|
88
194
|
self.execution_device = "cuda"
|
|
@@ -94,28 +200,57 @@ class offload:
|
|
|
94
200
|
def __init__(self):
|
|
95
201
|
self.active_models = []
|
|
96
202
|
self.active_models_ids = []
|
|
203
|
+
self.active_subcaches = {}
|
|
97
204
|
self.models = {}
|
|
98
|
-
self.
|
|
205
|
+
self.verboseLevel = 0
|
|
99
206
|
self.models_to_quantize = []
|
|
100
207
|
self.pinned_modules_data = {}
|
|
101
|
-
self.
|
|
102
|
-
self.
|
|
208
|
+
self.blocks_of_modules = {}
|
|
209
|
+
self.blocks_of_modules_sizes = {}
|
|
210
|
+
self.compile = False
|
|
103
211
|
self.device_mem_capacity = torch.cuda.get_device_properties(0).total_memory
|
|
104
212
|
self.last_reserved_mem_check =0
|
|
213
|
+
self.loaded_blocks = {}
|
|
214
|
+
self.prev_blocks_names = {}
|
|
215
|
+
self.next_blocks_names = {}
|
|
216
|
+
self.default_stream = torch.cuda.default_stream(torch.device("cuda")) # torch.cuda.current_stream()
|
|
217
|
+
self.transfer_stream = torch.cuda.Stream()
|
|
218
|
+
self.async_transfers = False
|
|
219
|
+
|
|
105
220
|
|
|
106
|
-
def
|
|
107
|
-
|
|
108
|
-
|
|
109
|
-
|
|
110
|
-
|
|
111
|
-
|
|
221
|
+
def add_module_to_blocks(self, model_id, blocks_name, submodule, prev_block_name):
|
|
222
|
+
|
|
223
|
+
entry_name = model_id if blocks_name is None else model_id + "/" + blocks_name
|
|
224
|
+
if entry_name in self.blocks_of_modules:
|
|
225
|
+
blocks_params = self.blocks_of_modules[entry_name]
|
|
226
|
+
blocks_params_size = self.blocks_of_modules_sizes[entry_name]
|
|
112
227
|
else:
|
|
113
|
-
|
|
114
|
-
|
|
115
|
-
|
|
116
|
-
|
|
117
|
-
|
|
118
|
-
self.
|
|
228
|
+
blocks_params = []
|
|
229
|
+
self.blocks_of_modules[entry_name] = blocks_params
|
|
230
|
+
blocks_params_size = 0
|
|
231
|
+
if blocks_name !=None:
|
|
232
|
+
prev_entry_name = None if prev_block_name == None else model_id + "/" + prev_block_name
|
|
233
|
+
self.prev_blocks_names[entry_name] = prev_entry_name
|
|
234
|
+
if not prev_block_name == None:
|
|
235
|
+
self.next_blocks_names[prev_entry_name] = entry_name
|
|
236
|
+
|
|
237
|
+
for p in submodule.parameters(recurse=False):
|
|
238
|
+
blocks_params.append(p)
|
|
239
|
+
if isinstance(p, QTensor):
|
|
240
|
+
blocks_params_size += p._data.nbytes
|
|
241
|
+
blocks_params_size += p._scale.nbytes
|
|
242
|
+
else:
|
|
243
|
+
blocks_params_size += p.data.nbytes
|
|
244
|
+
|
|
245
|
+
for p in submodule.buffers(recurse=False):
|
|
246
|
+
blocks_params.append(p)
|
|
247
|
+
blocks_params_size += p.data.nbytes
|
|
248
|
+
|
|
249
|
+
|
|
250
|
+
self.blocks_of_modules_sizes[entry_name] = blocks_params_size
|
|
251
|
+
|
|
252
|
+
return blocks_params_size
|
|
253
|
+
|
|
119
254
|
|
|
120
255
|
def can_model_be_cotenant(self, model_id):
|
|
121
256
|
potential_cotenants= cotenants_map.get(model_id, None)
|
|
@@ -126,45 +261,113 @@ class offload:
|
|
|
126
261
|
return False
|
|
127
262
|
return True
|
|
128
263
|
|
|
129
|
-
|
|
130
|
-
|
|
131
|
-
|
|
132
|
-
|
|
133
|
-
|
|
264
|
+
@torch.compiler.disable()
|
|
265
|
+
def gpu_load_blocks(self, model_id, blocks_name, async_load = False):
|
|
266
|
+
if blocks_name != None:
|
|
267
|
+
self.loaded_blocks[model_id] = blocks_name
|
|
268
|
+
|
|
269
|
+
def cpu_to_gpu(stream_to_use, blocks_params, record_for_stream = None):
|
|
270
|
+
with torch.cuda.stream(stream_to_use):
|
|
271
|
+
for p in blocks_params:
|
|
272
|
+
if isinstance(p, QTensor):
|
|
273
|
+
p._data = p._data.cuda(non_blocking=True)
|
|
274
|
+
p._scale = p._scale.cuda(non_blocking=True)
|
|
275
|
+
else:
|
|
276
|
+
p.data = p.data.cuda(non_blocking=True)
|
|
277
|
+
|
|
278
|
+
if record_for_stream != None:
|
|
279
|
+
if isinstance(p, QTensor):
|
|
280
|
+
p._data.record_stream(record_for_stream)
|
|
281
|
+
p._scale.record_stream(record_for_stream)
|
|
282
|
+
else:
|
|
283
|
+
p.data.record_stream(record_for_stream)
|
|
284
|
+
|
|
285
|
+
|
|
286
|
+
entry_name = model_id if blocks_name is None else model_id + "/" + blocks_name
|
|
287
|
+
if self.verboseLevel >=2:
|
|
288
|
+
model = self.models[model_id]
|
|
134
289
|
model_name = model._get_name()
|
|
135
|
-
print(f"Loading model {
|
|
136
|
-
|
|
137
|
-
|
|
290
|
+
print(f"Loading model {entry_name} ({model_name}) in GPU")
|
|
291
|
+
|
|
292
|
+
|
|
293
|
+
if self.async_transfers and blocks_name != None:
|
|
294
|
+
first = self.prev_blocks_names[entry_name] == None
|
|
295
|
+
next_blocks_entry = self.next_blocks_names[entry_name] if entry_name in self.next_blocks_names else None
|
|
296
|
+
if first:
|
|
297
|
+
cpu_to_gpu(torch.cuda.current_stream(), self.blocks_of_modules[entry_name])
|
|
298
|
+
# if next_blocks_entry != None:
|
|
299
|
+
# self.transfer_stream.wait_stream(self.default_stream)
|
|
300
|
+
# else:
|
|
301
|
+
# self.transfer_stream.wait_stream(self.default_stream)
|
|
302
|
+
torch.cuda.synchronize()
|
|
303
|
+
|
|
304
|
+
if next_blocks_entry != None:
|
|
305
|
+
cpu_to_gpu(self.transfer_stream, self.blocks_of_modules[next_blocks_entry]) #, self.default_stream
|
|
306
|
+
|
|
138
307
|
else:
|
|
139
|
-
|
|
140
|
-
|
|
308
|
+
# if self.async_transfers:
|
|
309
|
+
# self.transfer_stream.wait_stream(self.default_stream)
|
|
310
|
+
cpu_to_gpu(self.default_stream, self.blocks_of_modules[entry_name])
|
|
311
|
+
torch.cuda.synchronize()
|
|
312
|
+
|
|
313
|
+
|
|
314
|
+
@torch.compiler.disable()
|
|
315
|
+
def gpu_unload_blocks(self, model_id, blocks_name):
|
|
316
|
+
if blocks_name != None:
|
|
317
|
+
self.loaded_blocks[model_id] = None
|
|
318
|
+
|
|
319
|
+
blocks_name = model_id if blocks_name is None else model_id + "/" + blocks_name
|
|
320
|
+
|
|
321
|
+
if self.verboseLevel >=2:
|
|
322
|
+
model = self.models[model_id]
|
|
323
|
+
model_name = model._get_name()
|
|
324
|
+
print(f"Unloading model {blocks_name} ({model_name}) from GPU")
|
|
325
|
+
|
|
326
|
+
blocks_params = self.blocks_of_modules[blocks_name]
|
|
327
|
+
|
|
328
|
+
if model_id in self.pinned_modules_data:
|
|
329
|
+
pinned_parameters_data = self.pinned_modules_data[model_id]
|
|
330
|
+
for p in blocks_params:
|
|
141
331
|
if isinstance(p, QTensor):
|
|
142
|
-
|
|
143
|
-
p.
|
|
332
|
+
data = pinned_parameters_data[p]
|
|
333
|
+
p._data = data[0]
|
|
334
|
+
p._scale = data[1]
|
|
144
335
|
else:
|
|
145
|
-
p.data = p
|
|
146
|
-
|
|
336
|
+
p.data = pinned_parameters_data[p]
|
|
337
|
+
else:
|
|
338
|
+
for p in blocks_params:
|
|
339
|
+
if isinstance(p, QTensor):
|
|
340
|
+
p._data = p._data.cpu()
|
|
341
|
+
p._scale = p._scale.cpu()
|
|
342
|
+
else:
|
|
343
|
+
p.data = p.data.cpu()
|
|
344
|
+
|
|
345
|
+
|
|
346
|
+
|
|
147
347
|
@torch.compiler.disable()
|
|
348
|
+
def gpu_load(self, model_id):
|
|
349
|
+
model = self.models[model_id]
|
|
350
|
+
self.active_models.append(model)
|
|
351
|
+
self.active_models_ids.append(model_id)
|
|
352
|
+
|
|
353
|
+
self.gpu_load_blocks(model_id, None)
|
|
354
|
+
|
|
355
|
+
# torch.cuda.current_stream().synchronize()
|
|
356
|
+
|
|
148
357
|
def unload_all(self):
|
|
149
|
-
for
|
|
150
|
-
|
|
151
|
-
|
|
152
|
-
|
|
153
|
-
|
|
154
|
-
|
|
155
|
-
for p in module_params:
|
|
156
|
-
if isinstance(p, QTensor):
|
|
157
|
-
data = pinned_parameters_data[p]
|
|
158
|
-
p._data = data[0]
|
|
159
|
-
p._scale = data[1]
|
|
160
|
-
else:
|
|
161
|
-
p.data = pinned_parameters_data[p]
|
|
162
|
-
|
|
358
|
+
for model_id in self.active_models_ids:
|
|
359
|
+
self.gpu_unload_blocks(model_id, None)
|
|
360
|
+
loaded_block = self.loaded_blocks[model_id]
|
|
361
|
+
if loaded_block != None:
|
|
362
|
+
self.gpu_unload_blocks(model_id, loaded_block)
|
|
363
|
+
self.loaded_blocks[model_id] = None
|
|
163
364
|
|
|
164
365
|
self.active_models = []
|
|
165
366
|
self.active_models_ids = []
|
|
367
|
+
self.active_subcaches = []
|
|
166
368
|
torch.cuda.empty_cache()
|
|
167
369
|
gc.collect()
|
|
370
|
+
self.last_reserved_mem_check = time.time()
|
|
168
371
|
|
|
169
372
|
def move_args_to_gpu(self, *args, **kwargs):
|
|
170
373
|
new_args= []
|
|
@@ -188,10 +391,12 @@ class offload:
|
|
|
188
391
|
|
|
189
392
|
return new_args, new_kwargs
|
|
190
393
|
|
|
191
|
-
def ready_to_check_mem(self
|
|
394
|
+
def ready_to_check_mem(self):
|
|
395
|
+
if self.compile:
|
|
396
|
+
return
|
|
192
397
|
cur_clock = time.time()
|
|
193
398
|
# can't check at each call if we can empty the cuda cache as quering the reserved memory value is a time consuming operation
|
|
194
|
-
if
|
|
399
|
+
if (cur_clock - self.last_reserved_mem_check)<0.200:
|
|
195
400
|
return False
|
|
196
401
|
self.last_reserved_mem_check = cur_clock
|
|
197
402
|
return True
|
|
@@ -199,20 +404,70 @@ class offload:
|
|
|
199
404
|
|
|
200
405
|
def empty_cache_if_needed(self):
|
|
201
406
|
mem_reserved = torch.cuda.memory_reserved()
|
|
202
|
-
|
|
407
|
+
mem_threshold = 0.9*self.device_mem_capacity
|
|
408
|
+
if mem_reserved >= mem_threshold:
|
|
203
409
|
mem_allocated = torch.cuda.memory_allocated()
|
|
204
410
|
if mem_allocated <= 0.70 * mem_reserved:
|
|
205
411
|
# print(f"Cuda empty cache triggered as Allocated Memory ({mem_allocated/1024000:0f} MB) is lot less than Cached Memory ({mem_reserved/1024000:0f} MB) ")
|
|
206
412
|
torch.cuda.empty_cache()
|
|
413
|
+
tm= time.time()
|
|
414
|
+
if self.verboseLevel >=2:
|
|
415
|
+
print(f"Empty Cuda cache at {tm}")
|
|
207
416
|
# print(f"New cached memory after purge is {torch.cuda.memory_reserved()/1024000:0f} MB) ")
|
|
208
417
|
|
|
209
|
-
|
|
210
|
-
|
|
211
|
-
|
|
418
|
+
|
|
419
|
+
def any_param_or_buffer(self, target_module: torch.nn.Module):
|
|
420
|
+
|
|
421
|
+
for _ in target_module.parameters(recurse= False):
|
|
422
|
+
return True
|
|
423
|
+
|
|
424
|
+
for _ in target_module.buffers(recurse= False):
|
|
425
|
+
return True
|
|
426
|
+
|
|
427
|
+
return False
|
|
428
|
+
|
|
429
|
+
|
|
430
|
+
|
|
431
|
+
def hook_me_light(self, target_module, model_id,blocks_name, previous_method, context):
|
|
432
|
+
|
|
433
|
+
anyParam = self.any_param_or_buffer(target_module)
|
|
434
|
+
|
|
435
|
+
def check_empty_cuda_cache(module, *args, **kwargs):
|
|
436
|
+
if self.ready_to_check_mem():
|
|
212
437
|
self.empty_cache_if_needed()
|
|
213
438
|
return previous_method(*args, **kwargs)
|
|
214
|
-
|
|
215
|
-
|
|
439
|
+
|
|
440
|
+
|
|
441
|
+
def load_module_blocks(module, *args, **kwargs):
|
|
442
|
+
#some_context = context #for debugging
|
|
443
|
+
if blocks_name == None:
|
|
444
|
+
if self.ready_to_check_mem():
|
|
445
|
+
self.empty_cache_if_needed()
|
|
446
|
+
else:
|
|
447
|
+
loaded_block = self.loaded_blocks[model_id]
|
|
448
|
+
if (loaded_block == None or loaded_block != blocks_name) :
|
|
449
|
+
if loaded_block != None:
|
|
450
|
+
self.gpu_unload_blocks(model_id, loaded_block)
|
|
451
|
+
if self.ready_to_check_mem():
|
|
452
|
+
self.empty_cache_if_needed()
|
|
453
|
+
self.loaded_blocks[model_id] = blocks_name
|
|
454
|
+
self.gpu_load_blocks(model_id, blocks_name)
|
|
455
|
+
return previous_method(*args, **kwargs)
|
|
456
|
+
|
|
457
|
+
if hasattr(target_module, "_mm_id"):
|
|
458
|
+
orig_model_id = getattr(target_module, "_mm_id")
|
|
459
|
+
if self.verboseLevel >=2:
|
|
460
|
+
print(f"Model '{model_id}' shares module '{target_module._get_name()}' with module '{orig_model_id}' ")
|
|
461
|
+
assert not anyParam
|
|
462
|
+
return
|
|
463
|
+
setattr(target_module, "_mm_id", model_id)
|
|
464
|
+
|
|
465
|
+
|
|
466
|
+
if blocks_name != None and anyParam:
|
|
467
|
+
setattr(target_module, "forward", functools.update_wrapper(functools.partial(load_module_blocks, target_module), previous_method) )
|
|
468
|
+
#print(f"new cache:{blocks_name}")
|
|
469
|
+
else:
|
|
470
|
+
setattr(target_module, "forward", functools.update_wrapper(functools.partial(check_empty_cuda_cache, target_module), previous_method) )
|
|
216
471
|
|
|
217
472
|
|
|
218
473
|
def hook_me(self, target_module, model, model_id, module_id, previous_method):
|
|
@@ -236,13 +491,9 @@ class offload:
|
|
|
236
491
|
return
|
|
237
492
|
setattr(target_module, "_mm_id", model_id)
|
|
238
493
|
|
|
239
|
-
# create a fake accelerate parameter so that the _execution_device property returns always "cuda"
|
|
240
|
-
# (it is queried in many pipelines even if offloading is not properly implemented)
|
|
241
|
-
if not hasattr(target_module, "_hf_hook"):
|
|
242
|
-
setattr(target_module, "_hf_hook", HfHook())
|
|
243
494
|
setattr(target_module, "forward", functools.update_wrapper(functools.partial(check_change_module, target_module), previous_method) )
|
|
244
495
|
|
|
245
|
-
if not self.
|
|
496
|
+
if not self.verboseLevel >=1:
|
|
246
497
|
return
|
|
247
498
|
|
|
248
499
|
if module_id == None or module_id =='':
|
|
@@ -262,22 +513,185 @@ class offload:
|
|
|
262
513
|
# self.unhook_module(module)
|
|
263
514
|
|
|
264
515
|
|
|
516
|
+
@staticmethod
|
|
517
|
+
def fast_load_transformers_model(model_path: str):
|
|
518
|
+
"""
|
|
519
|
+
quick version of .LoadfromPretrained of the transformers library
|
|
520
|
+
used to build a model and load the corresponding weights (quantized or not)
|
|
521
|
+
"""
|
|
522
|
+
|
|
523
|
+
from transformers import AutoConfig
|
|
524
|
+
|
|
525
|
+
if model_path.endswith(".sft") or model_path.endswith(".safetensors"):
|
|
526
|
+
config_path = model_path[ : model_path.rfind("/")]
|
|
527
|
+
else:
|
|
528
|
+
raise("full model path expected")
|
|
529
|
+
config_fullpath = config_path +"/config.json"
|
|
530
|
+
|
|
531
|
+
import os.path
|
|
532
|
+
if not os.path.isfile(config_fullpath):
|
|
533
|
+
raise("a 'config.json' that describes the model is required in the directory of the model")
|
|
534
|
+
|
|
535
|
+
with open(config_fullpath, "r", encoding="utf-8") as reader:
|
|
536
|
+
text = reader.read()
|
|
537
|
+
transformer_config= json.loads(text)
|
|
538
|
+
architectures = transformer_config["architectures"]
|
|
539
|
+
class_name = architectures[0]
|
|
540
|
+
|
|
541
|
+
module = __import__("transformers")
|
|
542
|
+
transfomer_class = getattr(module, class_name)
|
|
543
|
+
|
|
544
|
+
config = AutoConfig.from_pretrained(config_path)
|
|
545
|
+
|
|
546
|
+
from accelerate import init_empty_weights
|
|
547
|
+
#needed to keep inits of non persistent buffers
|
|
548
|
+
with init_empty_weights():
|
|
549
|
+
model = transfomer_class(config)
|
|
550
|
+
|
|
551
|
+
model = model.base_model
|
|
552
|
+
torch.set_default_device('cpu')
|
|
553
|
+
model.apply(model._initialize_weights)
|
|
554
|
+
|
|
555
|
+
#missing_keys, unexpected_keys =
|
|
556
|
+
offload.load_model_data(model,model_path, strict = True )
|
|
557
|
+
|
|
558
|
+
return model
|
|
559
|
+
# # text_encoder.final_layer_norm = text_encoder.norm
|
|
560
|
+
# model = model.base_model
|
|
561
|
+
# model.final_layer_norm = model.norm
|
|
562
|
+
# self.model = model
|
|
563
|
+
|
|
564
|
+
|
|
565
|
+
|
|
566
|
+
@staticmethod
|
|
567
|
+
def load_model_data(model, file_path: str, device=torch.device('cpu'), strict = True):
|
|
568
|
+
"""
|
|
569
|
+
Load a model, detect if it has been previously quantized using quanto and do the extra setup if necessary
|
|
570
|
+
"""
|
|
571
|
+
from optimum.quanto import requantize
|
|
572
|
+
import safetensors.torch
|
|
573
|
+
|
|
574
|
+
if "quanto" in file_path.lower():
|
|
575
|
+
pos = str.rfind(file_path, ".")
|
|
576
|
+
if pos > 0:
|
|
577
|
+
quantization_map_path = file_path[:pos]
|
|
578
|
+
quantization_map_path += "_map.json"
|
|
579
|
+
|
|
580
|
+
|
|
581
|
+
with open(quantization_map_path, 'r') as f:
|
|
582
|
+
quantization_map = json.load(f)
|
|
583
|
+
|
|
584
|
+
state_dict = safetensors.torch.load_file(file_path)
|
|
585
|
+
|
|
586
|
+
# change dtype of current meta model parameters because 'requantize' won't update the dtype on non quantized parameters
|
|
587
|
+
for k, p in model.named_parameters():
|
|
588
|
+
if not k in quantization_map and k in state_dict:
|
|
589
|
+
p_in_sd = state_dict[k]
|
|
590
|
+
if p.data.dtype != p_in_sd.data.dtype:
|
|
591
|
+
p.data = p.data.to(p_in_sd.data.dtype)
|
|
592
|
+
|
|
593
|
+
requantize(model, state_dict, quantization_map, device)
|
|
594
|
+
|
|
595
|
+
# for k, p in model.named_parameters():
|
|
596
|
+
# if p.data.dtype == torch.float32:
|
|
597
|
+
# pass
|
|
598
|
+
|
|
599
|
+
|
|
600
|
+
# del state_dict
|
|
601
|
+
return
|
|
602
|
+
|
|
603
|
+
else:
|
|
604
|
+
if ".safetensors" in file_path or ".sft" in file_path:
|
|
605
|
+
state_dict = safetensors.torch.load_file(file_path)
|
|
606
|
+
|
|
607
|
+
else:
|
|
608
|
+
|
|
609
|
+
state_dict = torch.load(file_path, weights_only=True)
|
|
610
|
+
if "module" in state_dict:
|
|
611
|
+
state_dict = state_dict["module"]
|
|
612
|
+
|
|
613
|
+
|
|
614
|
+
model.load_state_dict(state_dict, strict = strict, assign = True ) #strict=True,
|
|
615
|
+
|
|
616
|
+
|
|
617
|
+
return
|
|
618
|
+
|
|
619
|
+
@staticmethod
|
|
620
|
+
def save_model(model, file_path, do_quantize = False, quantization_type = qint8 ):
|
|
621
|
+
"""save the weights of a model and quantize them if requested
|
|
622
|
+
These weights can be loaded again using 'load_model_data'
|
|
623
|
+
"""
|
|
624
|
+
import safetensors.torch
|
|
625
|
+
pos = str.rfind(file_path, ".")
|
|
626
|
+
if pos > 0:
|
|
627
|
+
file_path = file_path[:pos]
|
|
628
|
+
|
|
629
|
+
if do_quantize:
|
|
630
|
+
_quantize(model, weights=quantization_type)
|
|
631
|
+
|
|
632
|
+
# # state_dict = {k: v.clone().contiguous() for k, v in model.state_dict().items()}
|
|
633
|
+
# state_dict = {k: v for k, v in model.state_dict().items()}
|
|
634
|
+
|
|
635
|
+
|
|
636
|
+
|
|
637
|
+
safetensors.torch.save_file(model.state_dict(), file_path + '.safetensors')
|
|
638
|
+
|
|
639
|
+
if do_quantize:
|
|
640
|
+
from optimum.quanto import quantization_map
|
|
641
|
+
|
|
642
|
+
with open(file_path + '_map.json', 'w') as f:
|
|
643
|
+
json.dump(quantization_map(model), f)
|
|
644
|
+
|
|
265
645
|
|
|
266
646
|
|
|
267
647
|
@classmethod
|
|
268
|
-
def all(cls, pipe_or_dict_of_modules, quantizeTransformer = True, pinInRAM = False,
|
|
648
|
+
def all(cls, pipe_or_dict_of_modules, quantizeTransformer = True, pinInRAM = False, verboseLevel = 1, modelsToQuantize = None, budgets= 0, info = None):
|
|
649
|
+
"""Hook to a pipeline or a group of modules in order to reduce their VRAM requirements:
|
|
650
|
+
pipe_or_dict_of_modules : the pipeline object or a dictionary of modules of the model
|
|
651
|
+
quantizeTransformer: set True by default will quantize on the fly the video / image model
|
|
652
|
+
pinInRAM: move models in reserved memor. This allows very fast performance but requires 50% extra RAM (usually >=64 GB)
|
|
653
|
+
modelsToQuantize: a list of models to be also quantized on the fly (e.g the text_encoder), useful to reduce bith RAM and VRAM consumption
|
|
654
|
+
budgets: 0 by default (unlimited). If non 0, it corresponds to the maximum size in MB that every model will occupy at any moment
|
|
655
|
+
(in fact the real usage is twice this number). It is very efficient to reduce VRAM consumption but this feature may be very slow
|
|
656
|
+
if pinInRAM is not enabled
|
|
657
|
+
"""
|
|
658
|
+
|
|
269
659
|
self = cls()
|
|
270
|
-
self.
|
|
660
|
+
self.verboseLevel = verboseLevel
|
|
271
661
|
self.pinned_modules_data = {}
|
|
662
|
+
model_budgets = {}
|
|
272
663
|
|
|
664
|
+
# model_budgets = {"text_encoder_2": 3400 }
|
|
665
|
+
HEADER = '\033[95m'
|
|
666
|
+
ENDC = '\033[0m'
|
|
667
|
+
BOLD ='\033[1m'
|
|
668
|
+
UNBOLD ='\033[0m'
|
|
669
|
+
|
|
670
|
+
print(f"{BOLD}{HEADER}************ Memory Management for the GPU Poor (mmgp 2.0) by DeepBeepMeep ************{ENDC}{UNBOLD}")
|
|
671
|
+
if info != None:
|
|
672
|
+
print(info)
|
|
673
|
+
budget = 0
|
|
674
|
+
if not budgets is None:
|
|
675
|
+
if isinstance(budgets , dict):
|
|
676
|
+
model_budgets = budgets
|
|
677
|
+
else:
|
|
678
|
+
budget = int(budgets) * ONE_MB
|
|
679
|
+
|
|
680
|
+
if (budgets!= None or budget >0) :
|
|
681
|
+
self.async_transfers = True
|
|
682
|
+
|
|
683
|
+
pinInRAM = True
|
|
273
684
|
# compile not working yet or slower
|
|
274
|
-
compile = False
|
|
275
|
-
|
|
685
|
+
compile = False # True
|
|
686
|
+
#quantizeTransformer = False
|
|
687
|
+
#self.async_transfers = False
|
|
688
|
+
self.compile = compile
|
|
689
|
+
|
|
276
690
|
pipe = None
|
|
277
|
-
preloadInRAM = True
|
|
278
691
|
torch.set_default_device('cuda')
|
|
279
692
|
if hasattr(pipe_or_dict_of_modules, "components"):
|
|
280
|
-
|
|
693
|
+
# commented as it not very useful and generates warnings
|
|
694
|
+
#pipe_or_dict_of_modules.to("cpu") #XXXX
|
|
281
695
|
# create a fake Accelerate parameter so that lora loading doesn't change the device
|
|
282
696
|
pipe_or_dict_of_modules.hf_device_map = torch.device("cuda")
|
|
283
697
|
pipe = pipe_or_dict_of_modules
|
|
@@ -291,114 +705,178 @@ class offload:
|
|
|
291
705
|
modelsToQuantize = [modelsToQuantize]
|
|
292
706
|
if quantizeTransformer:
|
|
293
707
|
modelsToQuantize.append("transformer")
|
|
708
|
+
|
|
294
709
|
self.models_to_quantize = modelsToQuantize
|
|
710
|
+
models_already_loaded = []
|
|
711
|
+
|
|
712
|
+
modelsToPin = None
|
|
713
|
+
pinAllModels = False
|
|
714
|
+
if isinstance(pinInRAM, bool):
|
|
715
|
+
pinAllModels = pinInRAM
|
|
716
|
+
elif isinstance(pinInRAM, list):
|
|
717
|
+
modelsToPin = pinInRAM
|
|
718
|
+
else:
|
|
719
|
+
modelsToPin = [pinInRAM]
|
|
720
|
+
|
|
295
721
|
# del models["transformer"] # to test everything but the transformer that has a much longer loading
|
|
296
|
-
|
|
722
|
+
sizeofbfloat16 = torch.bfloat16.itemsize
|
|
723
|
+
#
|
|
724
|
+
# models = { 'transformer': pipe_or_dict_of_modules["transformer"]} # to test only the transformer
|
|
725
|
+
|
|
726
|
+
|
|
297
727
|
for model_id in models:
|
|
298
728
|
current_model: torch.nn.Module = models[model_id]
|
|
729
|
+
modelPinned = pinAllModels or (modelsToPin != None and model_id in modelsToPin)
|
|
299
730
|
# make sure that no RAM or GPU memory is not allocated for gradiant / training
|
|
300
|
-
current_model.to("cpu").eval()
|
|
301
|
-
|
|
731
|
+
current_model.to("cpu").eval()
|
|
732
|
+
already_loaded = False
|
|
302
733
|
# Quantize model just before transferring it to the RAM to keep OS cache file
|
|
303
734
|
# open as short as possible. Indeed it seems that as long as the lazy safetensors
|
|
304
735
|
# are not fully fully loaded, the OS won't be able to release the corresponding cache file in RAM.
|
|
305
736
|
if model_id in self.models_to_quantize:
|
|
306
|
-
print(f"Quantization of model '{model_id}' started")
|
|
307
|
-
quantize(current_model, weights=qint8)
|
|
308
|
-
freeze(current_model)
|
|
309
|
-
print(f"Quantization of model '{model_id}' done")
|
|
310
|
-
torch.cuda.empty_cache()
|
|
311
|
-
gc.collect()
|
|
312
737
|
|
|
738
|
+
already_quantized = _quantize(current_model, weights=qint8, verboseLevel = self.verboseLevel, model_id=model_id)
|
|
739
|
+
if not already_quantized:
|
|
740
|
+
already_loaded = True
|
|
741
|
+
models_already_loaded.append(model_id)
|
|
313
742
|
|
|
314
|
-
|
|
315
|
-
if preloadInRAM: #
|
|
316
|
-
# load all the remaining unread lazy safetensors in RAM to free open cache files
|
|
317
|
-
for p in current_model.parameters():
|
|
318
|
-
# Preread every tensor in RAM except tensors that have just been quantified
|
|
319
|
-
# and are no longer needed
|
|
320
|
-
if isinstance(p, QTensor):
|
|
321
|
-
# fix quanto bug (see below) now as he won't have any opportunity to do it during RAM pinning
|
|
322
|
-
if not pinInRAM and p._scale.dtype == torch.float32:
|
|
323
|
-
p._scale = p._scale.to(torch.bfloat16)
|
|
324
743
|
|
|
744
|
+
current_model_size = 0
|
|
745
|
+
# load all the remaining unread lazy safetensors in RAM to free open cache files
|
|
746
|
+
for p in current_model.parameters():
|
|
747
|
+
# Preread every tensor in RAM except tensors that have just been quantified
|
|
748
|
+
# and are no longer needed
|
|
749
|
+
if isinstance(p, QTensor):
|
|
750
|
+
# fix quanto bug (see below) now as he won't have any opportunity to do it during RAM pinning
|
|
751
|
+
if not modelPinned and p._scale.dtype == torch.float32:
|
|
752
|
+
p._scale = p._scale.to(torch.bfloat16)
|
|
753
|
+
current_model_size += torch.numel(p._scale) * sizeofbfloat16
|
|
754
|
+
current_model_size += torch.numel(p._data) * sizeofbfloat16 / 2
|
|
755
|
+
if pinInRAM and not already_loaded:
|
|
756
|
+
# Force flushing the lazy load so that reserved memory can be freed when we are ready to pin
|
|
757
|
+
p._scale = p._scale + 0
|
|
758
|
+
p._data = p._data + 0
|
|
759
|
+
else:
|
|
760
|
+
if p.data.dtype == torch.float32:
|
|
761
|
+
# convert any left overs float32 weight to bloat16 to divide by 2 the model memory footprint
|
|
762
|
+
p.data = p.data.to(torch.bfloat16)
|
|
325
763
|
else:
|
|
326
|
-
|
|
327
|
-
|
|
328
|
-
|
|
329
|
-
|
|
330
|
-
|
|
331
|
-
|
|
332
|
-
|
|
764
|
+
# force reading the tensors from the disk by pretending to modify them
|
|
765
|
+
p.data = p.data + 0
|
|
766
|
+
|
|
767
|
+
current_model_size += torch.numel(p.data) * p.data.element_size()
|
|
768
|
+
|
|
769
|
+
for b in current_model.buffers():
|
|
770
|
+
if b.data.dtype == torch.float32:
|
|
771
|
+
# convert any left overs float32 weight to bloat16 to divide by 2 the model memory footprint
|
|
772
|
+
b.data = b.data.to(torch.bfloat16)
|
|
773
|
+
else:
|
|
774
|
+
# force reading the tensors from the disk by pretending to modify them
|
|
775
|
+
b.data = b.data + 0
|
|
776
|
+
|
|
777
|
+
current_model_size += torch.numel(p.data) * p.data.element_size()
|
|
778
|
+
|
|
779
|
+
if model_id not in self.models:
|
|
780
|
+
self.models[model_id] = current_model
|
|
781
|
+
|
|
782
|
+
|
|
783
|
+
model_budget = model_budgets[model_id] * ONE_MB if model_id in model_budgets else budget
|
|
784
|
+
|
|
785
|
+
if model_budget > 0 and model_budget > current_model_size:
|
|
786
|
+
model_budget = 0
|
|
787
|
+
|
|
788
|
+
model_budgets[model_id] = model_budget
|
|
789
|
+
|
|
790
|
+
# Pin in RAM models only once they have been fully loaded otherwise there will be some contention (at least on Linux OS) in the non pageable memory
|
|
791
|
+
# between partially loaded lazy safetensors and pinned tensors
|
|
792
|
+
for model_id in models:
|
|
793
|
+
current_model: torch.nn.Module = models[model_id]
|
|
794
|
+
if not (pinAllModels or modelsToPin != None and model_id in modelsToPin):
|
|
795
|
+
continue
|
|
796
|
+
if verboseLevel>=1:
|
|
797
|
+
print(f"Pinning tensors of '{model_id}' in RAM")
|
|
798
|
+
gc.collect()
|
|
799
|
+
pinned_parameters_data = {}
|
|
800
|
+
for p in current_model.parameters():
|
|
801
|
+
if isinstance(p, QTensor):
|
|
802
|
+
# pin in memory both quantized data and scales of quantized parameters
|
|
803
|
+
# but don't pin .data as it corresponds to the original tensor that we don't want to reload
|
|
804
|
+
p._data = p._data.pin_memory()
|
|
805
|
+
# fix quanto bug (that seems to have been fixed since&) that allows _scale to be float32 if the original weight was float32
|
|
806
|
+
# (this may cause type mismatch between dequantified bfloat16 weights and float32 scales)
|
|
807
|
+
p._scale = p._scale.to(torch.bfloat16).pin_memory() if p._scale.dtype == torch.float32 else p._scale.pin_memory()
|
|
808
|
+
pinned_parameters_data[p]=[p._data, p._scale]
|
|
809
|
+
else:
|
|
810
|
+
p.data = p.data.pin_memory()
|
|
811
|
+
pinned_parameters_data[p]=p.data
|
|
812
|
+
for b in current_model.buffers():
|
|
813
|
+
b.data = b.data.pin_memory()
|
|
814
|
+
|
|
815
|
+
pinned_buffers_data = {b: b.data for b in current_model.buffers()}
|
|
816
|
+
pinned_parameters_data.update(pinned_buffers_data)
|
|
817
|
+
self.pinned_modules_data[model_id]=pinned_parameters_data
|
|
333
818
|
|
|
334
|
-
addModelFlag = False
|
|
335
819
|
|
|
336
|
-
|
|
820
|
+
# Hook forward methods of modules
|
|
821
|
+
for model_id in models:
|
|
822
|
+
current_model: torch.nn.Module = models[model_id]
|
|
823
|
+
current_budget = model_budgets[model_id]
|
|
824
|
+
current_size = 0
|
|
825
|
+
cur_blocks_prefix, prev_blocks_name, cur_blocks_name,cur_blocks_seq = None, None, None, -1
|
|
826
|
+
self.loaded_blocks[model_id] = None
|
|
827
|
+
|
|
337
828
|
for submodule_name, submodule in current_model.named_modules():
|
|
829
|
+
# create a fake accelerate parameter so that the _execution_device property returns always "cuda"
|
|
830
|
+
# (it is queried in many pipelines even if offloading is not properly implemented)
|
|
831
|
+
if not hasattr(submodule, "_hf_hook"):
|
|
832
|
+
setattr(submodule, "_hf_hook", HfHook())
|
|
833
|
+
|
|
834
|
+
if submodule_name=='':
|
|
835
|
+
continue
|
|
836
|
+
|
|
837
|
+
if current_budget > 0:
|
|
838
|
+
if isinstance(submodule, (torch.nn.ModuleList, torch.nn.Sequential)):
|
|
839
|
+
if cur_blocks_prefix == None:
|
|
840
|
+
cur_blocks_prefix = submodule_name + "."
|
|
841
|
+
else:
|
|
842
|
+
#if cur_blocks_prefix != submodule_name[:len(cur_blocks_prefix)]:
|
|
843
|
+
if not submodule_name.startswith(cur_blocks_prefix):
|
|
844
|
+
cur_blocks_prefix = submodule_name + "."
|
|
845
|
+
cur_blocks_name,cur_blocks_seq = None, -1
|
|
846
|
+
else:
|
|
847
|
+
|
|
848
|
+
if cur_blocks_prefix is not None:
|
|
849
|
+
#if cur_blocks_prefix == submodule_name[0:len(cur_blocks_prefix)]:
|
|
850
|
+
if submodule_name.startswith(cur_blocks_prefix):
|
|
851
|
+
num = int(submodule_name[len(cur_blocks_prefix):].split(".")[0])
|
|
852
|
+
if num != cur_blocks_seq and (cur_blocks_name == None or current_size > current_budget):
|
|
853
|
+
prev_blocks_name = cur_blocks_name
|
|
854
|
+
cur_blocks_name = cur_blocks_prefix + str(num)
|
|
855
|
+
# print(f"new block: {model_id}/{cur_blocks_name} - {submodule_name}")
|
|
856
|
+
cur_blocks_seq = num
|
|
857
|
+
else:
|
|
858
|
+
cur_blocks_prefix, prev_blocks_name, cur_blocks_name,cur_blocks_seq = None, None, None, -1
|
|
859
|
+
|
|
338
860
|
if hasattr(submodule, "forward"):
|
|
339
861
|
submodule_method = getattr(submodule, "forward")
|
|
340
862
|
if callable(submodule_method):
|
|
341
|
-
|
|
342
|
-
|
|
343
|
-
# hook only the first two levels of modules with the full suite of processing
|
|
863
|
+
if len(submodule_name.split("."))==1:
|
|
864
|
+
# hook only the first level of modules with the full suite of processing
|
|
344
865
|
self.hook_me(submodule, current_model, model_id, submodule_name, submodule_method)
|
|
345
|
-
else:
|
|
346
|
-
|
|
347
|
-
|
|
348
|
-
|
|
349
|
-
|
|
350
|
-
new_candidate = submodule_name[0:pos+3]
|
|
351
|
-
if len(new_candidate.split("."))<=4:
|
|
352
|
-
current_block_sequence = new_candidate
|
|
353
|
-
# force a memory check when initiating a new sequence of blocks as the shapes of tensor will certainly change
|
|
354
|
-
# and memory reusability is less likely
|
|
355
|
-
# we limit this check to the first level of blocks as quering the cuda cache is time consuming
|
|
356
|
-
forceMemoryCheck = True
|
|
357
|
-
else:
|
|
358
|
-
if current_block_sequence != submodule_name[0:len(current_block_sequence)]:
|
|
359
|
-
current_block_sequence = None
|
|
360
|
-
self.hook_me_light(submodule, forceMemoryCheck, submodule_method)
|
|
361
|
-
|
|
362
|
-
|
|
363
|
-
if addModelFlag:
|
|
364
|
-
if model_id not in self.models:
|
|
365
|
-
self.models[model_id] = current_model
|
|
366
|
-
|
|
367
|
-
# Pin in RAM models only once they have been fully loaded otherwise there may be some contention in the non pageable memory
|
|
368
|
-
# between partially loaded lazy safetensors and pinned tensors
|
|
369
|
-
if pinInRAM:
|
|
370
|
-
if verbose:
|
|
371
|
-
print("Pinning model tensors in RAM")
|
|
372
|
-
torch.cuda.empty_cache()
|
|
373
|
-
gc.collect()
|
|
374
|
-
for model_id in models:
|
|
375
|
-
pinned_parameters_data = {}
|
|
376
|
-
current_model: torch.nn.Module = models[model_id]
|
|
377
|
-
for p in current_model.parameters():
|
|
378
|
-
if isinstance(p, QTensor):
|
|
379
|
-
# pin in memory both quantized data and scales of quantized parameters
|
|
380
|
-
# but don't pin .data as it corresponds to the original tensor that we don't want to reload
|
|
381
|
-
p._data = p._data.pin_memory()
|
|
382
|
-
# fix quanto bug that allows _scale to be float32 if the original weight was float32
|
|
383
|
-
# (this may cause type mismatch between dequantified bfloat16 weights and float32 scales)
|
|
384
|
-
p._scale = p._scale.to(torch.bfloat16).pin_memory() if p._scale.dtype == torch.float32 else p._scale.pin_memory()
|
|
385
|
-
pinned_parameters_data[p]=[p._data, p._scale]
|
|
386
|
-
else:
|
|
387
|
-
p.data = p.data.pin_memory()
|
|
388
|
-
pinned_parameters_data[p]=p.data
|
|
389
|
-
for b in current_model.buffers():
|
|
390
|
-
b.data = b.data.pin_memory()
|
|
866
|
+
else:
|
|
867
|
+
# force a memory check when initiating a new sequence of blocks as the shapes of tensor will certainly change
|
|
868
|
+
# and memory reusability is less likely
|
|
869
|
+
# we limit this check to the first level of blocks as quering the cuda cache is time consuming
|
|
870
|
+
self.hook_me_light(submodule, model_id, cur_blocks_name, submodule_method, context = submodule_name)
|
|
391
871
|
|
|
392
|
-
|
|
393
|
-
|
|
394
|
-
self.pinned_modules_data[model_id]=pinned_parameters_data
|
|
872
|
+
if compile and cur_blocks_name != None and model_id == "transformer" and "_blocks" in submodule_name:
|
|
873
|
+
submodule.compile(mode="reduce-overhead" ) #mode= "max-autotune"
|
|
395
874
|
|
|
396
|
-
|
|
397
|
-
self.params_of_modules[model_id] = module_params
|
|
398
|
-
self.collect_module_parameters(current_model,module_params)
|
|
875
|
+
current_size = self.add_module_to_blocks(model_id, cur_blocks_name, submodule, prev_blocks_name)
|
|
399
876
|
|
|
400
|
-
|
|
401
|
-
|
|
877
|
+
|
|
878
|
+
if compile and False:
|
|
879
|
+
if verboseLevel>=1:
|
|
402
880
|
print("Torch compilation started")
|
|
403
881
|
torch._dynamo.config.cache_size_limit = 10000
|
|
404
882
|
# if pipe != None and hasattr(pipe, "__call__"):
|
|
@@ -409,13 +887,65 @@ class offload:
|
|
|
409
887
|
current_model.compile(mode= "max-autotune")
|
|
410
888
|
#models["transformer"].compile()
|
|
411
889
|
|
|
412
|
-
if
|
|
890
|
+
if verboseLevel>=1:
|
|
413
891
|
print("Torch compilation done")
|
|
414
892
|
|
|
893
|
+
if verboseLevel >=2:
|
|
894
|
+
for n,b in self.blocks_of_modules_sizes.items():
|
|
895
|
+
print(f"Size of submodel '{n}': {b/ONE_MB:.1f} MB")
|
|
896
|
+
|
|
415
897
|
torch.cuda.empty_cache()
|
|
416
898
|
gc.collect()
|
|
417
899
|
|
|
418
|
-
|
|
419
900
|
return self
|
|
420
901
|
|
|
421
|
-
|
|
902
|
+
|
|
903
|
+
|
|
904
|
+
@staticmethod
|
|
905
|
+
def profile(pipe_or_dict_of_modules,profile_no: profile_type, quantizeTransformer = True):
|
|
906
|
+
"""Apply a configuration profile that depends on your hardware:
|
|
907
|
+
pipe_or_dict_of_modules : the pipeline object or a dictionary of modules of the model
|
|
908
|
+
profile_name : num of the profile:
|
|
909
|
+
HighRAM_HighVRAM_Fastest (=1): at least 48 GB of RAM and 24 GB of VRAM : the fastest well suited for a RTX 3090 / RTX 4090
|
|
910
|
+
HighRAM_LowVRAM_Fast (=2): at least 48 GB of RAM and 12 GB of VRAM : a bit slower, better suited for RTX 3070/3080/4070/4080
|
|
911
|
+
or for RTX 3090 / RTX 4090 with large pictures batches or long videos
|
|
912
|
+
LowRAM_HighVRAM_Medium (=3): at least 32 GB of RAM and 24 GB of VRAM : so so speed but adapted for RTX 3090 / RTX 4090 with limited RAM
|
|
913
|
+
LowRAM_LowVRAM_Slow (=4): at least 32 GB of RAM and 12 GB of VRAM : if have little VRAM or generate longer videos
|
|
914
|
+
VerylowRAM_LowVRAM_Slowest (=5): at least 24 GB of RAM and 10 GB of VRAM : if you don't have much it won't be fast but maybe it will work
|
|
915
|
+
quantizeTransformer: bool = True, the main model is quantized by default for all the profiles, you may want to disable that to get the best image quality
|
|
916
|
+
"""
|
|
917
|
+
|
|
918
|
+
|
|
919
|
+
modules = pipe_or_dict_of_modules
|
|
920
|
+
if hasattr(modules, "components"):
|
|
921
|
+
modules= modules.components
|
|
922
|
+
any_T5 = False
|
|
923
|
+
if "text_encoder_2" in modules:
|
|
924
|
+
text_encoder_2 = modules["text_encoder_2"]
|
|
925
|
+
any_T5 = "t5" in text_encoder_2.__module__.lower()
|
|
926
|
+
extra_mod_to_quantize = ("text_encoder_2" if any_T5 else "text_encoder")
|
|
927
|
+
|
|
928
|
+
# transformer (video or image generator) should be as small as possible to not occupy space that could be used by actual image data
|
|
929
|
+
# on the other hand the text encoder should be quite large (as long as it fits in 10 GB of VRAM) to reduce sequence offloading
|
|
930
|
+
|
|
931
|
+
budgets = { "transformer" : 600 , "text_encoder": 3000, "text_encoder_2": 3000 }
|
|
932
|
+
|
|
933
|
+
if profile_no == profile_type.HighRAM_HighVRAM_Fastest:
|
|
934
|
+
info = "You have chosen a Very Fast profile that requires at least 48 GB of RAM and 24 GB of VRAM."
|
|
935
|
+
return offload.all(pipe_or_dict_of_modules, pinInRAM= True, info = info, quantizeTransformer= quantizeTransformer)
|
|
936
|
+
elif profile_no == profile_type.HighRAM_LowVRAM_Fast:
|
|
937
|
+
info = "You have chosen a Fast profile that requires at least 48 GB of RAM and 12 GB of VRAM."
|
|
938
|
+
return offload.all(pipe_or_dict_of_modules, pinInRAM= True, budgets=budgets, info = info, quantizeTransformer= quantizeTransformer )
|
|
939
|
+
elif profile_no == profile_type.LowRAM_HighVRAM_Medium:
|
|
940
|
+
info = "You have chosen a Medium speed profile that requires at least 32 GB of RAM and 24 GB of VRAM."
|
|
941
|
+
return offload.all(pipe_or_dict_of_modules, pinInRAM= "transformer", modelsToQuantize= extra_mod_to_quantize , info = info, quantizeTransformer= quantizeTransformer)
|
|
942
|
+
elif profile_no == profile_type.LowRAM_LowVRAM_Slow:
|
|
943
|
+
info = "You have chosen the Slow profile that requires at least 32 GB of RAM and 12 GB of VRAM."
|
|
944
|
+
return offload.all(pipe_or_dict_of_modules, pinInRAM= "transformer", modelsToQuantize= extra_mod_to_quantize , budgets=budgets, info = info, quantizeTransformer= quantizeTransformer)
|
|
945
|
+
elif profile_no == profile_type.VerylowRAM_LowVRAM_Slowest:
|
|
946
|
+
budgets["transformer"] = 400
|
|
947
|
+
info = "You have chosen the Slowest profile that requires at least 24 GB of RAM and 10 GB of VRAM."
|
|
948
|
+
return offload.all(pipe_or_dict_of_modules, pinInRAM= False, modelsToQuantize= extra_mod_to_quantize , budgets=budgets, info = info, quantizeTransformer= quantizeTransformer)
|
|
949
|
+
else:
|
|
950
|
+
raise("Unknown profile")
|
|
951
|
+
|
mmgp-1.2.0.dist-info/METADATA
DELETED
|
@@ -1,109 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.1
|
|
2
|
-
Name: mmgp
|
|
3
|
-
Version: 1.2.0
|
|
4
|
-
Summary: Memory Management for the GPU Poor
|
|
5
|
-
Author-email: deepbeepmeep <deepbeepmeep@yahoo.com>
|
|
6
|
-
License: GNU GENERAL PUBLIC LICENSE
|
|
7
|
-
Version 3, 29 June 2007
|
|
8
|
-
Requires-Python: >=3.10
|
|
9
|
-
Description-Content-Type: text/markdown
|
|
10
|
-
License-File: LICENSE.md
|
|
11
|
-
Requires-Dist: torch>=2.1.0
|
|
12
|
-
Requires-Dist: optimum-quanto
|
|
13
|
-
|
|
14
|
-
|
|
15
|
-
<p align="center">
|
|
16
|
-
<H2>Memory Management for the GPU Poor by DeepBeepMeep</H2>
|
|
17
|
-
</p>
|
|
18
|
-
|
|
19
|
-
|
|
20
|
-
This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ... can run smoothly on a 24 GB GPU limited card.
|
|
21
|
-
This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several
|
|
22
|
-
times in a pipe (eg VAE).
|
|
23
|
-
|
|
24
|
-
Requirements:
|
|
25
|
-
- GPU: RTX 3090/ RTX 4090 (24 GB of VRAM)
|
|
26
|
-
- RAM: minimum 48 GB, recommended 64 GB
|
|
27
|
-
|
|
28
|
-
## Usage
|
|
29
|
-
First you need to install the module in your current project with:
|
|
30
|
-
```shell
|
|
31
|
-
pip install mmgp
|
|
32
|
-
```
|
|
33
|
-
|
|
34
|
-
It is almost plug and play and just needs to be invoked from the main app just after the model pipeline has been created.
|
|
35
|
-
1) First make sure that the pipeline explictly loads the models in the CPU device, for instance:
|
|
36
|
-
```
|
|
37
|
-
pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16).to("cpu")
|
|
38
|
-
```
|
|
39
|
-
|
|
40
|
-
2) Once every potential Lora has been loaded and merged, add the following lines:
|
|
41
|
-
|
|
42
|
-
```
|
|
43
|
-
from mmgp import offload
|
|
44
|
-
offload.all(pipe)
|
|
45
|
-
```
|
|
46
|
-
|
|
47
|
-
## Options
|
|
48
|
-
The 'transformer' model in the pipe contains usually the video or image generator is quantized on the fly by default to 8 bits. If you want to save time on disk and reduce the loading time, you may want to load directly a prequantized model. In that case you need to set the option *quantizeTransformer* to *False* to turn off on the fly quantization.
|
|
49
|
-
|
|
50
|
-
You can specify a list of additional models string ids to quantize (for instance the text_encoder) using the optional argument *modelsToQuantize* for instance *modelsToQuantize = ["text_encoder_2"]*.This may be useful if you have less than 48 GB of RAM.
|
|
51
|
-
|
|
52
|
-
Note that there is little advantage on the GPU / VRAM side to quantize text encoders as their inputs are usually quite light.
|
|
53
|
-
|
|
54
|
-
Conversely if you have more than 64GB of RAM you may want to enable RAM pinning with the option *pinInRAM = True*. You will get in return super fast loading / unloading of models
|
|
55
|
-
(this can save significant time if the same pipeline is run multiple times in a row)
|
|
56
|
-
|
|
57
|
-
In Summary, if you have:
|
|
58
|
-
- Between 32 GB and 48 GB of RAM
|
|
59
|
-
```
|
|
60
|
-
offload.all(pipe, modelsToQuantize = ["text_encoder_2"]) # for Flux models
|
|
61
|
-
#OR
|
|
62
|
-
offload.all(pipe, modelsToQuantize = ["text_encoder"]) # for HunyuanVideo models
|
|
63
|
-
|
|
64
|
-
```
|
|
65
|
-
|
|
66
|
-
- Between 48 GB and 64 GB of RAM
|
|
67
|
-
```
|
|
68
|
-
offload.all(pipe)
|
|
69
|
-
```
|
|
70
|
-
- More than 64 GB of RAM
|
|
71
|
-
```
|
|
72
|
-
offload.all(pipe), pinInRAM = True
|
|
73
|
-
```
|
|
74
|
-
|
|
75
|
-
## Special
|
|
76
|
-
Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models.\
|
|
77
|
-
For instance :
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
- for flux derived models:
|
|
81
|
-
```
|
|
82
|
-
pipe = { "text_encoder": clip, "text_encoder_2": t5, "transformer": model, "vae":ae }
|
|
83
|
-
```
|
|
84
|
-
- for mochi:
|
|
85
|
-
```
|
|
86
|
-
pipe = { "text_encoder": self.text_encoder, "transformer": self.dit, "vae":self.decoder }
|
|
87
|
-
```
|
|
88
|
-
|
|
89
|
-
|
|
90
|
-
Please note that there should be always one model whose Id is 'transformer'. It corresponds to the main image / video model which usually needs to be quantized (this is done on the fly by default when loading the model).
|
|
91
|
-
|
|
92
|
-
Becareful, lots of models use the T5 XXL as a text encoder. However, quite often their corresponding pipeline configurations point at the official Google T5 XXL repository
|
|
93
|
-
where there is a huge 40GB model to download and load. It is cumbersorme as it is a 32 bits model and contains the decoder part of T5 that is not used.
|
|
94
|
-
I suggest you use instead one of the 16 bits encoder only version available around, for instance:
|
|
95
|
-
```
|
|
96
|
-
text_encoder_2 = T5EncoderModel.from_pretrained("black-forest-labs/FLUX.1-dev", subfolder="text_encoder_2", torch_dtype=torch.float16)
|
|
97
|
-
```
|
|
98
|
-
|
|
99
|
-
Sometime just providing the pipe won't be sufficient as you will need to change the content of the core model:
|
|
100
|
-
- For instance you may need to disable an existing CPU offload logic that already exists (such as manual calls to move tensors between cuda and the cpu)
|
|
101
|
-
- mmpg to tries to fake the device as being "cuda" but sometimes some code won't be fooled and it will create tensors in the cpu device and this may cause some issues.
|
|
102
|
-
|
|
103
|
-
You are free to use my module for non commercial use as long you give me proper credits. You may contact me on twitter @deepbeepmeep
|
|
104
|
-
|
|
105
|
-
Thanks to
|
|
106
|
-
---------
|
|
107
|
-
- Huggingface / accelerate for the hooking examples
|
|
108
|
-
- Huggingface / quanto for their very useful quantizer
|
|
109
|
-
- gau-nernst for his Pinnig RAM samples
|
mmgp-1.2.0.dist-info/RECORD
DELETED
|
@@ -1,7 +0,0 @@
|
|
|
1
|
-
__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
|
2
|
-
mmgp.py,sha256=IijgE22bUPl98VvXwoC1qngmkdWU11YXjkiksp8o1hY,21418
|
|
3
|
-
mmgp-1.2.0.dist-info/LICENSE.md,sha256=HjzvY2grdtdduZclbZ46B2M-XpT4MDCxFub5ZwTWq2g,93
|
|
4
|
-
mmgp-1.2.0.dist-info/METADATA,sha256=jRXi-iNZ_3zNNVxMC1qmVDd7ylq8kAr5Y5FgYyBvVh4,4897
|
|
5
|
-
mmgp-1.2.0.dist-info/WHEEL,sha256=PZUExdf71Ui_so67QXpySuHtCi3-J3wvF4ORK6k_S8U,91
|
|
6
|
-
mmgp-1.2.0.dist-info/top_level.txt,sha256=waGaepj2qVfnS2yAOkaMu4r9mJaVjGbEi6AwOUogU_U,14
|
|
7
|
-
mmgp-1.2.0.dist-info/RECORD,,
|
|
File without changes
|
|
File without changes
|
|
File without changes
|