mmgp 1.2.0__py3-none-any.whl → 2.0.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of mmgp might be problematic. Click here for more details.

@@ -0,0 +1,137 @@
1
+ Metadata-Version: 2.1
2
+ Name: mmgp
3
+ Version: 2.0.0
4
+ Summary: Memory Management for the GPU Poor
5
+ Author-email: deepbeepmeep <deepbeepmeep@yahoo.com>
6
+ License: GNU GENERAL PUBLIC LICENSE
7
+ Version 3, 29 June 2007
8
+ Requires-Python: >=3.10
9
+ Description-Content-Type: text/markdown
10
+ License-File: LICENSE.md
11
+ Requires-Dist: torch>=2.1.0
12
+ Requires-Dist: optimum-quanto
13
+ Requires-Dist: accelerate
14
+
15
+
16
+ <p align="center">
17
+ <H2>Memory Management 2.0 for the GPU Poor by DeepBeepMeep</H2>
18
+ </p>
19
+
20
+
21
+ This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ... can run smoothly on a 12 to 24 GB GPU limited card.
22
+ This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several
23
+ times in a pipe (eg VAE).
24
+
25
+ Requirements:
26
+ - VRAM: minimum 12 GB, recommended 24 GB (RTX 3090/ RTX 4090)
27
+ - RAM: minimum 24 GB, recommended 48 GB
28
+
29
+ This module features 5 profiles in order to able to run the model at a decent speed on a low end consumer config (32 GB of RAM and 12 VRAM) and to run it at a very good speed on a high end consumer config (48 GB of RAM and 24 GB of VRAM).
30
+
31
+ Each profile may use the following:
32
+ - Smart preloading of models in RAM to reduce RAM requirements
33
+ - Smart automated loading / unloading of models in the GPU to avoid unloading models that may be needed again soon
34
+ - Smart slicing of models to reduce memory occupied by models in the VRAM
35
+ - Ability to pin models in reserved RAM to accelerate transfers to VRAM
36
+ - Async transfers to VRAM to avoid a pause when loading a new slice of a model
37
+ - Automated on the fly quantization or ability to load quantized models
38
+
39
+ ## Installation
40
+ First you need to install the module in your current project with:
41
+ ```shell
42
+ pip install mmgp
43
+ ```
44
+
45
+
46
+ ## Usage
47
+
48
+ It is almost plug and play and just needs to be invoked from the main app just after the model pipeline has been created.
49
+ 1) First make sure that the pipeline explictly loads the models in the CPU device, for instance:
50
+ ```
51
+ pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16).to("cpu")
52
+ ```
53
+
54
+ 2) Once every potential Lora has been loaded and merged, add the following lines for a quick setup:
55
+ ```
56
+ from mmgp import offload, profile_type
57
+ offload.profile(pipe, profile_type.HighRAM_LowVRAM_Fast)
58
+ ```
59
+
60
+ You can choose between 5 profiles depending on your hardware:
61
+ - HighRAM_HighVRAM_Fastest: at least 48 GB of RAM and 24 GB of VRAM : the fastest well suited for a RTX 3090 / RTX 4090
62
+ - HighRAM_LowVRAM_Fast (recommended): at least 48 GB of RAM and 12 GB of VRAM : a bit slower, better suited for RTX 3070/3080/4070/4080
63
+ or for RTX 3090 / RTX 4090 with large pictures batches or long videos
64
+ - LowRAM_HighVRAM_Medium: at least 32 GB of RAM and 24 GB of VRAM : so so speed but adapted for RTX 3090 / RTX 4090 with limited RAM
65
+ - LowRAM_LowVRAM_Slow: at least 32 GB of RAM and 12 GB of VRAM : if have little VRAM or generate longer videos
66
+ - VerylowRAM_LowVRAM_Slowest: at least 24 GB of RAM and 10 GB of VRAM : if you don't have much it won't be fast but maybe it will work
67
+
68
+ By default the 'transformer' will be quantized to 8 bits for all profiles. If you don't want that you may specify the optional parameter *quantizeTransformer = False*.
69
+
70
+ ## Alternatively you may want to create your own profile with specific parameters:
71
+
72
+ For example:
73
+ ```
74
+ from mmgp import offload
75
+ offload.all(pipe, pinInRAM=True, modelsToQuantize = ["text_encoder_2"] )
76
+ ```
77
+ - pinInRAM: Boolean (for all models) or List of models ids to pin in RAM. Every model pinned in RAM will load much faster (4 times) but this requires more RAM
78
+ - modelsToQuantize: list of model ids to quantize on the fly. If the corresponding model is already quantized, this option will be ignored.
79
+ - quantizeTransformer: boolean by default True. The 'transformer' model in the pipe contains usually the video or image generator is by defaut; quantized on the fly by default to 8 bits. If you want to save time on disk and reduce the loading time, you may want to load directly a prequantized model. If you don't want to quantize the image generator, you need to set the option *quantizeTransformer* to *False* to turn off on the fly quantization.
80
+ - budgets: either a number in mega bytes (for all models, if 0 unlimited budget) or a dictionary that maps model ids to mega bytes : define the budget in VRAM (in fact the real number is 2.5 this number) that is allocated in VRAM for each model. The smaller this number, the more VRAM left for image data / longer video but also the slower because there will be lots of loading / unloading between the RAM and the VRAM. Turning on the PinInRAM accelerates greatly (4x) small budgets but consumes usually 50% more RAM.
81
+
82
+
83
+ ## Going further
84
+
85
+ The module includes several tools to package a light version of your favorite video / image generator:
86
+ - *save_model(model, file_path, do_quantize = False, quantization_type = qint8 )*\
87
+ Save tensors of a model already loaded in memory in a safetensor format (much faster to reload). You can save it in a quantized format (default qint8 quantization recommended).
88
+ If the model is saved in a quantized format, an extra file that ends with '_map.json' will be created and needed to reload the model again.
89
+
90
+ - *load_model_data(model, file_path: str)*\
91
+ Load the tensors data of a model in RAM of a model already initialized with no data. Detect and handle quantized models saved previously with save_model.
92
+
93
+ - *fast_load_transformers_model(model_path: str)*\
94
+ Initialize (build the model hierarchy in memory) and fast load the corresponding tensors of a 'transformers' library model.
95
+ The advantages over the original *LoadfromPretrained* function is that the full model can fit into a single file with a filename of your choosing (thefore you can have multiple 'transformers' versions of the same model in the same directory) and prequantized model are processed in a transparent way.
96
+ Please note that you need to keep the original file transformers 'config.json' in the same directory.
97
+
98
+
99
+ The typical workflow wil be:
100
+ 1) temporarly insert the *save_model* function just after a model has been fully loaded to save a copy of the model / quantized model.
101
+ 2) replace the full initalizing / loading logic with *fast_load_transformers_model* (if there is a 'Loadfrompretrained' call to a transformers object) or only the tensor loading functions (*torch.load_model_file* and *torch.load_state_dict*) with *load_model_data after* the initializing logic.
102
+
103
+ ## Special cases
104
+ Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models.\
105
+ For instance :
106
+
107
+
108
+ - for flux derived models:
109
+ ```
110
+ pipe = { "text_encoder": clip, "text_encoder_2": t5, "transformer": model, "vae":ae }
111
+ ```
112
+ - for mochi:
113
+ ```
114
+ pipe = { "text_encoder": self.text_encoder, "transformer": self.dit, "vae":self.decoder }
115
+ ```
116
+
117
+
118
+ Please note that there should be always one model whose Id is 'transformer'. It corresponds to the main image / video model which usually needs to be quantized (this is done on the fly by default when loading the model).
119
+
120
+ Becareful, lots of models use the T5 XXL as a text encoder. However, quite often their corresponding pipeline configurations point at the official Google T5 XXL repository
121
+ where there is a huge 40GB model to download and load. It is cumbersorme as it is a 32 bits model and contains the decoder part of T5 that is not used.
122
+ I suggest you use instead one of the 16 bits encoder only version available around, for instance:
123
+ ```
124
+ text_encoder_2 = T5EncoderModel.from_pretrained("black-forest-labs/FLUX.1-dev", subfolder="text_encoder_2", torch_dtype=torch.float16)
125
+ ```
126
+
127
+ Sometime just providing the pipe won't be sufficient as you will need to change the content of the core model:
128
+ - For instance you may need to disable an existing CPU offload logic that already exists (such as manual calls to move tensors between cuda and the cpu)
129
+ - mmpg to tries to fake the device as being "cuda" but sometimes some code won't be fooled and it will create tensors in the cpu device and this may cause some issues.
130
+
131
+ You are free to use my module for non commercial use as long you give me proper credits. You may contact me on twitter @deepbeepmeep
132
+
133
+ Thanks to
134
+ ---------
135
+ - Huggingface / accelerate for the hooking examples
136
+ - Huggingface / quanto for their very useful quantizer
137
+ - gau-nernst for his Pinnig RAM samples
@@ -0,0 +1,7 @@
1
+ __init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
2
+ mmgp.py,sha256=iJsy5WKd-X1lee37YWdFm8NrHYXa325_jAunzu7zdYM,45231
3
+ mmgp-2.0.0.dist-info/LICENSE.md,sha256=HjzvY2grdtdduZclbZ46B2M-XpT4MDCxFub5ZwTWq2g,93
4
+ mmgp-2.0.0.dist-info/METADATA,sha256=u2SiQXefqXAwyXkpJFwben-9n9l9z80dsbRXJpYnqMM,8609
5
+ mmgp-2.0.0.dist-info/WHEEL,sha256=PZUExdf71Ui_so67QXpySuHtCi3-J3wvF4ORK6k_S8U,91
6
+ mmgp-2.0.0.dist-info/top_level.txt,sha256=waGaepj2qVfnS2yAOkaMu4r9mJaVjGbEi6AwOUogU_U,14
7
+ mmgp-2.0.0.dist-info/RECORD,,
mmgp.py CHANGED
@@ -1,24 +1,28 @@
1
- # ------------------ Memory Management for the GPU Poor by DeepBeepMeep (mmgp)------------------
1
+ # ------------------ Memory Management 2.0 for the GPU Poor by DeepBeepMeep (mmgp)------------------
2
2
  #
3
3
  # This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ... can run smoothly on a 24 GB GPU limited card.
4
4
  # This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several
5
5
  # times in a pipe (eg VAE).
6
6
  #
7
7
  # Requirements:
8
- # - GPU: RTX 3090/ RTX 4090 (24 GB of VRAM)
9
- # - RAM: minimum 48 GB, recommended 64 GB
8
+ # - VRAM: minimum 12 GB, recommended 24 GB (RTX 3090/ RTX 4090)
9
+ # - RAM: minimum 24 GB, recommended 48 - 64 GB
10
10
  #
11
11
  # It is almost plug and play and just needs to be invoked from the main app just after the model pipeline has been created.
12
12
  # 1) First make sure that the pipeline explictly loads the models in the CPU device
13
13
  # for instance: pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16).to("cpu")
14
14
  # 2) Once every potential Lora has been loaded and merged, add the following lines:
15
+ # For a quick setup, you may want to choose between 4 profiles depending on your hardware, for instance:
16
+ # from mmgp import offload, profile_type
17
+ # offload.profile(pipe, profile_type.HighRAM_LowVRAM_Fast)
18
+ # Alternatively you may want to your own parameters, for instance:
15
19
  # from mmgp import offload
16
- # offload.all(pipe)
20
+ # offload.all(pipe, pinInRAM=true, modelsToQuantize = ["text_encoder_2"] )
17
21
  # The 'transformer' model that contains usually the video or image generator is quantized on the fly by default to 8 bits so that it can fit into 24 GB of VRAM.
18
22
  # If you want to save time on disk and reduce the loading time, you may want to load directly a prequantized model. In that case you need to set the option quantizeTransformer to False to turn off on the fly quantization.
19
23
  # You can specify a list of additional models string ids to quantize (for instance the text_encoder) using the optional argument modelsToQuantize. This may be useful if you have less than 48 GB of RAM.
20
24
  # Note that there is little advantage on the GPU / VRAM side to quantize text encoders as their inputs are usually quite light.
21
- # Conversely if you have more than 64GB RAM you may want to enable RAM pinning with the option pinInRAM = True. You will get in return super fast loading / unloading of models
25
+ # Conversely if you have more than 48GB RAM you may want to enable RAM pinning with the option pinInRAM = True. You will get in return super fast loading / unloading of models
22
26
  # (this can save significant time if the same pipeline is run multiple times in a row)
23
27
  #
24
28
  # Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models.
@@ -53,10 +57,15 @@ import torch
53
57
  import gc
54
58
  import time
55
59
  import functools
60
+ import sys
61
+ import json
62
+
56
63
  from optimum.quanto import freeze, qfloat8, qint8, quantize, QModuleMixin, QTensor
57
64
 
58
65
 
59
66
 
67
+ ONE_MB = 1048576
68
+
60
69
  cotenants_map = {
61
70
  "text_encoder": ["vae", "text_encoder_2"],
62
71
  "text_encoder_2": ["vae", "text_encoder"],
@@ -79,10 +88,107 @@ def move_tensors(obj, device):
79
88
  else:
80
89
  raise TypeError("Tensor or list / dict of tensors expected")
81
90
 
91
+ def _quantize(model_to_quantize, weights=qint8, verboseLevel = 1, threshold = 1000000000, model_id = None):
92
+
93
+ sizeofbfloat16 = torch.bfloat16.itemsize
94
+
95
+ def compute_submodule_size(submodule):
96
+ size = 0
97
+ for p in submodule.parameters(recurse=False):
98
+ size += torch.numel(p.data) * sizeofbfloat16
99
+
100
+ for p in submodule.buffers(recurse=False):
101
+ size += torch.numel(p.data) * sizeofbfloat16
102
+
103
+ return size
104
+
105
+ total_size =0
106
+ total_excluded = 0
107
+ exclude_list = []
108
+ submodule_size = 0
109
+ submodule_names = []
110
+ cur_blocks_prefix = None
111
+ prev_blocks_prefix = None
112
+
113
+ print(f"Quantization of model '{model_id}' started")
114
+
115
+ for submodule_name, submodule in model_to_quantize.named_modules():
116
+ if isinstance(submodule, QModuleMixin):
117
+ if verboseLevel>=1:
118
+ print("No quantization to do as model is already quantized")
119
+ return False
120
+
121
+
122
+ if submodule_name=='':
123
+ continue
124
+
125
+
126
+ flush = False
127
+ if isinstance(submodule, (torch.nn.ModuleList, torch.nn.Sequential)):
128
+ if cur_blocks_prefix == None:
129
+ cur_blocks_prefix = submodule_name + "."
130
+ flush = True
131
+ else:
132
+ #if cur_blocks_prefix != submodule_name[:len(cur_blocks_prefix)]:
133
+ if not submodule_name.startswith(cur_blocks_prefix):
134
+ cur_blocks_prefix = submodule_name + "."
135
+ flush = True
136
+ else:
137
+ if cur_blocks_prefix is not None:
138
+ #if not cur_blocks_prefix == submodule_name[0:len(cur_blocks_prefix)]:
139
+ if not submodule_name.startswith(cur_blocks_prefix):
140
+ cur_blocks_prefix = None
141
+ flush = True
142
+
143
+ if flush:
144
+ if submodule_size <= threshold:
145
+ exclude_list += submodule_names
146
+ if verboseLevel >=2:
147
+ print(f"Excluded size {submodule_size/ONE_MB:.1f} MB: {prev_blocks_prefix} : {submodule_names}")
148
+ total_excluded += submodule_size
149
+
150
+ submodule_size = 0
151
+ submodule_names = []
152
+ prev_blocks_prefix = cur_blocks_prefix
153
+ size = compute_submodule_size(submodule)
154
+ submodule_size += size
155
+ total_size += size
156
+ submodule_names.append(submodule_name)
157
+
158
+ if submodule_size > 0 and submodule_size <= threshold:
159
+ exclude_list += submodule_names
160
+ if verboseLevel >=2:
161
+ print(f"Excluded size {submodule_size/ONE_MB:.1f} MB: {prev_blocks_prefix} : {submodule_names}")
162
+ total_excluded += submodule_size
163
+
164
+ perc_excluded =total_excluded/ total_size if total_size >0 else 1
165
+ if verboseLevel >=2:
166
+ print(f"Total Excluded {total_excluded/ONE_MB:.1f} MB oF {total_size/ONE_MB:.1f} that is {perc_excluded*100:.2f}%")
167
+ if perc_excluded >= 0.10:
168
+ print(f"Too many many modules are excluded, there is something wrong with the selection, switch back to full quantization.")
169
+ exclude_list = None
170
+
171
+ # we are obviously loading a model that has been already quantized
172
+
173
+ quantize(model_to_quantize,weights, exclude= exclude_list)
174
+ freeze(model_to_quantize)
175
+ torch.cuda.empty_cache()
176
+ gc.collect()
177
+ print(f"Quantization of model '{model_id}' done")
178
+
179
+ return True
82
180
 
83
181
  def get_model_name(model):
84
182
  return model.name
85
183
 
184
+ import enum
185
+ class profile_type(int, enum.Enum):
186
+ HighRAM_HighVRAM_Fastest = 1
187
+ HighRAM_LowVRAM_Fast = 2
188
+ LowRAM_HighVRAM_Medium = 3
189
+ LowRAM_LowVRAM_Slow = 4
190
+ VerylowRAM_LowVRAM_Slowest = 5
191
+
86
192
  class HfHook:
87
193
  def __init__(self):
88
194
  self.execution_device = "cuda"
@@ -94,28 +200,57 @@ class offload:
94
200
  def __init__(self):
95
201
  self.active_models = []
96
202
  self.active_models_ids = []
203
+ self.active_subcaches = {}
97
204
  self.models = {}
98
- self.verbose = False
205
+ self.verboseLevel = 0
99
206
  self.models_to_quantize = []
100
207
  self.pinned_modules_data = {}
101
- self.params_of_modules = {}
102
- self.pinTensors = False
208
+ self.blocks_of_modules = {}
209
+ self.blocks_of_modules_sizes = {}
210
+ self.compile = False
103
211
  self.device_mem_capacity = torch.cuda.get_device_properties(0).total_memory
104
212
  self.last_reserved_mem_check =0
213
+ self.loaded_blocks = {}
214
+ self.prev_blocks_names = {}
215
+ self.next_blocks_names = {}
216
+ self.default_stream = torch.cuda.default_stream(torch.device("cuda")) # torch.cuda.current_stream()
217
+ self.transfer_stream = torch.cuda.Stream()
218
+ self.async_transfers = False
219
+
105
220
 
106
- def collect_module_parameters(self, module: torch.nn.Module, module_params):
107
- if isinstance(module, (torch.nn.ModuleList, torch.nn.Sequential)):
108
- for i in range(len(module)):
109
- current_layer = module[i]
110
- module_params.extend(current_layer.parameters())
111
- module_params.extend(current_layer.buffers())
221
+ def add_module_to_blocks(self, model_id, blocks_name, submodule, prev_block_name):
222
+
223
+ entry_name = model_id if blocks_name is None else model_id + "/" + blocks_name
224
+ if entry_name in self.blocks_of_modules:
225
+ blocks_params = self.blocks_of_modules[entry_name]
226
+ blocks_params_size = self.blocks_of_modules_sizes[entry_name]
112
227
  else:
113
- for p in module.parameters(recurse=False):
114
- module_params.append(p)
115
- for p in module.buffers(recurse=False):
116
- module_params.append(p)
117
- for sub_module in module.children():
118
- self.collect_module_parameters(sub_module, module_params)
228
+ blocks_params = []
229
+ self.blocks_of_modules[entry_name] = blocks_params
230
+ blocks_params_size = 0
231
+ if blocks_name !=None:
232
+ prev_entry_name = None if prev_block_name == None else model_id + "/" + prev_block_name
233
+ self.prev_blocks_names[entry_name] = prev_entry_name
234
+ if not prev_block_name == None:
235
+ self.next_blocks_names[prev_entry_name] = entry_name
236
+
237
+ for p in submodule.parameters(recurse=False):
238
+ blocks_params.append(p)
239
+ if isinstance(p, QTensor):
240
+ blocks_params_size += p._data.nbytes
241
+ blocks_params_size += p._scale.nbytes
242
+ else:
243
+ blocks_params_size += p.data.nbytes
244
+
245
+ for p in submodule.buffers(recurse=False):
246
+ blocks_params.append(p)
247
+ blocks_params_size += p.data.nbytes
248
+
249
+
250
+ self.blocks_of_modules_sizes[entry_name] = blocks_params_size
251
+
252
+ return blocks_params_size
253
+
119
254
 
120
255
  def can_model_be_cotenant(self, model_id):
121
256
  potential_cotenants= cotenants_map.get(model_id, None)
@@ -126,45 +261,113 @@ class offload:
126
261
  return False
127
262
  return True
128
263
 
129
- def gpu_load(self, model_id):
130
- model = self.models[model_id]
131
- self.active_models.append(model)
132
- self.active_models_ids.append(model_id)
133
- if self.verbose:
264
+ @torch.compiler.disable()
265
+ def gpu_load_blocks(self, model_id, blocks_name, async_load = False):
266
+ if blocks_name != None:
267
+ self.loaded_blocks[model_id] = blocks_name
268
+
269
+ def cpu_to_gpu(stream_to_use, blocks_params, record_for_stream = None):
270
+ with torch.cuda.stream(stream_to_use):
271
+ for p in blocks_params:
272
+ if isinstance(p, QTensor):
273
+ p._data = p._data.cuda(non_blocking=True)
274
+ p._scale = p._scale.cuda(non_blocking=True)
275
+ else:
276
+ p.data = p.data.cuda(non_blocking=True)
277
+
278
+ if record_for_stream != None:
279
+ if isinstance(p, QTensor):
280
+ p._data.record_stream(record_for_stream)
281
+ p._scale.record_stream(record_for_stream)
282
+ else:
283
+ p.data.record_stream(record_for_stream)
284
+
285
+
286
+ entry_name = model_id if blocks_name is None else model_id + "/" + blocks_name
287
+ if self.verboseLevel >=2:
288
+ model = self.models[model_id]
134
289
  model_name = model._get_name()
135
- print(f"Loading model {model_name} ({model_id}) in GPU")
136
- if not self.pinInRAM:
137
- model.to("cuda")
290
+ print(f"Loading model {entry_name} ({model_name}) in GPU")
291
+
292
+
293
+ if self.async_transfers and blocks_name != None:
294
+ first = self.prev_blocks_names[entry_name] == None
295
+ next_blocks_entry = self.next_blocks_names[entry_name] if entry_name in self.next_blocks_names else None
296
+ if first:
297
+ cpu_to_gpu(torch.cuda.current_stream(), self.blocks_of_modules[entry_name])
298
+ # if next_blocks_entry != None:
299
+ # self.transfer_stream.wait_stream(self.default_stream)
300
+ # else:
301
+ # self.transfer_stream.wait_stream(self.default_stream)
302
+ torch.cuda.synchronize()
303
+
304
+ if next_blocks_entry != None:
305
+ cpu_to_gpu(self.transfer_stream, self.blocks_of_modules[next_blocks_entry]) #, self.default_stream
306
+
138
307
  else:
139
- module_params = self.params_of_modules[model_id]
140
- for p in module_params:
308
+ # if self.async_transfers:
309
+ # self.transfer_stream.wait_stream(self.default_stream)
310
+ cpu_to_gpu(self.default_stream, self.blocks_of_modules[entry_name])
311
+ torch.cuda.synchronize()
312
+
313
+
314
+ @torch.compiler.disable()
315
+ def gpu_unload_blocks(self, model_id, blocks_name):
316
+ if blocks_name != None:
317
+ self.loaded_blocks[model_id] = None
318
+
319
+ blocks_name = model_id if blocks_name is None else model_id + "/" + blocks_name
320
+
321
+ if self.verboseLevel >=2:
322
+ model = self.models[model_id]
323
+ model_name = model._get_name()
324
+ print(f"Unloading model {blocks_name} ({model_name}) from GPU")
325
+
326
+ blocks_params = self.blocks_of_modules[blocks_name]
327
+
328
+ if model_id in self.pinned_modules_data:
329
+ pinned_parameters_data = self.pinned_modules_data[model_id]
330
+ for p in blocks_params:
141
331
  if isinstance(p, QTensor):
142
- p._data = p._data.cuda(non_blocking=True)
143
- p._scale = p._scale.cuda(non_blocking=True)
332
+ data = pinned_parameters_data[p]
333
+ p._data = data[0]
334
+ p._scale = data[1]
144
335
  else:
145
- p.data = p.data.cuda(non_blocking=True) #
146
- # torch.cuda.current_stream().synchronize()
336
+ p.data = pinned_parameters_data[p]
337
+ else:
338
+ for p in blocks_params:
339
+ if isinstance(p, QTensor):
340
+ p._data = p._data.cpu()
341
+ p._scale = p._scale.cpu()
342
+ else:
343
+ p.data = p.data.cpu()
344
+
345
+
346
+
147
347
  @torch.compiler.disable()
348
+ def gpu_load(self, model_id):
349
+ model = self.models[model_id]
350
+ self.active_models.append(model)
351
+ self.active_models_ids.append(model_id)
352
+
353
+ self.gpu_load_blocks(model_id, None)
354
+
355
+ # torch.cuda.current_stream().synchronize()
356
+
148
357
  def unload_all(self):
149
- for model, model_id in zip(self.active_models, self.active_models_ids):
150
- if not self.pinInRAM:
151
- model.to("cpu")
152
- else:
153
- module_params = self.params_of_modules[model_id]
154
- pinned_parameters_data = self.pinned_modules_data[model_id]
155
- for p in module_params:
156
- if isinstance(p, QTensor):
157
- data = pinned_parameters_data[p]
158
- p._data = data[0]
159
- p._scale = data[1]
160
- else:
161
- p.data = pinned_parameters_data[p]
162
-
358
+ for model_id in self.active_models_ids:
359
+ self.gpu_unload_blocks(model_id, None)
360
+ loaded_block = self.loaded_blocks[model_id]
361
+ if loaded_block != None:
362
+ self.gpu_unload_blocks(model_id, loaded_block)
363
+ self.loaded_blocks[model_id] = None
163
364
 
164
365
  self.active_models = []
165
366
  self.active_models_ids = []
367
+ self.active_subcaches = []
166
368
  torch.cuda.empty_cache()
167
369
  gc.collect()
370
+ self.last_reserved_mem_check = time.time()
168
371
 
169
372
  def move_args_to_gpu(self, *args, **kwargs):
170
373
  new_args= []
@@ -188,10 +391,12 @@ class offload:
188
391
 
189
392
  return new_args, new_kwargs
190
393
 
191
- def ready_to_check_mem(self, forceMemoryCheck):
394
+ def ready_to_check_mem(self):
395
+ if self.compile:
396
+ return
192
397
  cur_clock = time.time()
193
398
  # can't check at each call if we can empty the cuda cache as quering the reserved memory value is a time consuming operation
194
- if not forceMemoryCheck and (cur_clock - self.last_reserved_mem_check)<0.200:
399
+ if (cur_clock - self.last_reserved_mem_check)<0.200:
195
400
  return False
196
401
  self.last_reserved_mem_check = cur_clock
197
402
  return True
@@ -199,20 +404,70 @@ class offload:
199
404
 
200
405
  def empty_cache_if_needed(self):
201
406
  mem_reserved = torch.cuda.memory_reserved()
202
- if mem_reserved >= 0.9*self.device_mem_capacity:
407
+ mem_threshold = 0.9*self.device_mem_capacity
408
+ if mem_reserved >= mem_threshold:
203
409
  mem_allocated = torch.cuda.memory_allocated()
204
410
  if mem_allocated <= 0.70 * mem_reserved:
205
411
  # print(f"Cuda empty cache triggered as Allocated Memory ({mem_allocated/1024000:0f} MB) is lot less than Cached Memory ({mem_reserved/1024000:0f} MB) ")
206
412
  torch.cuda.empty_cache()
413
+ tm= time.time()
414
+ if self.verboseLevel >=2:
415
+ print(f"Empty Cuda cache at {tm}")
207
416
  # print(f"New cached memory after purge is {torch.cuda.memory_reserved()/1024000:0f} MB) ")
208
417
 
209
- def hook_me_light(self, target_module, forceMemoryCheck, previous_method):
210
- def check_empty_cache(module, *args, **kwargs):
211
- if self.ready_to_check_mem(forceMemoryCheck):
418
+
419
+ def any_param_or_buffer(self, target_module: torch.nn.Module):
420
+
421
+ for _ in target_module.parameters(recurse= False):
422
+ return True
423
+
424
+ for _ in target_module.buffers(recurse= False):
425
+ return True
426
+
427
+ return False
428
+
429
+
430
+
431
+ def hook_me_light(self, target_module, model_id,blocks_name, previous_method, context):
432
+
433
+ anyParam = self.any_param_or_buffer(target_module)
434
+
435
+ def check_empty_cuda_cache(module, *args, **kwargs):
436
+ if self.ready_to_check_mem():
212
437
  self.empty_cache_if_needed()
213
438
  return previous_method(*args, **kwargs)
214
-
215
- setattr(target_module, "forward", functools.update_wrapper(functools.partial(check_empty_cache, target_module), previous_method) )
439
+
440
+
441
+ def load_module_blocks(module, *args, **kwargs):
442
+ #some_context = context #for debugging
443
+ if blocks_name == None:
444
+ if self.ready_to_check_mem():
445
+ self.empty_cache_if_needed()
446
+ else:
447
+ loaded_block = self.loaded_blocks[model_id]
448
+ if (loaded_block == None or loaded_block != blocks_name) :
449
+ if loaded_block != None:
450
+ self.gpu_unload_blocks(model_id, loaded_block)
451
+ if self.ready_to_check_mem():
452
+ self.empty_cache_if_needed()
453
+ self.loaded_blocks[model_id] = blocks_name
454
+ self.gpu_load_blocks(model_id, blocks_name)
455
+ return previous_method(*args, **kwargs)
456
+
457
+ if hasattr(target_module, "_mm_id"):
458
+ orig_model_id = getattr(target_module, "_mm_id")
459
+ if self.verboseLevel >=2:
460
+ print(f"Model '{model_id}' shares module '{target_module._get_name()}' with module '{orig_model_id}' ")
461
+ assert not anyParam
462
+ return
463
+ setattr(target_module, "_mm_id", model_id)
464
+
465
+
466
+ if blocks_name != None and anyParam:
467
+ setattr(target_module, "forward", functools.update_wrapper(functools.partial(load_module_blocks, target_module), previous_method) )
468
+ #print(f"new cache:{blocks_name}")
469
+ else:
470
+ setattr(target_module, "forward", functools.update_wrapper(functools.partial(check_empty_cuda_cache, target_module), previous_method) )
216
471
 
217
472
 
218
473
  def hook_me(self, target_module, model, model_id, module_id, previous_method):
@@ -236,13 +491,9 @@ class offload:
236
491
  return
237
492
  setattr(target_module, "_mm_id", model_id)
238
493
 
239
- # create a fake accelerate parameter so that the _execution_device property returns always "cuda"
240
- # (it is queried in many pipelines even if offloading is not properly implemented)
241
- if not hasattr(target_module, "_hf_hook"):
242
- setattr(target_module, "_hf_hook", HfHook())
243
494
  setattr(target_module, "forward", functools.update_wrapper(functools.partial(check_change_module, target_module), previous_method) )
244
495
 
245
- if not self.verbose:
496
+ if not self.verboseLevel >=1:
246
497
  return
247
498
 
248
499
  if module_id == None or module_id =='':
@@ -262,22 +513,185 @@ class offload:
262
513
  # self.unhook_module(module)
263
514
 
264
515
 
516
+ @staticmethod
517
+ def fast_load_transformers_model(model_path: str):
518
+ """
519
+ quick version of .LoadfromPretrained of the transformers library
520
+ used to build a model and load the corresponding weights (quantized or not)
521
+ """
522
+
523
+ from transformers import AutoConfig
524
+
525
+ if model_path.endswith(".sft") or model_path.endswith(".safetensors"):
526
+ config_path = model_path[ : model_path.rfind("/")]
527
+ else:
528
+ raise("full model path expected")
529
+ config_fullpath = config_path +"/config.json"
530
+
531
+ import os.path
532
+ if not os.path.isfile(config_fullpath):
533
+ raise("a 'config.json' that describes the model is required in the directory of the model")
534
+
535
+ with open(config_fullpath, "r", encoding="utf-8") as reader:
536
+ text = reader.read()
537
+ transformer_config= json.loads(text)
538
+ architectures = transformer_config["architectures"]
539
+ class_name = architectures[0]
540
+
541
+ module = __import__("transformers")
542
+ transfomer_class = getattr(module, class_name)
543
+
544
+ config = AutoConfig.from_pretrained(config_path)
545
+
546
+ from accelerate import init_empty_weights
547
+ #needed to keep inits of non persistent buffers
548
+ with init_empty_weights():
549
+ model = transfomer_class(config)
550
+
551
+ model = model.base_model
552
+ torch.set_default_device('cpu')
553
+ model.apply(model._initialize_weights)
554
+
555
+ #missing_keys, unexpected_keys =
556
+ offload.load_model_data(model,model_path, strict = True )
557
+
558
+ return model
559
+ # # text_encoder.final_layer_norm = text_encoder.norm
560
+ # model = model.base_model
561
+ # model.final_layer_norm = model.norm
562
+ # self.model = model
563
+
564
+
565
+
566
+ @staticmethod
567
+ def load_model_data(model, file_path: str, device=torch.device('cpu'), strict = True):
568
+ """
569
+ Load a model, detect if it has been previously quantized using quanto and do the extra setup if necessary
570
+ """
571
+ from optimum.quanto import requantize
572
+ import safetensors.torch
573
+
574
+ if "quanto" in file_path.lower():
575
+ pos = str.rfind(file_path, ".")
576
+ if pos > 0:
577
+ quantization_map_path = file_path[:pos]
578
+ quantization_map_path += "_map.json"
579
+
580
+
581
+ with open(quantization_map_path, 'r') as f:
582
+ quantization_map = json.load(f)
583
+
584
+ state_dict = safetensors.torch.load_file(file_path)
585
+
586
+ # change dtype of current meta model parameters because 'requantize' won't update the dtype on non quantized parameters
587
+ for k, p in model.named_parameters():
588
+ if not k in quantization_map and k in state_dict:
589
+ p_in_sd = state_dict[k]
590
+ if p.data.dtype != p_in_sd.data.dtype:
591
+ p.data = p.data.to(p_in_sd.data.dtype)
592
+
593
+ requantize(model, state_dict, quantization_map, device)
594
+
595
+ # for k, p in model.named_parameters():
596
+ # if p.data.dtype == torch.float32:
597
+ # pass
598
+
599
+
600
+ # del state_dict
601
+ return
602
+
603
+ else:
604
+ if ".safetensors" in file_path or ".sft" in file_path:
605
+ state_dict = safetensors.torch.load_file(file_path)
606
+
607
+ else:
608
+
609
+ state_dict = torch.load(file_path, weights_only=True)
610
+ if "module" in state_dict:
611
+ state_dict = state_dict["module"]
612
+
613
+
614
+ model.load_state_dict(state_dict, strict = strict, assign = True ) #strict=True,
615
+
616
+
617
+ return
618
+
619
+ @staticmethod
620
+ def save_model(model, file_path, do_quantize = False, quantization_type = qint8 ):
621
+ """save the weights of a model and quantize them if requested
622
+ These weights can be loaded again using 'load_model_data'
623
+ """
624
+ import safetensors.torch
625
+ pos = str.rfind(file_path, ".")
626
+ if pos > 0:
627
+ file_path = file_path[:pos]
628
+
629
+ if do_quantize:
630
+ _quantize(model, weights=quantization_type)
631
+
632
+ # # state_dict = {k: v.clone().contiguous() for k, v in model.state_dict().items()}
633
+ # state_dict = {k: v for k, v in model.state_dict().items()}
634
+
635
+
636
+
637
+ safetensors.torch.save_file(model.state_dict(), file_path + '.safetensors')
638
+
639
+ if do_quantize:
640
+ from optimum.quanto import quantization_map
641
+
642
+ with open(file_path + '_map.json', 'w') as f:
643
+ json.dump(quantization_map(model), f)
644
+
265
645
 
266
646
 
267
647
  @classmethod
268
- def all(cls, pipe_or_dict_of_modules, quantizeTransformer = True, pinInRAM = False, verbose = True, modelsToQuantize = None ):
648
+ def all(cls, pipe_or_dict_of_modules, quantizeTransformer = True, pinInRAM = False, verboseLevel = 1, modelsToQuantize = None, budgets= 0, info = None):
649
+ """Hook to a pipeline or a group of modules in order to reduce their VRAM requirements:
650
+ pipe_or_dict_of_modules : the pipeline object or a dictionary of modules of the model
651
+ quantizeTransformer: set True by default will quantize on the fly the video / image model
652
+ pinInRAM: move models in reserved memor. This allows very fast performance but requires 50% extra RAM (usually >=64 GB)
653
+ modelsToQuantize: a list of models to be also quantized on the fly (e.g the text_encoder), useful to reduce bith RAM and VRAM consumption
654
+ budgets: 0 by default (unlimited). If non 0, it corresponds to the maximum size in MB that every model will occupy at any moment
655
+ (in fact the real usage is twice this number). It is very efficient to reduce VRAM consumption but this feature may be very slow
656
+ if pinInRAM is not enabled
657
+ """
658
+
269
659
  self = cls()
270
- self.verbose = verbose
660
+ self.verboseLevel = verboseLevel
271
661
  self.pinned_modules_data = {}
662
+ model_budgets = {}
272
663
 
664
+ # model_budgets = {"text_encoder_2": 3400 }
665
+ HEADER = '\033[95m'
666
+ ENDC = '\033[0m'
667
+ BOLD ='\033[1m'
668
+ UNBOLD ='\033[0m'
669
+
670
+ print(f"{BOLD}{HEADER}************ Memory Management for the GPU Poor (mmgp 2.0) by DeepBeepMeep ************{ENDC}{UNBOLD}")
671
+ if info != None:
672
+ print(info)
673
+ budget = 0
674
+ if not budgets is None:
675
+ if isinstance(budgets , dict):
676
+ model_budgets = budgets
677
+ else:
678
+ budget = int(budgets) * ONE_MB
679
+
680
+ if (budgets!= None or budget >0) :
681
+ self.async_transfers = True
682
+
683
+ #pinInRAM = True
273
684
  # compile not working yet or slower
274
685
  compile = False
275
- self.pinInRAM = pinInRAM
686
+ #quantizeTransformer = False
687
+ #self.async_transfers = False
688
+ self.compile = compile
689
+
276
690
  pipe = None
277
- preloadInRAM = True
278
691
  torch.set_default_device('cuda')
279
692
  if hasattr(pipe_or_dict_of_modules, "components"):
280
- pipe_or_dict_of_modules.to("cpu") #XXXX
693
+ # commented as it not very useful and generates warnings
694
+ #pipe_or_dict_of_modules.to("cpu") #XXXX
281
695
  # create a fake Accelerate parameter so that lora loading doesn't change the device
282
696
  pipe_or_dict_of_modules.hf_device_map = torch.device("cuda")
283
697
  pipe = pipe_or_dict_of_modules
@@ -291,114 +705,181 @@ class offload:
291
705
  modelsToQuantize = [modelsToQuantize]
292
706
  if quantizeTransformer:
293
707
  modelsToQuantize.append("transformer")
708
+
294
709
  self.models_to_quantize = modelsToQuantize
710
+ models_already_loaded = []
711
+
712
+ modelsToPin = None
713
+ pinAllModels = False
714
+ if isinstance(pinInRAM, bool):
715
+ pinAllModels = pinInRAM
716
+ elif isinstance(pinInRAM, list):
717
+ modelsToPin = pinInRAM
718
+ else:
719
+ modelsToPin = [pinInRAM]
720
+
295
721
  # del models["transformer"] # to test everything but the transformer that has a much longer loading
296
- # models = { 'transformer': pipe_or_dict_of_modules["transformer"]} # to test only the transformer
722
+ sizeofbfloat16 = torch.bfloat16.itemsize
723
+ #
724
+ # models = { 'transformer': pipe_or_dict_of_modules["transformer"]} # to test only the transformer
725
+
726
+
297
727
  for model_id in models:
298
728
  current_model: torch.nn.Module = models[model_id]
729
+ modelPinned = pinAllModels or (modelsToPin != None and model_id in modelsToPin)
299
730
  # make sure that no RAM or GPU memory is not allocated for gradiant / training
300
- current_model.to("cpu").eval() #XXXXX
301
-
731
+ current_model.to("cpu").eval()
732
+ already_loaded = False
302
733
  # Quantize model just before transferring it to the RAM to keep OS cache file
303
734
  # open as short as possible. Indeed it seems that as long as the lazy safetensors
304
735
  # are not fully fully loaded, the OS won't be able to release the corresponding cache file in RAM.
305
736
  if model_id in self.models_to_quantize:
306
- print(f"Quantization of model '{model_id}' started")
307
- quantize(current_model, weights=qint8)
308
- freeze(current_model)
309
- print(f"Quantization of model '{model_id}' done")
310
- torch.cuda.empty_cache()
311
- gc.collect()
312
737
 
738
+ already_quantized = _quantize(current_model, weights=qint8, verboseLevel = self.verboseLevel, model_id=model_id)
739
+ if not already_quantized:
740
+ already_loaded = True
741
+ models_already_loaded.append(model_id)
313
742
 
314
-
315
- if preloadInRAM: #
316
- # load all the remaining unread lazy safetensors in RAM to free open cache files
317
- for p in current_model.parameters():
318
- # Preread every tensor in RAM except tensors that have just been quantified
319
- # and are no longer needed
320
- if isinstance(p, QTensor):
321
- # fix quanto bug (see below) now as he won't have any opportunity to do it during RAM pinning
322
- if not pinInRAM and p._scale.dtype == torch.float32:
323
- p._scale = p._scale.to(torch.bfloat16)
324
743
 
744
+ current_model_size = 0
745
+ # load all the remaining unread lazy safetensors in RAM to free open cache files
746
+ for p in current_model.parameters():
747
+ # Preread every tensor in RAM except tensors that have just been quantified
748
+ # and are no longer needed
749
+ if isinstance(p, QTensor):
750
+ # fix quanto bug (see below) now as he won't have any opportunity to do it during RAM pinning
751
+ if not modelPinned and p._scale.dtype == torch.float32:
752
+ p._scale = p._scale.to(torch.bfloat16)
753
+ current_model_size += torch.numel(p._scale) * sizeofbfloat16
754
+ current_model_size += torch.numel(p._data) * sizeofbfloat16 / 2
755
+ if pinInRAM and not already_loaded:
756
+ # Force flushing the lazy load so that reserved memory can be freed when we are ready to pin
757
+ p._scale = p._scale + 0
758
+ p._data = p._data + 0
759
+ else:
760
+ if p.data.dtype == torch.float32:
761
+ # convert any left overs float32 weight to bloat16 to divide by 2 the model memory footprint
762
+ p.data = p.data.to(torch.bfloat16)
325
763
  else:
326
- if p.data.dtype == torch.float32:
327
- # convert any left overs float32 weight to bloat16 to divide by 2 the model memory footprint
328
- p.data = p.data.to(torch.bfloat16)
329
- else:
330
- # force reading the tensors from the disk by pretending to modify them
331
- p.data = p.data + 0
332
-
764
+ # force reading the tensors from the disk by pretending to modify them
765
+ p.data = p.data + 0
766
+
767
+ current_model_size += torch.numel(p.data) * p.data.element_size()
768
+
769
+ for b in current_model.buffers():
770
+ if b.data.dtype == torch.float32:
771
+ # convert any left overs float32 weight to bloat16 to divide by 2 the model memory footprint
772
+ b.data = b.data.to(torch.bfloat16)
773
+ else:
774
+ # force reading the tensors from the disk by pretending to modify them
775
+ b.data = b.data + 0
776
+
777
+ current_model_size += torch.numel(p.data) * p.data.element_size()
778
+
779
+ if model_id not in self.models:
780
+ self.models[model_id] = current_model
781
+
782
+
783
+ model_budget = model_budgets[model_id] * ONE_MB if model_id in model_budgets else budget
784
+
785
+ if model_budget > 0 and model_budget > current_model_size:
786
+ model_budget = 0
787
+
788
+ model_budgets[model_id] = model_budget
789
+
790
+ # Pin in RAM models only once they have been fully loaded otherwise there will be some contention (at least on Linux OS) in the non pageable memory
791
+ # between partially loaded lazy safetensors and pinned tensors
792
+ for model_id in models:
793
+ current_model: torch.nn.Module = models[model_id]
794
+ if not (pinAllModels or modelsToPin != None and model_id in modelsToPin):
795
+ continue
796
+ if verboseLevel>=1:
797
+ print(f"Pinning tensors of '{model_id}' in RAM")
798
+ gc.collect()
799
+ pinned_parameters_data = {}
800
+ for p in current_model.parameters():
801
+ if isinstance(p, QTensor):
802
+ # pin in memory both quantized data and scales of quantized parameters
803
+ # but don't pin .data as it corresponds to the original tensor that we don't want to reload
804
+ p._data = p._data.pin_memory()
805
+ # fix quanto bug (that seems to have been fixed since&) that allows _scale to be float32 if the original weight was float32
806
+ # (this may cause type mismatch between dequantified bfloat16 weights and float32 scales)
807
+ if p._scale.dtype == torch.float32:
808
+ pass
809
+
810
+ p._scale = p._scale.to(torch.bfloat16).pin_memory() if p._scale.dtype == torch.float32 else p._scale.pin_memory()
811
+ pinned_parameters_data[p]=[p._data, p._scale]
812
+ else:
813
+ p.data = p.data.pin_memory()
814
+ pinned_parameters_data[p]=p.data
815
+ for b in current_model.buffers():
816
+ b.data = b.data.pin_memory()
817
+
818
+ pinned_buffers_data = {b: b.data for b in current_model.buffers()}
819
+ pinned_parameters_data.update(pinned_buffers_data)
820
+ self.pinned_modules_data[model_id]=pinned_parameters_data
333
821
 
334
- addModelFlag = False
335
822
 
336
- current_block_sequence = None
823
+ # Hook forward methods of modules
824
+ for model_id in models:
825
+ current_model: torch.nn.Module = models[model_id]
826
+ current_budget = model_budgets[model_id]
827
+ current_size = 0
828
+ cur_blocks_prefix, prev_blocks_name, cur_blocks_name,cur_blocks_seq = None, None, None, -1
829
+ self.loaded_blocks[model_id] = None
830
+
337
831
  for submodule_name, submodule in current_model.named_modules():
832
+ # create a fake accelerate parameter so that the _execution_device property returns always "cuda"
833
+ # (it is queried in many pipelines even if offloading is not properly implemented)
834
+ if not hasattr(submodule, "_hf_hook"):
835
+ setattr(submodule, "_hf_hook", HfHook())
836
+
837
+ if submodule_name=='':
838
+ continue
839
+
840
+ if current_budget > 0:
841
+ if isinstance(submodule, (torch.nn.ModuleList, torch.nn.Sequential)):
842
+ if cur_blocks_prefix == None:
843
+ cur_blocks_prefix = submodule_name + "."
844
+ else:
845
+ #if cur_blocks_prefix != submodule_name[:len(cur_blocks_prefix)]:
846
+ if not submodule_name.startswith(cur_blocks_prefix):
847
+ cur_blocks_prefix = submodule_name + "."
848
+ cur_blocks_name,cur_blocks_seq = None, -1
849
+ else:
850
+
851
+ if cur_blocks_prefix is not None:
852
+ #if cur_blocks_prefix == submodule_name[0:len(cur_blocks_prefix)]:
853
+ if submodule_name.startswith(cur_blocks_prefix):
854
+ num = int(submodule_name[len(cur_blocks_prefix):].split(".")[0])
855
+ if num != cur_blocks_seq and (cur_blocks_name == None or current_size > current_budget):
856
+ prev_blocks_name = cur_blocks_name
857
+ cur_blocks_name = cur_blocks_prefix + str(num)
858
+ # print(f"new block: {model_id}/{cur_blocks_name} - {submodule_name}")
859
+ cur_blocks_seq = num
860
+ else:
861
+ cur_blocks_prefix, prev_blocks_name, cur_blocks_name,cur_blocks_seq = None, None, None, -1
862
+
338
863
  if hasattr(submodule, "forward"):
339
864
  submodule_method = getattr(submodule, "forward")
340
865
  if callable(submodule_method):
341
- addModelFlag = True
342
- if submodule_name=='' or len(submodule_name.split("."))==1:
343
- # hook only the first two levels of modules with the full suite of processing
866
+ if len(submodule_name.split("."))==1:
867
+ # hook only the first level of modules with the full suite of processing
344
868
  self.hook_me(submodule, current_model, model_id, submodule_name, submodule_method)
345
- else:
346
- forceMemoryCheck = False
347
- pos = submodule_name.find(".0.")
348
- if pos > 0:
349
- if current_block_sequence == None:
350
- new_candidate = submodule_name[0:pos+3]
351
- if len(new_candidate.split("."))<=4:
352
- current_block_sequence = new_candidate
353
- # force a memory check when initiating a new sequence of blocks as the shapes of tensor will certainly change
354
- # and memory reusability is less likely
355
- # we limit this check to the first level of blocks as quering the cuda cache is time consuming
356
- forceMemoryCheck = True
357
- else:
358
- if current_block_sequence != submodule_name[0:len(current_block_sequence)]:
359
- current_block_sequence = None
360
- self.hook_me_light(submodule, forceMemoryCheck, submodule_method)
361
-
362
-
363
- if addModelFlag:
364
- if model_id not in self.models:
365
- self.models[model_id] = current_model
366
-
367
- # Pin in RAM models only once they have been fully loaded otherwise there may be some contention in the non pageable memory
368
- # between partially loaded lazy safetensors and pinned tensors
369
- if pinInRAM:
370
- if verbose:
371
- print("Pinning model tensors in RAM")
372
- torch.cuda.empty_cache()
373
- gc.collect()
374
- for model_id in models:
375
- pinned_parameters_data = {}
376
- current_model: torch.nn.Module = models[model_id]
377
- for p in current_model.parameters():
378
- if isinstance(p, QTensor):
379
- # pin in memory both quantized data and scales of quantized parameters
380
- # but don't pin .data as it corresponds to the original tensor that we don't want to reload
381
- p._data = p._data.pin_memory()
382
- # fix quanto bug that allows _scale to be float32 if the original weight was float32
383
- # (this may cause type mismatch between dequantified bfloat16 weights and float32 scales)
384
- p._scale = p._scale.to(torch.bfloat16).pin_memory() if p._scale.dtype == torch.float32 else p._scale.pin_memory()
385
- pinned_parameters_data[p]=[p._data, p._scale]
386
- else:
387
- p.data = p.data.pin_memory()
388
- pinned_parameters_data[p]=p.data
389
- for b in current_model.buffers():
390
- b.data = b.data.pin_memory()
869
+ else:
870
+ # force a memory check when initiating a new sequence of blocks as the shapes of tensor will certainly change
871
+ # and memory reusability is less likely
872
+ # we limit this check to the first level of blocks as quering the cuda cache is time consuming
873
+ self.hook_me_light(submodule, model_id, cur_blocks_name, submodule_method, context = submodule_name)
391
874
 
392
- pinned_buffers_data = {b: b.data for b in current_model.buffers()}
393
- pinned_parameters_data.update(pinned_buffers_data)
394
- self.pinned_modules_data[model_id]=pinned_parameters_data
875
+ # if compile and cur_blocks_name != None and model_id == "transformer" and "_blocks" in submodule_name:
876
+ # submodule.compile(mode="reduce-overhead" ) #mode= "max-autotune"
877
+
878
+ current_size = self.add_module_to_blocks(model_id, cur_blocks_name, submodule, prev_blocks_name)
395
879
 
396
- module_params = []
397
- self.params_of_modules[model_id] = module_params
398
- self.collect_module_parameters(current_model,module_params)
399
880
 
400
881
  if compile:
401
- if verbose:
882
+ if verboseLevel>=1:
402
883
  print("Torch compilation started")
403
884
  torch._dynamo.config.cache_size_limit = 10000
404
885
  # if pipe != None and hasattr(pipe, "__call__"):
@@ -409,13 +890,65 @@ class offload:
409
890
  current_model.compile(mode= "max-autotune")
410
891
  #models["transformer"].compile()
411
892
 
412
- if verbose:
893
+ if verboseLevel>=1:
413
894
  print("Torch compilation done")
414
895
 
896
+ if verboseLevel >=2:
897
+ for n,b in self.blocks_of_modules_sizes.items():
898
+ print(f"Size of submodel '{n}': {b/ONE_MB:.1f} MB")
899
+
415
900
  torch.cuda.empty_cache()
416
901
  gc.collect()
417
902
 
418
-
419
903
  return self
420
904
 
421
-
905
+
906
+
907
+ @staticmethod
908
+ def profile(pipe_or_dict_of_modules,profile_no: profile_type, quantizeTransformer = True):
909
+ """Apply a configuration profile that depends on your hardware:
910
+ pipe_or_dict_of_modules : the pipeline object or a dictionary of modules of the model
911
+ profile_name : num of the profile:
912
+ HighRAM_HighVRAM_Fastest (=1): at least 48 GB of RAM and 24 GB of VRAM : the fastest well suited for a RTX 3090 / RTX 4090
913
+ HighRAM_LowVRAM_Fast (=2): at least 48 GB of RAM and 12 GB of VRAM : a bit slower, better suited for RTX 3070/3080/4070/4080
914
+ or for RTX 3090 / RTX 4090 with large pictures batches or long videos
915
+ LowRAM_HighVRAM_Medium (=3): at least 32 GB of RAM and 24 GB of VRAM : so so speed but adapted for RTX 3090 / RTX 4090 with limited RAM
916
+ LowRAM_LowVRAM_Slow (=4): at least 32 GB of RAM and 12 GB of VRAM : if have little VRAM or generate longer videos
917
+ VerylowRAM_LowVRAM_Slowest (=5): at least 24 GB of RAM and 10 GB of VRAM : if you don't have much it won't be fast but maybe it will work
918
+ quantizeTransformer: bool = True, the main model is quantized by default for all the profiles, you may want to disable that to get the best image quality
919
+ """
920
+
921
+
922
+ modules = pipe_or_dict_of_modules
923
+ if hasattr(modules, "components"):
924
+ modules= modules.components
925
+ any_T5 = False
926
+ if "text_encoder_2" in modules:
927
+ text_encoder_2 = modules["text_encoder_2"]
928
+ any_T5 = "t5" in text_encoder_2.__module__.lower()
929
+ extra_mod_to_quantize = ("text_encoder_2" if any_T5 else "text_encoder")
930
+
931
+ # transformer (video or image generator) should be as small as possible to not occupy space that could be used by actual image data
932
+ # on the other hand the text encoder should be quite large (as long as it fits in 10 GB of VRAM) to reduce sequence offloading
933
+
934
+ budgets = { "transformer" : 600 , "text_encoder": 3000, "text_encoder_2": 3000 }
935
+
936
+ if profile_no == profile_type.HighRAM_HighVRAM_Fastest:
937
+ info = "You have chosen a Very Fast profile that requires at least 48 GB of RAM and 24 GB of VRAM."
938
+ return offload.all(pipe_or_dict_of_modules, pinInRAM= True, info = info, quantizeTransformer= quantizeTransformer)
939
+ elif profile_no == profile_type.HighRAM_LowVRAM_Fast:
940
+ info = "You have chosen a Fast profile that requires at least 48 GB of RAM and 12 GB of VRAM."
941
+ return offload.all(pipe_or_dict_of_modules, pinInRAM= True, budgets=budgets, info = info, quantizeTransformer= quantizeTransformer )
942
+ elif profile_no == profile_type.LowRAM_HighVRAM_Medium:
943
+ info = "You have chosen a Medium speed profile that requires at least 32 GB of RAM and 24 GB of VRAM."
944
+ return offload.all(pipe_or_dict_of_modules, pinInRAM= "transformer", modelsToQuantize= extra_mod_to_quantize , info = info, quantizeTransformer= quantizeTransformer)
945
+ elif profile_no == profile_type.LowRAM_LowVRAM_Slow:
946
+ info = "You have chosen the Slowest profile that requires at least 32 GB of RAM and 12 GB of VRAM."
947
+ return offload.all(pipe_or_dict_of_modules, pinInRAM= "transformer", modelsToQuantize= extra_mod_to_quantize , budgets=budgets, info = info, quantizeTransformer= quantizeTransformer)
948
+ elif profile_no == profile_type.VerylowRAM_LowVRAM_Slowest:
949
+ budgets["transformer"] = 400
950
+ info = "You have chosen the Slowest profile that requires at least 24 GB of RAM and 10 GB of VRAM."
951
+ return offload.all(pipe_or_dict_of_modules, pinInRAM= False, modelsToQuantize= extra_mod_to_quantize , budgets=budgets, info = info, quantizeTransformer= quantizeTransformer)
952
+ else:
953
+ raise("Unknown profile")
954
+
@@ -1,109 +0,0 @@
1
- Metadata-Version: 2.1
2
- Name: mmgp
3
- Version: 1.2.0
4
- Summary: Memory Management for the GPU Poor
5
- Author-email: deepbeepmeep <deepbeepmeep@yahoo.com>
6
- License: GNU GENERAL PUBLIC LICENSE
7
- Version 3, 29 June 2007
8
- Requires-Python: >=3.10
9
- Description-Content-Type: text/markdown
10
- License-File: LICENSE.md
11
- Requires-Dist: torch>=2.1.0
12
- Requires-Dist: optimum-quanto
13
-
14
-
15
- <p align="center">
16
- <H2>Memory Management for the GPU Poor by DeepBeepMeep</H2>
17
- </p>
18
-
19
-
20
- This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ... can run smoothly on a 24 GB GPU limited card.
21
- This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several
22
- times in a pipe (eg VAE).
23
-
24
- Requirements:
25
- - GPU: RTX 3090/ RTX 4090 (24 GB of VRAM)
26
- - RAM: minimum 48 GB, recommended 64 GB
27
-
28
- ## Usage
29
- First you need to install the module in your current project with:
30
- ```shell
31
- pip install mmgp
32
- ```
33
-
34
- It is almost plug and play and just needs to be invoked from the main app just after the model pipeline has been created.
35
- 1) First make sure that the pipeline explictly loads the models in the CPU device, for instance:
36
- ```
37
- pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16).to("cpu")
38
- ```
39
-
40
- 2) Once every potential Lora has been loaded and merged, add the following lines:
41
-
42
- ```
43
- from mmgp import offload
44
- offload.all(pipe)
45
- ```
46
-
47
- ## Options
48
- The 'transformer' model in the pipe contains usually the video or image generator is quantized on the fly by default to 8 bits. If you want to save time on disk and reduce the loading time, you may want to load directly a prequantized model. In that case you need to set the option *quantizeTransformer* to *False* to turn off on the fly quantization.
49
-
50
- You can specify a list of additional models string ids to quantize (for instance the text_encoder) using the optional argument *modelsToQuantize* for instance *modelsToQuantize = ["text_encoder_2"]*.This may be useful if you have less than 48 GB of RAM.
51
-
52
- Note that there is little advantage on the GPU / VRAM side to quantize text encoders as their inputs are usually quite light.
53
-
54
- Conversely if you have more than 64GB of RAM you may want to enable RAM pinning with the option *pinInRAM = True*. You will get in return super fast loading / unloading of models
55
- (this can save significant time if the same pipeline is run multiple times in a row)
56
-
57
- In Summary, if you have:
58
- - Between 32 GB and 48 GB of RAM
59
- ```
60
- offload.all(pipe, modelsToQuantize = ["text_encoder_2"]) # for Flux models
61
- #OR
62
- offload.all(pipe, modelsToQuantize = ["text_encoder"]) # for HunyuanVideo models
63
-
64
- ```
65
-
66
- - Between 48 GB and 64 GB of RAM
67
- ```
68
- offload.all(pipe)
69
- ```
70
- - More than 64 GB of RAM
71
- ```
72
- offload.all(pipe), pinInRAM = True
73
- ```
74
-
75
- ## Special
76
- Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models.\
77
- For instance :
78
-
79
-
80
- - for flux derived models:
81
- ```
82
- pipe = { "text_encoder": clip, "text_encoder_2": t5, "transformer": model, "vae":ae }
83
- ```
84
- - for mochi:
85
- ```
86
- pipe = { "text_encoder": self.text_encoder, "transformer": self.dit, "vae":self.decoder }
87
- ```
88
-
89
-
90
- Please note that there should be always one model whose Id is 'transformer'. It corresponds to the main image / video model which usually needs to be quantized (this is done on the fly by default when loading the model).
91
-
92
- Becareful, lots of models use the T5 XXL as a text encoder. However, quite often their corresponding pipeline configurations point at the official Google T5 XXL repository
93
- where there is a huge 40GB model to download and load. It is cumbersorme as it is a 32 bits model and contains the decoder part of T5 that is not used.
94
- I suggest you use instead one of the 16 bits encoder only version available around, for instance:
95
- ```
96
- text_encoder_2 = T5EncoderModel.from_pretrained("black-forest-labs/FLUX.1-dev", subfolder="text_encoder_2", torch_dtype=torch.float16)
97
- ```
98
-
99
- Sometime just providing the pipe won't be sufficient as you will need to change the content of the core model:
100
- - For instance you may need to disable an existing CPU offload logic that already exists (such as manual calls to move tensors between cuda and the cpu)
101
- - mmpg to tries to fake the device as being "cuda" but sometimes some code won't be fooled and it will create tensors in the cpu device and this may cause some issues.
102
-
103
- You are free to use my module for non commercial use as long you give me proper credits. You may contact me on twitter @deepbeepmeep
104
-
105
- Thanks to
106
- ---------
107
- - Huggingface / accelerate for the hooking examples
108
- - Huggingface / quanto for their very useful quantizer
109
- - gau-nernst for his Pinnig RAM samples
@@ -1,7 +0,0 @@
1
- __init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
2
- mmgp.py,sha256=IijgE22bUPl98VvXwoC1qngmkdWU11YXjkiksp8o1hY,21418
3
- mmgp-1.2.0.dist-info/LICENSE.md,sha256=HjzvY2grdtdduZclbZ46B2M-XpT4MDCxFub5ZwTWq2g,93
4
- mmgp-1.2.0.dist-info/METADATA,sha256=jRXi-iNZ_3zNNVxMC1qmVDd7ylq8kAr5Y5FgYyBvVh4,4897
5
- mmgp-1.2.0.dist-info/WHEEL,sha256=PZUExdf71Ui_so67QXpySuHtCi3-J3wvF4ORK6k_S8U,91
6
- mmgp-1.2.0.dist-info/top_level.txt,sha256=waGaepj2qVfnS2yAOkaMu4r9mJaVjGbEi6AwOUogU_U,14
7
- mmgp-1.2.0.dist-info/RECORD,,
File without changes