mmgp 1.1.0__tar.gz → 1.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of mmgp might be problematic. Click here for more details.

mmgp-1.2.0/LICENSE.md ADDED
@@ -0,0 +1,2 @@
1
+ GNU GENERAL PUBLIC LICENSE
2
+ Version 3, 29 June 2007
mmgp-1.2.0/PKG-INFO ADDED
@@ -0,0 +1,109 @@
1
+ Metadata-Version: 2.1
2
+ Name: mmgp
3
+ Version: 1.2.0
4
+ Summary: Memory Management for the GPU Poor
5
+ Author-email: deepbeepmeep <deepbeepmeep@yahoo.com>
6
+ License: GNU GENERAL PUBLIC LICENSE
7
+ Version 3, 29 June 2007
8
+ Requires-Python: >=3.10
9
+ Description-Content-Type: text/markdown
10
+ License-File: LICENSE.md
11
+ Requires-Dist: torch>=2.1.0
12
+ Requires-Dist: optimum-quanto
13
+
14
+
15
+ <p align="center">
16
+ <H2>Memory Management for the GPU Poor by DeepBeepMeep</H2>
17
+ </p>
18
+
19
+
20
+ This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ... can run smoothly on a 24 GB GPU limited card.
21
+ This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several
22
+ times in a pipe (eg VAE).
23
+
24
+ Requirements:
25
+ - GPU: RTX 3090/ RTX 4090 (24 GB of VRAM)
26
+ - RAM: minimum 48 GB, recommended 64 GB
27
+
28
+ ## Usage
29
+ First you need to install the module in your current project with:
30
+ ```shell
31
+ pip install mmgp
32
+ ```
33
+
34
+ It is almost plug and play and just needs to be invoked from the main app just after the model pipeline has been created.
35
+ 1) First make sure that the pipeline explictly loads the models in the CPU device, for instance:
36
+ ```
37
+ pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16).to("cpu")
38
+ ```
39
+
40
+ 2) Once every potential Lora has been loaded and merged, add the following lines:
41
+
42
+ ```
43
+ from mmgp import offload
44
+ offload.all(pipe)
45
+ ```
46
+
47
+ ## Options
48
+ The 'transformer' model in the pipe contains usually the video or image generator is quantized on the fly by default to 8 bits. If you want to save time on disk and reduce the loading time, you may want to load directly a prequantized model. In that case you need to set the option *quantizeTransformer* to *False* to turn off on the fly quantization.
49
+
50
+ You can specify a list of additional models string ids to quantize (for instance the text_encoder) using the optional argument *modelsToQuantize* for instance *modelsToQuantize = ["text_encoder_2"]*.This may be useful if you have less than 48 GB of RAM.
51
+
52
+ Note that there is little advantage on the GPU / VRAM side to quantize text encoders as their inputs are usually quite light.
53
+
54
+ Conversely if you have more than 64GB of RAM you may want to enable RAM pinning with the option *pinInRAM = True*. You will get in return super fast loading / unloading of models
55
+ (this can save significant time if the same pipeline is run multiple times in a row)
56
+
57
+ In Summary, if you have:
58
+ - Between 32 GB and 48 GB of RAM
59
+ ```
60
+ offload.all(pipe, modelsToQuantize = ["text_encoder_2"]) # for Flux models
61
+ #OR
62
+ offload.all(pipe, modelsToQuantize = ["text_encoder"]) # for HunyuanVideo models
63
+
64
+ ```
65
+
66
+ - Between 48 GB and 64 GB of RAM
67
+ ```
68
+ offload.all(pipe)
69
+ ```
70
+ - More than 64 GB of RAM
71
+ ```
72
+ offload.all(pipe), pinInRAM = True
73
+ ```
74
+
75
+ ## Special
76
+ Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models.\
77
+ For instance :
78
+
79
+
80
+ - for flux derived models:
81
+ ```
82
+ pipe = { "text_encoder": clip, "text_encoder_2": t5, "transformer": model, "vae":ae }
83
+ ```
84
+ - for mochi:
85
+ ```
86
+ pipe = { "text_encoder": self.text_encoder, "transformer": self.dit, "vae":self.decoder }
87
+ ```
88
+
89
+
90
+ Please note that there should be always one model whose Id is 'transformer'. It corresponds to the main image / video model which usually needs to be quantized (this is done on the fly by default when loading the model).
91
+
92
+ Becareful, lots of models use the T5 XXL as a text encoder. However, quite often their corresponding pipeline configurations point at the official Google T5 XXL repository
93
+ where there is a huge 40GB model to download and load. It is cumbersorme as it is a 32 bits model and contains the decoder part of T5 that is not used.
94
+ I suggest you use instead one of the 16 bits encoder only version available around, for instance:
95
+ ```
96
+ text_encoder_2 = T5EncoderModel.from_pretrained("black-forest-labs/FLUX.1-dev", subfolder="text_encoder_2", torch_dtype=torch.float16)
97
+ ```
98
+
99
+ Sometime just providing the pipe won't be sufficient as you will need to change the content of the core model:
100
+ - For instance you may need to disable an existing CPU offload logic that already exists (such as manual calls to move tensors between cuda and the cpu)
101
+ - mmpg to tries to fake the device as being "cuda" but sometimes some code won't be fooled and it will create tensors in the cpu device and this may cause some issues.
102
+
103
+ You are free to use my module for non commercial use as long you give me proper credits. You may contact me on twitter @deepbeepmeep
104
+
105
+ Thanks to
106
+ ---------
107
+ - Huggingface / accelerate for the hooking examples
108
+ - Huggingface / quanto for their very useful quantizer
109
+ - gau-nernst for his Pinnig RAM samples
@@ -12,6 +12,7 @@ Requirements:
12
12
  - GPU: RTX 3090/ RTX 4090 (24 GB of VRAM)
13
13
  - RAM: minimum 48 GB, recommended 64 GB
14
14
 
15
+ ## Usage
15
16
  First you need to install the module in your current project with:
16
17
  ```shell
17
18
  pip install mmgp
@@ -27,13 +28,38 @@ It is almost plug and play and just needs to be invoked from the main app just a
27
28
 
28
29
  ```
29
30
  from mmgp import offload
30
- offload.me(pipe)
31
+ offload.all(pipe)
31
32
  ```
33
+
34
+ ## Options
32
35
  The 'transformer' model in the pipe contains usually the video or image generator is quantized on the fly by default to 8 bits. If you want to save time on disk and reduce the loading time, you may want to load directly a prequantized model. In that case you need to set the option *quantizeTransformer* to *False* to turn off on the fly quantization.
33
36
 
34
- If you have more than 64GB RAM you may want to enable RAM pinning with the option *pinInRAM = True*. You will get in return super fast loading / unloading of models
37
+ You can specify a list of additional models string ids to quantize (for instance the text_encoder) using the optional argument *modelsToQuantize* for instance *modelsToQuantize = ["text_encoder_2"]*.This may be useful if you have less than 48 GB of RAM.
38
+
39
+ Note that there is little advantage on the GPU / VRAM side to quantize text encoders as their inputs are usually quite light.
40
+
41
+ Conversely if you have more than 64GB of RAM you may want to enable RAM pinning with the option *pinInRAM = True*. You will get in return super fast loading / unloading of models
35
42
  (this can save significant time if the same pipeline is run multiple times in a row)
36
43
 
44
+ In Summary, if you have:
45
+ - Between 32 GB and 48 GB of RAM
46
+ ```
47
+ offload.all(pipe, modelsToQuantize = ["text_encoder_2"]) # for Flux models
48
+ #OR
49
+ offload.all(pipe, modelsToQuantize = ["text_encoder"]) # for HunyuanVideo models
50
+
51
+ ```
52
+
53
+ - Between 48 GB and 64 GB of RAM
54
+ ```
55
+ offload.all(pipe)
56
+ ```
57
+ - More than 64 GB of RAM
58
+ ```
59
+ offload.all(pipe), pinInRAM = True
60
+ ```
61
+
62
+ ## Special
37
63
  Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models.\
38
64
  For instance :
39
65
 
@@ -1,6 +1,6 @@
1
1
  [project]
2
2
  name = "mmgp"
3
- version = "1.1.0"
3
+ version = "1.2.0"
4
4
  authors = [
5
5
  { name = "deepbeepmeep", email = "deepbeepmeep@yahoo.com" },
6
6
  ]
@@ -0,0 +1,109 @@
1
+ Metadata-Version: 2.1
2
+ Name: mmgp
3
+ Version: 1.2.0
4
+ Summary: Memory Management for the GPU Poor
5
+ Author-email: deepbeepmeep <deepbeepmeep@yahoo.com>
6
+ License: GNU GENERAL PUBLIC LICENSE
7
+ Version 3, 29 June 2007
8
+ Requires-Python: >=3.10
9
+ Description-Content-Type: text/markdown
10
+ License-File: LICENSE.md
11
+ Requires-Dist: torch>=2.1.0
12
+ Requires-Dist: optimum-quanto
13
+
14
+
15
+ <p align="center">
16
+ <H2>Memory Management for the GPU Poor by DeepBeepMeep</H2>
17
+ </p>
18
+
19
+
20
+ This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ... can run smoothly on a 24 GB GPU limited card.
21
+ This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several
22
+ times in a pipe (eg VAE).
23
+
24
+ Requirements:
25
+ - GPU: RTX 3090/ RTX 4090 (24 GB of VRAM)
26
+ - RAM: minimum 48 GB, recommended 64 GB
27
+
28
+ ## Usage
29
+ First you need to install the module in your current project with:
30
+ ```shell
31
+ pip install mmgp
32
+ ```
33
+
34
+ It is almost plug and play and just needs to be invoked from the main app just after the model pipeline has been created.
35
+ 1) First make sure that the pipeline explictly loads the models in the CPU device, for instance:
36
+ ```
37
+ pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16).to("cpu")
38
+ ```
39
+
40
+ 2) Once every potential Lora has been loaded and merged, add the following lines:
41
+
42
+ ```
43
+ from mmgp import offload
44
+ offload.all(pipe)
45
+ ```
46
+
47
+ ## Options
48
+ The 'transformer' model in the pipe contains usually the video or image generator is quantized on the fly by default to 8 bits. If you want to save time on disk and reduce the loading time, you may want to load directly a prequantized model. In that case you need to set the option *quantizeTransformer* to *False* to turn off on the fly quantization.
49
+
50
+ You can specify a list of additional models string ids to quantize (for instance the text_encoder) using the optional argument *modelsToQuantize* for instance *modelsToQuantize = ["text_encoder_2"]*.This may be useful if you have less than 48 GB of RAM.
51
+
52
+ Note that there is little advantage on the GPU / VRAM side to quantize text encoders as their inputs are usually quite light.
53
+
54
+ Conversely if you have more than 64GB of RAM you may want to enable RAM pinning with the option *pinInRAM = True*. You will get in return super fast loading / unloading of models
55
+ (this can save significant time if the same pipeline is run multiple times in a row)
56
+
57
+ In Summary, if you have:
58
+ - Between 32 GB and 48 GB of RAM
59
+ ```
60
+ offload.all(pipe, modelsToQuantize = ["text_encoder_2"]) # for Flux models
61
+ #OR
62
+ offload.all(pipe, modelsToQuantize = ["text_encoder"]) # for HunyuanVideo models
63
+
64
+ ```
65
+
66
+ - Between 48 GB and 64 GB of RAM
67
+ ```
68
+ offload.all(pipe)
69
+ ```
70
+ - More than 64 GB of RAM
71
+ ```
72
+ offload.all(pipe), pinInRAM = True
73
+ ```
74
+
75
+ ## Special
76
+ Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models.\
77
+ For instance :
78
+
79
+
80
+ - for flux derived models:
81
+ ```
82
+ pipe = { "text_encoder": clip, "text_encoder_2": t5, "transformer": model, "vae":ae }
83
+ ```
84
+ - for mochi:
85
+ ```
86
+ pipe = { "text_encoder": self.text_encoder, "transformer": self.dit, "vae":self.decoder }
87
+ ```
88
+
89
+
90
+ Please note that there should be always one model whose Id is 'transformer'. It corresponds to the main image / video model which usually needs to be quantized (this is done on the fly by default when loading the model).
91
+
92
+ Becareful, lots of models use the T5 XXL as a text encoder. However, quite often their corresponding pipeline configurations point at the official Google T5 XXL repository
93
+ where there is a huge 40GB model to download and load. It is cumbersorme as it is a 32 bits model and contains the decoder part of T5 that is not used.
94
+ I suggest you use instead one of the 16 bits encoder only version available around, for instance:
95
+ ```
96
+ text_encoder_2 = T5EncoderModel.from_pretrained("black-forest-labs/FLUX.1-dev", subfolder="text_encoder_2", torch_dtype=torch.float16)
97
+ ```
98
+
99
+ Sometime just providing the pipe won't be sufficient as you will need to change the content of the core model:
100
+ - For instance you may need to disable an existing CPU offload logic that already exists (such as manual calls to move tensors between cuda and the cpu)
101
+ - mmpg to tries to fake the device as being "cuda" but sometimes some code won't be fooled and it will create tensors in the cpu device and this may cause some issues.
102
+
103
+ You are free to use my module for non commercial use as long you give me proper credits. You may contact me on twitter @deepbeepmeep
104
+
105
+ Thanks to
106
+ ---------
107
+ - Huggingface / accelerate for the hooking examples
108
+ - Huggingface / quanto for their very useful quantizer
109
+ - gau-nernst for his Pinnig RAM samples
@@ -13,10 +13,12 @@
13
13
  # for instance: pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16).to("cpu")
14
14
  # 2) Once every potential Lora has been loaded and merged, add the following lines:
15
15
  # from mmgp import offload
16
- # offload.me(pipe)
17
- # The 'transformer' model that contains usually the video or image generator is quantized on the fly by default to 8 bits. If you want to save time on disk and reduce the loading time, you may want to load directly a prequantized model. In that case you need to set the option quantizeTransformer to False to turn off on the fly quantization.
18
- #
19
- # If you have more than 64GB RAM you may want to enable RAM pinning with the option pinInRAM = True. You will get in return super fast loading / unloading of models
16
+ # offload.all(pipe)
17
+ # The 'transformer' model that contains usually the video or image generator is quantized on the fly by default to 8 bits so that it can fit into 24 GB of VRAM.
18
+ # If you want to save time on disk and reduce the loading time, you may want to load directly a prequantized model. In that case you need to set the option quantizeTransformer to False to turn off on the fly quantization.
19
+ # You can specify a list of additional models string ids to quantize (for instance the text_encoder) using the optional argument modelsToQuantize. This may be useful if you have less than 48 GB of RAM.
20
+ # Note that there is little advantage on the GPU / VRAM side to quantize text encoders as their inputs are usually quite light.
21
+ # Conversely if you have more than 64GB RAM you may want to enable RAM pinning with the option pinInRAM = True. You will get in return super fast loading / unloading of models
20
22
  # (this can save significant time if the same pipeline is run multiple times in a row)
21
23
  #
22
24
  # Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models.
@@ -263,7 +265,7 @@ class offload:
263
265
 
264
266
 
265
267
  @classmethod
266
- def all(cls, pipe_or_dict_of_modules, quantizeTransformer = True, pinInRAM = False, verbose = True):
268
+ def all(cls, pipe_or_dict_of_modules, quantizeTransformer = True, pinInRAM = False, verbose = True, modelsToQuantize = None ):
267
269
  self = cls()
268
270
  self.verbose = verbose
269
271
  self.pinned_modules_data = {}
@@ -284,8 +286,12 @@ class offload:
284
286
 
285
287
  models = {k: v for k, v in pipe_or_dict_of_modules.items() if isinstance(v, torch.nn.Module)}
286
288
 
289
+ modelsToQuantize = modelsToQuantize if modelsToQuantize is not None else []
290
+ if not isinstance(modelsToQuantize, list):
291
+ modelsToQuantize = [modelsToQuantize]
287
292
  if quantizeTransformer:
288
- self.models_to_quantize = ["transformer"]
293
+ modelsToQuantize.append("transformer")
294
+ self.models_to_quantize = modelsToQuantize
289
295
  # del models["transformer"] # to test everything but the transformer that has a much longer loading
290
296
  # models = { 'transformer': pipe_or_dict_of_modules["transformer"]} # to test only the transformer
291
297
  for model_id in models: