mmgp 1.0.4__tar.gz → 1.0.5__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of mmgp might be problematic. Click here for more details.

@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: mmgp
3
- Version: 1.0.4
3
+ Version: 1.0.5
4
4
  Summary: Memory Management for the GPU Poor
5
5
  Author-email: deepbeepmeep <deepbeepmeep@yahoo.com>
6
6
  License: GNU GENERAL PUBLIC LICENSE
@@ -685,38 +685,48 @@ Requires-Dist: optimum-quanto
685
685
 
686
686
  **------------------ Memory Management for the GPU Poor by DeepBeepMeep ------------------**
687
687
 
688
- This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ... run smoothly on a 24 GB GPU limited card
689
- This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several times in a pipe (eg VAE)
688
+ This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ... can run smoothly on a 24 GB GPU limited card.
689
+ This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several
690
+ times in a pipe (eg VAE).
690
691
 
691
692
  Requirements:
692
- - GPU: RTX 3090/ RTX 4090
693
+ - GPU: RTX 3090/ RTX 4090 (24 GB of VRAM)
693
694
  - RAM: minimum 48 GB, recommended 64 GB
694
695
 
695
696
  It is almost plug and play and just needs to be invoked from the main app just after the model pipeline has been created.
696
697
  1) First make sure that the pipeline explictly loads the models in the CPU device
697
698
  for instance: pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16).to("cpu")
698
699
  2) Once every potential Lora has been loaded and merged, add the following lines:
699
- from mmgp import offload
700
- offload.me(pipe)
701
- If you don't have enough RAM you may disable RAM pinning but model switching option pinInRAM= False
702
- Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models^.
700
+
701
+ *from mmgp import offload*
702
+ *offload.me(pipe)*
703
+
704
+ The 'transformer' model that contains usually the video or image generator is quantized on the fly by default to 8 bits. If you want to save time on disk and reduce the loading time, you may want to load directly a prequantized model. In that case you need to set the option *quantizeTransformer* to *False* to turn off on the fly quantization.
705
+
706
+ If you have more than 64GB RAM you may want to enable RAM pinning with the option *pinInRAM = True*. You will get in return super fast loading / unloading of models
707
+ (this can save significant time if the same pipeline is run multiple times in a row)
708
+
709
+ Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models.
703
710
 
704
711
  For instance :
705
- for flux derived models: pipe = { "text_encoder": clip, "text_encoder_2": t5, "transformer": model, "vae":ae }
706
- for mochi: pipe = { "text_encoder": self.text_encoder, "transformer": self.dit, "vae":self.decoder }
712
+ for flux derived models: *pipe = { "text_encoder": clip, "text_encoder_2": t5, "transformer": model, "vae":ae }*
713
+ for mochi: *pipe = { "text_encoder": self.text_encoder, "transformer": self.dit, "vae":self.decoder }*
707
714
 
708
- Please note that there should be always one model whose Id is 'transformer'. It is corresponds to the main image / video model which usually needs to be quantized (this is done by default)
715
+ Please note that there should be always one model whose Id is 'transformer'. It corresponds to the main image / video model which usually needs to be quantized (this is done on the fly by default when loading the model).
709
716
 
710
- Becareful, lots of models uses the T5 XXL as a text encoder. However, quite often their corresponding pipeline configuratons points at the official Google T5 XXL repository
717
+ Becareful, lots of models use the T5 XXL as a text encoder. However, quite often their corresponding pipeline configurations point at the official Google T5 XXL repository
711
718
  where there is a huge 40GB model to download and load. It is cumbersorme as it is a 32 bits model and contains the decoder part of T5 that is not used.
712
719
  I suggest you use instead one of the 16 bits encoder only version available around, for instance:
713
- text_encoder_2 = T5EncoderModel.from_pretrained("black-forest-labs/FLUX.1-dev", subfolder="text_encoder_2", torch_dtype=torch.float16)
720
+ *text_encoder_2 = T5EncoderModel.from_pretrained("black-forest-labs/FLUX.1-dev", subfolder="text_encoder_2", torch_dtype=torch.float16)*
714
721
 
715
- You are free to use my code for non commercial use as long you give me proper credits. You may contact me on twitter @deepbeepmeep
722
+ Sometime just providing the pipe won't be sufficient as you will need to change the content of the core model:
723
+ - For instance you may need to disable an existing CPU offload logic that already exists (such as manual calls to move tensors between cuda and the cpu)
724
+ - mmpg to tries to fake the device as being "cuda" but sometimes some code won't be fooled and it will create tensors in the cpu device and this may cause some issues.
716
725
 
726
+ You are free to use my module for non commercial use as long you give me proper credits. You may contact me on twitter @deepbeepmeep
717
727
 
718
- Thanks
719
- -------
728
+ Thanks to
729
+ ---------
720
730
  Huggingface / accelerate for the hooking examples
721
731
  Huggingface / quanto for their very useful quantizer
722
- gau-nernst for his Pinnig RAM examples
732
+ gau-nernst for his Pinnig RAM samples
mmgp-1.0.5/README.md ADDED
@@ -0,0 +1,47 @@
1
+ **------------------ Memory Management for the GPU Poor by DeepBeepMeep ------------------**
2
+
3
+ This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ... can run smoothly on a 24 GB GPU limited card.
4
+ This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several
5
+ times in a pipe (eg VAE).
6
+
7
+ Requirements:
8
+ - GPU: RTX 3090/ RTX 4090 (24 GB of VRAM)
9
+ - RAM: minimum 48 GB, recommended 64 GB
10
+
11
+ It is almost plug and play and just needs to be invoked from the main app just after the model pipeline has been created.
12
+ 1) First make sure that the pipeline explictly loads the models in the CPU device
13
+ for instance: pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16).to("cpu")
14
+ 2) Once every potential Lora has been loaded and merged, add the following lines:
15
+
16
+ *from mmgp import offload*
17
+ *offload.me(pipe)*
18
+
19
+ The 'transformer' model that contains usually the video or image generator is quantized on the fly by default to 8 bits. If you want to save time on disk and reduce the loading time, you may want to load directly a prequantized model. In that case you need to set the option *quantizeTransformer* to *False* to turn off on the fly quantization.
20
+
21
+ If you have more than 64GB RAM you may want to enable RAM pinning with the option *pinInRAM = True*. You will get in return super fast loading / unloading of models
22
+ (this can save significant time if the same pipeline is run multiple times in a row)
23
+
24
+ Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models.
25
+
26
+ For instance :
27
+ for flux derived models: *pipe = { "text_encoder": clip, "text_encoder_2": t5, "transformer": model, "vae":ae }*
28
+ for mochi: *pipe = { "text_encoder": self.text_encoder, "transformer": self.dit, "vae":self.decoder }*
29
+
30
+ Please note that there should be always one model whose Id is 'transformer'. It corresponds to the main image / video model which usually needs to be quantized (this is done on the fly by default when loading the model).
31
+
32
+ Becareful, lots of models use the T5 XXL as a text encoder. However, quite often their corresponding pipeline configurations point at the official Google T5 XXL repository
33
+ where there is a huge 40GB model to download and load. It is cumbersorme as it is a 32 bits model and contains the decoder part of T5 that is not used.
34
+ I suggest you use instead one of the 16 bits encoder only version available around, for instance:
35
+ *text_encoder_2 = T5EncoderModel.from_pretrained("black-forest-labs/FLUX.1-dev", subfolder="text_encoder_2", torch_dtype=torch.float16)*
36
+
37
+ Sometime just providing the pipe won't be sufficient as you will need to change the content of the core model:
38
+ - For instance you may need to disable an existing CPU offload logic that already exists (such as manual calls to move tensors between cuda and the cpu)
39
+ - mmpg to tries to fake the device as being "cuda" but sometimes some code won't be fooled and it will create tensors in the cpu device and this may cause some issues.
40
+
41
+ You are free to use my module for non commercial use as long you give me proper credits. You may contact me on twitter @deepbeepmeep
42
+
43
+ Thanks to
44
+ ---------
45
+ Huggingface / accelerate for the hooking examples
46
+ Huggingface / quanto for their very useful quantizer
47
+ gau-nernst for his Pinnig RAM samples
@@ -70,5 +70,5 @@ sort_first = [
70
70
  [tool.setuptools_scm]
71
71
  write_to = "src/_version.py"
72
72
  parentdir_prefix_version = "mmgp-"
73
- fallback_version = "1.0.4"
73
+ fallback_version = "1.0.5"
74
74
  version_scheme = "post-release"
@@ -12,5 +12,5 @@ __version__: str
12
12
  __version_tuple__: VERSION_TUPLE
13
13
  version_tuple: VERSION_TUPLE
14
14
 
15
- __version__ = version = '1.0.4'
16
- __version_tuple__ = version_tuple = (1, 0, 4)
15
+ __version__ = version = '1.0.5'
16
+ __version_tuple__ = version_tuple = (1, 0, 5)
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: mmgp
3
- Version: 1.0.4
3
+ Version: 1.0.5
4
4
  Summary: Memory Management for the GPU Poor
5
5
  Author-email: deepbeepmeep <deepbeepmeep@yahoo.com>
6
6
  License: GNU GENERAL PUBLIC LICENSE
@@ -685,38 +685,48 @@ Requires-Dist: optimum-quanto
685
685
 
686
686
  **------------------ Memory Management for the GPU Poor by DeepBeepMeep ------------------**
687
687
 
688
- This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ... run smoothly on a 24 GB GPU limited card
689
- This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several times in a pipe (eg VAE)
688
+ This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ... can run smoothly on a 24 GB GPU limited card.
689
+ This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several
690
+ times in a pipe (eg VAE).
690
691
 
691
692
  Requirements:
692
- - GPU: RTX 3090/ RTX 4090
693
+ - GPU: RTX 3090/ RTX 4090 (24 GB of VRAM)
693
694
  - RAM: minimum 48 GB, recommended 64 GB
694
695
 
695
696
  It is almost plug and play and just needs to be invoked from the main app just after the model pipeline has been created.
696
697
  1) First make sure that the pipeline explictly loads the models in the CPU device
697
698
  for instance: pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16).to("cpu")
698
699
  2) Once every potential Lora has been loaded and merged, add the following lines:
699
- from mmgp import offload
700
- offload.me(pipe)
701
- If you don't have enough RAM you may disable RAM pinning but model switching option pinInRAM= False
702
- Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models^.
700
+
701
+ *from mmgp import offload*
702
+ *offload.me(pipe)*
703
+
704
+ The 'transformer' model that contains usually the video or image generator is quantized on the fly by default to 8 bits. If you want to save time on disk and reduce the loading time, you may want to load directly a prequantized model. In that case you need to set the option *quantizeTransformer* to *False* to turn off on the fly quantization.
705
+
706
+ If you have more than 64GB RAM you may want to enable RAM pinning with the option *pinInRAM = True*. You will get in return super fast loading / unloading of models
707
+ (this can save significant time if the same pipeline is run multiple times in a row)
708
+
709
+ Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models.
703
710
 
704
711
  For instance :
705
- for flux derived models: pipe = { "text_encoder": clip, "text_encoder_2": t5, "transformer": model, "vae":ae }
706
- for mochi: pipe = { "text_encoder": self.text_encoder, "transformer": self.dit, "vae":self.decoder }
712
+ for flux derived models: *pipe = { "text_encoder": clip, "text_encoder_2": t5, "transformer": model, "vae":ae }*
713
+ for mochi: *pipe = { "text_encoder": self.text_encoder, "transformer": self.dit, "vae":self.decoder }*
707
714
 
708
- Please note that there should be always one model whose Id is 'transformer'. It is corresponds to the main image / video model which usually needs to be quantized (this is done by default)
715
+ Please note that there should be always one model whose Id is 'transformer'. It corresponds to the main image / video model which usually needs to be quantized (this is done on the fly by default when loading the model).
709
716
 
710
- Becareful, lots of models uses the T5 XXL as a text encoder. However, quite often their corresponding pipeline configuratons points at the official Google T5 XXL repository
717
+ Becareful, lots of models use the T5 XXL as a text encoder. However, quite often their corresponding pipeline configurations point at the official Google T5 XXL repository
711
718
  where there is a huge 40GB model to download and load. It is cumbersorme as it is a 32 bits model and contains the decoder part of T5 that is not used.
712
719
  I suggest you use instead one of the 16 bits encoder only version available around, for instance:
713
- text_encoder_2 = T5EncoderModel.from_pretrained("black-forest-labs/FLUX.1-dev", subfolder="text_encoder_2", torch_dtype=torch.float16)
720
+ *text_encoder_2 = T5EncoderModel.from_pretrained("black-forest-labs/FLUX.1-dev", subfolder="text_encoder_2", torch_dtype=torch.float16)*
714
721
 
715
- You are free to use my code for non commercial use as long you give me proper credits. You may contact me on twitter @deepbeepmeep
722
+ Sometime just providing the pipe won't be sufficient as you will need to change the content of the core model:
723
+ - For instance you may need to disable an existing CPU offload logic that already exists (such as manual calls to move tensors between cuda and the cpu)
724
+ - mmpg to tries to fake the device as being "cuda" but sometimes some code won't be fooled and it will create tensors in the cpu device and this may cause some issues.
716
725
 
726
+ You are free to use my module for non commercial use as long you give me proper credits. You may contact me on twitter @deepbeepmeep
717
727
 
718
- Thanks
719
- -------
728
+ Thanks to
729
+ ---------
720
730
  Huggingface / accelerate for the hooking examples
721
731
  Huggingface / quanto for their very useful quantizer
722
- gau-nernst for his Pinnig RAM examples
732
+ gau-nernst for his Pinnig RAM samples
@@ -1,10 +1,11 @@
1
- # ------------------ Memory Management for the GPU Poor by DeepBeepMeep ------------------
1
+ # ------------------ Memory Management for the GPU Poor by DeepBeepMeep (mmgp)------------------
2
2
  #
3
- # This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ... run smoothly on a 24 GB GPU limited card
4
- # This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several times in a pipe (eg VAE)
3
+ # This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ... can run smoothly on a 24 GB GPU limited card.
4
+ # This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several
5
+ # times in a pipe (eg VAE).
5
6
  #
6
7
  # Requirements:
7
- # - GPU: RTX 3090/ RTX 4090
8
+ # - GPU: RTX 3090/ RTX 4090 (24 GB of VRAM)
8
9
  # - RAM: minimum 48 GB, recommended 64 GB
9
10
  #
10
11
  # It is almost plug and play and just needs to be invoked from the main app just after the model pipeline has been created.
@@ -13,27 +14,35 @@
13
14
  # 2) Once every potential Lora has been loaded and merged, add the following lines:
14
15
  # from mmgp import offload
15
16
  # offload.me(pipe)
16
- # If you don't have enough RAM you may disable RAM pinning but model switching option pinInRAM= False
17
- # Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models^.
17
+ # The 'transformer' model that contains usually the video or image generator is quantized on the fly by default to 8 bits. If you want to save time on disk and reduce the loading time, you may want to load directly a prequantized model. In that case you need to set the option quantizeTransformer to False to turn off on the fly quantization.
18
+ #
19
+ # If you have more than 64GB RAM you may want to enable RAM pinning with the option pinInRAM = True. You will get in return super fast loading / unloading of models
20
+ # (this can save significant time if the same pipeline is run multiple times in a row)
21
+ #
22
+ # Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models.
18
23
  #
19
24
  # For instance :
20
25
  # for flux derived models: pipe = { "text_encoder": clip, "text_encoder_2": t5, "transformer": model, "vae":ae }
21
26
  # for mochi: pipe = { "text_encoder": self.text_encoder, "transformer": self.dit, "vae":self.decoder }
22
27
  #
23
- # Please note that there should be always one model whose Id is 'transformer'. It is corresponds to the main image / video model which usually needs to be quantized (this is done by default)
28
+ # Please note that there should be always one model whose Id is 'transformer'. It corresponds to the main image / video model which usually needs to be quantized (this is done on the fly by default when loading the model)
24
29
  #
25
- # Becareful, lots of models uses the T5 XXL as a text encoder. However, quite often their corresponding pipeline configuratons points at the official Google T5 XXL repository
30
+ # Becareful, lots of models use the T5 XXL as a text encoder. However, quite often their corresponding pipeline configurations point at the official Google T5 XXL repository
26
31
  # where there is a huge 40GB model to download and load. It is cumbersorme as it is a 32 bits model and contains the decoder part of T5 that is not used.
27
32
  # I suggest you use instead one of the 16 bits encoder only version available around, for instance:
28
33
  # text_encoder_2 = T5EncoderModel.from_pretrained("black-forest-labs/FLUX.1-dev", subfolder="text_encoder_2", torch_dtype=torch.float16)
29
34
  #
30
- # You are free to use my code for non commercial use as long you give me proper credits. You may contact me on twitter @deepbeepmeep
35
+ # Sometime just providing the pipe won't be sufficient as you will need to change the content of the core model:
36
+ # - For instance you may need to disable an existing CPU offload logic that already exists (such as manual calls to move tensors between cuda and the cpu)
37
+ # - mmpg to tries to fake the device as being "cuda" but sometimes some code won't be fooled and it will create tensors in the cpu device and this may cause some issues.
38
+ #
39
+ # You are free to use my module for non commercial use as long you give me proper credits. You may contact me on twitter @deepbeepmeep
31
40
  #
32
- # Thanks
33
- # -------
41
+ # Thanks to
42
+ # ---------
34
43
  # Huggingface / accelerate for the hooking examples
35
44
  # Huggingface / quanto for their very useful quantizer
36
- # gau-nernst for his Pinnig RAM examples
45
+ # gau-nernst for his Pinnig RAM samples
37
46
 
38
47
 
39
48
  #
@@ -45,8 +54,6 @@ import functools
45
54
  from optimum.quanto import freeze, qfloat8, qint8, quantize, QModuleMixin, QTensor
46
55
 
47
56
 
48
- # config Dimension X (CogVideo derived ) : Quantization False: because Lora applied later
49
-
50
57
 
51
58
  cotenants_map = {
52
59
  "text_encoder": ["vae", "text_encoder_2"],
mmgp-1.0.4/README.md DELETED
@@ -1,37 +0,0 @@
1
- **------------------ Memory Management for the GPU Poor by DeepBeepMeep ------------------**
2
-
3
- This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ... run smoothly on a 24 GB GPU limited card
4
- This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several times in a pipe (eg VAE)
5
-
6
- Requirements:
7
- - GPU: RTX 3090/ RTX 4090
8
- - RAM: minimum 48 GB, recommended 64 GB
9
-
10
- It is almost plug and play and just needs to be invoked from the main app just after the model pipeline has been created.
11
- 1) First make sure that the pipeline explictly loads the models in the CPU device
12
- for instance: pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16).to("cpu")
13
- 2) Once every potential Lora has been loaded and merged, add the following lines:
14
- from mmgp import offload
15
- offload.me(pipe)
16
- If you don't have enough RAM you may disable RAM pinning but model switching option pinInRAM= False
17
- Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models^.
18
-
19
- For instance :
20
- for flux derived models: pipe = { "text_encoder": clip, "text_encoder_2": t5, "transformer": model, "vae":ae }
21
- for mochi: pipe = { "text_encoder": self.text_encoder, "transformer": self.dit, "vae":self.decoder }
22
-
23
- Please note that there should be always one model whose Id is 'transformer'. It is corresponds to the main image / video model which usually needs to be quantized (this is done by default)
24
-
25
- Becareful, lots of models uses the T5 XXL as a text encoder. However, quite often their corresponding pipeline configuratons points at the official Google T5 XXL repository
26
- where there is a huge 40GB model to download and load. It is cumbersorme as it is a 32 bits model and contains the decoder part of T5 that is not used.
27
- I suggest you use instead one of the 16 bits encoder only version available around, for instance:
28
- text_encoder_2 = T5EncoderModel.from_pretrained("black-forest-labs/FLUX.1-dev", subfolder="text_encoder_2", torch_dtype=torch.float16)
29
-
30
- You are free to use my code for non commercial use as long you give me proper credits. You may contact me on twitter @deepbeepmeep
31
-
32
-
33
- Thanks
34
- -------
35
- Huggingface / accelerate for the hooking examples
36
- Huggingface / quanto for their very useful quantizer
37
- gau-nernst for his Pinnig RAM examples
File without changes
File without changes
File without changes
File without changes