mmgp 1.0.4__tar.gz → 1.0.6__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of mmgp might be problematic. Click here for more details.

@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: mmgp
3
- Version: 1.0.4
3
+ Version: 1.0.6
4
4
  Summary: Memory Management for the GPU Poor
5
5
  Author-email: deepbeepmeep <deepbeepmeep@yahoo.com>
6
6
  License: GNU GENERAL PUBLIC LICENSE
@@ -683,40 +683,73 @@ License-File: LICENSE.md
683
683
  Requires-Dist: torch>=2.1.0
684
684
  Requires-Dist: optimum-quanto
685
685
 
686
- **------------------ Memory Management for the GPU Poor by DeepBeepMeep ------------------**
687
686
 
688
- This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ... run smoothly on a 24 GB GPU limited card
689
- This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several times in a pipe (eg VAE)
687
+ <p align="center">
688
+ <H2>Memory Management for the GPU Poor by DeepBeepMeep</H2>
689
+ </p>
690
+
691
+
692
+ This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ... can run smoothly on a 24 GB GPU limited card.
693
+ This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several
694
+ times in a pipe (eg VAE).
690
695
 
691
696
  Requirements:
692
- - GPU: RTX 3090/ RTX 4090
697
+ - GPU: RTX 3090/ RTX 4090 (24 GB of VRAM)
693
698
  - RAM: minimum 48 GB, recommended 64 GB
694
699
 
700
+ First you need to install the module in your current project with:
701
+ ```shell
702
+ pip install mmgp
703
+ ```
704
+
695
705
  It is almost plug and play and just needs to be invoked from the main app just after the model pipeline has been created.
696
- 1) First make sure that the pipeline explictly loads the models in the CPU device
697
- for instance: pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16).to("cpu")
706
+ 1) First make sure that the pipeline explictly loads the models in the CPU device, for instance:
707
+ ```
708
+ pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16).to("cpu")
709
+ ```
710
+
698
711
  2) Once every potential Lora has been loaded and merged, add the following lines:
712
+
713
+ ```
699
714
  from mmgp import offload
700
715
  offload.me(pipe)
701
- If you don't have enough RAM you may disable RAM pinning but model switching option pinInRAM= False
702
- Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models^.
716
+ ```
717
+ The 'transformer' model in the pipe contains usually the video or image generator is quantized on the fly by default to 8 bits. If you want to save time on disk and reduce the loading time, you may want to load directly a prequantized model. In that case you need to set the option *quantizeTransformer* to *False* to turn off on the fly quantization.
703
718
 
719
+ If you have more than 64GB RAM you may want to enable RAM pinning with the option *pinInRAM = True*. You will get in return super fast loading / unloading of models
720
+ (this can save significant time if the same pipeline is run multiple times in a row)
721
+
722
+ Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models.\
704
723
  For instance :
705
- for flux derived models: pipe = { "text_encoder": clip, "text_encoder_2": t5, "transformer": model, "vae":ae }
706
- for mochi: pipe = { "text_encoder": self.text_encoder, "transformer": self.dit, "vae":self.decoder }
707
724
 
708
- Please note that there should be always one model whose Id is 'transformer'. It is corresponds to the main image / video model which usually needs to be quantized (this is done by default)
709
725
 
710
- Becareful, lots of models uses the T5 XXL as a text encoder. However, quite often their corresponding pipeline configuratons points at the official Google T5 XXL repository
726
+ - for flux derived models:
727
+ ```
728
+ pipe = { "text_encoder": clip, "text_encoder_2": t5, "transformer": model, "vae":ae }
729
+ ```
730
+ - for mochi:
731
+ ```
732
+ pipe = { "text_encoder": self.text_encoder, "transformer": self.dit, "vae":self.decoder }
733
+ ```
734
+
735
+
736
+ Please note that there should be always one model whose Id is 'transformer'. It corresponds to the main image / video model which usually needs to be quantized (this is done on the fly by default when loading the model).
737
+
738
+ Becareful, lots of models use the T5 XXL as a text encoder. However, quite often their corresponding pipeline configurations point at the official Google T5 XXL repository
711
739
  where there is a huge 40GB model to download and load. It is cumbersorme as it is a 32 bits model and contains the decoder part of T5 that is not used.
712
740
  I suggest you use instead one of the 16 bits encoder only version available around, for instance:
741
+ ```
713
742
  text_encoder_2 = T5EncoderModel.from_pretrained("black-forest-labs/FLUX.1-dev", subfolder="text_encoder_2", torch_dtype=torch.float16)
743
+ ```
714
744
 
715
- You are free to use my code for non commercial use as long you give me proper credits. You may contact me on twitter @deepbeepmeep
745
+ Sometime just providing the pipe won't be sufficient as you will need to change the content of the core model:
746
+ - For instance you may need to disable an existing CPU offload logic that already exists (such as manual calls to move tensors between cuda and the cpu)
747
+ - mmpg to tries to fake the device as being "cuda" but sometimes some code won't be fooled and it will create tensors in the cpu device and this may cause some issues.
716
748
 
749
+ You are free to use my module for non commercial use as long you give me proper credits. You may contact me on twitter @deepbeepmeep
717
750
 
718
- Thanks
719
- -------
720
- Huggingface / accelerate for the hooking examples
721
- Huggingface / quanto for their very useful quantizer
722
- gau-nernst for his Pinnig RAM examples
751
+ Thanks to
752
+ ---------
753
+ - Huggingface / accelerate for the hooking examples
754
+ - Huggingface / quanto for their very useful quantizer
755
+ - gau-nernst for his Pinnig RAM samples
mmgp-1.0.6/README.md ADDED
@@ -0,0 +1,70 @@
1
+
2
+ <p align="center">
3
+ <H2>Memory Management for the GPU Poor by DeepBeepMeep</H2>
4
+ </p>
5
+
6
+
7
+ This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ... can run smoothly on a 24 GB GPU limited card.
8
+ This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several
9
+ times in a pipe (eg VAE).
10
+
11
+ Requirements:
12
+ - GPU: RTX 3090/ RTX 4090 (24 GB of VRAM)
13
+ - RAM: minimum 48 GB, recommended 64 GB
14
+
15
+ First you need to install the module in your current project with:
16
+ ```shell
17
+ pip install mmgp
18
+ ```
19
+
20
+ It is almost plug and play and just needs to be invoked from the main app just after the model pipeline has been created.
21
+ 1) First make sure that the pipeline explictly loads the models in the CPU device, for instance:
22
+ ```
23
+ pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16).to("cpu")
24
+ ```
25
+
26
+ 2) Once every potential Lora has been loaded and merged, add the following lines:
27
+
28
+ ```
29
+ from mmgp import offload
30
+ offload.me(pipe)
31
+ ```
32
+ The 'transformer' model in the pipe contains usually the video or image generator is quantized on the fly by default to 8 bits. If you want to save time on disk and reduce the loading time, you may want to load directly a prequantized model. In that case you need to set the option *quantizeTransformer* to *False* to turn off on the fly quantization.
33
+
34
+ If you have more than 64GB RAM you may want to enable RAM pinning with the option *pinInRAM = True*. You will get in return super fast loading / unloading of models
35
+ (this can save significant time if the same pipeline is run multiple times in a row)
36
+
37
+ Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models.\
38
+ For instance :
39
+
40
+
41
+ - for flux derived models:
42
+ ```
43
+ pipe = { "text_encoder": clip, "text_encoder_2": t5, "transformer": model, "vae":ae }
44
+ ```
45
+ - for mochi:
46
+ ```
47
+ pipe = { "text_encoder": self.text_encoder, "transformer": self.dit, "vae":self.decoder }
48
+ ```
49
+
50
+
51
+ Please note that there should be always one model whose Id is 'transformer'. It corresponds to the main image / video model which usually needs to be quantized (this is done on the fly by default when loading the model).
52
+
53
+ Becareful, lots of models use the T5 XXL as a text encoder. However, quite often their corresponding pipeline configurations point at the official Google T5 XXL repository
54
+ where there is a huge 40GB model to download and load. It is cumbersorme as it is a 32 bits model and contains the decoder part of T5 that is not used.
55
+ I suggest you use instead one of the 16 bits encoder only version available around, for instance:
56
+ ```
57
+ text_encoder_2 = T5EncoderModel.from_pretrained("black-forest-labs/FLUX.1-dev", subfolder="text_encoder_2", torch_dtype=torch.float16)
58
+ ```
59
+
60
+ Sometime just providing the pipe won't be sufficient as you will need to change the content of the core model:
61
+ - For instance you may need to disable an existing CPU offload logic that already exists (such as manual calls to move tensors between cuda and the cpu)
62
+ - mmpg to tries to fake the device as being "cuda" but sometimes some code won't be fooled and it will create tensors in the cpu device and this may cause some issues.
63
+
64
+ You are free to use my module for non commercial use as long you give me proper credits. You may contact me on twitter @deepbeepmeep
65
+
66
+ Thanks to
67
+ ---------
68
+ - Huggingface / accelerate for the hooking examples
69
+ - Huggingface / quanto for their very useful quantizer
70
+ - gau-nernst for his Pinnig RAM samples
@@ -0,0 +1,19 @@
1
+ [project]
2
+ name = "mmgp"
3
+ version = "1.0.6"
4
+ authors = [
5
+ { name = "deepbeepmeep", email = "deepbeepmeep@yahoo.com" },
6
+ ]
7
+ description = "Memory Management for the GPU Poor"
8
+ readme = "README.md"
9
+ requires-python = ">=3.10"
10
+ license = { file = "LICENSE.md" }
11
+ dependencies = [
12
+ "torch >= 2.1.0",
13
+ "optimum-quanto",
14
+ ]
15
+
16
+ [tool.setuptools.packages.find]
17
+ # All the following settings are optional:
18
+ where = ["src"]
19
+ namespaces = false
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: mmgp
3
- Version: 1.0.4
3
+ Version: 1.0.6
4
4
  Summary: Memory Management for the GPU Poor
5
5
  Author-email: deepbeepmeep <deepbeepmeep@yahoo.com>
6
6
  License: GNU GENERAL PUBLIC LICENSE
@@ -683,40 +683,73 @@ License-File: LICENSE.md
683
683
  Requires-Dist: torch>=2.1.0
684
684
  Requires-Dist: optimum-quanto
685
685
 
686
- **------------------ Memory Management for the GPU Poor by DeepBeepMeep ------------------**
687
686
 
688
- This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ... run smoothly on a 24 GB GPU limited card
689
- This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several times in a pipe (eg VAE)
687
+ <p align="center">
688
+ <H2>Memory Management for the GPU Poor by DeepBeepMeep</H2>
689
+ </p>
690
+
691
+
692
+ This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ... can run smoothly on a 24 GB GPU limited card.
693
+ This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several
694
+ times in a pipe (eg VAE).
690
695
 
691
696
  Requirements:
692
- - GPU: RTX 3090/ RTX 4090
697
+ - GPU: RTX 3090/ RTX 4090 (24 GB of VRAM)
693
698
  - RAM: minimum 48 GB, recommended 64 GB
694
699
 
700
+ First you need to install the module in your current project with:
701
+ ```shell
702
+ pip install mmgp
703
+ ```
704
+
695
705
  It is almost plug and play and just needs to be invoked from the main app just after the model pipeline has been created.
696
- 1) First make sure that the pipeline explictly loads the models in the CPU device
697
- for instance: pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16).to("cpu")
706
+ 1) First make sure that the pipeline explictly loads the models in the CPU device, for instance:
707
+ ```
708
+ pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16).to("cpu")
709
+ ```
710
+
698
711
  2) Once every potential Lora has been loaded and merged, add the following lines:
712
+
713
+ ```
699
714
  from mmgp import offload
700
715
  offload.me(pipe)
701
- If you don't have enough RAM you may disable RAM pinning but model switching option pinInRAM= False
702
- Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models^.
716
+ ```
717
+ The 'transformer' model in the pipe contains usually the video or image generator is quantized on the fly by default to 8 bits. If you want to save time on disk and reduce the loading time, you may want to load directly a prequantized model. In that case you need to set the option *quantizeTransformer* to *False* to turn off on the fly quantization.
703
718
 
719
+ If you have more than 64GB RAM you may want to enable RAM pinning with the option *pinInRAM = True*. You will get in return super fast loading / unloading of models
720
+ (this can save significant time if the same pipeline is run multiple times in a row)
721
+
722
+ Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models.\
704
723
  For instance :
705
- for flux derived models: pipe = { "text_encoder": clip, "text_encoder_2": t5, "transformer": model, "vae":ae }
706
- for mochi: pipe = { "text_encoder": self.text_encoder, "transformer": self.dit, "vae":self.decoder }
707
724
 
708
- Please note that there should be always one model whose Id is 'transformer'. It is corresponds to the main image / video model which usually needs to be quantized (this is done by default)
709
725
 
710
- Becareful, lots of models uses the T5 XXL as a text encoder. However, quite often their corresponding pipeline configuratons points at the official Google T5 XXL repository
726
+ - for flux derived models:
727
+ ```
728
+ pipe = { "text_encoder": clip, "text_encoder_2": t5, "transformer": model, "vae":ae }
729
+ ```
730
+ - for mochi:
731
+ ```
732
+ pipe = { "text_encoder": self.text_encoder, "transformer": self.dit, "vae":self.decoder }
733
+ ```
734
+
735
+
736
+ Please note that there should be always one model whose Id is 'transformer'. It corresponds to the main image / video model which usually needs to be quantized (this is done on the fly by default when loading the model).
737
+
738
+ Becareful, lots of models use the T5 XXL as a text encoder. However, quite often their corresponding pipeline configurations point at the official Google T5 XXL repository
711
739
  where there is a huge 40GB model to download and load. It is cumbersorme as it is a 32 bits model and contains the decoder part of T5 that is not used.
712
740
  I suggest you use instead one of the 16 bits encoder only version available around, for instance:
741
+ ```
713
742
  text_encoder_2 = T5EncoderModel.from_pretrained("black-forest-labs/FLUX.1-dev", subfolder="text_encoder_2", torch_dtype=torch.float16)
743
+ ```
714
744
 
715
- You are free to use my code for non commercial use as long you give me proper credits. You may contact me on twitter @deepbeepmeep
745
+ Sometime just providing the pipe won't be sufficient as you will need to change the content of the core model:
746
+ - For instance you may need to disable an existing CPU offload logic that already exists (such as manual calls to move tensors between cuda and the cpu)
747
+ - mmpg to tries to fake the device as being "cuda" but sometimes some code won't be fooled and it will create tensors in the cpu device and this may cause some issues.
716
748
 
749
+ You are free to use my module for non commercial use as long you give me proper credits. You may contact me on twitter @deepbeepmeep
717
750
 
718
- Thanks
719
- -------
720
- Huggingface / accelerate for the hooking examples
721
- Huggingface / quanto for their very useful quantizer
722
- gau-nernst for his Pinnig RAM examples
751
+ Thanks to
752
+ ---------
753
+ - Huggingface / accelerate for the hooking examples
754
+ - Huggingface / quanto for their very useful quantizer
755
+ - gau-nernst for his Pinnig RAM samples
@@ -1,9 +1,6 @@
1
1
  LICENSE.md
2
2
  README.md
3
3
  pyproject.toml
4
- src/__init__.py
5
- src/_version.py
6
- src/mmgp.py
7
4
  src/mmgp.egg-info/PKG-INFO
8
5
  src/mmgp.egg-info/SOURCES.txt
9
6
  src/mmgp.egg-info/dependency_links.txt
@@ -0,0 +1 @@
1
+
mmgp-1.0.4/README.md DELETED
@@ -1,37 +0,0 @@
1
- **------------------ Memory Management for the GPU Poor by DeepBeepMeep ------------------**
2
-
3
- This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ... run smoothly on a 24 GB GPU limited card
4
- This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several times in a pipe (eg VAE)
5
-
6
- Requirements:
7
- - GPU: RTX 3090/ RTX 4090
8
- - RAM: minimum 48 GB, recommended 64 GB
9
-
10
- It is almost plug and play and just needs to be invoked from the main app just after the model pipeline has been created.
11
- 1) First make sure that the pipeline explictly loads the models in the CPU device
12
- for instance: pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16).to("cpu")
13
- 2) Once every potential Lora has been loaded and merged, add the following lines:
14
- from mmgp import offload
15
- offload.me(pipe)
16
- If you don't have enough RAM you may disable RAM pinning but model switching option pinInRAM= False
17
- Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models^.
18
-
19
- For instance :
20
- for flux derived models: pipe = { "text_encoder": clip, "text_encoder_2": t5, "transformer": model, "vae":ae }
21
- for mochi: pipe = { "text_encoder": self.text_encoder, "transformer": self.dit, "vae":self.decoder }
22
-
23
- Please note that there should be always one model whose Id is 'transformer'. It is corresponds to the main image / video model which usually needs to be quantized (this is done by default)
24
-
25
- Becareful, lots of models uses the T5 XXL as a text encoder. However, quite often their corresponding pipeline configuratons points at the official Google T5 XXL repository
26
- where there is a huge 40GB model to download and load. It is cumbersorme as it is a 32 bits model and contains the decoder part of T5 that is not used.
27
- I suggest you use instead one of the 16 bits encoder only version available around, for instance:
28
- text_encoder_2 = T5EncoderModel.from_pretrained("black-forest-labs/FLUX.1-dev", subfolder="text_encoder_2", torch_dtype=torch.float16)
29
-
30
- You are free to use my code for non commercial use as long you give me proper credits. You may contact me on twitter @deepbeepmeep
31
-
32
-
33
- Thanks
34
- -------
35
- Huggingface / accelerate for the hooking examples
36
- Huggingface / quanto for their very useful quantizer
37
- gau-nernst for his Pinnig RAM examples
mmgp-1.0.4/pyproject.toml DELETED
@@ -1,74 +0,0 @@
1
- [project]
2
- name = "mmgp"
3
- authors = [
4
- { name = "deepbeepmeep", email = "deepbeepmeep@yahoo.com" },
5
- ]
6
- description = "Memory Management for the GPU Poor"
7
- readme = "README.md"
8
- requires-python = ">=3.10"
9
- license = { file = "LICENSE.md" }
10
- dynamic = ["version"]
11
- dependencies = [
12
- "torch >= 2.1.0",
13
- "optimum-quanto",
14
- ]
15
-
16
- [project.optional-dependencies]
17
-
18
-
19
- [build-system]
20
- build-backend = "setuptools.build_meta"
21
- requires = ["setuptools>=64", "wheel", "setuptools_scm>=8"]
22
-
23
- [tool.ruff]
24
- line-length = 110
25
- target-version = "py310"
26
- extend-exclude = ["/usr/lib/*"]
27
-
28
- [tool.ruff.lint]
29
- ignore = [
30
- "E501", # line too long - will be fixed in format
31
- ]
32
-
33
- [tool.ruff.format]
34
- quote-style = "double"
35
- indent-style = "space"
36
- line-ending = "auto"
37
- skip-magic-trailing-comma = false
38
- docstring-code-format = true
39
- exclude = [
40
- "src/_version.py", # generated by setuptools_scm
41
- ]
42
-
43
- [tool.ruff.lint.isort]
44
- combine-as-imports = true
45
- force-wrap-aliases = true
46
- known-local-folder = ["src"]
47
- known-first-party = ["mmgp"]
48
-
49
- [tool.pyright]
50
- include = ["src"]
51
- exclude = [
52
- "**/__pycache__", # cache directories
53
- "./typings", # generated type stubs
54
- ]
55
- stubPath = "./typings"
56
-
57
- [tool.tomlsort]
58
- in_place = true
59
- no_sort_tables = true
60
- spaces_before_inline_comment = 1
61
- spaces_indent_inline_array = 2
62
- trailing_comma_inline_array = true
63
- sort_first = [
64
- "project",
65
- "build-system",
66
- "tool.setuptools",
67
- ]
68
-
69
- # needs to be last for CI reasons
70
- [tool.setuptools_scm]
71
- write_to = "src/_version.py"
72
- parentdir_prefix_version = "mmgp-"
73
- fallback_version = "1.0.4"
74
- version_scheme = "post-release"
File without changes
@@ -1,16 +0,0 @@
1
- # file generated by setuptools_scm
2
- # don't change, don't track in version control
3
- TYPE_CHECKING = False
4
- if TYPE_CHECKING:
5
- from typing import Tuple, Union
6
- VERSION_TUPLE = Tuple[Union[int, str], ...]
7
- else:
8
- VERSION_TUPLE = object
9
-
10
- version: str
11
- __version__: str
12
- __version_tuple__: VERSION_TUPLE
13
- version_tuple: VERSION_TUPLE
14
-
15
- __version__ = version = '1.0.4'
16
- __version_tuple__ = version_tuple = (1, 0, 4)
@@ -1,3 +0,0 @@
1
- __init__
2
- _version
3
- mmgp
mmgp-1.0.4/src/mmgp.py DELETED
@@ -1,408 +0,0 @@
1
- # ------------------ Memory Management for the GPU Poor by DeepBeepMeep ------------------
2
- #
3
- # This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ... run smoothly on a 24 GB GPU limited card
4
- # This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several times in a pipe (eg VAE)
5
- #
6
- # Requirements:
7
- # - GPU: RTX 3090/ RTX 4090
8
- # - RAM: minimum 48 GB, recommended 64 GB
9
- #
10
- # It is almost plug and play and just needs to be invoked from the main app just after the model pipeline has been created.
11
- # 1) First make sure that the pipeline explictly loads the models in the CPU device
12
- # for instance: pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16).to("cpu")
13
- # 2) Once every potential Lora has been loaded and merged, add the following lines:
14
- # from mmgp import offload
15
- # offload.me(pipe)
16
- # If you don't have enough RAM you may disable RAM pinning but model switching option pinInRAM= False
17
- # Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models^.
18
- #
19
- # For instance :
20
- # for flux derived models: pipe = { "text_encoder": clip, "text_encoder_2": t5, "transformer": model, "vae":ae }
21
- # for mochi: pipe = { "text_encoder": self.text_encoder, "transformer": self.dit, "vae":self.decoder }
22
- #
23
- # Please note that there should be always one model whose Id is 'transformer'. It is corresponds to the main image / video model which usually needs to be quantized (this is done by default)
24
- #
25
- # Becareful, lots of models uses the T5 XXL as a text encoder. However, quite often their corresponding pipeline configuratons points at the official Google T5 XXL repository
26
- # where there is a huge 40GB model to download and load. It is cumbersorme as it is a 32 bits model and contains the decoder part of T5 that is not used.
27
- # I suggest you use instead one of the 16 bits encoder only version available around, for instance:
28
- # text_encoder_2 = T5EncoderModel.from_pretrained("black-forest-labs/FLUX.1-dev", subfolder="text_encoder_2", torch_dtype=torch.float16)
29
- #
30
- # You are free to use my code for non commercial use as long you give me proper credits. You may contact me on twitter @deepbeepmeep
31
- #
32
- # Thanks
33
- # -------
34
- # Huggingface / accelerate for the hooking examples
35
- # Huggingface / quanto for their very useful quantizer
36
- # gau-nernst for his Pinnig RAM examples
37
-
38
-
39
- #
40
- import torch
41
- #
42
- import gc
43
- import time
44
- import functools
45
- from optimum.quanto import freeze, qfloat8, qint8, quantize, QModuleMixin, QTensor
46
-
47
-
48
- # config Dimension X (CogVideo derived ) : Quantization False: because Lora applied later
49
-
50
-
51
- cotenants_map = {
52
- "text_encoder": ["vae", "text_encoder_2"],
53
- "text_encoder_2": ["vae", "text_encoder"],
54
- }
55
-
56
- # useful functions to move a group of tensors (to design custom offload patches)
57
- def move_tensors(obj, device):
58
- if torch.is_tensor(obj):
59
- return obj.to(device)
60
- elif isinstance(obj, dict):
61
- _dict = {}
62
- for k, v in obj.items():
63
- _dict[k] = move_tensors(v, device)
64
- return _dict
65
- elif isinstance(obj, list):
66
- _list = []
67
- for v in obj:
68
- _list.append(move_tensors(v, device))
69
- return _list
70
- else:
71
- raise TypeError("Tensor or list / dict of tensors expected")
72
-
73
-
74
- def get_model_name(model):
75
- return model.name
76
-
77
- class HfHook:
78
- def __init__(self):
79
- self.execution_device = "cuda"
80
-
81
- def detach_hook(self, module):
82
- pass
83
-
84
- class offload:
85
- def __init__(self):
86
- self.active_models = []
87
- self.active_models_ids = []
88
- self.models = {}
89
- self.verbose = False
90
- self.models_to_quantize = []
91
- self.pinned_modules_data = {}
92
- self.params_of_modules = {}
93
- self.pinTensors = False
94
- self.device_mem_capacity = torch.cuda.get_device_properties(0).total_memory
95
- self.last_reserved_mem_check =0
96
-
97
- def collect_module_parameters(self, module: torch.nn.Module, module_params):
98
- if isinstance(module, (torch.nn.ModuleList, torch.nn.Sequential)):
99
- for i in range(len(module)):
100
- current_layer = module[i]
101
- module_params.extend(current_layer.parameters())
102
- module_params.extend(current_layer.buffers())
103
- else:
104
- for p in module.parameters(recurse=False):
105
- module_params.append(p)
106
- for p in module.buffers(recurse=False):
107
- module_params.append(p)
108
- for sub_module in module.children():
109
- self.collect_module_parameters(sub_module, module_params)
110
-
111
- def can_model_be_cotenant(self, model_id):
112
- potential_cotenants= cotenants_map.get(model_id, None)
113
- if potential_cotenants is None:
114
- return False
115
- for existing_cotenant in self.active_models_ids:
116
- if existing_cotenant not in potential_cotenants:
117
- return False
118
- return True
119
-
120
- def gpu_load(self, model_id):
121
- model = self.models[model_id]
122
- self.active_models.append(model)
123
- self.active_models_ids.append(model_id)
124
- if self.verbose:
125
- model_name = model._get_name()
126
- print(f"Loading model {model_name} ({model_id}) in GPU")
127
- if not self.pinInRAM:
128
- model.to("cuda")
129
- else:
130
- module_params = self.params_of_modules[model_id]
131
- for p in module_params:
132
- if isinstance(p, QTensor):
133
- p._data = p._data.cuda(non_blocking=True)
134
- p._scale = p._scale.cuda(non_blocking=True)
135
- else:
136
- p.data = p.data.cuda(non_blocking=True) #
137
- # torch.cuda.current_stream().synchronize()
138
- @torch.compiler.disable()
139
- def unload_all(self):
140
- for model, model_id in zip(self.active_models, self.active_models_ids):
141
- if not self.pinInRAM:
142
- model.to("cpu")
143
- else:
144
- module_params = self.params_of_modules[model_id]
145
- pinned_parameters_data = self.pinned_modules_data[model_id]
146
- for p in module_params:
147
- if isinstance(p, QTensor):
148
- data = pinned_parameters_data[p]
149
- p._data = data[0]
150
- p._scale = data[1]
151
- else:
152
- p.data = pinned_parameters_data[p]
153
-
154
-
155
- self.active_models = []
156
- self.active_models_ids = []
157
- torch.cuda.empty_cache()
158
- gc.collect()
159
-
160
- def move_args_to_gpu(self, *args, **kwargs):
161
- new_args= []
162
- new_kwargs={}
163
- for arg in args:
164
- if torch.is_tensor(arg):
165
- if arg.dtype == torch.float32:
166
- arg = arg.to(torch.bfloat16).cuda(non_blocking=True)
167
- else:
168
- arg = arg.cuda(non_blocking=True)
169
- new_args.append(arg)
170
-
171
- for k in kwargs:
172
- arg = kwargs[k]
173
- if torch.is_tensor(arg):
174
- if arg.dtype == torch.float32:
175
- arg = arg.to(torch.bfloat16).cuda(non_blocking=True)
176
- else:
177
- arg = arg.cuda(non_blocking=True)
178
- new_kwargs[k]= arg
179
-
180
- return new_args, new_kwargs
181
-
182
- def ready_to_check_mem(self, forceMemoryCheck):
183
- cur_clock = time.time()
184
- # can't check at each call if we can empty the cuda cache as quering the reserved memory value is a time consuming operation
185
- if not forceMemoryCheck and (cur_clock - self.last_reserved_mem_check)<0.200:
186
- return False
187
- self.last_reserved_mem_check = cur_clock
188
- return True
189
-
190
-
191
- def empty_cache_if_needed(self):
192
- mem_reserved = torch.cuda.memory_reserved()
193
- if mem_reserved >= 0.9*self.device_mem_capacity:
194
- mem_allocated = torch.cuda.memory_allocated()
195
- if mem_allocated <= 0.70 * mem_reserved:
196
- # print(f"Cuda empty cache triggered as Allocated Memory ({mem_allocated/1024000:0f} MB) is lot less than Cached Memory ({mem_reserved/1024000:0f} MB) ")
197
- torch.cuda.empty_cache()
198
- # print(f"New cached memory after purge is {torch.cuda.memory_reserved()/1024000:0f} MB) ")
199
-
200
- def hook_me_light(self, target_module, forceMemoryCheck, previous_method):
201
- def check_empty_cache(module, *args, **kwargs):
202
- if self.ready_to_check_mem(forceMemoryCheck):
203
- self.empty_cache_if_needed()
204
- return previous_method(*args, **kwargs)
205
-
206
- setattr(target_module, "forward", functools.update_wrapper(functools.partial(check_empty_cache, target_module), previous_method) )
207
-
208
-
209
- def hook_me(self, target_module, model, model_id, module_id, previous_method):
210
- def check_change_module(module, *args, **kwargs):
211
- performEmptyCacheTest = False
212
- if not model_id in self.active_models_ids:
213
- new_model_id = getattr(module, "_mm_id")
214
- # do not always unload existing models if it is more efficient to keep in them in the GPU
215
- # (e.g: small modules whose calls are text encoders)
216
- if not self.can_model_be_cotenant(new_model_id) :
217
- self.unload_all()
218
- performEmptyCacheTest = False
219
- self.gpu_load(new_model_id)
220
- # transfer leftovers inputs that were incorrectly created in the RAM (mostly due to some .device tests that returned incorrectly "cpu")
221
- args, kwargs = self.move_args_to_gpu(*args, **kwargs)
222
- if performEmptyCacheTest:
223
- self.empty_cache_if_needed()
224
- return previous_method(*args, **kwargs)
225
-
226
- if hasattr(target_module, "_mm_id"):
227
- return
228
- setattr(target_module, "_mm_id", model_id)
229
-
230
- # create a fake accelerate parameter so that the _execution_device property returns always "cuda"
231
- # (it is queried in many pipelines even if offloading is not properly implemented)
232
- if not hasattr(target_module, "_hf_hook"):
233
- setattr(target_module, "_hf_hook", HfHook())
234
- setattr(target_module, "forward", functools.update_wrapper(functools.partial(check_change_module, target_module), previous_method) )
235
-
236
- if not self.verbose:
237
- return
238
-
239
- if module_id == None or module_id =='':
240
- model_name = model._get_name()
241
- print(f"Hooked in model '{model_id}' ({model_name})")
242
-
243
-
244
- # Not implemented yet, but why would one want to get rid of these features ?
245
- # def unhook_module(module: torch.nn.Module):
246
- # if not hasattr(module,"_mm_id"):
247
- # return
248
-
249
- # delattr(module, "_mm_id")
250
-
251
- # def unhook_all(parent_module: torch.nn.Module):
252
- # for module in parent_module.components.items():
253
- # self.unhook_module(module)
254
-
255
-
256
-
257
-
258
- @classmethod
259
- def all(cls, pipe_or_dict_of_modules, quantizeTransformer = True, pinInRAM = False, verbose = True):
260
- self = cls()
261
- self.verbose = verbose
262
- self.pinned_modules_data = {}
263
-
264
- # compile not working yet or slower
265
- compile = False
266
- self.pinInRAM = pinInRAM
267
- pipe = None
268
- preloadInRAM = True
269
- torch.set_default_device('cuda')
270
- if hasattr(pipe_or_dict_of_modules, "components"):
271
- pipe_or_dict_of_modules.to("cpu") #XXXX
272
- # create a fake Accelerate parameter so that lora loading doesn't change the device
273
- pipe_or_dict_of_modules.hf_device_map = torch.device("cuda")
274
- pipe = pipe_or_dict_of_modules
275
- pipe_or_dict_of_modules= pipe_or_dict_of_modules.components
276
-
277
-
278
- models = {k: v for k, v in pipe_or_dict_of_modules.items() if isinstance(v, torch.nn.Module)}
279
-
280
- if quantizeTransformer:
281
- self.models_to_quantize = ["transformer"]
282
- # del models["transformer"] # to test everything but the transformer that has a much longer loading
283
- # models = { 'transformer': pipe_or_dict_of_modules["transformer"]} # to test only the transformer
284
- for model_id in models:
285
- current_model: torch.nn.Module = models[model_id]
286
- # make sure that no RAM or GPU memory is not allocated for gradiant / training
287
- current_model.to("cpu").eval() #XXXXX
288
-
289
- # Quantize model just before transferring it to the RAM to keep OS cache file
290
- # open as short as possible. Indeed it seems that as long as the lazy safetensors
291
- # are not fully fully loaded, the OS won't be able to release the corresponding cache file in RAM.
292
- if model_id in self.models_to_quantize:
293
- print(f"Quantization of model '{model_id}' started")
294
- quantize(current_model, weights=qint8)
295
- freeze(current_model)
296
- print(f"Quantization of model '{model_id}' done")
297
- torch.cuda.empty_cache()
298
- gc.collect()
299
-
300
-
301
-
302
- if preloadInRAM: #
303
- # load all the remaining unread lazy safetensors in RAM to free open cache files
304
- for p in current_model.parameters():
305
- # Preread every tensor in RAM except tensors that have just been quantified
306
- # and are no longer needed
307
- if isinstance(p, QTensor):
308
- # fix quanto bug (see below) now as he won't have any opportunity to do it during RAM pinning
309
- if not pinInRAM and p._scale.dtype == torch.float32:
310
- p._scale = p._scale.to(torch.bfloat16)
311
-
312
- else:
313
- if p.data.dtype == torch.float32:
314
- # convert any left overs float32 weight to bloat16 to divide by 2 the model memory footprint
315
- p.data = p.data.to(torch.bfloat16)
316
- else:
317
- # force reading the tensors from the disk by pretending to modify them
318
- p.data = p.data + 0
319
-
320
-
321
- addModelFlag = False
322
-
323
- current_block_sequence = None
324
- for submodule_name, submodule in current_model.named_modules():
325
- if hasattr(submodule, "forward"):
326
- submodule_method = getattr(submodule, "forward")
327
- if callable(submodule_method):
328
- addModelFlag = True
329
- if submodule_name=='' or len(submodule_name.split("."))==1:
330
- # hook only the first two levels of modules with the full suite of processing
331
- self.hook_me(submodule, current_model, model_id, submodule_name, submodule_method)
332
- else:
333
- forceMemoryCheck = False
334
- pos = submodule_name.find(".0.")
335
- if pos > 0:
336
- if current_block_sequence == None:
337
- new_candidate = submodule_name[0:pos+3]
338
- if len(new_candidate.split("."))<=4:
339
- current_block_sequence = new_candidate
340
- # force a memory check when initiating a new sequence of blocks as the shapes of tensor will certainly change
341
- # and memory reusability is less likely
342
- # we limit this check to the first level of blocks as quering the cuda cache is time consuming
343
- forceMemoryCheck = True
344
- else:
345
- if current_block_sequence != submodule_name[0:len(current_block_sequence)]:
346
- current_block_sequence = None
347
- self.hook_me_light(submodule, forceMemoryCheck, submodule_method)
348
-
349
-
350
- if addModelFlag:
351
- if model_id not in self.models:
352
- self.models[model_id] = current_model
353
-
354
- # Pin in RAM models only once they have been fully loaded otherwise there may be some contention in the non pageable memory
355
- # between partially loaded lazy safetensors and pinned tensors
356
- if pinInRAM:
357
- if verbose:
358
- print("Pinning model tensors in RAM")
359
- torch.cuda.empty_cache()
360
- gc.collect()
361
- for model_id in models:
362
- pinned_parameters_data = {}
363
- current_model: torch.nn.Module = models[model_id]
364
- for p in current_model.parameters():
365
- if isinstance(p, QTensor):
366
- # pin in memory both quantized data and scales of quantized parameters
367
- # but don't pin .data as it corresponds to the original tensor that we don't want to reload
368
- p._data = p._data.pin_memory()
369
- # fix quanto bug that allows _scale to be float32 if the original weight was float32
370
- # (this may cause type mismatch between dequantified bfloat16 weights and float32 scales)
371
- p._scale = p._scale.to(torch.bfloat16).pin_memory() if p._scale.dtype == torch.float32 else p._scale.pin_memory()
372
- pinned_parameters_data[p]=[p._data, p._scale]
373
- else:
374
- p.data = p.data.pin_memory()
375
- pinned_parameters_data[p]=p.data
376
- for b in current_model.buffers():
377
- b.data = b.data.pin_memory()
378
-
379
- pinned_buffers_data = {b: b.data for b in current_model.buffers()}
380
- pinned_parameters_data.update(pinned_buffers_data)
381
- self.pinned_modules_data[model_id]=pinned_parameters_data
382
-
383
- module_params = []
384
- self.params_of_modules[model_id] = module_params
385
- self.collect_module_parameters(current_model,module_params)
386
-
387
- if compile:
388
- if verbose:
389
- print("Torch compilation started")
390
- torch._dynamo.config.cache_size_limit = 10000
391
- # if pipe != None and hasattr(pipe, "__call__"):
392
- # pipe.__call__= torch.compile(pipe.__call__, mode= "max-autotune")
393
-
394
- for model_id in models:
395
- current_model: torch.nn.Module = models[model_id]
396
- current_model.compile(mode= "max-autotune")
397
- #models["transformer"].compile()
398
-
399
- if verbose:
400
- print("Torch compilation done")
401
-
402
- torch.cuda.empty_cache()
403
- gc.collect()
404
-
405
-
406
- return self
407
-
408
-
File without changes
File without changes